[0] The BinProlog Experience: Implementing \\ a High-Performance Continuation Passing \\ Prolog Engine \\

The BinProlog Experience: Implementing
a High-Performance Continuation Passing
Prolog Engine

Paul Tarau 1

Abstract

BinProlog2 is a compact C-emulated Prolog engine based on the continuation passing program transformation introduced in [21]. By transformation to binary programs (consisting of clauses having only one atom in the head and the body) a slow-down is expected as conventional WAM's highly optimized environments are replaced by a heap-only run-time system. However, empirically this is not confirmed: BinProlog is at this point among the fastest C-emulated Prolog engines (450 KLIPS on a 25MHz Sparcstation 10-20). This paper gives the necessary implementation and benchmarking information that explains why our approach is a competitive alternative to the conventional WAM. We go into the details of data representation and compare the optimization opportunities offered by the BinWAM (our engine) with conventional WAM optimizations while describing some of BinProlog's new implementation techniques as its tag-on-data term-representation and term-compression scheme and it's copy-once heap-based findall.

Keywords: implementation of Prolog, WAM optimizations, program transformations, binary logic programs, continuation passing style compilation, data-representations for Prolog run-time systems.

1  Introduction

BinProlog is a C-emulated Prolog engine based on a program transformation introduced in [21]. It replaces the WAM by a more compact continuation passing logic engine [16] based on a mapping of full Prolog to binary logic programs (binarization). As conventional WAM's highly optimized environments are discarded in favor of a heap-only run-time system3 an important slow-down was expected.

We have decided however to put aside the established common knowledge about the WAM, its indexing and its usual data representations and start from the special case of binary logic programs (having only one atom in the head and the body) and the `pure theory' of LD-resolution as described for the special case of binary programs in [21]4.

The relatively surprising result was that our engine is now fairly equivalent if not better than standard WAM implementations. Even without list-optimizations, BinProlog 2.30 is 1.5-2 times faster than SB-Prolog 3.1 and matches well-engineered standard WAM implementations like C-emulated SICStus 2.1 on most of the benchmarks.

How much of the speed of BinProlog comes from the opportunities opened by specializing the WAM to binary programs and how much comes from other factors?

BinProlog has no low-level `tricks'. We do not use assembly code or threading with first order labels. The emulator is a completely goto-less standard C program. A relatively small number of distinct opcodes (129) covering unification, control and built-ins with some instruction and term compression generating about 4K of extra code are `standard' optimizations more heavily used in other emulated WAM implementations.

Everything else being equal, our binarization based abstract machine (the BinWAM) looks empirically very competitive with environment based conventional WAMs.

We will show in this paper how the binarization based approach `called for' the simplifications in data-structure representation which actually pushed our compiler to its natural limits. The overall experience shows that, as in the case of RISC hardware, performance comes often simply by doing less instead of doing more.

We will start by reviewing the program transformation that allows compilation of logic programs towards a simplified WAM specialized for the execution of binary logic programs. We refer the reader to [21] for the original definition of this transformation.

2  The binarization transformation

Binary clauses have only one atom in the body (except for some inline `builtin' operations like arithmetics) and therefore they need no `return' after a call. A transformation introduced in [21] allows to faithfully represent logic programs with operationally equivalent binary programs.

To keep things simple we will describe our transformations in the case of definite programs. First, we need to modify the well-known description of SLD-resolution [13] to be closer to Prolog's operational semantics. We will follow here the notations of [22].

Let us define the composition operator that combines clauses by unfolding the leftmost body-goal of the first argument.

Definition 1 Let A0:-A1,A2,...,An and B0:-B1,...,Bm be two clauses (suppose n > 0, m 0). We define (A0:-A1,A2,...,An) (B0:-B1,...,Bm) = (A0:-B1,...,Bm,A2,...,An)q

with q = mgu(A1,B0). If the atoms A1 and B0 do not unify, the result of the composition is denoted as ^. Furthermore, as usual, we consider A0:-true,A2,...,An to be equivalent to A0:-A2,...,An, and for any clause C, ^ C = C ^ = ^. We assume that at least one operand has been renamed to a variant with fresh variables.

This Prolog-like inference rule is called LD-resolution and it has the advantage of giving a more accurate description of Prolog's operational semantics than SLD-resolution.

Before defining the binarization transformation, we describe two auxiliary transformations.

The first transformation converts facts into rules by giving them the atom true as body. E.g., the fact p is transformed into the rule p :- true.

The second transformation, inspired by [27], eliminates the metavariables by wrapping them in a call/1 goal. E.g., the rule and(X,Y):-X, Y is transformed into and(X,Y) :- call(X), call(Y).

The transformation of [21] (binarization) adds continuations as extra arguments of atoms in a way that preserves also first argument indexing.

Definition 2 Let P be a definite program and Cont a new variable. Let T and E = p(T1,...,Tn) be two expressions.5 We denote by y(E,T) the expression p(T1,...,Tn,T). Starting with the clause

(C)        A :- B1,B2,...,Bn.

we construct the clause

(C')        y(A,Cont) :- y(B1,y(B2,...,y(Bn,Cont))).

The set P of all clauses C' obtained from the clauses of P is called the binarization of P.

Example 1 The following example shows the result of this transformation on the well-known `naive reverse' program:

   app([],Ys,Ys,Cont):-true(Cont).
   app([A|Xs],Ys,[A|Zs],Cont):-app(Xs,Ys,Zs,Cont).
                                  
   nrev([],[],Cont):-true(Cont).
   nrev([X|Xs],Zs,Cont):-nrev(Xs,Ys,app(Ys,[X],Zs,Cont)).

These transformations preserve a strong operational equivalence with the original program with respect to the LD resolution rule which is reified in the syntactical structure of the resulting program.

Theorem 1 ([22]) Each resolution step of an LD derivation on a definite program P can be mapped to an SLD-resolution step of the binarized program P. Let G be an atomic goal and G = y(G,true). Then, computed answers obtained querying P with G are the same as those obtained by querying P' with G'.

Notice that the equivalence between the binary version and the original program can also be explained in terms of fold/unfold transformations as suggested by [15].

Clearly, continuations become explicit in the binary version of the program. We will devise a technique to access and manipulate them in an intuitive way, by modifying BinProlog's binarization preprocessor.

3  Binarization based compilation

3.1  Virtualisation of meta-predicates

Note that the first step of the transformation simply wraps metavariables inside a new predicate call/1. The second step adds continuations as last arguments of each predicate and a new predicate true/1 to deal with unit clauses. During this step the arity of all predicates increases by 1 so that for instance call/1 becomes call/2. Although we can add clauses that describe how they work on the set of all the functors occurring in the program, in practice it is simpler and more efficient to treat them as built-ins [16]. Our current implementation actually performs their execution inline but still has to look up in a hash-table to transform the term to which a meta-variable points, to its corresponding predicate-entry. As this happens only when we reach a `fact' in the original program, it has relatively little impact on performance. Note however that it can be done in part at compile time, by specializing the source program with respect to some known continuations.

3.2  Inline compilation of built-ins

Demoen and Mariën pointed out in [10] that a more implementation oriented view of binary programs can be very useful: a binary program is simply one that does not need an environment in the WAM. This allows inline code generation for built-ins occurring immediately after the head. They also describes a clever implementation of cut and other built-ins in the case of binary programs, by modifying the Prolog-by-BIM engine.

They found useful the program transformation of [21] primarily for better explaining and optimizing conventional WAM implementations. However, the authors have not considered practical a binary only engine. On the other hand, it was our belief that specializing the WAM to binary programs can give some new opportunities that are worthwhile to be investigated at least for aesthetic reasons. After implementing BinProlog it turned out to be also more efficient than expected. Inline expansion of builtins contributes signifcantly to BinProlog's speed and allows last call optimisation for frequently occurring linear recursive predicates exactly as it happens in conventional WAMs.

3.3  Early term construction vs. late term construction

In procedural and functional languages featuring only deterministic calls it makes sense to avoid eager early structure creation. The WAM [28] follows this trend based on the argument that most logic programs are deterministic and therefore calls and structure creation in logic programming languages should follow this model. A more careful analysis suggests that the choice between

will favor different programming styles. Let's note at this point that the WAM's (common sense) assumptions are subject to the following paradox:

This explains in part why (against all obvious intuitions), a standard AND-stack based WAM is not necessarily faster than a properly implemented BinWAM 6.

Note that in the case of early failure in the context of deep backtracking conventional WAMs have a clear advantage although this will again compete with the BinWAM's smaller and unlinked choicepoints to the point that, for instance, the CHAT80 parser will turn out to be faster in BinProlog than in a good C-emulated WAMs like Sicstus 2.1.

In practice, different programming styles will benefit differently from the various tradeoffs implied by one of these basic choices. In any case, system-level knowledge on what's really happening inside the engine should be abstracted in higher-order constructs to the advantage of application-level users. BinProlog 3.30's new extensions (intuitionistic and linear implication, various higher order constructs, extended DCGs etc.) [20,19] are based on this motivation.

3.4  A simplified run-time system

A simplified OR-stack having the layout shown in figure 1 is used only for (1-level) choice point creation in nondeterministic predicates.

P next clause address
H saved top of the heap
TR saved top of the trail
AN+1 continuation argument register
AN saved argument register N
... ...
A1 saved argument register 1


Figure 1: BinProlog's OR-stack.

As a consequence, the heap consumption of the program goes up, although in some special cases, partial evaluation at source level can deal with the problem [9,14], showing again that a heap-only approach is not necessarily worse. As automating these source-level transformations needs global compilation and possibly some help from programmer declared pragmas, we wanted to ensure good performance for our engine without counting on them.

3.5  A simplified clause selection mechanism

As the compiler works on a clause-by-clause basis, it is the responsibility of the C-emulator to index clauses and link the code. It uses a global < key,key > value hash table seen as an abstract multipurpose dictionary. A one byte mark-field is used to distinguish between load-time use and run-time use and for fast clean-up. We found that sharing of the global dictionary, although somewhat slower than the small key value hashing tables injected into the code-space of the standard WAM, simplifies the implementation and makes it easier to switch to better indexing techniques in the future. The same table is used by the run-time system to get the addresses of meta-predicates, to perform first argument indexing and is offered to the user as entry point to a blackboard containing logically behaving global terms. This high level of code reuse contributes significantly to the small size and indirectly to the overall speed of BinProlog.

Predicates are classified as single-clause, deterministic and nondeterministic. At this time only predicates having all first-argument functors distinct are detected as deterministic. Although this is definitely a RISCy approach to indexing, we found that for predicates having a more general distribution of first-arguments, a source-to-source transformation can be used. In the future we plan to integrate typical uses of CUT and arithmetic tests in this indexing scheme and extended it with ML-style `pattern matching' compilation that generates decision trees.

Indexing of deterministic predicates is done by a unique SWITCH instruction. If the first argument dereferences to a non-variable, SWITCH either fails or finds the 1-word address of the unique matching clause in the global hash-table, using the predicate and the functor of the first argument as a 2-word key. A specialized JUMP-IF instruction deals with the frequent case of 2 clause deterministic predicates. To reduce the interpretation overhead SWITCH and JUMP_IF are combined with the preceding EXECUTE and the following GET_STRUCTURE7 instruction, giving EXEC_SWITCH and EXEC_JUMP_IF. This allows not only to avoid dereferencing the first argument twice, but also reduces branching that breaks the processor's pipeline. Note that the basic difference with the WAM is the absence of intensive tag analysis. This is related also to our different low-level data-representation.

4  Data representation

4.1  Tag-on-pointer versus tag-on-data

When describing the data in a cell with a tag we have basically 2 possibilities. We can put a tag in the same cell as the address of the data or near the data itself. The first possibility, probably most popular among WAM implementors, allows one to check the tag before deciding if and how it has to be processed. We choose the second possibility initially on purely aesthetic grounds while our engine had only 6 instructions. As we became aware that with good indexing unifications are more often intended to succeed and move data around than as a clause selection mechanism, WAM's preemptive tag checking lost some of its potential value. This also justifies why we have not implemented traditional WAMs SWITCH_ON_TAG instruction.

We found it very convenient to precompute a functor in the code-space as a word of the form <arity,symbol-number,tag> and then simply compare it with objects on the heap or in registers at 1-cycle cost instead of comparing the tags, finding out that they are almost always the same, then compare the functor-name and find out that they are also the same and finally compare the arities with a costly if-logic. Therefore we kept our unusual tag-on-data representation, that also has the advantage of consuming less tag bits. Up to now, we have only 3 different tags, implemented on 2 bits and still there's a 4-th tag available for future use.

A seemingly not so important (but potentially costly) point is related to the layout of the fields inside the word. We describe it as it helps to compare our implementation with WAMs having a less efficient low-level data representation. We choosed to put most frequently used data at the end of the word, the next frequently used one at the beginning of the word and finally the less frequently used in the middle. Even on byte addressable machines it is a good idea to fetch the whole data in a register and than get the lower bits with a small mask and the upper bits with a right shift in 1 cycle. It is less efficient to get the bits at left with a mask, as a huge mask needs two instructions on most RISCs to be loaded in a register.

With this representation a functor fits completely in one word:

arity symbol-number 2-bit tag


By choosing TAG=0 for variables and having only 2-bit tags, every memory address (C pointer) looks like a logical variable. This gives a very low overhead and less error-prone integration of C code in the engine and it has, on architectures where indexed indirect addressing is not for free, a positive impact on performance.

4.2  Term compression

If a term has a last argument containing a functor, with our tag-on-data representation we can avoid the extra pointer from the last argument to the functor cell and simply make them collapse. Obviously the unification algorithm must take care of this case, but the space savings are important, especially in the case of lists which become contiguous vectors with their N-th element directly addressable at offset 2*sizeof(term)*N+1 bytes from the beginning of the list, as shown in figure 1.

· a · b · c · []


Figure 2: List compression.

  1. t/2 1 t/2 2 t/2 3 n/0

  2. t/2 1

    t/2 2

    t/2 3 n/0


    t/2 1 t/2 2 t/2 3 n/0

Figure 3: Term compression.

The effect of this last argument overlapping on t(1,t(2,t(3,n))) is represented in figure 2.

In BinProlog 1.71 only built-ins like det_append, copy_term, findall take advantage of this improved term representation. A change to the emulator integrated in BinProlog starting from version 2.20 has shown very encouraging results and practically no overhead by extending this optimization to all the terms. It also reduces the space consumption for lists and other `chained functors' to values similar or better than in the case of conventional WAMs. We refer to [24] for the details of the term-compression related optimizations of BinProlog.

5  Optimizing the run-time system

5.1  Reducing the interpretation overhead

One can argue that the best way to reduce the interpretation overhead is by not doing it at all. However, most of the native code compilers we know of have a compact-code option that is actually C-emulated WAM. And being the most practical in term of compilation-time and code-size, this mode is used by the Prolog developer most of the time, except for the final product. On the other hand, some of the techniques that follow can be used to eliminate tests and jumps so they can be useful also in native code generation.

5.1.1  Instruction-compression and case-overlapping.

It happens very often that a sequence of consecutive instructions share some WAM state information. For example two consecutive unify instructions have the same mode as they correspond to arguments of the same structure. Moreover, due to our very simple instruction set, some instructions have only a few possible other instructions that can follow them. For example, after an EXECUTE instruction, we can have a single, a deterministic or a nondeterministic clause. It makes sense to specialize the EXECUTE instruction with respect to what has to be done in each case. This gives, in the case of calls to deterministic predicates the instructions EXEC_SWITCH and EXEC_JUMP_IF as mentioned in the section on indexing. On the other hand, some instructions are simply so small that just dispatching them can cost more than actually performing the associated WAM-step. This in itself is a reason to compress two or more instructions taking less than a word in one. Again, having a small initial instruction set avoids combinatorial explosion in this case. For example, by compressing our UNIFY instructions and their WRITE-mode specializations, we get the following 8 new instructions:

UNIFY_VARIABLE_VARIABLE
WRITE_VARIABLE_VARIABLE
UNIFY_VALUE_VALUE
WRITE_VALUE_VALUE
UNIFY_VARIABLE_VALUE
WRITE_VARIABLE_VALUE
UNIFY_VALUE_VARIABLE
WRITE_VALUE_VARIABLE

This gives, in the case of the binarized version of the recursive clause of append/3 the following code:

append([A|Xs],Ys,[A|Zs],Cont):-append(Xs,Ys,Zs,Cont).

TRUST_ME_ELSE */4,     % keeps also the arity = 4
GET_STRUCTURE X1, ./2
UNIFY_VAR_VAR X5, A1
GET_STRUCTURE X3, ./2
UNIFY_VAL_VAR X5, A3
EXEC_JUMP_IF  append/4 % actually the address of append/4

The choice of candidates for instruction compression is based on low level profiling (instruction frequencies) and possibility of sharing of common work by two successive instructions and frequencies of functors with various arities. This justifies the choice of instructions like UNIFY_VARIABLE_VARIABLE.

Example 2 To emphasize some instruction compression possibilities not so obvious in the case of standard WAM let's show the BinProlog versus SICStus code for the clause:

a(X,Z):-b(X,Y),c(Y,Z). =>binary form=> a(X,Z,C):-b(X,Y,c(Y,Z,C)).

SICSTUS Prolog 2.1                BinProlog 1.43

clause(a/2/1,                  a/3:
   [ifshallow                  PUT_STRUCTURE    X4<-c/3
   ,neck(2)                    WRITE_VAR_VAL    X5,X2
   ,else                       WRITE_VALUE      X3
   ,endif                      MOVE_REG         X2<-X5
   ,allocate                   MOVE_REG         X3<-X4
   ,get_y_variable(1,1)        EXECUTE          b/3
   ,put_y_variable(0,1)
   ,init([])                      BinProlog 1.71
   ,call(b/2,2)                        
   ,put_y_unsafe_value(0,0)    PUT_WRITE_VAR_VAL  X4<-c/3, X5,X2
   ,put_y_value(1,1)           WRITE_VALUE        X3
   ,deallocate                 MOVE_REGx2         X2<-X5, X3<-X4
   ,execute(c/2)]).            EXECUTE            b/3

Clearly, combinatorial explosion due to the elaborate case analysis (safe vs. unsafe, x-variable vs. y-variable) makes the job of advanced instruction compression tedious in the case of standard WAM 8. Moreover, in the case of an emulated engine just decoding the init, allocate, deallocate and call instructions costs more than BinProlog's simple PUT_STRUCTURE and WRITE_VAR_VAL and their straightforward IF-less work on copying 3 heap cells from the registers.

BinProlog also integrates the preceding GET_STRUCTURE instruction into the double UNIFY instructions and the preceding PUT_STRUCTURE into the double WRITE instructions (starting from version 1.71). This gives another 16 instructions but it covers a large majority of uses of GET_STRUCTURE and PUT_STRUCTURE. Reducing interpretation overhead on those critical, high frequency instructions definitely contributes to the speed of our emulator. As a consequence, in the frequent case of structures of arity=2 (lists included), mode-related IF-logic is completely eliminated. The impact of this optimization can be seen clearly on the NREV benchmark (we refer to the section on performance evaluation).

Moreover, in the case of native code, reordering of BinProlog's PUT_STRUCTURE + WRITE groups can be very useful in slot-filling and more intricate super-scalar processing related instruction scheduling. On the other hand, the presence of environments in conventional WAM limits those reordering optimizations to one chunk. A very straightforward compilation to C [23] and the possibility of optimized `burst-mode' structure creation in PUT instructions are a direct consequence of binarization and would be harder to apply to AND-stack based traditional WAMs, which exhibit much less uniform instruction patterns.

Other Prologs also do instruction compression, and it is not unusual to hear about engines having 1000 instructions or more. Therefore this optimization is quite common. However, we found out that simplifying the unification instructions of the BinWAM allows for very `general-purpose' instruction compression. Conventional WAMs often limit this kind of optimization to lists. We changed the list-constructor of the NREV benchmark and found out that performance of Quintus Prolog 3.1.1 has dropped from 479 KLIPS to 130 KLIPS while BinProlog achieved a constant 216 KLIPS in both cases. In engines counting less on instruction compression like C-emulated SICStus 2.1 or SB-Prolog 3.1 the drop in performance was however less dramatic (from 139 to 120 KLIPS) and (from 103 to 89 KLIPS) respectively.

We claim that in a simplified engine instruction compression can be made more `abstract' and therefore with fewer compressed instructions we can hit a statistically more relevant part of the code.

It means that in BinProlog, for instance, arithmetic expressions or programs manipulating binary trees will benefit from our compression strategy while this may not be the case with conventional WAMs, unless they duplicate their (already) baroque list-instruction optimizations for arbitrary structures.

An other point is that instruction compression is usually applied inside a procedure. As BinProlog has a unique primitive EXECUTE instruction instead of standard WAM's CALL, ALLOCATE, DEALLOCATE, EXECUTE, PROCEED we can afford to do instruction compression across procedure boundaries with very little increase in code size due to relatively few different ways to combine control instructions. Inter-procedural instruction compression can be seen as a kind of `hand-crafted' partial evaluation at C-level, intended to optimize the main loop of the WAM-emulator. This can be seen as a special case of the call forwarding technique used in the implementation of jc [12,7]. It has the same effect as partial evaluation at source level which also eliminates procedure calls. At the global level, knowledge about possible continuations can also remove the run-time effort of address look-up for meta-predicates and useless trailing and dereferencing.

Case-overlapping is a well-known technique that saves code-size in the emulator. Note that we can share within the main switch statement of the emulator a case when an instruction is a specialization of its predecessor, based on the well known property of the case statement in C, that in the absence of a break; or continue; control flows from a case label to the next. It is a 0-cost operation. The following example, from our emulator, shows how we can share the code between EXEC_SWITCH and EXEC.

          case EXEC_SWITCH:
            .........
          case SWITCH:
            .........
          break;

Note that instructions like EXEC_SWITCH or EXEC_JUMP_IF have actually a global compilation flavor as they exploit knowledge about the procedure which is called. Due to the very simple initial instruction set of the BinProlog engine these WAM-level code transformations are performed in C at load-time with no visible penalty on compilation time.

Finally, we can often apply both instruction compression and case-overlapping to further reduce the space requirements. As compressed WRITE-instructions are still just special cases of corresponding compressed UNIFY-instructions we have:

          case UNIFY_VAR_VAL:
            .........
          case WRITE_VAR_VAL:
            .........
          break;

This explains why BinProlog's emulator is still only about 50K on most architectures with obviously positive effects on locality due to code reuse.

5.2  Offset based structure manipulation

BinProlog 2.20 used a standard bottom-up structure creation method which is reasonable for an emulated implementation. However inside compressed instructions like PUT_WRITE_VAL_VAL it makes sense to do operations like H[0] = ...; H[1] = ...; H[2] = ...; H+ = 3; as offset based indirect addressing is for free on most modern computers. This optimization allows to use the advantages of top-down structure creation in a high frequency subset of the instructions while retaining the simplicity of the standard bottom-up creation scheme. Similar operations involving the S register are possible for the read-mode of GET instructions as for instance in GET_UNIFY_VAL_VAL. With BinProlog 3.00's top-down representation motivated by term-compression and translation to C the heap offset of every PUT instruction is available at compile time and the C-translation module takes full advantage of this [23].

5.2.1  The benefits of two-stream sequences for free.

Notice that for GET instructions we have the benefits of separate READ and WRITE streams (for instance, avoidance of mode checking) on some high frequency instructions without actually incurring the compilation complexity and emulation overhead in generating them. As terms of depth 1 and functors of low arity dominate statistically Prolog programs, we can see that our instruction compression scheme actually behaves as if two separate instruction streams were present, most of the time!

Our experiments in Prolog-to-C translation using BinProlog's WAM model and data representation show that performance similar to native SICStus and Aquarius can be obtained on top of this very simple instruction set [8].

6  Memory management

6.1  Towards an ecological Prolog engine

An ideal memory manager is `ecological'. It works as a `self-purifying' engine that recuperates space not as a deliberate `garbage-collection' operation but as a natural `way of life' i.e something done inside the normal, useful activities the engine performs. Let's start by describing first some `real-world' ecology which has served as a model for BinProlog's heap-based findall implementation.

6.1.1  The rain-forest metaphor.

In a rain-forest, a natural garbage-collection process takes place. Evaporated water originating from the forest form clouds, and then condensed water falls back as rain.

In a Prolog engine terms are created on the heap. Although WAM's environment stack does some limited and well intentioned garbage prevention (i.e. environment trimming), often garbage collection is needed for large practical programs. The possibility of garbage prevention is reduced even more in BinProlog where continuations go also on the heap.

Following the rain-forest metaphor, our objective is to set up a natural and automatic memory recuperation cycle. Basically the heap is split in a small lower half (the rain forest) and a large upper half (the clouds). The key idea is to fool Prolog's execution mechanism to work temporarily in the upper half (evaporation) and then let useful data `fall back' in the lower half of the heap (rain) in a condensed form (compaction by copying). The trick is very simple: as structure creation on the heap is always done around the H pointer while everything else stays in BinProlog's registers, all we have to do is temporarily set H to a location at the beginning of the upper half of the heap and then let the engine work as usual. The figure 1 shows this heap lifting technique and the position of the H pointer at various stages.


Picture 1


Picture Not Created.

Figure 4: Heap-lifting.

The metaphor is stretched to its limits when applied to recursive uses of the mechanism, but the concrete implementation deals properly with this problem by saving the necessary information on a stack and by ensuring a reasonable number of embedded heap-splits.

7  Fast heap-based builtins

Let us start with a detailed description of a heap-based implementation for two very useful Prolog primitives that will give an insight on how the principle can be used in practice. They also have the potential to significantly speed-up the overall performance of Prolog systems that decide to convert from classical assert-based implementations to a heap-based technique.

7.1  Findall

We implemented both findall/3 and findall/4. The latter, (suggested by a public domain version of R.A. O'Keefe) appends the answers to an existing list. We used our fast copy_term to do it. Basically we execute the goal in the upper half9. Findall is written as a failure driven loop in Prolog, using 3 builtins implemented in C.

findall(X,G,Xs):-findall_workhorse(X,G,[[]|Xs]).

findall(X,G,Xs,End):-findall_workhorse(X,G,[End|Xs]).

findall_workhorse(X,G,_):-
  lift_heap,
  G,
  findall_store_heap(X).
findall_workhorse(_,_,Xs):-
  findall_load_heap(Xs).	

The builtin lift_heap simply cuts the heap in half and temporarily assigns to H the value of the middle of the heap. The previous value of H and the previous HEAP_MARGIN used for overflow check are stacked. This allows for embedded calls to findall to work properly.

Then the goal G is executed in the upper half of the heap.

The builtin findall_store_heap pushes a copy of the answer X, made by copy_term to a an open ended list located in the lower half of the heap and starting from the initial position of H, that grows with each new answer. Then it forces backtracking.

The failure driven prolog-loop ensures that in the case of a finite generation of answers, they can all be collected by the builtin findall_load_heap starting from the initial position of H and forming an open ended list. The space used is exactly the sum of the sizes of the answers plus the size of the list cells used as glue. The builtin findall_load_heap puts on the heap and returns a cons-cell containing the end of the list and the list itself. It also removes from the stack H and HEAP_MARGIN, used inside copy_term to check heap overflow. For findall/4 we return also the end of the open ended list of answers, while for for findall/3 we close it with an empty list. The total size of the Prolog and C code is about half of the usual assert based (correct) implementation of findall.

The C-code of the 3 builtins is part of the main switch of our WAM-loop.

The implementation can be further accelerated by moving it completely to C, but we have left it in Prolog to allow tracing and to be able to use the same builtins for other primitives.

Note that at the end, all the upper half of the heap is freed. This means that an amount of space proportional to the size of the computation is freed within time proportional to the size of the answer.

7.2  Cheney-style compressing copy_term

We have used inside findall a very fast Cheney-style term copying algorithm [25] also available through Prolog's copy_term/2, which enforces term compression, preserves sharing in most practical cases and deals correctly with infinite terms.

Cheney's classical algorithm [6] copies a term in a breadth-first manner using forwarding pointers for already moved cells. This algorithm can be directly taken to implement copy_term. The single difference is that the forwarding pointers that destroy the original term, have to be trailed on a value trail. As long as no sharing of identical terms is present, the algorithm produces only forward references, depicted `\nearrow'. We modified this algorithm to enforce last argument overlapping as follows. When a structure is copied, its last argument and all its subterms that are again found as a last argument are copied first. In this manner, most possibilities for last argument overlapping are discovered. While the classical method is a pure breadth-first algorithm, our adaptation introduces a depth-first component. In the frequent case of a list of atomic cells our algorithm creates a fully contiguous vector-like object.

We illustrate BinProlog's algorithm with the term t([a,b,c],[d,e,f]). Breadth-first copying leads to the first memory layout, while the second layout is the result of our algorithm in BinProlog. Note that the first algorithm slices up all linear chains so that no last argument overlapping is possible.

  1. t/2 \nearrow \nearrow ./2 a \nearrow ./2 d \nearrow ./2 b \nearrow ./2 e \nearrow ./2 c [] ./2 f []

  2. t/2 \nearrow ./2 d ./2 e ./2 f [] ./2 a ./2 b ./2 c []

As long as there are no shared subterms, our algorithm is able to exploit all overlaps. We refer to [25] for a best-case worst-case analyzis of the algorithm and the tradeoffs of various choices.

8  Performance evaluation

The figure 1 compares the performance of BinProlog 2.20 with various optimization switches on a Sparcstation 10-20.

The meaning of the switches is the following:

For reference, timings for SICStus Prolog (column S2.1_8) are given. Timing is without garbage collection, as returned by statistics(runtime,_). The programs are available in the directory progs of the BinProlog distribution. NREV is the well-known naive reverse benchmark, CHAT-Parser is the unmodified version from the Berkeley Benchmark, FIBO(16)x50 is the recursive Fibonacci predicate, PERMS(8) is a permutation generator, DET-P(8) is a `prolog killer' deterministic all permutation program, FINDALL-P(8) is a findall-based all-permutations program, BOOTSTRAP is our compiler compiling itself, its libraries reader, tokenizer, writer and DCG preprocessor. QSORT is a version of the well known sorting program, DIFFEREN is a symbolic differentiation program and CHOICE10 is a backtracking intensive program from the ECRC benchmark. User-time and real-time are obtained directly11 with Unix's rusage tool and give an idea about the overall burden on the system that executes Prolog.

The figure 2 compares the code size of various program in BinProlog 3.30 with emulated and native SICStus Prolog on a Sparcstation 10-20.

Bmark/Compiler No T I2,T I3 I3,T I3,T,O S2.1_8
NREV 229kl 238kl 326kl 408kl 436kl 453kl 271kl
BOOTSTRAP 13.90s 13.90s 12.47s 11.87s 11.95s 11.74s 11.08s
CHAT-Parser 0.81s 0.81s 0.72s 0.69s 0.70s 0.69s 0.69s
FIBO(16)x50 2.16s 2.21s 1.83s 1.78s 1.79s 1.79s 2.41s
PERMS(8) 0.49s 0.48s 0.41s 0.36s 0.37s 0.36s 0.49s
DET-P(8) 1.46s 1.47s 1.19s 0.93s 1.02s 0.99s 1.05s
FINDALL-P(8)1.44s 1.42s 1.35s 1.36s 1.30s 1.30s 4.66s
DIFFEREN 0.61s 0.61s 0.54s 0.52s 0.53s 0.51s 0.67s
CHOICE 2.06s 2.06s 1.98s 1.92s 1.96s 1.92s 2.94s
QSORT 0.72s 0.67s 0.61s 0.61s 0.55s 0.58s 0.42s
total user-time 55.00s 54.00s 45.00s 42.00s 42.00s 40.00s 52.00s
total real-time 59.98s 57.93s 50.04s 47.79s 46.47s 43.82s 96.51s


Figure 5: Impact of instruction and term compression on execution speed.

File/Compiler BinProlog 3.30 Emulated S2.1_9 Native S2.1_9
NREV 3079 16048 19072
QSORT 3155 13456 18880
CHAT-Parser 31732 123008 184080


Figure 6: BinProlog vs. SICStus user code size.

Compiler version No T I2,T I3 I3,T I3,T,O
Emulator size in bytes 47460 47484 49356 51124 51244 51268


Figure 7: Impact of optimizations on emulator code size.

Note that we have used the same BinProlog run-time system in all the measurements so that the evaluation of the optimizations on a feature-by-feature basis is accurate. The various optimizations can be turned on and off independently with appropriate compile-time options for the C-emulator. Instruction and term-compression are subject to the general RISC-philosophy we have used in the design of BinProlog: the optimizations cover the most frequent idioms. Moreover, by careful case-overlapping redundant code is avoided when possible. Figure 3 shows the impact of the optimizations on code size.

Clearly the code-size increase from the basic No to the most optimized version I3,T,O that uses term compression, instruction compression up to length 3 (like PUT_WRITE_VAR_VAL ) and offset based structure manipulation inside the compressed instructions is still less then 4K.

9  Related work

In the case of heap-intensive binary programs the ratio between the useful data and the size of the computation is rather small. This suggest to use some form of (generational) copying garbage collector (first proposed for logic programming languages by Bekkers and Ungaro [3], in the MALI memory manager and also used for the l-prolog implementation by Brisset and Ridoux [4]) instead of the traditional mark-and-sweep collector of most Prologs. Our approach is based on resource-driven failure introduced in [18] and described in details in [17].

Binarization based compilation of logic programming languages is very similar to the Continuation Passing Style compilation for functional programs described in [2] and [1]. The key idea of Appel as presented in [1] is that heap-deallocation by copying (generational) garbage-collection has a better amortized cost than stack-allocation and deallocation. As far as there's no backtracking involved, this principle can be used to partially garbage-collect in a very efficient way a deterministic, deep branch of a WAM-implemented SLD-tree. Although things can get more complex in the presence of choice points and trailing, memory management can be combined with other useful activities of the engine and most importantly with OR-parallel execution as described in [17].

10  Conclusion

The simplicity BinProlog's basic instruction set due to its specialization to binary programs allowed us to apply low level optimizations not easily available on standard WAM. Based on our previously reported [18] virtualization of demo-predicates at WAM-level that largely eliminate the overhead of metaprogramming introduced by binarization and the simplified memory management approach described in [17], despite its still higher memory consumption, the BinProlog engine can now be considered a realistic alternative to standard WAM for full Prolog implementation. This is especially true in C-emulated WAM's where simplification of the instruction set has a very positive impact on limiting the interpretation overhead. Absolute performance evaluation compared to a well-engineered12 conventional WAM like SICStus Prolog 2.1 gives empirical evidence to our claim. Developments on source level transformations like U. Neumerkel's thesis [14] suggest that large classes of binary programs have `continuation-like' structures that all can benefit from some common optimizations. The Aquarius experience [26] shows that a more explicit and low level refinement of the WAM can better take advantage of global optimization techniques. A BAM-like continuation passing engine would benefit from the simplicity of the BinWAM and the dereferencing and trail elimination opportunities of the BAM. Continuation-related optimizations can also be integrated naturally in a lower-level binary abstract machine. Encouraging results on Prolog to C-translation using BinProlog's engine design and data-representations [8] show that performance similar to the best existing native code Prolog systems can be attained with a much smaller implementation effort.

Acknowledgements

Fruitful discussions with Yves Bekkers, Patrice Boizumault, Michel Boyer, Mats Carlsson, Jacques Cohen, Veronica Dahl, Thomas Lindgren, Jean-Francois Pique, Olivier Ridoux, Sten-Åke Tärnlund and comments from a large number of BinProlog users helped improving the design of our logic engine and clarifying the content of this paper. Special thanks go to Koen De Bosschere, Bart Demoen, Gert Engels and Ulrich Neumerkel for their contribution to the implementation of BinProlog components.

References

[1]
A. Appel. A runtime system. Lisp and Symbolic Computation, (3):343-380, 1990.

[2]
A. Appel. Compiling with Continuations. Cambridge University Press, 1992.

[3]
Y. Bekkers and L. Ungaro. Two real-time garbage collectors for a prolog system. In Proceedings of the Logic Programming Conference'91, pages 137-149. ICOT, Tokyo, Sept. 1991.

[4]
P. Brisset and O. Ridoux. Continuations in lProlog. In D. S. Warren, editor, Proceedings of the Tenth International Conference on Logic Programming, pages 27-43, Budapest, Hungary, 1993. The MIT Press.

[5]
M. Carlsson. Design and Implementation of an OR-Parallel Prolog Engine. Phd thesis, SICS, 1990.

[6]
C. J. Cheney. A nonrecursive list compacting algorithm. Communications of ACM, 11(13):677-678, Nov. 1970.

[7]
K. De Bosschere, S. Debray, D. Gudeman, and S. Kannan. Call Forwarding: A Simple Interprocedural Optimization Technique for Dynamically Typed Languages. In Proceedings of the 21st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 409-420, Portland/USA, Jan. 1994. ACM.

[8]
K. De Bosschere and P. Tarau. High Performance Continuation Passing Style Prolog-to-C Mapping. In E. Deaton, D. Oppenheim, J. Urban, and H. Berghel, editors, Proceedings of the 1994 ACM Symposium on Applied Computing, pages 383-387, Phoenix/AZ, Mar. 1994. ACM Press.

[9]
B. Demoen. On the Transformation of a Prolog program to a more efficient Binary program. Technical Report 130, K.U.Leuven, Dec. 1990.

[10]
B. Demoen and A. Mariën. Implementation of Prolog as binary definite Programs. In A. Voronkov, editor, Logic Programming, RCLP Proceedings, number 592 in Lecture Notes in Artificial Intelligence, pages 165-176, Berlin, Heidelberg, 1992. Springer-Verlag.

[11]
B. Demoen and G. Maris. A comparison of some schemes for translating logic to C. Technical Report 188, K.U.Leuven, Mar. 1994. presented at the Workshop on parallel and data parallel execution of logic programs, ICLP94.

[12]
D. Gudeman, K. De Bosschere, and S. Debray. jc: An Efficient and Portable Sequential Implementation of Janus. In K. Apt, editor, Joint International Conference and Symposium on Logic Programming, pages 399-413, Washington, Nov. 1992. MIT press.

[13]
J. Lloyd. Foundations of Logic Programming. Symbolic computation - Artificial Intelligence. Springer-Verlag, Berlin, 1987. Second edition.

[14]
U. Neumerkel. Specialization of Prolog Programs with Partially Static Goals and Binarization. Phd thesis, Technische Universität Wien, 1992.

[15]
M. Proietti. On the definition of binarization in terms of fold/unfold., June 1994. Personal Communication.

[16]
P. Tarau. A Simplified Abstract Machine for the Execution of Binary Metaprograms. In Proceedings of the Logic Programming Conference'91, pages 119-128. ICOT, Tokyo, 7 1991.

[17]
P. Tarau. Ecological Memory Management in a Continuation Passing Prolog Engine. In Y. Bekkers and J. Cohen, editors, Memory Management International Workshop IWMM 92 Proceedings, number 637 in Lecture Notes in Computer Science, pages 344-356. Springer, Sept. 1992.

[18]
P. Tarau. Program Transformations and WAM-support for the Compilation of Definite Metaprograms. In A. Voronkov, editor, Logic Programming, RCLP Proceedings, number 592 in Lecture Notes in Artificial Intelligence, pages 462-473, Berlin, Heidelberg, 1992. Springer-Verlag.

[19]
P. Tarau. Language Issues and Programming Techniques in BinProlog. In D. Sacca, editor, Proceeding of the GULP'93 Conference, Gizzeria Lido, Italy, June 1993.

[20]
P. Tarau. BinProlog 4.00 User Guide. Technical Report 95-1, Département d'Informatique, Université de Moncton, Feb. 1995. Available by ftp from clement.info.umoncton.ca.

[21]
P. Tarau and M. Boyer. Elementary Logic Programs. In P. Deransart and J. Maluszy\'nski, editors, Proceedings of Programming Language Implementation and Logic Programming, number 456 in Lecture Notes in Computer Science, pages 159-173. Springer, Aug. 1990.

[22]
P. Tarau and K. De Bosschere. Memoing with Abstract Answers and Delphi Lemmas. In Y. Deville, editor, Logic Program Synthesis and Transformation, Springer-Verlag, pages 196-209, Louvain-la-Neuve, July 1993.

[23]
P. Tarau, B. Demoen, and K. De Bosschere. The Power of Partial Translation: an Experiment with the C-ification of Binary Prolog. In K. George, J. Carrol, E. Deaton, D. Oppenheim, and J. Hightower, editors, Proceedings of the 1995 ACM Symposium on Applied Computing, pages 152-176, Nashville, Feb. 1995. ACM Press.

[24]
P. Tarau and U. Neumerkel. Compact Representation of Terms and Instructions in the BinWAM. Technical Report 93-3, Dept. d'Informatique, Université de Moncton, Nov. 1993. available by ftp from clement.info.umoncton.ca.

[25]
P. Tarau and U. Neumerkel. A Novel Term Compression Scheme and Data Representation in the BinWAM. In M. Hermenegildo and J. Penjam, editors, Proceedings of Programming Language Implementation and Logic Programming, Lecture Notes in Computer Science 844, pages 73-87. "Springer", Sept. 1994.

[26]
P. Van Roy. Can Logic programming Execute as Fast as Imperative Programming. Phd thesis, University of California at Berkley, 1990.

[27]
D. H. D. Warren. Higher-order extensions to Prolog - are they needed? In D. Michie, J. Hayes, and Y. H. Pao, editors, Machine Intelligence 10. Ellis Horwood, 1981.

[28]
D. H. D. Warren. An Abstract Prolog Instruction Set. Technical Note 309, SRI International, Oct. 1983.


Footnotes:

1 Department of Computer Science, University of North Texas, Denton, Texas 76203 E-mail: tarau@cs.unt.edu

2 available from http://www.binnetcorp.com/BinProlog

3 Similar to the `goal-stacking' Warren rejected in 1983 []. We refer to [] for a thorough comparison of various goal-stacking implementations.

4 Our initial 200-line compiler generating code for a 6 instruction subset of ALS-prolog's WAM was probably the world's smallest definite program compiler.

5 Atom or term.

6 As the section about performance will show.

7 well, also GET_CONSTANT, although in this case it is not worth dealing with it separately

8 Current implementations of SICStus also do extensive instruction folding. They do need however a much larger set of folded instructions - about 256. This makes emulated code in SICStus 2.1_9 almost as large as native code and about 3-times larger than emulated BinProlog code.

9 Actually the ratio between the lower and the upper part is 1:4, 1:8 or even a higher power of 2. This must be in principle the statistically expected ratio between the size of the answer and the size of the computation that produces it.

10 Although shallow-backtracking is faster in C-emulated SICStus due to the optimization described in [] the overall performance for this benchmark is better in BinProlog due to its smaller and unlinked choice-points.

11 On a 1-user workstation with enough memory (32Mb) for in-core execution.

12 see [] for the details of its design and optimizations


File translated from TEX by TTH, version 2.00.
On 4 Mar 1999, 23:51.