Keywords: implementation of Prolog, WAM optimizations, program transformations, binary logic programs, continuation passing style compilation, data-representations for Prolog run-time systems.
We have decided however to put aside the established common knowledge about the WAM, its indexing and its usual data representations and start from the special case of binary logic programs (having only one atom in the head and the body) and the `pure theory' of LD-resolution as described for the special case of binary programs in [21]^{4}.
The relatively surprising result was that our engine is now fairly equivalent if not better than standard WAM implementations. Even without list-optimizations, BinProlog 2.30 is 1.5-2 times faster than SB-Prolog 3.1 and matches well-engineered standard WAM implementations like C-emulated SICStus 2.1 on most of the benchmarks.
How much of the speed of BinProlog comes from the opportunities opened by specializing the WAM to binary programs and how much comes from other factors?
BinProlog has no low-level `tricks'. We do not use assembly code or threading with first order labels. The emulator is a completely goto-less standard C program. A relatively small number of distinct opcodes (129) covering unification, control and built-ins with some instruction and term compression generating about 4K of extra code are `standard' optimizations more heavily used in other emulated WAM implementations.
Everything else being equal, our binarization based abstract machine (the BinWAM) looks empirically very competitive with environment based conventional WAMs.
We will show in this paper how the binarization based approach `called for' the simplifications in data-structure representation which actually pushed our compiler to its natural limits. The overall experience shows that, as in the case of RISC hardware, performance comes often simply by doing less instead of doing more.
We will start by reviewing the program transformation that allows compilation of logic programs towards a simplified WAM specialized for the execution of binary logic programs. We refer the reader to [21] for the original definition of this transformation.
Binary clauses have only one atom in the body (except for some inline `builtin' operations like arithmetics) and therefore they need no `return' after a call. A transformation introduced in [21] allows to faithfully represent logic programs with operationally equivalent binary programs.
To keep things simple we will describe our transformations in the case of definite programs. First, we need to modify the well-known description of SLD-resolution [13] to be closer to Prolog's operational semantics. We will follow here the notations of [22].
Let us define the composition operator Å that combines clauses by unfolding the leftmost body-goal of the first argument.
Definition 1 Let A_{0}:-A_{1},A_{2},...,A_{n} and B_{0}:-B_{1},...,B_{m} be two clauses (suppose n > 0, m ³ 0). We define (A_{0}:-A_{1},A_{2},...,A_{n}) Å (B_{0}:-B_{1},...,B_{m}) = (A_{0}:-B_{1},...,B_{m},A_{2},...,A_{n})q
with q = mgu(A_{1},B_{0}). If the atoms A_{1} and B_{0} do not unify, the result of the composition is denoted as ^. Furthermore, as usual, we consider A_{0}:-true,A_{2},...,A_{n} to be equivalent to A_{0}:-A_{2},...,A_{n}, and for any clause C, ^ Å C = C Å ^ = ^. We assume that at least one operand has been renamed to a variant with fresh variables.
This Prolog-like inference rule is called LD-resolution and it has the advantage of giving a more accurate description of Prolog's operational semantics than SLD-resolution.
Before defining the binarization transformation, we describe two auxiliary transformations.
The first transformation converts facts into rules by giving them the atom true as body. E.g., the fact p is transformed into the rule p :- true.
The second transformation, inspired by [27], eliminates the metavariables by wrapping them in a call/1 goal. E.g., the rule and(X,Y):-X, Y is transformed into and(X,Y) :- call(X), call(Y).
The transformation of [21] (binarization) adds continuations as extra arguments of atoms in a way that preserves also first argument indexing.
Definition 2 Let P be a definite program and Cont a new variable. Let T and E = p(T_{1},...,T_{n}) be two expressions.^{5} We denote by y(E,T) the expression p(T_{1},...,T_{n},T). Starting with the clause
(C) A :- B_{1},B_{2},...,B_{n}.
we construct the clause
(C') y(A,Cont) :- y(B_{1},y(B_{2},...,y(B_{n},Cont))).
The set P¢ of all clauses C' obtained from the clauses of P is called the binarization of P.
Example 1 The following example shows the result of this transformation on the well-known `naive reverse' program:
app([],Ys,Ys,Cont):-true(Cont). app([A|Xs],Ys,[A|Zs],Cont):-app(Xs,Ys,Zs,Cont). nrev([],[],Cont):-true(Cont). nrev([X|Xs],Zs,Cont):-nrev(Xs,Ys,app(Ys,[X],Zs,Cont)).
These transformations preserve a strong operational equivalence with the original program with respect to the LD resolution rule which is reified in the syntactical structure of the resulting program.
Theorem 1 ([22]) Each resolution step of an LD derivation on a definite program P can be mapped to an SLD-resolution step of the binarized program P¢. Let G be an atomic goal and G¢ = y(G,true). Then, computed answers obtained querying P with G are the same as those obtained by querying P' with G'.
Notice that the equivalence between the binary version and the original program can also be explained in terms of fold/unfold transformations as suggested by [15].
Clearly, continuations become explicit in the binary version of the program. We will devise a technique to access and manipulate them in an intuitive way, by modifying BinProlog's binarization preprocessor.
Note that the first step of the transformation simply wraps metavariables inside a new predicate call/1. The second step adds continuations as last arguments of each predicate and a new predicate true/1 to deal with unit clauses. During this step the arity of all predicates increases by 1 so that for instance call/1 becomes call/2. Although we can add clauses that describe how they work on the set of all the functors occurring in the program, in practice it is simpler and more efficient to treat them as built-ins [16]. Our current implementation actually performs their execution inline but still has to look up in a hash-table to transform the term to which a meta-variable points, to its corresponding predicate-entry. As this happens only when we reach a `fact' in the original program, it has relatively little impact on performance. Note however that it can be done in part at compile time, by specializing the source program with respect to some known continuations.
Demoen and Mariën pointed out in [10] that a more implementation oriented view of binary programs can be very useful: a binary program is simply one that does not need an environment in the WAM. This allows inline code generation for built-ins occurring immediately after the head. They also describes a clever implementation of cut and other built-ins in the case of binary programs, by modifying the Prolog-by-BIM engine.
They found useful the program transformation of [21] primarily for better explaining and optimizing conventional WAM implementations. However, the authors have not considered practical a binary only engine. On the other hand, it was our belief that specializing the WAM to binary programs can give some new opportunities that are worthwhile to be investigated at least for aesthetic reasons. After implementing BinProlog it turned out to be also more efficient than expected. Inline expansion of builtins contributes signifcantly to BinProlog's speed and allows last call optimisation for frequently occurring linear recursive predicates exactly as it happens in conventional WAMs.
In procedural and functional languages featuring only deterministic calls it makes sense to avoid eager early structure creation. The WAM [28] follows this trend based on the argument that most logic programs are deterministic and therefore calls and structure creation in logic programming languages should follow this model. A more careful analysis suggests that the choice between
will favor different programming styles. Let's note at this point that the WAM's (common sense) assumptions are subject to the following paradox:
If the program is mostly nondeterministic then late and repeated construction (WAM) is not better than eager early creation (BinWAM) which is done only once, because it implies more work on backtracking.
This explains in part why (against all obvious intuitions), a standard AND-stack based WAM is not necessarily faster than a properly implemented BinWAM ^{6}.
Note that in the case of early failure in the context of deep backtracking conventional WAMs have a clear advantage although this will again compete with the BinWAM's smaller and unlinked choicepoints to the point that, for instance, the CHAT80 parser will turn out to be faster in BinProlog than in a good C-emulated WAMs like Sicstus 2.1.
In practice, different programming styles will benefit differently from the various tradeoffs implied by one of these basic choices. In any case, system-level knowledge on what's really happening inside the engine should be abstracted in higher-order constructs to the advantage of application-level users. BinProlog 3.30's new extensions (intuitionistic and linear implication, various higher order constructs, extended DCGs etc.) [20,19] are based on this motivation.
A simplified OR-stack having the layout shown in figure 1 is used only for (1-level) choice point creation in nondeterministic predicates.
P Þ | next clause address |
H Þ | saved top of the heap |
TR Þ | saved top of the trail |
A_{N+1} Þ | continuation argument register |
A_{N} Þ | saved argument register N |
... | ... |
A_{1} Þ | saved argument register 1 |
As a consequence, the heap consumption of the program goes up, although in some special cases, partial evaluation at source level can deal with the problem [9,14], showing again that a heap-only approach is not necessarily worse. As automating these source-level transformations needs global compilation and possibly some help from programmer declared pragmas, we wanted to ensure good performance for our engine without counting on them.
Predicates are classified as single-clause, deterministic and nondeterministic. At this time only predicates having all first-argument functors distinct are detected as deterministic. Although this is definitely a RISCy approach to indexing, we found that for predicates having a more general distribution of first-arguments, a source-to-source transformation can be used. In the future we plan to integrate typical uses of CUT and arithmetic tests in this indexing scheme and extended it with ML-style `pattern matching' compilation that generates decision trees.
Indexing of deterministic predicates is done by a unique SWITCH instruction. If the first argument dereferences to a non-variable, SWITCH either fails or finds the 1-word address of the unique matching clause in the global hash-table, using the predicate and the functor of the first argument as a 2-word key. A specialized JUMP-IF instruction deals with the frequent case of 2 clause deterministic predicates. To reduce the interpretation overhead SWITCH and JUMP_IF are combined with the preceding EXECUTE and the following GET_STRUCTURE^{7} instruction, giving EXEC_SWITCH and EXEC_JUMP_IF. This allows not only to avoid dereferencing the first argument twice, but also reduces branching that breaks the processor's pipeline. Note that the basic difference with the WAM is the absence of intensive tag analysis. This is related also to our different low-level data-representation.
We found it very convenient to precompute a functor in the code-space as a word of the form <arity,symbol-number,tag> and then simply compare it with objects on the heap or in registers at 1-cycle cost instead of comparing the tags, finding out that they are almost always the same, then compare the functor-name and find out that they are also the same and finally compare the arities with a costly if-logic. Therefore we kept our unusual tag-on-data representation, that also has the advantage of consuming less tag bits. Up to now, we have only 3 different tags, implemented on 2 bits and still there's a 4-th tag available for future use.
A seemingly not so important (but potentially costly) point is related to the layout of the fields inside the word. We describe it as it helps to compare our implementation with WAMs having a less efficient low-level data representation. We choosed to put most frequently used data at the end of the word, the next frequently used one at the beginning of the word and finally the less frequently used in the middle. Even on byte addressable machines it is a good idea to fetch the whole data in a register and than get the lower bits with a small mask and the upper bits with a right shift in 1 cycle. It is less efficient to get the bits at left with a mask, as a huge mask needs two instructions on most RISCs to be loaded in a register.
With this representation a functor fits completely in one word:
arity | symbol-number | 2-bit tag |
· | a | · | b | · | c | · | [] |
- t/2 1 ® t/2 2 ® t/2 3 n/0
- t/2 1 ¯
t/2 2 ¯
t/2 3 n/0
t/2 1 t/2 2 t/2 3 n/0
The effect of this last argument overlapping on t(1,t(2,t(3,n))) is represented in figure 2.
In BinProlog 1.71 only built-ins like det_append, copy_term, findall take advantage of this improved term representation. A change to the emulator integrated in BinProlog starting from version 2.20 has shown very encouraging results and practically no overhead by extending this optimization to all the terms. It also reduces the space consumption for lists and other `chained functors' to values similar or better than in the case of conventional WAMs. We refer to [24] for the details of the term-compression related optimizations of BinProlog.
UNIFY_VARIABLE_VARIABLE WRITE_VARIABLE_VARIABLE UNIFY_VALUE_VALUE WRITE_VALUE_VALUE UNIFY_VARIABLE_VALUE WRITE_VARIABLE_VALUE UNIFY_VALUE_VARIABLE WRITE_VALUE_VARIABLE
This gives, in the case of the binarized version of the recursive clause of append/3 the following code:
append([A|Xs],Ys,[A|Zs],Cont):-append(Xs,Ys,Zs,Cont). TRUST_ME_ELSE */4, % keeps also the arity = 4 GET_STRUCTURE X1, ./2 UNIFY_VAR_VAR X5, A1 GET_STRUCTURE X3, ./2 UNIFY_VAL_VAR X5, A3 EXEC_JUMP_IF append/4 % actually the address of append/4
The choice of candidates for instruction compression is based on low level profiling (instruction frequencies) and possibility of sharing of common work by two successive instructions and frequencies of functors with various arities. This justifies the choice of instructions like UNIFY_VARIABLE_VARIABLE.
Example 2 To emphasize some instruction compression possibilities not so obvious in the case of standard WAM let's show the BinProlog versus SICStus code for the clause:
a(X,Z):-b(X,Y),c(Y,Z). =>binary form=> a(X,Z,C):-b(X,Y,c(Y,Z,C)). SICSTUS Prolog 2.1 BinProlog 1.43 clause(a/2/1, a/3: [ifshallow PUT_STRUCTURE X4<-c/3 ,neck(2) WRITE_VAR_VAL X5,X2 ,else WRITE_VALUE X3 ,endif MOVE_REG X2<-X5 ,allocate MOVE_REG X3<-X4 ,get_y_variable(1,1) EXECUTE b/3 ,put_y_variable(0,1) ,init([]) BinProlog 1.71 ,call(b/2,2) ,put_y_unsafe_value(0,0) PUT_WRITE_VAR_VAL X4<-c/3, X5,X2 ,put_y_value(1,1) WRITE_VALUE X3 ,deallocate MOVE_REGx2 X2<-X5, X3<-X4 ,execute(c/2)]). EXECUTE b/3
Clearly, combinatorial explosion due to the elaborate case analysis (safe vs. unsafe, x-variable vs. y-variable) makes the job of advanced instruction compression tedious in the case of standard WAM ^{8}. Moreover, in the case of an emulated engine just decoding the init, allocate, deallocate and call instructions costs more than BinProlog's simple PUT_STRUCTURE and WRITE_VAR_VAL and their straightforward IF-less work on copying 3 heap cells from the registers.
BinProlog also integrates the preceding GET_STRUCTURE instruction into the double UNIFY instructions and the preceding PUT_STRUCTURE into the double WRITE instructions (starting from version 1.71). This gives another 16 instructions but it covers a large majority of uses of GET_STRUCTURE and PUT_STRUCTURE. Reducing interpretation overhead on those critical, high frequency instructions definitely contributes to the speed of our emulator. As a consequence, in the frequent case of structures of arity=2 (lists included), mode-related IF-logic is completely eliminated. The impact of this optimization can be seen clearly on the NREV benchmark (we refer to the section on performance evaluation).
Moreover, in the case of native code, reordering of BinProlog's PUT_STRUCTURE + WRITE groups can be very useful in slot-filling and more intricate super-scalar processing related instruction scheduling. On the other hand, the presence of environments in conventional WAM limits those reordering optimizations to one chunk. A very straightforward compilation to C [23] and the possibility of optimized `burst-mode' structure creation in PUT instructions are a direct consequence of binarization and would be harder to apply to AND-stack based traditional WAMs, which exhibit much less uniform instruction patterns.
Other Prologs also do instruction compression, and it is not unusual to hear about engines having 1000 instructions or more. Therefore this optimization is quite common. However, we found out that simplifying the unification instructions of the BinWAM allows for very `general-purpose' instruction compression. Conventional WAMs often limit this kind of optimization to lists. We changed the list-constructor of the NREV benchmark and found out that performance of Quintus Prolog 3.1.1 has dropped from 479 KLIPS to 130 KLIPS while BinProlog achieved a constant 216 KLIPS in both cases. In engines counting less on instruction compression like C-emulated SICStus 2.1 or SB-Prolog 3.1 the drop in performance was however less dramatic (from 139 to 120 KLIPS) and (from 103 to 89 KLIPS) respectively.
We claim that in a simplified engine instruction compression can be made more `abstract' and therefore with fewer compressed instructions we can hit a statistically more relevant part of the code.
It means that in BinProlog, for instance, arithmetic expressions or programs manipulating binary trees will benefit from our compression strategy while this may not be the case with conventional WAMs, unless they duplicate their (already) baroque list-instruction optimizations for arbitrary structures.
An other point is that instruction compression is usually applied inside a procedure. As BinProlog has a unique primitive EXECUTE instruction instead of standard WAM's CALL, ALLOCATE, DEALLOCATE, EXECUTE, PROCEED we can afford to do instruction compression across procedure boundaries with very little increase in code size due to relatively few different ways to combine control instructions. Inter-procedural instruction compression can be seen as a kind of `hand-crafted' partial evaluation at C-level, intended to optimize the main loop of the WAM-emulator. This can be seen as a special case of the call forwarding technique used in the implementation of jc [12,7]. It has the same effect as partial evaluation at source level which also eliminates procedure calls. At the global level, knowledge about possible continuations can also remove the run-time effort of address look-up for meta-predicates and useless trailing and dereferencing.
Case-overlapping is a well-known technique that saves code-size in the emulator. Note that we can share within the main switch statement of the emulator a case when an instruction is a specialization of its predecessor, based on the well known property of the case statement in C, that in the absence of a break; or continue; control flows from a case label to the next. It is a 0-cost operation. The following example, from our emulator, shows how we can share the code between EXEC_SWITCH and EXEC.
case EXEC_SWITCH: ......... case SWITCH: ......... break;
Note that instructions like EXEC_SWITCH or EXEC_JUMP_IF have actually a global compilation flavor as they exploit knowledge about the procedure which is called. Due to the very simple initial instruction set of the BinProlog engine these WAM-level code transformations are performed in C at load-time with no visible penalty on compilation time.
Finally, we can often apply both instruction compression and case-overlapping to further reduce the space requirements. As compressed WRITE-instructions are still just special cases of corresponding compressed UNIFY-instructions we have:
case UNIFY_VAR_VAL: ......... case WRITE_VAR_VAL: ......... break;
This explains why BinProlog's emulator is still only about 50K on most architectures with obviously positive effects on locality due to code reuse.
Our experiments in Prolog-to-C translation using BinProlog's WAM model and data representation show that performance similar to native SICStus and Aquarius can be obtained on top of this very simple instruction set [8].
An ideal memory manager is `ecological'. It works as a `self-purifying' engine that recuperates space not as a deliberate `garbage-collection' operation but as a natural `way of life' i.e something done inside the normal, useful activities the engine performs. Let's start by describing first some `real-world' ecology which has served as a model for BinProlog's heap-based findall implementation.
In a Prolog engine terms are created on the heap. Although WAM's environment stack does some limited and well intentioned garbage prevention (i.e. environment trimming), often garbage collection is needed for large practical programs. The possibility of garbage prevention is reduced even more in BinProlog where continuations go also on the heap.
Following the rain-forest metaphor, our objective is to set up a natural and automatic memory recuperation cycle. Basically the heap is split in a small lower half (the rain forest) and a large upper half (the clouds). The key idea is to fool Prolog's execution mechanism to work temporarily in the upper half (evaporation) and then let useful data `fall back' in the lower half of the heap (rain) in a condensed form (compaction by copying). The trick is very simple: as structure creation on the heap is always done around the H pointer while everything else stays in BinProlog's registers, all we have to do is temporarily set H to a location at the beginning of the upper half of the heap and then let the engine work as usual. The figure 1 shows this heap lifting technique and the position of the H pointer at various stages.
Picture Not Created.
The metaphor is stretched to its limits when applied to recursive uses of the mechanism, but the concrete implementation deals properly with this problem by saving the necessary information on a stack and by ensuring a reasonable number of embedded heap-splits.
We implemented both findall/3 and findall/4. The latter, (suggested by a public domain version of R.A. O'Keefe) appends the answers to an existing list. We used our fast copy_term to do it. Basically we execute the goal in the upper half^{9}. Findall is written as a failure driven loop in Prolog, using 3 builtins implemented in C.
findall(X,G,Xs):-findall_workhorse(X,G,[[]|Xs]). findall(X,G,Xs,End):-findall_workhorse(X,G,[End|Xs]). findall_workhorse(X,G,_):- lift_heap, G, findall_store_heap(X). findall_workhorse(_,_,Xs):- findall_load_heap(Xs).
The builtin lift_heap simply cuts the heap in half and temporarily assigns to H the value of the middle of the heap. The previous value of H and the previous HEAP_MARGIN used for overflow check are stacked. This allows for embedded calls to findall to work properly.
Then the goal G is executed in the upper half of the heap.
The builtin findall_store_heap pushes a copy of the answer X, made by copy_term to a an open ended list located in the lower half of the heap and starting from the initial position of H, that grows with each new answer. Then it forces backtracking.
The failure driven prolog-loop ensures that in the case of a finite generation of answers, they can all be collected by the builtin findall_load_heap starting from the initial position of H and forming an open ended list. The space used is exactly the sum of the sizes of the answers plus the size of the list cells used as glue. The builtin findall_load_heap puts on the heap and returns a cons-cell containing the end of the list and the list itself. It also removes from the stack H and HEAP_MARGIN, used inside copy_term to check heap overflow. For findall/4 we return also the end of the open ended list of answers, while for for findall/3 we close it with an empty list. The total size of the Prolog and C code is about half of the usual assert based (correct) implementation of findall.
The C-code of the 3 builtins is part of the main switch of our WAM-loop.
The implementation can be further accelerated by moving it completely to C, but we have left it in Prolog to allow tracing and to be able to use the same builtins for other primitives.
Note that at the end, all the upper half of the heap is freed. This means that an amount of space proportional to the size of the computation is freed within time proportional to the size of the answer.
We have used inside findall a very fast Cheney-style term copying algorithm [25] also available through Prolog's copy_term/2, which enforces term compression, preserves sharing in most practical cases and deals correctly with infinite terms.
Cheney's classical algorithm [6] copies a term in a breadth-first manner using forwarding pointers for already moved cells. This algorithm can be directly taken to implement copy_term. The single difference is that the forwarding pointers that destroy the original term, have to be trailed on a value trail. As long as no sharing of identical terms is present, the algorithm produces only forward references, depicted `\nearrow'. We modified this algorithm to enforce last argument overlapping as follows. When a structure is copied, its last argument and all its subterms that are again found as a last argument are copied first. In this manner, most possibilities for last argument overlapping are discovered. While the classical method is a pure breadth-first algorithm, our adaptation introduces a depth-first component. In the frequent case of a list of atomic cells our algorithm creates a fully contiguous vector-like object.
We illustrate BinProlog's algorithm with the term t([a,b,c],[d,e,f]). Breadth-first copying leads to the first memory layout, while the second layout is the result of our algorithm in BinProlog. Note that the first algorithm slices up all linear chains so that no last argument overlapping is possible.
As long as there are no shared subterms, our algorithm is able to exploit all overlaps. We refer to [25] for a best-case worst-case analyzis of the algorithm and the tradeoffs of various choices.
- t/2 \nearrow \nearrow ./2 a \nearrow ./2 d \nearrow ./2 b \nearrow ./2 e \nearrow ./2 c [] ./2 f []
- t/2 \nearrow ./2 d ./2 e ./2 f [] ./2 a ./2 b ./2 c []
The figure 1 compares the performance of BinProlog 2.20 with various optimization switches on a Sparcstation 10-20.
The meaning of the switches is the following:
For reference, timings for SICStus Prolog (column S2.1_8) are given. Timing is without garbage collection, as returned by statistics(runtime,_). The programs are available in the directory progs of the BinProlog distribution. NREV is the well-known naive reverse benchmark, CHAT-Parser is the unmodified version from the Berkeley Benchmark, FIBO(16)x50 is the recursive Fibonacci predicate, PERMS(8) is a permutation generator, DET-P(8) is a `prolog killer' deterministic all permutation program, FINDALL-P(8) is a findall-based all-permutations program, BOOTSTRAP is our compiler compiling itself, its libraries reader, tokenizer, writer and DCG preprocessor. QSORT is a version of the well known sorting program, DIFFEREN is a symbolic differentiation program and CHOICE^{10} is a backtracking intensive program from the ECRC benchmark. User-time and real-time are obtained directly^{11} with Unix's rusage tool and give an idea about the overall burden on the system that executes Prolog.
The figure 2 compares the code size of various program in BinProlog 3.30 with emulated and native SICStus Prolog on a Sparcstation 10-20.
Bmark/Compiler | No | T | I2,T | I3 | I3,T | I3,T,O | S2.1_8 |
NREV | 229kl | 238kl | 326kl | 408kl | 436kl | 453kl | 271kl |
BOOTSTRAP | 13.90s | 13.90s | 12.47s | 11.87s | 11.95s | 11.74s | 11.08s |
CHAT-Parser | 0.81s | 0.81s | 0.72s | 0.69s | 0.70s | 0.69s | 0.69s |
FIBO(16)x50 | 2.16s | 2.21s | 1.83s | 1.78s | 1.79s | 1.79s | 2.41s |
PERMS(8) | 0.49s | 0.48s | 0.41s | 0.36s | 0.37s | 0.36s | 0.49s |
DET-P(8) | 1.46s | 1.47s | 1.19s | 0.93s | 1.02s | 0.99s | 1.05s |
FINDALL-P(8) | 1.44s | 1.42s | 1.35s | 1.36s | 1.30s | 1.30s | 4.66s |
DIFFEREN | 0.61s | 0.61s | 0.54s | 0.52s | 0.53s | 0.51s | 0.67s |
CHOICE | 2.06s | 2.06s | 1.98s | 1.92s | 1.96s | 1.92s | 2.94s |
QSORT | 0.72s | 0.67s | 0.61s | 0.61s | 0.55s | 0.58s | 0.42s |
total user-time | 55.00s | 54.00s | 45.00s | 42.00s | 42.00s | 40.00s | 52.00s |
total real-time | 59.98s | 57.93s | 50.04s | 47.79s | 46.47s | 43.82s | 96.51s |
File/Compiler | BinProlog 3.30 | Emulated S2.1_9 | Native S2.1_9 |
NREV | 3079 | 16048 | 19072 |
QSORT | 3155 | 13456 | 18880 |
CHAT-Parser | 31732 | 123008 | 184080 |
Compiler version | No | T | I2,T | I3 | I3,T | I3,T,O |
Emulator size in bytes | 47460 | 47484 | 49356 | 51124 | 51244 | 51268 |
Note that we have used the same BinProlog run-time system in all the measurements so that the evaluation of the optimizations on a feature-by-feature basis is accurate. The various optimizations can be turned on and off independently with appropriate compile-time options for the C-emulator. Instruction and term-compression are subject to the general RISC-philosophy we have used in the design of BinProlog: the optimizations cover the most frequent idioms. Moreover, by careful case-overlapping redundant code is avoided when possible. Figure 3 shows the impact of the optimizations on code size.
Clearly the code-size increase from the basic No to the most optimized version I3,T,O that uses term compression, instruction compression up to length 3 (like PUT_WRITE_VAR_VAL ) and offset based structure manipulation inside the compressed instructions is still less then 4K.
Binarization based compilation of logic programming languages is very similar to the Continuation Passing Style compilation for functional programs described in [2] and [1]. The key idea of Appel as presented in [1] is that heap-deallocation by copying (generational) garbage-collection has a better amortized cost than stack-allocation and deallocation. As far as there's no backtracking involved, this principle can be used to partially garbage-collect in a very efficient way a deterministic, deep branch of a WAM-implemented SLD-tree. Although things can get more complex in the presence of choice points and trailing, memory management can be combined with other useful activities of the engine and most importantly with OR-parallel execution as described in [17].
^{1} Department of Computer Science, University of North Texas, Denton, Texas 76203 E-mail: tarau@cs.unt.edu
^{2} available from http://www.binnetcorp.com/BinProlog
^{3} Similar to the `goal-stacking' Warren rejected in 1983 []. We refer to [] for a thorough comparison of various goal-stacking implementations.
^{4} Our initial 200-line compiler generating code for a 6 instruction subset of ALS-prolog's WAM was probably the world's smallest definite program compiler.
^{5} Atom or term.
^{6} As the section about performance will show.
^{7} well, also GET_CONSTANT, although in this case it is not worth dealing with it separately
^{8} Current implementations of SICStus also do extensive instruction folding. They do need however a much larger set of folded instructions - about 256. This makes emulated code in SICStus 2.1_9 almost as large as native code and about 3-times larger than emulated BinProlog code.
^{9} Actually the ratio between the lower and the upper part is 1:4, 1:8 or even a higher power of 2. This must be in principle the statistically expected ratio between the size of the answer and the size of the computation that produces it.
^{10} Although shallow-backtracking is faster in C-emulated SICStus due to the optimization described in [] the overall performance for this benchmark is better in BinProlog due to its smaller and unlinked choice-points.
^{11} On a 1-user workstation with enough memory (32Mb) for in-core execution.
^{12} see [] for the details of its design and optimizations