
Precise and Practical Mode Analysis with the CHINA Analyzer
Roberto Bagnara
In this paper we will briefly describe some recent developments that allow to perform precise and practical mode analysis with the CHINA analyzer. The emphasized words are important. On one hand, precision means that we use the most precise domains and domain combinations known to date. On the other hand, the issue of practicality has many facets. Space limitations do not allow to discuss them here, but they can be synthesized as follows. The analyzer should deal with real programs designed to run on real, implemented languages, it should never crash (the only limitation being the amount of available virtual memory), and it must terminate in reasonable time for any input program. What enables CHINA to be able to achieve this goal for mode analysis is a particular combination of domains and the techniques that have been adopted for their implementation. These domains capture information about groundness, compoundness, pairsharing, freeness, and linearity. The Groundness DomainThe abstract domain employed in CHINA for detecting the property of definite groundness is based on Pos, the domain of positive Boolean functions. Pos functions are expressive enough to capture the kind of dependencies that are implicit in unification constraints such asZ=f(g(X),Y) :
the corresponding Boolean function
is (X and Y) iff Z,
meaning that Z is ground if and only if
X and Y are so.
They also capture dependencies arising from other constraint domains:
for instance, the CLP(R) constraint X+2Y+Z=4
can be abstracted as
((X and Y) implies Z) and ((X and Z) implies Y) and ((Y and Z) implies X) ,
indicating that determining any two variables is sufficient to
determine the third.
This ability to express dependencies makes analysis based on Pos very precise, but also makes it relatively expensive, as many operations on Boolean formulas have exponential worst case complexity. Reduced Ordered Binary Decision Diagrams (ROBDDs) also known as Bryant graphs, are a representation for Boolean functions supporting many efficient operations. Because of this, they have often been used to implement Pos for abstract interpretation. However, the observation of many (constraint) logic programs shows that the percentage of variables that are found to be ground during the analysis, for typical invocations, is as high as 80%. This suggests that representing Pos elements simply by means of ROBDDs, is probably not the best we can do [Bag96].
It is also the case that ROBDDs generated during program analysis
often contain many pairs of variables that are equivalent. In the
context of groundness analysis, this means that either both are
ground, or neither is. Such equivalent variables appear, of course,
for a program goal of the form The frequency with which definitely ground variables and groundequivalent pairs of variables occur in the analysis of real programs suggested to us to factorize apart this information [BS99]. In the GER representation, an element of Pos is represented by means of a triple: the G component (from "ground") records the set of ground variables; the E component (from "equivalent") records a transitively closed set of pairs of equivalent variables; the R component (from "ROBDD") records a residual groundness dependency. There are elements of Pos that can be represented by several such triples and, in the GER representation, we need to make a choice among those. This choice must be canonical and economical. Economy can be explained as follows: true variables are most efficiently represented in the first component (a bitvector at the implementation level) and should not occur anywhere else in the representation. Equivalent variables are best represented in the second component of the GER representation (implemented as a vector of integers). As equivalent variables partition the space of variables into equivalence classes, only one variable per equivalence class must occur in the ROBDD constituting the third component of the representation. If we choose, say, the least variable (with respect to a predefined total ordering on variables) of each equivalence class as the representative of the class, we have also ensured canonicity. The GER representation has several advantages. The ROBDD generated during the analysis are kept as small as possible, and thus several important operations (which are of exponential worst case complexity) are speeded up considerably. Of course, ensuring reduction of the GER representation has a non negligible cost, but this overhead is greatly repayed: only the analysis of tiny programs shows a very modest slowdown (which is measured in hundredth of a second) while on the analysis of real programs the GER representation gives rise to speedups of up to an order of magnitude. Another advantage of the GER representation (which, incidentally, was the very reason why this line of research was started) is that the information about ground variables is readily available, since all and only the ground variables are kept in the G component. Obtaining the set of ground variables or testing whether a certain variable is ground are thus very efficient operations. As we will see, this characteristic allows to improve the efficiency of other analyses.
When using the GER representation of Pos, it is quite
rare to find real programs for which the analysis becomes too complex.
In other words, the factorization of ground and groundequivalent
variables is such that the exponential nature of Pos
seldom manifests itself.
However, programs that make the GER representation explode do exist
and, moreover, any serious domain of analysis whose complexity is, say,
more than linear in the number of variables, needs a widening.
The GER representation lends itself to a particularly simple
and effective widening, both in terms of precision and in terms
of efficiency of the implementation.
Since the G and E components have a linear size representation
they do not need any widening.
In contrast, the R component can grow exponentially and this is where
widening takes place.
The widening imposes a limit on the maximum number of ROBDD nodes
allocated by any operation used by the analysis.
When this limit is reached an exception is thrown and a sound
approximation of the operation is returned instead of the real result.
For instance, here is how conjunction is implemented
in the ROBDD class upon which the GER representation for Pos is based
( const BNode* BNode::and_approximate_up(const BNode* y) const { try { return unsafe_and(this, y); } catch(ROBDD_Too_Complex) { emergency_cleanup(); return this; // x is a sound upapproximation of x & y. } } Theoretically speaking, another source of potential difficulties for a Posbased analysis is due to the fact that the longest increasing chain in Pos has exponential length in the number of variables. This means that the chaotic iteration could traverse a very long chain of Pos descriptions before converging to a (post) fixpoint. While programs exhibiting this behavior can easily be written on purpose, no real program we have access to gives rise to this phenomenon. Nonetheless, the "always terminate in reasonable time" rule demands this possibility to be taken into account. Again, the G and E components of the GER representation do not pose any problems: the longest chains in G and E have linear length in the number of variables. Given that the problem rarely manifests itself (to say the least) the simple widening ever proposed suffices: allow the ROBDD components to change only a certain number of times, then approximate them with True (which, in terms of Pos is the ``don't know'', description). The Domain for CompoundnessCompoundness here means `nonvariable'; that is, a variable is compound if it is bound to a term that is definitely not free. Compoundness information is very important to optimize clause indexing.
In the CHINA analyzer, the domain used
for compoundness is, again, Pos.
This is because compoundness and compoundness dependencies
are well represented by positive Boolean functions.
Thus the global domain employed by CHINA
includes a second GER representation for compoundness.
As any ground variables is clearly compound, the G component
of the groundness domain is always propagated to the GER representation
used to track compoundness.
This is just an example where the ready availability of the set
of ground variables is used to speed up other analyses.
Note, however, that the combination with groundness information
is an essential ingredient for achieving reasonable precision
on compoundness.
In fact, the compoundness Pos description for the
equation The Domain for PairSharing, Freeness, and LinearityThe abstract domain used in the CHINA analyzer to capture pairsharing is based on the setsharing domain Sharing of Jacobs and Langen. In 1997, when we first incorporated the Sharing domain into the analyzer, a number of theoretical and practical questions were still open. The most important issue was the one of correctness: up to 1998 there seemed to be no published proofs of the soundness of the domain. In [HBZ98] we have proved the soundness, idempotence, and commutativity of Sharing. Most importantly, these results have been established, for the first time, without assuming that the analyzed language performs the occurcheck in the unification procedure. This closed a longstanding gap, as all the works on the use of Sharing for the analysis of Prolog programs had always disregarded this problem.The goal of sharing analysis is to detect which pairs of variables are definitely independent (namely, they cannot be bound to terms having one or more variables in common). In fact, in the literature we can find no reference to the "independence of a set of variables". All the proposed applications of sharing analysis (compiletime optimizations, occurcheck reduction and so on) are based on information about the independence of pairs of variables. It it thus natural to ask whether the setsharing domain Sharing is adequate with respect to the property of interest, that is, pairsharing. In [BHZ97] we have proved that Sharing is redundant for pairsharing and we have identified the weakest abstraction of Sharing that can capture pairsharing with the same degree of precision. One notable advantage of this abstraction is that the costly starunion operator, whose complexity is exponential in the number of sharinggroups, is no longer necessary and can be safely replaced by the binaryunion operator, which has quadratic complexity. This line of work has thus had an important impact on the issue of practicality of sharing analysis. In fact, the nonredundant domain described in [BHZ97] gives rise to speedups of up to three order of magnitudes on the analysis of real programs. The complexity of the Sharing domain is exponential in the number of program variables. Its nonredundant version has also exponential complexity, even though, in practice, the nonredundant domain is much better behaved than the original. In any case, a stable behavior, and thus a truly practical domain, can only be obtained by means of widening operators. The problem of scalability of nonredundant Sharing while still retaining as much precision as possible has been tackled in [ZBH99], where a family of widenings is presented that allows to achieve the desired goal. We moved from the observation that, when the sharingsets become large, then they are at the same time heavy to manipulate and, at least for a subset of the variables involved, light as far as information content is concerned. We have thus introduced a new representation for setsharing made of two components. They are both sharingsets, that is, sets of sets of variables. However, while the second component is interpreted as in the standard Sharing domain, the first component records worstcase sharing assumptions of sets of variables. Let us consider, for instance, the sharinggroup {X,Y,Z}. While its presence on the second component indicates that variables X, Y, and Z can share a common variable, the same sharinggroup in the first component means that X, Y, and Z can share variables in all possible ways. This new representation supports a wide range of widenings. These widenings allow to trade precision for efficiency by moving information from the second component to the first one. What differentiates one widening from another concerns what to move and when. The advantage of this new representation is that the first component is able to represent possible sharing in a very compact, though imprecise way. Descriptions constituted by hundreds of thousands of sharinggroups can be collapsed into descriptions consisting of a few tens of such groups, often with very little precision loss. This phenomenon has been actually observed in the analysis of real programs. Of course, some precision is lost, but experimentation showed that the loss concerns almost exclusively ground dependencies, and these can be recovered by coupling the new sharing domain with Pos. This brings us to the topic of combining sharing domains with other domains. Practically speaking, a sharing domain is never used in isolation. The first combination was suggested by Langen himself in his PhD thesis: linearity (the property of all variables that can only be bound to terms without multiple occurrences of variables) can greatly improve the accuracy of sharing analysis. The synergy attainable from the integration between aliasing and freeness information (freeness is the property of all variables that can only be bound to other variables) has been pointed out, for the first time, by Muthukumar and Hermenegildo. Besides that, freeness is a very useful property in itself. These standard combinations are now widely accepted and nobody would seriously think of performing sharing analysis without them. However, a number of other proposals for refined domain combinations have been circulating more or less clandestinely for years. One feature that is common to these proposals is that they do not seem to have undergone experimental evaluation. We have tackled this issue in [BZH99], where we have evaluated several proposals for precise domain combinations. Apart from the combination with structural information the results of this work were disappointing: only very small precision gains seem to be attainable by the more refined domain combinations. The combination with Pos, also studied in [BZH99], deserves further discussion. It is well known that Sharing keeps track of ground dependencies. More precisely, Sharing contains Def, the domain of definite Boolean functions, as a proper subdomain. However, there are several good reasons to couple Sharing with Pos: (1) this combination is essential for the powerful widening techniques on nonredundant Sharing mentioned above to be applied. This is very important, since analysis based on (nonredundant) Sharing without a widening is not practical. (2) Def is not expressive enough to capture all the ground dependencies of Prolog programs. Moreover, Def cannot even capture the dependencies induced by the primitive constraints of some CLP languages, and we target the analysis at Prolog and CLP programs. (3) In the context of the analysis of CLP programs, the notions of "ground variable" and the notion of "variable that cannot share a common variable with other variables" are distinct. A numeric variable in, say, CLP(R), cannot share with other variables but is not ground unless it has been constrained to a unique value. Thus the analysis of CLP programs with Sharing alone will either lose precision on pairsharing (if numerical variables are allowed to "share" in order to compute their groundness) or be unable to compute the groundness of numerical variables (if numerical variables are excluded from the sharingsets). In the first alternative, as we have already noted, the precision with which groundness of numerical variables can be tracked will also be limited. Since groundness of numerical variables is important for a number of applications (e.g., compiling equality constraints down to assignments or tests in some circumstances), we advocate the use of Pos and Sharing at the same time. (4) Detecting definitely ground variables through Pos and exploiting them to simplify the operations on Sharing is very worthwhile as far as efficiency is concerned, if the set of ground variables is readily available. This is the case, for instance, with the GER implementation of Pos described above. Using the GER representation, the propagation of ground variables to the sharing domain allows to obtain additional speedups of up to two orders of magnitude in the analysis of real programs. (5) Knowing the set of ground variables in advance, not only reduces the complexity of the operations of the sharing domain. It also improves precision when the domain keeps track of linearity information, as in our case. In fact, while it has been proved that Sharing and its derivatives are commutative, meaning that the result of the analysis does not depend on the ordering in which the bindings are executed [HBZ98], Sharing plus linearity does not enjoy this property. In particular, it is known since the thesis of Langen that best results are obtained if the grounding unifications are considered before the others. Again, the combination with Pos, since it allows the analyzer to know the set of all definitely ground variables in advance, is clearly advantageous in this respect. Bibliography


bagnara@cs.unipr.it 
Home  Personal  Papers  Teaching  Links 