Data-Flow Analysis for Constraint Logic-Based Languages
Available: PDF, 300 DPI, 600 DPI, and 1200 DPI PostScript, DVI, BibTeX entry.
Chapter 1: Introduction
This study is aimed at the development of precise , practical , and theoretically well-founded data-flow analyzers for constraint logic-based languages . The emphasis on the above words is important for a full understanding of what follows, since some of the author's personal views and expectations are hidden behind them. Thus, in order to make our aims clear from the very beginning, we should start explaining the exact meaning of each word. This can be better achieved by treating them in reverse order.
Many problems can be conveniently expressed, reasoned about, and solved by means of constraints. These problems range from temporal, spatial, and physical reasoning, to image processing, automated diagnosis, plus all the problems that are typical of the fields of operation research and mathematical programming. As we will see, an important class of data-flow analyses for constraint logic-based languages can also be easily understood in terms of constraint manipulation.
The use of constraints as a description language and of techniques based on constraints has a long history in the field of Artificial Intelligence. Many constraint-based applications have been developed in the last 30 years in order to explore and solve specific problems. These systems, however, were often ad hoc systems with very little in common among them. Eventually, researchers realized that constraints could allow for programming machines in a novel way: constraint programming .[*]
Those programming paradigms that are ``based on logic'' are the best suited for constraint programming. Indeed, viewing basic constraints as atomic formulae in first-order predicate logic is a natural thing to do. Logic-based languages then provide logical and meta-logical ways of composing and dealing with constraints.
A significant part of the material contained in the present thesis deals with constraints and ways of approximating them by means of other constraints. Thus, several results are of interest independently from the considered programming paradigm. Nonetheless we focus our attention on the ``constraint logic programming'' paradigm for reasons that will soon be explained.
Constraint logic programming (CLP) is a generalization of the pure logic programming paradigm (LP), having very similar model-theoretic, fixpoint and operational semantics [JL87,JM94]. It embodies a quite natural view of ``constraint computations'' expressed by definite clauses. The CLP notion of constraint solving in a given algebraic structure encompasses the one of unification over some Herbrand universe. This gives CLP languages a great advantage over LP in terms of expressivity and flexibility.
We believe that CLP languages have a future, and this is the first reason for our particular interest. They provide elegant and simple linguistic support for constraint programming. This makes them suitable for a number of applications. Several CLP languages and systems exist to date, both in academic and industrial installations. Thus, CLP languages now have a significant user community. As a consequence we have a rough idea as to what a constraint logic program looks like: for a project that aims at some experimental validation of the theoretical ideas, this is an important factor.
Another motivation for studying data-flow analysis of CLP languages is that such languages can benefit from the analysis' results more than other paradigms. Both CLP compilers and programmers can do a much better job if they are given information about the run-time behavior of programs.
In some cases CLP programs can be naturally more efficient than the correspondent LP ones, because of the possibility of reasoning directly in the ``domain of discourse'' (integer arithmetic, for instance), without requiring complicated encodings of data objects as first-order terms. However, the basic operational step in CLP program execution, a test for solvability of constraints, is generally far more complex than unification. Ideally, a correct implementation of a CLP language needs a complete solver, that is a full decision procedure for the satisfiability of constraints in the language's domain(s).[*] The indiscriminate use of such complete solvers in their full generality can lead to severe inefficiencies. For these reasons, the optimized compilation of CLP programs can give rise to impressive performance improvements, even more impressive than the ones obtainable for the compilation of Prolog.
The CLP paradigm inherits from LP the complete freedom that programmers have when writing their program. Usually there are no prescriptive declarations (types , for instance) and any program that is syntactically correct is a legal one. Even though this is a much debated point, absolute freedom makes CLP languages attractive for the so called fast-prototyping of applications. The back side of the coin is that there is not much the compiler can do in order to help programmers and maintainers during the various phases of the software life-cycle. This clearly poses serious software engineering problems and, as a matter of fact, many prototypes do not evolve into real applications. The good news is that (constraint) logic programming, due to its unique semantic features, has a unique potential for formal program development. This possibility can be turned into a reality by providing a set of semantics-based tools (i.e., based on program analysis) for writing, debugging, manipulating, and reasoning about programs.
In summary, data-flow analysis has a great potential for providing valuable information to both the compiler and the developer of CLP programs. It promises to play a fundamental role in the achievement of the last point in the following wish list for CLP: (1) retaining the declarativity and flexibility of logic programming; (2) augmenting expressivity by allowing direct programming in the intended computational domain; (3) gaining a competitive advantage, in terms of efficiency, productivity, and maintainability, over other programming paradigms.
A data-flow analyzer is a computer program: it takes a program P as its input and, after a reasonable amount of time, it delivers some partial information about the run-time behavior of P. As such, a data-flow analyzer is quite different from that which is usually meant by ``a data-flow analysis'': a collection of techniques by which one can, in principle, derive the desired information about the behavior of programs. While it is undoubtedly true that a data-flow analyzer implements some data-flow analyses, the standard connotations of the two terms allow for a huge gap in the level of detail that one has to provide.
Designing a data-flow analysis requires a significant effort. The theory of abstract interpretation [CC77,CC92a] is a blessing in this respect, as it provides a variety of frameworks for ensuring the correctness of the analysis, once you have defined a domain of approximate program properties. The theory also provides standard ways of composing existing domains, reasoning about them, and deriving approximate versions of the relevant operations of the language at hand. However, the theory does not tell you how to invent an abstract domain that represents a reasonable compromise between precision and efficiency for the class of programs being analyzed. This is what makes the design of data-flow analyses potentially hard, depending on the use one has in mind for them.
If the purpose of the analysis is to write a paper, a number of shortcuts are possible. Some of them are completely justified by the needs of the presentation: cluttering the description with unnecessary details is counterproductive. The author's definition for ``unnecessary detail'' is: something that can be easily filled in by the interested reader without breaking any result in the paper. Other shortcuts are more questionable. Here is an incomplete categorization of things that make life easier.
In the author's experience, the researcher who engages in the task of designing a data-flow analysis with reasonable chances of being successfully implementable must be prepared to have a bad time. The mistakes that have been outlined above can trap him at any stage. Furthermore, during the pencil-and-paper development many questions arise naturally, and most of them cannot be answered. One typical form of such questions is the following: ``am I sure that this is reasonably implementable?'' Sometimes the answer comes later in the actual implementation stage. If the answer is negative some other solution has to be found. The implementor will have to conduct his own research, and an analysis distinct from the one proposed by the designer will have to be implemented.
Another class of questions is: ``how much am I going to gain if I do that, for the average program?'' For relatively new programming paradigms such as CLP the notion of average program makes little sense, which, again, means that there can be no answer. Thus, the risk of spending much time refining an analysis so as to make it implementable, and later discovering that relatively few programs may benefit from the results, is high.
In an attempt to circumvent these difficulties, we started our work with analyzers , not just analyses , in mind. In other words, our interest is in ``implementable data-flow analyses'', or, stated differently, ``data-flow analyses that are actually implemented by analyzers''. There is no free lunch, of course: experience has shown that dealing with analyzers, while perhaps saving you from overlooking important aspects of the problem, forebode a plethora of other troubles. This, however, is another story. By the way, the name of the author's pet analyzer is CHINA[*]. CHINA is a data-flow analyzer for CLP languages based on the abstract compilation approach. It performs bottom-up analysis deriving information on both call- and success-patterns by means of program transformations and optimized fixpoint computation techniques. It currently supports combined domains for: (parametric) structural information, simple types, definiteness, aliasing and freeness, numerical bounds and relations of and among numerical leaves, and polymorphic types. CHINA consists of about 40,000 lines of C++ code for the abstract interpreter, and about 5,000 lines of Prolog code for the abstract compiler.
It is the author's impression that too often the importance ascribed to a result is directly proportional to the strength of its conclusion. The ratio between the strength of the conclusion and the strength of the premises should, instead, be considered as the right measure. A weak conclusion that can be established starting from weak premises can be very valuable. This is especially the case when one deals with real implementations where, in order to tackle the various complexity precision trade-offs, one needs as much freedom as possible. Often this implies that the strong premises cannot be ensured, while weak conclusions would be enough. Scalable theories are particularly precious in this respect. The author's favorite reference for abstract interpretation is Abstract Interpretation Frameworks , by Cousot and Cousot [CC92a]. There, the authors start from very weak hypotheses that only guarantee correctness and finiteness of the analysis. Only later Galois connections and insertions are introduced in a blaze of glory, with all their nice properties.
Hypothesis 2 has already been discussed. Whenever it applies we are faced with the research problem of removing any limiting assumption.
The author favors the last hypothesis. A couple of seconds is a ridiculous time for, say, optimized compilation. If one takes into account that
So we should go for more precision. The problem is how to increase precision yet avoiding the concrete effects of exponential complexity. Consider groundness analysis, for instance. The cruder domains do not pose any efficiency problem. In contrast, the more refined domains for groundness, such as Pos , work perfectly until you bump into a program clause with more than, say, fifty variables. At that time, Pos will blow-up your computer's memory. One would like to have a more linear , or stable behavior. The right solution, as indicated by Cousot and Cousot [CC92b], is not to revert to the simpler domains. We should use complex domains instead, together with sophisticated widening/narrowing operators. With this technique we can try to limit precision losses to those cases where we cannot afford the complexity implied by the refined domains.
Ideally, it should be possible to endow data-flow analyzers with a knob . The user could then ``rotate the knob'' in order to control the complexity/precision ratio of the system. The widening/narrowing approach can make this possibility a reality. Unfortunately, the design of widening operators tends somewhat to escape the realm of theoretical analysis, and thus, in the author's opinion, it has not be studied enough. Indeed, the development of successful widening operators requires, perhaps more than other things, extensive experimentation.
In this thesis you can find an account of the author's research from March 1993 to October 1996. The work included here is all centered around the development of the CHINA analyzer. By this we mean:
Some history is required so as to put the subsequent material into context. Also, since some chapters in this thesis, or their leading ideas, are based on papers that have already been published, we take the occasion to report here the relevant bibliographic sources.
We tackled the problem of data-flow analysis of CLP languages over finite numeric domains back in 1992. In [Bag92] we proposed an abstract interpretation deriving spatial approximations of the success set of program predicates. The concrete interpretation of a constraint, that is, a shape in k-dimensional space, was abstracted by an enclosing, though not necessarily minimal, bounding box . Bounding boxes are just rectangular regions with sides parallel to the axes. Thus, a bounding box is univocally identified by its projections (i.e., intervals) over the axis associated to the variables. Of course, bounding boxes are very coarse approximations of general shapes. Finer spatial approximations exist and are well-known, such as enclosing convex polyhedra, grid and Z-order approximations [HMO91] and so on. However, the coarseness of bounding boxes is well repaid by the relative facility with which they can be derived, manipulated and used to deduce information relevant to static analysis. There are several techniques for deriving bounding boxes from a given set of arithmetic constraints (especially if one confines attention to linear constraints): variants of Fourier-Motzkin variable elimination [DE73,Pug92], the sup-inf method [Ble74,Sho77,Sho81] etc. In [Bag92] we considered a variant of the constraint propagation technique called label inference with interval labels [Dav87]. More precisely we used the Waltz algorithm [Wal75,Mac77,MF85] applying label refinement , in which the bounding box corresponding to a set of constraints is gradually restricted.
Later, we turned our attention to the analysis and compilation issues of languages such as CLP(R). In particular, we studied a notion of future redundant constraint less restrictive than the one introduced by Jørgensen et al. [JMM91]. An analysis for the detection of implicit numeric constraints, future redundant ones in particular, was sketched by Giacobazzi, Levi and the author in [BGL92], and the constraint propagation techniques used, collectively known as constraint inference [Dav87], were presented in [BGL93]. Meanwhile, the development of CHINA's first prototype was started. The applications of implicit and redundant constraint analysis were first described in [Bag94], together with some experimental results obtained with the prototype analyzer.
The formalization of the (concrete and abstract) domains and of the analysis was unsatisfactory. What characterizes a numeric constraint? A possible answer is ``the set of its solutions''. Some authors have formalized concrete domains that way [GDL92]. But this does not work in general as, in order to constructively define an abstraction function you must know the solutions. Unfortunately this is unfeasible or even impossible, in general: solving a set of equations of the form (xi - xj)2 + (yi - yj)2 = cij is NP-hard [Yem79], whereas it is undecidable whether a transcendental expression is equal to zero [Ric68]. As a final example, solving a set of linear integer constraints with at most two variables per inequality is NP-complete [Lag85].
For these reasons we chose to define a constraint as ``the set of its consequences under some consequence relation ''. This is a much more reasonable definition as it allows to capture both the concrete domains really implemented by CLP systems and the class of approximations we were considering. We were thus lead to the study of partial information systems [Sco82] and constraint systems [SRP91]. Another theoretical difficulty was due to the fact that our numerical domain was a combination of two other domains: one for real intervals and the other for ordinal relationships . The combination was something different from the reduced product: sometimes less precise, sometimes more. More importantly, we wanted the combination to be programmable in order to meet different balances between precision and efficiency. As a solution we gave the dignity of constraints to a restricted class of cc agents [SRP91,Sar93]. Some pieces of the theory were missing[*] and we became deeply involved in the subject. The results appeared first in [Bag95] and refined in [Bag97].
Meanwhile, the development of the second prototype of CHINA was initiated. At some point we faced the problem of analyzing non-linear constraints for CLP(R). As in CLP(R) non-linears are delayed until they become linear, the analyzer is not allowed to derive anything from them unless the analysis has inferred that they have indeed become linear. Having solved the theoretical difficulties about the combination of domains, we implemented the Pos groundness domain in the standard way, and we combined it with the numerical component. The interaction could be implemented in different ways: the numerical component could ask the definiteness component, unsolicited definiteness information could be sent to the numerical component, or a mixture of the two. In addition, some groundness information that is synthesized in the numerical component should be notified to the definiteness component. However, no one had studied the problem of efficiently detecting ground variables in the context of the standard implementation of Pos . This motivated our interest in groundness analysis, a topic that we believed was almost closed. The outcome of our studies on the subject was briefly sketched in [Bag96c] and later presented in [Bag96b].
The CHINA system is still evolving, and has never previously been described, although a very short description of the first prototype of CHINA is given in [Bag94].
Recently, we have been working on the precise detection of call-patterns by means of program transformation. In a joint work with Maria Cristina Scudellari (a former student at the University of Pisa) we devised a solution. However, this has not undergone experimental evaluation yet. It is presented here for the first time in Chapter 7.
Chapter 2introduces some of the necessary mathematical background. It is mostly intended to define the terminology and notation that will be used throughout the thesis.
Chapter 3is motivated by the fact that many interesting analyses for constraint logic-based languages are aimed at the detection of monotonic properties, that is to say, properties which are preserved as the computation progresses. Our basic claim is that most, if not all, of these analyses can be described within a unified notion of constraint domains. We present a class of constraint systems which allows for a smooth integration within an appropriate framework for the definition of non-standard semantics of constraint logic-based languages. Such a framework is also presented and motivated. We then show how such domains can be built, as well as construction techniques which induce a hierarchy of domains with interesting properties. In particular, we propose a general methodology for domain combination with asynchronous interaction (i.e., the interaction is not necessarily synchronized with the domains' operations). Following this methodology, interesting combinations of domains can be expressed with all the semantic elegance of concurrent constraint programming languages.
Chapter 4presents the rational construction of a generic domain for structural analysis of CLP languages: Pattern(D#), where the parameter D# is an abstract domain satisfying certain properties. Our domain builds on the parameterized domain for the analysis of Prolog programs Pat(R), which is due to Cortesi et al. [CLV93,CLV94]. However, the formalization of our CLP abstract domain is independent from specific implementation techniques: Pat(R) (slightly extended and corrected) is one of the possible implementations. Reasoning at a higher level of abstraction we are able to appeal to familiar notions of unification theory. One advantage is that we can identify an important parameter (a common anti-instance function, missing in [CLV93]) that gives some control over the precision and computational cost of the resulting generic structural domain. Moreover, we deal --apparently for the first time-- with the occur-check problem : while, to our knowledge, all the analysis' domains that have been proposed in the literature (Pat(R) included) assume that the analyzed language implements complete unification, most real languages do omit the occur-check. This fact has serious implications both in terms of correctness and precision of the analysis.
Chapter 5addresses the problem of compile-time detection of implicit and redundant numeric constraints in CLP programs. We discuss how this kind of constraints has several important applications in the general field of semantics based program manipulation, and, specifically, optimized compilation. In particular, we propose a novel optimization based on call-graph simplification . We then describe a sequence of approximations for characterizing, by means of ranges and relations, the values of the numerical leaves that appear in Herbrand terms. This, among other things, brings to light the need of information about how many numerical leaves a term has. For each approximation needed for the analysis we offer different alternatives that allow for dealing with the complexity/precision tradeoff in several ways. The overall numeric domain (or, better, a family of them) is obtained (among other things) by means of the product and the ask-and-tell constructions of Chapter 3. The ask-and-tell constraint system constitutes a very convenient formalism for expressing both (1) efficient reasoning techniques originating from the world of artificial intelligence, where approximate deduction holds the spotlight since the origins; and (2) all the abstract operations needed for the analysis. In practice, we define a family of concurrent languages that serve as target-languages in an abstract compilation approach.
Chapter 6deals with groundness analysis for (constraint) logic programs. This subject has been widely studied, and interesting domains have been proposed. Pos (a domain of Boolean functions) has been recognized as the most suitable domain for capturing the kind of dependencies arising in groundness analysis. Its standard implementation is based on reduced ordered binary-decision diagrams (ROBDDs), a well-known symbolic representation for Boolean functions. Even though several authors have reported positive experiences using ROBDDs for groundness analysis, in the literature there is no reference to the problem of the efficient detection of those variables that are deemed to be ground in the context of a ROBDD. This is not surprising, since most currently implemented analyzers need to derive this information only at the end of the analysis and only for presentation purposes. Things are much different when this information is required during the analysis. This need arises when dealing with languages which employ some sort of delay mechanism , which are typically based on groundness conditions. In these cases, the naïf approaches are too inefficient, since the abstract interpreter must quickly (and often) decide whether a constraint is delayed or not. Fast access to ground variables is also necessary when aliasing analysis is performed using a domain that does not keep track of ground dependencies. We introduce and study the problem, proposing two possible solutions. The second one, besides making possible the quick detection of ground variables, has also the effect of keeping the ROBDDs as small as possible, improving the efficiency of groundness analysis in itself.
Chapter 7presents some recent work about the analysis of call-patterns for constraint logic programs. In principle, call-patterns can be reconstructed, to a limited extent, from the success-patterns. This, however, often implies significant precision losses. As the precision of call-patterns is very important for many applications, their direct computation is desirable. Top-down analysis methods are usually advocated for this purpose, since the standard execution strategy of (constraint) logic programs is top-down. Alternatively, bottom-up analysis methods based on the Magic Sets or similar transformations can be used [Kan93,CDY94,Nil91]. This approach, however, can result in a loss of precision because the connection between call- and success-patterns is not preserved. In a recent work, Debray and Ramakrishnan [DR94] introduced a bottom-up analysis technique for logic programs based on program transformation. They showed, among other things, that their bottom-up analysis is at least as precise (on both call and success-patterns) as any top-down abstract interpretation using the same abstract domain and abstract operators. The basic idea behind [DR94] is to employ a variation of the Magic Templates algorithm [Ram88]. Moreover, the (possibly approximated or abstract) ``meaning'' of a program clause (or predicate) is a partial function mapping descriptions of tuples of terms (the arguments at the moment of the invocation: call-patterns) into sets of such descriptions (describing the possible arguments at the return point of that invocation: success-patterns). The solution of [DR94], however, is not generalizable to the entire class of CLP languages, since they exploit the peculiar properties of the Herbrand constraint system, where a tuple of terms is a strong normal form for constraints. This way, and by duplicating the arguments, they are able to describe the partial functions which represent the connection between call and success-patterns in the clauses heads . Of course, this cannot be done for generic CLP languages. Our aim is thus to generalize the overall result of [DR94] to the general case of constraint logic programs: ``the abstract interpretation of languages with a top-down execution strategy need not itself be top-down ''.
Chapter 8draws some final conclusion about what we have done and learned. More importantly, it traces some lines for further research describing what remains to be done in order to approach the objectives that have been delineated in this introduction.
As the reader will have already noticed, our approach is based on a mixture of theory and experimentation. In the field of computer science these two aspects of research are too often confined in distinct, separate worlds. About this subject, the ``Committee on Academic Careers for Experimental Computer Scientists'' of the U.S.A. National Research Council has words that deserve a long quotation [NRC94].
[...] the crux of the problem is a critical difference in the way the theoretical and experimental research methodologies approach research questions. The problem derives from the enormous complexity that is fundamental to computational problems [...] This complexity is confronted in the theoretical and experimental research in different ways, as the following oversimplified formulation exhibits.
This excerpt does contain a faithful description of the incommunicability that seems to be so frequent in our field. However, it takes for granted that there are two kinds of computer science researcher: the theoretician and the experimentalist.
The author has done five years of ``real programming work'' in environments where the theoreticians were often laughed at (and these environments were probably the most illuminated ones among those where real programming takes place). And during the last four years he has done full-time research work in a milieu where experimentalists ``look strange'', so to speak. These hostile feelings are sometimes justified. There are, indeed, experimentalists that seem unable to communicate their findings to the other researchers; some even refuse to do it. On the other hand there are theoreticians that like filling the ``applications'' sections of their papers with conjectures, even though these speculations are not explicitly presented as such.
These observations have convinced the author that being a theoretician and an experimentalist at the same time was something worth a try. The risk of being mediocre in both these activities is high, of course, but we must say that we are satisfied with this choice. Having tried, our impression is that theoretical and experimental work do have synergetic effects. As it has already been mentioned, most of our theoretical studies were suggested by practical difficulties. Moreover, the author shamelessly admits that several deficiencies in his theories have been discovered either at the time of writing the actual code (do not know what to write: an under-specified theory), or by observing the results obtained with the implementation (a wrong theory).
If you try to write for the novice
you will communicate with the experts;
otherwise you will communicate with nobody.
-- DONALD E. KNUTH, lecturing on Mathematical Writing,
Stanford University (1987)
Home | Personal | Papers | Teaching | Links