Verification of C Programs
Via Natural Semantics and Abstract Interpretation |
We are witnessing a substantial lack of available tools able to verify the absence of relevant classes of run-time errors in code written in (reasonably rich fragments of) C and C++. This is despite the progress made in recent years in the fields of program analysis and verification, and despite the huge impact such tools could have on the quality of a good portion of our software universe. It is interesting to observe that, among the dozens of freely available software development tools, hardly any, by analyzing the program semantics, are able to certify the absence of important classes of run-time hazards such as, say, the widely known buffer overflows in C code. The reason is, of course, that C and C++ are complex languages and the techniques that can be used to dominate this complexity still do not reduce tool development to simple, manageable tasks. Our overall aim for this research is to investigate how known techniques based on natural semantics and abstract interpretation can be extended so as to conveniently formalize and implement a range of analysis and verification tools for modern imperative languages such as C and C++.
With this aim in mind, in [2] we define a core language —called CPM— that has much in common with C and includes several features that are problematic from the point of view of the semantic analysis of C and C++ code: recursive functions, run-time system and user-defined exceptions, realistic data and memory models, pointer types to both data objects and functions, and non-structured control flow mechanisms. Note that the contradiction between targeting “real” imperative programming languages and choosing CPM, an unreal one, is only apparent. As C misses exceptions and C++ is too hard as a starter, choosing any one of these would not have allowed us to assess the adequacy of the methodology described below with respect to the above goals.
Verification of many program properties using static program analysis via abstract interpretation [9, 11] is now a well-researched area. Static analysis is conducted by mimicking the concrete execution of the programs on an abstract domain. This is a set of computable representations of program properties equipped with all the operations required to mirror, in an approximate though correct way, the real, concrete executions of the program. Therefore, a formal concrete semantics for the language to be analyzed that models all the aspects of executions that are relevant to the properties of interest must be provided. Of course, we need to work on a language that is completely defined. For the C language, for instance, this can be achieved by converting the source programs to some more constrained language —like CIL, the C Intermediate Language described in [17]— where all ambiguities have been removed and by fixing an ABI (Application Binary Interface) so as to conform to the C implementation of interest. The problem is then to define a concrete semantics for the fully specified language so as to ensure that:
We now review the first two points; we will come back to the third point after introducing the abstract semantics.
We need a formal semantics that can be recognized (without requiring a strong mathematical background) as corresponding to the intuitive, often involved and incomplete explanations provided by standardization documents. For this purpose, we adopted the G∞SOS approach of Cousot and Cousot [12] which generalizes with infinite computations the natural semantics approach by Kahn [16] which, in turn, is a “big-step” operational semantics defined by structural induction on program structures in the style of Plotkin [18]. A semantics for CPM is then expressed by means of a concise set of rather simple rules that are quite readable and, most importantly, directly correspond to executable Prolog clauses. What was not clear to us when we started this work is whether the approach “scales” when applied to languages like C: for example, how can run-time errors and non-structured control flow mechanisms be modeled in this framework? We now know that the natural semantics is fit for the purpose:
Concerning point (ii) above, we can only formally reason about properties if they are observable in the chosen concrete semantics. For example, if we want to prove that a program uses pointer arithmetic in a safe way, we need a concrete semantics that allows us to observe all the unsafe uses. Such a concrete semantics cannot simply model pointers as plain addresses, as more information is required than that to detect the violations. In the concrete semantics for CPM, these violations are optionally reported as run-time exceptions so that, proving that such an exception can never be thrown, amounts to proving the desired property. All exceptional and undefined behaviors (such as divisions by zero, overflows of signed integer variables and dangerous uses of the shift operators) are modeled by exceptions. This, besides the need to deal with user-defined exceptions as found in C++, Java and Python, is the reason for the inclusion in CPM of exception propagation and handling mechanisms. Note however that accommodating exceptions impacts on the specification of the other components of the semantic construction. For example, short-circuit evaluation of Boolean expressions cannot be normalized as proposed in [8], because such a normalization process, by influencing the order of evaluation of subexpressions, is unable to preserve the concrete semantics as far as exceptional computation paths are concerned.
Following the abstract interpretation approach, we also require an abstract semantics that has a correlation with the concrete semantics. In addition, we require that appropriate abstract domains are available that can provide correct approximations for the values and all the operations that are involved in the concrete computation [9, 10, 11, 12]. For CPM we have formally defined in [2] an abstract semantics framework that follows and extends the approach outlined in the works by Schmidt [19, 20, 21]. We have proved that any semantics within this framework will have a safe correlation with the concrete semantics (a summarized version of this proof is available in [2]). Moreover, this framework has been designed to be both modular and generic. It is modular because the overall static analyzer is naturally partitioned into components with clearly identified responsibilities and interfaces, something that greatly simplifies both the proof of correctness and the implementation. It is generic, since it is designed to be completely parametric on the analysis domains. In particular, and here we come to point (iii) above, it provides —differently from all published proposals we know of— full support for relational domains (i.e., abstract domains that can capture the relationships between different data objects). Achieving this goal constrains the design of both the concrete and the abstract semantics. As was the case for the concrete semantics, the abstract semantics rules for CPM are almost directly translated to generic Prolog code that can be interfaced with specialized libraries implementing several abstract domains, including accurate ones such as those provided by the Parma Polyhedra Library [3, 4, 5]. So this working prototype, which is currently being extended with the pointer analysis described in [13, 14, 15], demonstrates that the proposal of Schmidt can play a crucial role in the development of reliable and precise analyzers for real imperative languages including C, Java and, we believe, C++ and RPython (http://pypy.org/).
Although our framework is only fully specified for the core CPM language, and this encompasses C but not C++, we do not have a definite answer concerning the appropriateness of our proposal for the verification of C++ programs. That said, we do not see what, in the current design, would prevent the extension of the core language together with its concrete and abstract semantics so as to handle any other features of mainstream, single-threaded imperative programming languages.
Our proposed analysis framework is parametric on abstract memory structures. While the literature seems to provide all that is necessary to realize very sophisticated ones, we can confidently predict that, among all the code out there waiting to be analyzed, some will greatly exacerbate the complexity/precision trade-off. The ability to analyze C programs will confront us with a huge variety of inputs and it is hardly likely that the same compromises will be able to accommodate programs as diverse as the huge, pointer-free, synthesized loops handled by ASTRÉE3 and, say, libraries for manipulation of strings. However, these are research problems for the future — now we have a formal design on which analyzers can be built, our next goal is to complete the build and make this technology truly available and deployable.
This document was translated from LATEX by HEVEA.