Monday, February 18, 2013

Is your analysis concrete?

In the area of program analysis, it is all-too-easy to marry yourself to a model of computation that does not capture the entire behavior of the language you're analyzing before you even begin. The classical data flow analysis algorithms from the Fortran folks have been mostly successful for all the compiler hackers out there, but computer security has become such a real danger that we can't just use the work of hackers (though their work is appreciated). Before I talk more about program analysis, I want to say a few words on reliable computing.

Reliable computing and programming languages

If we are to be confident about any property of a program that we write, not only should it pass a large test suite, but it should also have that property proven either automatically, by an interactive proof process resulting in a machine-checkable proof, or provided directly by the language it is written in.

Herein lies the problem. Computers speak only one language - machine code. In modern computers, machine code is a binary representation of some low-level imperative language. In just under 100% of the time, programs are not written in machine code, so we employ compilers - program translators - to change a nicer language to machine code. Sometimes we even chain these compilers together so that there is a higher level intermediate code that even higher level languages can target, so that different kinds of translation technology can be used at the different levels of abstraction.

How, then, do we know properties hold of programs written in high level languages that promise to maintain certain abstractions? Can programs written in other languages break these properties? Many engineers understand languages by their implementation, but not the intended semantics of the language. After all, a language only means something if it executes machine code, right? Well, not really. This line of thinking becomes far too intractable as high level language implementors take more liberties with their implementation strategy. Instead, if we assume a mathematically ideal computational model for the machine code (not really a problem unless you're in outer space or working on an airplane for over 12 hours), then we can employ powerful proof techniques to prove that such translations not only preserve properties, but that programs in the target language can't break them. These techniques require a the model (semantics) for the machine code AND a semantics for the high level language. What is a semantics if not an implementation?

A programming language semantics is a mathematical model of the meaning of a program in that language. Yes, an imperative implementation could be a semantics, but it's often too large and complex of a mathematical object to reason about. Instead, we have different styles for writing programming language semantics (c.f. the Redex book) that are much more amenable to standard proof techniques like induction and subject reduction. I am a personal proponent of languages maintaining a formal semantics in one of these styles. The easiest to learn and possibly the most successful is small-step operational semantics.

Analysis of a language, not a graph

The standard model of data flow analysis is of graphs; basic blocks form nodes and control flow forms edges between nodes. This form of analysis can be proven correct with respect to a concrete semantics, but the path is cumbersome. Furthermore, not all languages have the standard imperative block structure. There is another way to describe a program analysis and have its correctness as an immediate result of the way it's written. This is in the form of Abstracting Abstract Machines (AAM), which shows a straightforward way to phrase your programming language semantics such that abstraction to a finite automaton (and thus computably explored) is dead simple.

This technique does not mark the end of research in analysis. Instead it opens possibilities and offers a new perspective: your analysis techniques should be built into your programming language semantics. That isn't to say your go-to model should be the translated form that makes abstraction simple. No, this means that you should have a repeatable, straightforward process for translating a run-of-the-mill concrete semantics into a ready-for-abstraction concrete semantics. It is illuminating to see what your analysis machinery really means in terms of concrete execution, and can ease your explanation of the idea to someone else; we humans are good with thinking concretely, while abstraction makes it harder.

I more recently developed a simpler translation path from semantics to a pushdown abstraction (rather than a finite one), although it assumes more of the programming model - that is, no more than a bounded amount of the continuation can be captured at once. Languages with full continuations can enjoy enhanced precision of the method with some extra model-specific work, but the pushdown guarantee flies out the window. Though the guarantee is theoretically broken, the use of heavy control operators tends to be controlled enough (hah) to not cause too much leaked precision.

The translation amounts to saving the state of the machine right after control transfers to a function as a key into a table of "local continuations" that also gets updated on function call to the continuation up to the entry of the function. This "up to" notion is easily captured by replacing the continuation on function call with the key into the continuation table. When it comes time to interpret that key, you just continue to all local continuations that were stored at that key. This non-determinism can lead to loops where previously there weren't (you can always choose the older of return continuations), but reachability of states is maintained with the original semantics. Abstraction follows the same strategy of AAM for finitizing the address space, but continuations no longer need to be store-allocated. This presentation immensely helped people in my department understand the work of my recently-graduated colleague, Dimitrios Vardoulakis, on CFA2, the first pushdown abstraction for a higher-order language. One element that was missing from his work was a non-standard concrete semantics, the term given to concrete semantics augmented with different/extra information to compute the answer of the analysis, rather than the answer of evaluation. Indeed, this idea is not new. Rather, the push I and my colleagues are making is to make the translation from standard to non-standard concrete semantics a straightforward and intuitive process. Secondly, all the trickery of the analysis should have a concrete counterpart. Once you abstract, there should not be anything surprising. The full generality of the abstract interpretation framework is something to appreciate, but we must also respect ourselves and our audience by providing an understandable and accessible technical artifact.


elgreco said...

CFA2 does have a non-standard concrete semantics (Fig 2.4b, pg 11 of my diss. It was also in the ESOP paper). It is the same non-standard semantics Matt Might used in his dissertation. This semantics is easy to abstract to a finite-state analysis like kCFA. It is trickier to go to the abstract semantics of CFA2; that abstraction is not straightforward.

This choice was intentional. I did not want to have a concrete semantics with a stack because I preferred to start from sth that has already been used by others. (I generally try to reuse as much prior work as possible; this makes it easier to show the delta of my work.) When you create a new concrete semantics, sm can come and ask "but how does this correspond to the other semantics people usually use," so then you have to go and do that proof.

J. Ian Johnson said...

I can understand that line of reasoning, but the correspondence in this case is not a hard one to show (reachability is preserved), and it gives people an operational understanding of the analysis' mechanics. The difficult proof shifts from the analysis' soundness (something that you put a lot of work into) to a less difficult proof of this correspondence. I know because I went through both kinds of proofs. It is because of this journey that I encourage people to put their cleverness into a concrete semantics, prove a correspondence, and then pointwise abstract. It helps intuition and keeps you honest about the workings of your algorithm; you can just run the concrete semantics on your test programs and not have results skewed by introduced approximations.