Use of Formal Methods in Polykid Development

Ghassan Chehaibar (Bull-Dyade) - October 1997

Introduction

In 96/97, the Pregnana R&D center and the Bull-Inria GIE DYADE collaborated in applying formal methods to the Polykid architecture design process. This was carried out by the DYADE action VASY (VAlidation of SYstems).The aim of this action is to experiment and evaluate the use of FM (Formal Methods) in large projects having in mind a technology transfer into Bull.

In order to assess this experiment, this note reports the results obtained and the status at mid-october 97. Some conclusions are drawn for inserting FM in future projects.

FM have been already used in Bull in hardware design to prove that the logical behaviour of a chip is equivalent to the physical layout. Within Polykid project, FM are used at a completely different level to find issues in the specification of the overall system design (architecture and protocol specification), starting at the earlier phases of the development process.

"FM refer to the use of techniques from formal logic and discrete mathematics in the specification, design and construction of computer system and software. [...] This is analoguous to the role of mathematics in all other engineering disciplines: mathematics provide ways of modeling and predicting the behavior of systems through calculations." (quoted from [5]). For more information about FM see the reference documents and the web sites listed below. Let us only point out that we use the FM technique called state enumeration or model checking based on automata generation (there is another technique based on logic provers).

This note is organized as follows:

Executive summary
Results
Cost-benefit analysis
Test generation
Inserting formal methods
Tools and techniques
Comparision with competitors

1. Executive summary

The application of FM (formal methods) to the verification of the cache coherence protocol of Polykid (Bull CC-NUMA multiprocessor project) raised 20 issues throughout the protocol specification development (among which 8 behavioral errors) and costed about 10 man.months. Some of these errors are very hard to reproduce through tests and simulations. In the context of such complex systems, FM proved to be a protocol debugging aid and helped in improving the product quality.

We used the CADP toolbox (provided by INRIA Rhône-Alpes and Verimag) to compile, simulate and verify the formal specifications of Polykid cache coherence protocol. These were small configurations and we still have to work on tackling formal specification and verification of larger configurations.

Work is in progress also to apply formal methods techniques to generate test cases automatically, using the TGV tool (provided by INRIA-Rennes and Verimag): it will be applied in Polykid test environment.

2. Results

The experiment started at the beginning of Polykid specification in February 1996. The goals were:

to find issues and improve understanding of the system through formal specification,
to formally verify key properties: deadlock freeness, cache coherency protocol correction.

We can distinguish three phases in the application of formal methods to Polykid, depending on the maturity of the architecture engineering product specification.

Target	Cost	Benefits
Specification under construction (cache coherence protocol)	6 months, 55 questions raised to Polykid architects 2 formal specifications (4000 and 2000 lines)	Clarify issues
Draft versions of EPS* rev AC	1.5 month, 20 questions raised to Polykid architects	7 issues raised
EPS* rev AE to AF	2 months, 10 questions raised to Polykid architects a formal specification of 2000 lines (2 versions developped), 3 properties verified (on small configuration)	13 issues raised

*EPS = Engineering Product Specification

Phase 1. (February to July 96) The Polykid cache coherence protocol was under construction. In order to build a formal specification, the first task was to interact with the architects and designers to get the information and to keep current with all the changes (moving target).

Obviously, in this phase, it is difficult to speak about errors because the features were not yet frozen: the main benefit was to clarify undocumented concerns and to point out potential issues. All the more since the most delicate points (eg, the collision rules) were not yet specified.

About fifty-five questions were raised to Polykid hardware architects (40 oral ones and 15 written ones) over six months. Two formal LOTOS specifications were produced. The first one (4000 LOTOS lines) was very detailed (although incomplete) and so was too large to be compiled by the LOTOS compiler. The second one (2000 lines) was focused on the cache coherence rules (without the collision rules). It was possible to execute this model (user-driven simulation) but the description was still too detailed and not decomposable, so that formal verification was not feasible.

Phase 2. (September to December 96) From EPS rev.AB to AC. The cache coherence protocol was now complete in draft versions of EPS rev.AC. About twenty questions are asked to the architects to understand the last changes and to get additional information. Seven issues were raised, categorized as follows:

1 behavioral error (deadlock),
2 uncovered cases,
4 undocumented features.

Formal specification and verification activity was reduced from October 96 and then completely suspended from January 97 until June 97, in order to developp RCC software emulator (TSP-system).

Phase 3. (July 97 to Sep 97) From EPS rev.AE to AF. The specification was now mature and fairly stable. In order to build a new formal specification (small and decomposable), a new thorough analysis of the EPS was undertaken, raising about ten questions to the architects. This led to raise 13 issues, categorized as follows:

7 behavioral errors (1 deadlocks and 6 data consistency violations),
6 undocumented features.

The Polykid hardware architects proposed solutions to these errors. Five out of seven errors were possible in two-module configuration without processor caches level. So we built a new 2000 lines LOTOS specification describing such a configuration with the corrected rules. Actually two versions were built:

one detailed enough to describe the rules that previously contained four errors (collision rules),
and a second one describing the behaviour related to the fifth error (containing more details about the rule named "use of memory alias", and less details about collision rules).

These models were small enough and written in a decomposable way, so that generating the corresponding automata and performing verification were possible.Then, it was formally verified that, in this downsized and simplified configuration, three properties are true:

no deadlock,
no coherency paradox due to collision rules,
no coherency paradox due to use of memory alias.

In order to be sure that the models were accurate enough, we introduced in each version one of the errors found during modeling: the tools detected the errors and produced the corresponding diagnostic sequences.

This result increases the confidence in the corrected version of the design, but it should be stressed that it is in no way a formal proof of total correctness because:

it is performed on small and simple configurations,
the formal specification could be bugged,
the description of the property to prove could be bugged,
the verification tools could be bugged.

Categorization of issues. We only considered issues which were not clear for a reader familiar with the protocol: obvious omissions or typing errors were not taken into account:

A behavioral error is a deadlock (typically a read which does not get the data) or data consistency violation (two different and valid copies of a memory location).
An undocumented feature: a protocol rule that was assumed or planned to be implemented in the VHDL but not documented, and which is needed to guarantee a correct behaviour.
An uncovered case: in a set of rules which aim to solve a problem, some cases are not handled, but the consequences are not clear.

Learning and reuse factor. As formal specification was progressing, we improved our performance since we got experience in:

understanding the protocol concepts: so modelling and abstraction become easier,
using the LOTOS language and the toolbox CADP. Thus we learned to tune the style of specification, to get the level of detail that was acceptable by the tools, to find the possible workarounds of some bugs of the tools.

As example of constraints: the LOTOS compiler is more performant when there are less processes and less gates (LOTOS event gate = interaction point) and more functions, while decomposition tools rely on the structuring in processes.

Some parts of the first phase specification were reused in the third phase. We also took advantage of a previous experiment in 1995 where we applied FM to the PowerScale architecture and so got familiar with the PowerPc architecture.

The features of the formal specifications were as follows:

Model features	LOTOS spec. features	Automata generation
4 modules 2 processors per module with caches remote cache level 12 memory addresses	4000 lines 66 processes 52 event gates 12 functions	Too large to be compilable
3 modules (one home) 2 processors per module with caches remote cache level 1 memory address IN_Q and OUT_Q collapsed	2000 lines 1 process 11 event gates 117 functions	Simulation is possible but the automaton was not possible to generate
2 modules (one home) 1 cacheless processor per module remote (node) cache level 2 memory addresses (colliding in the module cache) distinct OUT_Q and IN_Q (1 entry for each)	2000 lines 10 processes 11 event gates 39 functions	automaton size with 4 bus operations: 59 379 states, 216 539 transitions, generated in 45 mn (on an Ultra-Sparc 144Mhz and 256 MB) automaton size with 9 bus operations: 230 561 states, 1 054 793 transitions, generated in 90 mn

3. Cost-benefit analysis

To ascertain that the money spent on applying formal methods to Polykid was well spent, we have to compare the total cost (over all the life-cycle) of the product without use of FM and the total cost when FM are used. Or equivalently, we have to compare:

the cost of applying formal methods (roughly 10 months),
the benefits value: the resulting time savings in the total life-cycle of the product.

To estimate the benefits, we need to measure:

the quality improvement obtained in the first phase of the formal specification,
the cost of finding the same issues using other methods (inspections, tests and simulations),
the delay introduced in the product time-to-market when the errors are found using other methods (since using FM has raised the above issues before tests and simulations activity).

However, these data are not available because:

the overall quality improvement is hard to quantify,
to measure the other two points, one test team should use chips implementing the erroneous specification without knowing anything about the issues found by FM.

Another rigorous way of performing a quantitative evaluation of FM benefits is to collect data from projects using FM and compare them with data collected from projects which do not use FM (the projects should deal with the same application domain and be staffed with equally skilled teams).

Since quantitative data are lacking, we may draw the following qualitative evaluation, showing the additional benefits of FM with respect to other activities:

using FM in the earlier phases of the product development clarifies issues: this has the potentiel to prevent errors and improve the overall system quality. This was the case of the Phase 1 of this experiment.

FM vs inspections: as it is the case with inspections, using FM we find out the error and its origin. But in a formal specification we try to rewrite the english-written specification in a formal language: this induces a thorough analysis which goes beyond an inspection. This is needed to find problems in delicate parts of a protocol, like the collision rules of a cache coherence protocol. Some of the issues were out of the scope of the formal specification (eg, are only reproduceable with processor caches, while these are not modeled): but were seen due to the careful analysis in order to abstract the cache coherence protocol and to model it using LOTOS. Moreover, we can automatically find errors using FM verification tools.

FM vs implementation: the implementation of a protocol is also a formalization step: but it comes late in the development process, and it is overwhelmed by too many details: in a formal specificatrion we focus on some part of the protocol and we abstract the unrelevant details. For instance, we have focused on the cache coherence protocol high level description, while the VLSI implementation has many other problems to solve: like timing, details of real configuration, many other functional features...

FM vs tests: using FM we can detect protocol specification errors which are very hard to detect, understand and reproduce through tests ; moreover, we found them earlier in the life-cycle of the product. For instance, all the errors listed in the RCC First Run bug list as "found by formal proof", are actually protocol errors. If you look at bug#2, bug#7, you see that the first one is a very unlikely situation and the second one can only occurs in a very small critical window (4 clocks). All these errors require some synchronization conditions between the different modules or between the CPUs of one module (it is also the case of bug#9): it is likely that very intensive simulations are needed to detect them.

the learning factor should be taken into account: as said in previous section, getting experience in using the specification language and the associated tools increases the productivity of FM engineers during the same project and from one project to the next one.

Finally, the use of FM is a strong commitment to ensure product quality: it has the potential to improve the overall system quality by clarifying specifications, to find issues in the early phases of the product development thus reducing the risk of delays and to increase confidence in the design.

4. Test generation

VASY is also investigating the use of FM in test generation. It consists in generating test cases from a formal specification and formal test purposes: the generated test cases are then used to test that the implementation is conform with the formal specification.

The benefits of this approach are to avoid writing test verifiers: the test case (automatically produced) sends inputs and verifies outputs of the implementation. This approach has been already successfully applied in telecom software (in INRIA-Rennes) using the TGV tool. The challenge for VASY is to use this tool to automatically generate test suites for Polykid SIM1 test environment.

Because of staffing problems, this activity has been delayed. Work is in progress in three directions:

to write a specialized formal specification of Polykid architecture which should be detailed enough to contain the input/output signals of the implementation,
to develop what is needed to apply the generated test suites to Polykid SIM1 test environment,
to improve TGV, in order to fit the needs and test practices in Pregnana, and to facilitate the formalization and the specification of the test purposes used as entries by TGV to generate abstract test cases.

Nevertheless, TGV tool has been already configured to accept LOTOS formal specification (it was not the case at the beginning of the action), and some preliminary experiments have been done on the first compilable LOTOS specification of Polykid to produce abstract test cases.

5. Inserting formal methods

In order to insert the use of FM in the standard methodology, the following points stem from this experiment:

Use of FM in projects like Polykid is to be considered a design aid and debugging aid to find issues and improve understanding of complex systems. It is not a way of ensuring total correctness through formal proof. In Polykid, all the issues were found during the modeling activity (the formal specification, not the formal verification).

It is profitable to introduce FM in the earlier phases of the project, before the specification is frozen, even if, when the specification is stable, it is easier to build a model and to characterize errors and to quantify the benefits. This helps in clarifying issues from the very beginning.

To distinguish different goals and different work products: formal specification, formal verification and test generation. For instance:

build a formal specification of the whole cache coherence protocol to find issues and improve understanding of the system;
build a reduced one to find errors automatically or formally prove some key protocol properties on small configurations: (again, it is not a total correctness proof, but it increases confidence in the design);
build another formal specification reflecting the test environment and formal test purposes to generate test cases.

6. Tools and techniques

The formal specification language used is LOTOS, an ISO standard. It allows algebraic type definitions and structuring of systems in parallel processes (with rendez-vous synchronisations). The tools provided by INRIA are CADP and TGV(currently only available on Solaris systems):

CADP (Caesar Aldebaran Development Package): it is a toolbox developped by INRIA Rhône-Alpes and Verimag. The functions used in Polykid project are:

Compiling a LOTOS program into a graphical simulator that allows the user to drive the system step by step: useful to debug the formal specification itself.
Compiling a LOTOS program into an automaton. This is used for verification purposes.
Reduction of automata with respect to equivalence relations.
Find a sequence in an automaton: for example, find erroneous sequences that lead to a coherency paradox in the cache protocol.
Prove properties (as deadlock freeness) by comparing the automaton modelling the system with an automaton modelling the property.
Compositional generation of automata: in large systems like Polykid, it is never possible to directly generate the automaton. The system is decomposed in several processes, then each process is compiled into an automaton, this automaton is reduced, and we recompose the different automata in some order with reductions at each step. The decomposition and recomposition are specified manually.

In 1997, the graphical user interface has been enhanced and the LOTOS compiler has been optimized to run faster ; new composional verification tools have been added to the toolbox. Work is in progress to improve the automata generation algorithm to generate less redundant states.

TGV (Test Generation with Verification technology). This tool is developped by IRISA (INRIA Rennes) and Verimag. From a formal LOTOS specification of Polykid and a formal test purpose (in an automaton format) TGV produces a test case as an automaton which represents test sequences. During the last year, algorithms and interfaces of TGV have been improved to take into account the specific features of LOTOS language and to fit as closer as possible to the practices in hardware testing. Work is in progress to develop a tool which transforms this test case in a program which will be used in SIM1 Polykid test environment: this program will send inputs to the implementation and receives outputs and verify them with respect to the test purpose.

7. Comparison with competitors

It seems that our most serious competitor in formal specification and verification of hardware protocols, and especially cache coherence protocols, is the Hardware Verification Group of Stanford University. They have designed their own formal specification language and a formal verifier called Murphi (phi is the greek letter).

This tool has been used by Sun within two projects:

verification of the cache coherence protocol of UltraSparc-1: nothing has been published about it.
verification of cache coherence protocol of the CC-NUMA research project S3.mp (this protocol uses distributed linked lists as the SCI one). In a paper relating this case study, it is said that they were not able to use Murphi on a system containing more than 3 modules (one address, one processor without cache per module, node cache level, data consistency model). Then they used a different abstract representation which did not reflect the architectural structure. They manually proved that it is equivalent to the first one and so reached a 5-node system (using 500MB of memory to generate the automata). They found "many design errors" in the small configuration and additionnal errors in the larger one.

We have more details about the applications done by the Stanford team, since these are included in their tool distribution. Due to several state reduction techniques implemented in Murphi, they are able to handle fairly detailed models. The interesting case studies are:

verification of the cache coherence protocol of DASH (the Stanford CC-NUMA multiprocessor) : for DASH, they have built three models (the first one with 5 nodes, one processor with cache per node, one address in the system, no DMA operations, the second one with 3 nodes and DMA operations, the third one with 5 nodes and only lock-related operations). They have rediscovered a known bug of DASH.
verification of a subset of SCI cache coherence protocol (IEEE standard). They have built a model with 3 processors with caches and one address. Their formal specification was made similar to the C-code implementation of the standard. They have found one implementation error and two protocol errors.

However, using Murphi, it is not possible to run a user-driven graphical simulation (useful to debug the specification) and it is not possible to verify liveness properties (such as: a request will get an answer) but only safety properties (such as no coherence paradox).

In the Polykid experiment, we have obtained a good result in modeling two addresses in our system, which are mapped in the same slot of the memory directory and of the module (or remote) cache: this has allowed us to take into account conflicts that were the origin of the most tricky errors. While all the above experiments modeled only one address, which means they assume a fully mapped directory and a remote cache per node as big as the whole memory. But we definitely need to improve our performance and handle larger models which are likely to contain problems, working in two directions:

optimizing the LOTOS compiler in order to generate less redundant states ; this is already in progress.
investigate if there are other specification style and more possible abstractions which reduce the state explosion, or other verification methods.

References

E. M. Clarke, J. Wing. Formal Methods: State of the Art and Future Directions. CMU Computer Science technical report CMU-CS-96-178, August 1996.
D.L. Dill, A.J. Drexler, A.J. Hu, C.H. Yang. Protocol Verification as a Hardware Design Aid. IEEE International Conference on Computer Design: VLSI in Computers and Processors, IEEE Computer Society, pp 522-525. 1992.
J-C. Fernandez, C. Jard, T. Jéron, C. Viho. An Experiment in Automatic Generation of test Suites for Protocols with Verification Technology. Inria report version of a paper published in Science of Computer programming 96.
H. Garavel, M. Jorgensen, R. Mateescu, C. Pecheur, M. Sighireanu, B. Vivien. CADP'97: Status, Applications and Perspectives. (A paper about the CADP toolbox)
Nasa Office of Safety and Mission Assurance. Formal Methods Specification and Verification Guidebook for Software and Computer Systems. Volume I: Planning and Technology Insertion. NASA-GB-002-95. Release 1.0. July 1995.