Partenaires et Financeurs
MULTIVAL Progress Report 2009
Nicolas Coste (PhD student, STMicroelectronics and INRIA/VASY)
Hubert Garavel (group leader, INRIA/VASY)
Yves Guerte (software engineer, INRIA/VASY)
Rémi Hérilier (software engineer, INRIA/VASY)
Holger Hermanns (research scientist, INRIA/VASY)
Frédéric Lang (research scientist, INRIA/VASY)
Etienne Lantreibecq (engineer, STMicroelectronics)
Radu Mateescu (research scientist, INRIA/VASY)
Louis Paternault (software engineer, INRIA/VASY)
Wendelin Serwe (research scientist, INRIA/VASY)
Meriem Zidouni (PhD student, Bull and INRIA/VASY)
in close collaboration with
François Bertrand (group leader, CEA/Leti)
Richard Hersemeule (group leader, MULTIVAL project leader, STMicroelectronics)
Virang Shah (engineer, CEA/Leti)
Yvain Thonnart (engineer, CEA/Leti)
Together with Bull, we studied the MPI software layer and MPI benchmark applications to be run on FAME2 (Flexible Architecture for Multiple Environments), a CC-NUMA multiprocessor architecture developed at Bull for teraflop mainframes and petaflop computing.
In 2009, we pursued the study of the "barrier" primitive of MPI, which allows several parallel processes to synchronize (each process arriving at the barrier waits for all the others to arrive) before continuing their execution. The latency of the barrier primitive corresponds to the (average) time taken by a process to traverse the barrier, i.e., the time between the moment when it arrives and when it leaves the barrier. Our goal was to estimate this latency on different topologies, different software implementations of the MPI primitives, and different cache coherency protocols. We specified, using the LOTOS and C languages, five protocols implementing the barrier primitive (centralized, combining, tournament, dissemination, and tree-based), which differ by the shared data structures and the synchronizations used. We extended the LOTOS specifications of these five protocols with Markov delays corresponding to the read/write and fetch-and-decrement operations performed by these protocols.
In order to fight state explosion, we undertook the analysis of the Markov chains underlying these protocols by means of steady-state simulation, carried out on the fly using the newly developed CUNCTATOR tool. CUNCTATOR enables to terminate a simulation when a certain virtual time is reached and does not attempt to estimate by itself the accuracy of the simulation results (this estimation may depend on the type of application under study). To analyze the barrier protocols, we considered two widely-used estimation methods [Jain-91] that enabled us to decide whether the results of a simulation are sufficiently close to their real values: the confidence interval method, which involves launching increasingly longer simulations, and the concurrent estimation method, which involves launching in parallel several simulations with different random parameters. We automated these two convergence methods by means of shell scripts and applied them to the five barrier protocols.
Using the confidence interval method, which relies upon the save/restore mechanism implemented by the CUNCTATOR tool, we were able to compute the latencies of barrier traversals for all the five protocols, for configurations containing up to four processes. For the centralized and tree-based protocols (which could also be analyzed by completely generating their Markov chains and applying the BCG_STEADY tool), the throughputs computed by simulation were close (less than 5%) to those computed by BCG_STEADY. These results suggest that the latencies computed by simulation for the three other protocols with four processes (which are too complex to be analyzed by generating their Markov chains explicitly) are also close to their real values.
We pursued the study (started in 2005) of the FAUST (Flexible Architecture of Unified System for Telecom) NoC (Network on Chip) [BCVCR05] developed by the CEA/Leti laboratory. Since 2007, we are focusing on the latest version, named Magali, of this architecture.
Together with Leti scientists (Francois Bertrand, Virang Shah, and Yvain Thonnart), we studied the communication interconnect, which routes packets (consisting of several 34-bit flits) between the 23 components of the circuit. At the block level, this interconnect is described in the hardware process calculus CHP (Communicating Hardware Processes) and implemented, at the RTL level, in asynchronous logic. The interconnect has 23 communication nodes, each of which consists of five input controllers and five output controllers. Each input controller dispatches incoming flits to one out of four output controllers, and each output controller arbitrates between four input controllers.
In 2009, Leti scientists (Virang Shah and Yvain Thonnart) developed a set of tools for the verification of netlists for asynchronous circuits. These tools automatically translate netlists and standard cells of a commercial cell library commercialized by STMicroelectronics into LOTOS. These tools were applied, in conjunction with CADP, for analyzing a speed-optimized version of a component of the communication interconnect. Using the EVALUATOR model checker, a potential deadlock was exhibited, which had not been found using simulation. Because the deadlock only occurs under particular assumptions on differences of delay, Leti scientists are currently investigating the probability that these assumptions are satisfied in a real chip, to decide whether the optimized component should be integrated in the design.
Together with STMicroelectronics, we studied xSTream, a multiprocessor dataflow architecture for high performance embedded multimedia streaming applications. In this architecture, computation nodes (e.g., filters) communicate by xSTream queues connected by a NoC (Network on Chip). An xSTream queue generalizes a bounded FIFO queue in two ways: it provides additional primitives (such as peek to consult items in the middle of the queue, which is not possible with the standard push/pop primitives of FIFO queues), and a backlog (extra memory) to allow the increase of the queue size when the queue overflows.
In 2009, the verification of the xSTream architecture benefited from the latest enhancements brought to the CADP toolbox. For instance, the enhancements of the OPEN/CÆSAR table significantly decreased the state space generation time. Also, the extended capabilities offered by the enhanced OPEN/CÆSAR table and by version 1.1 of the BCG format enabled the generation of very large transition systems for the xSTream architecture (up to 15 billion transitions).
We also continued performance evaluation studies to predict latency and throughput of communication between xSTream queues. We consolidated our IPC (Interactive Probabilistic Chains) approach undertaken in 2008. This approach is inspired by Interactive Markov Chains [Her02], but uses probabilities instead of stochastic distributions and a central clock governing all delays.
The semantic definition of latency was generalized, the IPC tool chain has been enhanced, and a conference paper was published [CHLS09], which presents the IPC approach and its application to xSTream.
We continued to study whether the CADP tools could enrich with formal verification capabilities the current design flow of STMicroelectronics, which is based on SystemC/TLM (see section 7b below). With this aim, STMicroelectronics provided us with the Blitter Display (BDisp for short), a 2D-graphics co-processor implementing Blit (Block Image Transfer) and numerous graphical operators, e.g., rotations, alpha blending, or Blue Ray disc decoding. The BDisp is software-controlled through instructions written in nodes of up to four application queues and two real-time composition queues. It is described by more than 25,000 lines of SystemC/TLM code.
In 2009, we continued the verification of the BDisp following both approaches mentioned in section section 7b below:
We improved the translation scheme from TLM to LOTOS, optimized the hand-written LOTOS code (reducing the size of a state by a factor of 10), and ported the model to 64-bit architectures (so as to be able to use more than 4 Gb of memory). Although these improvements did still not allow to generate the complete state space, we were able to generate bigger graphs and to model check more correctness properties using the EVALUATOR tool than in 2008.
We discovered a synchronization issue that could be reproduced using an interactive SystemC scheduler on the untimed version of the TLM model (i.e., replacing delayed notifications by immediate notifications), but not on the timed version (nor on the actual circuit itself). We proposed a modification of the TLM model so that the property holds also in an untimed context. This work led to a publication in an international conference [GHPS09].
We developed a new TLM model of only the control part of the BDisp based on the specification documents. Experimenting with the TLM.OPEN tool on this new TLM model, we were able to generate a complete state space of less than 10,000 states, without any translation into LOTOS. These results have been presented to STMicroelectronics, which showed interest in experimenting with TLM.OPEN on a new case study.
The CADP toolbox contains several tools dedicated to the LOTOS language, namely the CÆSAR.ADT compiler for the data type part of LOTOS, the CÆSAR compiler for the process part of LOTOS, and the CÆSAR.INDENT pretty-printer.
In 2009, in addition to fixing five bugs in the CÆSAR and CÆSAR.ADT compilers, we improved the LOTOS dedicated tools of CADP as follows:
On 64-bit architectures, the maximum number of states that CÆSAR can generate directly and store has been increased from 232 to 234.
Similarly to the introduction of dynamic resizing of hash tables in OPEN/CÆSAR's tables, the state table used by CÆSAR during graph generation has also been made extensible.
CÆSAR now gathers information about the current machine architecture and the amount of physical memory available. With this information, CÆSAR now generates simulator programs that are optimized to store as many states as possible on the machine.
The hash function used by CÆSAR has been improved. This leads to faster computation and better dispersion.
We continued to enhance our TRAIAN compiler, which generates C code from LOTOS NT data type and function definitions. TRAIAN is distributed on the Internet and used intensively within the VASY project-team as a development tool for compiler construction [GLM02].
In 2009, we corrected four bugs and enhanced Traian in several respects: we changed the syntax of the nondeterministic choice statement to make it compatible with the syntax accepted by the LNT2LOTOS translator, we made it easier to define the range of natural and integer values in LOTOS NT descriptions, and we adapted the C code generated for tester and selector operations to make it compatible with the libraries already present in CADP.
We published in an international conference a paper [SLG09] about the compilation of temporal extensions and the new intermediate form ATLANTIF [Sto09], into which languages combining data, concurrency, and real-time (such as E-LOTOS and LOTOS NT) could be translated efficiently. ATLANTIF is based upon the NTIF model [GL02] for sequential processes with data, which we extended in several ways to support concurrency and real-time.
We continued to improve the LNT2LOTOS tool suite that translates (a large subset of) LOTOS NT into LOTOS, thus allowing to use CADP to verify LOTOS NT descriptions. This tool suite is being used by Bull to model and verify critical parts of its FAME2 multiprocessor architecture; it was also used to verify protocol specifications provided by AIRBUS.
In 2009, the LNT2LOTOS tool suite was enhanced significantly. Four successive versions were released, which corrected several bugs and brought many improvements, with the objective to prepare a future integration of this tool suite in CADP.
As regards the data type part of LOTOS NT, we added support for floating-point numbers and for inclusion of modules; we enriched the set of predefined operations for boolean, character and string types; we compiled constructor operations for the integer type efficiently; we extended the syntax of hexadecimal and character constants; and we enhanced the "case" statements to handle constants and free variables.
As regards the process part of LOTOS NT, we improved the translation of parallel composition of processes to support the "disrupt" and "break" statements inside parallel compositions; we simplified the syntax of "loop" statements and the code generated for the "case" and "raise" statements; we extended the syntax of patterns with expressions containing non-constructor operations; we simplified the LOTOS code generated for exit handling; we added generalized support for constructor operations; and we improved the translation of nested compound and "if-then-else" statements in order to reduce the computation effort and the size of the generated LOTOS code.
New options were added to select a specific LOTOS NT process as being the main process of the resulting LOTOS specification. The management of intermediate files was simplified, and error handling was made compatible with the CÆSAR and CÆSAR.ADT compilers for LOTOS.
A new OPEN/CÆSAR-compliant compiler named LNT.OPEN was introduced to explore the state spaces of LOTOS NT descriptions on the fly.
LNT2LOTOS is developed using the SYNTAX/TRAIAN technology. Currently, it represents 35,900 lines of code (4,000 lines of SYNTAX code, 29,000 lines of LOTOS NT code, and 2,900 lines of C code). The LOTOS NT reference manual was updated and grew from 71 pages (at the end of 2008) to 101 pages. The non-regression database was enriched with about 1,160 new LOTOS NT programs.
We investigated the verification of TLM (Transaction Level Model) models. Compared to traditional RTL (Register Transfer Level) models, TLM allows faster simulation, simultaneous development of software and hardware, and earlier hardware/software partitioning. Among all languages supporting TLM, SystemC emerges as an industrial standard. SystemC is a C++ library providing both a high-level description language and a simulation kernel that involves a central (largely deterministic) scheduler ordering the actions of the different processes.
Our prior work led to two approaches for translating the PV (Programmers View) level of SystemC/TLM (i.e., TLM models without explicit timing information) into LOTOS:
Our first approach [HP08] follows the "official" simulation semantics of SystemC, keeping the scheduler as defined in [IEEE05].
Our second approach [PS08] proposes a fully asynchronous semantics for TLM, getting rid of the central scheduler that is difficult to implement in parallel circuits and prevents certain execution sequences from being analyzed during formal verification.
We also developed a new approach that directly connects TLM to the OPEN/CÆSAR environment, avoiding any intermediate translation into LOTOS. This led to a new OPEN/CÆSAR-compliant compiler, named TLM.OPEN, which reuses the G++ compiler and the OSCI SystemC and TLM libraries. The numerous tools of CADP can be used to verify complex properties and check the equivalence of two TLM programs. The advantages of the new approach are the following:
TLM.OPEN is much simpler: engineers do not need to learn LOTOS and they have fewer lines of code to write (typically, additional functions for storing and restoring the current state of each TLM module, i.e., generally less than 20% of the size of the TLM model).
TLM.OPEN allows the verification of larger TLM models. In [HP08], we verified a benchmark containing m interrupt transmitter modules up to m = 19; TLM.OPEN can handle this benchmark up to m = 22 on the same machine. In [Moy05], another benchmark containing n transmitters is described and verified up to n = 13 using the LUSSY [MMC05] and SMV [M99] tools; TLM.OPEN can handle this benchmark up to n = 18 using the same amount of memory.
A key objective for the future of CADP is to maintain our ability to support the latest 64-bit platforms with the high level of quality expected by a growing user base on industry and academia. This is a heavy task because of the number of tools in CADP, their intrinsic complexity, and their reliance upon third-party software. In 2009, we continued our efforts in this direction:
We added support for two recent operating systems: Windows 7 and MacOS X 10.6 ("Snow Leopard").
We enhanced all CADP tool to handle large files, beyond the limit of 2 or 4 Gbytes.
We completed our collaboration with Pierre Boullier, Philippe Deschamp, and Benoît Sagot (Alpage project team, INRIA Rocquencourt) to port the SYNTAX compiler generation software (more than 300,000 lines of C code) to 12 different platforms (processor, operating system, and C compiler), and especially to 64-bit architectures. Together, we solved the few remaining bugs in the WiNDOWS version of SYNTAX and we enhanced the tools to support large files.
We continued building a comprehensive validation framework, based on non-regression testing and semantical checking for the CADP tools. This framework allows functional testing of individual tools as well as integration testing for several CADP tools used all together to perform complex verification scenarios on various computing platforms and using various compilers.
We continued gathering large collections of benchmarks (BCG graphs, boolean equation systems, mu-calculus formulas, etc.) for testing the CADP tools extensively.
For the development of CADP, we continued experimenting the Pipol platform of INRIA, which provides reservation, configuration, and deployment facilities for porting and testing applications across various hardware and software environments.
Finally, there have been significant software developments done in MULTIVAL that will not be detailed here, since they are not immediately visible to the industrial users of the CADP tools. More information can be found in the corresponding sections of the VASY project team activity report for 2009. Notable changes include:
Enhancements to the BCG format and libraries (section 5.1.1)
Enhancements to the OPEN/CÆSAR libraries (section 5.1.2)
Enhancements to the XTL tools (section 5.1.3)
Enhancements to the performance evaluation tools, including the addition of the new CUNCTATOR tool (section 5.1.5)