Partenaires et Financeurs
MULTIVAL Progress Report 2007
David Champelovier (software engineer, INRIA/VASY)
Nicolas Coste (PhD student, STMicroelectronics and INRIA/VASY)
Hubert Garavel (group leader, INRIA/VASY)
Yves Guerte (software engineer, INRIA/VASY)
Claude Helmstetter (postdoctoral researcher, INRIA/VASY)
Rémi Hérilier (software engineer, INRIA/VASY)
Holger Hermanns (research scientist, INRIA/VASY)
Etienne Lantreibecq (engineer, STMicroelectronics)
Romain Lacroix (software engineer, INRIA/VASY)
Frédéric Lang (research scientist, INRIA/VASY)
Radu Mateescu (research scientist, INRIA/VASY)
Olivier Ponsini (postdoctoral researcher, INRIA/VASY)
Wendelin Serwe (research scientist, INRIA/VASY)
Damien Thivolle (software engineer, INRIA/VASY)
Marie Vidal (software engineer, INRIA/VASY)
Meriem Zidouni (PhD student, Bull and INRIA/VASY)
in close collaboration with
Jack Abily (engineer, Bull)
François Bertrand (group leader, CEA/Leti)
Ghassan Chehaibar (engineer, Bull)
Sahar Foroutan (PhD student, STMicroelectronics and CEA/Leti)
Richard Hersemeule (group leader, MULTIVAL project leader, STMicroelectronics)
Sylvie Lesmanne (group leader, Bull)
Virang Shah (engineer, CEA/Leti)
Yvain Thonnart (engineer, CEA/Leti)
Together with Bull, we studied the MPI software layer and MPI benchmark applications to be run on FAME2 (Flexible Architecture for Multiple Environments), a CC-NUMA multiprocessor architecture developed at Bull for teraflop mainframes and petaflop computing.
In 2007, we focused on a particular MPI benchmark (the so-called "ping-pong" protocol), our goal being to predict the performance of this benchmark on FAME2 machines, in particular to estimate the latency of send/receive operations in different topologies, different software implementations of the MPI primitives, and different cache coherency protocols. The ping-pong benchmark consists of two parallel processes, which send to each other a data packet k times. Two variants of the benchmark were specified using a combination of LOTOS code (to describe the behavior of processes) and C code (to describe the data structures of memory and caches). Several configurations were studied, for two data packets and k varying between 1 and 10 (which corresponds to a normal termination of the benchmark session) or being infinite (which corresponds to cyclic execution).
For each configuration, several safety and liveness temporal properties were specified in regular alternation-free μ-calculus, and successfully verified on the corresponding graph using EVALUATOR 3.5. Additionally, it was successfully checked using BISIMULATOR that the two variants were branching equivalent to each other (when observing only send/receive actions), and that each configuration with k iterations was smaller (modulo the branching preorder) than the configurations with k+1 iterations and k infinite.
Finally, the LOTOS specification was extended with Markov delays and a performance analysis was carried out using the BCG_MIN, DETERMINATOR and BCG_STEADY tools in order to predict the latency of the send/receive operations and of the various types of transfers between memory and (local or remote) caches. The results obtained by numerical analysis were close enough to the experimental measures to assess the suitability of the model.
We pursued the study (started in 2005) of the FAUST (Flexible Architecture of Unified System for Telecom) NoC (Network on Chip) [BCVCR05] developed by the CEA/Leti laboratory. Our previous work on the first version of FAUST led to a joint INRIA/Leti publication [SSTV07].
In 2007, we focused on the second version of FAUST, a NoC-based platform for wireless telecom applications (4G, MIMO, etc.). Together with CEA/Leti scientists, we studied the communication interconnect part of FAUST, which routes packets (consisting of several 34-bit flits) between the 23 components of the FAUST circuit. At the block level, this interconnect is described in the hardware process calculus CHP (Communicating Hardware Processes) and implemented, at the RTL level, in asynchronous logic. The interconnect has 23 communication nodes, each of which consisting of five input controllers and five output controllers. Each input controller dispatches incoming flits to one out of four output controllers, and each output controller arbitrates between four input controllers.
To generate the state space of the input controller, we first abstracted the CHP specification (500 lines of code) by applying data independence considerations: we replaced the 34-bit data values by 8-bit data values, focusing on the control information. Then, applying our CHP2LOTOS translator to the CHP description of the input controller, we produced a LOTOS specification in less than one second. This led to 15 concurrent LOTOS processes, several of which were so large that their parallel composition could not be computed directly. By replacing "internal" choice in CHP/LOTOS by "external" choice to reflect the actual hardware implementation of FAUST, we further reduced the size of the intermediate state spaces by a factor of 50, making the state space of the entire input controller tractable (less than 3 million states).
To generate the state space of the output controller, we also applied our CHP2LOTOS translator to the corresponding CHP description (400 lines of code). By making the same data abstractions as for the input controller, by applying symmetry reductions (allowing to consider only two out of four input controllers), and by using the compositional state space generation features of CADP, we managed to generate the state space for an output controller (less than 6 million states).
Together with STMicroelectronics, we studied xSTream, a multiprocessor dataflow architecture for high performance embedded multimedia streaming applications. In this architecture, computation nodes (e.g., filters) communicate by xSTream queues connected by a NoC (Network on Chip). An xSTream queue generalizes a bounded FIFO queue in two ways: it provides additional primitives (such as peek to consult items in the middle of the queue, which is not possible with the standard push/pop primitives of FIFO queues), and a backlog (extra memory) to allow the increase of the queue size when the queue overflows.
In 2007, we studied in depth the concurrent behavior of xSTream queues:
We developed several LOTOS models of xSTream queues (in total 6,800 lines of code), differing in whether each xSTream primitive (push, pop, peek, etc.) is modeled by one or two (request/response) LOTOS events.
For several values of n and p, we established (by means of on the fly equivalence checking using the BISIMULATOR tool of CADP) that an xSTream queue of size n with a backlog of size p is branching equivalent to an xSTream queue of size n+p.
We then investigated the interaction of several xSTream queues sharing the NoC. By using BISIMULATOR, we discovered that the state space (1.5 million states and 8 million transitions) for a pair (Q1, Q2) of xSTream queues sharing the NoC with another pair (Q3, Q4) of xSTream queues was not branching equivalent, when hiding the actions of (Q3, Q4), to the state space obtained for (Q1, Q2) alone. Such a possibility of starvation caused by NoC access conflicts revealed that the xSTream credit protocol was not optional (as assumed so far) but mandatory.
We then incorporated the xSTream credit protocol into our LOTOS model and managed to generate the corresponding state space by means of compositional verification techniques (which allowed to reduce the size of the largest intermediate state space from 17 million states and 108 million transitions down to 1.5 million states and 8 million transitions). This time, we managed to prove branching equivalence, but only when the credit value parameter is equal to 1; for values greater than 1, we reported the issue to the xSTream architects, together with a convincing counterexample (a valid xSTream application running into a deadlock when executed on an xSTream simulator generated from our LOTOS model using the EXEC/CÆSAR environment of CADP).
In parallel, we undertook performance evaluation studies to predict latency and throughputs in the communication architecture, as well as occupancy within xSTream queues (with and without backlogs). A key challenge is to combine probabilistic/stochastic information (e.g., the rates at which xSTream applications push and pop elements in and out of the queues) with precise timing information (e.g., memory access time). We investigated different techniques and tools, starting from the Interactive Markov Chain model supported by CADP tools (we used Erlang distributions to approximate precise timing information), but considering also max-plus algebra, discrete event simulation, and timed automata (using the UPPAAL/Cora tools).
The CADP toolbox contains several tools dedicated to the LOTOS language, namely: the CÆSAR.ADT compiler for the data type part of LOTOS, the CÆSAR compiler for the process part of LOTOS, and the CÆSAR.INDENT pretty-printer.
In 2007, we performed maintenance activities for these tools, fixing a few errors. We also modified in many places the C code generated by CÆSAR and CÆSAR.ADT to avoid all warnings emitted by recent C compilers with the strictest code-checking options. For the end-user, this eases the development of LOTOS specifications by guaranteeing that all warnings seen when compiling the C code generated by CÆSAR and CÆSAR.ADT are relevant to the source LOTOS description (e.g., unused parameters of some LOTOS operation) or to the hand-written C code provided by the user (and are not meaningless warnings arising from C code generation artefacts).
We pursued our study of state space reduction techniques, our goal being to decrease the size of the graphs generated by CÆSAR, still preserving strong bisimulation between the original and reduced graphs. We improved the results of our reduction technique based on live variable analysis [GS06] by allowing data-flow optimizations to be repeatedly applied in a hierarchical manner on parts of the LOTOS specification. On 151 out of 684 benchmarks, we observed a reduction in the numbers of states (by a mean factor of 2.49 and a maximum factor of 24.27) and of transitions (by a mean factor of 2.41 and a maximum factor of 23.54).
We continued to enhance our TRAIAN 2.5 compiler, which generates C code from LOTOS NT data type and function definitions. TRAIAN is distributed on the Internet and used intensively within the VASY project-team as a development tool for compiler construction [GLM02].
In 2007, we corrected nine bugs (three bugs already known and six bugs discovered in 2007) and improved the C code generated by TRAIAN in two respects:
It no longer produces warnings when compiled with recent C compilers using the strictest code-checking options. Thus, TRAIAN users will only get warnings relevant to their LOTOS NT code.
It was ported to 64-bit architectures, together with the predefined includes and libraries provided with TRAIAN. We checked that this yields the same results as for 32-bit architectures.
In addition, we fully automated the cross-compilation of TRAIAN for multiple platforms. We also enriched the validation database for TRAIAN with 60 new files, and enhanced the validation scripts to support cross-testing on multiple platforms.
We undertook the development of a new version 3.0 of TRAIAN written entirely in LOTOS NT and no longer relying on the obsolete FNC2 compiler generation tool used for TRAIAN 1.* and 2.*. We completed the lexical and syntactic analysis phases of TRAIAN 3.0 representing 6,900 lines of code (4,500 lines of SYNTAX code, 2,000 lines of LOTOS NT code, and 400 lines of C code).
In the context of the FormalFame Plus contract, we continued to improve the LNT2LOTOS tool suite that translates (a large subset of) LOTOS NT into LOTOS, thus allowing to use CADP to verify LOTOS NT descriptions. Bull is using this tool suite to model and verify critical parts of its FAME2 multiprocessor architecture.
In 2007, we brought the following improvements:
As regards the data part of LOTOS NT, a new type constructor "array [n..m] of T" was added to LOTOS NT at Bull's request and implemented in the tool suite.
As regards the process part of LOTOS NT, we finalized the definition of processes and behaviors, ensuring language symmetry in the sense that most LOTOS NT constructs used in functions are also allowed in processes (absence of such a nice symmetry is a drawback of LOTOS making the language harder to learn). The concrete syntax and static semantics of processes have been specified. The translation of LOTOS NT process definitions is under development.
LNT2LOTOS is developed using the SYNTAX/TRAIAN technology. Currently, it represents 22,300 lines of code (3,700 lines of SYNTAX code, 16,500 lines of LOTOS NT code, and 2,200 lines of C code). The LOTOS NT reference manual was updated and grew from 47 pages (at the end of 2006) to 61 pages.
We continued the study of the process algebra CHP (Communicating Hardware Processes) for which the TIMA laboratory has developed a circuit synthesis tool named TAST and which is used by the CEA/Leti laboratory to describe the FAUST2 architecture. Our goal is to integrate formal verification tools into the design flow of complex asynchronous circuits.
In 2007, we revisited the CHP2LOTOS translator to add support for CHP expressions containing multiple probes of channels (this feature provides for efficient hardware implementation by checking the availability of different partners in handshake communications). To guarantee an "atomic" evaluation of such expressions and ensure mutual exclusion between channels, we designed a locking mechanism that we implemented in the generated LOTOS code. Comparing the new version of the translator to the previous on 310 examples, we found that this increased the state space for 40 examples (by a maximum factor of two), but surprisingly also decreased the state space for 19 examples (by a maximum factor of two). We wrote an overview publication on the CHP to LOTOS translation that was submitted to an international journal.
We investigated the verification of TLM (Transaction Level Model) models. Compared to traditional RTL (Register Transfer Level) models, TLM allows faster simulation, simultaneous development of software and hardware, and earlier hardware/software partitioning. Among all languages supporting TLM, SystemC emerges as an industrial standard. SystemC is a C++ library providing both a high-level description language and a simulation kernel that involves a central (largely deterministic) scheduler ordering the actions of the different processes.
In 2007, we focused on the PV (Programmers View) level of SystemC/TLM, i.e., TLM models without explicit timing information. We worked in two directions:
Our first approach follows the "official" simulation semantics of SystemC/TLM, keeping the scheduler as defined in [IEEE-1666-2005]. Inspired by a translation from SystemC/TLM to Promela [TCMM07], we devised a translation from SystemC/TLM to LOTOS, which we experimented on the parameterized benchmark with n concurrent components given in [TCMM07].
Using the REDUCTOR and BISIMULATOR tools of CADP, we compared the state space generated for the LOTOS description with the set of finite execution traces produced by the RVS tool [Hel07, HMMM06] for SystemC/TLM. Both were found to be trace equivalent, showing that our LOTOS-based approach complies to the SystemC/TLM simulation semantics.
Using the EVALUATOR tool of CADP, we verified the liveness property given in [TCMM07] that explores the complete state space of the parameterized benchmark. We experimented both direct state space generation (for which we obtained results comparable to those reported in [TCMM07]) and compositional state space generation (for which we observed state spaces three times smaller than those reported for SPIN in [TCMM07], which enabled us to handle the case n=19 while SPIN fails at n>15).
Our second approach pursues the work undertaken in 2006 to design a fully asynchronous semantics for SystemC/TLM. In this approach, we get rid of the central scheduler because it implements a global supervisor (that is difficult to achieve in a parallel circuit without sacrificing efficiency) and hampers the effectiveness of formal verification by preventing certain possible execution sequences from being analyzed. To the contrary, our approach explores all possible interleavings arising from concurrent components and fits well with the principles of GALS (Globally Asynchronous Locally Synchronous), NoCs (Networks on Chip), and SoCs (Systems on Chip).
We devised translation rules from SystemC/TLM to LOTOS, which rely on the user to translate C++ types and functions to LOTOS manually and which are parameterized by a locking policy that controls the level of asynchrony between transactions. This work led to a publication in an international conference [PS08].
We experimented our translation on several examples, namely the parameterized benchmark of [TCMM07] and seven examples taken from the SystemC/TLM reference document [RSPF05], and we started investigating a large industrial case study provided by STMicroelectronics.
We compared both approaches (i.e., with and without the scheduler) on the parameterized benchmark of [TCMM07]:
We could show that the state space for the approach with scheduler is included (modulo branching preorder) in the state space for the approach without scheduler.
We could show that the approach without scheduler generalizes the approach with scheduler in the sense that the latter is obtained from the former by choosing a particular locking policy. Remarkably, the approach without scheduler even allowed to handle the case n=19 using either direct state space generation (102 million states and 152 million transitions) or compositional generation (3 million states and 23 million transitions, the largest intermediate state space having 91 million states and 128 million transitions).
In practice, even if the approach without scheduler explores more configurations due to asynchrony, this is compensated in part by our efficient LOTOS encoding. For instance, the state spaces we generated for the approach without scheduler were similar in size to those reported in [TCMM07] with a scheduler-based semantics.
We migrated 32-bit versions of CADP to support recent computing platforms (including Sparc stations running Solaris 10, Intel machines running Linux 2.6, and Intel-based Apple computers running MacOS 10.3 and 10.4) together with the latest C compilers available on these platforms (Gcc 4, Intel Icc 9.0, and Sun Studio 11).
A key objective for the future of CADP is the ability to support the latest 64-bit platforms. This is a heavy task because of the number of tools in CADP, their intrinsic complexity, and their reliance upon third-party software. In the framework of the MULTIVAL project, we made significant steps in this direction:
We selected three 64-bit platforms (Sparc V9 stations running Solaris 10, Intel IA64 machines running Linux, and Intel EMT64/AMD64 machines running Linux) and we constructed cross-compiling environments for these platforms.
In a first step towards 64-bit architectures, we improved the code of all CADP tools so as to suppress any warning message issued by recent C compilers with the strictest code-checking options.
We ported all the 55 shell scripts of CADP to these 64-bit architectures.
We started porting third party software used by CADP. The CUDD, TCL-TK, TIX, and Boehm-Demers garbage collector libraries have been ported successfully to 64-bit architectures. The SPARSE library has been ported partially. The FNC2 compiler generation system will not be ported, since it is no longer maintained and its source code not accessible to us.
Together with Pierre Boullier, Philippe Deschamp, and Benoît Sagot (ATOLL/ALPAGE project-team of INRIA Rocquencourt), we started to port the SYNTAX compiler generation software (more than 300,000 lines of C code) to 64-bit architectures. We first removed all warning messages emitted by recent C compilers. We found and repaired several bugs, some of which were specific to 64-bit architectures. We set up automated procedures to compile SYNTAX on eleven different platforms (processor, operating system, and C compiler) as well as quality tests to be applied at each modification of SYNTAX. The port of SYNTAX is almost complete.
Finally, there have been significant software developements done in MULTIVAL that will not be detailed here, since they are not immediately visible to the industrial users of the CADP tools. More information can be found in the corresponding sections of the VASY project-team activity report for year 2007. We can mention namely:
Enhancements to the CÆSAR_SOLVE library (section 5.1.1)
Development of the BES_SOLVE tool (section 5.1.2)
Enhancements to the EVALUATOR 3.5 and 4.0 tools (section 5.1.3)
Development of the PROJECTOR 3.0 tool (section 5.1.4)