[
Accueil
]
|
Actualités Présentation Partenaires et Financeurs Offres d'emplois Liens Intranet Contacts |
MULTIVAL Progress Report 2008
|
David Champelovier (software engineer, INRIA/VASY) Nicolas Coste (PhD student, STMicroelectronics and INRIA/VASY) Hubert Garavel (group leader, INRIA/VASY) Yves Guerte (software engineer, INRIA/VASY) Rémi Hérilier (software engineer, INRIA/VASY) Holger Hermanns (research scientist, INRIA/VASY) Romain Lacroix (software engineer, INRIA/VASY) Frédéric Lang (research scientist, INRIA/VASY) Etienne Lantreibecq (engineer, STMicroelectronics) Radu Mateescu (research scientist, INRIA/VASY) Louis Paternault (software engineer, INRIA/VASY) Olivier Ponsini (postdoctoral researcher, INRIA/VASY) Wendelin Serwe (research scientist, INRIA/VASY) Meriem Zidouni (PhD student, Bull and INRIA/VASY) |
in close collaboration with
Jack Abily (engineer, Bull) François Bertrand (group leader, CEA/Leti) Ghassan Chehaibar (engineer, Bull) Sahar Foroutan (PhD student, STMicroelectronics and CEA/Leti) Richard Hersemeule (group leader, MULTIVAL project leader, STMicroelectronics) Sylvie Lesmanne (group leader, Bull) Virang Shah (engineer, CEA/Leti) Yvain Thonnart (engineer, CEA/Leti) |
Together with Bull, we studied the MPI software layer and MPI benchmark applications to be run on FAME2 (Flexible Architecture for Multiple Environments), a CC-NUMA multiprocessor architecture developed at Bull for teraflop mainframes and petaflop computing.
In 2008, our activities focused on the following aspects:
We pursued the study of the MPI benchmark called the "ping-pong" protocol, our goal being to predict the performance of this benchmark on FAME2 machines, in particular to estimate the latency of send/receive operations on different topologies, different software implementations of the MPI primitives, and different cache coherency protocols. The ping-pong benchmark consists of two parallel processes, which send a data packet to each other k times. The benchmark was specified using a combination of LOTOS code (to describe the behavior of processes) and C code (to describe the data structures of memory and caches). Several configurations were considered, by specifying two different implementations of the send/receive primitives (based on linked lists with locks and on lock-free buffers, respectively) and two cache coherency protocols (in which a variable written by a process becomes either owned by that process, or shared between that process and the previous owner, respectively).
The performance analysis was carried out by extending the LOTOS specification with Markov delays and applying the BCG_MIN, DETERMINATOR, and BCG_STEADY tools of CADP in order to calculate the latency of the send/receive operations. The quality of the model was improved by decomposing the read/write accesses into two request/response phases, which reflects the real behaviour of the system accurately. Thus, the latencies predicted from the model were close (down to 6% of difference) to the experimental measures. The performance analysis also allowed to estimate the number of cache misses corresponding to each instruction, which indicates that the second send/receive protocol (based on lock-free buffers) and the first cache coherency protocol (in which a variable written by a process becomes owned by that process) provide the best performance among the configurations considered. An article describing this work was accepted for publication in an international conference [CZM09].
We undertook the study of the "barrier" primitive of MPI, which allows several parallel processes to synchronize (each process arriving at the barrier waits for all the others to arrive) before continuing their execution. The latency of the barrier primitive corresponds to the (average) time taken by a process to traverse the barrier, i.e., the time between the moment when it arrives and when it leaves the barrier. We specified, using the LOTOS and C languages, five protocols implementing the barrier primitives centralized, combining, tournament, dissemination, and tree-based. These five protocols differ by the shared data structures and the synchronizations used. In addition to the classical read/write operations, all these protocols use a fetch-and-decrement operation, which consists of a read operation and a write operation executed atomically. The combining protocol involves non-tail recursion, which was eliminated by modeling an explicit stack.
For functional verification, we identified five temporal properties characterizing the correct behaviour of several parallel processes traversing a barrier cyclically: deadlock freeness, correct access to private variables, absence of access to a null address in memory, mandatory traversal of the current barrier by all processes, and forbidden leaving of a barrier by a process before all the others arrive at the barrier. These properties were successfully verified on the five barrier protocols by using the EVALUATOR model checker of CADP. The first three properties were expressed in regular alternation-free mu-calculus and checked using EVALUATOR 3.5, whereas the last two properties, which require a count of the processes arriving at a barrier, were expressed in MCL and checked using EVALUATOR 4.0.
For performance evaluation, we extended the LOTOS specifications of the centralized and tree-based barrier protocols (involving cyclic parallel processes) with Markov delays. It turned out that the Markovian model contained nondeterminism caused by simultaneous accesses of concurrent processes to the same shared variable. This nondeterminism was eliminated by inserting very small Markov delays before the conflicting accesses. In order to fight state explosion, we generated the Markov chain semi-compositionally by using hand-crafted interfaces for the memory and cache processes, and by minimizing the intermediate graphs progressively modulo stochastic branching bisimulation using BCG_MIN, the whole process being automated using SVL scripts. This enabled generation of the final Markov chain for the centralized barrier with six processes and the tree-based barrier with four processes, and to compute the latencies of barrier traversals for these protocols using BCG_STEADY.
We pursued the study (started in 2005) of the FAUST (Flexible Architecture of Unified System for Telecom) NoC (Network on Chip) [BCVCR05] developed by the CEA/Leti laboratory. Since 2007, we are focusing on the latest version, named Magali, of this architecture.
In 2008, we helped Leti scientists to write mu-calculus formulas expressing the correctness of the output controller. By using the EVALUATOR 3.5 tool, it was possible to verify several properties. For example, we verified that routing of flits on a virtual channel does not change before the last flit of a packet (i.e., all flits of the same packet are routed similarly), that an acknowledgment of routing is generated at the end of a packet, and that routing a new packet happens only after an arbitration decision has been taken.
To verify the input controller, Leti scientists decided not to use model checking (as done in 2006 [SSTV07] for a prior version of the input controller), since the mu-calculus formulas would have been too complex (the reason for the complexity increase is the higher storage capacity per channel in the latest version of the input controller). Instead, they used equivalence checking and developed an abstract model of the input controller (300 lines of LOTOS code, about 4.6 million states) that was proven, using the BISIMULATOR tool, to be included (with respect to branching bisimulation) in the input controller, showing that the input controller contains all expected behaviors.
We also helped Leti scientists in their effort to build a co-simulation platform to compare the formal LOTOS model of MAGALI and the implementation of MAGALI given as a VHDL netlist. Using the EXEC/CÆSAR framework, it was possible to embed the LOTOS model into a SYSTEM/C process that was executed (using the ModelSim tool of Mentor Graphics) in combination with the netlist. In this approach, the LOTOS model randomly generates inputs that are sent to the netlist. For each output sent back by the netlist, one checks whether a corresponding output also exists in the LOTOS model. As a first result, this co-simulation platform revealed that the LOTOS model was generating spurious inputs that would never occur actually. Leti scientists solved this issue by adding constraints in the LOTOS model to generate only valid inputs (e.g., in a sequence of flits, an end-of-packet may only occur after a begin-of-packet).
Together with STMicroelectronics, we studied xSTream, a multiprocessor dataflow architecture for high performance embedded multimedia streaming applications. In this architecture, computation nodes (e.g., filters) communicate by xSTream queues connected by a NoC (Network on Chip). An xSTream queue generalizes a bounded FIFO queue in two ways: it provides additional primitives (such as peek to consult items in the middle of the queue, which is not possible with the standard push/pop primitives of FIFO queues), and a backlog (extra memory) to allow the increase of the queue size when the queue overflows.
During 2007, the xSTream queues were studied extensively. In 2008, we focused on the NoC itself and its interaction with xSTream queues.
The xSTream NoC is composed of routers connected by direct communication links. First experiments showed that representing xSTream primitives in the routers by two events (request/response), as used for modeling xSTream queues, would lead to state space explosion. Therefore, we decided to represent each xSTream primitive by a single LOTOS event in a router. To reuse the LOTOS models already developed for xSTream queues (in which each xSTream primitive is represented using two LOTOS events), we introduced additional LOTOS processes, called network interfaces, that convert one representation into the other.
We developed two successive models of an xSTream router. The first model (150 lines of LOTOS code) uses a single LOTOS process with eight parameters representing the status of each of the eight ports of the router. State space generation for this model was slow due to the number of conditions evaluating to false when computing the successor states. Therefore, we produced a second model (120 lines of LOTOS code), in which each port is represented by a separate process, which reduced the generation time by a factor 50, still yielding the same state space.
In a first step, we considered a NoC composed of four routers, where each router is directly connected to all the other routers. For this model, we could generate the corresponding state space (about 5,600 states and 20,000 transitions) compositionally, the largest intermediate graph having one million states and 5 million transitions. By using the BISIMULATOR tool, we could show that this model (i) is not branching bisimilar to, but only contains (modulo the branching preorder) the parallel composition of two xSTream queues (after abstracting away the network interfaces and the routers); (ii) is branching bisimilar to the composition of two xSTream queues connected by two network interfaces (after abstracting away only the routers). By using the EVALUATOR tool, we could verify temporal logic formulas expressing that the additional behavior is correct.
In a second step, we considered the more intricate case of a NoC composed of six routers, which are no longer fully connected; this implies that a packet might traverse more than two routers, which in turn increases the number of messages that might be received by a router. This resulted in an increase of the state space size for each single router: a router in a NoC with four routers has less than 400 states and 140,000 transitions, whereas a router in a NoC with six routers has up to 6.7 million states and 49 million transitions, which was too large for further composition with the other routers, network interfaces, and xSTream queues. Therefore, we abstracted away the differences between all "disturbing" messages (i.e., messages that do not directly concern the considered pair of xSTream queues), ignoring destination and routing information in all these messages. This allowed us to simplify the model of a router by removing those ports that are only used by the abstracted messages. The state space corresponding to this simplified model of a router was reduced down to 240 states and 1,000 transitions. An additional benefit was that the same router could be used for all routers of the NoC (due to an asymmetry in the architecture, in the first model all routers had different state spaces). By using the BISIMULATOR tool, we found that the parallel composition of two xSTream queues connected by network interfaces and a NoC with six abstract routers was branching bisimilar to the parallel composition of two xSTream queues connected by two network interfaces. However, we could not establish an equivalence relation between the two models of a router, as the simplified model of a router is no longer deterministic (abstracting away the routing information from disturbing messages introduces nondeterminism, since some disturbing messages can be dropped while others have to be forwarded to the next router). We therefore refined the abstraction, distinguishing two kinds of disturbing packets, those to be dropped and those to be forwarded. Comparing this refined model with the complete one, we still did not observe branching equivalence, but only a mutual inclusion (i.e., two simulations, but not a bisimulation).
We also continued performance evaluation studies to predict latency and throughput of communication between xSTream queues. For us, a key challenge is to combine probabilistic/stochastic information (e.g., the rates at which xSTream applications push and pop elements in and out of the queues) with precise timing information (e.g., memory access time). After exploring different techniques and tools, we devised an approach, called IPC (Interactive Probabilistic Chains), inspired by Interactive Markov Chains [Her02], but using probabilities instead of stochastic distributions and a central clock governing all delays. We defined structured operational semantics for standard process algebra operators (sequential and parallel composition, nondeterministic choice, etc.) applied to IPC and proved that probabilistic branching bisimulation is a congruence for the parallel composition of IPC. Taking advantage of the open architecture of CADP, we prototyped a tool chain (6,400 lines of C code and 400 lines of Perl script), which we experimented on several examples.
We started to study whether the CADP tools could enrich with formal verification capabilities the current design flow of STMicroelectronics, which is based on SystemC/TLM (see section 7c below). With this aim, STMicroelectronics provided us with the Blitter Display (BDisp for short), a 2D-graphics co-processor implementing Blit (Block Image Transfer) and numerous graphical operators, e.g., rotations, alpha blending, or Blue Ray disc decoding. The BDisp is software-controlled through instructions written in nodes of up to four application queues and two real-time composition queues. It is described by more than 25,000 lines of SystemC/TLM code.
First, to palliate the absence of the STMicroelectronics in-house development environment, we designed a minimal testbench, and were able to recompile the BDisp and pass all of the required tests within our own build environment..We then sought to obtain a LOTOS model of the BDisp. This was not immediate, because the SystemC/TLM model of the BDisp is mostly sequential and data-intensive, and because it is not primarily tailored to formal verification. For this reason, we focused on the queue management part, which is responsible for triggering and ordering the execution of the queues according to their priorities. We isolated the queue management part from the rest (graphical data treatment) and explicited the asynchronous concurrency that actually exists in the transactions between the BDisp and its environment.
The translation from SystemC/TLM to LOTOS was done manually, but according to systematic rules. The SystemC/TLM code modeling input/output communications and concurrency was translated to LOTOS, while the rest of the code (i.e., plain C++ code) was kept unmodified and invoked from the LOTOS model. Concretely:
We translated the communication and concurrency primitives from SystemC/TLM into LOTOS using our schedulerless translation rules (see [PS08]). This produced 920 lines of LOTOS code.
We identified all variables composing the internal BDisp state and optimized their memory size on the state vector (e.g., replacement of integer variables by bit-fields of minimal length).
We split the rest of the SystemC code into functions testing the BDisp state and into functions modifying the BDisp state (5,550 lines of C++ code).
We abstracted the queue modeling (e.g., compact representation of instruction nodes, memory addresses abstracted to an enumerated type, etc.).
We developed an interface between the LOTOS and C++ codes (2,250 lines of C code).
This novel, hybrid approach reuses large parts of the original SystemC/TLM model (so that the model under verification remains close to the original one) and leads to good performance using the CADP tools. After assembling all these code fragments together, we obtained an executable LOTOS/C++ model of the BDisp. We used the OCIS tool of CADP to perform step-by-step simulation and the EVALUATOR tool to prove correctness properties on several verification scenarios.
The CADP toolbox contains several tools dedicated to the LOTOS language, namely: the CÆSAR.ADT compiler for the data type part of LOTOS, the CÆSAR compiler for the process part of LOTOS, and the CÆSAR.INDENT pretty-printer.
In 2008, we migrated these compilers to the latest version of SYNTAX. We also brought support for long LOTOS or C identifiers, as well as for large LOTOS programs having more than 2^16 lines. We ported to 64-bit platforms the C code generated by CÆSAR and CÆSAR.ADT, and we modified this code to remove all warnings generated by recent compilers and code checkers (namely, Gcc, Intel's ICC, and Sun's CC and Lint) running in 32-bit and 64-bit mode.
By applying intensive testing, we discovered and fixed several difficult semantic issues in CÆSAR's optimizations that would, in some rare cases, generate incorrect or suboptimal code.
We implemented in CÆSAR.ADT a feature (already present in TRAIAN since 2003) that allows values of dynamic data types (such as lists, trees, etc.) to be represented "canonically", meaning that they are stored in tables and, thus, represented only once in memory. A technical challenge was to make this feature optional, in the sense that CÆSAR.ADT users can do this selectively, for a set of types of their choice, while other types remain implemented as before.
We continued to enhance our TRAIAN compiler, which generates C code from LOTOS NT data type and function definitions. TRAIAN is distributed on the Internet and used intensively within the VASY project-team as a development tool for compiler construction [GLM02].
In 2008, we corrected five bugs and enhanced TRAIAN in several respects:
We improved the efficiency of the C code generated by TRAIAN when this code is compiled with optimizing Gcc.
We modified the generated C code so as to remove all warnings emitted by Sun's Lint code checker and many warnings emitted by Intel's ICC compiler.
We developed configuration files to support the LOTOS NT language in the Emacs, XEmacs, and Jedit editors
We enhanced the testing environment for TRAIAN and enriched its non-regression testing database with new examples.
A new version 2.6 of TRAIAN was released on February 27, 2008.
We continued to improve the LNT2LOTOS tool suite that translates (a large subset of) LOTOS NT into LOTOS, thus allowing to use CADP to verify LOTOS NT descriptions. This tool suite is being used by Bull to model and verify critical parts of its FAME2 multiprocessor architecture; it was also used to verify protocol specifications provided by Airbus.
In 2008, the LNT2LOTOS tool suite was enhanced significantly. Eight successive versions were released and the tool suite was ported to five 32-bit architectures and three 64-bit architectures.
For the data type part of LOTOS NT, we added support for character and string constants, simplified the use of predefined operations in module declarations, and allowed end-users to specify the names of the LOTOS and C types and functions to be generated by the LNT2LOTOS translator.
For the process part of LOTOS NT, we finalized the concrete syntax and static semantics of processes, and specified an algorithm translating a set of LOTOS NT processes into a collection of LOTOS types, functions, and processes. This algorithm was implemented almost entirely in LNT2LOTOS and is currently under validation.
A new tool named Lnt_Check was introduced to check the exhaustivity of "case" patterns occuring both in functions and processes.
LNT2LOTOS is developed using the SYNTAX/TRAIAN technology. Currently, it represents 32,500 lines of code (3,900 lines of SYNTAX code, 26,100 lines of LOTOS NT code, and 2,500 lines of C code). The LOTOS NT reference manual was updated and grew from 61 pages (at the end of 2007) to 71 pages. Five demo examples illustrating the usage of the LNT2LOTOS tool suite, together with a non-regression database containing 300 LOTOS NT programs, were developed.
We continued the study of the process algebra CHP (Communicating Hardware Processes) for which the TIMA laboratory has developed a circuit synthesis tool named TAST and which is used by the CEA/Leti laboratory to describe the FAUST/MAGALI architecture. Our goal is to integrate formal verification tools into the design flow of complex asynchronous circuits. Our prior work led to the development of the CHP2LOTOS tool, which translates CHP into LOTOS, our goal being to integrate formal verification tools into the design flow of complex asynchronous circuits.
In 2008, we devised proof arguments to establish the correctness of CHP2LOTOS. Our overview paper on the CHP to LOTOS translation [GSS09] was accepted for publication in an international journal .
We investigated the verification of TLM (Transaction Level Model) models. Compared to traditional RTL (Register Transfer Level) models, TLM allows faster simulation, simultaneous development of software and hardware, and earlier hardware/software partitioning. Among all languages supporting TLM, SystemC emerges as an industrial standard. SystemC is a C++ library providing both a high-level description language and a simulation kernel that involves a central (largely deterministic) scheduler ordering the actions of the different processes.
In 2008, we pursued our research on this topic. Both our approaches (i.e., with the scheduler [HP08] and without the scheduler [PS08]) were published in international conferences. A comparison of both approaches on the parameterized benchmark with n concurrent components given in [TCMM07] was done and reported in [HP08].
To improve the effectiveness of formal verification, we experimented with various ways of coding SystemC/TLM transactions in LOTOS (inlined, instantiated, or synchronized by rendezvous). We showed that inlining transactions in the initiator thread is, when possible, by far the most efficient encoding considering the size of the state space generated with CADP (up to 15 times fewer states than using instantiated transactions).
In parallel, to ease the integration of SystemC/TLM and LOTOS, we investigated how to reuse, within a LOTOS specification, existing fragments of code written in SystemC. In a first step, we experimented with this idea on the Blitter Display case-study provided to us by STMicroelectronics. In a second step, we undertook the design of a library that allows, from a LOTOS specification, invocation of SystemC/TLM simulation steps.
A key objective for the future of CADP is the ability to support the latest 64-bit platforms. This is a heavy task because of the number of tools in CADP, their intrinsic complexity, and their reliance upon third-party software. In 2008, we made significant steps in this direction:
For 32-bit machines, we added support for Windows Vista, MacOS 10.5, and the latest Linux distributions (Debian, Fedora Core, Red Hat, and Ubuntu).
For 64-bit machines, we completed porting CADP to SPARC V9 stations running Solaris 10, to Intel IA64 machines running Linux, and to Intel EMT64/AMD64 machines running Linux.
We enhanced the C code generated by CADP tools so that it compiles without any warning message using various compilers (namely GCC 3 and GCC 4, Sun's CC, and Intel's ICC).
We pursued our collaboration with Pierre Boullier, Philippe Deschamp, and Benoît Sagot (Alpage project team, INRIA Rocquencourt) to port the SYNTAX compiler generation software (more than 300,000 lines of C code) to 12 different platforms (processor, operating system, and C compiler), and especially to 64-bit architectures.
We introduced prototype declarations for all C functions of SYNTAX, which enabled stricter type checking for function calls. We discovered 18 bugs, 15 of which have been repaired in collaboration with the Alpage project team. We checked (using debugging tools such as Valgrind and MallocScribble) that the SYNTAX code is free from faulty memory accesses. We wrote an installation guide for SYNTAX and migrated all CADP tools to the latest version of SYNTAX.
Finally, there have been significant software developments done in MULTIVAL that will not be detailed here, since they are not immediately visible to the industrial users of the CADP tools. More information can be found in the corresponding sections of the VASY project-team activity report for 2008. Notable changes include:
Enhancements to the BCG format and libraries (section 5.1.1)
Enhancements to the OPEN/CÆSAR libraries (section 5.1.2)
Enhancements to the CÆSAR_SOLVE library (section 5.1.3)
Enhancements to the BISIMULATOR tool (section 5.1.4)
Enhancements to the EVALUATOR 3.5 and 4.0 tools (section 5.1.5)
Enhancements to the XTL tools (section 5.1.6)
Enhancements to the compositional verirication tools EXP.OPEN, PROJECTOR, and SVL (section 5.1.7)