Partenaires et Financeurs
MULTIVAL Progress Report 2010
Nicolas Coste (PhD student, STMicroelectronics and INRIA/VASY)
Hubert Garavel (group leader, INRIA/VASY)
Rémi Hérilier (software engineer, INRIA/VASY)
Holger Hermanns (research scientist, INRIA/VASY)
Alain Kaufmann (software engineer, INRIA/VASY)
Frédéric Lang (research scientist, INRIA/VASY)
Etienne Lantreibecq (engineer, STMicroelectronics)
Radu Mateescu (research scientist, INRIA/VASY)
Christine McKinty (engineer, INRIA/VASY)
Vincent Powazny (software engineer, INRIA/VASY)
Wendelin Serwe (research scientist, INRIA/VASY)
Meriem Zidouni (PhD student, Bull and INRIA/VASY)
in close collaboration with
Edith Beigné (engineer, CEA/Leti)
François Bertrand (group leader, CEA/Leti)
Fabien Clermidy (engineer, CEA/Leti)
Richard Hersemeule (group leader, MULTIVAL project leader, STMicroelectronics)
Virang Shah (engineer, CEA/Leti)
Yvain Thonnart (engineer, CEA/Leti)
Pascal Vivet (engineer, CEA/Leti)
Together with Bull, we studied the MPI software layer and MPI benchmark applications to be run on FAME2 (Flexible Architecture for Multiple Environments), a CC-NUMA multiprocessor architecture developed at Bull for teraflop mainframes and petaflop computing.
In 2010, we pursued the study of the "barrier" primitive of MPI, which allows several parallel processes to synchronize (each process arriving at the barrier waits for all the others to arrive) before continuing their execution. Our goal was to estimate the latency of the barrier primitive (i.e., the average time taken by a process to traverse the barrier) on different topologies, different software implementations of the MPI primitives, and different cache coherency protocols. Based on the LOTOS NT specifications extended with Markov delays that we previously developed for several protocols implementing the barrier primitive, we were able to compute, using the Cunctator on-the-fly steady-state s imulator, the latency for the centralized, combining, and tournament protocols for three different topologies and configurations containing up to 16 processes.
These results were presented in the PhD thesis of Meriem Zidouni [Zid10] , defended on May 25, 2010.
Together with STMicroelectronics, we studied xSTream, a multiprocessor dataflow architecture for high performance embedded multimedia streaming applications. In this architecture, computation nodes (e.g., filters) communicate by xSTream queues connected by a NoC (Network on Chip). An xSTream queue generalizes a bounded FIFO queue in two ways: it provides additional primitives (such as peek to consult items in the middle of the queue, which is not possible with the standard push/pop primitives of FIFO queues), and a backlog (extra memory) to allow the increase of the queue size when the queue overflows.
In 2010, we continued performance evaluation studies to predict latency and throughput of communication between xSTream queues, which are composed of two queues (called push and pop, respectively) communicating over the NoC taking advantage of the flow-control mechanism offered by xSTream. We consolidated our IPC (Interactive Probabilistic Chains) approach undertaken in 2008. This approach is inspired by Interactive Markov Chains [Her02], but uses probabilities instead of stochastic distributions and a central clock governing all delays.
Firstly, we used the IPC approach to compute the average throughput of the put operation for different sizes of the push and pop queues. Our experiments show that the relative size of the push and pop queues forming an xSTream queue might influence the average throughput of the push operation, although all xSTream queues with the same overall size (sum of push and pop queues) are functionally equivalent (i.e., branching bisimilar). We observed that an xSTream queue performs best when the sizes of both queues are as similar as possible. We also observed that larger push queues better support bursts of messages, which can be better absorbed.
Secondly, we studied the impact of the flow-control mechanism on performance. Therefore, we considered a system of two xSTream queues (i.e., two pairs of push and pop queues) sharing the NoC, studying the effect on the throughput of the first xSTream queue for various consumption rates of the other xSTream queue. In the case that the messages of the first xSTream queue are rapidly consumed, i.e., removed from the pop queue of the first xSTream queue, we observed that the flow-control mechanism has no impact on the throughput of the second xSTream queue. In the case that the messages of the first xSTream queue are consumed slowly, we observed that the flow-control mechanism impacts the throughput of the second xSTream queue. Indeed, without the flow-control mechanism, the second xSTream queue is slowed down proportionally to the first xSTream queue, whereas with the flowcontrol mechanism, the performance of the second xSTream queue even increases and tends towards the values observed for a single xSTream queue.
These results were presented in the PhD thesis of Nicolas Coste [Cos10], defended on June 25, 2010.
STMicroelectronics studied Platform 2012, a many-core programmable accelerator for ultra-efficient embedded computing in nanometer technology. This flexible and configurable multi-cluster platform fabric targets a range of emerging video, imaging, and next-generation immersive multimodal applications. Configurability options include the number of clusters, the number and type of processing elements (PE) per cluster, specialization of the architecture and instruction-set of the PEs, and finally, support of hardware-accelerated PEs. The platform is supported by a rich programming environment which embodies a range of platform programming models.
In 2010, we concentrated on two blocks of Platform 2012:
HWS (HardWare Synchronizer) is a hardware block providing mechanisms that allow the implementation of software routines for synchronization between execution threads of several processors. One of these mechanisms is the System Messenger based on asynchronous message queues. STMicroelectronics decided to analyze the correctness of the System Messenger, in particular to verify properties such as correct routing and absence of message loss.
The System Messenger comes with a set of usage rules that must be respected by the software, otherwise there are no correctness guarantees whatsoever. Therefore, we advised STMicroelectronics to use a constraint-oriented modelling style, i.e., to represent each usage constraint as a process to be composed in parallel with the System Messenger and the other constraints. We further reduced the size of the model by applying data abstractions (e.g., to reduce the number of bits per message) and symmetry arguments (e.g., to limit the number of processors). We also added processes to model the application software, considering two scenarios, namely n senders with one receiver, and one sender with n receivers. This led to a LOTOS NT model (eleven modules, 300 lines), for which we generated the state spaces (about 2,000 states and 3,000 transitions for each scenario). For all scenarios, we verified that the messages sent from the same sender to the same receiver are received in the same order. For the n-senders-one-receiver scenario, we also verified that no message is lost, and for the one-sender-n-receivers scenario, we also verified that the messages are correctly routed to their destination.
DTD (Dynamic Task Dispatcher) is a hardware block that dispatches a set of application tasks on a set of PEs. It is called dynamic because each task itself might add tasks to the set of those to be dispatched by the DTD. The DTD is synthesized from a C++ model, optimized to generate an efficient hardware block. Due to the intrinsic complexity of the DTD, STMicroelectronics was interested in the co-simulation of this C++ code with a formal model of the DTD.
In a first step, we developed a LOTOS NT model of the DTD capable of handling four PEs (1,200 lines of LOTOS NT). We also modelled as LOTOS NT processes the different sets of tasks corresponding to various applications. To express the operations provided by the DTD, we had to include a call-stack in the model of each PE, as a means of circumventing the static-control constraints of CÆSAR forbidding recursion over parallel composition. STMicroelectronics judged LOTOS NT to be essential in modelling the DTD, because using LOTOS instead would have been extremely difficult, requiring complex continuations with numerous parameters.
Then, we generated the state space (about 20,000 states and 80,000 transitions) for a simple application (containing one fork on two PEs), and verified several properties, such as the absence of deadlocks, and the correctness of the assertions inserted in the model. This allowed us to point out a difference between our implementation and the one from the architect, highlighting a high sensibility on the order of terms in an equation, revealing an under-specified mechanism. We also verified the correctness of a complex optimization.
Having gained confidence in the LOTOS NT model, we wrote a program to automatically generate a LOTOS NT model of the DTD capable of handling n PEs. For n = 4, this program yielded exactly the model discussed above. Using this program, we then generated a LOTOS NT model for n = 16, i.e., the number of PEs handled by the C++ model used for synthesis. We applied the Exec/Cæsar framework to co-simulate the C++ and LOTOS NT models of the DTD, a challenge being the connection of the asynchronous LOTOS NT model with the synchronous C++ model, because one step of the C++ model corresponds, in general, to several transitions of the LOTOS NT model.
These case studies enabled us to discover and correct several bugs in CADP.
The CADP toolbox contains several tools dedicated to the LOTOS language, namely the CÆSAR.ADT compiler for the data type part of LOTOS, the CÆSAR compiler for the process part of LOTOS, and the CÆSAR.INDENT pretty-printer.
In 2010, in addition to fixing seven bugs in the CÆSAR and CÆSAR.ADT compilers, we improved the LOTOS dedicated tools of CADP as follows:
We added a means to enable a user to indicate whether values of a given sort are "canonical" and will be stored automatically in a hash table the maximal size of which is specified by the user. We added further checks to detect uncanonical sorts in CÆSAR.ADT, with informative messages when such sorts are detected, enabling the user to modify these sorts to make them canonical so they can be stored in hash tables.
We modified the string management library so that strings can now be stored in hash tables. This means it is now possible to model check programs that manipulate variable-length strings. We also fixed a bug in this library.
The standard definition of the LOTOS language considers "i" or "I" as a reserved keyword denoting the internal gate and forbids its use elsewhere. The error messages issued by CÆSAR and CÆSAR.ADT when "i" is used incorrectly have been improved. We relaxed the rules for using "i" as a type, sort, operation, process, variable, or specification identifier. A new option can be used to instruct CÆSAR and CÆSAR.ADT to apply the LOTOS standard strictly, rather than the new relaxed rules.
To facilitate evolution, we merged the code bases for CÆSAR 6.3 and CÆSAR 7.1, which had diverged over time.
We continued to enhance our TRAIAN compiler, which generates C code from LOTOS NT data type and function definitions. TRAIAN is distributed on the Internet and used intensively within the VASY project-team as a development tool for compiler construction [GLM02].
In 2010, TRAIAN was essentially in maintenance mode. We did, however, correct one bug and we ported the related shell scripts to add support for the new Solaris Intel 32- and 64-bit architectures.
We continued to improve the LNT2LOTOS tool suite that translates (a large subset of) LOTOS NT into LOTOS, thus allowing the use of CADP to verify LOTOS NT descriptions. This tool suite is being used by Bull to model and verify critical parts of its FAME2 multiprocessor architecture; it was also used to verify protocol specifications provided by AIRBUS.
In 2010, the LNT2LOTOS tool was also enhanced significantly, leading to a new stable version, 5.1, which was incorporated into CADP in January 2010. The enhancements include:
Several changes have been made to the treatment of infix functions and constructors.
A new option has been added to LNT2LOTOS and LNT.OPEN to store all the character strings in a hash table.
The "Lnt_V1" library was enriched with new functions for manipulating strings and with support for real numbers in floating point notation.
Important enhancements have been made to enable LNT2LOTOS to handle external C code that is provided by the user.
Support for lists, sets, and ordered lists has been improved.
Changes were made in the syntax of LOTOS NT patterns and value expressions, avoiding syntactic ambiguities, and fixing a bug regarding unary operators used without parentheses.
Support for multi-module compilation was added.
Several improvements were made to the generated LOTOS code, and the error messages and warnings issued in the case of a serious syntax error in the source LOTOS NT code were improved. The test suite for LNT2LOTOS was improved, and many new tests were added.
Major improvements were made to the LNT2LOTOS Reference Manual, providing new information as well as clarifications and improvements to the existing information.
We also further developed LNT.OPEN, a shell script that automates the conversion of LOTOS NT programs to LOTOS code and provides a connection between LNT2LOTOS and the OPEN/CÆSAR environment. LNT.OPEN takes as input the principal module of a LOTOS NT specification and an OPEN/CÆSAR application program. LNT.OPEN first translates the complete LOTOS NT specification (i.e., the principal module and all included modules) into LOTOS by calling LPP and LNT2LOTOS, then compiles the generated LOTOS specification by calling CÆSAR.ADT and CÆSAR, and finally invokes the OPEN/CÆSAR application program. LNT.OPEN is now the recommended tool for using LOTOS NT specifications in conjunction with CADP.
A key objective for the future of CADP is to maintain our ability to support the latest 64-bit platforms with the high level of quality expected by a growing user base in industry and academia. This is a heavy task because of the number of tools in CADP, their intrinsic complexity, and their reliance upon third-party software. In 2010, we continued our efforts in this direction:
We added support for Solaris Intel 32- and 64-bit systems, and for Mac OS X 10.6 ("Snow Leopard") systems with 64-bit kernels.
We made changes to handle the latest versions of Cygwin on Windows and recent versions of the supported C compilers.
We updated the installation framework and enhanced the installation documentation, including information on using the new supported platforms.
We made minor modifications to the Eucalyptus user interface and error messages. We fixed a bug in the information displayed on certain recent Linux systems. We modified the Eucalyptus tool, adding support for FSP and LOTOS NT specifications. We enhanced support for PDF, PostScript, zipped files, and files of unknown format.
We enhanced support for the Emacs, XEmacs, and jEdit text editors in CADP, notably adding support for editing LOTOS NT files.
Because of the growing usage of CADP in industry and academia, we pursued our efforts to master the software quality of CADP and to improve performance:
We continued building a comprehensive validation framework, based on non-regression testing and semantical checking for the CADP tools. This framework allows functional testing of individual tools as well as integration testing for several CADP tools used together to perform complex verification scenarios on various computing platforms and using various compilers.
We developed a number of complex, internal tools, both to improve routine testing and to facilitate support of CADP users.
We continued gathering large collections of benchmarks (BCG graphs, boolean equation systems, mu-calculus formulas, etc.) for testing the CADP tools extensively. We defined a set of rules for managing these benchmarks, making it easier to add new test patterns automatically, and scripts to check test integrity.
To facilitate contributions from users to our test suite, we improved the Contributor tool, optimizing the code to improve performance, and enhancing the user interface.
We implemented a comprehensive suite of scheduled automatic tests that run on the Pipol platform of Inria. These tests, which check the 40 CADP demos, validate changes on all the supported architectures, and monitor performance changes.
Finally, there have been significant software developments done in MULTIVAL that will not be detailed here, since they are not immediately visible to the industrial users of the CADP tools. More information can be found in the corresponding sections of the VASY project team activity report for 2010. Notable changes include:
Enhancements to the BCG format and libraries (section 6.1.1)
Enhancements to CAESAR_SOLVE and the OPEN/CÆSAR libraries (section 6.1.2)
Enhancements to the Evaluator tool (section 6.1.3)
Enhancements to the performance evaluation tools (section 6.1.5)