FlexRay Fault Tolerance article


Published on

Omar Jaradat
Antonio Cappeillo

Published in: Automotive, Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

FlexRay Fault Tolerance article

  1. 1. FlaxRay Fault-Tolerance: Capabilities, weaknesses and proposed enhancements Antonio Cappiello, Omar Jaradat Mälardalen University, Västerås, Sweden, 03/2011 {aco10003, ojt10001}@student.mdh.seA bstract weaknesses and the capabilities based on two This paper gives an overview about main points: bus controller and network FlexRay, and it summarizes its main topology. The paper will proceed with the components, in addition to give an current state of the work and will be ended upadequate details of how those components work. by showing our conclusions.The document focuses on FlexRay reliabilityand how it is considered as a fault tolerant The FlexRay Protocolprotocol, as well as, it discusses the capabilities,weaknesses and the authors’ enhancement The FlexRay protocol is a time-triggeredproposals, so after reading this paper, readers protocol, and it can offer options forcan create a good knowledge about FlexRay as deterministic data that arrives in a predictablewell as how well this protocol achieves the time frame. FlexRay has a core with staticreliability. frames and dynamic frames with a communication cycle that provides a predefinedIntroduction space for static and dynamic data, so nodes on FlexRay network must know how all the piecesFlexRay is a communication system, it is of the network are configured in order toconsidered one of the next generations of bus communicate, and since the embedded networksprotocol for automotive networks; even it can be are different from normal PC networks, itapplied on any other real time distributed means that FlexRay does need any additionalsystem environment, but any researcher will mechanism to automatically discover andnotice that this protocol is usually tied with the configure devices at run-time, like the PCsautomotive industries, and this is simply, networks which require these procedures,because it was developed in 1999 by a FlexRay network and simply, have a closedcooperation of leading companies in automotive configuration and should not be changed once itindustry and it was developed exclusively for is assembled in the production.automotive. FlexRay manages more than one node “MultipleSince, software errors are considered one of the nodes” with a Time Division Multiplebig challenges that affect seriously on the Access (TDMA) scheme and every FlexRay nodesoftware performance. Our mission is to show is synchronized to the same clock, and each nodehow FlexRay can be considered as a fault waits for its turn to write on the bus, andtolerant system, and how it can handle the because the timing is harmonious infailures and errors that can be happened in any a TDMA scheme, FlexRay is always able togiven time, as well as, try to suggest or propose guarantee consistency of data deliver to nodesany idea can lead to enhance the reliability of on the network, this provides many advantagesFlexRay protocol. for systems that depend on up to date data between nodes.In this paper, we will describe FlexRay protocoland analyse the bus controller structure andhow the nodes can communicate and interactwithin the whole communication system, so wewill begin to talk about the protocol itself, andthen the fault tolerance, by explaining the
  2. 2. Fault-Tolerance: Capabilities and operates normal, constitute all together the so- called three-level error model, Figure 2. ThisWeaknesses model provides a self-diagnostic mechanism of the possible error.In this section of the document we are going topoint out the means adopted by the FlexRayprotocol in order to provide a fault-tolerancecommunication.We have individuated two aspects of a FlexRaySystem involved in the assurance of the fault-tolerance: 1. The bus controller. 2. The physical network architecture.The bus controllerThe bus controller consists of six components asshowed in Figure 1 [1], but in particular thereare some of these that use a mechanism toprotect the communication from errors. Figure 2 The Frame and Symbol Processing (FSP) beside to separate the payload from the header of the message received, it provides also status data to the host regarding the frame reception, as for example if the received frame is valid or invalid. On the sender node, The Coding/Decoding Unit (CODEC) computes and appends the CRC checksum to the message that it has to encode and send on the bus. On the receiver node, after decoding the message received, the CODEC performs the CRC check in order to verify whether the message integrity has been affected by electromagnetic noise on the bus and consequently some bits have been flipped. In addition, in a time-triggered real time system such as FlexRay, different nodes have to keep a Figure 1 consistent view of the global time even in faulty situations, and the component responsible forThe Protocol Operation Control (POC) this is the Clock Synchronisation (CS). Thisresponsible to react to host commands component tries to improve the fault tolerance ofinstructing/guiding the other components also the Protocol through two kind of correction: thereacts to error situations. For example when an offset correction and the rate correction. Inerror occurs, the POC falls to normal passive particular, it is in the offset correction methodstate and tries to reintegrate, but when the that the CS adopts a fault-tolerant midpointerror is fatal the POC falls in the halt state and algorithm in order to compute an average overall operations are stopped. These two states and the time differences between the communicationthe active state, in which the bus controller rounds. On the base of this computation, the
  3. 3. next message schedule is brought forward or allocated slots, and from the other hand thedelayed in such a way that all nodes have correctly relay of messages coming from non-almost the same time in the next cycle. faulty communication controller.The FlaxRay Consortium claims that thanks tothis algorithm, up to two Byzantine faults 1 can Summarising [4] [5] about the fault tolerance,be tolerated. When more than two of these faults we can state that the FlexRay Protocolhappen, the System can fall in a situation in  manages the errors with a “never-give-which there are different views of the global up”-strategy thanks to the three-leveltime and consequently another problem can error model explained above, becauseaffect the System, the Clique problem. “stopping communication is a criticalA Clique is a group of nodes connected to a decision which must be made by thenetwork which can communicate only inside the application whenever possible”;same group and not with the other ones.  is able to handle both internal andFlexRay doesn’t provide any mean to detect and external faults;resolve this kind of problem. The Clique  does not adopt any strategy likeProblem in FlexRay has been well analysed in retransmission in case of a corrupted[2], and more in depth, two kinds of Cliques has message, but this is responsibility of thebeen identified: host application to face with these 1. Time domain cliques, that happens problems because the strategy of the when subsets of nodes have different protocol is to “signal” the error; view of the global time, as described  as well as for the security aspect, before, because the Protocol does not provide 2. Value domain cliques, that occurs when security, but it is responsibility of the a frame is correctly placed in a slot but application contains a different cycle counter.  “Requires application support forMoreover, in [2] it is said that “the FlaxRay Byzantine faults (e.g. groupconsortium is aware of the potential clique membership).problem” but it is even said that “the cliques donot constitute a noticeable risk in practice” The physical network architecturemaybe because “there are no report published oncliques observed in a practical setup”. For these FlexRay supports single and dual channelreasons in that document the authors show with configurations which consist of one or two pairsexperiments how to create cliques in a physical of wires respectively, most FlexRay nodesFlexRay cluster and how to avoid or detect typically also have power and ground wirespossible cliques. available to power transceivers and microprocessors.Finally, when all the above illustrated meansadopted by the bus controller are not enough to FlexRay can be distinguished from all otherprevent faulty behaviours, an additional automotive protocols such as CAN and LIN bycomponent can be inserted between the bus its Network layout because FlexRay supports acontroller and the network as showed in the very flexible network topology, and this isFigure 1: the Bus Guardian (BG). In [3], four because it has two channels that can be used inproperties for the BG have been identified and a different ways, this for sure will increase theformally proofed: flexibility which will allow the protocol to provide a scalability of the fault tolerance, in 1. Correct Relay. addition to that it plays a big role in forming 2. Validity. FlexRay system structure, so redundant and 3. Agreement. independent systems are possible. 4. Integrity. There are three possible FlexRay topologies:These properties guarantee from one hand noaccesses of the communication control to the 1.Passive Bus Topology: it means that allcommunication channel outside the pre- nodes can be connected to a bus but in dual channels case one node can be1 A Byzantine Fault is typical of the distributed system and is connected to both channels or only tovisible with the wrong behaviour of a node in the system, thatconsist in sending arbitrary messages, including messages one of these channel. Figure 3.Aaimed to corrupt the system. More details about this topic willbe provided in the Current State of Work paragraph
  4. 4. 2.Active Star Topology: In this topology tolerance and time-determinism performance the network can be built as an active requirements for x-by-wire applications (i.e. star that contains star couplers, each drive-by-wire, steer-by-wire, brake-by-wire, node must be connected to one etc.). This article covers the basics FlexRay. [7] coupler. Figure 3.B Most first FlexRay networks generation only use 3.Combination of the topologies: In this the “single channel” and this is to decrease the topology a combination between the wires cost and keep it down, but further passive bus and active star is used. networks will use dual channel and this is Figure 3.C because the big advantage that they can gain from dual channel, since the dual channelIt is very important for designers to select enhances fault – tolerance and increase thebetween these topologies because choosing the bandwidth.more suitable topology can play a big role tooptimize the cost, performance, and reliability FlexRay can redundantly transmit individualfor their design. messages to provide an additional layer of network reliability. In fact, FlexRay networksFlexRay network must know how all the pieces provide scalable fault – tolerance by allowingof the network are configured in order to single or dual channel communication, but forcommunicate efficiently. sure the dual channel is preferred in many cases, for example, in security – criticalFigures 3.A, 3.B and 3.C show several possible applications, all devices connected to the bustopologies can be supported by FlexRay may use both channels for transferring data.channels [1] [6]. However, it is always possible to connect one single channel when the redundancy is not needed, or to increase the bandwidth by using both channels for transferring non-redundant data. As a result, FlexRay can be used with single or dual channels, but since the dual channel provides and increases the redundancy this will lead to increase the fault – tolerance, thus, using Figure3.A Figure 3.B dual channel topology instead of single channel will logically influence the fault – tolerance cumulatively [1]. Current State of Work In this section of the document we are going to describe the second step of our work consisting in collecting practical and theoretical research on the enhancement of the FlaxRay fault- tolerance capabilities. Regardless of the fact that FlaxRay is still a new protocol in the automotive industries, there are many works conducted by companies or Figure 3.C researchers form one hand in order to find out the true potentialities of the protocol and determine its working features and from theFlexRay communications bus is a deterministic, other hand with the purpose to improve itsfault-tolerant and high-speed bus system, and reliability and effectiveness. Therefore in ourusing two separate physical FlexRay work of collecting information we decided tocommunication lines with 10Mbps implement adopt a strategy of research based on selectingdouble redundant fault tolerant message the most reliable work form internationaltransmission so that data throughput can be conferences, workshops and companies leader indoubled as well. FlexRay delivers the error
  5. 5. the field of the embedded systems such as the (Constraint Logic Programming) in term ofReal-Time Systems Symposium (RTSS), The results, but computationally less expensive.Euromicro Technical Committee on Real-TimeSystems (ECRTS), the International Workshop About the message scheduling a good contributeon Automated Verification of Critical Systems has been given by [11], where in order to(AVoCS), the Real-Time and Embedded analyse the timing properties in both the staticTechnology and Applications Symposium and the dynamic segment of a FlexRay(RTAS), the IEEE Computer Society and many communication cycle, the authors suggestothers. different techniques.Moreover our research strategy is focused in More in depth, about the timing properties ofselecting the works regarding the reliability and the static segment, an algorithm that builds thefault-tolerant aspects of FlaxRay that try to static schedule has been proposed and analysed.estimates its capacities and propose concrete About the dynamic segment, several factors thatsolutions to its weakness. can impact on the worst-case response time haveAs result of this research we are going to been analysed in three different approaches,describe the most interesting outcomes as a kind optimal (OO), heuristic (HH) and holistic (OH)of insight on the current state of work on solution.FlexRay. The OO uses a ILP formulation, the HH sees the problem as bin-covering problem, and OHBefore to go in depth with the single results we further reduce the time of HH using partially ancan say to have noticed a common reason on the ILP formulation. All the proposed analyses arebase of each work: everyone agree on the need to based on formal extensive experiments.precisely determine the true performance, In another article [12] strictly related to thepredictability and reliability of the mentioned previous one [11] written by almost the sameprotocol as mandatory requirement to use authors, a further step toward an efficient use ofsuccessfully FlexRay in safety-critical FlexRay is done. While the first article boundsapplications. This common view is due to the the message transmission time on both the STfact that FlaxRay is becoming the leader in the and DYN segment, the second one is focused ondistributed embedded system targeted to high find the right bus configuration for a particularperformance vehicles. application in order to meet all the time constraints.Several study like [8] and [9] compare theFlaxRay protocol with the most popular This purpose is achieved providing fournowadays in automotive industries as LIN, techniques extensively tested by the authors:CAN, TTCAN and others, with the purpose toshow how the flexibility and potentialities of 1. The Basic Bus Configuration (BCC),FlaxRay include all the benefits of the other which results from analyzing theprotocol. In addition other works as [3] show minimal bandwidth requirements of thepractically how it’s possible to “migrate” from application;CAN to FlexRay explaning the migrationrequirements, parameter calculation, message 2. The OBC heuristic with the curveanalysis, Payload optimization and Slot size fitting (OBCCF), that instead ofdefinition, but at the end they indicated that exhaustively perform the scheduling forthere is a big problem in optimizing a FlexRay all possible values of the DYN segmentcycle which is formalizing the static segment length, evaluates the response time forand dynamic segment parameters. The latter is only some values and than with theone of the most interesting aspect on which curve fitting approach extrapolates themany researchers spent their efforts. response time for the other points ( thisFor example in [10] a technique to schedule is based on the regularity of themessages on the FlaxRay segment has been dependence response time vs. size of theproposed in order to compensate the lack of the DYN segment noticed in severalprotocol toward the faulty messages due to experiments and depicted by thetransient and intermittent faults that affect the following picture)reliability aspect of the communication. Thetechnique proposed generate a schedule on thebase of the probability of failure of the messageusing an heuristic very close to the CLP
  6. 6. reduce the validation time is required to manage even the continues and rapid changes in electronic control feature. This means to elaborate a schedule that takes into account even a certain amount of uncertainty. In [13] the info-gap technique has been showed with the purpose to generate different schedules with a degree of robustness related to different ranges of uncertainty. More in depth, the uncertainty analysed is in the payloads of the messages, but the same approach can be used even for uncertainty related to the dependency between task and messages, for the period (rate of task execution, or message transmission) and topology (mapping of tasks to hosts and messages to channels). Figure 4 By now we have discussed only the message scheduling problems in a system that uses the 3. The OBC heuristic with an exhaustive FlaxRay communication protocol, but there are exploration of the size for the DYN many other issues pointed out by others works segment; that need particular attention. Most of these are for example related to 4. The Simulated Annealing (SA) based Byzantine fault that is very common in design space exploration, used to distributed system. provide a base-line for evaluation of the The Byzantine fault occurs when a faulty node proposed heuristics. corrupts its local state and sends arbitrary messages. To face with this problem can be usedThe results of the experiments conducted by the a Byzantine fault tolerance technique (BFT)authors can be summarised by the following which mask a bounded number of Byzantinepicture taken from the same article: faults e.g. using state machine replication, or a detecting technique which equips each node with a detector in order to monitor other nodes and isolate the possible nodes with faulty behaviour. A formal study on these techniques has been conduced in [14], and what come out is that the first technique is stronger than the second one, but analysing a trade-off between them follows that:  Detection require f+1 replication vs. 3f+1 of the BFT in order to cope with f concurrent fault;  Detection systems need only be provisioned for the average load while a BFT system must be provisioned for the peak load;  Detection is cheaper. In addition to this analysis, in the same article Figure 5 the authors propose a sketch of a system that implements a Byzantine fault detector that provide accountability, completeness andAs these studies have showed, design the accuracy.schedule of the FlaxRay is a complex operationnot only because it is needed to guarantee the Toward the Byzantine fault the FlaxRay systemtight time constraints and performance required can be equipped with an additional moduleby some automotive application but even placed between the Bus Controller and thebecause, in order to increase the reusability and network, the Bus Guardian. The functionality of this has been already described in the
  7. 7. previous section of the document, but the [6] Seminar FlexRay, Robert Rieb, ChemntizFlaxRay specification doesn’t give any proof of University 2009.its functionalities. Regard to this, in [9], fourproperties has been identified and formally [7] FlexRay Automotive Communication Busproofed Overview, National Instruments ("NI"). 1. Correct Relay, 2. Validity, [8] Comparision of FieldBus Systems CAN, 3. Agreement, TTCAN, FlexRay and LIN in Passenger 4. Integrity. Vehicles, Steve C. Talbot, Shangping Ren, 29thMoreover about the Byzantine fault, the IEEE International Conference on DistributedFlexRay specification claims that up to two Computing Systems Workshops Montreal,Byzantine faults can be tolerated thanks to the Quebec, Canada June 22-June 26 2009Clock Synchronization Algorithm, but even thisproperty have to be proofed and the author of [9] In-Veichle Networking, frescale.comthe previous article ([15]) is currently workingeven on this problem. [10] Scheduling for Fault-Tolerant Communication on the Static Segment ofConclusion FlexRay, Bogdan Tanasa, Unmesh D. Bordoloi, Petru Eles, Zebo Peng, 31st IEEE Real-TimeFlexRay communications bus is a deterministic, Systems Symposium, 2010.fault-tolerant and high-speed bus system withhigh performance, and it has more and more [11]Timing Analysis of the FlexRaypromising future in real time distributed Communication Protocol, Traian Pop, Paul Pop,systems, specially, in automotive industry. Dual Petru Eles, Zebo Peng, Alexandru Andrei, Real-– channel topology offers enhanced fault- Time Systems Journal, Volume 39, Numbers 1-tolerance and increases the bandwidth, and this 3, pp 205-235, August, 2008provides messages redundancy or double thetransmission which increases the reliability, [12] Bus Access Optimisation for FlexRay-basedeven the dual channels can be used to increase Distributed Embedded Systems, Design,the bandwidth only, without redundant the Automation, and Test, Traian Pop, Paul Pop,message. FlexRay has a good mechanism to Petru Ion Eles and Zebo Peng, in Europehandle the errors (i.e. three-level error model) Conference DATE07.which provides a self-diagnostic mechanism ofthe possible error. [13] A. Ghosal, H. Zeng, Y. Ben-Haim, M. Di Natale, “Computing Robustness of FlexRayReferences Schedules to Uncertainties in Design Parameters” , DATE 10, 2010[1] Introduction to FlexRay and TTA, Peter [14] The case for Byzantine fault detection,Bohm, November 21, 2005. Andreas Haeberlen, Petr Kouznetsov, Peter Druschel, HOTDEP06 Proceedings of the 2nd[2] An Investigation of the Clique Problem in conference on Hot Topics in SystemFlexRay, P.Milbredt, M.Horauer, A.Steininger, Dependability, Volume 2 , 2006IEEE 2008. [15] On the Formal Verification of the FlexRay[3] On the Formal Verification of the FlexRay Communication Protocol, Bo Zhang, AutomaticCommunication Protocol, Bo Zhang, AVoVS Verification of Critical Systems - AvoCS (2006)2006. 184-189[4] Protocol Overiew, C.Temple-Motorola, [16] Migration Framework from CAN toFlexRay International Workshop, Detroit,2003. FlexRay, Richard Murphy, Frank Walsh and Brendan Jackman, Automotive Control Group,[5] The FlexRay Protocol, P.Koopman, Carnegie Waterford Institute of Technology, Cork Road,Mellon, 2010. Waterford, Ireland.