Analysing DIRAC's Behavior using Model Checking
                      with Process Algebra
                              Daniela Remenska - Jeff Templon - Tim Willemse - Henri Bal - Kees Verstoep - Wan Fokkink
                             Philippe Charpentier - Ricardo Graciani - Elisa Lanciotti - Krzysztof Daniel Ciba - Stefan Roiser


 Motivation                                                From DIRAC to mCRL2                                                  Verification
    DIRAC background                                       DIRAC (Python) ~150000 loc
                                                                                                                            ▪
                                                                                                                                Properties (Satefy / Progress / Deadlock)
    ▪   production activities and user analysis for LHCb                                                                        Model-checker automatically probes them.
                                                           Abstracting the implementation depends
    ▪   distributed services and light-weight agents       on the focus of the analysis.                                    ▪   Property violated: counter-example trace
                                                                                                                                is provided.
                                                           Check for race-conditions
   "blackboard"
        or                                                 Agents update the state of shared entities.
 "shared-memory"
     paradigm
                                                           Systems: Storage and Workload Mgmt
                                                           Entities: Jobs, Cache-Replicas, Tasks


                              Figure 1: DIRAC subsystems




▪   jobs often get into incorrect
    (or inconsistent) states                                                                                                               Figure 6: Violation of progress and safety requirements


▪   staging requests become stuck
▪   difficult to trace the root of such
    unexpected behavior                                                       Figure 2: Job state machine

    many scenarios and components
                                                           Agents and storage become processes.
▪   manual intervention necessary                          Control-flow is abstracted using mCRL2
                                                           non-deterministic choice and
                                                           if-then-else constructs.
    There are formal or systematic                         States of entities are described using                                           Figure 7: "Zombie" job starts running after being killed
      approaches to tackle this!                           custom abstract data types.


                                                                                                                                     Conclusions
                                                           State-space generation                                                   Distributed systems are difficult to
 Why Formal Methods?                                                                                                                reason about; many components,
                                                                                                                                    all run in parallel.
         Based on process algebra laws
    no ambiguity
                                                                                                                                    Formal methods are a more rigorous
                                                                                                                                    addition to testing, as a way to
         Model checking tools                                                                                                       improve software quality.
    full control over the execution of parallel
    processes. This way one gains more insight
                                                                                                                                    A sound model needs to be written
    into the system behavior.
                                                                                                                                    manually. This requires experience
                                                                                                                                    and can be error-prone.
        Automatically explore the entire
    state-space and check if some                                                                                                   Similar techniques can be re-applied
    "interesting" properties hold.                                                                                                  to similar systems, once the learning
                                                                                                                                    curve has lapsed.

         Stronger than testing



    Some drawbacks...                                                                                                                 Future Work
  Abstraction of the "real" behavior is needed.                                                                                       Automate (to some degree) the
This means one must build a sound model.                                                                                              translation from code to model.
  Expertise in formal methods and the system                      Figure 3: State-space visualisation with LTSView

domain is necessary.
    The state-space of the model can explode.              Analysis & Issues
                                                           Problems can be discovered while building and debugging the model:

                  Language & Toolset
 Actions: atomic building blocks
 can carry data parameters

 Processes: composed of actions,
 using algebra operators




                                                                   Figure 4a: XSim simulator trace of a job workflow             Figure 4b: DIRAC logging info of a job workflow




 Built-in data types
 integers, booleans, lists, sets, bags

 Abstract data types
                                                                Figure 5: State-transition visualisation with DiaGraphica

Poster chep2012 reduced_original1

  • 1.
    Analysing DIRAC's Behaviorusing Model Checking with Process Algebra Daniela Remenska - Jeff Templon - Tim Willemse - Henri Bal - Kees Verstoep - Wan Fokkink Philippe Charpentier - Ricardo Graciani - Elisa Lanciotti - Krzysztof Daniel Ciba - Stefan Roiser Motivation From DIRAC to mCRL2 Verification DIRAC background DIRAC (Python) ~150000 loc ▪ Properties (Satefy / Progress / Deadlock) ▪ production activities and user analysis for LHCb Model-checker automatically probes them. Abstracting the implementation depends ▪ distributed services and light-weight agents on the focus of the analysis. ▪ Property violated: counter-example trace is provided. Check for race-conditions "blackboard" or Agents update the state of shared entities. "shared-memory" paradigm Systems: Storage and Workload Mgmt Entities: Jobs, Cache-Replicas, Tasks Figure 1: DIRAC subsystems ▪ jobs often get into incorrect (or inconsistent) states Figure 6: Violation of progress and safety requirements ▪ staging requests become stuck ▪ difficult to trace the root of such unexpected behavior Figure 2: Job state machine many scenarios and components Agents and storage become processes. ▪ manual intervention necessary Control-flow is abstracted using mCRL2 non-deterministic choice and if-then-else constructs. There are formal or systematic States of entities are described using Figure 7: "Zombie" job starts running after being killed approaches to tackle this! custom abstract data types. Conclusions State-space generation Distributed systems are difficult to Why Formal Methods? reason about; many components, all run in parallel. Based on process algebra laws no ambiguity Formal methods are a more rigorous addition to testing, as a way to Model checking tools improve software quality. full control over the execution of parallel processes. This way one gains more insight A sound model needs to be written into the system behavior. manually. This requires experience and can be error-prone. Automatically explore the entire state-space and check if some Similar techniques can be re-applied "interesting" properties hold. to similar systems, once the learning curve has lapsed. Stronger than testing Some drawbacks... Future Work Abstraction of the "real" behavior is needed. Automate (to some degree) the This means one must build a sound model. translation from code to model. Expertise in formal methods and the system Figure 3: State-space visualisation with LTSView domain is necessary. The state-space of the model can explode. Analysis & Issues Problems can be discovered while building and debugging the model: Language & Toolset Actions: atomic building blocks can carry data parameters Processes: composed of actions, using algebra operators Figure 4a: XSim simulator trace of a job workflow Figure 4b: DIRAC logging info of a job workflow Built-in data types integers, booleans, lists, sets, bags Abstract data types Figure 5: State-transition visualisation with DiaGraphica