Session 1 introduction concurrent programming


Published on

Short introduction on concurrent/parallel programming, some formalisation of systems engineering and on formal development of OpenComRTOS.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Session 1 introduction concurrent programming

  1. 1. Introduction Concurrent Programming Eric Verhulst, From Deep Space to Deep SeaContent• The von Neuman legacy• Using more than one processor• On concurrent multi-tasking• How to program these things?• From MP to VSP: the Virtual Single Processor• Embedded Systems Engineering• The formal development of OpenComRTOS• Conclusion11 January 2012 MC4ES workshop 2
  2. 2. The von Neuman ALUvs. an embedded processor• The sequential programming paradigm is based on the von Neumann architecture • But this was only meant for a single ALU• A real processor in an embedded system : • Inputs data, Processes the data (only this covered by von Neumann) Output the result• On other words : • at least two communications, often one computation• => Communication/Computation ratio must be > 1 (in optimal case)• Standard programming languages (C, Java, …) only cover the computation and sometimes limited runtime multitasking• We have an unbalance, and have been living with it for decades• Reason ? : history • Computer scientists use workstations (eye Nyquist Frequency is 50 Hz) • Only embedded systems must process data in real-time • Embedded systems were first developed by hardware engineers11 January 2012 MC4ES workshop 3Multi-tasking:doing more than one thing at a time• Origin: • A software solution to a hardware limitation • von Neumann processors are sequential, the real-world is “parallel” by nature and software is just modeling • Developed out of industrial needs• How? • A function is a [callable] sequential stream of instructions • Uses resources [mainly registers] => defines “context” • Non-sequential processing = • switching between ownership of processor(s) • reducing overhead by using idle time or to avoid active wait : • each function has its own workspace • a task = function with proper context and workspace • Scheduling to achieve real-time behavior for each task • preemption capability : switch asynchronously11 January 2012 MC4ES workshop 4
  3. 3. Scheduling approaches (1)• Dominant real-time/scheduling system view paradigms: • Control flow: • Event driven - asynchronous : latency is the issue • Traverse the state machine • Uncovered states generate complexity • Data-flow: • Data-driven : throughput is the issue • Multi-rate processing generates complexity • Time-triggered: • Play safe: allocate timeslots beforehand • Reliable if system is predictable and stationary • Static scheduling • Play safe: single pre-calculated thread • Reliable if no SW errors, if no HW failures11 January 2012 MC4ES workshop 5Scheduling approaches (2)• REAL SYSTEMS : • combination of above • distinction is mainly implementation and style issue, not conceptual • SCHEDULING IS AN ORTHOGONAL ISSUE TO MULTI-TASKING • multi-tasking gives logical behaviour • scheduling gives timely behaviour11 January 2012 MC4ES workshop 6
  4. 4. Classical (RT) scheduling (1)• Superloop : • loops into single endless loop : [test, call function]• Round-robin, FIFO : • first in, first served • cooperative multi-tasking, run till descheduling point, no preemption• Priority based, pre-emptive • tasks run when executable in order of priority • can be descheduled at any moment • Rata Monotonic Analysis (Liu & Stanckovic) : CPU load < 70 % • simplified model : tasks are independent • graceful degradation by design11 January 2012 MC4ES workshop 7Classical (RT) scheduling (2)• Earliest deadline first (EDF): • most dynamic form : deadlines are time based • but complex to use and implement • time granularity is an issue (hardware lacking support) • catastrophic failure mode• For MP systems: • No real generic solution for embedded • IT solutions focus on QoS and soft real-time + high overhead • Higher need for error recovery • Graceful degradation : garantee QoS even if HW resources fail • RMA (priority, preemptive) still best approach11 January 2012 MC4ES workshop 8
  5. 5. Scheduling visualized: tracer11 January 2012 MC4ES workshop 9 Using more than one processor 11 January 2012 MC4ES workshop 10
  6. 6. Why Multi-Processing ?• Laws of diminishing return : • Power consumption increases more than linearly with speed (F**2, V) • Highest (peak) speed achieved by micro-parallel tricks : • Pipelining, VLIW, out of order execution, branch prediction, … • Efficiency depends on application code • Requires higher frequencies and many more gates • Creates new bottlenecks : • I/O and communication become bottlenecks • Memory access speed slower than ALU processing speed• Result : • 2 CPU@1F Hz can be better than 1 CPU@2F Hz if communication support (HW and SW) is adequate• The catch : • Not supported by von Neumann model • Scheduling, task partitioning and communication are inter-dependent • BUT SCHEDULING IS NOT ORTHOGONAL TO PROCESSOR MAPPING AND INTERPROCESSOR COMMUNICATION11 January 2012 MC4ES workshop 11 Generic MP system • There is no conceptual difference between: • Multicore (often assumed homogeneous) • Manycore (often assumed heterogeneous) • Parallel (closer to reality) • Distributed • Networked • Difference is implementation architecture: • Shared vs. distributed memory • SMP vs. SMD vs. MIMD • Note: PCIe (RapidIO, …) emulate shared bus using point to point very fast sit serial lanes11 January 2012 MC4ES workshop 12
  7. 7. Example: A Rack in a (MP)-SoC(1)11 January 2012 MC4ES workshop 13Example: A Rack in a (MP)-SoC(2)11 January 2012 MC4ES workshop 14
  8. 8. Why MP-SoC is now standard• High NRE, high frequency signals• Conclusion : • multi-core, course grain asynchronous SoC design • cores as proven components -> well defined interfaces • keep critical circuits inside • simplify I/O, high speed serial links, no buses • NRE dictates high volume -> more reprogrammability • system is now a component • below minimum thresholds of power and cost • it becomes cheap to “burn” gates • software becomes the differentiating factor11 January 2012 MC4ES workshop 15 On concurrent multi-tasking 11 January 2012 MC4ES workshop 16
  9. 9. On concurrent multi-tasking (1)• Tasks need to interact • synchronize • pass data = communicate • share resources• A task = a virtual single processor or unit of abstraction• A (SW) multi-tasking system can emulate a (HW) real system• Multi-tasking needs communication services11 January 2012 MC4ES workshop 17On concurrent multi-tasking (2)• Theoretical model : • CSP : Communicating Sequential Processes (and its variations) • C.A.R. Hoare • CSP := sequential processes + channels (cfr. occam) • Channels := synchronised (blocked) communication, no protocol • Formal, but doesn’t match complexity of real world• Generic model : module based, multi-tasking based, process oriented ,… • Generic model matches reality of MP-SoC • Very powerful to break the von-Neumann constrictor11 January 2012 MC4ES workshop 18
  10. 10. There are only programs• Simplest form of computation is assignment : a:= b• Semi-Formal : BEFORE : a = UNDEF; b = VALUE(b) AFTER : a = VALUE(b); b = VALUE(b)• Implementation in typical von Neumann machine : Load b, register X Store X, a11 January 2012 MC4ES workshop 19 CSP explained in occam PROC P1, P2 : CHAN OF INT32 c1,c2 : PAR P1(c1, c2) P2(c1, c2) /* c1 ? a : read from channel c1 into variable a */ /* c1 ! a : write variable a into channel c2 */ /* order of execution not defined by clock but by */ /* channel communication : execute when data is ready */ Needed : a C1 b - context P1 P2 - communication C211 January 2012 MC4ES workshop 20
  11. 11. A small parallel program No assumption in PAR case about order of execution => self-synchronising P1 P2 INT32 a : INT32 b : c SEQ SEQ a:= ANY b:= ANY c1 ! a c1 ? b Equivalent : SEQ INT32 a,b : a:= ANY b:= ANY b:= a11 January 2012 MC4ES workshop 21The PAR version at von Neuman machine levelPROC_1 Load a, register X Store X, output register (hidden : start channel transfer) (hidden : transfer control to PROC_2PROC_2 (hidden : detect channel transfer) (hidden : transfer control to PROC_2) Load input register, X Store X, bIn between : • Data moves from output register to input register • Sequential case is an optimization of the parallel case • But there hidden assumptions about data validity!11 January 2012 MC4ES workshop 22
  12. 12. The same program for hardware with Handel-CVoid main(void)par /* WILL GENERATE PARALLEL HW (1 clock cycle) */ chan chan_between; int a, b; { chan_between ! a chan_between ? b}But :seq /* WILL GENERATE SEQUENTIAL HW (2 clock cycles) */ chan chan_between; int a, b; chan_between ! a chan_between ? b }11 January 2012 MC4ES workshop 23Consequences• Data is protected inside scope of process.• Interaction is through explicit communication• In order to safeguard abstract equivalence : • Communication backbone needed • Automatic routing needed (but deadlock free) • Process scheduler if on same processor• In order to safeguard real-time behavior • Prioritisation of communication for dynamic applications (or schedule)• In order to handle multi-byte communication : • Buffering at communication layer, Packetisation, DMA in background• Result : • prioritized packet switching : header, priority, payload • Communication not fundamentally different from data I/O11 January 2012 MC4ES workshop 24
  13. 13. The (next generation) SoCXilinx Zync dual core ARM + FPGA11 January 2012 MC4ES workshop 25The (next generation) SoCAltera dual core ARM + FPGA11 January 2012 MC4ES workshop 26
  14. 14. How to program these things? 11 January 2012 MC4ES workshop 27Where’s the (new) programming model?• Issue: what’s wrong with the “ good old” software? • => von neuman => shared memory syndrome • Issue is not access to memory but integrity of memory • Issue is not bandwidth to memory, but latency • Sequential programs have lost the information of the inherent parallelism in the problem domain• Most attempts (MPI, ...) just add a heavy comm layer • Issue: underlying hardware still visible • Difficult for: • Porting to another target - Scalability (from small to large AND vice-versa) • Application domain specific - Performance doesn’t scale11 January 2012 MC4ES workshop 28
  15. 15. Beyond concurent multi-tasking• Multi-tasking = Process Oriented Programming• A Task = • Unit of execution • Encapsulated functional behavior • Modular programming• High Level [Programming] Language : • common specification : • for SW: compile to asm • for HW: compile to VHDL or Verilog • E.g. program PPC with ANSI C (and RTOS), FPGA with Handel-C • C level design is enabler for SoC “co-design” • More abstraction gives higher productivity • But interfaces be better standardized for better re-use • Interfaces can be “compiled” for higher volume applications11 January 2012 MC4ES workshop 29Multi-tasking API: an orthogonal set• Events : binary data, one to one, local -> interface to HW• [counting] semaphore : events with a memory, many to many, distributed• FIFO queue : simple datacomm between tasks, many to many, distributed• Mailbox/Port : variable size datacomm between tasks, rendez-vous, one/many to one/many, distributed• Resources : ownership protection, priority inheritance• Memory maps/pools• semantic issues : distributed operation, blocking, non-blocking, time-out, asynchronous operation, group operators • L1_memcpy_[A][WT] (not just memcpy!) • L1_SendPacketToPort_[NW][W][T][A]11 January 2012 MC4ES workshop 30
  16. 16. From MP to VSP: Virtual Single Processor 11 January 2012 MC4ES workshop 31Virtual Single Processor (VSP) model• Transparent parallel programming • Cross development on any platform + portability • Scalability, even on heterogeneous targets• Distributed semantics • Program logic neutral to topology and object mapping • Clean API provides for less programming errors • Prioritized packet switching communication layer• Based on “CSP” (C.A.R. Hoare): Communicating Sequential Processes: VSP is pragmatic superset• Implemented first in Virtuoso RTOS (now WRS/Intel) -> OpenComRTOS Multitasking and message passing Process oriented programming Interfacing using communication protocols Application doesn’t need to know physical layer11 January 2012 MC4ES workshop 32
  17. 17. 11 January 2012 MC4ES workshop 3311 January 2012 MC4ES workshop 34
  18. 18. Full application : Matlab/Simulink type designEmbedded DSP app with GUI front-end GUI front-end Virtuoso tasks & communication channels, on specific DSP card Read Process Process ADC Audio Audio Audio Split L-R ADC Driver Data data data channels Task Task stage 1 stage 2 Parameter knobs, monitor windows, DSP 1 Process Process etc... Parameter settings R L & Control task channel channel stage 3 stage 3 DSP 2 DSP 3 Process Process Monitor Task R L DSP 4 channel channel stage 4 stage 4 Front-end can be written in any language, and run Play Process remotely Process DAC Audio Audio Channel DAC Driver Data Audio data joiner task task data stage 6 stage 511 January 2012 MC4ES workshop 35Homogeneous OpenComRTOS Block diagram at top level, executable spec in e.g. C11 January 2012 MC4ES workshop 36
  19. 19. Heterogeneous OpenComRTOS with VSPand host OS11 January 2012 MC4ES workshop 37Next:Heterogeneous VSP with reprogrammable HW11 January 2012 MC4ES workshop 38
  20. 20. Conclusion• RTOS is much more than real-time • General purpose “process oriented” design and programming • Hide complexity inside chip for hardware (in SoC chip) • Hide complexity inside task for software (with RTOS) • Hide complexity of communication in system level support• CSP provides unified theoretical base for hardware and software, RTOS makes it pragmatic for real world : • “DESIGN PARALLEL, OPTIMIZE SEQUENTIALLY”• Software meets hardware with same development paradigm : • Handel-C for FPGA, “Parallel” C for SW• FPGA with macro-blocks is next generation SW defined SoC :• Time for asynchronous HW design ?11 January 2012 MC4ES workshop 39 Embedded Systems Engineering 11 January 2012 MC4ES workshop 40
  21. 21. Engineering methodology GoedelWorks © Test harness Formalised requirements & OpenVE © specifications capturing Formalized modelling Project repository Simulation Code generation Modeling User Applications OpenComRTOS © Formally developed Meta-models Runtime support for concurrency and communication Unifying Repository Unified Semantics StarFish Controller © Control & processing platform Unified architectural paradigm: Interacting Entities SIL enabled with support for 11 January 2012 MC4ES workshop fault tolerance 41 Automating a complex process Phase1 Phase2 Phase3 Phase4Cost : + ++ ++ ++ +++ +++++ +++++++ ++++++ ++++++++++of issues System System System System Requirements Specifications System Development Integration Validation Maintenance Capturing Capturing Architecting FMEA Safety HARA Specs Packaging Specs Hardware Specs System and Software Software System System System System Software System Safety Architectural Implementation Validation Maintenance Requirements Architecture Specs Integration Specifications Design Verification Analysis and TestSystem System System Domain Specific Domain Specific Domain Specific System System SystemRequirements Specifications Architecture Specifications Architectural Beta Released Released Updates (Normal Case) Design release Design Code Source Code Test Cases & Distribution Test Cases Test Test results System Fault Cases procedures Test Results Validation Fault Cases Results User manual ASIL process flow identified 2800 process requirements! 11 January 2012 MC4ES workshop 42
  22. 22. The OpenComRTOS approach• Derived from a unified systems engineering methodology• Two keywords: • Unified Semantics • use of common “systems grammar” • covers requirements, specifications, architecture, runtime, ... • Interacting Entities ( models almost any system)• RTOS and embedded systems: • Map very well on “interacting entities” • Time and architecture mostly orthogonal • Logical model is not communication but “interaction”11 January 2012 MC4ES workshop 43The OpenComRTOS project• Target systems: • Multi-Many-core, parallel processors, networked systems, include “legacy” processing nodes running old (RT)OS• Methodology: • Formal modeling and formal verification• Architecture: • Target is multi-node, hence communication is system- level issue, not a programmer’s concern • Scheduling is orthogonal issue • An application function = a “task” or a set of “tasks” • Composed of sequential “segments” • In between Tasks synchronise and pass data (“interaction”)11 January 2012 MC4ES workshop 44
  23. 23. TLA+ used for formal modelling• TLA (the Temporal Logic of Actions) is a logic for specifying and reasoning about concurrent and reactive systems.• Also UPPAAL for time-out services11 January 2012 MC4ES workshop 45Graphical view of RTOS “Hubs” Data needs to be buffered Buffer List CeilingPriority Prioity Inheritance For resources Owner Task For semaphores Count Predicate Action Synchronisation Synchronising Predicate Synchronisation W W L L Waiting Lists T T Threshold T Generic Hub (N-N) Similar to Guarded Actions or a pragmatic superset of CSP11 January 2012 MC4ES workshop 46
  24. 24. All RTOS entities are “HUBs”11 January 2012 MC4ES workshop 47OpenComRTOS application view:any entity can be mapped onto any node11 January 2012 MC4ES workshop 48
  25. 25. Unexpected: RTOS 10x smaller• Reference is Virtuoso RTOS (ex-Eonic Systems)• New architectures benefits: • Much easier to port • Same functionilaty (and more) in 10x less code • Smallest size SP: 1 KByte program, 200 bytes of RAM • Smallest size MP: 2 KBytes • Full version MP: 5 Kbytes, grows to 8 KB for complex SoC• Why is small better ? • Much better performance (less instructions) • Frees up more fast internal memory • Easier to verify and modify• Architecture allows new services without changing the RTOS kernel task!11 January 2012 MC4ES workshop 49Clean architecture gives small code OpenComRTOS L1 code size figures (MLX16) MP FULL SP SMALL L0 L1 L0 L1L0 Port 162 132L1 Hub shared 574 400L1 Port 4 4L1 Event 68 70L1 Semaphore 54 54L1 Resource 104 104L1 FIFO 232 232L1 Resource List 184 184Total L1 services 1220 1048Grand Total 3150 4532 996 2104 Smallest application: 1048 bytes program code and 198 bytes RAM (data) (SP, 2 tasks with 2 Ports sending/receiving Packets in a loop, ANSI-C) Number of instructions : 605 instructions for one loop (= 2 x context switches, 2 x L0_SendPacket_W, 2 x L0_ReceivePacket_W) 11 January 2012 MC4ES workshop 50
  26. 26. Probably the smallest MP-demo in the world Code Size Data Size Platform firmware 520 0 - 2 application tasks 230 1002, of which - 2 UART Driver tasks - Kernel stack: 100 - Kernel task 338 - Task stack: 4*64 - Idle task - ISR stack: 64 - Idle Stack: 50 - OpenComRTOS full MP (_NW, _W, _WT, _A) 3500 - 568 Total 4138 + 520 1002 + 568Conclusions • Don’t be afraid of multi-many-parallel-distributed- network-centric programming. • It’s more natural than sequential programming • The benefits are: • scalability, faster development, reuse, …11 January 2012 MC4ES workshop 52
  27. 27. The book Formal Development of a Network-Centric RTOS Software Engineering for Reliable Embedded Systems Verhulst, E., Boute, R.T., Faria, J.M.S., Sputh, B.H.C., Mezhuyev, V. Springer Documents the project (IWT co-funded) and the design, formal modeling, implementation of OpenComRTOS11 January 2012 MC4ES workshop 53