SlideShare a Scribd company logo
1 of 87
Download to read offline
2011 Android System Developer Forum




Is Parallel Programming Hard, And If So,
What Can You Do About It?


        Paul E. McKenney
        IBM Distinguished Engineer & CTO Linux
        Linux Technology Center



         April 28, 2011                        Copyright © 2011 IBM IBM Corporation
                                                                 © 2002
IBM Linux Technology Center



Who is Paul and How Did He Get This Way?




   “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   2
IBM Linux Technology Center



Who is Paul and How Did He Get This Way?

 Grew up in rural Oregon, USA
 First use of computer in high school (72-76)
      IBM mainframe: punched cards and FORTRAN
      Later ASR-33 TTY and BASIC
 BSME & BSCS, Oregon State University (76-81)
      Tuition provided by FORTRAN and COBOL
 Contract Programming and Consulting (81-85)
      Building control system (Pascal on z80)
      Security card-access system (Pascal on PDP-11)
      Dining hall system (Pascal on PDP-11)
      Acoustic navigation system (C on PDP-11)


        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   3
IBM Linux Technology Center



Who is Paul and How Did He Get This Way?

 SRI International (85-90)
      UNIX systems administration
      Packet-radio research
      Internet protocol research
      MSCS Computer Science (88)
 Sequent Computer Systems (90-00)
      Communications performance
      Parallel programming: memory allocators, RCU, ...
 IBM LTC (00-present)
      NUMA-aware and brlock-like locking primitive in AIX
      RCU maintainer for Linux kernel
      Ph.D. Computer Science and Engineering (04)



        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   4
IBM Linux Technology Center



Who is Paul and How Did He Get This Way?

 SRI International (85-90)
      UNIX systems administration
      Packet-radio research
      Internet protocol research
      MSCS Computer Science (88)
 Sequent Computer Systems (90-00)
      Communications performance
      Parallel programming: memory allocators, RCU, ...
 IBM LTC (00-present)
      NUMA-aware and brlock-like locking primitive in AIX
      RCU maintainer for Linux kernel
      Ph.D. Computer Science and Engineering (04)
 Paul has been doing parallel programming for 20 years

        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   5
IBM Linux Technology Center



Who is Paul and How Did He Get This Way?

 SRI International (85-90)
      UNIX systems administration
      Packet-radio research
      Internet protocol research
      MSCS Computer Science (88)
 Sequent Computer Systems (90-00)
      Communications performance
      Parallel programming: memory allocators, RCU, ...
 IBM LTC (00-present)
      NUMA-aware and brlock-like locking primitive in AIX
      RCU maintainer for Linux kernel
      Ph.D. Computer Science and Engineering (04)
 Paul has been doing parallel programming for 20 years
      What is so hard about parallel programming, then???
        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   6
IBM Linux Technology Center



Why Has Parallel Programming Been Hard?




   “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   7
IBM Linux Technology Center



Why Has Parallel Programming Been Hard?

 Parallel systems were rare and expensive
 Very few developers had access to parallel
  systems: little ROI for parallel languages & tools
 Almost no publicly accessible parallel code
 Parallel-programming engineering discipline
  confined to a few proprietary projects
 Technological obstacles:
      Synchronization overhead
      Deadlock
      Data races

        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   8
IBM Linux Technology Center



Parallel Programming: What Has Changed?

 Parallel systems were rare and expensive
 About $1,000,000 each in 1990
 About $100 in 2011
      Four orders of magnitude decrease in 21 years
      From many times the price of a house to a small
       fraction of the cost of a bicycle
 In 2006, masters student I was collaborating
 with bought a dual-core laptop
      And just to be able to present a parallel-
       programming talk using a parallel programmer


        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   9
IBM Linux Technology Center



Parallel Programming: Rare and Expensive




     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   10
IBM Linux Technology Center



Parallel Programming: Rare and Expensive




     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   11
IBM Linux Technology Center



Parallel Programming: Cheap and Plentiful




     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   12
IBM Linux Technology Center



Parallel Programming: What Has Changed?

 Back then: very few developers had access to
 parallel systems
      Maybe 1,000 developers with parallel systems
      Suppose that a tool could save 10% per developer
        • Then it would not make sense to invest 100 developers
        • Even assuming compatible environments: 'Fraid not!!!
 Now: almost anyone can have a parallel system
      More than 100,000 developers with parallel systems
        • Easily makes sense to invest 100 developers for 10% gain
        • Even assuming multiple incompatible environments
 We can expect many more parallel tools
      Linux kernel lockdep, lock profiling, ...
        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   13
IBM Linux Technology Center



Why Has Parallel Programming Been Hard?

 Almost no publicly accessible parallel code
 Parallel-programming engineering discipline
 confined to a few proprietary projects
      Linux kernel: 13 million lines of parallel code
      MariaDB (the database formerly known as MySQL)
      memcached
      Apache web server
      Hadoop
      BOINC
      OpenMP
 Lots of parallel code for study and copying
        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   14
IBM Linux Technology Center



Why Has Parallel Programming Been Hard?

 Technological obstacles:
      Synchronization overhead
      Deadlock
      Data races




        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   15
IBM Linux Technology Center



Why Has Parallel Programming Been Hard?

 Technological obstacles:
      Synchronization overhead
      Deadlock
      Data races




 If these obstacles are impossible to overcome,
 why are there more than 13 million lines of
 parallel Linux kernel code?


        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   16
IBM Linux Technology Center



Why Has Parallel Programming Been Hard?

 Technological obstacles:
      Synchronization overhead
      Deadlock
      Data races




 If these obstacles are impossible to overcome,
 why are there more than 13 million lines of
 parallel Linux kernel code?
      But parallel programming is harder than sequential
       programming...
        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   17
IBM Linux Technology Center



      So Why Parallel Programming?




“Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   18
IBM Linux Technology Center



Why Parallel Programming? (Party Line)




     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   19
IBM Linux Technology Center



Why Parallel Programming? (Reality)

 Parallelism can increase performance
      Assuming...
 Parallelism may be important battery-life extender
      Run two CPUs at half frequency
      25% power per CPU, 50% overall with same throughput
       assuming perfect scaling
        • And assuming CPU consumes most of the power
        • Which is not the case on many embedded systems
 But parallelism is but one way to optimize
 performance and battery lifetime
      Hashing, search trees, parsers, cordic algorithms, ...


        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   20
IBM Linux Technology Center



Why Parallel Programming? (Reality)




    If your code runs well enough single threaded,
            then leave it be single threaded.

 And you always have the option of implementing key
               function in hardware.

  But where it makes sense, parallelism can be quite
                      valuable.



      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   21
IBM Linux Technology Center



           Parallel Programming Goals




“Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   22
IBM Linux Technology Center



Parallel Programming Goals




                                        Performance



                  Productivity                                    Generality




     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   23
IBM Linux Technology Center



Parallel Programming Goals: Why Performance?

 (Performance often expressed as scalability or
 normalized as in performance per watt)

 If you don't care about performance, why are you
 bothering with parallelism???
      Just run single threaded and be happy!!!

 But what about:
      All the multi-core systems out there?
      Efficient use of resources?
      Everyone saying parallel programming is crucial?

 Parallel Programming: one optimization of many
 CPU: one potential bottleneck of many
        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   24
IBM Linux Technology Center



Parallel Programming Goals: Why Productivity?

 1948 CSIRAC (oldest intact computer)
      2,000 vacuum tubes, 768 20-bit words of memory
      $10M AU construction price
      1955 technical salaries: $3-5K/year
      Makes business sense to dedicate 10-person team to increasing
       performance by 10%
 2008 z80 (popular 8-bit microprocessor)
      8,500 transistors, 64K 8-bit works of memory
      $1.36 per CPU in quantity 1,000 (7 OOM decrease)
      2008 SW starting salaries: $50-95K/year US (1 OOM increase)
      Need 1M CPUs to break even on a one-person-year investment to
       gain 10% performance!
        • Or 10% more performance must be blazingly important
        • Or you are doing this as a hobby... In which case, do what you want!


        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   25
IBM Linux Technology Center



Parallel Programming Goals: Why Generality?

 The more general the solution, the more users
 to spread the cost over.

                                           Productivity

                                             Application

                                   Middleware (e.g., DBMS)
                     Performance




                                                                             Generality
                                   System Utilities & Libraries

                                             OS Kernel

                                                   FW

                                                   HW


     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   26
IBM Linux Technology Center



Performance of Synchronization Mechanisms




    “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   27
IBM Linux Technology Center



 Performance of Synchronization Mechanisms

                                       4-CPU 1.8GHz AMD Opteron 844 system
 Need to be here!
(Partitioning/RCU)        Operation                                       Cost (ns)                      Ratio
                          Clock period                                          0.6                          1
                          Best-case CAS                                       37.9                        63.2
                          Best-case lock                                      65.6                       109.3
                          Single cache miss                                  139.5                       232.5
                          CAS cache miss                                     306.0                       510.0

   Heavily optimized reader-
    writer lock might get here
     for readers (but too bad                                                 Typical synchronization
   about those poor writers...)                                               mechanisms do this a lot



            “Is Parallel Programming Hard?” 2011 Android System Development Forum    © 2011 IBM Corporation      28
IBM Linux Technology Center



  Performance of Synchronization Mechanisms

                                        4-CPU 1.8GHz AMD Opteron 844 system
  Need to be here!
 (Partitioning/RCU)        Operation                                       Cost (ns)                      Ratio
                           Clock period                                          0.6                          1
                           Best-case CAS                                       37.9                        63.2
                           Best-case lock                                      65.6                       109.3
                           Single cache miss                                  139.5                       232.5
                           CAS cache miss                                     306.0                       510.0

    Heavily optimized reader-
     writer lock might get here
      for readers (but too bad                                                 Typical synchronization
    about those poor writers...)                                               mechanisms do this a lot


But this is an old system...
             “Is Parallel Programming Hard?” 2011 Android System Development Forum    © 2011 IBM Corporation      29
IBM Linux Technology Center



  Performance of Synchronization Mechanisms

                                        4-CPU 1.8GHz AMD Opteron 844 system
  Need to be here!
 (Partitioning/RCU)        Operation                                       Cost (ns)                      Ratio
                           Clock period                                          0.6                          1
                           Best-case CAS                                       37.9                        63.2
                           Best-case lock                                      65.6                       109.3
                           Single cache miss                                  139.5                       232.5
                           CAS cache miss                                     306.0                       510.0

    Heavily optimized reader-
     writer lock might get here
      for readers (but too bad                                                 Typical synchronization
    about those poor writers...)                                               mechanisms do this a lot


But this is an old system...                                     And why low-level details???
             “Is Parallel Programming Hard?” 2011 Android System Development Forum    © 2011 IBM Corporation      30
IBM Linux Technology Center



Why All These Low-Level Details???

 Would you trust a bridge designed by someone who
 did not understand strengths of materials?
      Or a ship designed by someone who did not understand
       the steel-alloy transition temperatures?
      Or a house designed by someone who did not understand
       that unfinished wood rots when wet?
      Or a car designed by someone who did not understand
       the corrosion properties of the metals used in the exhaust
       system?
      Or a space shuttle designed by someone who did not
       understand the temperature limitations of O-rings?

 So why trust algorithms from someone ignorant of
 the properties of the underlying hardware???
        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   31
IBM Linux Technology Center



Why Such An Old System???

 Because early low-power embedded CPUs are likely
 to have similar performance characteristics.




      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   32
IBM Linux Technology Center



But Isn't Hardware Just Getting Faster?




      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   33
IBM Linux Technology Center



Performance of Synchronization Mechanisms
                                16-CPU 2.8GHz Intel X5550 (Nehalem) System

    Operation                                                        Cost (ns)                        Ratio
    Clock period                                                           0.4                            1
    “Best-case” CAS                                                      12.2                          33.8
    Best-case lock                                                       25.6                          71.2
    Single cache miss                                                    12.9                          35.8
    CAS cache miss                                                         7.0                         19.4



What a difference a few years can make!!!


     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation           34
IBM Linux Technology Center



Performance of Synchronization Mechanisms
                                16-CPU 2.8GHz Intel X5550 (Nehalem) System

    Operation                                                        Cost (ns)                        Ratio
    Clock period                                                           0.4                            1
    “Best-case” CAS                                                      12.2                          33.8
    Best-case lock                                                       25.6                          71.2
    Single cache “miss”                                                  12.9                          35.8
    CAS cache “miss”                                                       7.0                         19.4
    Single cache miss (off-core)                                         31.2                          86.6
    CAS cache miss (off-core)                                            31.2                          86.5



  Not quite so good... But still a 6x improvement!!!

     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation           35
IBM Linux Technology Center



Performance of Synchronization Mechanisms
                                16-CPU 2.8GHz Intel X5550 (Nehalem) System

    Operation                                                            Cost (ns)                Ratio
    Clock period                                                               0.4                    1
    “Best-case” CAS                                                          12.2                  33.8
    Best-case lock                                                           25.6                  71.2
    Single cache miss                                                        12.9                  35.8
    CAS cache miss                                                             7.0                 19.4
    Single cache miss (off-core)                                             31.2                  86.6
    CAS cache miss (off-core)                                                31.2                  86.5
    Single cache miss (off-socket)                                           92.4                 256.7
    CAS cache miss (off-socket)                                              95.9                 266.4
                    Maybe not such a big difference after all...
                    And these are best-case values!!! (Why?)
     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation       36
IBM Linux Technology Center



Performance of Synchronization Mechanisms
                                                         16-CPU 2.8GHz Intel X5550 (Nehalem) System
Most embedded devices here




                             Operation                                                            Cost (ns)                Ratio
                             Clock period                                                               0.4                    1
                             “Best-case” CAS                                                          12.2                  33.8
                             Best-case lock                                                           25.6                  71.2
                             Single cache miss                                                        12.9                  35.8
                             CAS cache miss                                                             7.0                 19.4
                             Single cache miss (off-core)                                             31.2                  86.6
                             CAS cache miss (off-core)                                                31.2                  86.5
                             Single cache miss (off-socket)                                           92.4                 256.7
                             CAS cache miss (off-socket)                                              95.9                 266.4
                                             Maybe not such a big difference after all...
                                             And these are best-case values!!! (Why?)
                              “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation       37
IBM Linux Technology Center



       Performance of Synchronization Mechanisms
                                                         16-CPU 2.8GHz Intel X5550 (Nehalem) System
Most embedded devices here




                             Operation                                                            Cost (ns)                Ratio
                             Clock period                                                               0.4                    1
                             “Best-case” CAS                                                          12.2                  33.8
        For now...




                             Best-case lock                                                           25.6                  71.2
                             Single cache miss                                                        12.9                  35.8
                             CAS cache miss                                                             7.0                 19.4
                             Single cache miss (off-core)                                             31.2                  86.6
                             CAS cache miss (off-core)                                                31.2                  86.5
                             Single cache miss (off-socket)                                           92.4                 256.7
                             CAS cache miss (off-socket)                                              95.9                 266.4
                                             Maybe not such a big difference after all...
                                             And these are best-case values!!! (Why?)
                              “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation       38
IBM Linux Technology Center



Performance of Synchronization Mechanisms




   If you thought a single atomic operation was slow, try lots of them!!!
     (Atomic increment of single variable on 1.9GHz Power 5 system)
      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   39
IBM Linux Technology Center



Performance of Synchronization Mechanisms




   Same effect on a 16-CPU 2.8GHz Intel X5550 (Nehalem) system

      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   40
IBM Linux Technology Center



Design Principle: Avoid Bottlenecks




    Only one of something: bad for performance and scalability
      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   41
IBM Linux Technology Center



Design Principle: Avoid Bottlenecks




 Many instances of something: great for performance and scalability!
                    Any exceptions to this rule?
       “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   42
IBM Linux Technology Center



Exercise: Dining Philosophers Problem
Each philosopher requires two chopsticks to eat.
Need to avoid starvation.




          “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   43
IBM Linux Technology Center



  Exercise: Dining Philosophers Solution #1


                                                          1




                              2                                                     5




                                         3                                  4
Locking hierarchy.
Pick up low-numbered chopstick
first, preventing deadlock.                                                     Is this a good solution???
            “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   44
IBM Linux Technology Center



  Exercise: Dining Philosophers Solution #2




                                  1
                                                                                      5
                                 2

                                                         3



                                                                            4
Locking hierarchy.
Pick up low-numbered chopstick                                                      If all want to eat, at least two
first, preventing deadlock.                                                         will be able to do so.
            “Is Parallel Programming Hard?” 2011 Android System Development Forum      © 2011 IBM Corporation      45
IBM Linux Technology Center



Exercise: Dining Philosophers Solution #3




                                                                              Bundle pairs of chopsticks.
                                                                              Discard fifth chopstick.
                                                                              Take pair: no deadlock.
                                                                              Starvation still possible.
      “Is Parallel Programming Hard?” 2011 Android System Development Forum       © 2011 IBM Corporation    46
IBM Linux Technology Center



Exercise: Dining Philosophers Solution #4




                                                                              Zero contention.
                                                                              Neither deadlock nor livelock.
                                                                              All 5 can eat concurrently.
                                                                              Excellent disease control.
      “Is Parallel Programming Hard?” 2011 Android System Development Forum       © 2011 IBM Corporation   47
IBM Linux Technology Center



Exercise: Dining Philosophers Solutions

 Objections to solution #2 and #3:
     “You can't just change the rules like that!!!”
       • No rule against moving or adding chopsticks!!!
     “Dining Philosophers Problem valuable lock-hierarchy
      teaching tool – #3 just destroyed it!!!”
       • Lock hierarchy is indeed very valuable and widely used, so the
         restriction “there can only be five chopsticks positioned as
         shown” does indeed have its place, even if it didn't appear in
         this instance of the Dining Philosophers Problem.
       • But the lesson of transforming the problem into perfectly
         partitionable form is also very valuable, and given the wide
         availability of cheap multiprocessors, most desperately needed.
     “But what if each chopstick costs a million dollars?”
       • Then we make the philosophers eat with their fingers...                                         ☺

        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation       48
IBM Linux Technology Center



But What About Real-World Software?




“Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   49
IBM Linux Technology Center



Sequential Programs Often Highly Optimized

 Highly optimized means difficult to modify
      And no wand can wish this away




        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   50
IBM Linux Technology Center



Sequential Programs Often Highly Optimized

 Highly optimized means difficult to modify
      And no wand can wish this away
 But there are some tricks that can often help:
      Partition the problem at high level
        • Reduce communication overhead between partitions
        • Good approach for multitasking smartphones
      Use existing parallel software
        • Single-threaded user-interface programs talking to central
          database software in servers
        • Changes in protocols may be needed to open up similar
          opportunities in embedded
                  Hardware acceleration can also be attractive
      Parallelize only the performance-critical code
        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   51
IBM Linux Technology Center



Some Issues With Real-World Software

 Object-oriented spaghetti code
 Highly optimized sequential code
 Parallel-unfriendly APIs
 Parallelization can be a major change




     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   52
IBM Linux Technology Center



Object-Oriented Spaghetti Code




               Let's see you create a valid locking hierarchy!!!
               What can you do about this?
                                                                                 Photo: © 2007 Ed Hawco all cc-gy-sa
     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation            53
IBM Linux Technology Center



Object-Oriented Spaghetti Code: Solution 1




                            Run multiple instances in parallel.
                            Maybe with the sh “&” operator.
                                                                                  Photo: © 2007 Ed Hawco all cc-gy-sa
      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation            54
IBM Linux Technology Center



 Object-Oriented Spaghetti Code: Solution 2




Find the high-cost portion of the code, and rewrite that portion to be parallel-friendly.
Or implement in hardware.
Either way, a careful redesign will be required.
                                                                                        Photo: © 2007 Ed Hawco all cc-gy-sa
            “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation            55
IBM Linux Technology Center



Highly Optimized Sequential Code

 Highly optimized sequential code is difficult to
  change, especially if the result is to remain
  optimized
 Parallelization is a change, so parallelizing
  highly optimized sequential code is likely to be
  difficult
      But it is easy to run multiple instances in parallel
      It may well be that the sequential optimizations work
       better than parallelization
        • Parallelism is but one optimization of many: Use the best
          optimization for the task at hand


        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   56
IBM Linux Technology Center



Parallel-Unfriendly APIs

 Consider a hash-table addition function that
  returns the number of entries in the table
 What is the best way to implement this in
  parallel?




      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   57
IBM Linux Technology Center



Parallel-Unfriendly APIs

 Consider a hash-table addition function that
  returns the number of entries in the table
 What is the best way to implement this in
  parallel?
 There is no good way to do so!!!
 The problem is that concurrent additions must
  communicate, and this communication is
  inherently expensive
      http://lwn.net/Articles/423994/
 What should be done instead?

        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   58
IBM Linux Technology Center



Parallel-Unfriendly APIs

 Don't return the count!
      Without the count, a hash-table implementation can
       handle two concurrent “add” operations with no
       conflicts: fully in parallel
      With the count, the two concurrent “add” operations
       must update the count
 Alternatively, return an approximate count
 Either way, sequential APIs may need to be
  reworked to allow efficient implementation



        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   59
IBM Linux Technology Center



Parallelization Can Be A Major Change

 Software resembles building construction
      Start with architects
      Then construction
      Finally maintenance
 If you have a old piece of software, your staff might
 consist only of software janitors
      Nothing wrong with janitors: I used to be one
      But odds are against janitors remodeling skyscrapers
 Use the right people for the job
 Can current staff do a major change?
      And yes, you might need time and budget to suit...

        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   60
IBM Linux Technology Center



                                Conclusions




“Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   61
IBM Linux Technology Center



Summary and Problem Statement

 Writing new parallel code is quite doable
      Many open-source projects prove this point
 Converting existing sequential code to run in
 parallel can be easy...
      … or arbitrarily difficult
 So don't do things the hard way!




        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   62
IBM Linux Technology Center



 Is Parallel Programming Hard, And If So, Why?




Parallel Programming is as Hard or as Easy as We Make It.

    It is that hard (or that easy) because we make it that way!!!




        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   63
IBM Linux Technology Center



Legal Statement

 This work represents the view of the author and does not
  necessarily represent the view of IBM.
 IBM and IBM (logo) are trademarks or registered
  trademarks of International Business Machines
  Corporation in the United States and/or other countries.
 Linux is a registered trademark of Linus Torvalds.
 Other company, product, and service names may be
  trademarks or service marks of others.
 This material is based upon work supported by the
  National Science Foundation under Grant No. CNS-
  0719851.


       “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   64
IBM Linux Technology Center



                         Summarized Summary



                   Use
              the right tool
              for the job!!!




Image copyright © 2004 Melissa McKenney


       “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   65
IBM Linux Technology Center



                                     BACKUP




“Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   66
IBM Linux Technology Center



Parallel Programming Advanced Topic: RCU

 For read-mostly data structures, RCU provides
 the benefits of the data-parallel model
      But without the need to actually partition or replicate
       the RCU-protected data structures
      Readers access data without needing to exclude
       each others or updates
        • Extremely lightweight read-side primitives


 And RCU provides additional read-side
 performance and scalability benefits
      With a few limitations and restrictions....


        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   67
IBM Linux Technology Center



RCU for Read-Mostly Data Structures
                                                                                                             Almost...


                                                   Work
                                                 Partitioning


                                                                                    Resource
                                                                                      R
             Parallel                                                                  C
                                                                                   Partitioning
                                                                                          U
          Access Control
                                                                                  & Replication


                                               Interacting
                                              With Hardware



RCU data-parallel approach: first partition resources, then partition work, and
only then worry about parallel access control, and only for updates.
          “Is Parallel Programming Hard?” 2011 Android System Development Forum     © 2011 IBM Corporation               68
IBM Linux Technology Center



RCU Usage in the Linux Kernel




     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   69
IBM Linux Technology Center



What Is RCU?

 Publication of new data
 Subscription to existing data
 Wait for pre-existing RCU readers




     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   70
IBM Linux Technology Center



Publication of And Subscription To New Data

  Key:        All readers can access
              Only pre-existing readers can access
              Inaccessible to readers



   gp
   A                               gp                                        gp                                   gp




                                                                                       rcu_assign_pointer()
                                                        initialization
                kmalloc()




                                ->a=?                                    ->a=1                                ->a=1
                                ->b=?                                    ->b=2                                ->b=2
                                ->c=?                                    ->c=3                                ->c=3



                               p                                         p                                    p

         “Is Parallel Programming Hard?” 2011 Android System Development Forum    © 2011 IBM Corporation               71
IBM Linux Technology Center



 RCU Removal From Linked List

    Combines waiting for readers and multiple versions:
              Writer removes element B from the list (list_del_rcu())
              Writer waits for all readers to finish (synchronize_rcu())
              Writer can then free B (kfree())

       One Version                                  Two Versions                                 One Version

               A                                         A                                            A                             A




                                                                            synchronize_rcu()
                             list_del_rcu()




                                                                                                                          kfree()
readers?       B                              readers?   B                                            B


               C                                         C                                            C                             C

                                                                                                No more readers
                                                                                                 referencing B!

                   “Is Parallel Programming Hard?” 2011 Android System Development Forum                  © 2011 IBM Corporation        72
IBM Linux Technology Center



RCU Replacement Of Item In Linked List

                  1 Version             1 Version              1 Version                        2 Versions                       1 Version             1 Version

A                      AA                    AA                   AA                                AA                              AA                     A




                                                                                                             synchronize_rcu()
                                                                           list_replace_rcu()
B                                                                                                  B'                               B'                    B'
                                                                                                                                                          B
                        ?                    B                    B'
    kmalloc()




                                                      update




                                                                                                                                             kfree()
                                copy




C                                                                                                                                                         C
                                                                                                    B                               B
                       BB                    BB                   BB                                 B                               B


                       CC                    CC                   CC                                CC                              CC
                                                                                                                          No more readers
                                                                                                                           referencing B!

                “Is Parallel Programming Hard?” 2011 Android System Development Forum                    © 2011 IBM Corporation                                73
IBM Linux Technology Center



    Wait For Pre-Existing RCU Readers
                                                                                          rcu_read_lock()

                       Reader                                             Reader
                                                 Reader
                                                                                           Reader
         Reader                                               Reader

                                             Reader                                         rcu_read_unlock()
 RCU readers
concurrent with
   updates                                            Reader
                                                                                                          Forbidden!
                                             synchronize_rcu()

                                                                                      Change Visible
                  Change                        Grace Period
                                                                                       to All Readers



   So what happens if you try to extend an RCU read-side critical section across a grace period?
                  “Is Parallel Programming Hard?” 2011 Android System Development Forum    © 2011 IBM Corporation      74
IBM Linux Technology Center



Wait For Pre-Existing RCU Readers

             Reader                                             Reader
                                       Reader
                                                                                Reader
  Reader                                            Reader

                                   Reader                                                     Grace period
                                                                                              extends as
                                            Reader                                            needed.


                                       synchronize_rcu()

                                                                                Change Visible
       Change                              Grace Period
                                                                                 to All Readers



  A grace period is not permitted to end until all pre-existing readers have completed.
        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation       75
IBM Linux Technology Center



Wait For Pre-Existing RCU Readers

           Reader                                             Reader
                                     Reader
                                                                              Reader
  Reader                                          Reader

                                 Reader

                                          Reader

                                              synchronize_rcu()

                                                                                              Change Visible
     Change                                      Grace Period
                                                                                               to All Readers



         But it is OK for a grace period to extend longer than necessary
      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation        76
IBM Linux Technology Center



Wait For Pre-Existing RCU Readers

           Reader                                             Reader
                                     Reader
                                                                              Reader
  Reader                                          Reader

                                 Reader

                                          Reader

                                                    synchronize_rcu()

                                                                                              Change Visible
     Change                                            Grace Period
                                                                                               to All Readers



        And it is also OK for a grace period to begin later than necessary
      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation        77
IBM Linux Technology Center



Wait For Pre-Existing RCU Readers

           Reader                                             Reader
                                     Reader
                                                                              Reader
  Reader                                          Reader

                                 Reader

                                          Reader

                                                    synchronize_rcu()
                                                                                              Change Visible
                 Change
                                                                                               to All Readers
                                                       Grace Period
                                                                                              Change Visible
     Change                                    synchronize_rcu()
                                                                                               to All Readers

       Starting a grace period late can allow it to serve multiple updates.
      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation        78
IBM Linux Technology Center



Wait For Pre-Existing RCU Readers

                Reader                                             Reader
                                          Reader
                                                                                   Reader
     Reader                                            Reader
                                                                                                            Reader
                                      Reader

                                               Reader

                                                   synchronize_rcu()

                                                                                                   Change Visible
          Change                                                       Grace Period
                                                                                                    to All Readers
                                                                                                                 !!!
And it is OK for the system to complain (or even abort) if a grace period extends too long.
Too-long of grace periods are likely to result in death by memory exhaustion anyway.
           “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation            79
IBM Linux Technology Center



Wait For Pre-Existing RCU Readers

           Reader                                             Reader
                                     Reader
                                                                              Reader
  Reader                                          Reader

                                 Reader

                                          Reader

                                     synchronize_rcu()

                                                                              Change Visible
     Change                              Grace Period
                                                                               to All Readers



                   Returning to the minimum-duration grace period.
      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   80
IBM Linux Technology Center



Wait For Pre-Existing RCU Readers

              Reader                                             Reader
                                        Reader
                                                                                 Reader
    Reader                                           Reader

                                    Reader

                               RT                Reader

                                        synchronize_rcu()

                                                                                        Change Visible
         Change                                Grace Period
                                                                                         to All Readers


Real-time scheduling constraints can extend grace periods by preempting RCU readers.
It is sometimes necessary to priority-boost RCU readers.
         “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   81
IBM Linux Technology Center



Waiting for Pre-Existing Readers: QSBR

 RCU readers are not permitted to block
      Same rule as for tasks holding spinlocks
 CPU context switch means all that CPU's readers are done
 Grace period ends after all CPUs execute a context switch




                                                                                             h
                                                                                         itc
                                                                                       sw
                                                                                     t
                                                                                  ex
                                                                                  nt
                                                                                co
   CPU 0

   CPU 1
                              synchronize_rcu()
   CPU 2

        “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   82
IBM Linux Technology Center



   RCU vs. Reader-Writer Locking Performance




How is this
possible?




                                Non-CONFIG_PREEMPT kernel build
              “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   83
IBM Linux Technology Center



RCU vs. Reader-Writer Locking Performance




                       CONFIG_PREEMPT kernel build
     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   84
IBM Linux Technology Center



RCU vs. Reader-Writer Locking Performance




                   CONFIG_PREEMPT kernel build, 16 CPUs
     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   85
IBM Linux Technology Center



RCU and Reader-Writer Lock Response Time




     “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   86
IBM Linux Technology Center



RCU Area of Applicability

                                     Read-Mostly, Stale &
                                     Inconsistent Data OK
                                     (RCU Works Great!!!)


                         Read-Mostly, Need Consistent Data
                                (RCU Works OK)


                          Read-Write, Need Consistent Data
                               (RCU Might Be OK...)

                 Update-Mostly, Need Consistent Data
        (RCU is Really Unlikely to be the Right Tool For The Job,
           But SLAB_DESTROY_BY_RCU Is A Possibility)


      “Is Parallel Programming Hard?” 2011 Android System Development Forum   © 2011 IBM Corporation   87

More Related Content

Similar to Is parallel programming hard? And if so, what can you do about it?

Csharp dot net
Csharp dot netCsharp dot net
Csharp dot netEkam Baram
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel ArchitecturesJoel Falcou
 
Beyond Expert-Only Parallel Programming
Beyond Expert-Only Parallel ProgrammingBeyond Expert-Only Parallel Programming
Beyond Expert-Only Parallel Programmingracesworkshop
 
From the Eclipse Foundation to the Symbian Foundation
From the Eclipse Foundation to the Symbian FoundationFrom the Eclipse Foundation to the Symbian Foundation
From the Eclipse Foundation to the Symbian FoundationDavid Wood
 
Linux on System z Update: Current & Future Linux on System z Technology
Linux on System z Update: Current & Future Linux on System z TechnologyLinux on System z Update: Current & Future Linux on System z Technology
Linux on System z Update: Current & Future Linux on System z TechnologyIBM India Smarter Computing
 
.Net introduction by Quontra Solutions
.Net introduction by Quontra Solutions.Net introduction by Quontra Solutions
.Net introduction by Quontra SolutionsQUONTRASOLUTIONS
 
VIRTUAL MACHINE VERSATILE PLATFORM01~chapter 1 (1).ppt
VIRTUAL MACHINE VERSATILE PLATFORM01~chapter 1 (1).pptVIRTUAL MACHINE VERSATILE PLATFORM01~chapter 1 (1).ppt
VIRTUAL MACHINE VERSATILE PLATFORM01~chapter 1 (1).pptnagarajans87
 
Building Embedded Linux Systems Introduction
Building Embedded Linux Systems IntroductionBuilding Embedded Linux Systems Introduction
Building Embedded Linux Systems IntroductionSherif Mousa
 
Introduction to .net framework
Introduction to .net frameworkIntroduction to .net framework
Introduction to .net frameworkArun Prasad
 
Portinig Application, Drivers And Os
Portinig Application, Drivers And OsPortinig Application, Drivers And Os
Portinig Application, Drivers And Osmomobangalore
 
Unik: Unikernel Backend to Cloud Foundry
Unik: Unikernel Backend to Cloud FoundryUnik: Unikernel Backend to Cloud Foundry
Unik: Unikernel Backend to Cloud FoundryVMware Tanzu
 
.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3aminmesbahi
 
Linux-training-for-beginners-in-mumbai
Linux-training-for-beginners-in-mumbaiLinux-training-for-beginners-in-mumbai
Linux-training-for-beginners-in-mumbaiUnmesh Baile
 
Linux 101 Exploring Linux OS
Linux 101 Exploring Linux OSLinux 101 Exploring Linux OS
Linux 101 Exploring Linux OSRodel Barcenas
 
Architecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureArchitecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureTalbott Crowell
 
Linux training-for-beginners-in-mumbai
Linux training-for-beginners-in-mumbaiLinux training-for-beginners-in-mumbai
Linux training-for-beginners-in-mumbaivibrantuser
 
Linux training-for-beginners-in-mumbai
Linux training-for-beginners-in-mumbaiLinux training-for-beginners-in-mumbai
Linux training-for-beginners-in-mumbaivibrantuser
 

Similar to Is parallel programming hard? And if so, what can you do about it? (20)

Csharp dot net
Csharp dot netCsharp dot net
Csharp dot net
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures
 
Beyond Expert-Only Parallel Programming
Beyond Expert-Only Parallel ProgrammingBeyond Expert-Only Parallel Programming
Beyond Expert-Only Parallel Programming
 
From the Eclipse Foundation to the Symbian Foundation
From the Eclipse Foundation to the Symbian FoundationFrom the Eclipse Foundation to the Symbian Foundation
From the Eclipse Foundation to the Symbian Foundation
 
Linux on System z Update: Current & Future Linux on System z Technology
Linux on System z Update: Current & Future Linux on System z TechnologyLinux on System z Update: Current & Future Linux on System z Technology
Linux on System z Update: Current & Future Linux on System z Technology
 
Why linux on power?
Why linux on power?Why linux on power?
Why linux on power?
 
.Net introduction by Quontra Solutions
.Net introduction by Quontra Solutions.Net introduction by Quontra Solutions
.Net introduction by Quontra Solutions
 
Dedicated embedded linux af Esben Haabendal, Prevas A/S
Dedicated embedded linux af Esben Haabendal, Prevas A/SDedicated embedded linux af Esben Haabendal, Prevas A/S
Dedicated embedded linux af Esben Haabendal, Prevas A/S
 
VIRTUAL MACHINE VERSATILE PLATFORM01~chapter 1 (1).ppt
VIRTUAL MACHINE VERSATILE PLATFORM01~chapter 1 (1).pptVIRTUAL MACHINE VERSATILE PLATFORM01~chapter 1 (1).ppt
VIRTUAL MACHINE VERSATILE PLATFORM01~chapter 1 (1).ppt
 
Building Embedded Linux Systems Introduction
Building Embedded Linux Systems IntroductionBuilding Embedded Linux Systems Introduction
Building Embedded Linux Systems Introduction
 
Introduction to .net framework
Introduction to .net frameworkIntroduction to .net framework
Introduction to .net framework
 
Portinig Application, Drivers And Os
Portinig Application, Drivers And OsPortinig Application, Drivers And Os
Portinig Application, Drivers And Os
 
Unik: Unikernel Backend to Cloud Foundry
Unik: Unikernel Backend to Cloud FoundryUnik: Unikernel Backend to Cloud Foundry
Unik: Unikernel Backend to Cloud Foundry
 
.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3
 
Lotus on Linux Report 2010
Lotus on Linux Report 2010Lotus on Linux Report 2010
Lotus on Linux Report 2010
 
Linux-training-for-beginners-in-mumbai
Linux-training-for-beginners-in-mumbaiLinux-training-for-beginners-in-mumbai
Linux-training-for-beginners-in-mumbai
 
Linux 101 Exploring Linux OS
Linux 101 Exploring Linux OSLinux 101 Exploring Linux OS
Linux 101 Exploring Linux OS
 
Architecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureArchitecting Solutions for the Manycore Future
Architecting Solutions for the Manycore Future
 
Linux training-for-beginners-in-mumbai
Linux training-for-beginners-in-mumbaiLinux training-for-beginners-in-mumbai
Linux training-for-beginners-in-mumbai
 
Linux training-for-beginners-in-mumbai
Linux training-for-beginners-in-mumbaiLinux training-for-beginners-in-mumbai
Linux training-for-beginners-in-mumbai
 

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Is parallel programming hard? And if so, what can you do about it?

  • 1. 2011 Android System Developer Forum Is Parallel Programming Hard, And If So, What Can You Do About It? Paul E. McKenney IBM Distinguished Engineer & CTO Linux Linux Technology Center April 28, 2011 Copyright © 2011 IBM IBM Corporation © 2002
  • 2. IBM Linux Technology Center Who is Paul and How Did He Get This Way? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 2
  • 3. IBM Linux Technology Center Who is Paul and How Did He Get This Way?  Grew up in rural Oregon, USA  First use of computer in high school (72-76)  IBM mainframe: punched cards and FORTRAN  Later ASR-33 TTY and BASIC  BSME & BSCS, Oregon State University (76-81)  Tuition provided by FORTRAN and COBOL  Contract Programming and Consulting (81-85)  Building control system (Pascal on z80)  Security card-access system (Pascal on PDP-11)  Dining hall system (Pascal on PDP-11)  Acoustic navigation system (C on PDP-11) “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 3
  • 4. IBM Linux Technology Center Who is Paul and How Did He Get This Way?  SRI International (85-90)  UNIX systems administration  Packet-radio research  Internet protocol research  MSCS Computer Science (88)  Sequent Computer Systems (90-00)  Communications performance  Parallel programming: memory allocators, RCU, ...  IBM LTC (00-present)  NUMA-aware and brlock-like locking primitive in AIX  RCU maintainer for Linux kernel  Ph.D. Computer Science and Engineering (04) “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 4
  • 5. IBM Linux Technology Center Who is Paul and How Did He Get This Way?  SRI International (85-90)  UNIX systems administration  Packet-radio research  Internet protocol research  MSCS Computer Science (88)  Sequent Computer Systems (90-00)  Communications performance  Parallel programming: memory allocators, RCU, ...  IBM LTC (00-present)  NUMA-aware and brlock-like locking primitive in AIX  RCU maintainer for Linux kernel  Ph.D. Computer Science and Engineering (04)  Paul has been doing parallel programming for 20 years “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 5
  • 6. IBM Linux Technology Center Who is Paul and How Did He Get This Way?  SRI International (85-90)  UNIX systems administration  Packet-radio research  Internet protocol research  MSCS Computer Science (88)  Sequent Computer Systems (90-00)  Communications performance  Parallel programming: memory allocators, RCU, ...  IBM LTC (00-present)  NUMA-aware and brlock-like locking primitive in AIX  RCU maintainer for Linux kernel  Ph.D. Computer Science and Engineering (04)  Paul has been doing parallel programming for 20 years  What is so hard about parallel programming, then??? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 6
  • 7. IBM Linux Technology Center Why Has Parallel Programming Been Hard? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 7
  • 8. IBM Linux Technology Center Why Has Parallel Programming Been Hard?  Parallel systems were rare and expensive  Very few developers had access to parallel systems: little ROI for parallel languages & tools  Almost no publicly accessible parallel code  Parallel-programming engineering discipline confined to a few proprietary projects  Technological obstacles:  Synchronization overhead  Deadlock  Data races “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 8
  • 9. IBM Linux Technology Center Parallel Programming: What Has Changed?  Parallel systems were rare and expensive  About $1,000,000 each in 1990  About $100 in 2011  Four orders of magnitude decrease in 21 years  From many times the price of a house to a small fraction of the cost of a bicycle  In 2006, masters student I was collaborating with bought a dual-core laptop  And just to be able to present a parallel- programming talk using a parallel programmer “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 9
  • 10. IBM Linux Technology Center Parallel Programming: Rare and Expensive “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 10
  • 11. IBM Linux Technology Center Parallel Programming: Rare and Expensive “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 11
  • 12. IBM Linux Technology Center Parallel Programming: Cheap and Plentiful “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 12
  • 13. IBM Linux Technology Center Parallel Programming: What Has Changed?  Back then: very few developers had access to parallel systems  Maybe 1,000 developers with parallel systems  Suppose that a tool could save 10% per developer • Then it would not make sense to invest 100 developers • Even assuming compatible environments: 'Fraid not!!!  Now: almost anyone can have a parallel system  More than 100,000 developers with parallel systems • Easily makes sense to invest 100 developers for 10% gain • Even assuming multiple incompatible environments  We can expect many more parallel tools  Linux kernel lockdep, lock profiling, ... “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 13
  • 14. IBM Linux Technology Center Why Has Parallel Programming Been Hard?  Almost no publicly accessible parallel code  Parallel-programming engineering discipline confined to a few proprietary projects  Linux kernel: 13 million lines of parallel code  MariaDB (the database formerly known as MySQL)  memcached  Apache web server  Hadoop  BOINC  OpenMP  Lots of parallel code for study and copying “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 14
  • 15. IBM Linux Technology Center Why Has Parallel Programming Been Hard?  Technological obstacles:  Synchronization overhead  Deadlock  Data races “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 15
  • 16. IBM Linux Technology Center Why Has Parallel Programming Been Hard?  Technological obstacles:  Synchronization overhead  Deadlock  Data races  If these obstacles are impossible to overcome, why are there more than 13 million lines of parallel Linux kernel code? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 16
  • 17. IBM Linux Technology Center Why Has Parallel Programming Been Hard?  Technological obstacles:  Synchronization overhead  Deadlock  Data races  If these obstacles are impossible to overcome, why are there more than 13 million lines of parallel Linux kernel code?  But parallel programming is harder than sequential programming... “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 17
  • 18. IBM Linux Technology Center So Why Parallel Programming? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 18
  • 19. IBM Linux Technology Center Why Parallel Programming? (Party Line) “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 19
  • 20. IBM Linux Technology Center Why Parallel Programming? (Reality)  Parallelism can increase performance  Assuming...  Parallelism may be important battery-life extender  Run two CPUs at half frequency  25% power per CPU, 50% overall with same throughput assuming perfect scaling • And assuming CPU consumes most of the power • Which is not the case on many embedded systems  But parallelism is but one way to optimize performance and battery lifetime  Hashing, search trees, parsers, cordic algorithms, ... “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 20
  • 21. IBM Linux Technology Center Why Parallel Programming? (Reality) If your code runs well enough single threaded, then leave it be single threaded. And you always have the option of implementing key function in hardware. But where it makes sense, parallelism can be quite valuable. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 21
  • 22. IBM Linux Technology Center Parallel Programming Goals “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 22
  • 23. IBM Linux Technology Center Parallel Programming Goals Performance Productivity Generality “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 23
  • 24. IBM Linux Technology Center Parallel Programming Goals: Why Performance?  (Performance often expressed as scalability or normalized as in performance per watt)  If you don't care about performance, why are you bothering with parallelism???  Just run single threaded and be happy!!!  But what about:  All the multi-core systems out there?  Efficient use of resources?  Everyone saying parallel programming is crucial?  Parallel Programming: one optimization of many  CPU: one potential bottleneck of many “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 24
  • 25. IBM Linux Technology Center Parallel Programming Goals: Why Productivity?  1948 CSIRAC (oldest intact computer)  2,000 vacuum tubes, 768 20-bit words of memory  $10M AU construction price  1955 technical salaries: $3-5K/year  Makes business sense to dedicate 10-person team to increasing performance by 10%  2008 z80 (popular 8-bit microprocessor)  8,500 transistors, 64K 8-bit works of memory  $1.36 per CPU in quantity 1,000 (7 OOM decrease)  2008 SW starting salaries: $50-95K/year US (1 OOM increase)  Need 1M CPUs to break even on a one-person-year investment to gain 10% performance! • Or 10% more performance must be blazingly important • Or you are doing this as a hobby... In which case, do what you want! “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 25
  • 26. IBM Linux Technology Center Parallel Programming Goals: Why Generality?  The more general the solution, the more users to spread the cost over. Productivity Application Middleware (e.g., DBMS) Performance Generality System Utilities & Libraries OS Kernel FW HW “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 26
  • 27. IBM Linux Technology Center Performance of Synchronization Mechanisms “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 27
  • 28. IBM Linux Technology Center Performance of Synchronization Mechanisms 4-CPU 1.8GHz AMD Opteron 844 system Need to be here! (Partitioning/RCU) Operation Cost (ns) Ratio Clock period 0.6 1 Best-case CAS 37.9 63.2 Best-case lock 65.6 109.3 Single cache miss 139.5 232.5 CAS cache miss 306.0 510.0 Heavily optimized reader- writer lock might get here for readers (but too bad Typical synchronization about those poor writers...) mechanisms do this a lot “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 28
  • 29. IBM Linux Technology Center Performance of Synchronization Mechanisms 4-CPU 1.8GHz AMD Opteron 844 system Need to be here! (Partitioning/RCU) Operation Cost (ns) Ratio Clock period 0.6 1 Best-case CAS 37.9 63.2 Best-case lock 65.6 109.3 Single cache miss 139.5 232.5 CAS cache miss 306.0 510.0 Heavily optimized reader- writer lock might get here for readers (but too bad Typical synchronization about those poor writers...) mechanisms do this a lot But this is an old system... “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 29
  • 30. IBM Linux Technology Center Performance of Synchronization Mechanisms 4-CPU 1.8GHz AMD Opteron 844 system Need to be here! (Partitioning/RCU) Operation Cost (ns) Ratio Clock period 0.6 1 Best-case CAS 37.9 63.2 Best-case lock 65.6 109.3 Single cache miss 139.5 232.5 CAS cache miss 306.0 510.0 Heavily optimized reader- writer lock might get here for readers (but too bad Typical synchronization about those poor writers...) mechanisms do this a lot But this is an old system... And why low-level details??? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 30
  • 31. IBM Linux Technology Center Why All These Low-Level Details???  Would you trust a bridge designed by someone who did not understand strengths of materials?  Or a ship designed by someone who did not understand the steel-alloy transition temperatures?  Or a house designed by someone who did not understand that unfinished wood rots when wet?  Or a car designed by someone who did not understand the corrosion properties of the metals used in the exhaust system?  Or a space shuttle designed by someone who did not understand the temperature limitations of O-rings?  So why trust algorithms from someone ignorant of the properties of the underlying hardware??? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 31
  • 32. IBM Linux Technology Center Why Such An Old System???  Because early low-power embedded CPUs are likely to have similar performance characteristics. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 32
  • 33. IBM Linux Technology Center But Isn't Hardware Just Getting Faster? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 33
  • 34. IBM Linux Technology Center Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Operation Cost (ns) Ratio Clock period 0.4 1 “Best-case” CAS 12.2 33.8 Best-case lock 25.6 71.2 Single cache miss 12.9 35.8 CAS cache miss 7.0 19.4 What a difference a few years can make!!! “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 34
  • 35. IBM Linux Technology Center Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Operation Cost (ns) Ratio Clock period 0.4 1 “Best-case” CAS 12.2 33.8 Best-case lock 25.6 71.2 Single cache “miss” 12.9 35.8 CAS cache “miss” 7.0 19.4 Single cache miss (off-core) 31.2 86.6 CAS cache miss (off-core) 31.2 86.5 Not quite so good... But still a 6x improvement!!! “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 35
  • 36. IBM Linux Technology Center Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Operation Cost (ns) Ratio Clock period 0.4 1 “Best-case” CAS 12.2 33.8 Best-case lock 25.6 71.2 Single cache miss 12.9 35.8 CAS cache miss 7.0 19.4 Single cache miss (off-core) 31.2 86.6 CAS cache miss (off-core) 31.2 86.5 Single cache miss (off-socket) 92.4 256.7 CAS cache miss (off-socket) 95.9 266.4 Maybe not such a big difference after all... And these are best-case values!!! (Why?) “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 36
  • 37. IBM Linux Technology Center Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Most embedded devices here Operation Cost (ns) Ratio Clock period 0.4 1 “Best-case” CAS 12.2 33.8 Best-case lock 25.6 71.2 Single cache miss 12.9 35.8 CAS cache miss 7.0 19.4 Single cache miss (off-core) 31.2 86.6 CAS cache miss (off-core) 31.2 86.5 Single cache miss (off-socket) 92.4 256.7 CAS cache miss (off-socket) 95.9 266.4 Maybe not such a big difference after all... And these are best-case values!!! (Why?) “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 37
  • 38. IBM Linux Technology Center Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Most embedded devices here Operation Cost (ns) Ratio Clock period 0.4 1 “Best-case” CAS 12.2 33.8 For now... Best-case lock 25.6 71.2 Single cache miss 12.9 35.8 CAS cache miss 7.0 19.4 Single cache miss (off-core) 31.2 86.6 CAS cache miss (off-core) 31.2 86.5 Single cache miss (off-socket) 92.4 256.7 CAS cache miss (off-socket) 95.9 266.4 Maybe not such a big difference after all... And these are best-case values!!! (Why?) “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 38
  • 39. IBM Linux Technology Center Performance of Synchronization Mechanisms If you thought a single atomic operation was slow, try lots of them!!! (Atomic increment of single variable on 1.9GHz Power 5 system) “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 39
  • 40. IBM Linux Technology Center Performance of Synchronization Mechanisms Same effect on a 16-CPU 2.8GHz Intel X5550 (Nehalem) system “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 40
  • 41. IBM Linux Technology Center Design Principle: Avoid Bottlenecks Only one of something: bad for performance and scalability “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 41
  • 42. IBM Linux Technology Center Design Principle: Avoid Bottlenecks Many instances of something: great for performance and scalability! Any exceptions to this rule? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 42
  • 43. IBM Linux Technology Center Exercise: Dining Philosophers Problem Each philosopher requires two chopsticks to eat. Need to avoid starvation. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 43
  • 44. IBM Linux Technology Center Exercise: Dining Philosophers Solution #1 1 2 5 3 4 Locking hierarchy. Pick up low-numbered chopstick first, preventing deadlock. Is this a good solution??? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 44
  • 45. IBM Linux Technology Center Exercise: Dining Philosophers Solution #2 1 5 2 3 4 Locking hierarchy. Pick up low-numbered chopstick If all want to eat, at least two first, preventing deadlock. will be able to do so. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 45
  • 46. IBM Linux Technology Center Exercise: Dining Philosophers Solution #3 Bundle pairs of chopsticks. Discard fifth chopstick. Take pair: no deadlock. Starvation still possible. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 46
  • 47. IBM Linux Technology Center Exercise: Dining Philosophers Solution #4 Zero contention. Neither deadlock nor livelock. All 5 can eat concurrently. Excellent disease control. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 47
  • 48. IBM Linux Technology Center Exercise: Dining Philosophers Solutions  Objections to solution #2 and #3:  “You can't just change the rules like that!!!” • No rule against moving or adding chopsticks!!!  “Dining Philosophers Problem valuable lock-hierarchy teaching tool – #3 just destroyed it!!!” • Lock hierarchy is indeed very valuable and widely used, so the restriction “there can only be five chopsticks positioned as shown” does indeed have its place, even if it didn't appear in this instance of the Dining Philosophers Problem. • But the lesson of transforming the problem into perfectly partitionable form is also very valuable, and given the wide availability of cheap multiprocessors, most desperately needed.  “But what if each chopstick costs a million dollars?” • Then we make the philosophers eat with their fingers... ☺ “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 48
  • 49. IBM Linux Technology Center But What About Real-World Software? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 49
  • 50. IBM Linux Technology Center Sequential Programs Often Highly Optimized  Highly optimized means difficult to modify  And no wand can wish this away “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 50
  • 51. IBM Linux Technology Center Sequential Programs Often Highly Optimized  Highly optimized means difficult to modify  And no wand can wish this away  But there are some tricks that can often help:  Partition the problem at high level • Reduce communication overhead between partitions • Good approach for multitasking smartphones  Use existing parallel software • Single-threaded user-interface programs talking to central database software in servers • Changes in protocols may be needed to open up similar opportunities in embedded  Hardware acceleration can also be attractive  Parallelize only the performance-critical code “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 51
  • 52. IBM Linux Technology Center Some Issues With Real-World Software  Object-oriented spaghetti code  Highly optimized sequential code  Parallel-unfriendly APIs  Parallelization can be a major change “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 52
  • 53. IBM Linux Technology Center Object-Oriented Spaghetti Code Let's see you create a valid locking hierarchy!!! What can you do about this? Photo: © 2007 Ed Hawco all cc-gy-sa “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 53
  • 54. IBM Linux Technology Center Object-Oriented Spaghetti Code: Solution 1 Run multiple instances in parallel. Maybe with the sh “&” operator. Photo: © 2007 Ed Hawco all cc-gy-sa “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 54
  • 55. IBM Linux Technology Center Object-Oriented Spaghetti Code: Solution 2 Find the high-cost portion of the code, and rewrite that portion to be parallel-friendly. Or implement in hardware. Either way, a careful redesign will be required. Photo: © 2007 Ed Hawco all cc-gy-sa “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 55
  • 56. IBM Linux Technology Center Highly Optimized Sequential Code  Highly optimized sequential code is difficult to change, especially if the result is to remain optimized  Parallelization is a change, so parallelizing highly optimized sequential code is likely to be difficult  But it is easy to run multiple instances in parallel  It may well be that the sequential optimizations work better than parallelization • Parallelism is but one optimization of many: Use the best optimization for the task at hand “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 56
  • 57. IBM Linux Technology Center Parallel-Unfriendly APIs  Consider a hash-table addition function that returns the number of entries in the table  What is the best way to implement this in parallel? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 57
  • 58. IBM Linux Technology Center Parallel-Unfriendly APIs  Consider a hash-table addition function that returns the number of entries in the table  What is the best way to implement this in parallel?  There is no good way to do so!!!  The problem is that concurrent additions must communicate, and this communication is inherently expensive  http://lwn.net/Articles/423994/  What should be done instead? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 58
  • 59. IBM Linux Technology Center Parallel-Unfriendly APIs  Don't return the count!  Without the count, a hash-table implementation can handle two concurrent “add” operations with no conflicts: fully in parallel  With the count, the two concurrent “add” operations must update the count  Alternatively, return an approximate count  Either way, sequential APIs may need to be reworked to allow efficient implementation “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 59
  • 60. IBM Linux Technology Center Parallelization Can Be A Major Change  Software resembles building construction  Start with architects  Then construction  Finally maintenance  If you have a old piece of software, your staff might consist only of software janitors  Nothing wrong with janitors: I used to be one  But odds are against janitors remodeling skyscrapers  Use the right people for the job  Can current staff do a major change?  And yes, you might need time and budget to suit... “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 60
  • 61. IBM Linux Technology Center Conclusions “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 61
  • 62. IBM Linux Technology Center Summary and Problem Statement  Writing new parallel code is quite doable  Many open-source projects prove this point  Converting existing sequential code to run in parallel can be easy...  … or arbitrarily difficult  So don't do things the hard way! “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 62
  • 63. IBM Linux Technology Center Is Parallel Programming Hard, And If So, Why? Parallel Programming is as Hard or as Easy as We Make It. It is that hard (or that easy) because we make it that way!!! “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 63
  • 64. IBM Linux Technology Center Legal Statement  This work represents the view of the author and does not necessarily represent the view of IBM.  IBM and IBM (logo) are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries.  Linux is a registered trademark of Linus Torvalds.  Other company, product, and service names may be trademarks or service marks of others.  This material is based upon work supported by the National Science Foundation under Grant No. CNS- 0719851. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 64
  • 65. IBM Linux Technology Center Summarized Summary Use the right tool for the job!!! Image copyright © 2004 Melissa McKenney “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 65
  • 66. IBM Linux Technology Center BACKUP “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 66
  • 67. IBM Linux Technology Center Parallel Programming Advanced Topic: RCU  For read-mostly data structures, RCU provides the benefits of the data-parallel model  But without the need to actually partition or replicate the RCU-protected data structures  Readers access data without needing to exclude each others or updates • Extremely lightweight read-side primitives  And RCU provides additional read-side performance and scalability benefits  With a few limitations and restrictions.... “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 67
  • 68. IBM Linux Technology Center RCU for Read-Mostly Data Structures Almost... Work Partitioning Resource R Parallel C Partitioning U Access Control & Replication Interacting With Hardware RCU data-parallel approach: first partition resources, then partition work, and only then worry about parallel access control, and only for updates. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 68
  • 69. IBM Linux Technology Center RCU Usage in the Linux Kernel “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 69
  • 70. IBM Linux Technology Center What Is RCU?  Publication of new data  Subscription to existing data  Wait for pre-existing RCU readers “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 70
  • 71. IBM Linux Technology Center Publication of And Subscription To New Data Key: All readers can access Only pre-existing readers can access Inaccessible to readers gp A gp gp gp rcu_assign_pointer() initialization kmalloc() ->a=? ->a=1 ->a=1 ->b=? ->b=2 ->b=2 ->c=? ->c=3 ->c=3 p p p “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 71
  • 72. IBM Linux Technology Center RCU Removal From Linked List  Combines waiting for readers and multiple versions:  Writer removes element B from the list (list_del_rcu())  Writer waits for all readers to finish (synchronize_rcu())  Writer can then free B (kfree()) One Version Two Versions One Version A A A A synchronize_rcu() list_del_rcu() kfree() readers? B readers? B B C C C C No more readers referencing B! “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 72
  • 73. IBM Linux Technology Center RCU Replacement Of Item In Linked List 1 Version 1 Version 1 Version 2 Versions 1 Version 1 Version A AA AA AA AA AA A synchronize_rcu() list_replace_rcu() B B' B' B' B ? B B' kmalloc() update kfree() copy C C B B BB BB BB B B CC CC CC CC CC No more readers referencing B! “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 73
  • 74. IBM Linux Technology Center Wait For Pre-Existing RCU Readers rcu_read_lock() Reader Reader Reader Reader Reader Reader Reader rcu_read_unlock() RCU readers concurrent with updates Reader Forbidden! synchronize_rcu() Change Visible Change Grace Period to All Readers So what happens if you try to extend an RCU read-side critical section across a grace period? “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 74
  • 75. IBM Linux Technology Center Wait For Pre-Existing RCU Readers Reader Reader Reader Reader Reader Reader Reader Grace period extends as Reader needed. synchronize_rcu() Change Visible Change Grace Period to All Readers A grace period is not permitted to end until all pre-existing readers have completed. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 75
  • 76. IBM Linux Technology Center Wait For Pre-Existing RCU Readers Reader Reader Reader Reader Reader Reader Reader Reader synchronize_rcu() Change Visible Change Grace Period to All Readers But it is OK for a grace period to extend longer than necessary “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 76
  • 77. IBM Linux Technology Center Wait For Pre-Existing RCU Readers Reader Reader Reader Reader Reader Reader Reader Reader synchronize_rcu() Change Visible Change Grace Period to All Readers And it is also OK for a grace period to begin later than necessary “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 77
  • 78. IBM Linux Technology Center Wait For Pre-Existing RCU Readers Reader Reader Reader Reader Reader Reader Reader Reader synchronize_rcu() Change Visible Change to All Readers Grace Period Change Visible Change synchronize_rcu() to All Readers Starting a grace period late can allow it to serve multiple updates. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 78
  • 79. IBM Linux Technology Center Wait For Pre-Existing RCU Readers Reader Reader Reader Reader Reader Reader Reader Reader Reader synchronize_rcu() Change Visible Change Grace Period to All Readers !!! And it is OK for the system to complain (or even abort) if a grace period extends too long. Too-long of grace periods are likely to result in death by memory exhaustion anyway. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 79
  • 80. IBM Linux Technology Center Wait For Pre-Existing RCU Readers Reader Reader Reader Reader Reader Reader Reader Reader synchronize_rcu() Change Visible Change Grace Period to All Readers Returning to the minimum-duration grace period. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 80
  • 81. IBM Linux Technology Center Wait For Pre-Existing RCU Readers Reader Reader Reader Reader Reader Reader Reader RT Reader synchronize_rcu() Change Visible Change Grace Period to All Readers Real-time scheduling constraints can extend grace periods by preempting RCU readers. It is sometimes necessary to priority-boost RCU readers. “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 81
  • 82. IBM Linux Technology Center Waiting for Pre-Existing Readers: QSBR  RCU readers are not permitted to block  Same rule as for tasks holding spinlocks  CPU context switch means all that CPU's readers are done  Grace period ends after all CPUs execute a context switch h itc sw t ex nt co CPU 0 CPU 1 synchronize_rcu() CPU 2 “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 82
  • 83. IBM Linux Technology Center RCU vs. Reader-Writer Locking Performance How is this possible? Non-CONFIG_PREEMPT kernel build “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 83
  • 84. IBM Linux Technology Center RCU vs. Reader-Writer Locking Performance CONFIG_PREEMPT kernel build “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 84
  • 85. IBM Linux Technology Center RCU vs. Reader-Writer Locking Performance CONFIG_PREEMPT kernel build, 16 CPUs “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 85
  • 86. IBM Linux Technology Center RCU and Reader-Writer Lock Response Time “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 86
  • 87. IBM Linux Technology Center RCU Area of Applicability Read-Mostly, Stale & Inconsistent Data OK (RCU Works Great!!!) Read-Mostly, Need Consistent Data (RCU Works OK) Read-Write, Need Consistent Data (RCU Might Be OK...) Update-Mostly, Need Consistent Data (RCU is Really Unlikely to be the Right Tool For The Job, But SLAB_DESTROY_BY_RCU Is A Possibility) “Is Parallel Programming Hard?” 2011 Android System Development Forum © 2011 IBM Corporation 87