Learning and Development
       Presents




OPEN TALK SERIES
A series of illuminating talks and interactions that open our minds to new ideas
and concepts; that makes us look for newer or better ways of doing what we did;
or point us to exciting things we have never done before. A range of topics on
Technology, Business, Fun and Life.

Be part of the learning experience at Aditi.
Join the talks. Its free. Free as in freedom at work, not free-beer.
Speak at these events. Or bring an expert/friend to talk.
Mail LEAD with topic and availability.
Parallel Programming

    Sundararajan Subramanian
        Aditi Technologies



2
Introduction to Parallel Computing
• The challenge
  – Provide the abstractions , programming
    paradigms, and algorithms needed to
    effectively design, implement, and maintain
    applications that exploit the parallelism
    provided by the underlying hardware in order
    to solve modern problems.
Single-core CPU chip
                  the single core




                                    4
Multi-core architectures




     Core 1           Core 2   Core 3   Core 4




Multi-core CPU chip                              5
Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)


   c          c         c         c
   o          o         o         o
   r          r         r         r
   e          e         e         e

   1          2         3         4



                                               6
The cores run in parallel
    thread 1       thread 2       thread 3       thread 4




c              c              c              c
o              o              o              o
r              r              r              r
e              e              e              e

1              2              3              4




                                                            7
Within each core, threads are time-sliced
       (just like on a uniprocessor)
     several       several       several       several
     threads       threads       threads       threads




 c             c             c             c
 o             o             o             o
 r             r             r             r
 e             e             e             e

 1             2             3             4




                                                         8
Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline
  instructions, split them into
  microinstructions, do aggressive branch
  prediction, etc.
• Instruction-level parallelism enabled rapid
  increases in processor speeds over the
  last 15 years

                                             9
Instruction level parallelism
• For(int i-0;i<1000;i++)
    { a[0]++; a[0]++; }


• For(int i-0;i<1000;i++)
    { a[0]++; a[1]++; }
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
  thread (Web server, database server)
• A computer game can do AI, graphics, and
  physics in three separate threads
• Single-core superscalar processors cannot
  fully exploit TLP
• Multi-core architectures are the next step in
  processor evolution: explicitly exploiting TLP
                                               11
A technique complementary to multi-core:
         Simultaneous multithreading

• Problem addressed:                                    L1 D-Cache D-TLB

  The processor pipeline                               Integer       Floating Point
  can get stalled:




                               L2 Cache and Control
  – Waiting for the result                                  Schedulers

    of a long floating point                                Uop queues
    (or integer) operation
                                                            Rename/Alloc
  – Waiting for data to
                                                      BTB      Trace Cache           uCode
    arrive from memory                                                               ROM
                                                               Decoder
 Other execution units         Bus

 wait unused                                                BTB and I-TLB
                                                                         Source: Intel

                                                                                         12
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
  SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
  on the same core

• Example: if one thread is waiting for a floating
  point operation to complete, another thread can
  use the integer units


                                                     13
Without SMT, only a single thread can
        run at any given time
                                   L1 D-Cache D-TLB

      L2 Cache and Control
                              Integer            Floating Point

                                        Schedulers

                                     Uop queues

                                     Rename/Alloc

                             BTB     Trace Cache          uCode ROM

                                         Decoder
      Bus




                                     BTB and I-TLB

                                                 Thread 1: floating point
                                                                            14
Without SMT, only a single thread can
        run at any given time
                                   L1 D-Cache D-TLB

      L2 Cache and Control
                              Integer              Floating Point

                                        Schedulers

                                      Uop queues

                                     Rename/Alloc

                             BTB      Trace Cache         uCode ROM

                                         Decoder
      Bus




                                     BTB and I-TLB

                               Thread 2:
                               integer operation                      15
SMT processor: both threads can run
          concurrently
                                  L1 D-Cache D-TLB

     L2 Cache and Control
                             Integer            Floating Point

                                       Schedulers

                                     Uop queues

                                    Rename/Alloc

                            BTB      Trace Cache         uCode ROM

                                        Decoder
     Bus




                                    BTB and I-TLB

                              Thread 2:         Thread 1: floating point
                              integer operation                            16
But: Can’t simultaneously use the
       same functional unit
                                 L1 D-Cache D-TLB

    L2 Cache and Control
                            Integer            Floating Point

                                      Schedulers

                                   Uop queues

                                   Rename/Alloc

                           BTB     Trace Cache        uCode ROM

                                       Decoder        This scenario is
                                                      impossible with SMT
    Bus




                                   BTB and I-TLB
                                                      on a single core
                             Thread 1 Thread 2        (assuming a single
                                 IMPOSSIBLE           integer unit)       17
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each
  simultaneous thread as a separate
  “virtual processor”
• The chip has only a single copy
  of each resource
• Compare to multi-core:
  each core has its own copy of resources

                                              18
Multi-core:
                       threads can run on separate cores
                         L1 D-Cache D-TLB                                          L1 D-Cache D-TLB

                        Integer         Floating Point                            Integer         Floating Point
L2 Cache and Control




                                                          L2 Cache and Control
                              Schedulers                                                Schedulers

                              Uop queues                                                Uop queues

                             Rename/Alloc                                              Rename/Alloc

                       BTB      Trace Cache       uCode                          BTB       Trace Cache      uCode
                                                  ROM                                                       ROM
                                Decoder                                                   Decoder
                                                          Bus
Bus




                             BTB and I-TLB                                             BTB and I-TLB

                             Thread 1                                                  Thread 2                    19
Multi-core:
                       threads can run on separate cores
                         L1 D-Cache D-TLB                                        L1 D-Cache D-TLB

                        Integer       Floating Point                            Integer       Floating Point
L2 Cache and Control




                                                        L2 Cache and Control
                             Schedulers                                              Schedulers

                             Uop queues                                              Uop queues

                             Rename/Alloc                                            Rename/Alloc

                       BTB      Trace Cache     uCode                          BTB      Trace Cache        uCode
                                                ROM                                                        ROM
                                Decoder                                                 Decoder
                                                        Bus
Bus




                             BTB and I-TLB                                           BTB and I-TLB

                                     Thread 3                                                   Thread 4       20
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
  – Single-core, non-SMT: standard uniprocessor
  – Single-core, with SMT
  – Multi-core, non-SMT
  – Multi-core, with SMT: our fish machines
• The number of SMT threads:
  2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
                                                  21
SMT Dual-core: all four threads can run
            concurrently
                         L1 D-Cache D-TLB                                        L1 D-Cache D-TLB

                        Integer       Floating Point                            Integer         Floating Point
L2 Cache and Control




                                                        L2 Cache and Control
                             Schedulers                                               Schedulers

                             Uop queues                                               Uop queues

                             Rename/Alloc                                            Rename/Alloc

                       BTB      Trace Cache     uCode                          BTB       Trace Cache        uCode
                                                ROM                                                         ROM
                                Decoder                                                 Decoder
                                                        Bus
Bus




                             BTB and I-TLB                                           BTB and I-TLB

                         Thread 1 Thread 3                                           Thread 2    Thread 4        22
Designs with private L2 caches




                                                             CORE0
CORE1




                   CORE0




                                      CORE1
        L1 cache           L1 cache           L1 cache               L1 cache

        L2 cache           L2 cache           L2 cache          L2 cache

                                              L3 cache          L3 cache
              memory
                                                      memory
  Both L1 and L2 are private
                                              A design with L3 caches
  Examples: AMD Opteron,
  AMD Athlon, Intel Pentium D                 Example: Intel Itanium 2
Private vs shared caches?
• Advantages/disadvantages?




                               25
Private vs shared caches
• Advantages of private:
  – They are closer to core, so faster access
  – Reduces contention
• Advantages of shared:
  – Threads on different cores can share the
    same cache data
  – More cache space available if a single (or a
    few) high-performance thread runs on the
    system
                                                   26
Parallel Architectures
• Use multiple
  – Datapaths
  – Memory units
  – Processing units
Parallel Architectures
• SIMD
  – Single instruction stream, multiple data stream
                     Processing
                        Unit
                     Processing
                        Unit




                                               Interconnect
 Control
                     Processing
  Unit
                        Unit
                     Processing
                        Unit
                     Processing
                        Unit
Parallel Architectures
• MIMD
 – Multiple instruction stream, multiple data stream
          Processing/Control
                 Unit

          Processing/Control




                                          Interconnect
                 Unit

          Processing/Control
                 Unit

          Processing/Control
                 Unit
Parallelism in Visual Studio 2010
Integrated    Programming Models                                                       Programming Models
Tooling
                          PLINQ
   Parallel            Task Parallel                                                     Parallel Pattern      Agents
  Debugger                                                                                   Library           Library
                         Library
Toolwindows




                                                   Data Structures

                                                                     Data Structures
              Concurrency Runtime                                                        Concurrency Runtime

                       ThreadPool
  Profiler                                                                                       Task Scheduler
Concurrency           Task Scheduler
  Analysis
                    Resource Manager
                                                                                              Resource Manager

                                    Operating System

                                       Threads

               Key:         Tools        Native Library                                Managed Library
Multi threading Today
• Divide the total number of activites across n
  processors
• In case of 2 Procs, divide it by 2.
User Mode Scheduler
CLR Thread Pool

    Global
    Queue




               Worker    …    Worker
              Thread 1       Thread p


Program
 Thread
User Mode Scheduler For Tasks
     CLR Thread Pool: Work-Stealing

                           Local       …     Local
           Global          Queue             Queue
           Queue




                        Worker     …        Worker
                       Thread 1            Thread p
                                            Task 6
Task 1              Task Task 3
                         4
 Task 2Program            Task 5
       Thread
DEMO
Task-based Programming
       ThreadPool Summary
ThreadPool.QueueUserWorkItem(…);



System.Threading.Tasks
Starting                      Parent/Child
Task.Factory.StartNew(…);     var p = new Task(() => {
                                  var t = new Task(…);
                              });
Continue/Wait/Cancel
Task t = …                    Tasks with results
                              Task<int> f =
Task p = t.ContinueWith(…);     new Task<int>(() => C());
t.Wait(2000);                 …
t.Cancel();                   int result = f.Result;
Coordination Data Structures (1 of
                                      3)
                                      Block if full
Concurrent Collections                         P          C
•   BlockingCollection<T>                  P                  C
•   ConcurrentBag<T>                           P          C
•   ConcurrentDictionary<TKey,TValu
    e>                                                Block if empty
•   ConcurrentLinkedList<T>
•   ConcurrentQueue<T>
•   ConcurrentStack<T>
•   IProducerConsumerCollection<T>
•   Partitioner, Partitioner<T>,
    OrderablePartitioner<T>
Coordination Data Structures (2 of
                           3)
Synchronization Primitives
•   Barrier
•   CountdownEvent




                             Loop
•   ManualResetEventSlim                   Barrier   postPhaseAction

•   SemaphoreSlim
•   SpinLock
•   SpinWait




                              CountdownEvent.
Coordination Data Structures (3 of
                                           3)
Initialization Primitives
•   Lazy<T>, LazyVariable<T>, LazyInitializer   Cancellation    MyMethod( )
•   ThreadLocal<T>                                Source

                                                         Foo(…, CancellationToken ct)
Cancellation Primitives                                         Thread Boundary

•   CancellationToken
•   CancellationTokenSource                              Bar(…, CancellationToken ct)
•   ICancelableOperation

                                                       ManualResetEventSlim.Wait( ct )


                                                Cancellation
                                                   Token

Parallel Programming

  • 1.
    Learning and Development Presents OPEN TALK SERIES A series of illuminating talks and interactions that open our minds to new ideas and concepts; that makes us look for newer or better ways of doing what we did; or point us to exciting things we have never done before. A range of topics on Technology, Business, Fun and Life. Be part of the learning experience at Aditi. Join the talks. Its free. Free as in freedom at work, not free-beer. Speak at these events. Or bring an expert/friend to talk. Mail LEAD with topic and availability.
  • 2.
    Parallel Programming Sundararajan Subramanian Aditi Technologies 2
  • 3.
    Introduction to ParallelComputing • The challenge – Provide the abstractions , programming paradigms, and algorithms needed to effectively design, implement, and maintain applications that exploit the parallelism provided by the underlying hardware in order to solve modern problems.
  • 4.
    Single-core CPU chip the single core 4
  • 5.
    Multi-core architectures Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip 5
  • 6.
    Multi-core CPU chip •The cores fit on a single processor socket • Also called CMP (Chip Multi-Processor) c c c c o o o o r r r r e e e e 1 2 3 4 6
  • 7.
    The cores runin parallel thread 1 thread 2 thread 3 thread 4 c c c c o o o o r r r r e e e e 1 2 3 4 7
  • 8.
    Within each core,threads are time-sliced (just like on a uniprocessor) several several several several threads threads threads threads c c c c o o o o r r r r e e e e 1 2 3 4 8
  • 9.
    Instruction-level parallelism • Parallelismat the machine-instruction level • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years 9
  • 10.
    Instruction level parallelism •For(int i-0;i<1000;i++) { a[0]++; a[0]++; } • For(int i-0;i<1000;i++) { a[0]++; a[1]++; }
  • 11.
    Thread-level parallelism (TLP) •This is parallelism on a more coarser scale • Server can serve each client in a separate thread (Web server, database server) • A computer game can do AI, graphics, and physics in three separate threads • Single-core superscalar processors cannot fully exploit TLP • Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP 11
  • 12.
    A technique complementaryto multi-core: Simultaneous multithreading • Problem addressed: L1 D-Cache D-TLB The processor pipeline Integer Floating Point can get stalled: L2 Cache and Control – Waiting for the result Schedulers of a long floating point Uop queues (or integer) operation Rename/Alloc – Waiting for data to BTB Trace Cache uCode arrive from memory ROM Decoder Other execution units Bus wait unused BTB and I-TLB Source: Intel 12
  • 13.
    Simultaneous multithreading (SMT) •Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core • Weaving together multiple “threads” on the same core • Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units 13
  • 14.
    Without SMT, onlya single thread can run at any given time L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 1: floating point 14
  • 15.
    Without SMT, onlya single thread can run at any given time L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: integer operation 15
  • 16.
    SMT processor: boththreads can run concurrently L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: Thread 1: floating point integer operation 16
  • 17.
    But: Can’t simultaneouslyuse the same functional unit L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder This scenario is impossible with SMT Bus BTB and I-TLB on a single core Thread 1 Thread 2 (assuming a single IMPOSSIBLE integer unit) 17
  • 18.
    SMT not a“true” parallel processor • Enables better threading (e.g. up to 30%) • OS and applications perceive each simultaneous thread as a separate “virtual processor” • The chip has only a single copy of each resource • Compare to multi-core: each core has its own copy of resources 18
  • 19.
    Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 2 19
  • 20.
    Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 3 Thread 4 20
  • 21.
    Combining Multi-core andSMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: our fish machines • The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads • Intel calls them “hyper-threads” 21
  • 22.
    SMT Dual-core: allfour threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread 2 Thread 4 22
  • 24.
    Designs with privateL2 caches CORE0 CORE1 CORE0 CORE1 L1 cache L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private A design with L3 caches Examples: AMD Opteron, AMD Athlon, Intel Pentium D Example: Intel Itanium 2
  • 25.
    Private vs sharedcaches? • Advantages/disadvantages? 25
  • 26.
    Private vs sharedcaches • Advantages of private: – They are closer to core, so faster access – Reduces contention • Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system 26
  • 27.
    Parallel Architectures • Usemultiple – Datapaths – Memory units – Processing units
  • 28.
    Parallel Architectures • SIMD – Single instruction stream, multiple data stream Processing Unit Processing Unit Interconnect Control Processing Unit Unit Processing Unit Processing Unit
  • 29.
    Parallel Architectures • MIMD – Multiple instruction stream, multiple data stream Processing/Control Unit Processing/Control Interconnect Unit Processing/Control Unit Processing/Control Unit
  • 30.
    Parallelism in VisualStudio 2010 Integrated Programming Models Programming Models Tooling PLINQ Parallel Task Parallel Parallel Pattern Agents Debugger Library Library Library Toolwindows Data Structures Data Structures Concurrency Runtime Concurrency Runtime ThreadPool Profiler Task Scheduler Concurrency Task Scheduler Analysis Resource Manager Resource Manager Operating System Threads Key: Tools Native Library Managed Library
  • 31.
    Multi threading Today •Divide the total number of activites across n processors • In case of 2 Procs, divide it by 2.
  • 32.
    User Mode Scheduler CLRThread Pool Global Queue Worker … Worker Thread 1 Thread p Program Thread
  • 33.
    User Mode SchedulerFor Tasks CLR Thread Pool: Work-Stealing Local … Local Global Queue Queue Queue Worker … Worker Thread 1 Thread p Task 6 Task 1 Task Task 3 4 Task 2Program Task 5 Thread
  • 34.
  • 35.
    Task-based Programming ThreadPool Summary ThreadPool.QueueUserWorkItem(…); System.Threading.Tasks Starting Parent/Child Task.Factory.StartNew(…); var p = new Task(() => { var t = new Task(…); }); Continue/Wait/Cancel Task t = … Tasks with results Task<int> f = Task p = t.ContinueWith(…); new Task<int>(() => C()); t.Wait(2000); … t.Cancel(); int result = f.Result;
  • 36.
    Coordination Data Structures(1 of 3) Block if full Concurrent Collections P C • BlockingCollection<T> P C • ConcurrentBag<T> P C • ConcurrentDictionary<TKey,TValu e> Block if empty • ConcurrentLinkedList<T> • ConcurrentQueue<T> • ConcurrentStack<T> • IProducerConsumerCollection<T> • Partitioner, Partitioner<T>, OrderablePartitioner<T>
  • 37.
    Coordination Data Structures(2 of 3) Synchronization Primitives • Barrier • CountdownEvent Loop • ManualResetEventSlim Barrier postPhaseAction • SemaphoreSlim • SpinLock • SpinWait CountdownEvent.
  • 38.
    Coordination Data Structures(3 of 3) Initialization Primitives • Lazy<T>, LazyVariable<T>, LazyInitializer Cancellation MyMethod( ) • ThreadLocal<T> Source Foo(…, CancellationToken ct) Cancellation Primitives Thread Boundary • CancellationToken • CancellationTokenSource Bar(…, CancellationToken ct) • ICancelableOperation ManualResetEventSlim.Wait( ct ) Cancellation Token