Your SlideShare is downloading. ×
Migration To Multi Core - Parallel Programming Models
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Migration To Multi Core - Parallel Programming Models

4,801
views

Published on

This is a 1-day parallel programming workshop, I gave more than a year ago.

This is a 1-day parallel programming workshop, I gave more than a year ago.

Published in: Technology

2 Comments
13 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,801
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
2
Likes
13
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Migration to Multi-Core Zvi Avraham, CTO [email_address]
  • 2. Agenda
    • Muli-Core and Many-Core
    • Hardware slides (another PowerPoint)
    • Parallel Programming Models
    • CPU Affinity
    • Parallel Programming APIs
    • Win32 Threads API
    • OpenMP Tutorial (another PowerPoint)
    • Intel TBB – Thread Building Blocks (if time permits)
    • Demos – C++ source code samples
    • Summary
    • Questions
  • 3. Multi-Core vs. Many-Core Multi-Core: ≤8 cores/threads Many-Core: >8 cores/threads
  • 4. Multi-Core x86
    • AMD Athlon X2
    • AMD Opteron Dual-Core
    • AMD Barcelona Quad-Core
    • AMD Phenom Triple-Core
    • Pentium D
    • Intel Core 2 Duo
    • Intel Core 2 Quad
    • Dual Core Xeon
    • Quad Core Xeon
  • 5. Many-Core Processors (MPU) the replacement for DSP and FPGA
    • Sun UltraSPARC T2 / Niagara – 8 cores x 8 threads
      • (fastest commodity processor for today)
    • Sun UltraSPARC RK / Rock – 16 cores x 2 threads
    • RMI XLR™ 732 Processor – 8 cores x 4 threads (MIPS)
    • Cavium OCTEON – 16 cores (MIPS)
    • TILERA TILE64 – 64 cores (MIPS)
  • 6. Intel Tera-scale
    • 80 cores ~ 1.81 Teraflops, 2 GB RAM on chip
    • Currently only prototype (expected in 2009/2010)
  • 7. Cell Processor SONY Playstation 3
    • 1 main PowerPC core x 2 threads @ 3 GHz
    • + 7 “Synregistic Processing Elements” @ 3 GHz
    • Total – 9 threads
  • 8. Xenon CPU Microsoft XBOX 360
    • PowerPC – 3 Cores x 2 Threads @ 3.2 GHz
  • 9. NVIDIA GeForce 8800GTX
    • 128 Stream Processors @ 1.3 Ghz, ~520GFlops
  • 10. Intel Larrabee GPU
    • Up to 48 cores x 4 threads
    • Cores based on Pentium (x86-64)
    • 512 bit SIMD
  • 11. Hardware Slides
    • See external PowerPoint presentation:
      • Maximizing Desktop Application Performance on Dual-Core PC Platforms
  • 12. Parallel Programming Models
  • 13. Concurrent/Parallel/Distributed
    • Concurrency
      • property of systems in which several computational processes are executing at the same time, and potentially interacting with each other
    • Parallel
      • computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel")
    • Distributed
      • The tasks running on different interconnected computers
  • 14. Levels of HW Parallelism
    • Bit-level
      • (i.e. from 8bit to 16bit to 32bit to 64bit CPUs)
    • Instruction-level
      • (CPU scheduling multiple instruction simultaneously)
      • SIMD
    • SMT – Simultaneous Multi-Threading
    • CMP – Core Multi-Processing
    • SMP – Symmetric Multi-Processing
    • Cluster – computers with fast interconnect
    • Grid – network of loosely connected computers
  • 15. Flynn’s Taxonomy Single Instruction Multiple Instruction Single Data SISD MISD Multiple Data SIMD MIMD
  • 16. Flynn’s Taxonomy (2)
  • 17. Parallel Programming Models
    • How can we write programs that run faster on a multi-core CPU?
    • How can we write programs that do not crash on a multi-core CPU?
    • Choose the right model!
    • There are two fundamental parallel programming models:
      • Shared State (Shared Memory)
      • Message Passing (Distributed Memory)
    • DSM (Distributed Shared Memory) – academic
  • 18. Shared State
    • Shared state concurrency involves the idea of “mutable state” (literally memory that can be changed).
    • This is fine as long as you have only one process/thread doing the changing.
    • If you have multiple processes sharing and modifying the same memory, you have a recipe for disaster - madness lies here.
    • To protect against the simultaneous modification of shared memory, we use a locking mechanism. Call this a mutex, critical section, synchronized method, or whatever you like, but it’s still a lock.
  • 19. Message Passing
    • In message passing concurrency, there is no shared state. All computations are done inside processes/threads, and the only way to exchange data is through asynchronous message passing.
    • Why this is good?
      • No need in locks.
      • No locks – no deadlocks (almost)
      • No locks – deterministic execution – good for Real Time.
      • No locks – good for I/O-bounded applications – Efficient Network Programming.
  • 20. Message Passing (cont.)
    • ActiveObject design pattern
      • ActiveObject is an Object with it’s own thread of control and attached message queue.
      • On arrival of message, ActiveObject wakes up and execute the command in the message.
      • Optionally, sends the result back to caller (via callback or Future).
    • Rational Rose Capsule design pattern
      • The same as ActiveObject, but with attached FSM.
      • Messages called “Events”.
      • Events changing FSM State according to state transition table.
    • MPI (Message Passing Interface) - de-facto standard API for Message Passing between nodes in the computational clusters.
  • 21. Active Object Design Pattern
  • 22. Rational’s Capsule ROOM – Real-Time Object Oriented Modeling
  • 23. Implicit vs. Explicit Parallelism
    • Implicit Parallelism
      • Automatic parallelization of code by compiler or library
      • A pure implicitly parallel language does not need special directives, operators or functions to enable parallel execution.
      • Programmer focused on core tasks, instead of worry about division on tasks and communication
      • Less degree of control by programmer
      • Less effective
      • Examples: Matlab, R, LabView, NESL, ZPL, Intel Ct library for C++
  • 24. Implicit vs. Explicit Parallelism
    • Explicit Parallelism
      • representation of concurrent computations by means of primitives in the form of special-purpose directives or function calls, also called “parallelization overhead”
      • Most parallel primitives are related to process synchronization, communication or task partitioning
      • Full control by programmer
      • Examples: thread APIs, Erlang, Ada, cilk, etc.
  • 25. Data Parallel vs. Task Parallel
    • Data Parallelism
      • a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes.
    • Task Parallelism
      • a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes.
    • Most real programs fall somewhere on a continuum between Task parallelism and Data parallelism.
  • 26. Data Parallel example
    • if CPU="a" then
    • low_limit=1
    • upper_limit=50
    • else if CPU="b" then
    • low_limit=51
    • upper_limit=100
    • end if
    • for i = low_limit to upper_limit
    • Task on d(i)
    • end for
  • 27. Task Parallel example
    • program:
    • ...
    • if CPU="a" then
    • do task "A"
    • else if CPU="b" then
    • do task "B"
    • end if
    • ...
    • end program
  • 28. “ Plain” vs Nested Data Parallelism
    • “ Plain” Data Parallelism
      • Good for regular data (vectors, matrices)
      • No support for control flow
      • Efficient impl. on SIMD or GPGPU HW
    • Nested Data Parallelism
      • Support irregular or sparse data
      • Can express control flow
      • Efficient impl. on Multi-core CPUs with SIMD
      • Examples:
        • [a:5, b:7] + [b:3, c:3, d:1] = [a:5, b:10, c:3, d:1]
        • sum( [a:5, a:3, b:2, b:4, d:1] ) = [a:8, b:6, d:1]
  • 29. Parallel Models
    • Data Parallel models:
      • Work Sharing (using Fork / Join)
        • Parallel For
      • Scatter / Gather
      • Map / Reduce
        • Split / Map / Combine / Reduce / Merge
    • Task Parallel models:
      • Fork / Join
      • Recursive Fork / Join (Work Stealing scheduler)
      • Scheduler / Workers (aka Master / Slave)
        • Compute Grid: Executer / Scheduler / Workers
  • 30. Scatter / Gather
  • 31. Fork / Join Barries – “joins” Parallel Regions Master Thread
  • 32. Recursive Fork/Join
    • Result solve (Problem problem) {               if (problem is small)                        directly solve problem             else {                     split problem into independent parts                       fork new subtasks to solve each part                       join all subtasks                       compose result from subresults              } }
  • 33. Map / Reduce
    • Input & Output: each a set of key/value pairs
    • Programmer specifies two functions:
      • map (in_key, in_value) ->
      • list(out_key, intermediate_value)
        • Processes input key/value pair
        • Produces set of intermediate pairs
      • reduce (out_key, list(intermediate_value)) -> list(out_value)
        • Combines all intermediate values for a particular key
        • Produces a set of merged output values (usually just one)
  • 34. Map / Reduce
  • 35. Types of Parallelism
    • Fine-grained
      • when tasks need to communicate many-times per second
      • small messages
      • low-latency
    • Coarse-grained
      • when tasks do not need to communicate many times per second
    • Embarrassing parallelism
      • tasks are independent
  • 36. Embarrassingly Parallel Problems
    • Embarrassingly Parallel Problem is one for which no particular effort is needed to segment the problem into a very large number of parallel tasks.
    • And there is no essential dependency (or communications) between those parallel tasks.
    • If each step can be computed independently from every other step, thus each step could be made to run on a separate processor to achieve quicker results.
    • Examples:
      • running the same algorithm on each grabbed frame
      • running the same algorithm with different parameters
  • 37. Extended Flynn’s Taxonomy
    • SPMD - Single Program, Multiple Data streams
      • Multiple autonomous processors simultaneously executing the same program (but at independent points, rather than in the lockstep that SIMD imposes) on different data. Also referred to as 'Single Process, multiple data' [6] . SPMD is the most common style of parallel programming [7] .
    • MPMD - Multiple Program, Multiple Data streams
      • Multiple autonomous processors simultaneously operating at least 2 independent programs. Typically such systems pick one node to be the "host" ("the explicit host/node programming model") or "manager" (the "Manager/Worker" strategy), which runs one program that farms out data to all the other nodes which all run a second program. Those other nodes then return their results directly to the manager.
  • 38. CPU Affinity Changing Process and Thread Affinity
  • 39. CPU Affinity
    • The system uses a symmetric multiprocessing (SMP) model to schedule processes and threads on multiple processors.
    • Any process/thread can be assigned to any processor.
    • On a single CPU system threads can’t run concurrently, but using time-slicing model.
    • On a multiple CPU system threads can run concurrently on different processors.
    • Scheduling is still determined by thread priority.
    • However, on a multiprocessor computer, you can also affect scheduling by setting thread affinity and thread ideal processor (i.e. “bounding” thread to specific CPU).
  • 40. Why mess with CPU Affinity? Legacy Code Migration
    • Usually Windows OS handles CPU Affinity automatically very good. So why mess with it?
    • When migrating old applications to multi-core CPU, their performance may start degrading. This happens because OS moving application’s threads and interrupt routines between cores.
    • By fixing application’s process and threads to CPU #0, application will start behave the same way, as on uniprocessor machine.
  • 41. Why mess with CPU Affinity? Real-Time and Determinism
    • By setting CPU affinity for our application’s thread - we get deterministic execution times, needed for Real-Time.
    • OS will no more move our threads between cores.
    • For example, let’s say we developing video surveillance system using dual-core CPU.
    • We have two threads:
      • #1 – receiving video frames from the Matrox Frame Grabber
      • #2 – transmitting the received frames via TCP/IP to the Server
    • By setting affinity of the receiving thread on CPU #0 and of the transmitting thread on CPU #1 – we getting deterministic execution times.
  • 42. Why mess with CPU Affinity? Memory Affinity
    • Cache locality
      • There are some multi-core CPUs with separate L1 caches (for example: Pentium-D, some Intel’s quad-core CPUs).
      • If OS moving thread from one core to another, the data is missing on the 2 nd core, so it must fetch data from RAM, thus we get performance degradation.
    • NUMA
      • On the NUMA, each memory bank connect directly to single CPU. When accessing RAM from another CPU it will be slower.
      • By setting thread’s CPU affinity and allocating memory on the faster RAM bank, we getting performance improvement.
  • 43. Changing CPU Affinity
    • Using Task Manager – manualy, need to change each time, when running program
    • ImageCfg.exe Utility – modify program’s EXE image – need to do it only once
    • Process.exe Utility – change program’s affinity, without modifying it’s EXE image
    • IntFiltr.exe – Interrupt Affinity Filter tool
    • Win32 Affinity API:
      • SetProcessAffinityMask
      • SetThreadAffinityMask
  • 44. CPU Affinity – Task Manager
  • 45. CPU Affinity – Task Manager
  • 46. ImageCFG – Affinity Mask Tool
    • Change CPU Affinity mask for existing EXE file
    • WARNING: modifies .EXE file – so backup before you do anything
    • “ -u” - Marks image as uniprocessor only:
      • imagecfg -u c:path ofile.exe
    • “ -a” – Process Affinity mask value in hex:
      • Permanently set CPU Affinity mask to CPU #1
      • imagecfg -a 0x1 c:path ofile.exe
      • Permanently set CPU Affinity mask to CPU #2
      • imagecfg -a 0x2 c:path ofile.exe
  • 47. Process.exe – Get Affinity Mask
    • “ -a” option is used in conjunction with a process name or PID, the utility will show the System Affinity Mask and the Process Affinity Mask.
  • 48. Process.exe – Set Affinity Mask
    • To set the affinity mask, simply append the binary mask after the PID/Image Name. Any leading zeros are ignored, so there is no requirement to enter the full 32 bit mask.
    • Doesn’t modify EXE file – so need to run each time
  • 49. IntFiltr.exe – Interrupt Affinity Filter
    • Binding of device interrupts to particular processors on multiprocessor computers is a useful technique to maximize performance, scaling, and partitioning of large computers.
    • Interrupt-Affinity Filter (IntFiltr) is an interrupt-binding tool that permits you to establish affinity for device processors on multiprocessor computers.
    • IntFiltr uses Plug and Play features of Windows 2000 and provides a Graphical User Interface (GUI) to permit interrupt binding.
  • 50. Parallel Programming APIs
  • 51. Parallel Programming APIs
    • Threads – native OS threads API (Win32 Threads, POSIX pthreads, etc.). Low-level, highly error-prone API. Locking, etc.
    • OpenMP – SMP standard, which allow incremental parallelization of serial code, by adding compiler directives, called “pragmas”.
    • Intel TBB – Thread Building Blocks C++ library from Intel. High-level constructs: think in terms of tasks instead of threads (like “Parallel STL”).
    • Auto-Parallelization
  • 52. Simplicity / Complexity
    • The native threads programming model introduces much more complexity within the code than OpenMP or Intel® TBB, making it more challenging to maintain.
    • One of the benefits of using Intel® TBB or OpenMP when appropriate is that these APIs create and manage the thread pool for you: thread synchronization and scheduling are handled automatically .
  • 53. Capabilities Comparison Intel TBB OpenMP Threads Task level parallelism + + - Data decomposition support + + - Complex parallel patterns (non-loops) + - - Generic parallel patterns + - - Scalable nested parallelism support + - - Built-in load balancing + + - Affinity support - + + Static scheduling - + - Concurrent data structures + - - Scalable memory allocator + - - I/O dominated tasks - - + User-level synch. primitives + + - Compiler support is not required + - + Cross OS support + + -
  • 54. Native Threads
    • Win32 threads – CreateThread
    • POSIX threads
    • boost::threads – platform independent
    • ACE threads – platform independent, optimized for Concurrent and Network programming (used by many defense and telecom companies: Boeng, Raytheon, Elbit, Ericsson, Motorola, Lucent, Siemens, etc.)
  • 55. Win32 Threads API It’s assumed, that reader/listener is familiar with basic multithreading and corresponding Win32 API
  • 56. What are Win32 Threads?
    • Microsoft Windows implementation
    • C language interface to library
      • Follows Win32 programming model
    • Threads exist within single process
    • All threads are peers
      • No explicit parent-child model
  • 57. Win32 API Hierarchy for Concurrency Windows OS Job Job Process Primary Thread Thread Process Fiber Fiber
  • 58. Process
    • Each process provides the resources needed to execute a program. A process has a
      • virtual address space
      • executable code
      • open handles to system objects
      • security context
      • unique process identifier
      • environment variables
      • priority class
      • minimum and maximum working set sizes
      • at least one thread of execution.
    • Each process is started with a single thread, often called the primary thread , but can create additional threads from any of its threads.
  • 59. Thread
    • A thread is the entity within a process that can be scheduled for execution.
    • All threads of a process share its virtual address space and system resources.
    • In addition, each thread maintains:
      • exception handlers
      • scheduling priority
      • thread local storage (TLS)
      • unique thread identifier
      • set of structures the system will use to save the thread context until it is scheduled.
    • The thread context includes the thread's set of machine registers, the kernel stack, a thread environment block, and a user stack in the address space of the thread's process. Threads can also have their own security context, which can be used for impersonating clients.
  • 60. Job object
    • A job object allows groups of processes to be managed as a unit. Job objects are namable, securable, sharable objects that control attributes of the processes associated with them. Operations performed on the job object affect all processes associated with the job object.
    • Job = Process Group
    • There is no Thread Group API in Win32 (unlike ACE C++ library or Java)
  • 61. Fiber / Coroutine / Microthread
    • A fiber is a unit of execution that must be manually scheduled by the application.
    • Fibers run in the context of the threads that schedule them.
    • Each thread can schedule multiple fibers.
    • Fibers do not run simultaneously, so there is no need in locking of shared data
    • In general, fibers do not provide advantages over a well-designed multithreaded application. However, using fibers can make it easier to port applications that were designed to schedule their own threads.
  • 62. Win32 Threads API Example: CreateThread
  • 63. Win32 Threads API: Critical Section
  • 64. CCriticalSection C++ Class
  • 65. CCriticalSection & CLock example
  • 66. Thread Affinity
    • Thread affinity forces a thread to run on a specific subset of processors.
    • Use the SetProcessAffinityMask function to specify thread affinity for all threads of the process.
    • To set the thread affinity for a single thread, use the SetThreadAffinityMask function. The thread affinity must be a subset of the process affinity.
    • You can obtain the current process affinity by calling the GetProcessAffinityMask function.
    • Setting thread affinity should generally be avoided, because it can interfere with the scheduler's ability to schedule threads effectively across processors. This can decrease the performance gains produced by parallel processing.
  • 67. Thread Ideal Processor
    • When you specify a thread ideal processor , the scheduler runs the thread on the specified processor when possible.
    • Use the SetThreadIdealProcessor function to specify a preferred processor for a thread.
    • This does not guarantee that the ideal processor will be chosen, but provides a useful hint to the scheduler.
  • 68. Our running Example: The PI program Numerical Integration  4.0 (1+x 2 ) dx =  0 1  F(x i )  x   i = 0 N Mathematically, we know that: We can approximate the integral as a sum of rectangles: Where each rectangle has width  x and height F(x i ) at the middle of interval i. F(x) = 4.0/(1+x 2 ) 4.0 2.0 1.0 X 0.0
  • 69. PI: Matlab N=1000000; Step = 1/N; PI = Step*sum(4./(1+(((1:N)-0.5)*Step).^2));
    • Implicit Data Parallel
    • Implemented using SIMD
    • Later version of Matlab can utilize multiple / multi-core CPUs
  • 70. PI Program: an example static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }
  • 71. OpenMP PI Program : Parallel for with a reduction #include <omp.h> static long num_steps = 100000; double step; #define NUM_THREADS 2 void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for reduction(+:sum) private(x) for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; } OpenMP adds 2 to 4 lines of code
  • 72. OpenMP
  • 73. OpenMP Slides
    • See external PowerPoint presentation:
      • OpenMP Tutorial Part 1: The Core Elements of OpenMP
  • 74. OpenMP Compiler Option
    • How to enable OpenMP in MSVC++ 2005
    • Project -> Properties -> Configuration Properties -> C/C++ -> Command Line :
  • 75. OpenMP Compiler Option
    • How to enable OpenMP in Intel Compiler:
      • /Qopenmp
      • /Qopenmp_report{0|1|2}
  • 76. Auto-Parallelization
    • Intel compiler only, using compiler switch:
      • /Qparallel
      • /Qpar_report[n]
    • Automatic threading of loops without having to manually insert OpenMP directives.
    • Compiler can identify “easy” canditates for parallelization
    • Large applications are difficult to analyze
  • 77. Intel® TBB Thread Building Blocks C++ Library
  • 78. Calculate PI using TBB
  • 79. Calculate PI in MPI
  • 80. MPI_Reduce
    • Reduces values on all processes to a single value on root process
    • Synopsis
    • MPI_Reduce (
    • void *sendbuf, // address of send buffer
    • void *recvbuf, // address of receive buffer (only at root)
    • int count, // number of elements in send buffer (integer)
    • MPI_Datatype datatype, // data type of elements of send buffer
    • MPI_Op op, // reduce operation: sum, prod, min, max, etc.
    • int root, // root rank of root process (integer)
    • MPI_Comm comm // comm communicator (handle)
    • )
  • 81. Summary
    • Multi-core performance improvements do not come for free in case your application is single-threaded.
    • Except for multiple concurrent applications being run faster all together.
    • Use Thread CPU Affinity to improve your applications performance and Real-Time deterministic execution.
    • Use OpenMP to parallelize your existing serial C/C++ code – minimal investment and incremental way to threading.
    • Use Intel® TBB library for your new C++ projects – high-level STL-like parallel API.
    • For the IO-bounded Real-Time and/or Network Concurrent Programming, look at ACE Framework or use your native threads API.
    • Be aware of potential performance pitfalls like, cache locality and FSB saturation.
    • Benchmark everything!
  • 82. Commercial Multi-core
    • Commercial multi-core applications is not the future.
      • It is NOW!
  • 83. Any Questions?
  • 84. Thank you!