SlideShare a Scribd company logo
Parallel Programming Concepts
OpenHPI Course
Week 6 : Patterns and Best Practices
Unit 6.1: Parallel Programming Patterns
Dr. Peter Tröger + Teaching Team
Summary: Week 5
■  “Shared nothing” systems provide very good scalability
□  Adding new processing elements not limited by “walls”
□  Different options for interconnect technology
■  Task granularity is essential
□  Surface-to-volume effect
□  Task mapping problem
■  De-facto standard is MPI programming
■  High level abstractions with
□  Channels
□  Actors
□  MapReduce
2
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
„What steps / strategy would you apply
to parallelize a given compute-intense program? “
The Parallel Programming Problem
3
Execution
Environment
Parallel Application Match ?
Configuration
Flexible
Type
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallelization and Design Patterns
■  Parallel programming relies on experience
□  Identification of concurrency
□  Identification of feasible algorithmic structures
□  If done wrong, performance / correctness may suffer
■  Rule of thumb: Somebody else is smarter than you !!
■  Design Pattern
□  Best practices, formulated as a template
□  Focus on general applicability to common problems
□  Well-known in object-oriented programming (“gang of four”)
■  Parallel design patterns in literature
□  Structured parallelization methodologies (== pattern)
□  Algorithmic building blocks commonly found (== pattern)
4
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Patterns for Parallel Programming
[Mattson et al.]
■  Phases in creating a parallel program
□  Finding Concurrency: Identify and
analyze exploitable concurrency
□  Algorithm Structure: Structure
the algorithm to take advantage
of potential concurrency
□  Supporting Structures: Define
program structures and data
structures needed for the code
□  Implementation Mechanisms:
Threads, processes, messages, …
■  Each phase is a design space
5
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Finding Concurrency Design Space
■  Identify and analyze exploitable concurrency
■  Example: Data Decomposition Pattern
□  Context: Computation is organized around large data
manipulation, similar operations on different data parts
□  Solution: Array-based data access (row, block),
recursive data structure traversal
■  Example: Group Tasks Pattern
□  Context: Tasks shared temporal constraints (e.g. intermediate
data), work on shared data structure
□  Solution: Apply ordering constraints to groups of tasks, put
truly independent tasks in one group for better scheduling
6
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Algorithm Structure Design Space
■  Structure the algorithm
■  Consider how the identified concurrency is organized
□  Organize algorithm by tasks
◊  Tasks are embarrassingly parallel, or organized linearly
-> Task Parallelism
◊  Tasks organized by recursive procedure
-> Divide and Conquer
□  Organize algorithm by by data dependencies
◊  Linear data dependencies -> Geometric Decomposition
◊  Recursive data dependencies -> Recursive Data
□  Organize algorithm by application data flow
◊  Regular data flow for computation -> Pipeline
◊  Irregular data flow -> Event-Based Coordination
7
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Example: Parallelize Bubble Sort
8
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
■  Bubble sort
□  Compare pair-wise and swap,
if in wrong order
■  Finding concurrency demands data
dependency consideration
□  Compare-exchange approach
needs some operation order
□  Algorithm idea implies hidden
data dependency
□  Idea: Parallelize serial rounds
■  Odd-even sort –
Compare [odd|even]-indexed pairs
and swap, in case
□  Apply task parallelism pattern
1 24 18 12 77
<->
1 24 18 12 77
<->
1 18 24 12 77
<->
1 18 12 24 77
<->
1 18 12 24 77
<->
1 18 12 24 77
<->
1 12 18 24 77
<-> ...
Example: Parallelize Bubble Sort
9
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
1 24 18 12 77
<->
1 24 18 12 77
<->
1 18 24 12 77
<->
1 18 12 24 77
<->
1 18 12 24 77
<->
1 18 12 24 77
<->
1 12 18 24 77
<-> ...
1 24 18 12 77
<-> <->
1 24 12 18 77
<-> <->
1 12 24 18 77
<-> <->
1 12 18 24 77
<-> <->
1 12 18 24 77
<-> <->
Supporting Structures Design Space
■  Software structures that support the expression of parallelism
■  Program structuring patterns - Single Program Multiple Data
(SPMD), master / worker, loop parallelism, fork / join
■  Data structuring patterns - Shared data, shared queue,
distributed array
□  Example: Shared data pattern
◊  Define shared abstract data type with concurrency control
(read only, read / write, independent sub sets, …)
◊  Choose appropriate synchronization construct
■  Supporting structures map to algorithm structure
□  Example: SPMD works well with geometric decomposition
10
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Patterns for Parallel Programming
11
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Design Space Parallelization Pattern
1. Finding Concurrency
Task Decomposition, Data Decomposition,
Group Tasks, Order Tasks, Data Sharing,
Design Evaluation
2. Algorithm Structure
Task Parallelism, Divide and Conquer,
Geometric Decomposition, Recursive Data,
Pipeline, Event-Based Coordination
3. Supporting Structures
SPMD, Master/Worker, Loop Parallelism,
Fork/Join, Shared Data, Shared Queue,
Distributed Array
4. Implementation Mechanisms
Thread & Process Creation and Destruction,
Memory Synchronization, Fences,
Barriers, Mutual Exclusion, Message Passing,
Collective Communication
Our Pattern Language (OPL)
■  Extended version of the Mattson el al. proposals
■  http://parlab.eecs.berkeley.edu/wiki/patterns/patterns
12
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Structural Patterns
(map/reduce, ...)
Computational Patterns
(Monte Carlo, ...)
Algorithm Strategy Patterns
(Task / data parallelism, pipelining, decomposition, ...)
Implementation Strategy Patterns
(SPMD, fork/join, Actors, shared queue, BSP, ...)
Concurrent Execution Patterns
(SIMD, MIMD, task graph, message passing, mutex, ...)
Our Pattern Language (OPL)
■  Structural patterns
□  Describe overall computational goal of the application
□  “Boxes and arrows”
■  Computational patterns
□  Classes of computations (Berkeley dwarves)
□  “Computations occurring in the boxes”
■  Algorithm strategy patterns
□  High-level strategies to exploit concurrency and parallelism
■  Implementation strategy patterns
□  Structures realized in source code
□  Program organization and data structures
■  Concurrent execution patterns
□  Approaches to support the execution of parallel algorithms
□  Strategies that advance a program
□  Basic building blocks for coordination of concurrent tasks
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
13
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
14
Example: Discrete Event Pattern
■  Name: Discrete Event Pattern
■  Problem: Suppose a computational pattern can be decomposed
into groups of semi-independent tasks interacting in an
irregular fashion. The interaction is determined by the flow of
data between them which implies ordering constraints between
the tasks. How can these tasks and their interaction be
implemented so they can execute concurrently?
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
15
Example: Discrete Event Pattern
■  Solution: A good solution is based on expressing the data flow using
abstractions called events, with each event having a task that
generates it and a task that processes it. Because an event must be
generated before it can be processed, events also define ordering
constraints between the tasks. Computation within each task consists
of processing events.
Initialize!
while(not done)!
{!
receive event!
process event!
send events!
}!
finalize!
1 2
3
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
16
Patterns for Efficient Computation
[Cool et al.]
■  Nesting Patterns
■  Structured Serial Control Flow Patterns
(Selection, Iteration, Recursion, …)
■  Parallel Control Patterns
(Fork-Join, Stencil, Reduction, Scan, …)
■  Serial Data Management Patterns
(Closures, Objects, …)
■  Parallel Data Management Patterns
(Pack, Pipeline, Decomposition,
Gather, Scatter, …)
■  Other Parallel Patterns (Futures, Speculative Selection, Workpile,
Search, Segmentation, Category Reduction, …)
■  Non-Deterministic Patterns (Branch and Bound, Transactions, …)
■  Programming Model Support
1
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts
OpenHPI Course
Week 6 : Patterns and Best Practices
Unit 6.2: Foster’s Methodology
Dr. Peter Tröger + Teaching Team
Designing Parallel Algorithms [Foster]
■  Map workload problem on an execution environment
□  Concurrency & locality for speedup, scalability
■  Four distinct stages of a methodological approach
■  A) Search for concurrency and scalability
□  Partitioning:
Decompose computation and data into small tasks
□  Communication:
Define necessary coordination of task execution
■  B) Search for locality and performance
□  Agglomeration:
Consider performance and implementation costs
□  Mapping:
Maximize processor utilization, minimize communication
19
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Partitioning
■  Expose opportunities for parallel execution through
fine-grained decomposition
■  Good partition keeps computation and data together
□  Data partitioning leads to data parallelism
□  Computation partitioning leads to task parallelism
□  Complementary approaches, can lead to different algorithms
□  Reveal hidden structures of the algorithm that have potential
□  Investigate complementary views on the problem
■  Avoid replication of either computation or data,
can be revised later to reduce communication overhead
■  Activity results in multiple candidate solutions
20
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Partitioning - Decomposition Types
■  Domain Decomposition
□  Define small data fragments
□  Specify computation for them
□  Different phases of computation
on the same data are handled separately
□  Rule of thumb:
First focus on large, or frequently used, data structures
■  Functional Decomposition
□  Split up computation into disjoint
tasks, ignore the data accessed
for the moment
□  With significant data overlap,
domain decomposition is more
appropriate
21
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[Foster]
[Foster]
Partitioning - Checklist
■  Checklist for resulting partitioning scheme
□  Order of magnitude more tasks than processors ?
◊  Keeps flexibility for next steps
□  Avoidance of redundant computation and storage needs ?
◊  Scalability for large problem sizes
□  Tasks of comparable size ?
◊  Goal to allocate equal work to processors
□  Does number of tasks scale with the problem size ?
◊  Algorithm should be able to solve larger tasks with more
given resources
■  Identify bad partitioning by estimating performance behavior
■  In case, re-formulate the partitioning (backtracking)
□  May even happen in later steps
22
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Communication
■  Specify links between data consumers and data producers
■  Specify kind and number of messages on these links
■  Domain decomposition problems might have tricky communication
infrastructures, due to data dependencies
■  Communication in functional decomposition problems can easily
be modeled from the data flow between the tasks
■  Categorization of communication patterns
□  Local communication (few neighbors) vs.
global communication
□  Structured communication (e.g. tree) vs.
unstructured communication
□  Static vs. dynamic communication structure
□  Synchronous vs. asynchronous communication
23
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Communication - Hints
■  Distribute computation and communication,
don‘t centralize algorithm
□  Bad example: Central manager for parallel summation
■  Unstructured communication is hard to agglomerate,
better avoid it
■  Checklist for communication design
□  Do all tasks perform the same amount of communication ?
□  Does each task performs only local communication ?
□  Can communication happen concurrently ?
□  Can computation happen concurrently ?
■  Solve issues by distributing or replicating communication hot spots
24
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Communication - Ghost Cells
■  Domain decomposition might lead to chunks that
demand data from each other
■  Solution 1: Copy necessary portion of data
(,ghost cells‘)
□  If no synchronization is needed after update
□  Data amount and frequency of update
influences resulting overhead and efficiency
□  Additional memory consumption
■  Solution 2: Access relevant data ,remotely‘
□  Delays thread coordination until the data is
really needed
□  Correctness („old“ data vs. „new“ data) must be
considered on parallel progress
25
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Agglomeration
■  Algorithm so far is correct,
but not specialized for a particular execution environment
■  Check partitioning and communication decisions again
□  Agglomerate tasks for efficient execution on target hardware
□  Replicate data and / or computation for efficiency reasons
■  Resulting number of tasks can still be greater than the number of
processors
■  Three conflicting guiding decisions
□  Reduce communication costs by coarser granularity of
computation and communication
□  Preserve flexibility for later mapping by finer granularity
□  Reduce engineering costs for creating a parallel version
26
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Agglomeration - Granularity
■  Since execution
environment is now
considered, the surface-
to-volume effect
becomes relevant
■  Late consideration keeps
core algorithm flexibility
27
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[Foster]
Surface-to-volume effect
Agglomeration - Checklist
■  Communication costs reduced by increasing locality ?
■  Does replicated computation outweighs its costs in all cases ?
■  Does data replication restrict the problem size ?
■  Do the larger tasks still have similar
computation / communication costs ?
■  Do the larger tasks still act with sufficient concurrency ?
■  Does the number of tasks still scale with the problem size ?
■  How much can the task count decrease, without disturbing load
balancing, scalability, or engineering costs ?
■  Is the transition to parallel code worth the engineering costs ?
28
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Mapping
■  Historically only relevant for shared-nothing systems
□  Shared memory systems have the operating system scheduler
□  With NUMA, this may become also relevant in shared memory
systems of the future (e.g. PGAS task placement)
■  Minimize execution time by …
□  … placing concurrent tasks on different nodes
□  … placing tasks with heavy communication on the same node
■  Conflicting strategies, additionally restricted by resource limits
□  Task mapping problem
□  Known to be compute-intense (bin packing)
■  Set of sophisticated (dynamic) heuristics for load balancing
□  Preference for local algorithms that do not need global
scheduling state
29
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts
OpenHPI Course
Week 6 : Patterns and Best Practices
Unit 6.3: Berkeley Dwarfs
Dr. Peter Tröger + Teaching Team
Common Algorithmic Problems
■  Sources
□  Parallel
programming
courses
□  Parallel
Benchmarks
□  Development
guides
□  Parallel
Programming
books
□  User stories
31
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
A View From Berkeley
■  Technical report from
Berkeley (2006), defining
parallel computing research
questions and
recommendations
■  Definition of „13 dwarfs“
□  Common designs of
parallel computation and
communication
□  Allow better evaluation
of programming models
and architectures
32
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
The Landscape of Parallel Computing Research: A
View from Berkeley
Krste Asanovic
Ras Bodik
Bryan Christopher Catanzaro
Joseph James Gebis
Parry Husbands
Kurt Keutzer
David A. Patterson
William Lester Plishker
John Shalf
Samuel Webb Williams
Katherine A. Yelick
Electrical Engineering and Computer Sciences
University of California at Berkeley
Technical Report No. UCB/EECS-2006-183
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
December 18, 2006
A View From Berkeley
■  Sources
□  EEMBC benchmarks (embedded systems), SPEC benchmarks
□  Database and text mining technology
□  Algorithms in computer game design and graphics
□  Machine learning algorithms
□  Original „7 Dwarfs“ for supercomputing [Colella]
■  „Anti-benchmark“
□  Dwarfs are not tied to code or language artifacts
□  Can serve as understandable vocabulary across disciplines
□  Allow feasability study of hardware and software design
◊  No need to wait for applications being developed
33
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
13 Dwarfs
■  Dwarfs currently defined
□  Dense Linear Algebra
□  Sparse Linear Algebra
□  Spectral Methods
□  N-Body Methods
□  Structured Grids
□  Unstructured Grids
□  MapReduce
□  Combinational Logic
□  Graph Traversal
□  Dynamic Programming
□  Backtrack and Branch-and-
Bound
□  Graphical Models
□  Finite State Machines
34
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
■  One dwarf may be implemented based on another one
■  Increasing uptake in scientific publications
■  Several reference implementations for CPU / GPU
Dwarfs in Popular Applications
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
35
Hot →	
  Cold
[Patterson]
■  Classic vector and matrix operations on non-sparse data
(vector op vector, matrix op vector, matrix op matrix)
■  Data layout as continues array(s)
■  High degree of data dependencies
■  Computation on elements, rows, columns or matrix blocks
■  Issues with memory hierarchy, data distribution is critical
■  Demands overlapping of computation and communication
Dense Linear Algebra
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
36
Sparse Linear Algebra
■  Operations on a sparse matrix (with lots of zeros)
■  Typically compressed data structures, integer operations,
only non-zero entries + indices
□  Dense blocks to exploit caches
■  Complex dependency structure
■  Scatter-gather vector operations
are often helpful
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
37
N-Body Methods
■  Physics: Predicting individual motions of an object group
interacting gravitationally
□  Calculations on interactions between many discrete points
■  Hierarchical tree-based and mesh-based methods,
avoid computing all pair-wise interactions
■  Variations with particle-particle
methods (one point to all others)
■  Large number of independent
calculations in a time step,
followed by all-to-all
communication
■  Issues with load balancing and
missing fixed hierarchy
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
38
Structured Grid
■  Data as a regular multidimensional grid
□  Access is regular and statically determinable
■  Computation as sequence of grid updates
□  Points are updated concurrently using values
from a small neighborhood
■  Spatial locality to use long cache lines
■  Temporal locality to allow cache reuse
■  Parallel mapping with sub-grid per processor
□  Ghost cells, surface to volume ratio
■  Latency hiding
□  Increased number of ghost cells
□  Coarse-grained data exchange
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
39
Unstructured Grid
■  Elements update neighbors in irregular
mesh/grid - static or dynamic structure
■  Problematic data distribution and access
requirements, indirection through tables
■  Modeling domain (e.g. physics)
□  Mesh represents surface or volume
□  Entities are points, edges, faces, ...
□  Applying pressure, temperature, …
□  Computations involve numerical
solutions or differential equations
□  Sequence of mesh updates
■  Massively data parallel, but irregularly
distributed data and communication
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
40
MapReduce
■  Originally called “Monte Carlo” in dwarf concept
□  Repeated independent execution of a function
(e.g. random number generation, map function)
□  Results aggregated at the end
□  Nearly no communication between tasks,
embarrassingly parallel
■  Examples: Monte Carlo, BOINC project, protein structures
[http://climatesanity.wordpress.com]
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
41
■  Global optimization problem in large search space
■  Divide and Conquer principle
□  Branching into subdivisions
□  Optimize execution by ruling out regions
■  Examples: Integer linear programming,
boolean satisfiability, combinatorial optimization,
traveling salesman, constraint programming, …
■  Heuristics to guide search to productive regions
■  Parallel checking of sub-regions
□  Demands invariants about the search space
□  Demands dynamic load balancing, load prediction is hard
■  Example:
Place N queens on a chessboard so that no two attack each other
Backtrack / Branch-and-Bound
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
42
Branch-and-Bound
43
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[http://docs.jboss.org/drools]
44
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[http://docs.jboss.org/drools]
Berkeley Dwarfs
■  Relevance of single dwarfs
widely differs
■  No widely accepted single
benchmark implementation
■  Computational dwarfs on
different layers,
implementations may be
based on each other
■  OpenDwarfs project
□  Optimized code for
different platforms
■  Parallel Dwarfs project
□  In C++, C#, F# for
Visual Studio
45
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
The Landscape of Parallel Computing Research: A
View from Berkeley
Krste Asanovic
Ras Bodik
Bryan Christopher Catanzaro
Joseph James Gebis
Parry Husbands
Kurt Keutzer
David A. Patterson
William Lester Plishker
John Shalf
Samuel Webb Williams
Katherine A. Yelick
Electrical Engineering and Computer Sciences
University of California at Berkeley
Technical Report No. UCB/EECS-2006-183
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
December 18, 2006
Parallel Programming Concepts
OpenHPI Course
Week 6 : Patterns and Best Practices
Unit 6.4: Some Future Trends
Dr. Peter Tröger + Teaching Team
NUMA Impact Increases
47
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Core
 Core
Core
 Core
Q
P
I
Core
 Core
Core
 Core
Q
P
I
Core
 Core
Core
 Core
Q
P
I
Core
 Core
Core
 Core
Q
P
I
L3Cache
L3Cache
L3Cache
MemoryController
MemoryController
MemoryController
L3Cache
MemoryController
I/O
 I/O
I/O
I/O
Memory
Memory
Memory
Memory
Innovation in Memory Technology
■  3D NAND
■  Hybrid Memory Cube
□  Intel, Micron, …
□  3D array of DDR-alike
memory cells
□  Early samples
available, 160GB/s
□  Through-silicon via
(TSV) approach with
embedded controllers,
attached to CPU
■  RRAM / ReRAM
□  Non-volatile memory
48
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[computerworld.com][extremetech.com]
Power Wall 2.0 = Dark Silicon
“Dark Silicon and the End
of Multicore Scaling”
by Hadi Esmaeilzadeh, Emily
Blem, Renée St. Amant,
Karthikeyan Sankaralingam,
Doug Burger
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
49
Hardware / Software Co-Design
■  Increasing number of cores by Moore‘s law
■  Power wall / dark silicon problem will become worse
□  In addition, battery-powered devices become more relevant
■  Idea: Use additional transistors for specialization
□  Design hardware for a software problem
□  Make it part of the processor („compile into hardware“)
□  More efficiency, less flexibility
□  Partially known from ILP SIMD support
□  Examples: Cryptography, regular expressions
■  Example: Cell processor (Playstation 3)
□  64-bit Power core
□  8 specialized co-processors
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
50
Software at Scale [Dongarra]
■  Effective utilization of many-core and hybrid hardware
□  Break fork-join parallelism
□  Dynamic data driven execution, consider block layout
□  Exploiting mixed precision (GPU vs. CPU, power consumption)
■  Aim for self-adapting software and auto-tuning support
□  Manual optimization is too hard
□  Let software optimize the software
■  Consider fault-tolerant software
□  With 1.000.000's of cores, things break all the time
■  Focus on algorithm classes that reduce communication
□  Special problem in dense computation
□  Aim for asynchronous iterations
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
51
OpenMP 4.0
■  SIMD extensions
□  Portable primitives to describe SIMD parallelization
□  Loop vectorization with simd construct
□  Several arguments for guiding the compiler (e.g. alignment)
■  Targeting extensions
□  Thread with the OpenMP program executes on the host device
□  Implementation may support multiple target devices
□  Control off-loading of loops and code regions on such devices
■  New API for device data environment
□  OpenMP - managed data items can be moved to the device
□  New primitives for better cancellation support
□  User-defined reduction operations
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
52
OpenACC
■  „OpenMP for accelerators“ (GPU, FPGAs, ...)
□  Partners: Cray, supercomputing centers, NVIDIA, PGI
□  Annotation in C, C++, and Fortran source code
□  OpenACC code can also be started on the accelerator
■  Features
□  Specification of data locality and asynchronous execution
□  Abstract specification of data movement, loop parallelization
□  Caching and synchronization support
□  Management of data movement by compiler and runtime
□  Implementations available, e.g. for Xeon Phi
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
53
Autotuners
■  Optimize parallel code by generating many variants
□  Try many or all optimization switches
◊  Loop unrolling, utilization of processor registers, …
□  Rely on parallelization variations defined in the application
■  Automatically tested on target platform
■  Research shows promising results
□  Can be better than manually optimized code
□  Optimization can fit to multiple execution environments
□  Known examples for sparse and dense linear algebra libraries
◊  ATLAS (Automatically Tuned Linear Algebra Software)
54
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Intel Math Kernel Library (MKL)
■  Intel library with heavily optimized functionality, for C & Fortran
□  Linear algebra
◊  Basic Linear Algebra Subprograms (BLAS) API
◊  Follows standards in high-performance computing
◊  Vector-vector, matrix-vector, matrix-matrix operations
□  Fast Fourier Transforms (FFT)
◊  Single precision, double precision, complex, real, ...
□  Vector math and statistics functions
◊  Random number generators and probability distributions
◊  Spline-based data fitting
■  High-level abstraction of functionality,
parallelization completely transparent for the developer
55
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Future Trends
■  Active research on next-generation hardware
□  Driven by exa-scale efforts in supercomputing
□  Driven by combined power wall and memory wall
□  Driven by shift in computer markets (desktop -> mobile)
■  Impact on software development will get more visible
□  Hybrid computing is the future default
□  Heterogeneous mixture of CPU + specialized accelerators
□  Old assumptions are broken (flat memory, constant access
time, homogeneous processing elements)
□  Old programming models no longer match
□  Extending the existing programming paradigms seems to work
□  High-level specialized libraries get more relevance
56
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts
OpenHPI Course
Summary
Dr. Peter Tröger + Teaching Team
Course Organization
■  Week 1: Terminology and fundamental concepts
□  Moore’s law, power wall, memory wall, ILP wall
□  Speedup vs. scaleup, Amdahl’s law, Flynn’s taxonomy, …
■  Week 2: Shared memory parallelism – The basics
□  Concurrency, race condition, semaphore, deadlock, monitor, …
■  Week 3: Shared memory parallelism – Programming
□  Threads, OpenMP, Cilk, Scala, …
■  Week 4: Accelerators
□  Hardware today, GPU Computing, OpenCL, …
■  Week 5: Distributed memory parallelism
□  CSP, Actor model, clusters, HPC, MPI, MapReduce, …
■  Week 6: Patterns and best practices
□  Foster’s methodology, Berkeley dwarfs, OPL collection, …
58
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Week 1:
The Free Lunch Is Over
■  Clock speed curve
flattened in 2003
□  Heat, power,
leakage
■  Speeding up the serial
instruction execution
through clock speed
improvements no
longer works
■  Additional issues
□  ILP wall
□  Memory wall
[HerbSutter,2009]
59
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Three Ways Of Doing Anything Faster
[Pfister]
■  Work harder
(clock speed)
Ø  Power wall problem
Ø  Memory wall problem
■  Work smarter
(optimization, caching)
Ø  ILP wall problem
Ø  Memory wall problem
■  Get help
(parallelization)
□  More cores per single CPU
□  Software needs to exploit
them in the right way
Ø  Memory wall problem
Problem
CPU
Core
Core
Core
Core
Core
60
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallelism on Different Levels
ProgramProgramProgram
ProcessProcessProcessProcessTask
PE
ProcessProcessProcessProcessTask
ProcessProcessProcessProcessTask
PE
PE
PE
Memory
Node
Network
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
61
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
The Parallel Programming Problem
■  Execution environment has a particular type
(SIMD, MIMD, UMA, NUMA, …)
■  Execution environment maybe configurable (number of resources)
■  Parallel application must be mapped to available resources
Execution EnvironmentParallel Application Match ?
Configuration
Flexible
Type
62
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Amdahl’s Law
63
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Gustafson-Barsis’ Law (1988)
■  Gustafson and Barsis: People are typically not interested in the
shortest execution time
□  Rather solve a bigger problem in reasonable time
■  Problem size could then scale with the number of processors
□  Typical in simulation and farmer / worker problems
□  Leads to larger parallel fraction with increasing N
□  Serial part is usually fixed or grows slower
■  Maximum scaled speedup by N processors:
■  Linear speedup now becomes possible
■  Software needs to ensure that serial parts remain constant
■  Other models exist (e.g. Work-Span model, Karp-Flatt metric)
64
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
S =
TSER + N · TP AR
TSER + TP AR
Week 2:
Concurrency vs. Parallelism
■  Concurrency means dealing with several things at once
□  Programming concept for the developer
□  In shared-memory systems, implemented by time sharing
■  Parallelism means doing several things at once
□  Demands parallel hardware
■  Parallel programming is a misnomer
□  Concurrent programming aiming at parallel execution
■  Any parallel software is concurrent software
□  Note: Some researchers disagree, most practitioners agree
■  Concurrent software is not always parallel software
□  Many server applications achieve scalability
by optimizing concurrency only (web server)
65
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Concurrency
Parallelism
Parallelism [Mattson et al.]
■  Task
□  Parallel program breaks a problem into tasks
■  Execution unit
□  Representation of a concurrently running task (e.g. thread)
□  Tasks are mapped to execution units
■  Processing element (PE)
□  Hardware element running one execution unit
□  Depends on scenario - logical processor vs. core vs. machine
□  Execution units run simultaneously on processing elements,
controlled by some scheduler
■  Synchronization - Mechanism to order activities of parallel tasks
■  Race condition - Program result depends on the scheduling order
66
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Concurrency Issues
■  Mutual Exclusion
□  The requirement that when one concurrent task is using a
shared resource, no other shall be allowed to do that
■  Deadlock
□  Two or more concurrent tasks are unable to proceed
□  Each is waiting for one of the others to do something
■  Starvation
□  A runnable task is overlooked indefinitely
□  Although it is able to proceed, it is never chosen to run
■  Livelock
□  Two or more concurrent tasks continuously change their states
in response to changes in the other activities
□  No global progress for the application
6OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
67
Week 3:
Parallel Programming for Shared Memory
68


Process
Explicitly Shared Memory
■  Different programming models for
concurrency with shared memory
■  Processes and threads mapped to
processing elements (cores)
■  Task model supports more
fine-grained parallelization than
with native threads
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Memory


Process
Memory
Thread
Thread
Task
Task
Task
Task
Concurrent Processes Concurrent Threads
Concurrent Tasks
Main Thread


Process
Memory
Main Thread


Process
Memory
Main Thread
 Thread
 Thread
Input Data
Parallel
Processing
Result Data
Task Parallelism and Data Parallelism
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
69
OpenMP
■  Programming with the fork-join model
□  Master thread forks into declared tasks
□  Runtime environment may run them in parallel,
based on dynamic mapping to threads from a pool
□  Worker task barrier before finalization (join)
70
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[Wikipedia]
High-Level Concurrency
71
Microsoft Parallel Patterns Library java.util.concurrent
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Partitioned Global Address Space
72
Place-shifting operations
• at(p) S
… …… …
Activities
Local
Heap
Place 0
……
…
Activities
Local
Heap
Place N
…
Global Reference
Distributed heap
• GlobalRef[T]
APGAS in X10: Places and Tasks
Task parallelism
• async S
• finish S
Concurrency control within a place
• when(c) S
• atomic S
■  Parallel tasks, each operating in one place of the PGAS
□  Direct variable access only in local place
■  Implementation strategy is flexible
□  One operating system process per place, manages thread pool
□  Work-stealing scheduler
[IBM]
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Week 4:
Cheap Performance with Accelerators
■  Performance
■  Energy / Price
□  Cheap to buy and to maintain
□  GFLOPS per watt: Fermi 1,5 / Kepler 5 / Maxwell 15 (2014)
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
73
0
200
400
600
800
1000
1200
1400
0 10000 20000 30000 40000 50000
ExecutionTimein
Milliseconds
Problem Size (Number of Sudoku Places)
Intel E8500 CPU
AMD R800 GPU
NVIDIA GT200 GPU
lower means faster
GPU: Graphics Processing Unit
(CPU of a graphics card)
CPU vs. GPU Architecture
□  Some huge threads
□  Branch prediction
□  1000+ light-weight threads
□  Memory latency hiding
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
74
Control
PE
PE
PE
PE
Cache
DRAM DRAM
CPU GPU „many-core“„multi-core“
OpenCL Platform Model
□  OpenCL exposes CPUs, GPUs, and other Accelerators as “devices”
□  Each “device” contains one or more “compute units”, i.e. cores, SMs,...
□  Each “compute unit” contains one or more SIMD “processing elements”
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
75
Best Practices for Performance Tuning
• Asynchronous, Recompute, SimpleAlgorithm Design
• Chaining, Overlap Transfer & ComputeMemory Transfer
• Divergent Branching, PredicationControl Flow
• Local Memory as Cache, rare resourceMemory Types
• Coalescing, Bank ConflictsMemory Access
• Execution Size, EvaluationSizing
• Shifting, Fused Multiply, Vector TypesInstructions
• Native Math Functions, Build OptionsPrecision
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
76
Week 5:
Shared Nothing
■  Clusters: Stand-alone machines connected by a local network
□  Cost-effective technique for a large-scale parallel computer
□  Users are builders, have control over their system
□  Synchronization much slower than in shared memory
□  Task granularity becomes an issue
77


Processing
Element
Task


Processing
Element
Task
Message
Message
Message
Message
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Local
Memory
Local
Memory
Shared Nothing
■  Supercomputers / Massively Parallel Processing (MPP) systems
□  (Hierarchical) cluster with a lot of processors
□  Still standard hardware, but specialized setup
□  High-performance interconnection network
□  For massive data-parallel applications, mostly simulations
(weapons, climate, earthquakes, airplanes, car crashes, ...)
■  Examples (Nov 2013)
□  BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops
□  Tianhe-2, 3.1 million cores,
1 PB memory, 17.808 kW power,
33.86 PFlops (quadrillions
calculations per second)
■  Annual ranking with the TOP500 list
(www.top500.org)
78
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Surface-To-Volume Effect
79
[nicerweb.com]
■  Fine-grained decomposition for
using all processing elements ?
■  Coarse-grained decomposition
to reduce communication
overhead ?
■  A tradeoff question !
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Message Passing
■  Parallel programming paradigm for “shared nothing” environments
□  Implementations for shared memory available,
but typically not the best approach
■  Users submit their message passing program & data as job
■  Cluster management system creates program instances
Instance
0
Instance
1
Instance
2 Instance
3
Execution Hosts
80
Cluster Management Software
Submission
Host
Job
Appli-
cation
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Single Program Multiple Data (SPMD)
81
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank
0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
Input data
SPMD program
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
Instance 0
Instance 1
Instance 2
Instance 3
Instance 4
Actor Model
■  Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular
Actor Formalism for Artificial Intelligence IJCAI 1973.
□  Mathematical model for concurrent computation
□  Actor as computational primitive
◊  Local decisions, concurrently sends / receives messages
◊  Has a mailbox for incoming messages
◊  Concurrently creates more actors
□  Asynchronous one-way message sending
□  Changing topology allowed, typically no order guarantees
◊  Recipient is identified by mailing address
◊  Actors can send their own identity to other actors
■  Available as programming language extension or library
in many environments
82
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Week 6:
Patterns for Parallel Programming
■  Phases in creating a parallel program
□  Finding Concurrency: Identify and
analyze exploitable concurrency
□  Algorithm Structure: Structure the
algorithm to take advantage of
potential concurrency
□  Supporting Structures: Define
program structures and data
structures needed for the code
□  Implementation Mechanisms:
Threads, processes, messages, …
■  Each phase is a design space
83
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Popular Applications vs. Dwarfs
Hot →	
  Cold
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
84
Designing Parallel Algorithms [Foster]
■  Map workload problem on an execution environment
□  Concurrency & locality for speedup, scalability
■  Four distinct stages of a methodological approach
■  A) Search for concurrency and scalability
□  Partitioning –
Decompose computation and data into small tasks
□  Communication –
Define necessary coordination of task execution
■  B) Search for locality and performance
□  Agglomeration –
Consider performance and implementation costs
□  Mapping –
Maximize processor utilization, minimize communication
85
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
86
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
And that’s it …
The End
■  Parallel programming is exciting again!
□  From massively parallel hardware to complex software
□  From abstract design patterns to specific languages
□  From deadlock freedom to extreme performance tuning
■  Some general concepts are established
□  Take this course as starting point
□  Learn from the high-performance computing community
■  Thanks for your participation
□  Lively discussion, directly and in the forums, we learned a lot
□  Sorry for technical flaws and content errors
■  Please use the feedback link
87
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Lecturer Contact
■  Operating Systems and Middleware Group at HPI
http://www.dcl.hpi.uni-potsdam.de
■  Dr. Peter Tröger
http://www.troeger.eu
http://twitter.com/ptroeger
http://www.linkedin.com/in/ptroeger
peter.troeger@hpi.uni-potsdam.de
■  M.Sc. Frank Feinbube
http://www.feinbube.de
http://www.linkedin.com/in/feinbube
frank.feinbube@hpi.uni-potsdam.de
88
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

More Related Content

Similar to OpenHPI - Parallel Programming Concepts - Week 6

Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Databricks
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
Dr Shashikant Athawale
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
MumitAhmed1
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
SharabiNaif
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
Anonymous9etQKwW
 
C cerin piv2017_c
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_c
Bertrand Tavitian
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
gokulprasath06
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Ali Alkan
 
Study and development of methods and tools for testing, validation and verif...
 Study and development of methods and tools for testing, validation and verif... Study and development of methods and tools for testing, validation and verif...
Study and development of methods and tools for testing, validation and verif...
Emilio Serrano
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)
ActiveEon
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
GeeksLab Odessa
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
Machine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeonMachine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeon
Activeeon
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
Luis Borbon
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...dgarijo
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
Lex Toumbourou
 
Technical research writing
Technical research writing   Technical research writing
Technical research writing
AJAL A J
 

Similar to OpenHPI - Parallel Programming Concepts - Week 6 (20)

Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
C cerin piv2017_c
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_c
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Study and development of methods and tools for testing, validation and verif...
 Study and development of methods and tools for testing, validation and verif... Study and development of methods and tools for testing, validation and verif...
Study and development of methods and tools for testing, validation and verif...
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Machine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeonMachine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeon
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Technical research writing
Technical research writing   Technical research writing
Technical research writing
 

More from Peter Tröger

WannaCry - An OS course perspective
WannaCry - An OS course perspectiveWannaCry - An OS course perspective
WannaCry - An OS course perspective
Peter Tröger
 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and Virtualization
Peter Tröger
 
Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2
Peter Tröger
 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissions
Peter Tröger
 
Design of Software for Embedded Systems
Design of Software for Embedded SystemsDesign of Software for Embedded Systems
Design of Software for Embedded Systems
Peter Tröger
 
Humans should not write XML.
Humans should not write XML.Humans should not write XML.
Humans should not write XML.
Peter Tröger
 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.
Peter Tröger
 
Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)
Peter Tröger
 
Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)
Peter Tröger
 
Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)
Peter Tröger
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Peter Tröger
 
Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)
Peter Tröger
 
Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)
Peter Tröger
 
Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)
Peter Tröger
 
Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)
Peter Tröger
 
Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)
Peter Tröger
 
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Peter Tröger
 
Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)
Peter Tröger
 
Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)
Peter Tröger
 
Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0
Peter Tröger
 

More from Peter Tröger (20)

WannaCry - An OS course perspective
WannaCry - An OS course perspectiveWannaCry - An OS course perspective
WannaCry - An OS course perspective
 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and Virtualization
 
Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2
 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissions
 
Design of Software for Embedded Systems
Design of Software for Embedded SystemsDesign of Software for Embedded Systems
Design of Software for Embedded Systems
 
Humans should not write XML.
Humans should not write XML.Humans should not write XML.
Humans should not write XML.
 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.
 
Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)
 
Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)
 
Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
 
Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)
 
Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)
 
Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)
 
Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)
 
Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)
 
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
 
Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)
 
Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)
 
Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0
 

Recently uploaded

Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 

Recently uploaded (20)

Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 

OpenHPI - Parallel Programming Concepts - Week 6

  • 1. Parallel Programming Concepts OpenHPI Course Week 6 : Patterns and Best Practices Unit 6.1: Parallel Programming Patterns Dr. Peter Tröger + Teaching Team
  • 2. Summary: Week 5 ■  “Shared nothing” systems provide very good scalability □  Adding new processing elements not limited by “walls” □  Different options for interconnect technology ■  Task granularity is essential □  Surface-to-volume effect □  Task mapping problem ■  De-facto standard is MPI programming ■  High level abstractions with □  Channels □  Actors □  MapReduce 2 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger „What steps / strategy would you apply to parallelize a given compute-intense program? “
  • 3. The Parallel Programming Problem 3 Execution Environment Parallel Application Match ? Configuration Flexible Type OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 4. Parallelization and Design Patterns ■  Parallel programming relies on experience □  Identification of concurrency □  Identification of feasible algorithmic structures □  If done wrong, performance / correctness may suffer ■  Rule of thumb: Somebody else is smarter than you !! ■  Design Pattern □  Best practices, formulated as a template □  Focus on general applicability to common problems □  Well-known in object-oriented programming (“gang of four”) ■  Parallel design patterns in literature □  Structured parallelization methodologies (== pattern) □  Algorithmic building blocks commonly found (== pattern) 4 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 5. Patterns for Parallel Programming [Mattson et al.] ■  Phases in creating a parallel program □  Finding Concurrency: Identify and analyze exploitable concurrency □  Algorithm Structure: Structure the algorithm to take advantage of potential concurrency □  Supporting Structures: Define program structures and data structures needed for the code □  Implementation Mechanisms: Threads, processes, messages, … ■  Each phase is a design space 5 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 6. Finding Concurrency Design Space ■  Identify and analyze exploitable concurrency ■  Example: Data Decomposition Pattern □  Context: Computation is organized around large data manipulation, similar operations on different data parts □  Solution: Array-based data access (row, block), recursive data structure traversal ■  Example: Group Tasks Pattern □  Context: Tasks shared temporal constraints (e.g. intermediate data), work on shared data structure □  Solution: Apply ordering constraints to groups of tasks, put truly independent tasks in one group for better scheduling 6 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 7. Algorithm Structure Design Space ■  Structure the algorithm ■  Consider how the identified concurrency is organized □  Organize algorithm by tasks ◊  Tasks are embarrassingly parallel, or organized linearly -> Task Parallelism ◊  Tasks organized by recursive procedure -> Divide and Conquer □  Organize algorithm by by data dependencies ◊  Linear data dependencies -> Geometric Decomposition ◊  Recursive data dependencies -> Recursive Data □  Organize algorithm by application data flow ◊  Regular data flow for computation -> Pipeline ◊  Irregular data flow -> Event-Based Coordination 7 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 8. Example: Parallelize Bubble Sort 8 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger ■  Bubble sort □  Compare pair-wise and swap, if in wrong order ■  Finding concurrency demands data dependency consideration □  Compare-exchange approach needs some operation order □  Algorithm idea implies hidden data dependency □  Idea: Parallelize serial rounds ■  Odd-even sort – Compare [odd|even]-indexed pairs and swap, in case □  Apply task parallelism pattern 1 24 18 12 77 <-> 1 24 18 12 77 <-> 1 18 24 12 77 <-> 1 18 12 24 77 <-> 1 18 12 24 77 <-> 1 18 12 24 77 <-> 1 12 18 24 77 <-> ...
  • 9. Example: Parallelize Bubble Sort 9 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 1 24 18 12 77 <-> 1 24 18 12 77 <-> 1 18 24 12 77 <-> 1 18 12 24 77 <-> 1 18 12 24 77 <-> 1 18 12 24 77 <-> 1 12 18 24 77 <-> ... 1 24 18 12 77 <-> <-> 1 24 12 18 77 <-> <-> 1 12 24 18 77 <-> <-> 1 12 18 24 77 <-> <-> 1 12 18 24 77 <-> <->
  • 10. Supporting Structures Design Space ■  Software structures that support the expression of parallelism ■  Program structuring patterns - Single Program Multiple Data (SPMD), master / worker, loop parallelism, fork / join ■  Data structuring patterns - Shared data, shared queue, distributed array □  Example: Shared data pattern ◊  Define shared abstract data type with concurrency control (read only, read / write, independent sub sets, …) ◊  Choose appropriate synchronization construct ■  Supporting structures map to algorithm structure □  Example: SPMD works well with geometric decomposition 10 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 11. Patterns for Parallel Programming 11 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Design Space Parallelization Pattern 1. Finding Concurrency Task Decomposition, Data Decomposition, Group Tasks, Order Tasks, Data Sharing, Design Evaluation 2. Algorithm Structure Task Parallelism, Divide and Conquer, Geometric Decomposition, Recursive Data, Pipeline, Event-Based Coordination 3. Supporting Structures SPMD, Master/Worker, Loop Parallelism, Fork/Join, Shared Data, Shared Queue, Distributed Array 4. Implementation Mechanisms Thread & Process Creation and Destruction, Memory Synchronization, Fences, Barriers, Mutual Exclusion, Message Passing, Collective Communication
  • 12. Our Pattern Language (OPL) ■  Extended version of the Mattson el al. proposals ■  http://parlab.eecs.berkeley.edu/wiki/patterns/patterns 12 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Structural Patterns (map/reduce, ...) Computational Patterns (Monte Carlo, ...) Algorithm Strategy Patterns (Task / data parallelism, pipelining, decomposition, ...) Implementation Strategy Patterns (SPMD, fork/join, Actors, shared queue, BSP, ...) Concurrent Execution Patterns (SIMD, MIMD, task graph, message passing, mutex, ...)
  • 13. Our Pattern Language (OPL) ■  Structural patterns □  Describe overall computational goal of the application □  “Boxes and arrows” ■  Computational patterns □  Classes of computations (Berkeley dwarves) □  “Computations occurring in the boxes” ■  Algorithm strategy patterns □  High-level strategies to exploit concurrency and parallelism ■  Implementation strategy patterns □  Structures realized in source code □  Program organization and data structures ■  Concurrent execution patterns □  Approaches to support the execution of parallel algorithms □  Strategies that advance a program □  Basic building blocks for coordination of concurrent tasks OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 13
  • 14. OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 14
  • 15. Example: Discrete Event Pattern ■  Name: Discrete Event Pattern ■  Problem: Suppose a computational pattern can be decomposed into groups of semi-independent tasks interacting in an irregular fashion. The interaction is determined by the flow of data between them which implies ordering constraints between the tasks. How can these tasks and their interaction be implemented so they can execute concurrently? OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 15
  • 16. Example: Discrete Event Pattern ■  Solution: A good solution is based on expressing the data flow using abstractions called events, with each event having a task that generates it and a task that processes it. Because an event must be generated before it can be processed, events also define ordering constraints between the tasks. Computation within each task consists of processing events. Initialize! while(not done)! {! receive event! process event! send events! }! finalize! 1 2 3 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 16
  • 17. Patterns for Efficient Computation [Cool et al.] ■  Nesting Patterns ■  Structured Serial Control Flow Patterns (Selection, Iteration, Recursion, …) ■  Parallel Control Patterns (Fork-Join, Stencil, Reduction, Scan, …) ■  Serial Data Management Patterns (Closures, Objects, …) ■  Parallel Data Management Patterns (Pack, Pipeline, Decomposition, Gather, Scatter, …) ■  Other Parallel Patterns (Futures, Speculative Selection, Workpile, Search, Segmentation, Category Reduction, …) ■  Non-Deterministic Patterns (Branch and Bound, Transactions, …) ■  Programming Model Support 1 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 18. Parallel Programming Concepts OpenHPI Course Week 6 : Patterns and Best Practices Unit 6.2: Foster’s Methodology Dr. Peter Tröger + Teaching Team
  • 19. Designing Parallel Algorithms [Foster] ■  Map workload problem on an execution environment □  Concurrency & locality for speedup, scalability ■  Four distinct stages of a methodological approach ■  A) Search for concurrency and scalability □  Partitioning: Decompose computation and data into small tasks □  Communication: Define necessary coordination of task execution ■  B) Search for locality and performance □  Agglomeration: Consider performance and implementation costs □  Mapping: Maximize processor utilization, minimize communication 19 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 20. Partitioning ■  Expose opportunities for parallel execution through fine-grained decomposition ■  Good partition keeps computation and data together □  Data partitioning leads to data parallelism □  Computation partitioning leads to task parallelism □  Complementary approaches, can lead to different algorithms □  Reveal hidden structures of the algorithm that have potential □  Investigate complementary views on the problem ■  Avoid replication of either computation or data, can be revised later to reduce communication overhead ■  Activity results in multiple candidate solutions 20 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 21. Partitioning - Decomposition Types ■  Domain Decomposition □  Define small data fragments □  Specify computation for them □  Different phases of computation on the same data are handled separately □  Rule of thumb: First focus on large, or frequently used, data structures ■  Functional Decomposition □  Split up computation into disjoint tasks, ignore the data accessed for the moment □  With significant data overlap, domain decomposition is more appropriate 21 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger [Foster] [Foster]
  • 22. Partitioning - Checklist ■  Checklist for resulting partitioning scheme □  Order of magnitude more tasks than processors ? ◊  Keeps flexibility for next steps □  Avoidance of redundant computation and storage needs ? ◊  Scalability for large problem sizes □  Tasks of comparable size ? ◊  Goal to allocate equal work to processors □  Does number of tasks scale with the problem size ? ◊  Algorithm should be able to solve larger tasks with more given resources ■  Identify bad partitioning by estimating performance behavior ■  In case, re-formulate the partitioning (backtracking) □  May even happen in later steps 22 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 23. Communication ■  Specify links between data consumers and data producers ■  Specify kind and number of messages on these links ■  Domain decomposition problems might have tricky communication infrastructures, due to data dependencies ■  Communication in functional decomposition problems can easily be modeled from the data flow between the tasks ■  Categorization of communication patterns □  Local communication (few neighbors) vs. global communication □  Structured communication (e.g. tree) vs. unstructured communication □  Static vs. dynamic communication structure □  Synchronous vs. asynchronous communication 23 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 24. Communication - Hints ■  Distribute computation and communication, don‘t centralize algorithm □  Bad example: Central manager for parallel summation ■  Unstructured communication is hard to agglomerate, better avoid it ■  Checklist for communication design □  Do all tasks perform the same amount of communication ? □  Does each task performs only local communication ? □  Can communication happen concurrently ? □  Can computation happen concurrently ? ■  Solve issues by distributing or replicating communication hot spots 24 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 25. Communication - Ghost Cells ■  Domain decomposition might lead to chunks that demand data from each other ■  Solution 1: Copy necessary portion of data (,ghost cells‘) □  If no synchronization is needed after update □  Data amount and frequency of update influences resulting overhead and efficiency □  Additional memory consumption ■  Solution 2: Access relevant data ,remotely‘ □  Delays thread coordination until the data is really needed □  Correctness („old“ data vs. „new“ data) must be considered on parallel progress 25 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 26. Agglomeration ■  Algorithm so far is correct, but not specialized for a particular execution environment ■  Check partitioning and communication decisions again □  Agglomerate tasks for efficient execution on target hardware □  Replicate data and / or computation for efficiency reasons ■  Resulting number of tasks can still be greater than the number of processors ■  Three conflicting guiding decisions □  Reduce communication costs by coarser granularity of computation and communication □  Preserve flexibility for later mapping by finer granularity □  Reduce engineering costs for creating a parallel version 26 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 27. Agglomeration - Granularity ■  Since execution environment is now considered, the surface- to-volume effect becomes relevant ■  Late consideration keeps core algorithm flexibility 27 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger [Foster] Surface-to-volume effect
  • 28. Agglomeration - Checklist ■  Communication costs reduced by increasing locality ? ■  Does replicated computation outweighs its costs in all cases ? ■  Does data replication restrict the problem size ? ■  Do the larger tasks still have similar computation / communication costs ? ■  Do the larger tasks still act with sufficient concurrency ? ■  Does the number of tasks still scale with the problem size ? ■  How much can the task count decrease, without disturbing load balancing, scalability, or engineering costs ? ■  Is the transition to parallel code worth the engineering costs ? 28 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 29. Mapping ■  Historically only relevant for shared-nothing systems □  Shared memory systems have the operating system scheduler □  With NUMA, this may become also relevant in shared memory systems of the future (e.g. PGAS task placement) ■  Minimize execution time by … □  … placing concurrent tasks on different nodes □  … placing tasks with heavy communication on the same node ■  Conflicting strategies, additionally restricted by resource limits □  Task mapping problem □  Known to be compute-intense (bin packing) ■  Set of sophisticated (dynamic) heuristics for load balancing □  Preference for local algorithms that do not need global scheduling state 29 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 30. Parallel Programming Concepts OpenHPI Course Week 6 : Patterns and Best Practices Unit 6.3: Berkeley Dwarfs Dr. Peter Tröger + Teaching Team
  • 31. Common Algorithmic Problems ■  Sources □  Parallel programming courses □  Parallel Benchmarks □  Development guides □  Parallel Programming books □  User stories 31 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 32. A View From Berkeley ■  Technical report from Berkeley (2006), defining parallel computing research questions and recommendations ■  Definition of „13 dwarfs“ □  Common designs of parallel computation and communication □  Allow better evaluation of programming models and architectures 32 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger The Landscape of Parallel Computing Research: A View from Berkeley Krste Asanovic Ras Bodik Bryan Christopher Catanzaro Joseph James Gebis Parry Husbands Kurt Keutzer David A. Patterson William Lester Plishker John Shalf Samuel Webb Williams Katherine A. Yelick Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-183 http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html December 18, 2006
  • 33. A View From Berkeley ■  Sources □  EEMBC benchmarks (embedded systems), SPEC benchmarks □  Database and text mining technology □  Algorithms in computer game design and graphics □  Machine learning algorithms □  Original „7 Dwarfs“ for supercomputing [Colella] ■  „Anti-benchmark“ □  Dwarfs are not tied to code or language artifacts □  Can serve as understandable vocabulary across disciplines □  Allow feasability study of hardware and software design ◊  No need to wait for applications being developed 33 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 34. 13 Dwarfs ■  Dwarfs currently defined □  Dense Linear Algebra □  Sparse Linear Algebra □  Spectral Methods □  N-Body Methods □  Structured Grids □  Unstructured Grids □  MapReduce □  Combinational Logic □  Graph Traversal □  Dynamic Programming □  Backtrack and Branch-and- Bound □  Graphical Models □  Finite State Machines 34 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger ■  One dwarf may be implemented based on another one ■  Increasing uptake in scientific publications ■  Several reference implementations for CPU / GPU
  • 35. Dwarfs in Popular Applications OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 35 Hot →  Cold [Patterson]
  • 36. ■  Classic vector and matrix operations on non-sparse data (vector op vector, matrix op vector, matrix op matrix) ■  Data layout as continues array(s) ■  High degree of data dependencies ■  Computation on elements, rows, columns or matrix blocks ■  Issues with memory hierarchy, data distribution is critical ■  Demands overlapping of computation and communication Dense Linear Algebra OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 36
  • 37. Sparse Linear Algebra ■  Operations on a sparse matrix (with lots of zeros) ■  Typically compressed data structures, integer operations, only non-zero entries + indices □  Dense blocks to exploit caches ■  Complex dependency structure ■  Scatter-gather vector operations are often helpful OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 37
  • 38. N-Body Methods ■  Physics: Predicting individual motions of an object group interacting gravitationally □  Calculations on interactions between many discrete points ■  Hierarchical tree-based and mesh-based methods, avoid computing all pair-wise interactions ■  Variations with particle-particle methods (one point to all others) ■  Large number of independent calculations in a time step, followed by all-to-all communication ■  Issues with load balancing and missing fixed hierarchy OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 38
  • 39. Structured Grid ■  Data as a regular multidimensional grid □  Access is regular and statically determinable ■  Computation as sequence of grid updates □  Points are updated concurrently using values from a small neighborhood ■  Spatial locality to use long cache lines ■  Temporal locality to allow cache reuse ■  Parallel mapping with sub-grid per processor □  Ghost cells, surface to volume ratio ■  Latency hiding □  Increased number of ghost cells □  Coarse-grained data exchange OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 39
  • 40. Unstructured Grid ■  Elements update neighbors in irregular mesh/grid - static or dynamic structure ■  Problematic data distribution and access requirements, indirection through tables ■  Modeling domain (e.g. physics) □  Mesh represents surface or volume □  Entities are points, edges, faces, ... □  Applying pressure, temperature, … □  Computations involve numerical solutions or differential equations □  Sequence of mesh updates ■  Massively data parallel, but irregularly distributed data and communication OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 40
  • 41. MapReduce ■  Originally called “Monte Carlo” in dwarf concept □  Repeated independent execution of a function (e.g. random number generation, map function) □  Results aggregated at the end □  Nearly no communication between tasks, embarrassingly parallel ■  Examples: Monte Carlo, BOINC project, protein structures [http://climatesanity.wordpress.com] OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 41
  • 42. ■  Global optimization problem in large search space ■  Divide and Conquer principle □  Branching into subdivisions □  Optimize execution by ruling out regions ■  Examples: Integer linear programming, boolean satisfiability, combinatorial optimization, traveling salesman, constraint programming, … ■  Heuristics to guide search to productive regions ■  Parallel checking of sub-regions □  Demands invariants about the search space □  Demands dynamic load balancing, load prediction is hard ■  Example: Place N queens on a chessboard so that no two attack each other Backtrack / Branch-and-Bound OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 42
  • 43. Branch-and-Bound 43 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger [http://docs.jboss.org/drools]
  • 44. 44 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger [http://docs.jboss.org/drools]
  • 45. Berkeley Dwarfs ■  Relevance of single dwarfs widely differs ■  No widely accepted single benchmark implementation ■  Computational dwarfs on different layers, implementations may be based on each other ■  OpenDwarfs project □  Optimized code for different platforms ■  Parallel Dwarfs project □  In C++, C#, F# for Visual Studio 45 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger The Landscape of Parallel Computing Research: A View from Berkeley Krste Asanovic Ras Bodik Bryan Christopher Catanzaro Joseph James Gebis Parry Husbands Kurt Keutzer David A. Patterson William Lester Plishker John Shalf Samuel Webb Williams Katherine A. Yelick Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-183 http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html December 18, 2006
  • 46. Parallel Programming Concepts OpenHPI Course Week 6 : Patterns and Best Practices Unit 6.4: Some Future Trends Dr. Peter Tröger + Teaching Team
  • 47. NUMA Impact Increases 47 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Core Core Core Core Q P I Core Core Core Core Q P I Core Core Core Core Q P I Core Core Core Core Q P I L3Cache L3Cache L3Cache MemoryController MemoryController MemoryController L3Cache MemoryController I/O I/O I/O I/O Memory Memory Memory Memory
  • 48. Innovation in Memory Technology ■  3D NAND ■  Hybrid Memory Cube □  Intel, Micron, … □  3D array of DDR-alike memory cells □  Early samples available, 160GB/s □  Through-silicon via (TSV) approach with embedded controllers, attached to CPU ■  RRAM / ReRAM □  Non-volatile memory 48 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger [computerworld.com][extremetech.com]
  • 49. Power Wall 2.0 = Dark Silicon “Dark Silicon and the End of Multicore Scaling” by Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, Doug Burger OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 49
  • 50. Hardware / Software Co-Design ■  Increasing number of cores by Moore‘s law ■  Power wall / dark silicon problem will become worse □  In addition, battery-powered devices become more relevant ■  Idea: Use additional transistors for specialization □  Design hardware for a software problem □  Make it part of the processor („compile into hardware“) □  More efficiency, less flexibility □  Partially known from ILP SIMD support □  Examples: Cryptography, regular expressions ■  Example: Cell processor (Playstation 3) □  64-bit Power core □  8 specialized co-processors OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 50
  • 51. Software at Scale [Dongarra] ■  Effective utilization of many-core and hybrid hardware □  Break fork-join parallelism □  Dynamic data driven execution, consider block layout □  Exploiting mixed precision (GPU vs. CPU, power consumption) ■  Aim for self-adapting software and auto-tuning support □  Manual optimization is too hard □  Let software optimize the software ■  Consider fault-tolerant software □  With 1.000.000's of cores, things break all the time ■  Focus on algorithm classes that reduce communication □  Special problem in dense computation □  Aim for asynchronous iterations OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 51
  • 52. OpenMP 4.0 ■  SIMD extensions □  Portable primitives to describe SIMD parallelization □  Loop vectorization with simd construct □  Several arguments for guiding the compiler (e.g. alignment) ■  Targeting extensions □  Thread with the OpenMP program executes on the host device □  Implementation may support multiple target devices □  Control off-loading of loops and code regions on such devices ■  New API for device data environment □  OpenMP - managed data items can be moved to the device □  New primitives for better cancellation support □  User-defined reduction operations OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 52
  • 53. OpenACC ■  „OpenMP for accelerators“ (GPU, FPGAs, ...) □  Partners: Cray, supercomputing centers, NVIDIA, PGI □  Annotation in C, C++, and Fortran source code □  OpenACC code can also be started on the accelerator ■  Features □  Specification of data locality and asynchronous execution □  Abstract specification of data movement, loop parallelization □  Caching and synchronization support □  Management of data movement by compiler and runtime □  Implementations available, e.g. for Xeon Phi OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 53
  • 54. Autotuners ■  Optimize parallel code by generating many variants □  Try many or all optimization switches ◊  Loop unrolling, utilization of processor registers, … □  Rely on parallelization variations defined in the application ■  Automatically tested on target platform ■  Research shows promising results □  Can be better than manually optimized code □  Optimization can fit to multiple execution environments □  Known examples for sparse and dense linear algebra libraries ◊  ATLAS (Automatically Tuned Linear Algebra Software) 54 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 55. Intel Math Kernel Library (MKL) ■  Intel library with heavily optimized functionality, for C & Fortran □  Linear algebra ◊  Basic Linear Algebra Subprograms (BLAS) API ◊  Follows standards in high-performance computing ◊  Vector-vector, matrix-vector, matrix-matrix operations □  Fast Fourier Transforms (FFT) ◊  Single precision, double precision, complex, real, ... □  Vector math and statistics functions ◊  Random number generators and probability distributions ◊  Spline-based data fitting ■  High-level abstraction of functionality, parallelization completely transparent for the developer 55 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 56. Future Trends ■  Active research on next-generation hardware □  Driven by exa-scale efforts in supercomputing □  Driven by combined power wall and memory wall □  Driven by shift in computer markets (desktop -> mobile) ■  Impact on software development will get more visible □  Hybrid computing is the future default □  Heterogeneous mixture of CPU + specialized accelerators □  Old assumptions are broken (flat memory, constant access time, homogeneous processing elements) □  Old programming models no longer match □  Extending the existing programming paradigms seems to work □  High-level specialized libraries get more relevance 56 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 57. Parallel Programming Concepts OpenHPI Course Summary Dr. Peter Tröger + Teaching Team
  • 58. Course Organization ■  Week 1: Terminology and fundamental concepts □  Moore’s law, power wall, memory wall, ILP wall □  Speedup vs. scaleup, Amdahl’s law, Flynn’s taxonomy, … ■  Week 2: Shared memory parallelism – The basics □  Concurrency, race condition, semaphore, deadlock, monitor, … ■  Week 3: Shared memory parallelism – Programming □  Threads, OpenMP, Cilk, Scala, … ■  Week 4: Accelerators □  Hardware today, GPU Computing, OpenCL, … ■  Week 5: Distributed memory parallelism □  CSP, Actor model, clusters, HPC, MPI, MapReduce, … ■  Week 6: Patterns and best practices □  Foster’s methodology, Berkeley dwarfs, OPL collection, … 58 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 59. Week 1: The Free Lunch Is Over ■  Clock speed curve flattened in 2003 □  Heat, power, leakage ■  Speeding up the serial instruction execution through clock speed improvements no longer works ■  Additional issues □  ILP wall □  Memory wall [HerbSutter,2009] 59 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 60. Three Ways Of Doing Anything Faster [Pfister] ■  Work harder (clock speed) Ø  Power wall problem Ø  Memory wall problem ■  Work smarter (optimization, caching) Ø  ILP wall problem Ø  Memory wall problem ■  Get help (parallelization) □  More cores per single CPU □  Software needs to exploit them in the right way Ø  Memory wall problem Problem CPU Core Core Core Core Core 60 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 61. Parallelism on Different Levels ProgramProgramProgram ProcessProcessProcessProcessTask PE ProcessProcessProcessProcessTask ProcessProcessProcessProcessTask PE PE PE Memory Node Network PE PE PE Memory PE PE PE Memory PE PE PE Memory PE PE PE Memory 61 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 62. The Parallel Programming Problem ■  Execution environment has a particular type (SIMD, MIMD, UMA, NUMA, …) ■  Execution environment maybe configurable (number of resources) ■  Parallel application must be mapped to available resources Execution EnvironmentParallel Application Match ? Configuration Flexible Type 62 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 63. Amdahl’s Law 63 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 64. Gustafson-Barsis’ Law (1988) ■  Gustafson and Barsis: People are typically not interested in the shortest execution time □  Rather solve a bigger problem in reasonable time ■  Problem size could then scale with the number of processors □  Typical in simulation and farmer / worker problems □  Leads to larger parallel fraction with increasing N □  Serial part is usually fixed or grows slower ■  Maximum scaled speedup by N processors: ■  Linear speedup now becomes possible ■  Software needs to ensure that serial parts remain constant ■  Other models exist (e.g. Work-Span model, Karp-Flatt metric) 64 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger S = TSER + N · TP AR TSER + TP AR
  • 65. Week 2: Concurrency vs. Parallelism ■  Concurrency means dealing with several things at once □  Programming concept for the developer □  In shared-memory systems, implemented by time sharing ■  Parallelism means doing several things at once □  Demands parallel hardware ■  Parallel programming is a misnomer □  Concurrent programming aiming at parallel execution ■  Any parallel software is concurrent software □  Note: Some researchers disagree, most practitioners agree ■  Concurrent software is not always parallel software □  Many server applications achieve scalability by optimizing concurrency only (web server) 65 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Concurrency Parallelism
  • 66. Parallelism [Mattson et al.] ■  Task □  Parallel program breaks a problem into tasks ■  Execution unit □  Representation of a concurrently running task (e.g. thread) □  Tasks are mapped to execution units ■  Processing element (PE) □  Hardware element running one execution unit □  Depends on scenario - logical processor vs. core vs. machine □  Execution units run simultaneously on processing elements, controlled by some scheduler ■  Synchronization - Mechanism to order activities of parallel tasks ■  Race condition - Program result depends on the scheduling order 66 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 67. Concurrency Issues ■  Mutual Exclusion □  The requirement that when one concurrent task is using a shared resource, no other shall be allowed to do that ■  Deadlock □  Two or more concurrent tasks are unable to proceed □  Each is waiting for one of the others to do something ■  Starvation □  A runnable task is overlooked indefinitely □  Although it is able to proceed, it is never chosen to run ■  Livelock □  Two or more concurrent tasks continuously change their states in response to changes in the other activities □  No global progress for the application 6OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 67
  • 68. Week 3: Parallel Programming for Shared Memory 68 Process Explicitly Shared Memory ■  Different programming models for concurrency with shared memory ■  Processes and threads mapped to processing elements (cores) ■  Task model supports more fine-grained parallelization than with native threads OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Memory Process Memory Thread Thread Task Task Task Task Concurrent Processes Concurrent Threads Concurrent Tasks Main Thread Process Memory Main Thread Process Memory Main Thread Thread Thread
  • 69. Input Data Parallel Processing Result Data Task Parallelism and Data Parallelism OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 69
  • 70. OpenMP ■  Programming with the fork-join model □  Master thread forks into declared tasks □  Runtime environment may run them in parallel, based on dynamic mapping to threads from a pool □  Worker task barrier before finalization (join) 70 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger [Wikipedia]
  • 71. High-Level Concurrency 71 Microsoft Parallel Patterns Library java.util.concurrent OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 72. Partitioned Global Address Space 72 Place-shifting operations • at(p) S … …… … Activities Local Heap Place 0 …… … Activities Local Heap Place N … Global Reference Distributed heap • GlobalRef[T] APGAS in X10: Places and Tasks Task parallelism • async S • finish S Concurrency control within a place • when(c) S • atomic S ■  Parallel tasks, each operating in one place of the PGAS □  Direct variable access only in local place ■  Implementation strategy is flexible □  One operating system process per place, manages thread pool □  Work-stealing scheduler [IBM] OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 73. Week 4: Cheap Performance with Accelerators ■  Performance ■  Energy / Price □  Cheap to buy and to maintain □  GFLOPS per watt: Fermi 1,5 / Kepler 5 / Maxwell 15 (2014) OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 73 0 200 400 600 800 1000 1200 1400 0 10000 20000 30000 40000 50000 ExecutionTimein Milliseconds Problem Size (Number of Sudoku Places) Intel E8500 CPU AMD R800 GPU NVIDIA GT200 GPU lower means faster GPU: Graphics Processing Unit (CPU of a graphics card)
  • 74. CPU vs. GPU Architecture □  Some huge threads □  Branch prediction □  1000+ light-weight threads □  Memory latency hiding OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 74 Control PE PE PE PE Cache DRAM DRAM CPU GPU „many-core“„multi-core“
  • 75. OpenCL Platform Model □  OpenCL exposes CPUs, GPUs, and other Accelerators as “devices” □  Each “device” contains one or more “compute units”, i.e. cores, SMs,... □  Each “compute unit” contains one or more SIMD “processing elements” OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 75
  • 76. Best Practices for Performance Tuning • Asynchronous, Recompute, SimpleAlgorithm Design • Chaining, Overlap Transfer & ComputeMemory Transfer • Divergent Branching, PredicationControl Flow • Local Memory as Cache, rare resourceMemory Types • Coalescing, Bank ConflictsMemory Access • Execution Size, EvaluationSizing • Shifting, Fused Multiply, Vector TypesInstructions • Native Math Functions, Build OptionsPrecision OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 76
  • 77. Week 5: Shared Nothing ■  Clusters: Stand-alone machines connected by a local network □  Cost-effective technique for a large-scale parallel computer □  Users are builders, have control over their system □  Synchronization much slower than in shared memory □  Task granularity becomes an issue 77 Processing Element Task Processing Element Task Message Message Message Message OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Local Memory Local Memory
  • 78. Shared Nothing ■  Supercomputers / Massively Parallel Processing (MPP) systems □  (Hierarchical) cluster with a lot of processors □  Still standard hardware, but specialized setup □  High-performance interconnection network □  For massive data-parallel applications, mostly simulations (weapons, climate, earthquakes, airplanes, car crashes, ...) ■  Examples (Nov 2013) □  BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops □  Tianhe-2, 3.1 million cores, 1 PB memory, 17.808 kW power, 33.86 PFlops (quadrillions calculations per second) ■  Annual ranking with the TOP500 list (www.top500.org) 78 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 79. Surface-To-Volume Effect 79 [nicerweb.com] ■  Fine-grained decomposition for using all processing elements ? ■  Coarse-grained decomposition to reduce communication overhead ? ■  A tradeoff question ! OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 80. Message Passing ■  Parallel programming paradigm for “shared nothing” environments □  Implementations for shared memory available, but typically not the best approach ■  Users submit their message passing program & data as job ■  Cluster management system creates program instances Instance 0 Instance 1 Instance 2 Instance 3 Execution Hosts 80 Cluster Management Software Submission Host Job Appli- cation OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 81. Single Program Multiple Data (SPMD) 81 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger // … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, comm_size - 1); } Input data SPMD program // … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, comm_size - 1); } // … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, comm_size - 1); } // … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, comm_size - 1); } // … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, comm_size - 1); } // … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %dn", rank, token, comm_size - 1); } Instance 0 Instance 1 Instance 2 Instance 3 Instance 4
  • 82. Actor Model ■  Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI 1973. □  Mathematical model for concurrent computation □  Actor as computational primitive ◊  Local decisions, concurrently sends / receives messages ◊  Has a mailbox for incoming messages ◊  Concurrently creates more actors □  Asynchronous one-way message sending □  Changing topology allowed, typically no order guarantees ◊  Recipient is identified by mailing address ◊  Actors can send their own identity to other actors ■  Available as programming language extension or library in many environments 82 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 83. Week 6: Patterns for Parallel Programming ■  Phases in creating a parallel program □  Finding Concurrency: Identify and analyze exploitable concurrency □  Algorithm Structure: Structure the algorithm to take advantage of potential concurrency □  Supporting Structures: Define program structures and data structures needed for the code □  Implementation Mechanisms: Threads, processes, messages, … ■  Each phase is a design space 83 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 84. Popular Applications vs. Dwarfs Hot →  Cold OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger 84
  • 85. Designing Parallel Algorithms [Foster] ■  Map workload problem on an execution environment □  Concurrency & locality for speedup, scalability ■  Four distinct stages of a methodological approach ■  A) Search for concurrency and scalability □  Partitioning – Decompose computation and data into small tasks □  Communication – Define necessary coordination of task execution ■  B) Search for locality and performance □  Agglomeration – Consider performance and implementation costs □  Mapping – Maximize processor utilization, minimize communication 85 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 86. 86 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger And that’s it …
  • 87. The End ■  Parallel programming is exciting again! □  From massively parallel hardware to complex software □  From abstract design patterns to specific languages □  From deadlock freedom to extreme performance tuning ■  Some general concepts are established □  Take this course as starting point □  Learn from the high-performance computing community ■  Thanks for your participation □  Lively discussion, directly and in the forums, we learned a lot □  Sorry for technical flaws and content errors ■  Please use the feedback link 87 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 88. Lecturer Contact ■  Operating Systems and Middleware Group at HPI http://www.dcl.hpi.uni-potsdam.de ■  Dr. Peter Tröger http://www.troeger.eu http://twitter.com/ptroeger http://www.linkedin.com/in/ptroeger peter.troeger@hpi.uni-potsdam.de ■  M.Sc. Frank Feinbube http://www.feinbube.de http://www.linkedin.com/in/feinbube frank.feinbube@hpi.uni-potsdam.de 88 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger