SlideShare a Scribd company logo
Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.1: Welcome !
Dr. Peter Tröger + Teaching Team
Course Content
■  Overview of theoretical and practical concepts
■  This course is for you if …
□  … you have skills in software development,
regardless of the programming language.
□  … you want to get an overview of parallelization concepts.
□  … you want to assess the feasibility of parallel hardware,
software and libraries for your parallelization problem.
■  This course is not for you if …
□  … you have no practical experience with software
development at all.
□  … you want a solution for a specific parallelization problem.
□  … you want to learn one specific parallel programming tool
or language in detail.
2
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts
3
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Course Organization
■  Six lecture weeks, final exam in week 7
■  Several lecture units per week, per unit:
□  Video, slides, non-graded self-test
□  Sometimes mandatory and optional readings
□  Sometimes optional programming tasks
□  Week finished with a graded assignment
■  Six graded assignments sum up to max. 90 points
■  Graded final exam with max. 90 points
■  OpenHPI certificate awarded for getting ≥90 points in total
■  Forum can be used to discuss with other participants
■  FAQ is constantly updated
4
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Course Organization
■  Week 1: Terminology and fundamental concepts
□  Moore’s law, power wall, memory wall, ILP wall,
speedup vs. scaleup, Amdahl’s law, Flynn’s taxonomy, …
■  Week 2: Shared memory parallelism – The basics
□  Concurrency, race condition, semaphore, mutex,
deadlock, monitor, …
■  Week 3: Shared memory parallelism – Programming
□  Threads, OpenMP, Intel TBB, Cilk, Scala, …
■  Week 4: Accelerators
□  Hardware today, CUDA, GPU Computing, OpenCL, …
■  Week 5: Distributed memory parallelism
□  CSP, Actor model, clusters, HPC, MPI, MapReduce, …
■  Week 6: Patterns, best practices and examples
5
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Why Parallel?
6
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Computer Markets
■  Embedded and Mobile Computing
□  Cars, smartphones, entertainment industry, medical devices, …
□  Power/performance and price as relevant issues
■  Desktop Computing
□  Price/performance ratio and extensibility as relevant issues
■  Server Computing
□  Business service provisioning as typical goal
□  Web servers, banking back-end, order processing, ...
□  Performance and availability as relevant issues
■  Most software benefits from having better performance
■  The computer hardware industry is constantly delivering
7
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Running Applications
Application
Instructions
8
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Three Ways Of Doing Anything Faster
[Pfister]
■  Work harder
(clock speed)
□  Hardware solution
□  No longer feasible
■  Work smarter
(optimization, caching)
□  Hardware solution
□  No longer feasible
as only solution
■  Get help
(parallelization)
□  Hardware + Software
in cooperation
Application
Instructions
t
9
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.2: Moore’s Law and the Power Wall
Dr. Peter Tröger + Teaching Team
Processor Hardware
■  First computers had fixed programs (e.g. electronic calculator)
■  Von Neumann architecture (1945)
□  Instructions for central processing unit (CPU) in memory
□  Program is treated as data
□  Loading of code during runtime, self-modification
■  Multiple such processors: Symmetric multiprocessing (SMP)
CPU
Memory
Control Unit
Arithmetic Logic UnitInput
Output
Bus
11
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Moore’s Law
■  “...the number of transistors that can be inexpensively placed on
an integrated circuit is increasing exponentially, doubling
approximately every two years. ...” (Gordon Moore, 1965)
□  CPUs contain different hardware parts, such as logic gates
□  Parts are built from transistors
□  Rule of exponential growth for the number
of transistors on one CPU chip
□  Meanwhile a self-fulfilling prophecy
□  Applied not only in processor industry,
but also in other areas
□  Sometimes misinterpreted as
performance indication
□  May still hold for the next 10-20 years
[Wikipedia]
12
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Moore’s Law
[Wikimedia]
13
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Moore’s Law vs. Software
■  Nathan P. Myhrvold, “The Next Fifty Years of Software”, 1997
□  “Software is a gas. It expands to fit the container it is in.”
◊  Constant increase in the amount of code
□  “Software grows until it becomes limited by Moore’s law.”
◊  Software often grows faster than hardware capabilities
□  “Software growth makes Moore’s Law possible.”
◊  Software and hardware market stimulate each other
□  “Software is only limited by human ambition & expectation.”
◊  People will always find ways for exploiting performance
■  Jevon’s paradox:
□  “Technological progress that increases the efficiency with
which a resource is used tends to increase (rather than
decrease) the rate of consumption of that resource.”
14
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Processor Performance Development
Transistors)#)
Clock)Speed)(MHz))
Power)(W))
Perf/Clock)(ILP))
“Work harder”
“Work smarter”
[HerbSutter,2009]
15
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
A Physics Problem
■  Power: Energy needed to run the processor
■  Static power (SP): Leakage in transistors while being inactive
■  Dynamic power (DP): Energy needed to switch a transistor
■  Moore’s law: N goes up exponentially, C goes down with size
■  Power dissipation demands cooling
□  Power density: Watt/cm2
■  Make dynamic power increase less dramatic:
□  Bringing down V reduces energy consumption, quadratically!
□  Don’t use N only for logic gates
■  Industry was able to increase the frequency (F) for decades
DP (approx.) = Number of Transistors (N) x Capacitance (C) x
Voltage2 (V2) x Frequency (F)
16
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Processor Supply Voltage
1
10
100
1970 1980 1990 2000 2010
PowerSupply(Volt)
Processor Supply VoltageProcessor Supply Voltage
[Moore,ISSCC]
17
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Power Density
■  Growth of watts per square centimeter in microprocessors
■  Higher temperatures: Increased leakage, slower transistors
0 W
20 W
40 W
60 W
80 W
100 W
120 W
140 W
1992 1995 1997 2000 2002 2005
Hot Plate
Air Cooling Limit
18
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Power Density
[Kevin Skadron, 2007]
“Cooking-Aware” Computing?
19
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Second Problem: Leakage Increase
0.001
0.01
0.1
1
10
100
1000
1960 1970 1980 1990 2000 2010
Power(W)
Processor Power (Watts)Processor Power (Watts) -- Active & LeakageActive & Leakage
ActiveActive
LeakageLeakage
[www.ieeeghn.org]
■  Static leakage today: Up to 40% of CPU power consumption
20
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
The Power Wall
■  Air cooling capabilities are limited
□  Maximum temperature of 100-125 °C, hot spot problem
□  Static and dynamic power consumption must be limited
■  Power consumption increases with Moore‘s law,
but grow of hardware performance is expected
■  Further reducing voltage as compensation
□  We can’t do that endlessly, lower limit around 0.7V
□  Strange physical effects
■  Next-generation processors need to use even less power
□  Lower the frequencies, scale them dynamically
□  Use only parts of the processor at a time (‘dark silicon’)
□  Build energy-efficient special purpose hardware
■  No chance for faster processors through frequency increase
21
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
The Free Lunch Is Over
■  Clock speed curve
flattened in 2003
□  Heat, power,
leakage
■  Speeding up the serial
instruction execution
through clock speed
improvements no
longer works
■  Additional issues
□  ILP wall
□  Memory wall
[HerbSutter,2009]
22
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.3: ILP Wall and Memory Wall
Dr. Peter Tröger + Teaching Team
Three Ways Of Doing Anything Faster
[Pfister]
■  Work harder
(clock speed)
□  Hardware solution
!  Power wall problem
■  Work smarter
(optimization, caching)
□  Hardware solution
■  Get help
(parallelization)
□  Hardware + Software
Application
Instructions
24
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Instruction Level Parallelism
■  Increasing the frequency is no longer an option
■  Provide smarter instruction processing for better performance
■  Instruction level parallelism (ILP)
□  Processor hardware optimizes low-level instruction execution
□  Instruction pipelining
◊  Overlapped execution of serial instructions
□  Superscalar execution
◊  Multiple units of one processor are used in parallel
□  Out-of-order execution
◊  Reorder instructions that do not have data dependencies
□  Speculative execution
◊  Control flow speculation and branch prediction
■  Today’s processors are packed with such ILP logic
25
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
The ILP Wall
■  No longer cost-effective to dedicate
new transistors to ILP mechanisms
■  Deeper pipelines make the
power problem worse
■  High ILP complexity effectively
reduces the processing
speed for a given frequency
(e.g. misprediction)
■  More aggressive ILP
technologies too risky due to
unknown real-world workloads
■  No ground-breaking new ideas
■  " “ILP wall”
■  Ok, let’s use the transistors for better caching
[Wikipedia]
26
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Caching
■  von Neumann architecture
□  Instructions are stored in main memory
□  Program is treated as data
□  For each instruction execution, data must be fetched
■  When the frequency increases, main memory becomes a
performance bottleneck
■  Caching: Keep data copy in very fast, small memory on the CPU
CPU
Memory
Control Unit
Arithmetic Logic UnitInput
Output
Bus
Cache
27
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Small
Memory Hardware Hierarchy
volatile
non-volatile
Registers
Processor
Caches
Random Access Memory
(RAM)
Flash / SSD Memory
Hard Drives
Tapes
Fast Expensive
Slow Large
28
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Cheap
Memory Hardware Hierarchy
CPU core CPU core CPU core CPU core
L2 Cache L2 Cache
L3 Cache
L1 Cache L1 Cache L1 Cache L1 Cache
Bus
Bus Bus
L = Level
29
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Caching for Performance
■  Well established optimization technique for performance
■  Caching relies on data locality
□  Some instructions are often used (e.g. loops)
□  Some data is often used (e.g. local variables)
□  Hardware keeps a copy of the data in the faster cache
□  On read attempts, data is taken directly from the cache
□  On write, data is cached and eventually written to memory
■  Similar to ILP, the potential is limited
□  Larger caches do not help automatically
□  At some point, all data locality in the
code is already exploited
□  Manual vs. compiler-driven optimization
[arstechnica.com]
30
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Memory Wall
■  If caching is limited, we simply need faster memory
■  The problem: Shared memory is ‘shared’
□  Interconnect contention
□  Memory bandwidth
◊  Memory transfer speed is limited by the power wall
◊  Memory transfer size is limited by the power wall
■  Transfer technology cannot
keep up with GHz processors
■  Memory is too slow, effects
cannot be hidden through
caching completely
" “Memory wall”
[dell.com]
31
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Problem Summary
■  Hardware perspective
□  Number of transistors N is still increasing
□  Building larger caches no longer helps (memory wall)
□  ILP is out of options (ILP wall)
□  Voltage / power / frequency is at the limit (power wall)
◊  Some help with dynamic scaling approaches
□  Remaining option: Use N for more cores per processor chip
■  Software perspective
□  Performance must come from the utilization of this increasing
core count per chip, since F is now fixed
□  Software must tackle the memory wall
32
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Three Ways Of Doing Anything Faster
[Pfister]
■  Work harder
(clock speed)
!  Power wall problem
!  Memory wall problem
■  Work smarter
(optimization, caching)
!  ILP wall problem
!  Memory wall problem
■  Get help
(parallelization)
□  More cores per single CPU
□  Software needs to exploit
them in the right way
!  Memory wall problem
Problem
CPU
Core
Core
Core
Core
Core
33
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.4: Parallel Hardware Classification
Dr. Peter Tröger + Teaching Team
Parallelism [Mattson et al.]
■  Task
□  Parallel program breaks a problem into tasks
■  Execution unit
□  Representation of a concurrently running task (e.g. thread)
□  Tasks are mapped to execution units
■  Processing element (PE)
□  Hardware element running one execution unit
□  Depends on scenario - logical processor vs. core vs. machine
□  Execution units run simultaneously on processing elements,
controlled by some scheduler
■  Synchronization - Mechanism to order activities of parallel tasks
■  Race condition - Program result depends on the scheduling order
35
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Faster Processing through Parallelization
Program
Task
Task
Task
Task
Task
36
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Flynn‘s Taxonomy (1966)
■  Classify parallel hardware architectures according to their
capabilities in the instruction and data processing dimension
Single Instruction,
Single Data (SISD)
Single Instruction,
Multiple Data (SIMD)
37
Processing Step
Instruction
Data Item
Output
Processing Step
Instruction
Data Items
Output
Multiple Instruction,
Single Data (MISD)
Processing Step
Instructions
Data Item
Output
Multiple Instruction,
Multiple Data (MIMD)
Processing Step
Instructions
Data Items
Output
Flynn‘s Taxonomy (1966)
■  Single Instruction, Single Data (SISD)
□  No parallelism in the execution
□  Old single processor architectures
■  Single Instruction, Multiple Data (SIMD)
□  Multiple data streams processed with one instruction stream
at the same time
□  Typical in graphics hardware and GPU accelerators
□  Special SIMD machines in high-performance computing
■  Multiple Instructions, Single Data (MISD)
□  Multiple instructions applied to the same data in parallel
□  Rarely used in practice, only for fault tolerance
■  Multiple Instructions, Multiple Data (MIMD)
□  Every modern processor, compute clusters
38
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallelism on Different Levels
ProgramProgramProgram
ProcessProcessProcessProcessTask
PE
ProcessProcessProcessProcessTask
ProcessProcessProcessProcessTask
PE
PE
PE
Memory
Node
Network
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
39
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallelism on Different Levels
■  A processor chip (socket)
□  Chip multi-processing (CMP)
◊  Multiple CPU’s per chip, called cores
◊  Multi-core / many-core
□  Simultaneous multi-threading (SMT)
◊  Interleaved execution of tasks on one core
◊  Example: Intel Hyperthreading
□  Chip multi-threading (CMT) = CMP + SMT
□  Instruction-level parallelism (ILP)
◊  Parallel processing of single instructions per core
■  Multiple processor chips in one machine (multi-processing)
□  Symmetric multi-processing (SMP)
■  Multiple processor chips in many machines (multi-computer)
40
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallelism on Different Levels
[arstechnica.com]
ILP, SMT ILP, SMTILP, SMTILP, SMT
ILP, SMT ILP, SMT ILP, SMT ILP, SMT
CMPArchitecture
41
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.5: Memory Architectures
Dr. Peter Tröger + Teaching Team
Parallelism on Different Levels
ProgramProgramProgram
ProcessProcessProcessProcessTask
PE
ProcessProcessProcessProcessTask
ProcessProcessProcessProcessTask
PE
PE
PE
Memory
Node
Network
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
43
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Shared Memory vs. Shared Nothing
■  Organization of parallel processing hardware as …
□  Shared memory system
◊  Tasks can directly access a common address space
◊  Implemented as memory hierarchy with different cache levels
□  Shared nothing system
◊  Tasks can only access local memory
◊  Global coordination of parallel execution by explicit
communication (e.g. messaging) between tasks
□  Hybrid architectures possible in practice
◊  Cluster of shared memory systems
◊  Accelerator hardware in a shared memory system
●  Dedicated local memory on the accelerator
●  Example: SIMD GPU hardware in SMP computer system
44
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Shared Memory vs. Shared Nothing
■  Pfister: “shared memory” vs. “distributed memory”
■  Foster: “multiprocessor” vs. “multicomputer”
■  Tannenbaum: “shared memory” vs. “private memory”


Processing
Element
Task
Shared Memory


Processing
Element
Task


Processing
Element
Task


Processing
Element
Task
Message
Message
Message
Message
Data DataData Data
45
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Shared Memory
■  Processing elements act independently
■  Use the same global address space
■  Changes are visible for all processing elements
■  Uniform memory access (UMA) system
□  Equal access time for all PE’s to all memory locations
□  Default approach for SMP systems of the past
■  Non-uniform memory access (NUMA) system
□  Delay on memory access according to the accessed region
□  Typically due to core / processor interconnect technology
■  Cache-coherent NUMA (CC-NUMA) system
◊  NUMA system that keeps all caches consistent
◊  Transparent hardware mechanisms
◊  Became standard approach with recent X86 chips
46
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Socket
UMA Example
■  Two dual-core processor chips in an SMP system
■  Level 1 cache (fast, small), Level 2 cache (slower, larger)
■  Hardware manages cache coherency among all cores
Core Core
L1 Cache L1 Cache
L2 Cache
RAM
Chipset / Memory Controller
System Bus
Socket
Core Core
L1 Cache L1 Cache
L2 Cache
47
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
RAM RAM RAM
Socket
NUMA Example
■  Eight cores on 2 sockets in an SMP system
■  Memory controllers + chip interconnect realize a single memory
address space for the software
Core Core
L1 L1
L3 Cache
RAM
L2 L2
Core Core
L1
L2
L1
L2
Memory Controller
RAM
Chip
Interconnect
Socket
Core Core
L1 L1
L3 Cache
L2 L2
Core Core
L1
L2
L1
L2
Memory Controller
48
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
NUMA Example: 4-way Intel Nehalem SMP
Core
 Core
Core
 Core
Q
P
I
Core
 Core
Core
 Core
Q
P
I
Core
 Core
Core
 Core
Q
P
I
Core
 Core
Core
 Core
Q
P
I
L3Cache
L3Cache
L3Cache
MemoryController
MemoryController
MemoryController
L3Cache
MemoryController
I/O
 I/O
I/O
I/O
Memory
Memory
Memory
Memory
49
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Shared Nothing
■  Processing elements no longer share a common global memory
■  Easy scale-out by adding machines to the messaging network
■  Cluster computing: Combine machines with cheap interconnect
□  Compute cluster: Speedup for an application
◊  Batch processing, data parallelism
□  Load-balancing cluster: Better throughput for some service
□  High Availability (HA) cluster: Fault tolerance
■  Cluster to the extreme
□  High Performance Computing (HPC)
□  Massively Parallel Processing (MPP) hardware
□  TOP500 list of the fastest supercomputers
50
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Clusters


Processing
Element
Task


Processing
Element
Task
Message
Message
Message
Message
Data Data
51
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Shared Nothing Example
…
Socket
Core Core
L1 L1
L3 Cache
RAM
L2 L2
Memory
Controller
Network
Interface
Socket
Core Core
L1 L1
L3 Cache
RAM
L2 L2
Memory
Controller
Network
Interface
Socket
Core Core
L1 L1
L3 Cache
RAM
L2 L2
Memory
Controller
Network
Interface
Machine Machine Machine
52
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Interconnection Network
Hybrid Example
…
Machine
Socket
Core Core
L1D L1D
L3 Cache
RAM
L2 L2
Memory
Controller
Network Interface
Chip Inter-
connect
Socket
Core Core
L1D L1D
L3 Cache
RAM
L2 L2
Memory
Controller
Machine
Socket
Core Core
L1D L1D
L3 Cache
RAM
L2 L2
Memory
Controller
Network Interface
Chip Inter-
connect
Socket
Core Core
L1D L1D
L3 Cache
RAM
L2 L2
Memory
Controller
53
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Interconnection Network
Example: Cluster of Nehalem SMPs
Network
54
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
The Parallel Programming Problem
■  Execution environment has a particular type
(SIMD, MIMD, UMA, NUMA, …)
■  Execution environment maybe configurable (number of resources)
■  Parallel application must be mapped to available resources
Execution EnvironmentParallel Application Match ?
Configuration
Flexible
Type
55
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.6: Speedup and Scaleup
Dr. Peter Tröger + Teaching Team
Which One Is Faster ?
■  Usage scenario
□  Transporting a fridge
■  Usage environment
□  Driving through a forest
■  Perception of performance
□  Maximum speed
□  Average speed
□  Acceleration
■  We need some kind of
application-specific benchmark
57
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallelism for …
■  Speedup – compute faster
■  Throughput – compute more in the same time
■  Scalability – compute faster / more with additional resources
■  …
Processing Element A1
Processing Element A2
Processing Element A3
Processing Element B1
Processing Element B2
Processing Element B3
ScalingUp
Scaling Out
MainMemory
MainMemory
58
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Metrics
■  Parallelization metrics are application-dependent,
but follow a common set of concepts
□  Speedup: Adding more resources leads to less time for
solving the same problem.
□  Linear speedup:
n times more resources " n times speedup
□  Scaleup: Adding more resources solves a larger version of the
same problem in the same time.
□  Linear scaleup:
n times more resources " n times larger problem solvable
■  The most important goal depends on the application
□  Throughput demands scalability of the software
□  Response time demands speedup of the processing
59
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Tasks: v=12
Processing elements: N= 3
Time needed: T3= 4
(Linear) Speedup: T1/T3=12/4=3
Speedup
■  Idealized assumptions
□  All tasks are equal sized
□  All code parts can run in parallel
Application
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
t t
Tasks: v=12
Processing elements: N=1
Time needed: T1=12
60
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Speedup with Load Imbalance
■  Assumptions
□  Tasks have different size,
best-possible speedup depends
on optimized resource usage
□  All code parts can run in parallel
Application
2
3
4
5
6
7
8
9
10
11
12
t t
1
2
3
4
1
5
6
7
8
9
10
11
12
Tasks: v=12
Processing elements: N= 3
Time needed: T3= 6
Speedup: T1/T3=16/6=2.67
Tasks: v=12
Processing elements: N=1
Time needed: T1=16
61
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Speedup with Serial Parts
■  Each application has inherently non-parallelizable serial parts
□  Algorithmic limitations
□  Shared resources acting as bottleneck
□  Overhead for program start
□  Communication overhead in shared-nothing systems
23
45
6
7
8
9
10
11
12
tSER1
1
tPAR1
tSER2 tPAR2
tSER3
62
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Amdahl’s Law
■  Gene Amdahl. “Validity of the single processor approach to achieving
large scale computing capabilities”. AFIPS 1967
□  Serial parts TSER = tSER1 + tSER2 + tSER3 + …
□  Parallelizable parts TPAR = tPAR1 + tPAR2 + tPAR3 + …
□  Execution time with one processing element:
T1 = TSER+TPAR
□  Execution time with N parallel processing elements:
TN >= TSER + TPAR / N
◊  Equal only on perfect parallelization,
e.g. no load imbalance
□  Amdahl’s Law for maximum speedup with N processing elements
S =
T1
TN
63
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
S =
TSER + TP AR
TSER + TP AR/N
Amdahl’s Law
64
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Amdahl’s Law
■  Speedup through parallelism is hard to achieve
■  For unlimited resources, speedup is bound by the serial parts:
□  Assume T1=1
■  Parallelization problem relates to all system layers
□  Hardware offers some degree of parallel execution
□  Speedup gained is bound by serial parts:
◊  Limitations of hardware components
◊  Necessary serial activities in the operating system,
virtual runtime system, middleware and the application
◊  Overhead for the parallelization itself
65
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
SN!1 =
T1
TN!1
SN!1 =
1
TSER
Amdahl’s Law
■  “Everyone knows Amdahl’s law, but quickly forgets it.”
[Thomas Puzak, IBM]
■  90% parallelizable code leads to not more than 10x speedup
□  Regardless of the number of processing elements
■  Parallelism is only useful …
□  … for small number of processing elements
□  … for highly parallelizable code
■  What’s the sense in big parallel / distributed hardware setups?
■  Relevant assumptions
□  Put the same problem on different hardware
□  Assumption of fixed problem size
□  Only consideration of execution time for one problem
66
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Gustafson-Barsis’ Law (1988)
■  Gustafson and Barsis: People are typically not interested in the
shortest execution time
□  Rather solve a bigger problem in reasonable time
■  Problem size could then scale with the number of processors
□  Typical in simulation and farmer / worker problems
□  Leads to larger parallel fraction with increasing N
□  Serial part is usually fixed or grows slower
■  Maximum scaled speedup by N processors:
■  Linear speedup now becomes possible
■  Software needs to ensure that serial parts remain constant
■  Other models exist (e.g. Work-Span model, Karp-Flatt metric)
67
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
S =
TSER + N · TP AR
TSER + TP AR
Summary: Week 1
■  Moore’s Law and the Power Wall
□  Processing element speed no longer increases
■  ILP Wall and Memory Wall
□  Memory access is not fast enough for modern hardware
■  Parallel Hardware Classification
□  From ILP to SMP, SIMD vs. MIMD
■  Memory Architectures
□  UMA vs. NUMA
■  Speedup and Scaleup
□  Amdahl’s Law and Gustavson’s Law
Since we need parallelism for speedup,
how can we express it in software?
68
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

More Related Content

What's hot

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
Numenta
 
NOR gate design in microwind
NOR gate design in microwindNOR gate design in microwind
NOR gate design in microwind
Omkar Rane
 
Signal Integrity - A Crash Course [R Lott]
Signal Integrity - A Crash Course [R Lott]Signal Integrity - A Crash Course [R Lott]
Signal Integrity - A Crash Course [R Lott]
Ryan Lott
 
Dell VMware Virtual SAN Ready Nodes
Dell VMware Virtual SAN Ready NodesDell VMware Virtual SAN Ready Nodes
Dell VMware Virtual SAN Ready Nodes
Andrew McDaniel
 
Cloud Computing & Its Impact on Project Management
Cloud Computing & Its Impact on Project ManagementCloud Computing & Its Impact on Project Management
Cloud Computing & Its Impact on Project Management
VSR *
 
CKAN as an open-source data management solution for open data
CKAN as an open-source data management solution for open data CKAN as an open-source data management solution for open data
CKAN as an open-source data management solution for open data
AIMS (Agricultural Information Management Standards)
 
Blockchain and Apache NiFi
Blockchain and Apache NiFiBlockchain and Apache NiFi
Blockchain and Apache NiFi
Timothy Spann
 
IBM Tape the future of tape
IBM Tape the future of tapeIBM Tape the future of tape
IBM Tape the future of tape
Josef Weingand
 
SimfiaNeo - Workbench for Safety Analysis powered by Sirius
SimfiaNeo - Workbench for Safety Analysis powered by SiriusSimfiaNeo - Workbench for Safety Analysis powered by Sirius
SimfiaNeo - Workbench for Safety Analysis powered by Sirius
Obeo
 
Analog vlsi
Analog vlsiAnalog vlsi
Analog vlsi
Khuong Lamborghini
 
Leveraging Operational Data in the Cloud
 Leveraging Operational Data in the Cloud Leveraging Operational Data in the Cloud
Leveraging Operational Data in the Cloud
Inductive Automation
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
CastLabKAIST
 
VMworld 2017 - Top 10 things to know about vSAN
VMworld 2017 - Top 10 things to know about vSANVMworld 2017 - Top 10 things to know about vSAN
VMworld 2017 - Top 10 things to know about vSAN
Duncan Epping
 
BUilt-In-Self-Test for VLSI Design
BUilt-In-Self-Test for VLSI DesignBUilt-In-Self-Test for VLSI Design
BUilt-In-Self-Test for VLSI Design
Usha Mehta
 
Applications of ATPG
Applications of ATPGApplications of ATPG
Applications of ATPG
Ushaswini Chowdary
 
Introduction to COMS VLSI Design
Introduction to COMS VLSI DesignIntroduction to COMS VLSI Design
Introduction to COMS VLSI Design
Eutectics
 
Cache coloring Xen Summit 2020
Cache coloring Xen Summit 2020Cache coloring Xen Summit 2020
Cache coloring Xen Summit 2020
Stefano Stabellini
 
Study of vlsi design methodologies and limitations using cad tools for cmos t...
Study of vlsi design methodologies and limitations using cad tools for cmos t...Study of vlsi design methodologies and limitations using cad tools for cmos t...
Study of vlsi design methodologies and limitations using cad tools for cmos t...
Lakshmi Narain College of Technology & Science Bhopal
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD
 

What's hot (20)

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
 
NOR gate design in microwind
NOR gate design in microwindNOR gate design in microwind
NOR gate design in microwind
 
Signal Integrity - A Crash Course [R Lott]
Signal Integrity - A Crash Course [R Lott]Signal Integrity - A Crash Course [R Lott]
Signal Integrity - A Crash Course [R Lott]
 
Dell VMware Virtual SAN Ready Nodes
Dell VMware Virtual SAN Ready NodesDell VMware Virtual SAN Ready Nodes
Dell VMware Virtual SAN Ready Nodes
 
Cloud Computing & Its Impact on Project Management
Cloud Computing & Its Impact on Project ManagementCloud Computing & Its Impact on Project Management
Cloud Computing & Its Impact on Project Management
 
CKAN as an open-source data management solution for open data
CKAN as an open-source data management solution for open data CKAN as an open-source data management solution for open data
CKAN as an open-source data management solution for open data
 
Blockchain and Apache NiFi
Blockchain and Apache NiFiBlockchain and Apache NiFi
Blockchain and Apache NiFi
 
IBM Tape the future of tape
IBM Tape the future of tapeIBM Tape the future of tape
IBM Tape the future of tape
 
SimfiaNeo - Workbench for Safety Analysis powered by Sirius
SimfiaNeo - Workbench for Safety Analysis powered by SiriusSimfiaNeo - Workbench for Safety Analysis powered by Sirius
SimfiaNeo - Workbench for Safety Analysis powered by Sirius
 
Analog vlsi
Analog vlsiAnalog vlsi
Analog vlsi
 
Leveraging Operational Data in the Cloud
 Leveraging Operational Data in the Cloud Leveraging Operational Data in the Cloud
Leveraging Operational Data in the Cloud
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
VMworld 2017 - Top 10 things to know about vSAN
VMworld 2017 - Top 10 things to know about vSANVMworld 2017 - Top 10 things to know about vSAN
VMworld 2017 - Top 10 things to know about vSAN
 
BUilt-In-Self-Test for VLSI Design
BUilt-In-Self-Test for VLSI DesignBUilt-In-Self-Test for VLSI Design
BUilt-In-Self-Test for VLSI Design
 
Applications of ATPG
Applications of ATPGApplications of ATPG
Applications of ATPG
 
Introduction to COMS VLSI Design
Introduction to COMS VLSI DesignIntroduction to COMS VLSI Design
Introduction to COMS VLSI Design
 
Cache coloring Xen Summit 2020
Cache coloring Xen Summit 2020Cache coloring Xen Summit 2020
Cache coloring Xen Summit 2020
 
Study of vlsi design methodologies and limitations using cad tools for cmos t...
Study of vlsi design methodologies and limitations using cad tools for cmos t...Study of vlsi design methodologies and limitations using cad tools for cmos t...
Study of vlsi design methodologies and limitations using cad tools for cmos t...
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 

Similar to OpenHPI - Parallel Programming Concepts - Week 1

OpenHPI - Parallel Programming Concepts - Week 2
OpenHPI - Parallel Programming Concepts - Week 2OpenHPI - Parallel Programming Concepts - Week 2
OpenHPI - Parallel Programming Concepts - Week 2
Peter Tröger
 
OpenHPI - Parallel Programming Concepts - Week 5
OpenHPI - Parallel Programming Concepts - Week 5OpenHPI - Parallel Programming Concepts - Week 5
OpenHPI - Parallel Programming Concepts - Week 5
Peter Tröger
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Aseda Owusua Addai-Deseh
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
Conor B. Murphy
 
OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4
Peter Tröger
 
Os Lamothe
Os LamotheOs Lamothe
Os Lamothe
oscon2007
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
LibbySchulze
 
CK: from ad hoc computer engineering to collaborative and reproducible data s...
CK: from ad hoc computer engineering to collaborative and reproducible data s...CK: from ad hoc computer engineering to collaborative and reproducible data s...
CK: from ad hoc computer engineering to collaborative and reproducible data s...
Grigori Fursin
 
Chap10.pdf
Chap10.pdfChap10.pdf
Chap10.pdf
NaimKhan57
 
unit-1-181211045120.pdf
unit-1-181211045120.pdfunit-1-181211045120.pdf
unit-1-181211045120.pdf
Vhhvf
 
Data Science in Production: Technologies That Drive Adoption of Data Science ...
Data Science in Production: Technologies That Drive Adoption of Data Science ...Data Science in Production: Technologies That Drive Adoption of Data Science ...
Data Science in Production: Technologies That Drive Adoption of Data Science ...
Nir Yungster
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
Design World
 
slides.pdf
slides.pdfslides.pdf
slides.pdf
GafryMahmoud
 
Available HPC Resources at CSUC
Available HPC Resources at CSUCAvailable HPC Resources at CSUC
Assignment 1-mtat
Assignment 1-mtatAssignment 1-mtat
Assignment 1-mtat
zafargilani
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
Dr Sandeep Kumar Poonia
 
Data Science
Data ScienceData Science
Data Science
Prithwis Mukerjee
 
Lecture1
Lecture1Lecture1
Lecture1
tt_aljobory
 
AI LAB using IBM Power 9 Processor
AI LAB using IBM Power 9 ProcessorAI LAB using IBM Power 9 Processor
AI LAB using IBM Power 9 Processor
Ganesan Narayanasamy
 

Similar to OpenHPI - Parallel Programming Concepts - Week 1 (20)

OpenHPI - Parallel Programming Concepts - Week 2
OpenHPI - Parallel Programming Concepts - Week 2OpenHPI - Parallel Programming Concepts - Week 2
OpenHPI - Parallel Programming Concepts - Week 2
 
OpenHPI - Parallel Programming Concepts - Week 5
OpenHPI - Parallel Programming Concepts - Week 5OpenHPI - Parallel Programming Concepts - Week 5
OpenHPI - Parallel Programming Concepts - Week 5
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4
 
Os Lamothe
Os LamotheOs Lamothe
Os Lamothe
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
CK: from ad hoc computer engineering to collaborative and reproducible data s...
CK: from ad hoc computer engineering to collaborative and reproducible data s...CK: from ad hoc computer engineering to collaborative and reproducible data s...
CK: from ad hoc computer engineering to collaborative and reproducible data s...
 
Chap10.pdf
Chap10.pdfChap10.pdf
Chap10.pdf
 
unit-1-181211045120.pdf
unit-1-181211045120.pdfunit-1-181211045120.pdf
unit-1-181211045120.pdf
 
Data Science in Production: Technologies That Drive Adoption of Data Science ...
Data Science in Production: Technologies That Drive Adoption of Data Science ...Data Science in Production: Technologies That Drive Adoption of Data Science ...
Data Science in Production: Technologies That Drive Adoption of Data Science ...
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
slides.pdf
slides.pdfslides.pdf
slides.pdf
 
Available HPC Resources at CSUC
Available HPC Resources at CSUCAvailable HPC Resources at CSUC
Available HPC Resources at CSUC
 
Assignment 1-mtat
Assignment 1-mtatAssignment 1-mtat
Assignment 1-mtat
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Data Science
Data ScienceData Science
Data Science
 
Lecture1
Lecture1Lecture1
Lecture1
 
AI LAB using IBM Power 9 Processor
AI LAB using IBM Power 9 ProcessorAI LAB using IBM Power 9 Processor
AI LAB using IBM Power 9 Processor
 

More from Peter Tröger

WannaCry - An OS course perspective
WannaCry - An OS course perspectiveWannaCry - An OS course perspective
WannaCry - An OS course perspective
Peter Tröger
 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and Virtualization
Peter Tröger
 
Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2
Peter Tröger
 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissions
Peter Tröger
 
Design of Software for Embedded Systems
Design of Software for Embedded SystemsDesign of Software for Embedded Systems
Design of Software for Embedded Systems
Peter Tröger
 
Humans should not write XML.
Humans should not write XML.Humans should not write XML.
Humans should not write XML.
Peter Tröger
 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.
Peter Tröger
 
Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)
Peter Tröger
 
Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)
Peter Tröger
 
Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)
Peter Tröger
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Peter Tröger
 
Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)
Peter Tröger
 
Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)
Peter Tröger
 
Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)
Peter Tröger
 
Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)
Peter Tröger
 
Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)
Peter Tröger
 
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Peter Tröger
 
Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)
Peter Tröger
 
Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)
Peter Tröger
 
Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0
Peter Tröger
 

More from Peter Tröger (20)

WannaCry - An OS course perspective
WannaCry - An OS course perspectiveWannaCry - An OS course perspective
WannaCry - An OS course perspective
 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and Virtualization
 
Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2
 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissions
 
Design of Software for Embedded Systems
Design of Software for Embedded SystemsDesign of Software for Embedded Systems
Design of Software for Embedded Systems
 
Humans should not write XML.
Humans should not write XML.Humans should not write XML.
Humans should not write XML.
 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.
 
Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)
 
Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)
 
Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
 
Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)
 
Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)
 
Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)
 
Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)
 
Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)
 
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
 
Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)
 
Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)
 
Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0
 

Recently uploaded

Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 

OpenHPI - Parallel Programming Concepts - Week 1

  • 1. Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.1: Welcome ! Dr. Peter Tröger + Teaching Team
  • 2. Course Content ■  Overview of theoretical and practical concepts ■  This course is for you if … □  … you have skills in software development, regardless of the programming language. □  … you want to get an overview of parallelization concepts. □  … you want to assess the feasibility of parallel hardware, software and libraries for your parallelization problem. ■  This course is not for you if … □  … you have no practical experience with software development at all. □  … you want a solution for a specific parallelization problem. □  … you want to learn one specific parallel programming tool or language in detail. 2 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 3. Parallel Programming Concepts 3 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 4. Course Organization ■  Six lecture weeks, final exam in week 7 ■  Several lecture units per week, per unit: □  Video, slides, non-graded self-test □  Sometimes mandatory and optional readings □  Sometimes optional programming tasks □  Week finished with a graded assignment ■  Six graded assignments sum up to max. 90 points ■  Graded final exam with max. 90 points ■  OpenHPI certificate awarded for getting ≥90 points in total ■  Forum can be used to discuss with other participants ■  FAQ is constantly updated 4 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 5. Course Organization ■  Week 1: Terminology and fundamental concepts □  Moore’s law, power wall, memory wall, ILP wall, speedup vs. scaleup, Amdahl’s law, Flynn’s taxonomy, … ■  Week 2: Shared memory parallelism – The basics □  Concurrency, race condition, semaphore, mutex, deadlock, monitor, … ■  Week 3: Shared memory parallelism – Programming □  Threads, OpenMP, Intel TBB, Cilk, Scala, … ■  Week 4: Accelerators □  Hardware today, CUDA, GPU Computing, OpenCL, … ■  Week 5: Distributed memory parallelism □  CSP, Actor model, clusters, HPC, MPI, MapReduce, … ■  Week 6: Patterns, best practices and examples 5 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 6. Why Parallel? 6 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 7. Computer Markets ■  Embedded and Mobile Computing □  Cars, smartphones, entertainment industry, medical devices, … □  Power/performance and price as relevant issues ■  Desktop Computing □  Price/performance ratio and extensibility as relevant issues ■  Server Computing □  Business service provisioning as typical goal □  Web servers, banking back-end, order processing, ... □  Performance and availability as relevant issues ■  Most software benefits from having better performance ■  The computer hardware industry is constantly delivering 7 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 8. Running Applications Application Instructions 8 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 9. Three Ways Of Doing Anything Faster [Pfister] ■  Work harder (clock speed) □  Hardware solution □  No longer feasible ■  Work smarter (optimization, caching) □  Hardware solution □  No longer feasible as only solution ■  Get help (parallelization) □  Hardware + Software in cooperation Application Instructions t 9 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 10. Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.2: Moore’s Law and the Power Wall Dr. Peter Tröger + Teaching Team
  • 11. Processor Hardware ■  First computers had fixed programs (e.g. electronic calculator) ■  Von Neumann architecture (1945) □  Instructions for central processing unit (CPU) in memory □  Program is treated as data □  Loading of code during runtime, self-modification ■  Multiple such processors: Symmetric multiprocessing (SMP) CPU Memory Control Unit Arithmetic Logic UnitInput Output Bus 11 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 12. Moore’s Law ■  “...the number of transistors that can be inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years. ...” (Gordon Moore, 1965) □  CPUs contain different hardware parts, such as logic gates □  Parts are built from transistors □  Rule of exponential growth for the number of transistors on one CPU chip □  Meanwhile a self-fulfilling prophecy □  Applied not only in processor industry, but also in other areas □  Sometimes misinterpreted as performance indication □  May still hold for the next 10-20 years [Wikipedia] 12 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 13. Moore’s Law [Wikimedia] 13 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 14. Moore’s Law vs. Software ■  Nathan P. Myhrvold, “The Next Fifty Years of Software”, 1997 □  “Software is a gas. It expands to fit the container it is in.” ◊  Constant increase in the amount of code □  “Software grows until it becomes limited by Moore’s law.” ◊  Software often grows faster than hardware capabilities □  “Software growth makes Moore’s Law possible.” ◊  Software and hardware market stimulate each other □  “Software is only limited by human ambition & expectation.” ◊  People will always find ways for exploiting performance ■  Jevon’s paradox: □  “Technological progress that increases the efficiency with which a resource is used tends to increase (rather than decrease) the rate of consumption of that resource.” 14 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 15. Processor Performance Development Transistors)#) Clock)Speed)(MHz)) Power)(W)) Perf/Clock)(ILP)) “Work harder” “Work smarter” [HerbSutter,2009] 15 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 16. A Physics Problem ■  Power: Energy needed to run the processor ■  Static power (SP): Leakage in transistors while being inactive ■  Dynamic power (DP): Energy needed to switch a transistor ■  Moore’s law: N goes up exponentially, C goes down with size ■  Power dissipation demands cooling □  Power density: Watt/cm2 ■  Make dynamic power increase less dramatic: □  Bringing down V reduces energy consumption, quadratically! □  Don’t use N only for logic gates ■  Industry was able to increase the frequency (F) for decades DP (approx.) = Number of Transistors (N) x Capacitance (C) x Voltage2 (V2) x Frequency (F) 16 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 17. Processor Supply Voltage 1 10 100 1970 1980 1990 2000 2010 PowerSupply(Volt) Processor Supply VoltageProcessor Supply Voltage [Moore,ISSCC] 17 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 18. Power Density ■  Growth of watts per square centimeter in microprocessors ■  Higher temperatures: Increased leakage, slower transistors 0 W 20 W 40 W 60 W 80 W 100 W 120 W 140 W 1992 1995 1997 2000 2002 2005 Hot Plate Air Cooling Limit 18 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 19. Power Density [Kevin Skadron, 2007] “Cooking-Aware” Computing? 19 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 20. Second Problem: Leakage Increase 0.001 0.01 0.1 1 10 100 1000 1960 1970 1980 1990 2000 2010 Power(W) Processor Power (Watts)Processor Power (Watts) -- Active & LeakageActive & Leakage ActiveActive LeakageLeakage [www.ieeeghn.org] ■  Static leakage today: Up to 40% of CPU power consumption 20 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 21. The Power Wall ■  Air cooling capabilities are limited □  Maximum temperature of 100-125 °C, hot spot problem □  Static and dynamic power consumption must be limited ■  Power consumption increases with Moore‘s law, but grow of hardware performance is expected ■  Further reducing voltage as compensation □  We can’t do that endlessly, lower limit around 0.7V □  Strange physical effects ■  Next-generation processors need to use even less power □  Lower the frequencies, scale them dynamically □  Use only parts of the processor at a time (‘dark silicon’) □  Build energy-efficient special purpose hardware ■  No chance for faster processors through frequency increase 21 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 22. The Free Lunch Is Over ■  Clock speed curve flattened in 2003 □  Heat, power, leakage ■  Speeding up the serial instruction execution through clock speed improvements no longer works ■  Additional issues □  ILP wall □  Memory wall [HerbSutter,2009] 22 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 23. Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.3: ILP Wall and Memory Wall Dr. Peter Tröger + Teaching Team
  • 24. Three Ways Of Doing Anything Faster [Pfister] ■  Work harder (clock speed) □  Hardware solution !  Power wall problem ■  Work smarter (optimization, caching) □  Hardware solution ■  Get help (parallelization) □  Hardware + Software Application Instructions 24 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 25. Instruction Level Parallelism ■  Increasing the frequency is no longer an option ■  Provide smarter instruction processing for better performance ■  Instruction level parallelism (ILP) □  Processor hardware optimizes low-level instruction execution □  Instruction pipelining ◊  Overlapped execution of serial instructions □  Superscalar execution ◊  Multiple units of one processor are used in parallel □  Out-of-order execution ◊  Reorder instructions that do not have data dependencies □  Speculative execution ◊  Control flow speculation and branch prediction ■  Today’s processors are packed with such ILP logic 25 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 26. The ILP Wall ■  No longer cost-effective to dedicate new transistors to ILP mechanisms ■  Deeper pipelines make the power problem worse ■  High ILP complexity effectively reduces the processing speed for a given frequency (e.g. misprediction) ■  More aggressive ILP technologies too risky due to unknown real-world workloads ■  No ground-breaking new ideas ■  " “ILP wall” ■  Ok, let’s use the transistors for better caching [Wikipedia] 26 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 27. Caching ■  von Neumann architecture □  Instructions are stored in main memory □  Program is treated as data □  For each instruction execution, data must be fetched ■  When the frequency increases, main memory becomes a performance bottleneck ■  Caching: Keep data copy in very fast, small memory on the CPU CPU Memory Control Unit Arithmetic Logic UnitInput Output Bus Cache 27 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 28. Small Memory Hardware Hierarchy volatile non-volatile Registers Processor Caches Random Access Memory (RAM) Flash / SSD Memory Hard Drives Tapes Fast Expensive Slow Large 28 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Cheap
  • 29. Memory Hardware Hierarchy CPU core CPU core CPU core CPU core L2 Cache L2 Cache L3 Cache L1 Cache L1 Cache L1 Cache L1 Cache Bus Bus Bus L = Level 29 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 30. Caching for Performance ■  Well established optimization technique for performance ■  Caching relies on data locality □  Some instructions are often used (e.g. loops) □  Some data is often used (e.g. local variables) □  Hardware keeps a copy of the data in the faster cache □  On read attempts, data is taken directly from the cache □  On write, data is cached and eventually written to memory ■  Similar to ILP, the potential is limited □  Larger caches do not help automatically □  At some point, all data locality in the code is already exploited □  Manual vs. compiler-driven optimization [arstechnica.com] 30 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 31. Memory Wall ■  If caching is limited, we simply need faster memory ■  The problem: Shared memory is ‘shared’ □  Interconnect contention □  Memory bandwidth ◊  Memory transfer speed is limited by the power wall ◊  Memory transfer size is limited by the power wall ■  Transfer technology cannot keep up with GHz processors ■  Memory is too slow, effects cannot be hidden through caching completely " “Memory wall” [dell.com] 31 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 32. Problem Summary ■  Hardware perspective □  Number of transistors N is still increasing □  Building larger caches no longer helps (memory wall) □  ILP is out of options (ILP wall) □  Voltage / power / frequency is at the limit (power wall) ◊  Some help with dynamic scaling approaches □  Remaining option: Use N for more cores per processor chip ■  Software perspective □  Performance must come from the utilization of this increasing core count per chip, since F is now fixed □  Software must tackle the memory wall 32 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 33. Three Ways Of Doing Anything Faster [Pfister] ■  Work harder (clock speed) !  Power wall problem !  Memory wall problem ■  Work smarter (optimization, caching) !  ILP wall problem !  Memory wall problem ■  Get help (parallelization) □  More cores per single CPU □  Software needs to exploit them in the right way !  Memory wall problem Problem CPU Core Core Core Core Core 33 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 34. Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.4: Parallel Hardware Classification Dr. Peter Tröger + Teaching Team
  • 35. Parallelism [Mattson et al.] ■  Task □  Parallel program breaks a problem into tasks ■  Execution unit □  Representation of a concurrently running task (e.g. thread) □  Tasks are mapped to execution units ■  Processing element (PE) □  Hardware element running one execution unit □  Depends on scenario - logical processor vs. core vs. machine □  Execution units run simultaneously on processing elements, controlled by some scheduler ■  Synchronization - Mechanism to order activities of parallel tasks ■  Race condition - Program result depends on the scheduling order 35 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 36. Faster Processing through Parallelization Program Task Task Task Task Task 36 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 37. OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Flynn‘s Taxonomy (1966) ■  Classify parallel hardware architectures according to their capabilities in the instruction and data processing dimension Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD) 37 Processing Step Instruction Data Item Output Processing Step Instruction Data Items Output Multiple Instruction, Single Data (MISD) Processing Step Instructions Data Item Output Multiple Instruction, Multiple Data (MIMD) Processing Step Instructions Data Items Output
  • 38. Flynn‘s Taxonomy (1966) ■  Single Instruction, Single Data (SISD) □  No parallelism in the execution □  Old single processor architectures ■  Single Instruction, Multiple Data (SIMD) □  Multiple data streams processed with one instruction stream at the same time □  Typical in graphics hardware and GPU accelerators □  Special SIMD machines in high-performance computing ■  Multiple Instructions, Single Data (MISD) □  Multiple instructions applied to the same data in parallel □  Rarely used in practice, only for fault tolerance ■  Multiple Instructions, Multiple Data (MIMD) □  Every modern processor, compute clusters 38 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 39. Parallelism on Different Levels ProgramProgramProgram ProcessProcessProcessProcessTask PE ProcessProcessProcessProcessTask ProcessProcessProcessProcessTask PE PE PE Memory Node Network PE PE PE Memory PE PE PE Memory PE PE PE Memory PE PE PE Memory 39 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 40. Parallelism on Different Levels ■  A processor chip (socket) □  Chip multi-processing (CMP) ◊  Multiple CPU’s per chip, called cores ◊  Multi-core / many-core □  Simultaneous multi-threading (SMT) ◊  Interleaved execution of tasks on one core ◊  Example: Intel Hyperthreading □  Chip multi-threading (CMT) = CMP + SMT □  Instruction-level parallelism (ILP) ◊  Parallel processing of single instructions per core ■  Multiple processor chips in one machine (multi-processing) □  Symmetric multi-processing (SMP) ■  Multiple processor chips in many machines (multi-computer) 40 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 41. Parallelism on Different Levels [arstechnica.com] ILP, SMT ILP, SMTILP, SMTILP, SMT ILP, SMT ILP, SMT ILP, SMT ILP, SMT CMPArchitecture 41 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 42. Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.5: Memory Architectures Dr. Peter Tröger + Teaching Team
  • 43. Parallelism on Different Levels ProgramProgramProgram ProcessProcessProcessProcessTask PE ProcessProcessProcessProcessTask ProcessProcessProcessProcessTask PE PE PE Memory Node Network PE PE PE Memory PE PE PE Memory PE PE PE Memory PE PE PE Memory 43 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 44. Shared Memory vs. Shared Nothing ■  Organization of parallel processing hardware as … □  Shared memory system ◊  Tasks can directly access a common address space ◊  Implemented as memory hierarchy with different cache levels □  Shared nothing system ◊  Tasks can only access local memory ◊  Global coordination of parallel execution by explicit communication (e.g. messaging) between tasks □  Hybrid architectures possible in practice ◊  Cluster of shared memory systems ◊  Accelerator hardware in a shared memory system ●  Dedicated local memory on the accelerator ●  Example: SIMD GPU hardware in SMP computer system 44 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 45. Shared Memory vs. Shared Nothing ■  Pfister: “shared memory” vs. “distributed memory” ■  Foster: “multiprocessor” vs. “multicomputer” ■  Tannenbaum: “shared memory” vs. “private memory” Processing Element Task Shared Memory Processing Element Task Processing Element Task Processing Element Task Message Message Message Message Data DataData Data 45 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 46. Shared Memory ■  Processing elements act independently ■  Use the same global address space ■  Changes are visible for all processing elements ■  Uniform memory access (UMA) system □  Equal access time for all PE’s to all memory locations □  Default approach for SMP systems of the past ■  Non-uniform memory access (NUMA) system □  Delay on memory access according to the accessed region □  Typically due to core / processor interconnect technology ■  Cache-coherent NUMA (CC-NUMA) system ◊  NUMA system that keeps all caches consistent ◊  Transparent hardware mechanisms ◊  Became standard approach with recent X86 chips 46 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 47. Socket UMA Example ■  Two dual-core processor chips in an SMP system ■  Level 1 cache (fast, small), Level 2 cache (slower, larger) ■  Hardware manages cache coherency among all cores Core Core L1 Cache L1 Cache L2 Cache RAM Chipset / Memory Controller System Bus Socket Core Core L1 Cache L1 Cache L2 Cache 47 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger RAM RAM RAM
  • 48. Socket NUMA Example ■  Eight cores on 2 sockets in an SMP system ■  Memory controllers + chip interconnect realize a single memory address space for the software Core Core L1 L1 L3 Cache RAM L2 L2 Core Core L1 L2 L1 L2 Memory Controller RAM Chip Interconnect Socket Core Core L1 L1 L3 Cache L2 L2 Core Core L1 L2 L1 L2 Memory Controller 48 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 49. NUMA Example: 4-way Intel Nehalem SMP Core Core Core Core Q P I Core Core Core Core Q P I Core Core Core Core Q P I Core Core Core Core Q P I L3Cache L3Cache L3Cache MemoryController MemoryController MemoryController L3Cache MemoryController I/O I/O I/O I/O Memory Memory Memory Memory 49 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 50. Shared Nothing ■  Processing elements no longer share a common global memory ■  Easy scale-out by adding machines to the messaging network ■  Cluster computing: Combine machines with cheap interconnect □  Compute cluster: Speedup for an application ◊  Batch processing, data parallelism □  Load-balancing cluster: Better throughput for some service □  High Availability (HA) cluster: Fault tolerance ■  Cluster to the extreme □  High Performance Computing (HPC) □  Massively Parallel Processing (MPP) hardware □  TOP500 list of the fastest supercomputers 50 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 52. Shared Nothing Example … Socket Core Core L1 L1 L3 Cache RAM L2 L2 Memory Controller Network Interface Socket Core Core L1 L1 L3 Cache RAM L2 L2 Memory Controller Network Interface Socket Core Core L1 L1 L3 Cache RAM L2 L2 Memory Controller Network Interface Machine Machine Machine 52 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Interconnection Network
  • 53. Hybrid Example … Machine Socket Core Core L1D L1D L3 Cache RAM L2 L2 Memory Controller Network Interface Chip Inter- connect Socket Core Core L1D L1D L3 Cache RAM L2 L2 Memory Controller Machine Socket Core Core L1D L1D L3 Cache RAM L2 L2 Memory Controller Network Interface Chip Inter- connect Socket Core Core L1D L1D L3 Cache RAM L2 L2 Memory Controller 53 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Interconnection Network
  • 54. Example: Cluster of Nehalem SMPs Network 54 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 55. The Parallel Programming Problem ■  Execution environment has a particular type (SIMD, MIMD, UMA, NUMA, …) ■  Execution environment maybe configurable (number of resources) ■  Parallel application must be mapped to available resources Execution EnvironmentParallel Application Match ? Configuration Flexible Type 55 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 56. Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.6: Speedup and Scaleup Dr. Peter Tröger + Teaching Team
  • 57. Which One Is Faster ? ■  Usage scenario □  Transporting a fridge ■  Usage environment □  Driving through a forest ■  Perception of performance □  Maximum speed □  Average speed □  Acceleration ■  We need some kind of application-specific benchmark 57 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 58. Parallelism for … ■  Speedup – compute faster ■  Throughput – compute more in the same time ■  Scalability – compute faster / more with additional resources ■  … Processing Element A1 Processing Element A2 Processing Element A3 Processing Element B1 Processing Element B2 Processing Element B3 ScalingUp Scaling Out MainMemory MainMemory 58 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 59. Metrics ■  Parallelization metrics are application-dependent, but follow a common set of concepts □  Speedup: Adding more resources leads to less time for solving the same problem. □  Linear speedup: n times more resources " n times speedup □  Scaleup: Adding more resources solves a larger version of the same problem in the same time. □  Linear scaleup: n times more resources " n times larger problem solvable ■  The most important goal depends on the application □  Throughput demands scalability of the software □  Response time demands speedup of the processing 59 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 60. Tasks: v=12 Processing elements: N= 3 Time needed: T3= 4 (Linear) Speedup: T1/T3=12/4=3 Speedup ■  Idealized assumptions □  All tasks are equal sized □  All code parts can run in parallel Application 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 t t Tasks: v=12 Processing elements: N=1 Time needed: T1=12 60 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 61. Speedup with Load Imbalance ■  Assumptions □  Tasks have different size, best-possible speedup depends on optimized resource usage □  All code parts can run in parallel Application 2 3 4 5 6 7 8 9 10 11 12 t t 1 2 3 4 1 5 6 7 8 9 10 11 12 Tasks: v=12 Processing elements: N= 3 Time needed: T3= 6 Speedup: T1/T3=16/6=2.67 Tasks: v=12 Processing elements: N=1 Time needed: T1=16 61 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 62. Speedup with Serial Parts ■  Each application has inherently non-parallelizable serial parts □  Algorithmic limitations □  Shared resources acting as bottleneck □  Overhead for program start □  Communication overhead in shared-nothing systems 23 45 6 7 8 9 10 11 12 tSER1 1 tPAR1 tSER2 tPAR2 tSER3 62 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 63. Amdahl’s Law ■  Gene Amdahl. “Validity of the single processor approach to achieving large scale computing capabilities”. AFIPS 1967 □  Serial parts TSER = tSER1 + tSER2 + tSER3 + … □  Parallelizable parts TPAR = tPAR1 + tPAR2 + tPAR3 + … □  Execution time with one processing element: T1 = TSER+TPAR □  Execution time with N parallel processing elements: TN >= TSER + TPAR / N ◊  Equal only on perfect parallelization, e.g. no load imbalance □  Amdahl’s Law for maximum speedup with N processing elements S = T1 TN 63 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger S = TSER + TP AR TSER + TP AR/N
  • 64. Amdahl’s Law 64 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 65. Amdahl’s Law ■  Speedup through parallelism is hard to achieve ■  For unlimited resources, speedup is bound by the serial parts: □  Assume T1=1 ■  Parallelization problem relates to all system layers □  Hardware offers some degree of parallel execution □  Speedup gained is bound by serial parts: ◊  Limitations of hardware components ◊  Necessary serial activities in the operating system, virtual runtime system, middleware and the application ◊  Overhead for the parallelization itself 65 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger SN!1 = T1 TN!1 SN!1 = 1 TSER
  • 66. Amdahl’s Law ■  “Everyone knows Amdahl’s law, but quickly forgets it.” [Thomas Puzak, IBM] ■  90% parallelizable code leads to not more than 10x speedup □  Regardless of the number of processing elements ■  Parallelism is only useful … □  … for small number of processing elements □  … for highly parallelizable code ■  What’s the sense in big parallel / distributed hardware setups? ■  Relevant assumptions □  Put the same problem on different hardware □  Assumption of fixed problem size □  Only consideration of execution time for one problem 66 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 67. Gustafson-Barsis’ Law (1988) ■  Gustafson and Barsis: People are typically not interested in the shortest execution time □  Rather solve a bigger problem in reasonable time ■  Problem size could then scale with the number of processors □  Typical in simulation and farmer / worker problems □  Leads to larger parallel fraction with increasing N □  Serial part is usually fixed or grows slower ■  Maximum scaled speedup by N processors: ■  Linear speedup now becomes possible ■  Software needs to ensure that serial parts remain constant ■  Other models exist (e.g. Work-Span model, Karp-Flatt metric) 67 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger S = TSER + N · TP AR TSER + TP AR
  • 68. Summary: Week 1 ■  Moore’s Law and the Power Wall □  Processing element speed no longer increases ■  ILP Wall and Memory Wall □  Memory access is not fast enough for modern hardware ■  Parallel Hardware Classification □  From ILP to SMP, SIMD vs. MIMD ■  Memory Architectures □  UMA vs. NUMA ■  Speedup and Scaleup □  Amdahl’s Law and Gustavson’s Law Since we need parallelism for speedup, how can we express it in software? 68 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger