Week 1 in the OpenHPI course on parallel programming concepts is about hardware and software trends that lead to the rise of parallel programming for ordinary developers.
Find the whole course at http://bit.ly/1l3uD4h.
Week 2 in the OpenHPI course on parallel programming concepts is about foundational aspects of concurrency.
Find the whole course at http://bit.ly/1l3uD4h.
Week 5 in the OpenHPI course on parallel programming concepts is about parallel applications in distributed systems.
Find the whole course at http://bit.ly/1l3uD4h.
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
HTML slides and longer abstract can be found at https://github.com/ljdursi/EuroMPI2016.
For years, the academic science and engineering community was almost alone in pursuing very large-scale numerical computing, and MPI was the lingua franca for such work. But starting in the mid-2000s, we were no longer alone. First internet-scale companies like Google and Yahoo! started performing fairly basic analytics tasks at enormous scale, and since then others have begun tackling increasingly complex and data-heavy machine-learning computations, which involve very familiar scientific computing primitives such as linear algebra, unstructured mesh decomposition, and numerical optimization. These new communities have created programming environments which emphasize what we’ve learned about computer science and programmability since 1994 – with greater levels of abstraction and encapsulation, separating high-level computation from the low-level implementation details.
At about the same time, new academic research communities began using computing at scale to attack their problems - but in many cases, an ideal distributed-memory application for them begins to look more like the new concurrent distributed databases than a large CFD simulation, with data structures like dynamic hash tables and Bloom trees playing more important roles than rectangular arrays or unstructured meshes. These new academic communities are among the first to adopt emerging big-data technologies over traditional HPC options; but as big-data technologies improve their tightly-coupled number-crunching capabilities, they are unlikely to be the last.
In this talk, I sketch out the landscape of distributed technical computing frameworks and environments, and look to see where MPI and the MPI community fits in to this new ecosystem.
Week 1 in the OpenHPI course on parallel programming concepts is about hardware and software trends that lead to the rise of parallel programming for ordinary developers.
Find the whole course at http://bit.ly/1l3uD4h.
Week 2 in the OpenHPI course on parallel programming concepts is about foundational aspects of concurrency.
Find the whole course at http://bit.ly/1l3uD4h.
Week 5 in the OpenHPI course on parallel programming concepts is about parallel applications in distributed systems.
Find the whole course at http://bit.ly/1l3uD4h.
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
HTML slides and longer abstract can be found at https://github.com/ljdursi/EuroMPI2016.
For years, the academic science and engineering community was almost alone in pursuing very large-scale numerical computing, and MPI was the lingua franca for such work. But starting in the mid-2000s, we were no longer alone. First internet-scale companies like Google and Yahoo! started performing fairly basic analytics tasks at enormous scale, and since then others have begun tackling increasingly complex and data-heavy machine-learning computations, which involve very familiar scientific computing primitives such as linear algebra, unstructured mesh decomposition, and numerical optimization. These new communities have created programming environments which emphasize what we’ve learned about computer science and programmability since 1994 – with greater levels of abstraction and encapsulation, separating high-level computation from the low-level implementation details.
At about the same time, new academic research communities began using computing at scale to attack their problems - but in many cases, an ideal distributed-memory application for them begins to look more like the new concurrent distributed databases than a large CFD simulation, with data structures like dynamic hash tables and Bloom trees playing more important roles than rectangular arrays or unstructured meshes. These new academic communities are among the first to adopt emerging big-data technologies over traditional HPC options; but as big-data technologies improve their tightly-coupled number-crunching capabilities, they are unlikely to be the last.
In this talk, I sketch out the landscape of distributed technical computing frameworks and environments, and look to see where MPI and the MPI community fits in to this new ecosystem.
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
Machine Learning (ML) is a subset of Artificial Intelligence (AI). In this talk, Richard Garris, Principal Architect at Databricks will explain how various ML algorithms are parallelized in Apache Spark. Andrew Ng calls the algorithms the "rocket ship" and the data "the fuel that you feed machine learning" to build deep learning applications. We will start with an understanding of machine learning pipelines built using single machine algorithms including Pandas, scikit-learn, and R. Then we will discuss how Apache Spark MLlib can be used to parallelize your machine learning pipeline with Linear Regression and Random Forest. Lastly, we will discuss ways to parallelize single machine algorithms in Spark by broadcasting the data and then performing distributed feature selection, model creation or hyperparameter tuning.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Introduction to Data Mining - A Beginner's Guidegokulprasath06
We live in a world where vast amounts of data are collected daily. Analyzing such data is an important need. Data mining can meet this demand by providing tools to discover knowledge from data.
Checkout: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi | Automating Machine Learning, Artificial Intelligence, and Data Science | Guided Analytics
Machine Learning for Dummies (without mathematics)ActiveEon
It presents an introduction and the basic concepts of machine learning without mathematics. This is a short presentation for beginners in machine learning.
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
In Data Engineer's Lunch 89, Obioma Anomnachi will discuss how to manage and schedule Machine Learning operations via Airflow. Learn how you can write complete end-to-end pipelines starting with retrieving raw data to serving ML predictions to end-users, entirely in Airflow.
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...GeeksLab Odessa
4.6.16 AI&BigData Lab
Upcoming events: goo.gl/I2gJ4H
В докладе постараюсь рассказать об особенностях промышленного использования моделей машинного обучения. Какие особенности возникают при обучении распределенных моделей. Как понять и предвидеть поведение модели с увеличением количества данных. Как настраивать и выбирать аккуратные модели с учетом ограниченных ресурсов кластера.
When data size grows in terms of sample count, feature count and model parameter count, things go crazy. The slideshow presents an overview of what to expect and how to handle them.
An introduction and basic concepts of machine learning without mathematics. This is a short presentation for beginners in machine learning.Andrews Cordolino Sobral, Ph.D., Computer Vision and Machine Learning Researcher, Activeeon
Machine learning workshop, session 3.
- Data sets
- Machine Learning Algorithms
- Algorithms by Learning Style
- Algorithms by Similarity
- People to follow
To get PhDs, Masters and Bachelors??
To provide solutions to complex problems
To investigate laws of nature
To make new discoveries
To develop new products
To save costs
To improve our life
Human desires
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
Machine Learning (ML) is a subset of Artificial Intelligence (AI). In this talk, Richard Garris, Principal Architect at Databricks will explain how various ML algorithms are parallelized in Apache Spark. Andrew Ng calls the algorithms the "rocket ship" and the data "the fuel that you feed machine learning" to build deep learning applications. We will start with an understanding of machine learning pipelines built using single machine algorithms including Pandas, scikit-learn, and R. Then we will discuss how Apache Spark MLlib can be used to parallelize your machine learning pipeline with Linear Regression and Random Forest. Lastly, we will discuss ways to parallelize single machine algorithms in Spark by broadcasting the data and then performing distributed feature selection, model creation or hyperparameter tuning.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Introduction to Data Mining - A Beginner's Guidegokulprasath06
We live in a world where vast amounts of data are collected daily. Analyzing such data is an important need. Data mining can meet this demand by providing tools to discover knowledge from data.
Checkout: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi | Automating Machine Learning, Artificial Intelligence, and Data Science | Guided Analytics
Machine Learning for Dummies (without mathematics)ActiveEon
It presents an introduction and the basic concepts of machine learning without mathematics. This is a short presentation for beginners in machine learning.
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
In Data Engineer's Lunch 89, Obioma Anomnachi will discuss how to manage and schedule Machine Learning operations via Airflow. Learn how you can write complete end-to-end pipelines starting with retrieving raw data to serving ML predictions to end-users, entirely in Airflow.
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...GeeksLab Odessa
4.6.16 AI&BigData Lab
Upcoming events: goo.gl/I2gJ4H
В докладе постараюсь рассказать об особенностях промышленного использования моделей машинного обучения. Какие особенности возникают при обучении распределенных моделей. Как понять и предвидеть поведение модели с увеличением количества данных. Как настраивать и выбирать аккуратные модели с учетом ограниченных ресурсов кластера.
When data size grows in terms of sample count, feature count and model parameter count, things go crazy. The slideshow presents an overview of what to expect and how to handle them.
An introduction and basic concepts of machine learning without mathematics. This is a short presentation for beginners in machine learning.Andrews Cordolino Sobral, Ph.D., Computer Vision and Machine Learning Researcher, Activeeon
Machine learning workshop, session 3.
- Data sets
- Machine Learning Algorithms
- Algorithms by Learning Style
- Algorithms by Similarity
- People to follow
To get PhDs, Masters and Bachelors??
To provide solutions to complex problems
To investigate laws of nature
To make new discoveries
To develop new products
To save costs
To improve our life
Human desires
Similar to OpenHPI - Parallel Programming Concepts - Week 6 (20)
Design of Software for Embedded SystemsPeter Tröger
The course covers basic principles of software design and development for embedded systems, with a special emphasis on automotive systems. Topics included in the slide deck are:
- Introduction and basic concepts
- Functional and non-functional requirements on embedded software
- Real-time execution environments and operating systems
- Programming languages and middleware for embedded systems
- Model-driven development for embedded systems
- Latest trends from research and practice
Human users should not be forced to edit XML documents. Sometimes, they may want to read it.
The presentation persists some arguments I stated about this topic again and again in the past. Discussions and opinions are more than welcome.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
1. Parallel Programming Concepts
OpenHPI Course
Week 6 : Patterns and Best Practices
Unit 6.1: Parallel Programming Patterns
Dr. Peter Tröger + Teaching Team
2. Summary: Week 5
■ “Shared nothing” systems provide very good scalability
□ Adding new processing elements not limited by “walls”
□ Different options for interconnect technology
■ Task granularity is essential
□ Surface-to-volume effect
□ Task mapping problem
■ De-facto standard is MPI programming
■ High level abstractions with
□ Channels
□ Actors
□ MapReduce
2
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
„What steps / strategy would you apply
to parallelize a given compute-intense program? “
3. The Parallel Programming Problem
3
Execution
Environment
Parallel Application Match ?
Configuration
Flexible
Type
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
4. Parallelization and Design Patterns
■ Parallel programming relies on experience
□ Identification of concurrency
□ Identification of feasible algorithmic structures
□ If done wrong, performance / correctness may suffer
■ Rule of thumb: Somebody else is smarter than you !!
■ Design Pattern
□ Best practices, formulated as a template
□ Focus on general applicability to common problems
□ Well-known in object-oriented programming (“gang of four”)
■ Parallel design patterns in literature
□ Structured parallelization methodologies (== pattern)
□ Algorithmic building blocks commonly found (== pattern)
4
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
5. Patterns for Parallel Programming
[Mattson et al.]
■ Phases in creating a parallel program
□ Finding Concurrency: Identify and
analyze exploitable concurrency
□ Algorithm Structure: Structure
the algorithm to take advantage
of potential concurrency
□ Supporting Structures: Define
program structures and data
structures needed for the code
□ Implementation Mechanisms:
Threads, processes, messages, …
■ Each phase is a design space
5
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
6. Finding Concurrency Design Space
■ Identify and analyze exploitable concurrency
■ Example: Data Decomposition Pattern
□ Context: Computation is organized around large data
manipulation, similar operations on different data parts
□ Solution: Array-based data access (row, block),
recursive data structure traversal
■ Example: Group Tasks Pattern
□ Context: Tasks shared temporal constraints (e.g. intermediate
data), work on shared data structure
□ Solution: Apply ordering constraints to groups of tasks, put
truly independent tasks in one group for better scheduling
6
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
7. Algorithm Structure Design Space
■ Structure the algorithm
■ Consider how the identified concurrency is organized
□ Organize algorithm by tasks
◊ Tasks are embarrassingly parallel, or organized linearly
-> Task Parallelism
◊ Tasks organized by recursive procedure
-> Divide and Conquer
□ Organize algorithm by by data dependencies
◊ Linear data dependencies -> Geometric Decomposition
◊ Recursive data dependencies -> Recursive Data
□ Organize algorithm by application data flow
◊ Regular data flow for computation -> Pipeline
◊ Irregular data flow -> Event-Based Coordination
7
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
8. Example: Parallelize Bubble Sort
8
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
■ Bubble sort
□ Compare pair-wise and swap,
if in wrong order
■ Finding concurrency demands data
dependency consideration
□ Compare-exchange approach
needs some operation order
□ Algorithm idea implies hidden
data dependency
□ Idea: Parallelize serial rounds
■ Odd-even sort –
Compare [odd|even]-indexed pairs
and swap, in case
□ Apply task parallelism pattern
1 24 18 12 77
<->
1 24 18 12 77
<->
1 18 24 12 77
<->
1 18 12 24 77
<->
1 18 12 24 77
<->
1 18 12 24 77
<->
1 12 18 24 77
<-> ...
10. Supporting Structures Design Space
■ Software structures that support the expression of parallelism
■ Program structuring patterns - Single Program Multiple Data
(SPMD), master / worker, loop parallelism, fork / join
■ Data structuring patterns - Shared data, shared queue,
distributed array
□ Example: Shared data pattern
◊ Define shared abstract data type with concurrency control
(read only, read / write, independent sub sets, …)
◊ Choose appropriate synchronization construct
■ Supporting structures map to algorithm structure
□ Example: SPMD works well with geometric decomposition
10
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
11. Patterns for Parallel Programming
11
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Design Space Parallelization Pattern
1. Finding Concurrency
Task Decomposition, Data Decomposition,
Group Tasks, Order Tasks, Data Sharing,
Design Evaluation
2. Algorithm Structure
Task Parallelism, Divide and Conquer,
Geometric Decomposition, Recursive Data,
Pipeline, Event-Based Coordination
3. Supporting Structures
SPMD, Master/Worker, Loop Parallelism,
Fork/Join, Shared Data, Shared Queue,
Distributed Array
4. Implementation Mechanisms
Thread & Process Creation and Destruction,
Memory Synchronization, Fences,
Barriers, Mutual Exclusion, Message Passing,
Collective Communication
12. Our Pattern Language (OPL)
■ Extended version of the Mattson el al. proposals
■ http://parlab.eecs.berkeley.edu/wiki/patterns/patterns
12
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Structural Patterns
(map/reduce, ...)
Computational Patterns
(Monte Carlo, ...)
Algorithm Strategy Patterns
(Task / data parallelism, pipelining, decomposition, ...)
Implementation Strategy Patterns
(SPMD, fork/join, Actors, shared queue, BSP, ...)
Concurrent Execution Patterns
(SIMD, MIMD, task graph, message passing, mutex, ...)
13. Our Pattern Language (OPL)
■ Structural patterns
□ Describe overall computational goal of the application
□ “Boxes and arrows”
■ Computational patterns
□ Classes of computations (Berkeley dwarves)
□ “Computations occurring in the boxes”
■ Algorithm strategy patterns
□ High-level strategies to exploit concurrency and parallelism
■ Implementation strategy patterns
□ Structures realized in source code
□ Program organization and data structures
■ Concurrent execution patterns
□ Approaches to support the execution of parallel algorithms
□ Strategies that advance a program
□ Basic building blocks for coordination of concurrent tasks
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
13
15. Example: Discrete Event Pattern
■ Name: Discrete Event Pattern
■ Problem: Suppose a computational pattern can be decomposed
into groups of semi-independent tasks interacting in an
irregular fashion. The interaction is determined by the flow of
data between them which implies ordering constraints between
the tasks. How can these tasks and their interaction be
implemented so they can execute concurrently?
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
15
16. Example: Discrete Event Pattern
■ Solution: A good solution is based on expressing the data flow using
abstractions called events, with each event having a task that
generates it and a task that processes it. Because an event must be
generated before it can be processed, events also define ordering
constraints between the tasks. Computation within each task consists
of processing events.
Initialize!
while(not done)!
{!
receive event!
process event!
send events!
}!
finalize!
1 2
3
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
16
17. Patterns for Efficient Computation
[Cool et al.]
■ Nesting Patterns
■ Structured Serial Control Flow Patterns
(Selection, Iteration, Recursion, …)
■ Parallel Control Patterns
(Fork-Join, Stencil, Reduction, Scan, …)
■ Serial Data Management Patterns
(Closures, Objects, …)
■ Parallel Data Management Patterns
(Pack, Pipeline, Decomposition,
Gather, Scatter, …)
■ Other Parallel Patterns (Futures, Speculative Selection, Workpile,
Search, Segmentation, Category Reduction, …)
■ Non-Deterministic Patterns (Branch and Bound, Transactions, …)
■ Programming Model Support
1
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
19. Designing Parallel Algorithms [Foster]
■ Map workload problem on an execution environment
□ Concurrency & locality for speedup, scalability
■ Four distinct stages of a methodological approach
■ A) Search for concurrency and scalability
□ Partitioning:
Decompose computation and data into small tasks
□ Communication:
Define necessary coordination of task execution
■ B) Search for locality and performance
□ Agglomeration:
Consider performance and implementation costs
□ Mapping:
Maximize processor utilization, minimize communication
19
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
20. Partitioning
■ Expose opportunities for parallel execution through
fine-grained decomposition
■ Good partition keeps computation and data together
□ Data partitioning leads to data parallelism
□ Computation partitioning leads to task parallelism
□ Complementary approaches, can lead to different algorithms
□ Reveal hidden structures of the algorithm that have potential
□ Investigate complementary views on the problem
■ Avoid replication of either computation or data,
can be revised later to reduce communication overhead
■ Activity results in multiple candidate solutions
20
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
21. Partitioning - Decomposition Types
■ Domain Decomposition
□ Define small data fragments
□ Specify computation for them
□ Different phases of computation
on the same data are handled separately
□ Rule of thumb:
First focus on large, or frequently used, data structures
■ Functional Decomposition
□ Split up computation into disjoint
tasks, ignore the data accessed
for the moment
□ With significant data overlap,
domain decomposition is more
appropriate
21
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[Foster]
[Foster]
22. Partitioning - Checklist
■ Checklist for resulting partitioning scheme
□ Order of magnitude more tasks than processors ?
◊ Keeps flexibility for next steps
□ Avoidance of redundant computation and storage needs ?
◊ Scalability for large problem sizes
□ Tasks of comparable size ?
◊ Goal to allocate equal work to processors
□ Does number of tasks scale with the problem size ?
◊ Algorithm should be able to solve larger tasks with more
given resources
■ Identify bad partitioning by estimating performance behavior
■ In case, re-formulate the partitioning (backtracking)
□ May even happen in later steps
22
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
23. Communication
■ Specify links between data consumers and data producers
■ Specify kind and number of messages on these links
■ Domain decomposition problems might have tricky communication
infrastructures, due to data dependencies
■ Communication in functional decomposition problems can easily
be modeled from the data flow between the tasks
■ Categorization of communication patterns
□ Local communication (few neighbors) vs.
global communication
□ Structured communication (e.g. tree) vs.
unstructured communication
□ Static vs. dynamic communication structure
□ Synchronous vs. asynchronous communication
23
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
24. Communication - Hints
■ Distribute computation and communication,
don‘t centralize algorithm
□ Bad example: Central manager for parallel summation
■ Unstructured communication is hard to agglomerate,
better avoid it
■ Checklist for communication design
□ Do all tasks perform the same amount of communication ?
□ Does each task performs only local communication ?
□ Can communication happen concurrently ?
□ Can computation happen concurrently ?
■ Solve issues by distributing or replicating communication hot spots
24
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
25. Communication - Ghost Cells
■ Domain decomposition might lead to chunks that
demand data from each other
■ Solution 1: Copy necessary portion of data
(,ghost cells‘)
□ If no synchronization is needed after update
□ Data amount and frequency of update
influences resulting overhead and efficiency
□ Additional memory consumption
■ Solution 2: Access relevant data ,remotely‘
□ Delays thread coordination until the data is
really needed
□ Correctness („old“ data vs. „new“ data) must be
considered on parallel progress
25
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
26. Agglomeration
■ Algorithm so far is correct,
but not specialized for a particular execution environment
■ Check partitioning and communication decisions again
□ Agglomerate tasks for efficient execution on target hardware
□ Replicate data and / or computation for efficiency reasons
■ Resulting number of tasks can still be greater than the number of
processors
■ Three conflicting guiding decisions
□ Reduce communication costs by coarser granularity of
computation and communication
□ Preserve flexibility for later mapping by finer granularity
□ Reduce engineering costs for creating a parallel version
26
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
27. Agglomeration - Granularity
■ Since execution
environment is now
considered, the surface-
to-volume effect
becomes relevant
■ Late consideration keeps
core algorithm flexibility
27
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[Foster]
Surface-to-volume effect
28. Agglomeration - Checklist
■ Communication costs reduced by increasing locality ?
■ Does replicated computation outweighs its costs in all cases ?
■ Does data replication restrict the problem size ?
■ Do the larger tasks still have similar
computation / communication costs ?
■ Do the larger tasks still act with sufficient concurrency ?
■ Does the number of tasks still scale with the problem size ?
■ How much can the task count decrease, without disturbing load
balancing, scalability, or engineering costs ?
■ Is the transition to parallel code worth the engineering costs ?
28
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
29. Mapping
■ Historically only relevant for shared-nothing systems
□ Shared memory systems have the operating system scheduler
□ With NUMA, this may become also relevant in shared memory
systems of the future (e.g. PGAS task placement)
■ Minimize execution time by …
□ … placing concurrent tasks on different nodes
□ … placing tasks with heavy communication on the same node
■ Conflicting strategies, additionally restricted by resource limits
□ Task mapping problem
□ Known to be compute-intense (bin packing)
■ Set of sophisticated (dynamic) heuristics for load balancing
□ Preference for local algorithms that do not need global
scheduling state
29
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
31. Common Algorithmic Problems
■ Sources
□ Parallel
programming
courses
□ Parallel
Benchmarks
□ Development
guides
□ Parallel
Programming
books
□ User stories
31
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
32. A View From Berkeley
■ Technical report from
Berkeley (2006), defining
parallel computing research
questions and
recommendations
■ Definition of „13 dwarfs“
□ Common designs of
parallel computation and
communication
□ Allow better evaluation
of programming models
and architectures
32
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
The Landscape of Parallel Computing Research: A
View from Berkeley
Krste Asanovic
Ras Bodik
Bryan Christopher Catanzaro
Joseph James Gebis
Parry Husbands
Kurt Keutzer
David A. Patterson
William Lester Plishker
John Shalf
Samuel Webb Williams
Katherine A. Yelick
Electrical Engineering and Computer Sciences
University of California at Berkeley
Technical Report No. UCB/EECS-2006-183
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
December 18, 2006
33. A View From Berkeley
■ Sources
□ EEMBC benchmarks (embedded systems), SPEC benchmarks
□ Database and text mining technology
□ Algorithms in computer game design and graphics
□ Machine learning algorithms
□ Original „7 Dwarfs“ for supercomputing [Colella]
■ „Anti-benchmark“
□ Dwarfs are not tied to code or language artifacts
□ Can serve as understandable vocabulary across disciplines
□ Allow feasability study of hardware and software design
◊ No need to wait for applications being developed
33
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
34. 13 Dwarfs
■ Dwarfs currently defined
□ Dense Linear Algebra
□ Sparse Linear Algebra
□ Spectral Methods
□ N-Body Methods
□ Structured Grids
□ Unstructured Grids
□ MapReduce
□ Combinational Logic
□ Graph Traversal
□ Dynamic Programming
□ Backtrack and Branch-and-
Bound
□ Graphical Models
□ Finite State Machines
34
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
■ One dwarf may be implemented based on another one
■ Increasing uptake in scientific publications
■ Several reference implementations for CPU / GPU
35. Dwarfs in Popular Applications
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
35
Hot →
Cold
[Patterson]
36. ■ Classic vector and matrix operations on non-sparse data
(vector op vector, matrix op vector, matrix op matrix)
■ Data layout as continues array(s)
■ High degree of data dependencies
■ Computation on elements, rows, columns or matrix blocks
■ Issues with memory hierarchy, data distribution is critical
■ Demands overlapping of computation and communication
Dense Linear Algebra
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
36
37. Sparse Linear Algebra
■ Operations on a sparse matrix (with lots of zeros)
■ Typically compressed data structures, integer operations,
only non-zero entries + indices
□ Dense blocks to exploit caches
■ Complex dependency structure
■ Scatter-gather vector operations
are often helpful
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
37
38. N-Body Methods
■ Physics: Predicting individual motions of an object group
interacting gravitationally
□ Calculations on interactions between many discrete points
■ Hierarchical tree-based and mesh-based methods,
avoid computing all pair-wise interactions
■ Variations with particle-particle
methods (one point to all others)
■ Large number of independent
calculations in a time step,
followed by all-to-all
communication
■ Issues with load balancing and
missing fixed hierarchy
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
38
39. Structured Grid
■ Data as a regular multidimensional grid
□ Access is regular and statically determinable
■ Computation as sequence of grid updates
□ Points are updated concurrently using values
from a small neighborhood
■ Spatial locality to use long cache lines
■ Temporal locality to allow cache reuse
■ Parallel mapping with sub-grid per processor
□ Ghost cells, surface to volume ratio
■ Latency hiding
□ Increased number of ghost cells
□ Coarse-grained data exchange
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
39
40. Unstructured Grid
■ Elements update neighbors in irregular
mesh/grid - static or dynamic structure
■ Problematic data distribution and access
requirements, indirection through tables
■ Modeling domain (e.g. physics)
□ Mesh represents surface or volume
□ Entities are points, edges, faces, ...
□ Applying pressure, temperature, …
□ Computations involve numerical
solutions or differential equations
□ Sequence of mesh updates
■ Massively data parallel, but irregularly
distributed data and communication
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
40
41. MapReduce
■ Originally called “Monte Carlo” in dwarf concept
□ Repeated independent execution of a function
(e.g. random number generation, map function)
□ Results aggregated at the end
□ Nearly no communication between tasks,
embarrassingly parallel
■ Examples: Monte Carlo, BOINC project, protein structures
[http://climatesanity.wordpress.com]
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
41
42. ■ Global optimization problem in large search space
■ Divide and Conquer principle
□ Branching into subdivisions
□ Optimize execution by ruling out regions
■ Examples: Integer linear programming,
boolean satisfiability, combinatorial optimization,
traveling salesman, constraint programming, …
■ Heuristics to guide search to productive regions
■ Parallel checking of sub-regions
□ Demands invariants about the search space
□ Demands dynamic load balancing, load prediction is hard
■ Example:
Place N queens on a chessboard so that no two attack each other
Backtrack / Branch-and-Bound
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
42
44. 44
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[http://docs.jboss.org/drools]
45. Berkeley Dwarfs
■ Relevance of single dwarfs
widely differs
■ No widely accepted single
benchmark implementation
■ Computational dwarfs on
different layers,
implementations may be
based on each other
■ OpenDwarfs project
□ Optimized code for
different platforms
■ Parallel Dwarfs project
□ In C++, C#, F# for
Visual Studio
45
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
The Landscape of Parallel Computing Research: A
View from Berkeley
Krste Asanovic
Ras Bodik
Bryan Christopher Catanzaro
Joseph James Gebis
Parry Husbands
Kurt Keutzer
David A. Patterson
William Lester Plishker
John Shalf
Samuel Webb Williams
Katherine A. Yelick
Electrical Engineering and Computer Sciences
University of California at Berkeley
Technical Report No. UCB/EECS-2006-183
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
December 18, 2006
47. NUMA Impact Increases
47
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Core
Core
Core
Core
Q
P
I
Core
Core
Core
Core
Q
P
I
Core
Core
Core
Core
Q
P
I
Core
Core
Core
Core
Q
P
I
L3Cache
L3Cache
L3Cache
MemoryController
MemoryController
MemoryController
L3Cache
MemoryController
I/O
I/O
I/O
I/O
Memory
Memory
Memory
Memory
48. Innovation in Memory Technology
■ 3D NAND
■ Hybrid Memory Cube
□ Intel, Micron, …
□ 3D array of DDR-alike
memory cells
□ Early samples
available, 160GB/s
□ Through-silicon via
(TSV) approach with
embedded controllers,
attached to CPU
■ RRAM / ReRAM
□ Non-volatile memory
48
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[computerworld.com][extremetech.com]
49. Power Wall 2.0 = Dark Silicon
“Dark Silicon and the End
of Multicore Scaling”
by Hadi Esmaeilzadeh, Emily
Blem, Renée St. Amant,
Karthikeyan Sankaralingam,
Doug Burger
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
49
50. Hardware / Software Co-Design
■ Increasing number of cores by Moore‘s law
■ Power wall / dark silicon problem will become worse
□ In addition, battery-powered devices become more relevant
■ Idea: Use additional transistors for specialization
□ Design hardware for a software problem
□ Make it part of the processor („compile into hardware“)
□ More efficiency, less flexibility
□ Partially known from ILP SIMD support
□ Examples: Cryptography, regular expressions
■ Example: Cell processor (Playstation 3)
□ 64-bit Power core
□ 8 specialized co-processors
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
50
51. Software at Scale [Dongarra]
■ Effective utilization of many-core and hybrid hardware
□ Break fork-join parallelism
□ Dynamic data driven execution, consider block layout
□ Exploiting mixed precision (GPU vs. CPU, power consumption)
■ Aim for self-adapting software and auto-tuning support
□ Manual optimization is too hard
□ Let software optimize the software
■ Consider fault-tolerant software
□ With 1.000.000's of cores, things break all the time
■ Focus on algorithm classes that reduce communication
□ Special problem in dense computation
□ Aim for asynchronous iterations
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
51
52. OpenMP 4.0
■ SIMD extensions
□ Portable primitives to describe SIMD parallelization
□ Loop vectorization with simd construct
□ Several arguments for guiding the compiler (e.g. alignment)
■ Targeting extensions
□ Thread with the OpenMP program executes on the host device
□ Implementation may support multiple target devices
□ Control off-loading of loops and code regions on such devices
■ New API for device data environment
□ OpenMP - managed data items can be moved to the device
□ New primitives for better cancellation support
□ User-defined reduction operations
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
52
53. OpenACC
■ „OpenMP for accelerators“ (GPU, FPGAs, ...)
□ Partners: Cray, supercomputing centers, NVIDIA, PGI
□ Annotation in C, C++, and Fortran source code
□ OpenACC code can also be started on the accelerator
■ Features
□ Specification of data locality and asynchronous execution
□ Abstract specification of data movement, loop parallelization
□ Caching and synchronization support
□ Management of data movement by compiler and runtime
□ Implementations available, e.g. for Xeon Phi
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
53
54. Autotuners
■ Optimize parallel code by generating many variants
□ Try many or all optimization switches
◊ Loop unrolling, utilization of processor registers, …
□ Rely on parallelization variations defined in the application
■ Automatically tested on target platform
■ Research shows promising results
□ Can be better than manually optimized code
□ Optimization can fit to multiple execution environments
□ Known examples for sparse and dense linear algebra libraries
◊ ATLAS (Automatically Tuned Linear Algebra Software)
54
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
55. Intel Math Kernel Library (MKL)
■ Intel library with heavily optimized functionality, for C & Fortran
□ Linear algebra
◊ Basic Linear Algebra Subprograms (BLAS) API
◊ Follows standards in high-performance computing
◊ Vector-vector, matrix-vector, matrix-matrix operations
□ Fast Fourier Transforms (FFT)
◊ Single precision, double precision, complex, real, ...
□ Vector math and statistics functions
◊ Random number generators and probability distributions
◊ Spline-based data fitting
■ High-level abstraction of functionality,
parallelization completely transparent for the developer
55
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
56. Future Trends
■ Active research on next-generation hardware
□ Driven by exa-scale efforts in supercomputing
□ Driven by combined power wall and memory wall
□ Driven by shift in computer markets (desktop -> mobile)
■ Impact on software development will get more visible
□ Hybrid computing is the future default
□ Heterogeneous mixture of CPU + specialized accelerators
□ Old assumptions are broken (flat memory, constant access
time, homogeneous processing elements)
□ Old programming models no longer match
□ Extending the existing programming paradigms seems to work
□ High-level specialized libraries get more relevance
56
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
59. Week 1:
The Free Lunch Is Over
■ Clock speed curve
flattened in 2003
□ Heat, power,
leakage
■ Speeding up the serial
instruction execution
through clock speed
improvements no
longer works
■ Additional issues
□ ILP wall
□ Memory wall
[HerbSutter,2009]
59
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
60. Three Ways Of Doing Anything Faster
[Pfister]
■ Work harder
(clock speed)
Ø Power wall problem
Ø Memory wall problem
■ Work smarter
(optimization, caching)
Ø ILP wall problem
Ø Memory wall problem
■ Get help
(parallelization)
□ More cores per single CPU
□ Software needs to exploit
them in the right way
Ø Memory wall problem
Problem
CPU
Core
Core
Core
Core
Core
60
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
61. Parallelism on Different Levels
ProgramProgramProgram
ProcessProcessProcessProcessTask
PE
ProcessProcessProcessProcessTask
ProcessProcessProcessProcessTask
PE
PE
PE
Memory
Node
Network
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
61
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
62. The Parallel Programming Problem
■ Execution environment has a particular type
(SIMD, MIMD, UMA, NUMA, …)
■ Execution environment maybe configurable (number of resources)
■ Parallel application must be mapped to available resources
Execution EnvironmentParallel Application Match ?
Configuration
Flexible
Type
62
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
64. Gustafson-Barsis’ Law (1988)
■ Gustafson and Barsis: People are typically not interested in the
shortest execution time
□ Rather solve a bigger problem in reasonable time
■ Problem size could then scale with the number of processors
□ Typical in simulation and farmer / worker problems
□ Leads to larger parallel fraction with increasing N
□ Serial part is usually fixed or grows slower
■ Maximum scaled speedup by N processors:
■ Linear speedup now becomes possible
■ Software needs to ensure that serial parts remain constant
■ Other models exist (e.g. Work-Span model, Karp-Flatt metric)
64
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
S =
TSER + N · TP AR
TSER + TP AR
65. Week 2:
Concurrency vs. Parallelism
■ Concurrency means dealing with several things at once
□ Programming concept for the developer
□ In shared-memory systems, implemented by time sharing
■ Parallelism means doing several things at once
□ Demands parallel hardware
■ Parallel programming is a misnomer
□ Concurrent programming aiming at parallel execution
■ Any parallel software is concurrent software
□ Note: Some researchers disagree, most practitioners agree
■ Concurrent software is not always parallel software
□ Many server applications achieve scalability
by optimizing concurrency only (web server)
65
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Concurrency
Parallelism
66. Parallelism [Mattson et al.]
■ Task
□ Parallel program breaks a problem into tasks
■ Execution unit
□ Representation of a concurrently running task (e.g. thread)
□ Tasks are mapped to execution units
■ Processing element (PE)
□ Hardware element running one execution unit
□ Depends on scenario - logical processor vs. core vs. machine
□ Execution units run simultaneously on processing elements,
controlled by some scheduler
■ Synchronization - Mechanism to order activities of parallel tasks
■ Race condition - Program result depends on the scheduling order
66
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
67. Concurrency Issues
■ Mutual Exclusion
□ The requirement that when one concurrent task is using a
shared resource, no other shall be allowed to do that
■ Deadlock
□ Two or more concurrent tasks are unable to proceed
□ Each is waiting for one of the others to do something
■ Starvation
□ A runnable task is overlooked indefinitely
□ Although it is able to proceed, it is never chosen to run
■ Livelock
□ Two or more concurrent tasks continuously change their states
in response to changes in the other activities
□ No global progress for the application
6OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
67
68. Week 3:
Parallel Programming for Shared Memory
68
Process
Explicitly Shared Memory
■ Different programming models for
concurrency with shared memory
■ Processes and threads mapped to
processing elements (cores)
■ Task model supports more
fine-grained parallelization than
with native threads
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Memory
Process
Memory
Thread
Thread
Task
Task
Task
Task
Concurrent Processes Concurrent Threads
Concurrent Tasks
Main Thread
Process
Memory
Main Thread
Process
Memory
Main Thread
Thread
Thread
70. OpenMP
■ Programming with the fork-join model
□ Master thread forks into declared tasks
□ Runtime environment may run them in parallel,
based on dynamic mapping to threads from a pool
□ Worker task barrier before finalization (join)
70
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[Wikipedia]
72. Partitioned Global Address Space
72
Place-shifting operations
• at(p) S
… …… …
Activities
Local
Heap
Place 0
……
…
Activities
Local
Heap
Place N
…
Global Reference
Distributed heap
• GlobalRef[T]
APGAS in X10: Places and Tasks
Task parallelism
• async S
• finish S
Concurrency control within a place
• when(c) S
• atomic S
■ Parallel tasks, each operating in one place of the PGAS
□ Direct variable access only in local place
■ Implementation strategy is flexible
□ One operating system process per place, manages thread pool
□ Work-stealing scheduler
[IBM]
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
73. Week 4:
Cheap Performance with Accelerators
■ Performance
■ Energy / Price
□ Cheap to buy and to maintain
□ GFLOPS per watt: Fermi 1,5 / Kepler 5 / Maxwell 15 (2014)
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
73
0
200
400
600
800
1000
1200
1400
0 10000 20000 30000 40000 50000
ExecutionTimein
Milliseconds
Problem Size (Number of Sudoku Places)
Intel E8500 CPU
AMD R800 GPU
NVIDIA GT200 GPU
lower means faster
GPU: Graphics Processing Unit
(CPU of a graphics card)
74. CPU vs. GPU Architecture
□ Some huge threads
□ Branch prediction
□ 1000+ light-weight threads
□ Memory latency hiding
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
74
Control
PE
PE
PE
PE
Cache
DRAM DRAM
CPU GPU „many-core“„multi-core“
75. OpenCL Platform Model
□ OpenCL exposes CPUs, GPUs, and other Accelerators as “devices”
□ Each “device” contains one or more “compute units”, i.e. cores, SMs,...
□ Each “compute unit” contains one or more SIMD “processing elements”
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
75
76. Best Practices for Performance Tuning
• Asynchronous, Recompute, SimpleAlgorithm Design
• Chaining, Overlap Transfer & ComputeMemory Transfer
• Divergent Branching, PredicationControl Flow
• Local Memory as Cache, rare resourceMemory Types
• Coalescing, Bank ConflictsMemory Access
• Execution Size, EvaluationSizing
• Shifting, Fused Multiply, Vector TypesInstructions
• Native Math Functions, Build OptionsPrecision
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
76
77. Week 5:
Shared Nothing
■ Clusters: Stand-alone machines connected by a local network
□ Cost-effective technique for a large-scale parallel computer
□ Users are builders, have control over their system
□ Synchronization much slower than in shared memory
□ Task granularity becomes an issue
77
Processing
Element
Task
Processing
Element
Task
Message
Message
Message
Message
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Local
Memory
Local
Memory
78. Shared Nothing
■ Supercomputers / Massively Parallel Processing (MPP) systems
□ (Hierarchical) cluster with a lot of processors
□ Still standard hardware, but specialized setup
□ High-performance interconnection network
□ For massive data-parallel applications, mostly simulations
(weapons, climate, earthquakes, airplanes, car crashes, ...)
■ Examples (Nov 2013)
□ BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops
□ Tianhe-2, 3.1 million cores,
1 PB memory, 17.808 kW power,
33.86 PFlops (quadrillions
calculations per second)
■ Annual ranking with the TOP500 list
(www.top500.org)
78
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
79. Surface-To-Volume Effect
79
[nicerweb.com]
■ Fine-grained decomposition for
using all processing elements ?
■ Coarse-grained decomposition
to reduce communication
overhead ?
■ A tradeoff question !
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
80. Message Passing
■ Parallel programming paradigm for “shared nothing” environments
□ Implementations for shared memory available,
but typically not the best approach
■ Users submit their message passing program & data as job
■ Cluster management system creates program instances
Instance
0
Instance
1
Instance
2 Instance
3
Execution Hosts
80
Cluster Management Software
Submission
Host
Job
Appli-
cation
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
81. Single Program Multiple Data (SPMD)
81
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank
0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
Input data
SPMD program
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
// … (determine rank and comm_size) …
int token;
if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %dn",
rank, token, comm_size - 1);
}
Instance 0
Instance 1
Instance 2
Instance 3
Instance 4
82. Actor Model
■ Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular
Actor Formalism for Artificial Intelligence IJCAI 1973.
□ Mathematical model for concurrent computation
□ Actor as computational primitive
◊ Local decisions, concurrently sends / receives messages
◊ Has a mailbox for incoming messages
◊ Concurrently creates more actors
□ Asynchronous one-way message sending
□ Changing topology allowed, typically no order guarantees
◊ Recipient is identified by mailing address
◊ Actors can send their own identity to other actors
■ Available as programming language extension or library
in many environments
82
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
83. Week 6:
Patterns for Parallel Programming
■ Phases in creating a parallel program
□ Finding Concurrency: Identify and
analyze exploitable concurrency
□ Algorithm Structure: Structure the
algorithm to take advantage of
potential concurrency
□ Supporting Structures: Define
program structures and data
structures needed for the code
□ Implementation Mechanisms:
Threads, processes, messages, …
■ Each phase is a design space
83
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
84. Popular Applications vs. Dwarfs
Hot →
Cold
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
84
85. Designing Parallel Algorithms [Foster]
■ Map workload problem on an execution environment
□ Concurrency & locality for speedup, scalability
■ Four distinct stages of a methodological approach
■ A) Search for concurrency and scalability
□ Partitioning –
Decompose computation and data into small tasks
□ Communication –
Define necessary coordination of task execution
■ B) Search for locality and performance
□ Agglomeration –
Consider performance and implementation costs
□ Mapping –
Maximize processor utilization, minimize communication
85
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
87. The End
■ Parallel programming is exciting again!
□ From massively parallel hardware to complex software
□ From abstract design patterns to specific languages
□ From deadlock freedom to extreme performance tuning
■ Some general concepts are established
□ Take this course as starting point
□ Learn from the high-performance computing community
■ Thanks for your participation
□ Lively discussion, directly and in the forums, we learned a lot
□ Sorry for technical flaws and content errors
■ Please use the feedback link
87
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
88. Lecturer Contact
■ Operating Systems and Middleware Group at HPI
http://www.dcl.hpi.uni-potsdam.de
■ Dr. Peter Tröger
http://www.troeger.eu
http://twitter.com/ptroeger
http://www.linkedin.com/in/ptroeger
peter.troeger@hpi.uni-potsdam.de
■ M.Sc. Frank Feinbube
http://www.feinbube.de
http://www.linkedin.com/in/feinbube
frank.feinbube@hpi.uni-potsdam.de
88
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger