6. A single process achieves parallelism by
creating separate threads for subtasks
A thread shares context with its parent
process
On a single processor, parallelism is an
illusion created by interweaving
6
7. Each process has its own context
Overheads for creation, communication and
context switching are higher
Processes allow true concurrent computing
even on separate systems
7
8. Threads are generally faster
Very dependent on hardware and
operating system
Difficult to generate metrics
8
10. Currently available
Intel: 2 – 4 cores
AMD: 2 cores (4 soon)
Cell processors: 9 cores
Graphical Processing Unit (GPU)
10
11. Architectural State Architectural State Architectural State Architectural State
Execution Engine Execution Engine Execution Engine Execution Engine
Local APIC Local APIC Local APIC Local APIC
Second Level Cache Second Level Cache
Bus Interface Bus Interface
System Bus
11
12. SPE- Synergistic Processing Element
SPU – Synergistic Processor Unit
SXU – Synergistic Execution Unit
MFC – Memory Flow Control
PPE – PowerPC Processor Element LS – Local Storage
PPU - PowerPC Processing Unit
PXU - PowerPC Execution Unit MIC – Memory Interface Controller
L1, L2 – Local Storage BIC – Broadband Interface Controller
SPE’s SPU SPU SPU SPU
SXU SXU SXU SXU
LS LS LS LS
PPE MIC
MFC MFC MFC MFC
PPU
L2 Element Interconnect Bus (EIB) (up to 96B/cycle)
L1 PXU
MFC MFC MFC MFC
BIC
LS LS LS LS
SPE’s SXU SXU SXU
SXU
SPU SPU SPU SPU 12
16. Defining
and Operating Executing
Preparing Threads Threads
Threads
Performed by
Performed
Performed by
Programming
by OS using
Processors
Environment
Processes
and Compiler
16
17. User-Level Threads
Kernel-Level Threads
Hardware Threads
Intel/AMD Cell
Sophisticated firmware Minimal firmware
on chip to handle on chip to handle
process execution execution
17
18. User-Level Threads
Kernel-Level Threads
Hardware Threads
Intel/AMD Cell
Multiple process Process
management of management
threads by Operating written by user:
System total control!
18
19. User-Level Threads
Kernel-Level Threads
Hardware Threads
Intel/AMD Cell
Use threading package User manages
to manage threads threads directly
(OpenMP, Pthreads, or by adapting a
TBB, etc) threading package
19
20. Intel/AMD Cell
Completely controlled Controlled by user
by OS and chip
For execution to be fast, execution block
(code and data) must be kept in cache as
much as possible.
20
21. Global Interpreter Lock (GIL)
Cache Management
Data Management
Program Flow
Thread Design
21
22. Python allows only one instance of the
interpreter to run at any given time
True multi-processing only available by
calling lower-level (C/C++/Fortran/etc)
routines
This is as it should be! The python
interpreter should not be parallelized
22
23. Significant Factors
Available memory
Number of other processes running
How the OS handling of threads and the hardware
handling of threads interact with each other
23
24. Strategies
Design data structures so that data can be sliced
into small chunks
Start with small program and data structures, then
increase them slowly looking for performance
degradation
Optimize code in called processes
Not enough control to do much else!
24
25. Significant Factors
Available memory on PowerPC
Whether there are other users on the cell
Progressive computation on one set of data
vs. separate computation on separate data
25
26. Strategies
Process plus data for SPE’s must fit within
256 K
Optimize code running on SPE’s – try
different options for your specific
application
Divide tasks sent to PPE into chunks that
will fit into SPE’s.
26
27. Data Stream
Data 1 Data 2 Data 3 Data 4
Process Process Process Process
Result 1 Result 2 Result 3 Result 4
Different data is put through the same process
27
28. Data Stream
Data Data Data Data
Process 1 Process 2 Process 3 Process 4
Result 1 Result 2 Result 3 Result 4
The same data is put through different processes
28
31. Written in Python
Python-like interface
Written up in August 2007 Dr Dobb’s Journal
(currently available on literature table)
Can work with other languages
Works on multiple processors as well as multi-core
Handles appropriate breakdown of data
31
32. Uses C++ like syntax to specify work to
be done in parallel
Otherwise similar in functionality to
NetWorkSpaces
Claims to be highly efficient
Currently in commercial use
Free for development; requires license
for released product
32
33. Originally intended to support GUI interfaces
across the internet (multiple systems)
Covers mechanics of interface with processors
Does not handle data
QtPy is a python implementation
33
34. http://www-
128.ibm.com/developerworks/power/cell/docs_documentation.html
Introduction to the Cell Multiprocessor
Cell Broadband Engine Programming Tutorial
Cell Broadband Engine Programming Handbook
Programming high-performance applications on
the Cell BE processor
Maximizing the power of the CBE Processor
34
35. Dr. Dobb’s Journal article about depth-first search:
http://www.ddj.com/dept/64bit/197801624
Software Development Kit
http://www-128.ibm.com/developerworks/power/cell
Programming the Cell Broadband Engine
http://www.embedded.com/showArticle.jhtml?articleID=188101999
35