In this video, Dr. Umit Catalyurek from Georgia Institute of Technology presents: Modern Computing: Cloud, Distributed, & High Performance.
Ümit V. Çatalyürek is a Professor in the School of Computational Science and Engineering in the College of Computing at the Georgia Institute of Technology. He received his Ph.D. in 2000 from Bilkent University. He is a recipient of an NSF CAREER award and is the primary investigator of several awards from the Department of Energy, the National Institute of Health, and the National Science Foundation. He currently serves as an Associate Editor for Parallel Computing, and as an editorial board member for IEEE Transactions on Parallel and Distributed Computing, and the Journal of Parallel and Distributed Computing.
Learn more: http://www.bigdatau.org/data-science-seminars
Watch the video presentation: http://wp.me/p3RLHQ-ghU
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Modern Computing: Cloud, Distributed, & High Performance
1. SECTION 3:
MODERN COMPUTING:
CLOUD, DISTRIBUTED & HIGH PERFORMANCE
DR. ÜMIT V. ÇATALYÜREK
PROFESSOR AND ASSOCIATE CHAIR
Georgia Institute of Technology
JANUARY 27, 2017
The Big Data to Knowledge (BD2K)
Guide to the Fundamentals of Data Science
1
2. ÜMİT V. ÇATALYÜREK
• A Professor in the School of Computational Science &
Engineering in the College of Computing at the Georgia
Institute of Technology.
• A recipient of an NSF CAREER award
• The primary investigator of several awards from the
Department of Energy, the National Institute of Health, & the
National Science Foundation.
• An Associate Editor for Parallel Computing, & editorial board
member for IEEE Transactions on Parallel & Distributed
Computing, & the Journal of Parallel & Distributed Computing.
• A Fellow of IEEE, member of ACM & SIAM, & the Chair for
IEEE TCPP 2016-2017, & Vice-Chair for ACM SIGBio 2015-
2018 term.
• Main research areas: parallel computing, combinatorial
scientific computing & biomedical informatics.
• More information about Dr. Ümit V. Çatalyürek can be
found at http://cc.gatech.edu/~umit.
2
3. MODERN COMPUTING: CLOUD, DISTRIBUTED &
HIGH PERFORMANCE COMPUTING
Ümit V. Çatalyürek
Professor and Associate Chair
School of Computational Science and Engineering
Georgia Institute of Technology
The BD2K Guide to the Fundamentals of Data Science Series
27 January 2017
3
4. Outline
• HPC
• What is it? Why?
• A Crash Course on (HPC) Computer Architecture
• History of Single “Processor” Performance
• Taxonomy of Processors, Memory Topology of Parallel Computers
• Supercomputers
• How to speedup your application?
• Focus the common case
• Pay attention to locality
• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology
Image Analysis for Neuroblastoma
• Summary
4
5. What does High Performance Computing (HPC) mean?
• There is no such thing as “Low Performance Computing”
• “HPC most generally refers to the practice of aggregating computing
power in a way that delivers much higher performance than one could
get out of a typical desktop computer or workstation in order to solve
large problems in science, engineering, or business” (insideHPC)
• HPC allows scientists and engineers to solve complex science,
engineering, and business problems using applications that require
high bandwidth, enhanced networking, and very high compute
capabilities.” (Amazon AWS)
• “HPC is the use of parallel processing for running advanced
application programs efficiently, reliably and quickly… The term HPC
is occasionally used as a synonym for supercomputing.”
(SearchEnterpriseLinux/WhatIs.com)
5
6. My Definition of High Performance Computing (HPC)
• Efficient use of computing platforms for running application
programs quickly.
• Why do we care about speed?
• We do not want science to wait for computing.
• Why do we care about efficiency?
• Efficient use of resources means more resources available to all of us J
• Somebody has to pay the bills!
• When you have efficient program, it will be also very fast!
• Supercomputing is HPC, but HPC does not mean just
supercomputing
• For Supercomputers check top500.org (more later).
6
7. Computing Today
• Computing = Parallel Computing = HPC
• Any “computer” you touch has parallel processing power:
• Your laptop’s CPU has at least 2 cores.
• Your cell phone has 4-8 cores!
• This is BD2K Seminar: Data (and hence computational need) is BIG!
• Too big that it does not fit in to your computer.
• It takes too long to compute on your computer.
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Dec-82
Nov-84
Oct-86
Sep-88
Aug-90
Jul-92
Jun-94
M
ay-96
Apr-98
M
ar-00
Feb-02
Jan-04
Dec-05
Nov-07
Oct-09
Sep-11
Aug-13
Jul-15
Megabases
GenBank Bases
GenBank
WGS
Source: http://www.genome.gov/sequencingcosts/
Oxford Nanapore
MinION MkI
7
8. Outline
• HPC
• What is it? Why?
• A Crash Course on (HPC) Computer Architecture
• History of Single “Processor” Performance
• Taxonomy of Processors, Memory Topology of Parallel Computers
• Supercomputers
• How to speedup your application?
• Focus the common case
• Pay attention to locality
• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology
Image Analysis for Neuroblastoma
• Summary
8
10. Bandwidth and Latency
•Bandwidth or throughput
• Total work done in a given time
• 10,000-25,000X improvement for processors
• 300-1200X improvement for memory and disks
•Latency or response time
• Time between start and completion of an event
• 30-80X improvement for processors
• 6-8X improvement for memory and disks
10
21. Outline
• HPC
• What is it? Why?
• A Crash Course on (HPC) Computer Architecture
• History of Single “Processor” Performance
• Taxonomy of Processors, Memory Topology of Parallel Computers
• Supercomputers
• How to speedup your application?
• Focus the common case
• Pay attention to locality
• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology
Image Analysis for Neuroblastoma
• Summary
21
22. Oxen or Chicken Dilemma
• "If you were plowing a field, which would you rather use?
Two strong oxen or 1024 chickens?”
Seymour Cray
22
27. Outline
• HPC
• What is it? Why?
• A Crash Course on (HPC) Computer Architecture
• History of Single “Processor” Performance
• Taxonomy of Processors, Memory Topology of Parallel Computers
• Supercomputers
• How to speedup your application?
• Focus the common case
• Pay attention to locality
• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology
Image Analysis for Neuroblastoma
• Summary
27
29. Amdahl’s Law Example:
( )
( )
56.1
64.0
1
10
0.4
0.41
1
Speedup
Fraction
Fraction1
1
Speedup
enhanced
enhanced
enhanced
overall
==
+−
=
+−
=
29
• Sequence Analysis Pipeline has a “slow” step which does
error correction of the input reads
• New CPU 10X faster
• I/O bound server, so 60% time waiting for I/O
• Apparently, its human nature to be attracted by 10X
faster, vs. keeping in perspective its just 1.6X faster
31. CLUSTAL W
• Based on Higgins & Sharp CLUSTAL [Gene88]
• Progressive alignment-based strategy
• Pairwise Alignment (n2l2)
• A distance matrix is computed using either an approximate method (fast) or
dynamic programming (more accurate, slower)
• Computation of Guide Tree (n3): phylogenetic tree
• Computed from the distance matrix
• Iteratively selecting aligned pairs and linking them.
• Progressive Alignment (nl2)
• A series of pairwise alignments computed using full dynamic programming to align
larger and larger groups of sequences.
• The order in the Guide Tree determines the ordering of sequence alignments.
• At each step; either two sequences are aligned, or a new sequence is aligned with
a group, or two groups are aligned.
• n: number of sequences in the query
• l : average sequence length
31
32. Speeding up CLUSTALW
Breakdown of CLUSTAL W Execution Time on PIII-650MHz
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
25 50 75 100 150 200 400 600 800 1000
Number of GPCR Sequences
TimeFractions
prog-align
guidetree
pairwise
• By parallelizing most time
consuming part: pair-wise
alignment
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8
Speeedup
# Processors
Speedup of Parallelized Version of CLUSTALW
linear
pair align
ideal
speedup
total
32
34. Outline
• HPC
• What is it? Why?
• A Crash Course on (HPC) Computer Architecture
• History of Single “Processor” Performance
• Taxonomy of Processors, Memory Topology of Parallel Computers
• Supercomputers
• How to speedup your application?
• Focus the common case
• Pay attention to locality
• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology
Image Analysis for Neuroblastoma
• Summary
34
35. Levels of the Memory Hierarchy
35
CPU Registers
100s Bytes
300 – 500 ps (0.3-0.5 ns)
L1 and L2 Cache
10s-100s K Bytes
~1 ns - ~10 ns
$1000s/ GByte
Main Memory
G Bytes
80ns- 200ns
~ $100/ GByte
Disk
10s T Bytes, 10 ms
(10,000,000 ns)
~ $1 / GByte
Capacity
Access Time
Cost
Tape infinite
sec-min
~$1 / GByte
Registers
L1 Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Upper Level
Lower Level
faster
Larger
L2 Cache
Blocks
36. Locality Aware Remote Visualization
• Scientific and clinical research generate multi-BG to multi-TB of
spatially and temporally correlated data
• Different spatial and temporal resolutions
• Different acquisition modalities, from CT to light microscopy to electron
micrography
• Examples Applications: Visible Human, mouse BIRN
• DataCutter Streams Data to MPI-based OSC Parallel Renderer
• Setup
• Full color Visible Woman dataset
• Super-sampled at 2x for entire dataset, 4x and 8x for regions of the dataset
• Data stored on 20 nodes
• 8 rendering nodes and 1 compositing node with texture VR
• Remote thin client connected over internet
41. Outline
• HPC
• What is it? Why?
• A Crash Course on (HPC) Computer Architecture
• History of Single “Processor” Performance
• Taxonomy of Processors, Memory Topology of Parallel Computers
• Supercomputers
• How to speedup your application?
• Focus the common case
• Pay attention to locality
• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology
Image Analysis for Neuroblastoma
• Summary
41
42. Current and Emerging Scientific Applications
42
Processing Remotely-Sensed Data
NOAA Tiros-N
w/ AVHRR sensor
AVHRR Level 1 DataAVHRR Level 1 Data
• As the TIROS-N satellite orbits, the
Advanced Very High Resolution Radiometer (AVHRR)
sensor scans perpendicular to the satellite’s track.
• At regular intervals along a scan line measurements
are gathered to form an instantaneous field of view
(IFOV).
• Scan lines are aggregated into Level 1 data sets.
A single file of Global Area
Coverage (GAC) data
represents:
• ~one full earth orbit.
• ~110 minutes.
• ~40 megabytes.
• ~15,000 scan lines.
One scan line is 409 IFOV’s
Satellite Data Processing
DCE-MRI Analysis
Short Sequence
Mapping
Quantum Chemistry
Image ProcessingMultimedia Video Surveillance Montage
43. Application Patterns
•Complex and diverse processing structures
43
Processing Remotely-Sensed Data
NOAA Tiros-N
w/ AVHRR sensor
AVHRR Level 1 DataAVHRR Level 1 Data
• As the TIROS-N satellite orbits, the
Advanced Very High Resolution Radiometer (AVHRR)
sensor scans perpendicular to the satellite’s track.
• At regular intervals along a scan line measurements
are gathered to form an instantaneous field of view
(IFOV).
• Scan lines are aggregated into Level 1 data sets.
A single file of Global Area
Coverage (GAC) data
represents:
• ~one full earth orbit.
• ~110 minutes.
• ~40 megabytes.
• ~15,000 scan lines.
One scan line is 409 IFOV’s
Bag-of-Tasks Model
Data Analysis Applications
Bag-of-Tasks Applications
Task
File
44. Application Patterns
•Complex and diverse processing structures
44
Data Analysis Applications
Bag-of-Tasks Applications Workflows
Non-streaming
Task
File
Sequential or Parallel Task
This image cannot currently be displayed.
Non-streaming
45. Application Patterns
•Complex and diverse processing structures
45
Streaming
Data Analysis Applications
Bag-of-Tasks Applications Workflows
Non-streaming
Task
File
Sequential or Parallel Task
Streaming
46. Taxonomy of Parallelism
•Complex and diverse processing structures
• Varied parallelism
46
Bag-of-Tasks Applications
Sequential Task
File
P1 P2 P3 P4
Task-parallelism
49. Taxonomy of Parallelism
• Complex and diverse processing structures
• Varied parallelism
•Bag-of-tasks: task-parallelism
•Non-streaming workflows: task- and data-parallelism
49
50. Data-parallelism
Streaming Workflows
Sequential or Parallel Task
P1 P2 P3 P4
Pipelined-parallelism
Task-parallelism
Taxonomy of Parallelism
• Complex and diverse processing structures
• Varied parallelism
50
51. Taxonomy of Parallelism
• Complex and diverse processing structures
• Varied parallelism
•Bag-of-tasks: task-parallelism
•Non-streaming workflows: task- and data-parallelism
•Streaming workflows: task-, data- and pipelined-
parallelism
51
52. Outline
• HPC
• What is it? Why?
• A Crash Course on (HPC) Computer Architecture
• History of Single “Processor” Performance
• Taxonomy of Processors, Memory Topology of Parallel Computers
• Supercomputers
• How to speedup your application?
• Focus the common case
• Pay attention to locality
• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology
Image Analysis for Neuroblastoma
• Summary
52
53. An Example Application: Whole-Slide Histopathology
Image Analysis for Neuroblastoma
• Classify biopsy tissue images into different subtypes of
prognostic significance
• Very high resolution slides
• Divided into smaller tiles
• Multi-resolution image analysis
• Mimics the way pathologists perform their analysis
• If classification at lower resolution is not satisfactory, analysis algorithm is
executed at higher resolution(s), hence the dynamic workload.
53
54. Why do we need HPC?
§ Due to the large sizes of whole-slide images
§ A 120K x 120K image digitized at 40x occupies more than 40 GB.
§ The processing time on a single CPU
§ For an image tile of 1K x 1K is »6 secs w/ Matlab, 850 msecs w/
C++
§ For a “small” 50K x 50K slide (assuming 50% background) »20 min.
§ In algorithm development
§ Algorithm development in Matlab
§ Requires evaluation of many different techniques, parameters etc.
§ In clinical practice, 8-9 biopsy samples are collected per patient. For an
average of 500 neuroblastoma patients treated annually, our biomedical
image analysis consumes:
§ On a CPU: 24 months using Matlab and 3.4 months using C++.
§ Can we reduce this to couple days or even hours?
54
56. Characterizing the GPU/CPU speed-up
56
Color
conversion
Co-
occur.
matrices
LBP
operator
Histo-
gram
Color channels Three Three One One
Output results 1Kx1K
tile
4x4
matrix
1Kx1K
tile
256
bins
Comput. weight Heavy Average Heavy Low
Operator type Streaming Iterative Streaming Iterative
Data reuse None Strong Little Strong
Locality access None High Little High
Arithm.
intensity
Heavy Low Average Low
Memory
access
Low High Average High
GPU speed-up 166.09 x 16.75 x 85.86 x 8.32 x
57. Effect of runtime optimizations
57
Homogeneous base case
Heterogeneous base case
Tile recalculation rate: % of tiles recalculated
at higher resolution.
ODDS improves performance even in the
base case
Using an additional CPU-only machine is
more than 3x faster than GPU-only version
Cluster Comput (2012) 15:125–144 139
Table 6 Different
demand-driven scheduling
policies used in Sect. 6
Demand-driven Area of Queue Policy Size of request for
Scheduling Policy effect Sender Receiver data buffers
DDFCFS Intra-filter Unsorted Unsorted Static
DDWRR Intra-filter Unsorted Sorted by speedup Static
ODDS Inter-filter Sorted by speedup Sorted by speedup Dynamic
In Table 6 we present three demand-driven policies
(where consumer filters only get as much data as they re-
quest) used in our evaluation. All these scheduling policies
maintain some minimal queue at the receiver side, such that
processor idle time is avoided. Simpler policies like round-
robin or random do not fit into the demand-driven paradigm,
as they simply push data buffers down to the consumer filters
without any knowledge of whether the data buffers are being
processed efficiently. As such, we do not consider these to
be good scheduling methods, and we exclude them from our
evaluation.
The First-Come, First-Served (DDFCFS) policy simply
maintains FIFO queues of data buffers on both ends of the
stream, and a filter instance requesting data will get what-
ever data buffer is next out of the queue. The DDWRR
policy uses the same technique as DDFCFS on the sender
side, but sorts its receiver-side queue of data buffers by
the relative speedup to give the highest-performing data
buffers to each processor. Both DDFCFS and DDWRR have
a static value for requests for data buffers during execu-
tion, which is chosen by the programmer. For ODDS, dis-
cussed in Sect. 5.3, the sender and receiver queues are sorted
by speedup and the receiver’s number of requests for data
buffers is dynamically calculated at run-time.
6.5.1 Homogeneous cluster base case
This section presents the results of experiments run in the
Fig. 17 Homogeneous base case evaluation
58. Outline
• HPC
• What is it? Why?
• A Crash Course on (HPC) Computer Architecture
• History of Single “Processor” Performance
• Taxonomy of Processors, Memory Topology of Parallel Computers
• Supercomputers
• How to speedup your application?
• Focus the common case
• Pay attention to locality
• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology
Image Analysis for Neuroblastoma
• Summary
58
59. How about Cloud Computing?
• Cloud Computing
• It is not really “Cloud”; it is someone else’s computer!
• Rent instead of buy.
• Pay for Compute, Data Storage and Transfer.
• Our current best bet to enable sharing of large data, workflows and
computational resources.
• For “most of us” our best bet to achieve scalability and speed.
• Sample reading:
• Nature Reviews Genetics 11, 647-657 (September 2010) | doi:10.1038/nrg2857
• Computational solutions to large-scale data management and analysis
• Eric E. Schadt, Michael D. Linderman, Jon Sorenson, Lawrence Lee, and Garry
P. Nolan
• http://www.nature.com/nrg/multimedia/compsolutions/slideshow.html
• See also: Correspondence by Trelles et al. | Correspondence by Schadt et al.
59
60. Summary
• How to speedup your application?
• Focus the common case
• If only 50% can be “improved”, best you can get 2x speedup!
• Pay attention to locality
• Reduce data move
• Move computation to data
• Take advantage of parallelism
• Multiple types of parallelism: task-, data- and pipelined-parallelism
• Fastest processor does not mean your application will run fast; find most suitable
architecture.
• GPUs are good for “regular” computations
• GPUs can be up-to 10x faster compared to multi-core CPU, in many real life
applications, it is usually 3-5x
60