Modern Computing: Cloud, Distributed, & High Performance

SECTION 3:
MODERN COMPUTING:
CLOUD, DISTRIBUTED & HIGH PERFORMANCE
DR. ÜMIT V. ÇATALYÜREK
PROFESSOR AND ASSOCIATE CHAIR
Georgia Institute of Technology
JANUARY 27, 2017
The Big Data to Knowledge (BD2K)
Guide to the Fundamentals of Data Science
1

ÜMİT V. ÇATALYÜREK
• A Professor in the School of Computational Science &
Engineering in the College of Computing at the Georgia
Institute of Technology.
• A recipient of an NSF CAREER award
• The primary investigator of several awards from the
Department of Energy, the National Institute of Health, & the
National Science Foundation.
• An Associate Editor for Parallel Computing, & editorial board
member for IEEE Transactions on Parallel & Distributed
Computing, & the Journal of Parallel & Distributed Computing.
• A Fellow of IEEE, member of ACM & SIAM, & the Chair for
IEEE TCPP 2016-2017, & Vice-Chair for ACM SIGBio 2015-
2018 term.
• Main research areas: parallel computing, combinatorial
scientific computing & biomedical informatics.
• More information about Dr. Ümit V. Çatalyürek can be
found at http://cc.gatech.edu/~umit.
2

MODERN COMPUTING: CLOUD, DISTRIBUTED &
HIGH PERFORMANCE COMPUTING
Ümit V. Çatalyürek
Professor and Associate Chair
School of Computational Science and Engineering
Georgia Institute of Technology
The BD2K Guide to the Fundamentals of Data Science Series
27 January 2017
3

Outline
• HPC
• What is it? Why?
• A Crash Course on (HPC) Computer Architecture
• History of Single “Processor” Performance
• Taxonomy of Processors, Memory Topology of Parallel Computers
• Supercomputers
• How to speedup your application?
• Focus the common case
• Pay attention to locality
• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology
Image Analysis for Neuroblastoma
• Summary
4

What does High Performance Computing (HPC) mean?
• There is no such thing as “Low Performance Computing”
• “HPC most generally refers to the practice of aggregating computing
power in a way that delivers much higher performance than one could
get out of a typical desktop computer or workstation in order to solve
large problems in science, engineering, or business” (insideHPC)
• HPC allows scientists and engineers to solve complex science,
engineering, and business problems using applications that require
high bandwidth, enhanced networking, and very high compute
capabilities.” (Amazon AWS)
• “HPC is the use of parallel processing for running advanced
application programs efficiently, reliably and quickly… The term HPC
is occasionally used as a synonym for supercomputing.”
(SearchEnterpriseLinux/WhatIs.com)
5

My Definition of High Performance Computing (HPC)
• Efficient use of computing platforms for running application
programs quickly.
• Why do we care about speed?
• We do not want science to wait for computing.
• Why do we care about efficiency?
• Efficient use of resources means more resources available to all of us J
• Somebody has to pay the bills!
• When you have efficient program, it will be also very fast!
• Supercomputing is HPC, but HPC does not mean just
supercomputing
• For Supercomputers check top500.org (more later).
6

Computing Today
• Computing = Parallel Computing = HPC
• Any “computer” you touch has parallel processing power:
• Your laptop’s CPU has at least 2 cores.
• Your cell phone has 4-8 cores!
• This is BD2K Seminar: Data (and hence computational need) is BIG!
• Too big that it does not fit in to your computer.
• It takes too long to compute on your computer.
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Dec-82
Nov-84
Oct-86
Sep-88
Aug-90
Jul-92
Jun-94
M
ay-96
Apr-98
M
ar-00
Feb-02
Jan-04
Dec-05
Nov-07
Oct-09
Sep-11
Aug-13
Jul-15
Megabases
GenBank Bases
GenBank
WGS
Source: http://www.genome.gov/sequencingcosts/
Oxford Nanapore
MinION MkI
7

Outline
• HPC
• Supercomputers
• Summary
8

History of Single “Processor” Performance
9
RISC
Move to multi-cores

Bandwidth and Latency
•Bandwidth or throughput
• Total work done in a given time
• 10,000-25,000X improvement for processors
• 300-1200X improvement for memory and disks
•Latency or response time
• Time between start and completion of an event
• 30-80X improvement for processors
• 6-8X improvement for memory and disks
10

Bandwidth and Latency
11
Log-log plot of bandwidth and latency milestones

Flynn’s Taxonomy
12
Instructions
Single (SI) Multiple (MI)
Data
Multiple(MD)
SISD
Single-threaded
process
MISD
Pipeline
architecture
SIMD
Vector Processing
MIMD
Shared-/
Distributed-
Memory
Computing
Single(SD)

SISD
13
D D D D D D D
Processor
Instructions

SIMD
14
D0
Processor
Instructions
D0D0 D0 D0 D0
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D0

GPU (SIMD) Advantage
15 Images are from W. Dally’s SC10 Keynote Talk

MIMD
16
D D D D D D D
Processor
Instructions
D D D D D D D
Processor
Instructions

Memory Typology: Shared
17
Memory
Processor
Processor Processor
Processor
a.k.a. SMPs

Memory Typology: Distributed
18
MemoryProcessor MemoryProcessor
MemoryProcessor MemoryProcessor
Network

Memory Typology: Hybrid
19
Memory
Processor
Network
Processor
Memory
Processor
Processor
Memory
Processor
Processor
Memory
Processor
Processor

Memory Typology: Hybrid + Hetorogenous
20
Memory
Processor
Network
Processor
GPU
Memory
Processor
Processor
GPU
Memory
Processor
Processor
GPU
Memory
Processor
Processor
GPU

Outline
• HPC
• Supercomputers
• Summary
21

Oxen or Chicken Dilemma
• "If you were plowing a field, which would you rather use?
Two strong oxen or 1024 chickens?”
Seymour Cray
22

Outline
• HPC
• Supercomputers
• Summary
27

Amdahl’s Law
28
( )
enhanced
enhanced
enhanced
new
old
overall
Speedup
Fraction
Fraction
1
ExTime
ExTime
Speedup
+−
==
1
Best you could ever hope to do:
( )enhanced
maximum
Fraction-1
1
Speedup =
( ) !
"
#
$
%
&
+−×=
enhanced
enhanced
enhancedoldnew
Speedup
Fraction
FractionExTimeExTime 1

Amdahl’s Law Example:
( )
( )
56.1
64.0
1
10
0.4
0.41
1
Speedup
Fraction
Fraction1
1
Speedup
enhanced
enhanced
enhanced
overall
==
+−
=
+−
=
29
• Sequence Analysis Pipeline has a “slow” step which does
error correction of the input reads
• New CPU 10X faster
• I/O bound server, so 60% time waiting for I/O
• Apparently, its human nature to be attracted by 10X
faster, vs. keeping in perspective its just 1.6X faster

Multiple Sequence Alignment
VTISCTGSSSNIG-AGNHVKWYQQLPG
VTISCTGTSSNIG--SITVNWYQQLPG
LRLSCSSSGFIFS--SYAMYWVRQAPG
LSLTCTVSGTSFD--DYYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNW--YVDG
ATLVCLISDFYPG--AVTVAW--KADS
AALGCLVKDYFPE--PVTVSW--NS-G
VSLTCLVKGFYPS--DIAVEW--ESNG
30
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG--
ATLVCLISDFYPGA--VTVAWKADS--
AALGCLVKDYFPEP--VTVSWNSG---
VSLTCLVKGFYPSD--IAVEWESNG--
• Optimal: O(2n P |li|)
• 6 sequences of length 100 if constant is 10-9 seconds
• running time 6.4 x 104 seconds (~17.7 hours)
• add 2 sequences
• running time 2.6 x 109 seconds (~82.4 years!)
or

CLUSTAL W
• Based on Higgins & Sharp CLUSTAL [Gene88]
• Progressive alignment-based strategy
• Pairwise Alignment (n2l2)
• A distance matrix is computed using either an approximate method (fast) or
dynamic programming (more accurate, slower)
• Computation of Guide Tree (n3): phylogenetic tree
• Computed from the distance matrix
• Iteratively selecting aligned pairs and linking them.
• Progressive Alignment (nl2)
• A series of pairwise alignments computed using full dynamic programming to align
larger and larger groups of sequences.
• The order in the Guide Tree determines the ordering of sequence alignments.
• At each step; either two sequences are aligned, or a new sequence is aligned with
a group, or two groups are aligned.
• n: number of sequences in the query
• l : average sequence length
31

Speeding up CLUSTALW
Breakdown of CLUSTAL W Execution Time on PIII-650MHz
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
25 50 75 100 150 200 400 600 800 1000
Number of GPCR Sequences
TimeFractions
prog-align
guidetree
pairwise
• By parallelizing most time
consuming part: pair-wise
alignment
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8
Speeedup
# Processors
Speedup of Parallelized Version of CLUSTALW
linear
pair align
ideal
speedup
total
32

0
100
200
300
400
500
600
700
800
900
1,000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Speedup
Number Of Processors
10.00%
5.00%
2.00%
1.00%
0.50%
0.10%
More on Amdahl’s law
33

Outline
• HPC
• Supercomputers
• Summary
34

Levels of the Memory Hierarchy
35
CPU Registers
100s Bytes
300 – 500 ps (0.3-0.5 ns)
L1 and L2 Cache
10s-100s K Bytes
~1 ns - ~10 ns
$1000s/ GByte
Main Memory
G Bytes
80ns- 200ns
~ $100/ GByte
Disk
10s T Bytes, 10 ms
(10,000,000 ns)
~ $1 / GByte
Capacity
Access Time
Cost
Tape infinite
sec-min
~$1 / GByte
Registers
L1 Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Upper Level
Lower Level
faster
Larger
L2 Cache
Blocks

Locality Aware Remote Visualization
• Scientific and clinical research generate multi-BG to multi-TB of
spatially and temporally correlated data
• Different spatial and temporal resolutions
• Different acquisition modalities, from CT to light microscopy to electron
micrography
• Examples Applications: Visible Human, mouse BIRN
• DataCutter Streams Data to MPI-based OSC Parallel Renderer
• Setup
• Full color Visible Woman dataset
• Super-sampled at 2x for entire dataset, 4x and 8x for regions of the dataset
• Data stored on 20 nodes
• 8 rendering nodes and 1 compositing node with texture VR
• Remote thin client connected over internet

Implementation of OSC Parallel Renderer

Outline
• HPC
• Supercomputers
• Summary
41

Current and Emerging Scientific Applications
42
Processing Remotely-Sensed Data
NOAA Tiros-N
w/ AVHRR sensor
AVHRR Level 1 DataAVHRR Level 1 Data
• As the TIROS-N satellite orbits, the
Advanced Very High Resolution Radiometer (AVHRR)
sensor scans perpendicular to the satellite’s track.
• At regular intervals along a scan line measurements
are gathered to form an instantaneous field of view
(IFOV).
• Scan lines are aggregated into Level 1 data sets.
A single file of Global Area
Coverage (GAC) data
represents:
• ~one full earth orbit.
• ~110 minutes.
• ~40 megabytes.
• ~15,000 scan lines.
One scan line is 409 IFOV’s
Satellite Data Processing
DCE-MRI Analysis
Short Sequence
Mapping
Quantum Chemistry
Image ProcessingMultimedia Video Surveillance Montage

Application Patterns
•Complex and diverse processing structures
43
Processing Remotely-Sensed Data
NOAA Tiros-N
w/ AVHRR sensor
AVHRR Level 1 DataAVHRR Level 1 Data
• As the TIROS-N satellite orbits, the
Advanced Very High Resolution Radiometer (AVHRR)
sensor scans perpendicular to the satellite’s track.
• At regular intervals along a scan line measurements
are gathered to form an instantaneous field of view
(IFOV).
• Scan lines are aggregated into Level 1 data sets.
A single file of Global Area
Coverage (GAC) data
represents:
• ~one full earth orbit.
• ~110 minutes.
• ~40 megabytes.
• ~15,000 scan lines.
One scan line is 409 IFOV’s
Bag-of-Tasks Model
Data Analysis Applications
Bag-of-Tasks Applications
Task
File

44
Bag-of-Tasks Applications Workflows
Non-streaming
Task
File
Sequential or Parallel Task
This image cannot currently be displayed.
Non-streaming

45
Streaming
Bag-of-Tasks Applications Workflows
Non-streaming
Task
File
Streaming

Taxonomy of Parallelism
• Varied parallelism
46
Bag-of-Tasks Applications
Sequential Task
File
P1 P2 P3 P4
Task-parallelism

• Complex and diverse processing structures
•Bag-of-tasks applications: task-parallelism
47

Data-parallelism
Task-parallelism
Non-streaming Workflows
P1 P2 P3 P4
48

•Bag-of-tasks: task-parallelism
•Non-streaming workflows: task- and data-parallelism
49

Data-parallelism
Streaming Workflows
P1 P2 P3 P4
Pipelined-parallelism
Task-parallelism
50

•Bag-of-tasks: task-parallelism
•Non-streaming workflows: task- and data-parallelism
•Streaming workflows: task-, data- and pipelined-
parallelism
51

Outline
• HPC
• Supercomputers
• Summary
52

An Example Application: Whole-Slide Histopathology
• Classify biopsy tissue images into different subtypes of
prognostic significance
• Very high resolution slides
• Divided into smaller tiles
• Multi-resolution image analysis
• Mimics the way pathologists perform their analysis
• If classification at lower resolution is not satisfactory, analysis algorithm is
executed at higher resolution(s), hence the dynamic workload.
53

Why do we need HPC?
§ Due to the large sizes of whole-slide images
§ A 120K x 120K image digitized at 40x occupies more than 40 GB.
§ The processing time on a single CPU
§ For an image tile of 1K x 1K is »6 secs w/ Matlab, 850 msecs w/
C++
§ For a “small” 50K x 50K slide (assuming 50% background) »20 min.
§ In algorithm development
§ Algorithm development in Matlab
§ Requires evaluation of many different techniques, parameters etc.
§ In clinical practice, 8-9 biopsy samples are collected per patient. For an
average of 500 neuroblastoma patients treated annually, our biomedical
image analysis consumes:
§ On a CPU: 24 months using Matlab and 3.4 months using C++.
§ Can we reduce this to couple days or even hours?
54

`
Whole-slide image
Label 1
Label 2
background
undetermined
Assign classification labels
Classification map
Image tiles (40X magnification)
CPU
SSE
Intel Xeon
Phi
…
Computation units
GPU …
Computational Infrastructure
55
CPU
C/C++
…

Characterizing the GPU/CPU speed-up
56
Color
conversion
Co-
occur.
matrices
LBP
operator
Histo-
gram
Color channels Three Three One One
Output results 1Kx1K
tile
4x4
matrix
1Kx1K
tile
256
bins
Comput. weight Heavy Average Heavy Low
Operator type Streaming Iterative Streaming Iterative
Data reuse None Strong Little Strong
Locality access None High Little High
Arithm.
intensity
Heavy Low Average Low
Memory
access
Low High Average High
GPU speed-up 166.09 x 16.75 x 85.86 x 8.32 x

Effect of runtime optimizations
57
Homogeneous base case
Heterogeneous base case
Tile recalculation rate: % of tiles recalculated
at higher resolution.
ODDS improves performance even in the
base case
Using an additional CPU-only machine is
more than 3x faster than GPU-only version
Cluster Comput (2012) 15:125–144 139
Table 6 Different
demand-driven scheduling
policies used in Sect. 6
Demand-driven Area of Queue Policy Size of request for
Scheduling Policy effect Sender Receiver data buffers
DDFCFS Intra-filter Unsorted Unsorted Static
DDWRR Intra-filter Unsorted Sorted by speedup Static
ODDS Inter-filter Sorted by speedup Sorted by speedup Dynamic
In Table 6 we present three demand-driven policies
(where consumer filters only get as much data as they re-
quest) used in our evaluation. All these scheduling policies
maintain some minimal queue at the receiver side, such that
processor idle time is avoided. Simpler policies like round-
robin or random do not fit into the demand-driven paradigm,
as they simply push data buffers down to the consumer filters
without any knowledge of whether the data buffers are being
processed efficiently. As such, we do not consider these to
be good scheduling methods, and we exclude them from our
evaluation.
The First-Come, First-Served (DDFCFS) policy simply
maintains FIFO queues of data buffers on both ends of the
stream, and a filter instance requesting data will get what-
ever data buffer is next out of the queue. The DDWRR
policy uses the same technique as DDFCFS on the sender
side, but sorts its receiver-side queue of data buffers by
the relative speedup to give the highest-performing data
buffers to each processor. Both DDFCFS and DDWRR have
a static value for requests for data buffers during execu-
tion, which is chosen by the programmer. For ODDS, dis-
cussed in Sect. 5.3, the sender and receiver queues are sorted
by speedup and the receiver’s number of requests for data
buffers is dynamically calculated at run-time.
6.5.1 Homogeneous cluster base case
This section presents the results of experiments run in the
Fig. 17 Homogeneous base case evaluation

Outline
• HPC
• Supercomputers
• Summary
58

How about Cloud Computing?
• Cloud Computing
• It is not really “Cloud”; it is someone else’s computer!
• Rent instead of buy.
• Pay for Compute, Data Storage and Transfer.
• Our current best bet to enable sharing of large data, workflows and
computational resources.
• For “most of us” our best bet to achieve scalability and speed.
• Sample reading:
• Nature Reviews Genetics 11, 647-657 (September 2010) | doi:10.1038/nrg2857
• Computational solutions to large-scale data management and analysis
• Eric E. Schadt, Michael D. Linderman, Jon Sorenson, Lawrence Lee, and Garry
P. Nolan
• http://www.nature.com/nrg/multimedia/compsolutions/slideshow.html
• See also: Correspondence by Trelles et al. | Correspondence by Schadt et al.
59

Summary
• If only 50% can be “improved”, best you can get 2x speedup!
• Reduce data move
• Move computation to data
• Multiple types of parallelism: task-, data- and pipelined-parallelism
• Fastest processor does not mean your application will run fast; find most suitable
architecture.
• GPUs are good for “regular” computations
• GPUs can be up-to 10x faster compared to multi-core CPU, in many real life
applications, it is usually 3-5x
60

Modern Computing: Cloud, Distributed, & High Performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Modern Computing: Cloud, Distributed, & High Performance

Similar to Modern Computing: Cloud, Distributed, & High Performance (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Modern Computing: Cloud, Distributed, & High Performance