Xian He Sun Data-Centric Into

Scalable Computing Software Lab, Illinois Institute of Technology 1
Computing:
From Compute-Centric to Data-Centric
Xian-He Sun
Illinois Institute of Technology
sun@iit.edu

CS546 Lecture 8 Page 2
The Design of Parallel Algorithms
Unbalanced Workload
Communication Delay
Overhead Increases with the Ensemble Size

10/3/2017
Maximum
System
256 racks
3.5 PF/s
512 TB
850 MHz
8 MB EDRAM
4 cores
1 chip, 20
DRAMs
13.6 GF/s
2.0 GB DDR
Supports 4-way SMP
32 Node Cards
1024 chips, 4096 procs
14 TF/s
2 TB
72 Racks
1 PF/s
144 TB
Cabled 8x8x16Rack
Petaflops
System
Compute Card
Chip
435 GF/s
64 GB
(32 chips 4x4x2)
32 compute, 0-2 IO cards
Node Board
Front End Node / Service Node
System p Servers
Linux SLES10
HPC SW:
Compilers
GPFS
ESSL
Loadleveler
Evolution of Computing:
The scalability
Source: ANL ALCF
IBM BG/P

Scalable Computing Software Lab, Illinois Institute of Technology 4
1
10
100
1,000
10,000
100,000
1980 1985 1990 1995 2000 2005 2010
Year
Performance
Memory
Uni-rocessor
Multi-core/many-core processor
The Memory-wall Problem
 Processor performance
increases rapidly
 Uni-processor: ~52% until
2004
 Aggregate multi-core/many-
core processor performance
even higher since 2004
 Memory: ~9% per year
 Storage: ~6% per year
 Processor-memory speed gap
keeps increasing
Source: Intel
Source: OCZ
25%
52%
20%
9%
60%
9%
Memory-bounded speedup (1990), Memory wall problem (1994)

CPU Registers
<8KB
<0.2~0.5 ns, 500~800 GB/s/core
Cache
<50MB
1-10 ns, 50~150GB/s/core
Main Memory
Giga Bytes
50ns-100ns 5~10GB/s/channel
Disk
Tera Bytes, 5 ms
100~300MB/s
Capacity
Access Time, Bandwidth
Tape
Peta Bytes or
infinite
sec-min
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Staging
Xfer Unit
prog./compiler
1-8 bytes
cache cntl
32-128 bytes
OS
4K-4M bytes
user/operator
Mbytes
Upper Level
Lower Level
faster
Larger
Current Solution: Memory Hierarchy

Lecture Page 2
Fig 1. Cost variation pattern

7
I/O becomes the Bottleneck of HPC

8
Solution：Parallel I/O Systems
Architecture of a typical parallel I/O system
File server File server File server File server File server

Problem of PFS in Big Data Applications
File
Servers
0
4
8
1
5
9
2
6
3
7
0 1 2 3 4 5 6 7 8 9 . .
File
Servers
04
48
1
5
9
2
6
3
7
0 1 2 3 4 5 6 7 8 9 . .
Ideal Situation Actual Situation
o If everyone requests data: contention
o Concurrence is more than increase bandwidth
Need to measure concurrence contribution

Multi-core
Multi-threading
Multi-issue
Multi-banked Cache
Multi-level Cache
Multi-channel
Multi-rank
Multi-bank
CPU
Cache
Memory
Out-of-order Execution
Speculative Execution
Runahead Execution
Pipelined Cache
Non-blocking Cache
Data Prefetching
Write buffer
Also: (mostly hiding) Memory Concurrency
Parallel File System
Input-Output (I/O)
Disks
Pipeline
Non-blocking
Prefetching
Write buffer

Model of Data Access (Memory Hierarchy)
 The conventional AMAT (Average Memory Access Time) Model
 AMAT = HitCycle1 + MR1×AMP1
Where AMP1 = (HitCycle2 + MR2×AMP2)
 AMAT = HitCycle + MR×(H2 + MR2× (H3 + MR3×AMP3 ))
 The new C-AMAT (Concurrent-AMAT) model
Where
1
1
1 1 1 2- -
H
H
C AMAT MR C AMAT
C
   
2 2
2 2
2 2-
H M
H pAMP
C AMAT pMR
C C
  
1
1
1 1
1
1 1
m
M
CpMR pAMP
MR AMP C
   
X.-H. Sun and D. Wang, "Concurrent Average Memory Access Time", in IEEE Computers, vol.
47, no. 5, pp. 74-80,May 2014
Here is the overlapping contribution

Lecture Page 12
Data Access Time: AMAT
 Average Memory Access Time (AMAT)
= Thit(L1) + Miss%(L1)* (Thit(L2) + Miss%(L2)* (Thit(L3) + Miss%(L3)*T(memory) ) )
 Example: (Latency as shown above)
 Miss rate: L1=10%, L2=5%, L3=1% (Be careful miss rate definition)
 AMAT
= 2.115
L1
Cache L2
Cache L3
Cache Main
Memory
(DRAM)
1 clk
Hit Time
10 clks 20 clks 300 clks
On-die

Data Access Time: C-AMAT
L1
Cache L2
Cache L3
Cache Main
Memory
(DRAM)
1 clk
Hit Time
2
Hit Concurrency
10 clks 20 clks 300 clks
3 4 6
 Concurrent Average Memory Access Time (C-AMAT)
=
H1
CH1
+ MR1 × κ1 ×
H2
CH2
+ MR2 × κ2 ×
H3
CH3
+ MR3 × κ3 ×
HMem
CHMem
 Example
 Miss Rate: L1=10%, L2=5%, L3=1% pMR, pAMP, AMP, CM, Cm: L1=7%, 10, 10, 5, 4
 𝜅: 𝐿1=0.56, L2=0.6, L3=0.8 L2=3%, 60, 40, 9, 6
 C-AMAT≈0.696 L3=0.8%, 400, 300, 16, 12

What Does C-AMAT Say?
 C-AMAT is an extension of AMAT to consider concurrency
 C-AMAT can be measured at each layer with APC
 C-AMAT is data-centric thinking
 Data access is as important as computing
 High locality may hurt performance
 The Pure Miss concept
 Balance locality, concurrency, overlapping with C-AMAT
 C-AMAT uniquely integrates the joint impact of locality,
concurrency, and overlapping for optimization (analysis and
measurement)

Off-chip
side
Pace Matching Data Access（搏动数据获取）
Processor
side
Layer 1 Layer 2 Layer 3 Layer 4
No delay data access (using C-AMAT as the gate to guide
and LPM as the global controller for in-situ optimization)
Case #1
Case #2
Case #3
Case #4

IO Optimization: from Prefetching to Merging
 Application-aware optimization
 Methods (LPM) and software
 Model (C-AMAT) measurement
Prefetching SC’08 Parallel I/O Prefetching Using MPI File Caching and I/O Signatures
Data Layout
Data
Coordination
Data
Organization
HPDC11
A Cost-intelligent Application-specific Data layout Scheme for Parallel
File Systems
SC11 Server-Side I/O Coordination for Parallel File Systems
PDSW11 Pattern-aware File Reorganization in MPI-IO
Data
Replication
IPDPS13 Layout-Aware Replication Scheme for Parallel I/O Systems
Cache ICDCS14 S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems
Heterogeneity
Merging
IPDPS15
BigData15
HAS: Heterogeneity-Aware Selective Data Layout Scheme for Parallel
File Systems on Hybrid Servers
PortHadoop: Support Direct HPC Data Processing in Hadoop
• Website: www.cs.iit.edu/~scs/iosig/
Data Compression

Merging with Industrial Solution:
Distributed I/O Systems
Architecture of a typical HDFS system
Models show for many applications the best optimization is to use HDFS

Integrated data access system (IDAS)
18
Two camps
o MPI for HPC
o MapReduce for big data
Native data access
o PFS (POSIX I/O and
MPI-IO)
o Big Data (HDFS API)
Limited Support for
Data Access
o Use explicit serial copy
o Redundant data store in
HDFS
• Merging is more than access (maintain the merits)
Yang, N. Liu, B. Feng, X.-H. Sun and S. Zhou, "PortHadoop: Support Direct HPC Data Processing in Hadoop," in Proc.
of IEEE International Conference on Big Data (IEEE BigData 2015). Santa Clara, CA, USA, Oct. 2015.

Application -- NASA Earth Science Data Analysis
(Super-Cloud Project)
10/3/2017 19
Simulation is done on MPI environment and Data Analysis is done on Hadoop environment

48 96 192 384
0
1000
2000
3000
4000
5000
6000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Number of Files
ElapsedTime(Seconds)
Serial Vis Serial Copy Parallel Vis Parallel Copy PortHadoop-R Vis Serial Animation
PortHadoop-R VS Conventional Methods: Performance
(Vis-only)
Serial Copy: one copy command issued on one Hadoop node
Parallel Copy: 4 copy commands issued on each Hadoop node (32 in total)
Serial Vis: read all data into one fat node, visualize
Parallel Vis: visualize all files on all Hadoop nodes
PortHadoop-R Vis: visualize all files into images Serial Animation: merge all images into one animation
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑃𝑜𝑟𝑡𝐻𝑎𝑑𝑜𝑜𝑝 = 1.47
𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑡𝑜𝑡𝑎𝑙 = 15
Good speedup: Total time is reduced by up to 15x
and 1.47x compared to serial and parallel solution.
20

What Are the Challenges in Computing?
 When communication is the issue
 Domain decomposition, pipelining
 When locality (memory hierarchy) is the concern
 Block-based computing
 When memory bound is the constraint
 Memory bound analysis
 When data access concurrency is a determining factor
 What is the new frame of algorithm design?

What We Can Help You?
 If you have big data applications
 If you have data intensive scientific applications
 If you need both computing and data powers
 If you are interested in data-centric algorithm designs
What You Can Help Us
 Applications examples
 Methods
 Probability sampling, etc.

CS570: Advanced
Computer Architecture Lecture Page 23
An Unbalanced System
 Source: Bob Colwell keynote ISCA’29 2002 http://systems.cs.colorado.edu/ISCA2002/Colwell-ISCA-
KEYNOTE-2002-final.ppt

Xian He Sun Data-Centric Into

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Xian He Sun Data-Centric Into

Similar to Xian He Sun Data-Centric Into (20)

More from SciCompIIT

More from SciCompIIT (9)

Recently uploaded

Recently uploaded (20)

Xian He Sun Data-Centric Into