Recent advances in large-scale experimental facilities ushered in an era of data-driven science. These large-scale data increase the opportunity to answer many fundamental questions in basic science. However, these data pose new challenges to the scientific community in terms of their optimal processing. Consequently, scientists are in dire need of robust high-performance computing (HPC) solutions that can scale with terabytes of data.
In this talk, I will address the challenges of two major aspects of scientific big data processing: 1) Developing scalable software and algorithms for data- and compute-intensive scientific applications. 2) Proposing new cluster architectures that these applications and software tools need for good performance. In this talk, I will mainly address the challenges involved in large-scale genome analysis applications such as, genomic error correction and genome assembly which made their way to the forefront of big data challenges recently as the sequencing machines outperformed Moore's law by several magnitudes.
In the first part, I will address the challenges involved in developing scalable algorithms to process huge amounts of genomic big data using the power of recent analytic tools such as, Hadoop, Giraph, distributed NoSQL, etc. The algorithms are carefully tailored to scale over terabytes of data over hundreds of computing nodes. At a border level, these algorithms take advantage of locality-based computing for their scalability. In this aspect, I will briefly talk about my general-purpose, analytic framework for easy and rapid designing of embarrassingly parallel algorithms for massive-scale scientific data.
In the second part, I will address the challenges in designing the hardware environment that these data- and compute-intensive applications require for good performance. I will pinpoint the limitations in a traditional HPC cluster (supercomputer) to process this huge amount of big genomic data with respect to these applications and propose a solution to those limitations by balancing the storage (both I/O and memory) bandwidth, with the computational speed of high-performance CPUs. I will briefly discuss my theoretical model that can help the HPC system designers who are striving for system balance.
Many of these observations and developments are used by different hardware vendors such as, Samsung and IBM to develop or improve the configuration of their next-gen HPC clusters (e.g., Samsung’s hyper-scale computing cluster, IBM’s Power8-based supercomputer) with high-speed storage and processing power
3. Introduction
Big Data Genome analysis
Big DataAnalysis Framework
Big data application and genome sequence
De novo Genome assembly
De novo Genomic error correction
Big data cyberinfrastructure
Evaluation of different cluster
Model for optimally balanced cluster
4. Introduction
Big Data Genome analysis
Big DataAnalysis Framework
Big data application and genome sequence
De novo Genome assembly
De novo Genomic error correction
Big data cyberinfrastructure
Evaluation of different cluster
Model for optimally balanced cluster
7. Cost decreases
Bandwidth increases
1.00E+10
1.00E+11
1.00E+12
1.00E+13
1.00E+14
1.00E+15
1.00E+16
1.00E+17
1993
1994
1997
2000
2003
2005
2007
2009
2011
2012
Increase in FLOPS of fastest
supercomputer
1
10
100
1000
10000
1995
1997
1999
2001
2003
2005
2007
2009
2011
Increase in Bandwidth (MB/s) for storage
and network
I/O bandwidth per device
Network bandwidth per cable
Hardware evolution
Processor
Storage
Network
8. Introduction
Big Data Genome analysis
Big DataAnalysis Framework
Big data application and genome sequence
De novo Genome assembly
De novo Genomic error correction
Big data cyberinfrastructure
Evaluation of different cluster
Model for optimally balanced cluster
11. Like restoring a damaged book from multiple copies torn at random places
The problem can be mapped as a graph analytic problem: De Bruijn Graph
12.
13.
14. Modified version of parallel list ranking algorithm
Mark head (h) and tail (t) and merge the h-t link
Number of rounds: O(log |n|) [where, n: #vertices in the longest path]
Round
#1
Round
#2
17. Round
#6
Bubbles:Vertices with same predecessor and same successor
Levenshtein like edit distance algorithm is used
If the distance is less than a threshold, vertex with minimum
frequency is removed
22. XSEDE resource
LSU SuperMic HPC cluster is used
Maximum #nodes 128
Cores/node 20 (Two 10-core Intel IvyBridge)
DRAM/node 64GB
Disk/node 250GB (Hard disk drive)
Network 56Gbps InfiniBand
23.
24. ABySS processes failed many times for network issues
Contrail, being disk-based took more than maximum allocated
time for a single job
GiGA ABySS Contrail
# Contigs 3032297 - -
NG50 827 - -
Max Contig size 35465 - -
# Cores 512 - -
Time (hour) 8.5 Failed Failed
27. De Bruijn graph-based method
More scalable than overlap-based method
Widest path algorithm provides accuracy
28. Theory behind using the widest path algorithm
Assume K-mer coverage as random variable IID (F(x))
Theory of minimum probability distribution (1-(1-F(x))n)
Proof:The probability of the minimum coverage k-mer is highest in
the erroneous read given many reads sequenced the same region of
the genome
29. There may be many error k-mers with high coverage
But the probability of finding minimum coverage is significantly
higher in the error path
Hence, a widest path algorithm is used to select the correct path
30. Hadoop (MapReduce): for computation
Hazelcast (In-memory NoSQL): for de Bruijn graph storage
31. Map: Emits three k-mers
First k-mer: incoming edge
Middle k-mer: vertex
Third k-mer: outgoing edge
Coverage of middle: 1
Reduce
Group by vertex
Aggregate incoming edges
Aggregate outgoing edges
Sum coverages
32. Hadoop with Hazelcast
Error detection
Hadoop Map-only job
k-mer coverage < threshold Error k-mer
Millions of Searches
over the entire
dataset
33. Widest path algorithm
Maximize the minimum k-mer coverage in the path of the de Bruijn
graph
Modified version of the Dijkstra’s Algorithm
Similar time complexity
34. PacBio Data #Reads Data Size
(GB)
Read length %Reads
Aligned
E. coli 1129576 1.032 1120 78.97
Yeast 2315594 0.53 5874 82.12
Fruit fly 6701498 55 4328 51.14
Human 23897260 312 6587 72.3
Illumina
Data
#Reads Data Size
(GB)
Read length %Reads
Aligned
E. coli 45440200 13.50 101 99.44
Yeast 4503422 1.20 101 93.75
Fruit fly 179363706 59 101 95.56
Human 1420689270 452 101 79.60
35. %Read aligned: Percentage of corrected long reads and the base pairs
aligned to the reference genome
%ReadsAligned = AlignedReads /TotalReads * 100
%Base pairs aligned: Percentage of base pairs (of total base pairs) of
corrected long reads and the aligned to the reference genome
%BasePairAligned = AlignedBases /TotalBases * 100
36. Widest path (WP): Select the path in the de Bruijn graph which
maximizes the minimum k-mer coverage
Leverages the coverage information while correcting the error
Dijkstra’s shortest path (SP): Select the shortest path without taking
any coverage information
Coverage information is used only when the de Bruijn graph is
constructed K-mers below a threshold is removed from the graph
1-step Greedy (Gr): Select the successor k-mer with highest coverage
High chance of selecting the wrong path
Stopped after a predefined number of hops
37. Widest path shows the best performance
Greedy algorithm shows the worst performance
K is set to 15
Data Algorithm %Read aligned %Base pair
aligned
E. coli ParLECHWP 93.69 92.15
ParLECH sp 87.55 86.49
ParLECH Gr 76.68 70.92
Yeast ParLECHWP 86.07 89.31
ParLECH sp 84.92 86.44
ParLECH Gr 75.77 74.68
Fruit fly ParLECHWP 65.92 62.42
ParLECH sp 54.53 49.41
ParLECH Gr 43.97 37.44
38. ParLECH aligned more reads and basepairs to the reference
genome comparing to LoRDEC
K is set to 15
Data Algorithm %Read aligned %Base pair
aligned
E. coli ParLECH 93.69 92.15
LoRDEC 87.55 86.49
Original 78.97 75.07
Yeast ParLECHWP 86.07 89.31
LoRDEC 84.92 87.08
Original 82.12 88.69
Fruit fly ParLECHWP 65.92 62.42
LoRDEC 54.53 49.69
Original 51.14 46.04
39. XSEDE resource
LSU SuperMic HPC cluster is used
Maximum #nodes 128
Cores/node 20 (Two 10-core Intel IvyBridge)
DRAM/node 64GB
Disk/node 250GB (Hard disk drive)
Network 56Gbps InfiniBand
40. LoRDEC performs better in single node
ParLECH outperforms when multiple nodes are added
1 2 4 8 16 32
#Nodes
Executiontime(min)
01020304050 ParLECH
LoRDEC
41. Almost linear scalability
16 32 64 128
1020501002005002000
Number of Nodes in log scale
Executiontimeinlogscale(min)
KmerCount
LocateError
CorrectError
Total
42. A total of 764GB data is processed
Appreciable accuracy
LoRDEC could not process
Could not produce the de Bruijn graph
PacBio data size 312GB
Illumina data size 452GB
#nodes used 128
k 17
Time 28.6 hours
%Read aligned 78.3
%base pair aligned 75.43
43. Desired software characteristics for big data genome analysis
Distributed
Scalable
Low cost
Consider data locality
Capable to work on commodity hardware
Develop algorithms using of big data analytics model
Better performance than other MPI-based software on traditional
HPC environment
Can we get better performance by
changing the hardware infrastructure?
44. Introduction
Big Data Genome analysis
Big DataAnalysis Framework
Big data application and genome sequence
De novo Genome assembly
De novo Genomic error correction
Big data cyberinfrastructure
Evaluation of different cluster
Model for optimally balanced cluster
47. Network issues
Fat tree architecture with Blocking (2:1)
Low effective bandwidth Current
programming models needs bandwidth
Storage issues
Fewer directly attached device
(normally hard disk drive)
Low I/O bandwidth Big data job
becomes I/O bound
Memory issues
Low RAM per core Significant
tradeoff between the degree of data-
parallelism and memory requirement
Low buffer size increases data spilling
to disks Causes significant
performance drop with HDD
50. Bumble
bee
Job type Input Final
output
#Jobs Shuffled
data
HDFS
data
Graph
construc
tion
Hadoop 90GB
(500M
reads)
95GB 2 2TB 136GB
Graph
simplific
ation
Series of
Giraph
jobs
95GB
(715M
vertices)
640MB
(62K
vertices)
15 - 966GB
#Nodes used:
SuperMikeII: 15
SwatIII-Basic-HDD/SSD: 15
SwatIII-Memory: 15
Human Job type Input Final
output
#Jobs Shuffled
data
HDFS
data
Graph
construc
tion
Hadoop 452GB
(2B
reads)
3TB 2 9.9TB 3.2TB
Graph
simplific
ation
Series of
Giraph
jobs
3TB
(1.5B
vertices)
3.8GB
(3M
vertices)
15 - 4.1TB
51. 0
0.5
1
1.5
Graph
construction
Graph
simplification
entire pipeline
Executiontimenormalizedto
SuperMikeII
Assembly stagesAxis
Effect of Network(InfiniBand vs Ethernet) to
assemble 90GB Bumble Bee Genome
SuperMikeII SwatIII-Basic-HDD
40Gbps IB + 2:1 blocking vs 10Gbps Eth. + no blocking Similar performance
SSD vs HDD Hadoop shows 50% improvement
256GB vs 32GB DRAM Hadoop shows 70% and Giraph shows 35%
improvement
1.012 1.033 1.025
0
0.5
1
1.5
Graph
construction
Graph
simplification
Entire pipeline
Executiontimenormalizedto
SuperMikeII
Assembly stages
Effect of storage type (HDD vs SSD) and size of
RAM while assembling Bumble Bee Genome
SuperMikeII SwatIII-Basic-SSD SwatIII-Memory
0.5
0.3
0.96
0.65 0.790.67
52. 0
1
2
3
GraphConstruction GraphSimplification EntirePipeline
Performance/$
normalizedto
SuperMikeII
Assembly stages
Performance/$ with bumble bee (90GB) genome assembly
0
1
2
3
GraphConstruction GraphSimplification EntirePipeline
Executiontime
normalizedto
SuperMikeII
Execution time for 90GB bumble bee genome
assembly
Scaled up cluster
More execution time
More Performance/$
HDD and SSD shows
almost same
execution time
HDD shows better
Performance/$ than
HDD
53. 0
2
4
6
Graph
construction
Graph
simplification
Entire pipeline
Performance/$
normalizedto
SuperMikeII
Performance/$ for human genome
0
0.5
1
1.5
Graph
construction
Graph
simplification
Entire pipeline
Executiontime
normalizedto
SuperMikeII
Execution time for human genome (452GB) Fewer scaled up
server
Better than traditional
HPC cluster (3-4x
benefit in
performance/$)
HDD performs similar
as SSD
HDD shows better
performance/$ than
SSD
1.006 1.128 0.898 1.023 0.999 1.077
3.17
4.36 3.88
4.79
3.65
4.21
54. 1-SSD performs similar to 4-HDD
Disk controller saturates at ~500MB/s
Adding more disks (HDD/SSD) does not improve
performance any more
0
1000
2000
3000
4000
5000
6000
7000
1HDD/DataNode 2HDD/DataNode 4HDD/DataNode 1SSD/DataNode 2SSD/DataNode
Executiontime(s)
#DAS/DN and type
5740
4429
3333
2939 2732
55. 0
1
2
3
GraphConstruction GraphSimplification EntirePipeline
Performance/$
normalizedto
SuperMikeII
Performance/$ with Bumble Bee Genome assembly
Hyperscale system prototype
32 low-power node: 2 cores, 1 SSD and 16GB RAM/node
10% better performance than SuperMikeII (16 cores, 1 HDD and
32GB RAM/node)
More than twice improvement in performance/$
0.8
0.85
0.9
0.95
1
1.05
GraphConstruction GraphSimplification EntirePipeline
Executiontime
normalizedto
SuperMikeII
Execution time for 90GB Bumble Bee Genome Assembly
0.93
0.89 0.90
2.16 2.24 2.215
56. Increase compute bandwidth
Power8 processor has 8 SMT
16 memory controllers
Increase I/O bandwidth
Many HDD per node
I/O and compute distribution on
SMT
Increase Network bandwidth
Clos connection with No blocking
57. Intel’s Knights Landing (KNL) cluster
Low energy consumption
Knights landing processor with lower clock speed
Increased compute and I/O parallelism
4 SMT (instead of 2 hyperthread)
Non volatile RAM high bandwidth flash memory
NvidiaGPU cluster
General Purpose GPU (GP GPU)
Work in conjunction with Intel or IBM Power8 processor
NVLink (High speed connection between IBM Pow)
58. Limitations in traditional HPC cluster and Data Center
Network
Storage
Memory
Huge tradeoff between performance and cost
How to model these observation to develop optimal
cluster architecture?
59. A Theoretical Model to Build Cost-
Effective Balanced HPC
Infrastructure
for Data-Driven Science
60. Amdahl’s I/O number for balanced system
1-bit (0.125-Byte) of I/O per second per IPS
Amdahl’s memory number for balanced system
1-byte of memory per IPS
Limitation
One-size-fit-all: does not consider the impact
of the application’s characteristics
61. Modified Amdahl’s I/O number
8 MIPS/MBPS I/O
On the relevant application
Modified Amdahl’s memory number
The MB/MIPS ratio is rising from 1 to 4
Limitation
Does not consider the cost component
Observations only: no theoretical background
63. Ignores overlap of work done by I/O and memory
Ignores the CPU micro architecture
Consider the number of instruction executed per cycle (IPC) as
proportional toCPU core frequency
69. Cluster SuperMikeII SwatIII CeresII
Cluster type Traditional HPC Datacenter MicroBrick
𝛽𝑖𝑜 0.003 0.015 0.166
𝛽 𝑚𝑒𝑚 0.77 6.15 5.33
𝛾𝑖𝑜 0.0005 0.01 1.03
𝛾 𝑚𝑒𝑚 0.06 1.47 1.25
Optimized for Only compute-
intensive
application
Compute- and
Memory-
intensive
application
I/O-, Compute-
and Memory-
intensive
Application
70. Application Terasort Wordcount Genome
Assembly
Ph1
Genome
Assembly
Ph2
Job type Hadoop Hadoop Hadoop Giraph
Input 1TB 1TB 452GB (2bn
short reads)
3.2TB (1.5bn
vertices)
Output 1TB 1TB 3TB 3.8GB
Shuffled data 1TB 1TB 9.9TB -
Application
Characteristics
Map: CPU-
intensive,
Reduce:
I/O-intensive
Map and
Reduce: I/O
and CPU-
Intensive
Map and
Reduce: CPU-
and
I/O-intensive
Memory-
Intensive
71. Lower is better (Price-to-Performance of SuperMikeII is
considered as 1)
CeresIIVs. SuperMikeII: >65% improvement for both
CeresIIVs. SwatIII: >50% improvement for both
0
0.2
0.4
0.6
0.8
1
1.2
TeraSort WordCount
Price-to-Performance
(normalizedtoSuperMike2
Application
SuperMikeII
SwatIII
CeresII
0.76
0.37
0.79
0.35
72. Lower is better (Price-to-Performance of SuperMikeII is
considered as 1)
CeresIIVs. SuperMikeII: 88% and 85% for phase-1 and phase-2
respectively
CeresII vs. SwatII: 50% and 20% for phase-1 and phase-2
respectively
0
0.2
0.4
0.6
0.8
1
1.2
Graph Construction Graph Simplification
Price-to-Performance
(normalizedto
SuperMike2)
Application
0.24
0.12
0.22
0.15
73. For data-driven application with current H/W price
Amdahl’s I/O number (𝛽𝑖𝑜
𝑜𝑝𝑡) should be increased compared
to Gray’s law (0.125 to 0.17)
Amdahl’s memory number (𝛽 𝑚𝑒𝑚
𝑜𝑝𝑡) should be decreased
compared to Gray’s law (4 to 2.7)
For HPC clusters
𝛽𝑖𝑜 and 𝛽 𝑚𝑒𝑚 provide an easy-to-use alternative for
FLOPS for I/O- and memory-bound applications
Informed choice among hardware components
during investing on HPC cluster when application
characteristics are not known
74. Application of Deep Learning and AI methodologies on
genomics
Key-Value Memory Network
Metagenomic Assembly and error correction
Transfer big genomic data on Blockchain
Security of the sensitive data
High throughput
Current collaboration
SanDiego Supercomputing center
IBM OpenPower
75. Thanks to the Faculty and staff LSU and UW Platteville
Dr. Seung-Jong Park, Dr. Kisung lee, Dr. SeungwonYang, Dr.
JianhuaChen, Dr. Praveen Koppa, Dr. SayanGoswami, Dr. Richard
Platania, Dr. Chui hui Chiu, Dipak Singh, Dr. Lisa Landgraf, etc.
Samsung SSD team
▪ Jaeki Hong, Jay Seo, Jinki Kim,WooseokChang, etc.
IBM Power8 and Open-PowerTeam
▪ Terry Leatherland, Ravi Arimilli,Ganesan Narayanswami, etc.
Other collaborators
▪ Dr. Ling Liu (GATECH)
Bioscience experts
▪ Dr. Joohyun Kim, Dr. Nayong Kim, Dr. Maheshi Dassanayake,
Dr. Dong-Ha Oh, etc.
76. This work was supported in part by
NIH-P20GM103424
NSF-MRI-1338051
NSF-CC-NIE-1341008
NSF-IBSS-L-1620451
LA BoR LEQSF(2016-19)-RD-A-08
The HPC services are provided by
LSU HPC
LONI
Samsung Research S. Korea
IBM Research Austin
77. “Developing a Meta Framework for Key-Value Memory Networks on HPC Clusters”
ChoonhanYoun, Arghya Kusum Das, SeungwonYang, Joohyun Kim, PEARC 2019
(Collaborative work UW-Platteville, LSU and San Diego SupercomputingCenter)
“ParLECH: Parallel Long-read Error Correction with Hadoop” Arghya Kusum Das, Seung-
Jong Park, Kisung Lee, IEEE BIBM, 2018
“A High-Throughput InteroperabilityArchitecture over Ethereum and Swarm for Big
Biomedical Data”, Arghya Kusum Das, Seung-Jong Park, Kisung Lee, IEEECHASE, 2018
(BlockchainWorkshop)
“Large-scale parallel genome assembler over cloud computing environment” Arghya
Kusum Das, Praveen Kumar Koppa, Sayan Goswami, Richard Platania, Seung-Jong Park.
JBCB May23, 2017 issue
“ParSECH: Parallel Sequencing Error Correction with Hadoop for Large-Scale Genome
Sequences” Arghya Kusum Das, Shayan Shams, Sayan Goswami, Richard Platania, Kisung
Lee, Seung-Jong Park. BiCOB 2017
“Lazer: A Memory-Efficient Framework for Large-Scale Genome Assembly” Sayan
Goswami, Arghya Kusum Das, Richard Platania, Kisung Lee, Seung-Jong Park IEEE Big
Data2016.
78. “Evaluating Different Distributed-Cyber-Infrastructure for Data and Compute Intensive
ScientificApplication” Arghya Kusum Das, Jaeki Hong, Sayan Goswami, Richard Platania,
Wooseok Chang, Seung-Jong Park. IEEE Big Data 2015. [With collaboration of SAMSUNG
Electronics Ltd., S. Korea]
“AugmentingAmdahl’s Second Law: ATheoretical Model for Cost-Effective Balanced HPC
Infrastructure for Data-Driven Science” Arghya Kusum Das, Jaeki Hong, Sayan Goswami,
Richard Platania, Kisung Lee, Wooseok Chang, Seung-Jong Park. IEEE Cloud 2017
[collaboration with SAMSUNG Electronics Ltd, S. Korea]
“IBM POWER8® HPC SystemAccelerates Genomics Analysis with SMT8 Multithreading”
Arghya Kusum Das, Sayan Goswami, Richard Platania, Seung-Jong Park, Ram
Ramanujam, Gus Kousoulas, Frank Lee, Ravi arimilli,Terry Leatherland, Joana Wong, John
Simpson,Grace Liu, JinchunWang. DynamicWhite Paper for Louisiana State University
collaboration with IBM
“BIC-LSU: Big Data Research Integration with Cyberinfrastructure for LSU” Chiu, Chui-hui,
Nathan Lewis, Dipak Kumar Singh, Arghya Kusum Das, Mohammad M. Jalazai, Richard
Platania, Sayan Goswami, Kisung Lee, and Seung-Jong Park. XSEDE 2016.
Good morning everybody. I am Arghya Kusum Das from CCT LSU. Today I am going to present our paper titled augmenting Amdahl’s second law Augmenting Amdahl’s Second Law: A Theoretical Model to Build Cost-Effective Balanced HPC Infrastructure for Data-Driven Science.
Good morning everybody. I am Arghya Kusum Das from CCT LSU. Today I am going to present our paper titled augmenting Amdahl’s second law Augmenting Amdahl’s Second Law: A Theoretical Model to Build Cost-Effective Balanced HPC Infrastructure for Data-Driven Science.
Good morning everybody. I am Arghya Kusum Das from CCT LSU. Today I am going to present our paper titled augmenting Amdahl’s second law Augmenting Amdahl’s Second Law: A Theoretical Model to Build Cost-Effective Balanced HPC Infrastructure for Data-Driven Science.
The most popular and effective law is proposed by Computer scientist Jean M Dell in 1960s where he told that a balanced system needs one bit off io far second per cpu instruction per second. This is known as Amdahl’s Io number. Regarding memory he told that a balanced system needs one byte of memory per CPU instruction per second. This is known as Amdahl’s memory number.
The major limitation of the law is each propose and one size feet tall type of design reach does not consider the impact off application characteristics which are changing frequently nowadays
To address this limitation Computer scientist to Jim Gray modified the original law where he can't the Io number same as as the original law but it should be done on your relevant application. Regarding the memory number he observed that it easy arising from 1 to 4.
Although Jim Gray consider the application characteristics for the Io number, he did not consider the cost component. Farther more the memory number is this simply an observation which does not have any theoretical background
This slide shows the concrete problem definition. We need to modify Amdahl’s io and memory numbers that is beta-io-opt and beta-mem-opt as a function off application balance and the cost balance. The application balance is a measurement of whether the application is io intensive or CPU intensive or memory intensive. basically it is ratio between required io bandwidth to required CPU speed or the ratio between required memory and required CPU speed. On the other hand the cost balance are the ratio between the io cost per gbps or memory cost per gb and cpu cost per gigahartz.
The slide shows the model’ assumption. First the model is additive in nature that is ignores any overlap between io and memory operation. This is a valid assumption because when a CPU is busy in doing io it it does not do any memory operation.
Second the it Ignores any CPU microarchitecture which means it considers the number of instruction executed per cycle as proportional to the CPU core frequency
[Micro architecture is ignored in Alex Szalay’s paper on Amdahl’s Balanced Blade
Ref: Szalay, Alexander S., et al. "Low-power amdahl-balanced blades for data intensive computing." ACM SIGOPS Operating Systems Review 44.1 (2010): 71-75.]
While driving the model at Forest we have taken a cross product of time spent at each hardware component and the cost off the corresponding hardware components. then we have simply replaced the resulting cross product with the balance terminology that is beta-io and beta-mem. After that we took a partial differentiation off the price to performance ratio with respect to beta-io and beta-mem. Since our motivation is to minimize the price to performance we solve that equation for zero
As the outcome we get two modfied Amdahl’s numbers. According to our model Amdahl’s io number is Square root of the ratio of the (application balance between I/O and CPU) and the (cost balance between the disk and CPU). On the other hand the M Dells memory number is Square root of the ratio of the (application balance between mem. and CPU) and the (cost balance between the mem. and CPU)
So the actual implication of the model lies in its consideration off the cost component. As it can be seen in the figure considering the lower cost of disk the model produce a higher value for amdahl’s io number comparing to gray’s amendment. On the other hand considering the higher cost off memory its produce lower number for amdahl’s memory number comparing to gray's law.
In our module gray’s law is a special case when the applications resource requirement exactly compensate the corresponding hardware cost that is Application balance is inverse off the cost balance
Also using these figure you can easily say which are system architecture is optimized for what type of applications. or the other way, what type of applications should perform best in what type of architecture. We will use these characteristics Farm for cluster classification later in the presentation
Dislike shows an example in the current scenario where we have analyzed amazon.com new egg.com etc. To get the average price off different module of the system that is io and CPU and then directly feed those to our model for io and computing intensive application which means, gamma-io equals one. This way our modified amdahl’s io number comes as 0.17. Similarly considering a memory and computer intensive application that is, gamma-mem equals one we calculated amdahl’s memory number as 2.7.
The price table for some hardware is given in the paper Table-II
Practical implication of application balance: (gamma_io or mem) = (data read from IO or memory)/(instructions-per-second)
Flow of the derivation
We have used three different types of cluster for this work. first is supermike2 which has 16 Xeon processing cores one hard disk drive producing only 0.15 GBPS bandwidth and 32 GB of ram. The second cluster is swat3 which has same processor configuration as supermike2 but four times more io bandwidth and eight times more memory. The third one is ceres2 richie's Samsung micro brick based cluster powered by NVMe ssd. Each node off these cluster has only 6 Xeon core and 64 GB of memory. however because of NVMe ssd each node has 2GBPS io bandwidth which is much higher than the other two cluster
Based on these configuration that I io bandwidth, memory and cpu speed we calculated the Beta-io and beta-mem off all of these cluster. As it can be seen ceres2’s beta-io is almost similar to the Optimal produced by our model. Whereas supermike2 shows extremely low value for this. Swat3 lies in between these two. In terms of memory again supermike2 shows very little value and swat3 shows very high value.
Now using the curve shown our earlier (that is beta versus, gamma plot ) we can easily determine which kind of application is optimized for these different architecture. This way supermike2 to can be classified as a transitional HPC cluster which is optimized only for computer intensive application with the Very little value of gamma-io and gamma-mem. Swat3 can be classified as regular data center which is optimized for both compute and memory intensive application. Serious to on the other hand event with lowest processing speed per node is darned out to be the best for all io compute and memory intensive applications
So for today’s hadoop based Scientific applications ceres2 is expected to show much better performance
To prove that we have a used three different type of benchmarks. The first to are very common terasort and wordcount. Terasort has CPU intensive map phase and then io intensive reduce phase. Wordcount has both CPU and io intensive map and reduce phase based on the data size.
The third benchmark is a genome assembly Application developed by us using hadoop and giraph. The first phase off the SM loader is the shuffle intensive hadoop job producing almost 10 TB of shuffle data and the second phase is the memory intensive giraph job reach process at 3.2 TB of graph.
This slide compares different clusters. as it can be seen ceres2 shows more than 65% improvement for both tera sort and word count. Comparing to swat3 ceres2 shows more than 50% benefit for both applications.
The slide compares the price to performance off different clusters for human genome assembly. Ceres2 shows more than 85% benefit for both of the phases of assembly comparing to supermike2 which is optimized for compute intensive applications only. Comparing to swat3 it shows 50% benefit for the hadoop phase and 20% benefit in the giraffe face.
Now it is the last part of our presentation. In this work we have provided theoretical background for amdahl’s are you number and memory number and modified these based on the application characteristics and the hardware price trend. According to our observation amdahl’s io number should be increased comparing to gray’s law because low price of disc. On the other hand M Dells memory number should decrease compared to Gray's Law as the memory price is high.
The model also provides and easy to use alternative for flops to show the capability off HPC clusters which has better expressive power for io and memory bound applications
In this work our focus was on simplicity so that the morning can be used by the system designers however many subtle parameters like CPU multithreading io latency etc. Can be added to improve its accuracy