Introduction of NGS Data Analysis on Hadoop 
Chung-Tsai Su 
SPN Architect, Core Tech 
Trend Micro 
2014/10/31 @CSIE.NTU 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 1
Q&A 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 2 http://setmoney.blob.core.windows.net/newsimages/2014/09/04/136352-XXL.jpg
http://www.genome.gov/sequencingcosts/ 
NGS Era
NGS Pipeline 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 4
High-Level Workflow of NGS 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 5 
Read 
Mapping 
Raw 
Reads 
(.fq) 
Variant 
Calling 
Sequence 
Alignment/ 
Mapping 
(.sam/.bam) 
Variant 
Calling file 
(.vcf)
NGS Data Analysis Pipeline 
• GATK best practice 
h1t0t/3p1/s20:1/4/wwwCo.bnfidreontaiald | Cinopsyritgihtt u20t1e2 .Torenrdg M/igcroa Itnkc. /guide/best6-practices?bpm=DNAseq
illumina solution 
7 
http://systems.illumina.com/content/dam/illumina-marketing/ 
documents/products/brochures/brochure_sequencing_systems_portfolio.pdf
The First $1,000 Genome – illumina HiSeq X Ten 
h1t0t/3p1:/2//0s14ystemCso.niflidleunmtiali |n Caop.ycrioghmt 2/0s12y Tsretnedm Miscr/oh Inics.eq-x-sequen8cing-system.html
Expectation of Data Processing 
Power for illumina HiSeq X Ten 
• A cluster of 10 HiSeq X instruments 
• Capable of sequencing up to 18,000 whole human 
genomes each year 
– Has a run cycle of ~3 days and produces ~150 genomes each 
run cycle 
– Running the industry standard BWA+GATK analysis pipeline to 
perform this analysis on a reasonably high-end (Dual Intel Xeon 
E5-2697v2 CPU – 12 core, 2.7 GHz with 96 GB DRAM) 
compute server takes ~24 hours per genome. 
– To achieve the required throughput of 150 genomes every three 
days, at least 50 of these servers are required. 
• Should meet a target of ~28 minutes for the completion 
of the mapping, aligning, sorting, de-duplication and 
variant calling of each genome. 
h1t0t/3p1/:2/0/1w4 ww.Ceodnfidicenotiagl | eConpyorigmht 2e0.1c2 Toremnd /Mdicrroa Ingc.en/ 9
Literature Survey 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 10
Literature 
• CloudBurst, 2009 
• CloudAligner, 2011 
• DistMap, 2013 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 11
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 12
Algorithm of CloudBurst 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 13 
Seed-and-Extend 
Algorithm
Experiments$ 
Performance of CloudBurst 
Scalability+ 
16000 
14000 
12000 
10000 
8000 
6000 
4000 
2000 
0 
Running Time vs Number of Reads on Chr 1 
0 1 2 3 4 5 6 7 8 
Runtime (s) 
Millions of Reads 
0 1 
2 3 
4 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 14
Speedup over Serial RMAP 
EECS$584$–$Fall$2013$ 
Speedup+over+serial+RMAP+ 
40 
35 
30 
25 
20 
15 
10 
5 
0 
Speedup over serial RMAP 
0 1 2 3 4 
Speedup 
Number of Mismatches 
chr1 chr22 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 15
Experiments$ 
Speedup on EC2 
Speedup+on+EC2+ 
1800 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
Running Time on EC2 
High-CPU Medium Instance Cluster 
24 48 72 96 
Running time (s) 
Number of Cores 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 16
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 17
Overhead of Disk I/O 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 18
Architecture of CloudAligner 
Seed-and-Extend 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 19 
Algorithm
Performance on Small Data 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 20
Performance on Large Data 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 21
Performance on Amazon EMR 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 22
Comparison with CloudBurst and CloudAligner 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 23
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 24
Workflow of DistMap 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 25
Evaluation of Read Mapping tools 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 26
Comparison of DistMap and other tools for 
distributed mapping 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 27
Market Movement 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 28
Hardware Solution - 
The World’s First NGS Bioinformatics Processor 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 29
h1t0t/3p1/:2/0/1w4 ww.Cbonifnidean.tical o| Cmopy/rpighrto 20d12u Tcretn.dh Mticmro Ilnc. 30
Architecture of bina Technology 
h1t0t/3p1/:2/0/1w4 ww.Cbonifnidean.tical o| Cmopy/rtigehtc 2h01n2 Torelnod gMiycr.oh Intcm. l 31
h1t0t/3p1s/2:0/1/4www.dConnafidnenetixal u| Cso.pcyorigmht 2/i0m12 aTrgeneds M/iucrso Iencc.ases/dnanex3u2s_CHARGE_prod1.png
Summary 
• NGS is a new page for Big Data Era 
• Need more CS experts to solve scalability and 
performance issues 
• Also, need more Data Scientist to discover the 
secrets/insights of Human Genome 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 33
http://technews.tw/2014/08/02/gene-big-data/ 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 34 http://technews.tw/2014/08/02/gene-big-data/
Q&A 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 35

A Survey of NGS Data Analysis on Hadoop

  • 1.
    Introduction of NGSData Analysis on Hadoop Chung-Tsai Su SPN Architect, Core Tech Trend Micro 2014/10/31 @CSIE.NTU 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 1
  • 2.
    Q&A 10/31/2014 Confidential| Copyright 2012 Trend Micro Inc. 2 http://setmoney.blob.core.windows.net/newsimages/2014/09/04/136352-XXL.jpg
  • 3.
  • 4.
    NGS Pipeline 10/31/2014Confidential | Copyright 2012 Trend Micro Inc. 4
  • 5.
    High-Level Workflow ofNGS 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 5 Read Mapping Raw Reads (.fq) Variant Calling Sequence Alignment/ Mapping (.sam/.bam) Variant Calling file (.vcf)
  • 6.
    NGS Data AnalysisPipeline • GATK best practice h1t0t/3p1/s20:1/4/wwwCo.bnfidreontaiald | Cinopsyritgihtt u20t1e2 .Torenrdg M/igcroa Itnkc. /guide/best6-practices?bpm=DNAseq
  • 7.
    illumina solution 7 http://systems.illumina.com/content/dam/illumina-marketing/ documents/products/brochures/brochure_sequencing_systems_portfolio.pdf
  • 8.
    The First $1,000Genome – illumina HiSeq X Ten h1t0t/3p1:/2//0s14ystemCso.niflidleunmtiali |n Caop.ycrioghmt 2/0s12y Tsretnedm Miscr/oh Inics.eq-x-sequen8cing-system.html
  • 9.
    Expectation of DataProcessing Power for illumina HiSeq X Ten • A cluster of 10 HiSeq X instruments • Capable of sequencing up to 18,000 whole human genomes each year – Has a run cycle of ~3 days and produces ~150 genomes each run cycle – Running the industry standard BWA+GATK analysis pipeline to perform this analysis on a reasonably high-end (Dual Intel Xeon E5-2697v2 CPU – 12 core, 2.7 GHz with 96 GB DRAM) compute server takes ~24 hours per genome. – To achieve the required throughput of 150 genomes every three days, at least 50 of these servers are required. • Should meet a target of ~28 minutes for the completion of the mapping, aligning, sorting, de-duplication and variant calling of each genome. h1t0t/3p1/:2/0/1w4 ww.Ceodnfidicenotiagl | eConpyorigmht 2e0.1c2 Toremnd /Mdicrroa Ingc.en/ 9
  • 10.
    Literature Survey 10/31/2014Confidential | Copyright 2012 Trend Micro Inc. 10
  • 11.
    Literature • CloudBurst,2009 • CloudAligner, 2011 • DistMap, 2013 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 11
  • 12.
    10/31/2014 Confidential |Copyright 2012 Trend Micro Inc. 12
  • 13.
    Algorithm of CloudBurst 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 13 Seed-and-Extend Algorithm
  • 14.
    Experiments$ Performance ofCloudBurst Scalability+ 16000 14000 12000 10000 8000 6000 4000 2000 0 Running Time vs Number of Reads on Chr 1 0 1 2 3 4 5 6 7 8 Runtime (s) Millions of Reads 0 1 2 3 4 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 14
  • 15.
    Speedup over SerialRMAP EECS$584$–$Fall$2013$ Speedup+over+serial+RMAP+ 40 35 30 25 20 15 10 5 0 Speedup over serial RMAP 0 1 2 3 4 Speedup Number of Mismatches chr1 chr22 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 15
  • 16.
    Experiments$ Speedup onEC2 Speedup+on+EC2+ 1800 1600 1400 1200 1000 800 600 400 200 0 Running Time on EC2 High-CPU Medium Instance Cluster 24 48 72 96 Running time (s) Number of Cores 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 16
  • 17.
    10/31/2014 Confidential |Copyright 2012 Trend Micro Inc. 17
  • 18.
    Overhead of DiskI/O 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 18
  • 19.
    Architecture of CloudAligner Seed-and-Extend 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 19 Algorithm
  • 20.
    Performance on SmallData 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 20
  • 21.
    Performance on LargeData 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 21
  • 22.
    Performance on AmazonEMR 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 22
  • 23.
    Comparison with CloudBurstand CloudAligner 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 23
  • 24.
    10/31/2014 Confidential |Copyright 2012 Trend Micro Inc. 24
  • 25.
    Workflow of DistMap 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 25
  • 26.
    Evaluation of ReadMapping tools 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 26
  • 27.
    Comparison of DistMapand other tools for distributed mapping 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 27
  • 28.
    Market Movement 10/31/2014Confidential | Copyright 2012 Trend Micro Inc. 28
  • 29.
    Hardware Solution - The World’s First NGS Bioinformatics Processor 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 29
  • 30.
    h1t0t/3p1/:2/0/1w4 ww.Cbonifnidean.tical o|Cmopy/rpighrto 20d12u Tcretn.dh Mticmro Ilnc. 30
  • 31.
    Architecture of binaTechnology h1t0t/3p1/:2/0/1w4 ww.Cbonifnidean.tical o| Cmopy/rtigehtc 2h01n2 Torelnod gMiycr.oh Intcm. l 31
  • 32.
    h1t0t/3p1s/2:0/1/4www.dConnafidnenetixal u| Cso.pcyorigmht2/i0m12 aTrgeneds M/iucrso Iencc.ases/dnanex3u2s_CHARGE_prod1.png
  • 33.
    Summary • NGSis a new page for Big Data Era • Need more CS experts to solve scalability and performance issues • Also, need more Data Scientist to discover the secrets/insights of Human Genome 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 33
  • 34.
    http://technews.tw/2014/08/02/gene-big-data/ 10/31/2014 Confidential| Copyright 2012 Trend Micro Inc. 34 http://technews.tw/2014/08/02/gene-big-data/
  • 35.
    Q&A 10/31/2014 Confidential| Copyright 2012 Trend Micro Inc. 35

Editor's Notes

  • #21 From the figure, we can see that CloudAligner is 60 to 80% faster than CloudBurst.
  • #22 We mapped different subsets of the accession SRR035459 to the human chromosome 22 (50 Mbp) allowing up to 3 mismatches. From the figure, we can see that the execution time of both CloudBurst and CloudAligner is proportional to the number of reads, and CloudAligner outperforms Cloud- Burst from 35 to 67%.
  • #23 With CloudBurst, the limitation of ts approach is the network bandwidth. With CloudAligner, its limitation is in the computation power of the workers in Hadoop. Consequently, if we run CloudAligner on cluster of legacy machines with high speed network, we probably lose the performance advantage over CloudBurst.