A Survey of NGS Data Analysis on Hadoop

Introduction of NGS Data Analysis on Hadoop
Chung-Tsai Su
SPN Architect, Core Tech
Trend Micro
2014/10/31 @CSIE.NTU
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 1

Q＆A
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 2 http://setmoney.blob.core.windows.net/newsimages/2014/09/04/136352-XXL.jpg

http://www.genome.gov/sequencingcosts/
NGS Era

NGS Pipeline

High-Level Workflow of NGS
Read
Mapping
Raw
Reads
(.fq)
Variant
Calling
Sequence
Alignment/
Mapping
(.sam/.bam)
Variant
Calling file
(.vcf)

NGS Data Analysis Pipeline
• GATK best practice
h1t0t/3p1/s20:1/4/wwwCo.bnfidreontaiald | Cinopsyritgihtt u20t1e2 .Torenrdg M/igcroa Itnkc. /guide/best6-practices?bpm=DNAseq

illumina solution
7
http://systems.illumina.com/content/dam/illumina-marketing/
documents/products/brochures/brochure_sequencing_systems_portfolio.pdf

The First $1,000 Genome – illumina HiSeq X Ten
h1t0t/3p1:/2//0s14ystemCso.niflidleunmtiali |n Caop.ycrioghmt 2/0s12y Tsretnedm Miscr/oh Inics.eq-x-sequen8cing-system.html

Expectation of Data Processing
Power for illumina HiSeq X Ten
• A cluster of 10 HiSeq X instruments
• Capable of sequencing up to 18,000 whole human
genomes each year
– Has a run cycle of ~3 days and produces ~150 genomes each
run cycle
– Running the industry standard BWA+GATK analysis pipeline to
perform this analysis on a reasonably high-end (Dual Intel Xeon
E5-2697v2 CPU – 12 core, 2.7 GHz with 96 GB DRAM)
compute server takes ~24 hours per genome.
– To achieve the required throughput of 150 genomes every three
days, at least 50 of these servers are required.
• Should meet a target of ~28 minutes for the completion
of the mapping, aligning, sorting, de-duplication and
variant calling of each genome.
h1t0t/3p1/:2/0/1w4 ww.Ceodnfidicenotiagl | eConpyorigmht 2e0.1c2 Toremnd /Mdicrroa Ingc.en/ 9

Literature Survey

Literature
• CloudBurst, 2009
• CloudAligner, 2011
• DistMap, 2013

Algorithm of CloudBurst
Seed-and-Extend
Algorithm

Experiments$
Performance of CloudBurst
Scalability+
16000
14000
12000
10000
8000
6000
4000
2000
0
Running Time vs Number of Reads on Chr 1
0 1 2 3 4 5 6 7 8
Runtime (s)
Millions of Reads
0 1
2 3
4

Speedup over Serial RMAP
EECS$584$–$Fall$2013$
Speedup+over+serial+RMAP+
40
35
30
25
20
15
10
5
0
Speedup over serial RMAP
0 1 2 3 4
Speedup
Number of Mismatches
chr1 chr22

Experiments$
Speedup on EC2
Speedup+on+EC2+
1800
1600
1400
1200
1000
800
600
400
200
0
Running Time on EC2
High-CPU Medium Instance Cluster
24 48 72 96
Running time (s)
Number of Cores

Overhead of Disk I/O

Architecture of CloudAligner
Seed-and-Extend
Algorithm

Performance on Small Data

Performance on Large Data

Performance on Amazon EMR

Comparison with CloudBurst and CloudAligner

Workflow of DistMap

Evaluation of Read Mapping tools

Comparison of DistMap and other tools for
distributed mapping

Market Movement

Hardware Solution -
The World’s First NGS Bioinformatics Processor

h1t0t/3p1/:2/0/1w4 ww.Cbonifnidean.tical o| Cmopy/rpighrto 20d12u Tcretn.dh Mticmro Ilnc. 30

Architecture of bina Technology
h1t0t/3p1/:2/0/1w4 ww.Cbonifnidean.tical o| Cmopy/rtigehtc 2h01n2 Torelnod gMiycr.oh Intcm. l 31

h1t0t/3p1s/2:0/1/4www.dConnafidnenetixal u| Cso.pcyorigmht 2/i0m12 aTrgeneds M/iucrso Iencc.ases/dnanex3u2s_CHARGE_prod1.png

Summary
• NGS is a new page for Big Data Era
• Need more CS experts to solve scalability and
performance issues
• Also, need more Data Scientist to discover the
secrets/insights of Human Genome

http://technews.tw/2014/08/02/gene-big-data/
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 34 http://technews.tw/2014/08/02/gene-big-data/

Q＆A

A Survey of NGS Data Analysis on Hadoop

More Related Content

What's hot

Viewers also liked

Similar to A Survey of NGS Data Analysis on Hadoop

Recently uploaded

A Survey of NGS Data Analysis on Hadoop

Editor's Notes