SlideShare a Scribd company logo
Introduction of NGS Data Analysis on Hadoop 
Chung-Tsai Su 
SPN Architect, Core Tech 
Trend Micro 
2014/10/31 @CSIE.NTU 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 1
Q&A 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 2 http://setmoney.blob.core.windows.net/newsimages/2014/09/04/136352-XXL.jpg
http://www.genome.gov/sequencingcosts/ 
NGS Era
NGS Pipeline 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 4
High-Level Workflow of NGS 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 5 
Read 
Mapping 
Raw 
Reads 
(.fq) 
Variant 
Calling 
Sequence 
Alignment/ 
Mapping 
(.sam/.bam) 
Variant 
Calling file 
(.vcf)
NGS Data Analysis Pipeline 
• GATK best practice 
h1t0t/3p1/s20:1/4/wwwCo.bnfidreontaiald | Cinopsyritgihtt u20t1e2 .Torenrdg M/igcroa Itnkc. /guide/best6-practices?bpm=DNAseq
illumina solution 
7 
http://systems.illumina.com/content/dam/illumina-marketing/ 
documents/products/brochures/brochure_sequencing_systems_portfolio.pdf
The First $1,000 Genome – illumina HiSeq X Ten 
h1t0t/3p1:/2//0s14ystemCso.niflidleunmtiali |n Caop.ycrioghmt 2/0s12y Tsretnedm Miscr/oh Inics.eq-x-sequen8cing-system.html
Expectation of Data Processing 
Power for illumina HiSeq X Ten 
• A cluster of 10 HiSeq X instruments 
• Capable of sequencing up to 18,000 whole human 
genomes each year 
– Has a run cycle of ~3 days and produces ~150 genomes each 
run cycle 
– Running the industry standard BWA+GATK analysis pipeline to 
perform this analysis on a reasonably high-end (Dual Intel Xeon 
E5-2697v2 CPU – 12 core, 2.7 GHz with 96 GB DRAM) 
compute server takes ~24 hours per genome. 
– To achieve the required throughput of 150 genomes every three 
days, at least 50 of these servers are required. 
• Should meet a target of ~28 minutes for the completion 
of the mapping, aligning, sorting, de-duplication and 
variant calling of each genome. 
h1t0t/3p1/:2/0/1w4 ww.Ceodnfidicenotiagl | eConpyorigmht 2e0.1c2 Toremnd /Mdicrroa Ingc.en/ 9
Literature Survey 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 10
Literature 
• CloudBurst, 2009 
• CloudAligner, 2011 
• DistMap, 2013 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 11
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 12
Algorithm of CloudBurst 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 13 
Seed-and-Extend 
Algorithm
Experiments$ 
Performance of CloudBurst 
Scalability+ 
16000 
14000 
12000 
10000 
8000 
6000 
4000 
2000 
0 
Running Time vs Number of Reads on Chr 1 
0 1 2 3 4 5 6 7 8 
Runtime (s) 
Millions of Reads 
0 1 
2 3 
4 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 14
Speedup over Serial RMAP 
EECS$584$–$Fall$2013$ 
Speedup+over+serial+RMAP+ 
40 
35 
30 
25 
20 
15 
10 
5 
0 
Speedup over serial RMAP 
0 1 2 3 4 
Speedup 
Number of Mismatches 
chr1 chr22 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 15
Experiments$ 
Speedup on EC2 
Speedup+on+EC2+ 
1800 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
Running Time on EC2 
High-CPU Medium Instance Cluster 
24 48 72 96 
Running time (s) 
Number of Cores 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 16
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 17
Overhead of Disk I/O 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 18
Architecture of CloudAligner 
Seed-and-Extend 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 19 
Algorithm
Performance on Small Data 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 20
Performance on Large Data 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 21
Performance on Amazon EMR 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 22
Comparison with CloudBurst and CloudAligner 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 23
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 24
Workflow of DistMap 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 25
Evaluation of Read Mapping tools 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 26
Comparison of DistMap and other tools for 
distributed mapping 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 27
Market Movement 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 28
Hardware Solution - 
The World’s First NGS Bioinformatics Processor 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 29
h1t0t/3p1/:2/0/1w4 ww.Cbonifnidean.tical o| Cmopy/rpighrto 20d12u Tcretn.dh Mticmro Ilnc. 30
Architecture of bina Technology 
h1t0t/3p1/:2/0/1w4 ww.Cbonifnidean.tical o| Cmopy/rtigehtc 2h01n2 Torelnod gMiycr.oh Intcm. l 31
h1t0t/3p1s/2:0/1/4www.dConnafidnenetixal u| Cso.pcyorigmht 2/i0m12 aTrgeneds M/iucrso Iencc.ases/dnanex3u2s_CHARGE_prod1.png
Summary 
• NGS is a new page for Big Data Era 
• Need more CS experts to solve scalability and 
performance issues 
• Also, need more Data Scientist to discover the 
secrets/insights of Human Genome 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 33
http://technews.tw/2014/08/02/gene-big-data/ 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 34 http://technews.tw/2014/08/02/gene-big-data/
Q&A 
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 35

More Related Content

What's hot

Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGIWhole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Phil Ewels
 
Formal Verification of Functional Code
Formal Verification of Functional CodeFormal Verification of Functional Code
Formal Verification of Functional Code
Martin Děcký
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
Paul Groth
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
Bioinformatics and Computational Biosciences Branch
 
Sciunits: Resuable Research Object
Sciunits: Resuable Research Object Sciunits: Resuable Research Object
Sciunits: Resuable Research Object
Tanu Malik
 
Software Dev
Software DevSoftware Dev
Software Dev
Adrian Wisernig
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
Yun Lung Li
 
Interactive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science CloudInteractive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science Cloud
Helix Nebula The Science Cloud
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Hong ChangBum
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
Intel® Software
 
Attackboard slides dac12-0605
Attackboard slides dac12-0605Attackboard slides dac12-0605
Attackboard slides dac12-0605
Yoshi Shih-Chieh Huang
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
NANO266 - Lecture 9 - Tools of the Modeling Trade
NANO266 - Lecture 9 - Tools of the Modeling TradeNANO266 - Lecture 9 - Tools of the Modeling Trade
NANO266 - Lecture 9 - Tools of the Modeling Trade
University of California, San Diego
 
RESTful Triple Spaces of Things
RESTful Triple Spaces of ThingsRESTful Triple Spaces of Things
RESTful Triple Spaces of Things
Open University, KMi
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Intel® Software
 
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
LO-PHI: Low-Observable Physical Host Instrumentation for Malware AnalysisLO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
Pietro De Nicolao
 
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseekerLichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
BOSC 2010
 
OVH AntiDDoS : Threat Detection
OVH AntiDDoS : Threat DetectionOVH AntiDDoS : Threat Detection
OVH AntiDDoS : Threat Detection
Steven Le Roux
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
Dirk Petersen
 
Variability, Bugs, and Cognition
Variability, Bugs, and CognitionVariability, Bugs, and Cognition
Variability, Bugs, and Cognition
Andrzej Wasowski
 

What's hot (20)

Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGIWhole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
 
Formal Verification of Functional Code
Formal Verification of Functional CodeFormal Verification of Functional Code
Formal Verification of Functional Code
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Sciunits: Resuable Research Object
Sciunits: Resuable Research Object Sciunits: Resuable Research Object
Sciunits: Resuable Research Object
 
Software Dev
Software DevSoftware Dev
Software Dev
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Interactive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science CloudInteractive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science Cloud
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Attackboard slides dac12-0605
Attackboard slides dac12-0605Attackboard slides dac12-0605
Attackboard slides dac12-0605
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
NANO266 - Lecture 9 - Tools of the Modeling Trade
NANO266 - Lecture 9 - Tools of the Modeling TradeNANO266 - Lecture 9 - Tools of the Modeling Trade
NANO266 - Lecture 9 - Tools of the Modeling Trade
 
RESTful Triple Spaces of Things
RESTful Triple Spaces of ThingsRESTful Triple Spaces of Things
RESTful Triple Spaces of Things
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
LO-PHI: Low-Observable Physical Host Instrumentation for Malware AnalysisLO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
 
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseekerLichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
 
OVH AntiDDoS : Threat Detection
OVH AntiDDoS : Threat DetectionOVH AntiDDoS : Threat Detection
OVH AntiDDoS : Threat Detection
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Variability, Bugs, and Cognition
Variability, Bugs, and CognitionVariability, Bugs, and Cognition
Variability, Bugs, and Cognition
 

Viewers also liked

Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
VHIR Vall d’Hebron Institut de Recerca
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
QIAGEN
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
Annelies Haegeman
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
GenomeInABottle
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics Pipeline
Candy Smellie
 
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
John Blue
 
A different kettle of fish entirely: bioinformatic challenges and solutions f...
A different kettle of fish entirely: bioinformatic challenges and solutions f...A different kettle of fish entirely: bioinformatic challenges and solutions f...
A different kettle of fish entirely: bioinformatic challenges and solutions f...
Lex Nederbragt
 
I Jornada Actualización en Genética Reproductiva y Fertilidad
I Jornada Actualización en Genética Reproductiva y Fertilidad I Jornada Actualización en Genética Reproductiva y Fertilidad
I Jornada Actualización en Genética Reproductiva y Fertilidad
TECNALIA Research & Innovation
 
Next generation sequencing in preimplantation genetic screening (NGS in PGS)
Next generation sequencing in preimplantation genetic screening (NGS in PGS)Next generation sequencing in preimplantation genetic screening (NGS in PGS)
Next generation sequencing in preimplantation genetic screening (NGS in PGS)
Mahidol University, Thailand
 
Global Next Generation Sequencing (NGS) Industry By Market Size & Forecast to...
Global Next Generation Sequencing (NGS) Industry By Market Size & Forecast to...Global Next Generation Sequencing (NGS) Industry By Market Size & Forecast to...
Global Next Generation Sequencing (NGS) Industry By Market Size & Forecast to...
DavidClark206
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
AllSeq
 
Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020
Christian Frech
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
QIAGEN
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
mkim8
 
Clinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation SequencingClinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation Sequencing
Bell Symposium & MSP Seminar
 
Ngs part i 2013
Ngs part i 2013Ngs part i 2013
Ngs part i 2013
Elsa von Licy
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
Sri Ambati
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
GenomeInABottle
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Prof. Wim Van Criekinge
 
2016 iHT2 San Diego Health IT Summit
2016 iHT2 San Diego Health IT Summit2016 iHT2 San Diego Health IT Summit
2016 iHT2 San Diego Health IT Summit
Health IT Conference – iHT2
 

Viewers also liked (20)

Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics Pipeline
 
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
 
A different kettle of fish entirely: bioinformatic challenges and solutions f...
A different kettle of fish entirely: bioinformatic challenges and solutions f...A different kettle of fish entirely: bioinformatic challenges and solutions f...
A different kettle of fish entirely: bioinformatic challenges and solutions f...
 
I Jornada Actualización en Genética Reproductiva y Fertilidad
I Jornada Actualización en Genética Reproductiva y Fertilidad I Jornada Actualización en Genética Reproductiva y Fertilidad
I Jornada Actualización en Genética Reproductiva y Fertilidad
 
Next generation sequencing in preimplantation genetic screening (NGS in PGS)
Next generation sequencing in preimplantation genetic screening (NGS in PGS)Next generation sequencing in preimplantation genetic screening (NGS in PGS)
Next generation sequencing in preimplantation genetic screening (NGS in PGS)
 
Global Next Generation Sequencing (NGS) Industry By Market Size & Forecast to...
Global Next Generation Sequencing (NGS) Industry By Market Size & Forecast to...Global Next Generation Sequencing (NGS) Industry By Market Size & Forecast to...
Global Next Generation Sequencing (NGS) Industry By Market Size & Forecast to...
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
Clinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation SequencingClinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation Sequencing
 
Ngs part i 2013
Ngs part i 2013Ngs part i 2013
Ngs part i 2013
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
2016 iHT2 San Diego Health IT Summit
2016 iHT2 San Diego Health IT Summit2016 iHT2 San Diego Health IT Summit
2016 iHT2 San Diego Health IT Summit
 

Similar to A Survey of NGS Data Analysis on Hadoop

Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
David Meyer
 
Building the iRODS Consortium
Building the iRODS ConsortiumBuilding the iRODS Consortium
Building the iRODS Consortium
All Things Open
 
Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud
Vrushali Channapattan
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
lohitvijayarenu
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
Shital Kat
 
Accumulo Summit 2014: Addressing big data challenges through innovative archi...
Accumulo Summit 2014: Addressing big data challenges through innovative archi...Accumulo Summit 2014: Addressing big data challenges through innovative archi...
Accumulo Summit 2014: Addressing big data challenges through innovative archi...
Accumulo Summit
 
Dagster @ R&S MNT
Dagster @ R&S MNTDagster @ R&S MNT
Dagster @ R&S MNT
Simon Späti
 
MapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn GloballyMapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn Globally
ridhav
 
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
Jeff Hung
 
Approaching real-time-hadoop
Approaching real-time-hadoopApproaching real-time-hadoop
Approaching real-time-hadoop
Chris Huang
 
Gluecon miller horizon
Gluecon miller horizonGluecon miller horizon
Gluecon miller horizon
Mike Miller
 
Take your Data Management Practice to the Next Level with Denodo 7
Take your Data Management Practice to the Next Level with Denodo 7Take your Data Management Practice to the Next Level with Denodo 7
Take your Data Management Practice to the Next Level with Denodo 7
Denodo
 
Presentation for slideshare
Presentation   for slidesharePresentation   for slideshare
Presentation for slideshare
bolu804
 
Are you ready to be edgy? Bringing applications to the edge of the network
Are you ready to be edgy? Bringing applications to the edge of the networkAre you ready to be edgy? Bringing applications to the edge of the network
Are you ready to be edgy? Bringing applications to the edge of the network
Megan O'Keefe
 
Eclipse IoT Day, March 2017 - LightweightM2M Protocol & Ecosystem
Eclipse IoT Day, March 2017 - LightweightM2M Protocol & EcosystemEclipse IoT Day, March 2017 - LightweightM2M Protocol & Ecosystem
Eclipse IoT Day, March 2017 - LightweightM2M Protocol & Ecosystem
Open Mobile Alliance
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
John Archer
 
Gc vit sttp cc december 2013
Gc vit sttp cc december 2013Gc vit sttp cc december 2013
Gc vit sttp cc december 2013
Seema Shah
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
Ellen Friedman
 
MobiCloud: Towards Cloud Mobile Hybrid Application Generation using Semantica...
MobiCloud: Towards Cloud Mobile Hybrid Application Generation using Semantica...MobiCloud: Towards Cloud Mobile Hybrid Application Generation using Semantica...
MobiCloud: Towards Cloud Mobile Hybrid Application Generation using Semantica...
Amit Sheth
 

Similar to A Survey of NGS Data Analysis on Hadoop (20)

Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
Introduction to OpenDaylight and Hydrogen, Learnings from the Year, What's Ne...
 
Building the iRODS Consortium
Building the iRODS ConsortiumBuilding the iRODS Consortium
Building the iRODS Consortium
 
Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Accumulo Summit 2014: Addressing big data challenges through innovative archi...
Accumulo Summit 2014: Addressing big data challenges through innovative archi...Accumulo Summit 2014: Addressing big data challenges through innovative archi...
Accumulo Summit 2014: Addressing big data challenges through innovative archi...
 
Dagster @ R&S MNT
Dagster @ R&S MNTDagster @ R&S MNT
Dagster @ R&S MNT
 
MapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn GloballyMapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn Globally
 
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
 
Approaching real-time-hadoop
Approaching real-time-hadoopApproaching real-time-hadoop
Approaching real-time-hadoop
 
Gluecon miller horizon
Gluecon miller horizonGluecon miller horizon
Gluecon miller horizon
 
Take your Data Management Practice to the Next Level with Denodo 7
Take your Data Management Practice to the Next Level with Denodo 7Take your Data Management Practice to the Next Level with Denodo 7
Take your Data Management Practice to the Next Level with Denodo 7
 
Presentation for slideshare
Presentation   for slidesharePresentation   for slideshare
Presentation for slideshare
 
Are you ready to be edgy? Bringing applications to the edge of the network
Are you ready to be edgy? Bringing applications to the edge of the networkAre you ready to be edgy? Bringing applications to the edge of the network
Are you ready to be edgy? Bringing applications to the edge of the network
 
Eclipse IoT Day, March 2017 - LightweightM2M Protocol & Ecosystem
Eclipse IoT Day, March 2017 - LightweightM2M Protocol & EcosystemEclipse IoT Day, March 2017 - LightweightM2M Protocol & Ecosystem
Eclipse IoT Day, March 2017 - LightweightM2M Protocol & Ecosystem
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
 
Gc vit sttp cc december 2013
Gc vit sttp cc december 2013Gc vit sttp cc december 2013
Gc vit sttp cc december 2013
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 
MobiCloud: Towards Cloud Mobile Hybrid Application Generation using Semantica...
MobiCloud: Towards Cloud Mobile Hybrid Application Generation using Semantica...MobiCloud: Towards Cloud Mobile Hybrid Application Generation using Semantica...
MobiCloud: Towards Cloud Mobile Hybrid Application Generation using Semantica...
 

Recently uploaded

Gregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics PresentationGregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics Presentation
gharris9
 
ASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdfASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdf
ToshihiroIto4
 
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
OECD Directorate for Financial and Enterprise Affairs
 
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
OECD Directorate for Financial and Enterprise Affairs
 
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij
 
原版制作贝德福特大学毕业证(bedfordhire毕业证)硕士文凭原版一模一样
原版制作贝德福特大学毕业证(bedfordhire毕业证)硕士文凭原版一模一样原版制作贝德福特大学毕业证(bedfordhire毕业证)硕士文凭原版一模一样
原版制作贝德福特大学毕业证(bedfordhire毕业证)硕士文凭原版一模一样
gpww3sf4
 
Burning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdfBurning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdf
kkirkland2
 
Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
gharris9
 
XP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to LeadershipXP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to Leadership
samililja
 
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussion
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussionPro-competitive Industrial Policy – LANE – June 2024 OECD discussion
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussion
OECD Directorate for Financial and Enterprise Affairs
 
Updated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidismUpdated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidism
Faculty of Medicine And Health Sciences
 
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPointMẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
1990 Media
 
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
 
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Dutch Power
 
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussionArtificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
OECD Directorate for Financial and Enterprise Affairs
 
2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf
Frederic Leger
 
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussion
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussionPro-competitive Industrial Policy – OECD – June 2024 OECD discussion
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussion
OECD Directorate for Financial and Enterprise Affairs
 
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Dutch Power
 
Tom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issueTom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issue
amekonnen
 
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
SkillCertProExams
 

Recently uploaded (20)

Gregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics PresentationGregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics Presentation
 
ASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdfASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdf
 
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
 
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
 
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
 
原版制作贝德福特大学毕业证(bedfordhire毕业证)硕士文凭原版一模一样
原版制作贝德福特大学毕业证(bedfordhire毕业证)硕士文凭原版一模一样原版制作贝德福特大学毕业证(bedfordhire毕业证)硕士文凭原版一模一样
原版制作贝德福特大学毕业证(bedfordhire毕业证)硕士文凭原版一模一样
 
Burning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdfBurning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdf
 
Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
 
XP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to LeadershipXP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to Leadership
 
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussion
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussionPro-competitive Industrial Policy – LANE – June 2024 OECD discussion
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussion
 
Updated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidismUpdated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidism
 
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPointMẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
 
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
 
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
 
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussionArtificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
 
2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf
 
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussion
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussionPro-competitive Industrial Policy – OECD – June 2024 OECD discussion
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussion
 
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
 
Tom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issueTom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issue
 
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
 

A Survey of NGS Data Analysis on Hadoop

  • 1. Introduction of NGS Data Analysis on Hadoop Chung-Tsai Su SPN Architect, Core Tech Trend Micro 2014/10/31 @CSIE.NTU 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 1
  • 2. Q&A 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 2 http://setmoney.blob.core.windows.net/newsimages/2014/09/04/136352-XXL.jpg
  • 4. NGS Pipeline 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 4
  • 5. High-Level Workflow of NGS 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 5 Read Mapping Raw Reads (.fq) Variant Calling Sequence Alignment/ Mapping (.sam/.bam) Variant Calling file (.vcf)
  • 6. NGS Data Analysis Pipeline • GATK best practice h1t0t/3p1/s20:1/4/wwwCo.bnfidreontaiald | Cinopsyritgihtt u20t1e2 .Torenrdg M/igcroa Itnkc. /guide/best6-practices?bpm=DNAseq
  • 7. illumina solution 7 http://systems.illumina.com/content/dam/illumina-marketing/ documents/products/brochures/brochure_sequencing_systems_portfolio.pdf
  • 8. The First $1,000 Genome – illumina HiSeq X Ten h1t0t/3p1:/2//0s14ystemCso.niflidleunmtiali |n Caop.ycrioghmt 2/0s12y Tsretnedm Miscr/oh Inics.eq-x-sequen8cing-system.html
  • 9. Expectation of Data Processing Power for illumina HiSeq X Ten • A cluster of 10 HiSeq X instruments • Capable of sequencing up to 18,000 whole human genomes each year – Has a run cycle of ~3 days and produces ~150 genomes each run cycle – Running the industry standard BWA+GATK analysis pipeline to perform this analysis on a reasonably high-end (Dual Intel Xeon E5-2697v2 CPU – 12 core, 2.7 GHz with 96 GB DRAM) compute server takes ~24 hours per genome. – To achieve the required throughput of 150 genomes every three days, at least 50 of these servers are required. • Should meet a target of ~28 minutes for the completion of the mapping, aligning, sorting, de-duplication and variant calling of each genome. h1t0t/3p1/:2/0/1w4 ww.Ceodnfidicenotiagl | eConpyorigmht 2e0.1c2 Toremnd /Mdicrroa Ingc.en/ 9
  • 10. Literature Survey 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 10
  • 11. Literature • CloudBurst, 2009 • CloudAligner, 2011 • DistMap, 2013 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 11
  • 12. 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 12
  • 13. Algorithm of CloudBurst 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 13 Seed-and-Extend Algorithm
  • 14. Experiments$ Performance of CloudBurst Scalability+ 16000 14000 12000 10000 8000 6000 4000 2000 0 Running Time vs Number of Reads on Chr 1 0 1 2 3 4 5 6 7 8 Runtime (s) Millions of Reads 0 1 2 3 4 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 14
  • 15. Speedup over Serial RMAP EECS$584$–$Fall$2013$ Speedup+over+serial+RMAP+ 40 35 30 25 20 15 10 5 0 Speedup over serial RMAP 0 1 2 3 4 Speedup Number of Mismatches chr1 chr22 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 15
  • 16. Experiments$ Speedup on EC2 Speedup+on+EC2+ 1800 1600 1400 1200 1000 800 600 400 200 0 Running Time on EC2 High-CPU Medium Instance Cluster 24 48 72 96 Running time (s) Number of Cores 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 16
  • 17. 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 17
  • 18. Overhead of Disk I/O 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 18
  • 19. Architecture of CloudAligner Seed-and-Extend 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 19 Algorithm
  • 20. Performance on Small Data 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 20
  • 21. Performance on Large Data 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 21
  • 22. Performance on Amazon EMR 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 22
  • 23. Comparison with CloudBurst and CloudAligner 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 23
  • 24. 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 24
  • 25. Workflow of DistMap 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 25
  • 26. Evaluation of Read Mapping tools 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 26
  • 27. Comparison of DistMap and other tools for distributed mapping 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 27
  • 28. Market Movement 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 28
  • 29. Hardware Solution - The World’s First NGS Bioinformatics Processor 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 29
  • 30. h1t0t/3p1/:2/0/1w4 ww.Cbonifnidean.tical o| Cmopy/rpighrto 20d12u Tcretn.dh Mticmro Ilnc. 30
  • 31. Architecture of bina Technology h1t0t/3p1/:2/0/1w4 ww.Cbonifnidean.tical o| Cmopy/rtigehtc 2h01n2 Torelnod gMiycr.oh Intcm. l 31
  • 32. h1t0t/3p1s/2:0/1/4www.dConnafidnenetixal u| Cso.pcyorigmht 2/i0m12 aTrgeneds M/iucrso Iencc.ases/dnanex3u2s_CHARGE_prod1.png
  • 33. Summary • NGS is a new page for Big Data Era • Need more CS experts to solve scalability and performance issues • Also, need more Data Scientist to discover the secrets/insights of Human Genome 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 33
  • 34. http://technews.tw/2014/08/02/gene-big-data/ 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 34 http://technews.tw/2014/08/02/gene-big-data/
  • 35. Q&A 10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 35

Editor's Notes

  1. From the figure, we can see that CloudAligner is 60 to 80% faster than CloudBurst.
  2. We mapped different subsets of the accession SRR035459 to the human chromosome 22 (50 Mbp) allowing up to 3 mismatches. From the figure, we can see that the execution time of both CloudBurst and CloudAligner is proportional to the number of reads, and CloudAligner outperforms Cloud- Burst from 35 to 67%.
  3. With CloudBurst, the limitation of ts approach is the network bandwidth. With CloudAligner, its limitation is in the computation power of the workers in Hadoop. Consequently, if we run CloudAligner on cluster of legacy machines with high speed network, we probably lose the performance advantage over CloudBurst.