Xing Xu, Ph.D
Director of Cloud Computing Product
Challenges in the Era of Big Genomic
Data and Our Practices in BGI
Topics for Today
 About BGI
 Challenges and Solutions
Data transfer
Cloud Computing
Computational Algorithms and Infrast...
BGI
 The world largest genome sequencing center
Started with Human Genome Project in 1999 with only a
few sequencers.
Now...
BGI
 The world largest genome sequencing center
 The largest computing and storage center for
genomics in China
- 20,000...
BGI
 The world largest genome sequencing center
 The largest computing and storage center for
genomics in China
 One of...
BGI
 The world largest genome sequencing center
 The largest computing and storage center for
genomics in China
 One of...
Challenges for
Handling Big Data
 Exponential growth of data amount
7
Challenges for
Handling Big Data
 Exponential growth of data amount
 Complicate data analysis process
8
Challenges for
Handling Big Data
 Exponential growth of data amount
 Complicate data analysis process
 Widely distribut...
Challenges and Solutions
 Data transfer
 Cloud Computing
 Computational Algorithms and Infrastructure
 Data Management...
Solutions for data transfer
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
11
Solutions for data transfer
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transf...
Solutions for data transfer:
High speed data transfer
13
 Demonstrated 10Gbps ultra high speed data
exchange with UC Davi...
Solutions for data transfer:
High speed data transfer
14
 Demonstrated 10Gbps ultra high speed data
exchange with UC Davi...
Solutions for data transfer
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transf...
Solutions for data transfer
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transf...
Solutions for cloud
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solut...
EasyGenomics™
EasyGenomics is a Software as a Service (SaaS)
bioinformatics platform for research and applications.
Algori...
A typical user case
19
A normal user case of EasyGenomics and Customers’ Local Computational resource.
The double line ite...
Bioinformatics Workflow
 Four steps:
Upload, Create a Sample, Perform Analyses, Download Results
 Algorithms:
Carefully ...
Homepage
Four task
portals
Status of
recent works
Warning and
Logging
Navigation
Tabs
Sequencing Quality Report
22
Mapping Report
23
Create an Analysis
Selected
sample(s)
•One selected sample => Single Analysis
•Multiple selected samples => Batch Analyses
Create an Analysis
Selectable
modules
Predefined
Settings
Shortcut
What’s new?
 An internal version of EG is running
automatically as a production system.
 It integrates the new data deli...
You can chose to deliver data
to EasyGenomics platform
27
Configuration file
Import Data from
Sequencing Service
28
Import Data from
Sequencing Service
29
Imported Samples
Solutions for cloud
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solut...
Two paths for
the future cloud solution
 Software as a Service (SaaS) to Platform as a
Service (PaaS)
To give the flexibi...
Solutions for
Algorithm and Infrastructure
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Sp...
• Fast Parallel Framework: Hadoop Streaming
• Reliable Storage System: HDFS
• Scalable Map/Reduce framework
Raw Data
QC
Mapping
Remove PCR duplications
Realignment
Identify
Variations
Selection & Annotation
Raw Data
SOAP-GaeaQC
SO...
Reads
Reference
Key Value
Position
Map
Aligning
Reduce
 Distributed Indexing for load balancing
 Flexible splitting tole...
0
2
4
6
8
10
12
14
16
Old Pipeline Cloud-based pipeline
Two weeks
Within 15 hrs (
120cores)
Data: Human 60X whole genome R...
 SOAP-GaeaAlignment (1 human sample in 1000genome)
Software Mapping Rate
Confident Mapping
Rate(MAPQ>=10)
Stampy 85.93% 7...
Assembly
Constructing de bruijn Graph
Solving Tiny Repeats Merging Bubbles
Scaffolding Merging Contigs
SOAP-Hecate: Distri...
Contig Extension
Scaffolding
Gap closing
SOAPdenovo v2
SOAP-Hecate v2.5
(84 cores)
SOAP-Hecate v2.5
(180 cores)
Data Size ...
Solutions for Algorithms
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
...
SOAP3: ~20X speed up
from SOAP2
SOAP
SOAP2
(2008)
20-30x
SOAP3
(2011)
10-30X
GPU Version
1893.45
10671.39
211.53
819.81
0
...
527
21879
1
10
100
1000
10000
100000
GSNP SOAPsnp
Elapsedtime(sec.)
Ch.1
73
3675
1
10
100
1000
10000
GSNP SOAPsnp
Elapsedt...
Solutions for
Data Management
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Tran...
Paradigm Shift
Traditional Model
Business
Determine
what question
to ask
IT
Structures the
data to
answer
that question
Bi...
Information Pyramid
Value
Decision
Knowledge
Information
Data
Element
Meaning
Context
Application
Achievement
Organizing R...
BGI Data Pyramid
iRODS
(Data)
Database
(Information)
Data Mining
(Knowledge)
Health/Clinical APP
(Decision)
• Data Preserv...
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow
K...
iRODS - integrated Rule
Oriented Data System
48*Access data with Web-based Browser or iRODS GUI or Command Line clients.
r...
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow -...
Variant
(Gene)
Disease
Drug
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Cl...
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow –...
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow -...
Summary of Our Practice
in IT infrastructure
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High ...
Acknowledgement
 Development Team
Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.
Flex Lab: Yan Li (Hecate),...
Upcoming SlideShare
Loading in …5
×

Best pratices at BGI for the Challenges in the Era of Big Genomics Data

992 views

Published on

My presentation for the workshop about the Best Practice Award BioIT on TriCon 2013

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
992
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
26
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • At the heart of EasyGenomics is our Bioinformatics Core. 5 workflows with carefully chosen algorithms, tested and optimized. Filtering, QC Report, Alignment along with other supporting features.
  • Then we chooseHadoop. The Hadoop streaming framework enabled us to fast parallel the common used algorithms.The distributed storage system HDFS, ensured the safety of our data.In addition, the Map/reduce framwork enabled us to further improve the accuracy and performance of our modules
  • We integrated all the standard re-sequencing steps onto the cloud-based framework. Which can greatly accelerate the data anlysis.
  • Except for parallel existing algorithms, we also designed new approaches based on the distributed frameworks. Benefit from its capacity of big data processing, we developed new models for mapping and variation detections.
  • This cloud-basedpipeline is already applied in Cancer and human disease studies within BGI.It can reduce the analysis time from two weeks into two days.Additionally, the figure on the bottom shows the speedups of our applications, which means we can control the analysis time by choosing the size of the cluster. It would be even faster.
  • This is the comparison on a human sample in 1000genome project Our mapping tools –GaeaAlignment returning the highest mapping rate.
  • Hecate distributed graph simplification algrithm into etire cluster of computer nodes。Transfered memory usage into fast distributed IO work which enabled the assembly work
  • Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. (Source: Wikipedia)SOAP: first-generation short read alignment toolSOAP2 (2008): 20 to 30 times faster than SOAP, less memoryCollaboration between BGI & University of Hong KongCompressed indexing: bidirectional BWT (2BWT)SOAP3 (2011): 10 to 30 times faster than SOAP2Collaboration from University of Hong KongGPU’s parallel processing powerCPU memory: increase from a few to tens GBGPU-based indexing: GPU-2BWT
  • ` ``
  • Best pratices at BGI for the Challenges in the Era of Big Genomics Data

    1. 1. Xing Xu, Ph.D Director of Cloud Computing Product Challenges in the Era of Big Genomic Data and Our Practices in BGI
    2. 2. Topics for Today  About BGI  Challenges and Solutions Data transfer Cloud Computing Computational Algorithms and Infrastructure Data Storage 2
    3. 3. BGI  The world largest genome sequencing center Started with Human Genome Project in 1999 with only a few sequencers. Now more than 150 sequencers, 6 TB/day sequencing throughput. MODEL ABI 3730XL Roche 454 ABI SOLiD 4 Solexa GA IIx Illumina HiSeq 2000 INSTALLATION 16 1 27 6 135
    4. 4. BGI  The world largest genome sequencing center  The largest computing and storage center for genomics in China - 20,000+ CPU cores - 19 NVIDIA GPUs - 220+ Tflops peak performance - 17 PB data storage - The storage and computation capability increase by 10000 folds! - Still increasing …
    5. 5. BGI  The world largest genome sequencing center  The largest computing and storage center for genomics in China  One of world leading research institutes in Genomics Since 2007, - 253 papers in high-impact journals - Including 47 in Nature and its sub-journals, 9 in Science,2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors - 369 patent applications - 254 software authorship
    6. 6. BGI  The world largest genome sequencing center  The largest computing and storage center for genomics in China  One of world leading research institutes in Genomics BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.
    7. 7. Challenges for Handling Big Data  Exponential growth of data amount 7
    8. 8. Challenges for Handling Big Data  Exponential growth of data amount  Complicate data analysis process 8
    9. 9. Challenges for Handling Big Data  Exponential growth of data amount  Complicate data analysis process  Widely distributed data Images from omicsmaps.com 9 BGI
    10. 10. Challenges and Solutions  Data transfer  Cloud Computing  Computational Algorithms and Infrastructure  Data Management 10
    11. 11. Solutions for data transfer  Data transfer Solution I: Hard drive shipment (w/ Fedex) 11
    12. 12. Solutions for data transfer  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer 12 High speed data transfer
    13. 13. Solutions for data transfer: High speed data transfer 13  Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June, 2012.
    14. 14. Solutions for data transfer: High speed data transfer 14  Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.  A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s). Right software: Aspera Fastp data transfer protocol Right infrastructure: 10Gb link between US and China Right technology: RAM Disk, iPV6
    15. 15. Solutions for data transfer  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer 15 Aspera Server Aspera Client Aspera Client Aspera Client  Software license  Expensive physical bandwidth Free BGI Clients  Bottleneck on the client site  Not a good solution of sharing
    16. 16. Solutions for data transfer  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing) 16
    17. 17. Solutions for cloud  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A Software as a Service (SaaS) platform for NGS data analysis 17
    18. 18. EasyGenomics™ EasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications. Algorithms, W orkflows, Reports Computational Resources Database, Data management Web portal, Simple UIHigh speed connection
    19. 19. A typical user case 19 A normal user case of EasyGenomics and Customers’ Local Computational resource. The double line items are Customers’ data or resource. The single line items are results and data within BGI and EasyGenomics platform. The widths of arrows represent the sizes of data flows (not in real proportion). Customers’ Local Resources
    20. 20. Bioinformatics Workflow  Four steps: Upload, Create a Sample, Perform Analyses, Download Results  Algorithms: Carefully chosen, tested and optimized  Workflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly
    21. 21. Homepage Four task portals Status of recent works Warning and Logging Navigation Tabs
    22. 22. Sequencing Quality Report 22
    23. 23. Mapping Report 23
    24. 24. Create an Analysis Selected sample(s) •One selected sample => Single Analysis •Multiple selected samples => Batch Analyses
    25. 25. Create an Analysis Selectable modules Predefined Settings Shortcut
    26. 26. What’s new?  An internal version of EG is running automatically as a production system.  It integrates the new data delivery portal of sequencing service. Aspera fastp download Accessible to all workflows on EasyGenomics 26
    27. 27. You can chose to deliver data to EasyGenomics platform 27 Configuration file
    28. 28. Import Data from Sequencing Service 28
    29. 29. Import Data from Sequencing Service 29 Imported Samples
    30. 30. Solutions for cloud  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A SaaS platform for NGS data analysis Two paths for the future cloud solution 30
    31. 31. Two paths for the future cloud solution  Software as a Service (SaaS) to Platform as a Service (PaaS) To give the flexibility to research users: Add their own tools (any tools) Integrate their own workflows (different combinations of modules)  One-Click SaaS solution To give the automated solution for clinical users: Automated solution for repetitive works Fulfill very specific functions 31
    32. 32. Solutions for Algorithm and Infrastructure  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A SaaS platform for NGS data analysis Two paths for the future cloud solution  Algorithm and Infrastructure Scale up with Hadoop / MapReduce: Hecate (de novo Assembly tool), Gaea (Resequencing pipeline) 32
    33. 33. • Fast Parallel Framework: Hadoop Streaming • Reliable Storage System: HDFS • Scalable Map/Reduce framework
    34. 34. Raw Data QC Mapping Remove PCR duplications Realignment Identify Variations Selection & Annotation Raw Data SOAP-GaeaQC SOAPalginer BWA BOWTIE SOAP-GaeaAlignment Selection & Annotation SOAP-GaeaMarkDuplicate SOAP-GaeaRealignment SNP : SOAPsnp, SOAP-GaeaSNP, SAMtools InDel : Dindel, SOAP-GaeaIndel SOAP-Gaea: Hadoop based resequencing pipeline
    35. 35. Reads Reference Key Value Position Map Aligning Reduce  Distributed Indexing for load balancing  Flexible splitting tolerates more mismatches  Dynamic Programming for robust gap alignment SOAP-Gaea: Hadoop based resequencing pipeline
    36. 36. 0 2 4 6 8 10 12 14 16 Old Pipeline Cloud-based pipeline Two weeks Within 15 hrs ( 120cores) Data: Human 60X whole genome Re-sequencing Fast and Scalable • The Hadoop Implementation provides great scalability. • Simply by providing more resource, the analysis can finish much faster.
    37. 37.  SOAP-GaeaAlignment (1 human sample in 1000genome) Software Mapping Rate Confident Mapping Rate(MAPQ>=10) Stampy 85.93% 70.00% SOAP2 79.14% 79.14% Novo align 82.53% 79.74 BWA 91.54% 84.78% Bowtie 81.15% 81.15% SOAP-GaeaAlignment 91.75% 85.20% It’s not only FAST, but also ACCURATE
    38. 38. Assembly Constructing de bruijn Graph Solving Tiny Repeats Merging Bubbles Scaffolding Merging Contigs SOAP-Hecate: Distributed de novo Genome Assembly
    39. 39. Contig Extension Scaffolding Gap closing SOAPdenovo v2 SOAP-Hecate v2.5 (84 cores) SOAP-Hecate v2.5 (180 cores) Data Size 670GB 670GB 670GB No. of Servers 1 7 15 Time 59 hour 59hour 38hour Memory Size 400*1 24*7 24G*15 Mode Centralized Distributed Distributed *80X human whole genome SOAP-Hecate is scalable and using much less memory  Scalability  Performance SOAP-Hecate SOAPdenovo ALLPATH Phusion2, phrap Meraculous ABySS Scaffold N50 26,570,829 117,000 211,000 495,000 486,000 144,300 Tested on simulated data from Assemblathon 1(Earl, Bradnam et al.
    40. 40. Solutions for Algorithms  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A SaaS platform for NGS data analysis Two paths for the future cloud solution  Algorithm and Infrastructure Scale up with Hadoop / MapReduce: Hecate (de novo Assembly tool), Gaea (Resequencing pipeline) GPU based acceleration: SOAP3 (Aligner), GSNP(SNP caller), GAMA (Population genetics tool) 40
    41. 41. SOAP3: ~20X speed up from SOAP2 SOAP SOAP2 (2008) 20-30x SOAP3 (2011) 10-30X GPU Version 1893.45 10671.39 211.53 819.81 0 2000 4000 6000 8000 10000 12000 Human Zebra fish Total Time (second) SOAP2 SOAP3 14.12 14.6 13 13.5 14 14.5 15 Human Zebra fish Speedup 84.2 64.49 88.29 76.55 0 20 40 60 80 100 Human Zebra fish Alignment Ratio (%) SOAP2 SOAP3 Collaboration from University of Hong Kong
    42. 42. 527 21879 1 10 100 1000 10000 100000 GSNP SOAPsnp Elapsedtime(sec.) Ch.1 73 3675 1 10 100 1000 10000 GSNP SOAPsnp Elapsedtime(sec.) Ch. 21 GSNP: 50X faster than its CPU based SOAPSNP  The elapsed time of all steps are included.  GSNP is around 50x faster than single-thread CPU-based SOAPsnp.
    43. 43. Solutions for Data Management  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A SaaS platform for NGS data analysis Two paths for the future cloud solution  Algorithm and Infrastructure Scale up with Hadoop / MapReduce GPU based acceleration  Data Management Data management in BGI 43
    44. 44. Paradigm Shift Traditional Model Business Determine what question to ask IT Structures the data to answer that question Big Data Model IT Delivers a platform to enable creative discovery Business Explores what questions could be asked
    45. 45. Information Pyramid Value Decision Knowledge Information Data Element Meaning Context Application Achievement Organizing Refining Summarizing Utilizing
    46. 46. BGI Data Pyramid iRODS (Data) Database (Information) Data Mining (Knowledge) Health/Clinical APP (Decision) • Data Preservation • Data Retrieval • Data Sharing • BGI-SNP • BGI-SV • BGI-GaP • Disease: HGVD/PMRD • Systems Biology • Drug Discovery • Diagnosis of Genetic Diseases • Drug of Choice
    47. 47. iRODS Sequencer Raw Data Data Analysis Analyzed Data Data Warehousing Personalized Analysis Clinical Diagnosis Data Flow Knowledge Base Metadata LIMS Public Resources BGI-DB Variant (Gene) Disease Drug
    48. 48. iRODS - integrated Rule Oriented Data System 48*Access data with Web-based Browser or iRODS GUI or Command Line clients. renci.org
    49. 49. iRODS Sequencer Raw Data Data Analysis Analyzed Data Data Warehousing Personalized Analysis Clinical Diagnosis Data Flow - iRODS Knowledge Base Metadata LIMS Public Resources BGI-DB Variant (Gene) Disease Drug iRODS-based Data Management • Contents: raw data, analyzed data and related metadata • Data backup • Fully integrated with LIMS • Able to search and access any data according to the metadata from BGI data standard, e.g. project, sample, cohort, phenotype, QC, etc. • Federation: integrate separate iRODS zones
    50. 50. Variant (Gene) Disease Drug iRODS Sequencer Raw Data Data Analysis Analyzed Data Data Warehousing Personalized Analysis Clinical Diagnosis Data Flow – BGI-DB Knowledge Base Metadata LIMS Public Resources BGI-DB BGI-DB • A locus-specific database (LSDB) for all variants identified by BGI • Manage all basic information generated from data analysis pipelines • Link all detailed information about individual samples to each variant • Easy to query information from samples with certain commonality (such as same phenotype, same cohort, etc.) • Provide the raw information for further data mining steps
    51. 51. iRODS Sequencer Raw Data Data Analysis Analyzed Data Data Warehousing Personalized Analysis Clinical Diagnosis Data Flow – BGI-DW & BGI-KB Knowledge Base Metadata LIMS Public Resources BGI-DB Variant (Gene) Disease Drug BGI Data Warehousing & Knowledge Base • BGI data warehousing (BGI-DW) consists of a series of secondary databases related to variants, diseases and drugs • BGI knowledge base (BGI-KB) stores and manages the knowledge obtained through mining BGI-DB, BGI-DW and other public resources • Periodically and automatically updated • Provide APIs for the bioinformaticians to query the information and generate individualized reports
    52. 52. iRODS Sequencer Raw Data Data Analysis Analyzed Data Data Warehousing Personalized Analysis Clinical Diagnosis Data Flow - Successful Story Knowledge Base Metadata LIMS Public Resources BGI-DB Query the allele frequency database to filter out common variants and identify disease- causal variants Calculate variant frequencies from certain cohorts and save them into the allele frequency database Diagnosis for Monogenic Disease Group samples into cohorts based on their phenotypes Variant (Gene) Disease Drug
    53. 53. Summary of Our Practice in IT infrastructure  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A SaaS platform for NGS data analysis Two paths for the future cloud solution  Algorithm and Infrastructure Scale up with Hadoop / MapReduce GPU based acceleration  Data Management Using iRODs file system to manage big data 53
    54. 54. Acknowledgement  Development Team Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc. Flex Lab: Yan Li (Hecate), Zhi Zhang(GAEA, iRODS) etc. GPU Lab: Bingqiang Wang etc.  Test & QA Team Xin Guan, Jingjuan Liu, etc.  PMO & IT Operation Wenjun Zeng, Litong Lai, Jing Tian, etc.  Product Team Xing Xu, Jing Guo, Fang Fang etc.  Other BGI Teams  Collaborators: University of Hong Kong (HKU) Hong Kong University of Science and Technology (HKUST) Nvidia - Aspera RENCI - TianJing Supercomputing center

    ×