Best pratices at BGI for the Challenges in the Era of Big Genomics Data

Xing Xu, Ph.D
Director of Cloud Computing Product
Challenges in the Era of Big Genomic
Data and Our Practices in BGI

Topics for Today
 About BGI
 Challenges and Solutions
Data transfer
Cloud Computing
Computational Algorithms and Infrastructure
Data Storage
2

BGI
 The world largest genome sequencing center
Started with Human Genome Project in 1999 with only a
few sequencers.
Now more than 150 sequencers, 6 TB/day sequencing
throughput.
MODEL
ABI
3730XL
Roche
454
ABI
SOLiD 4
Solexa
GA IIx
Illumina
HiSeq 2000
INSTALLATION 16 1 27 6 135

BGI
 The largest computing and storage center for
genomics in China
- 20,000+ CPU cores
- 19 NVIDIA GPUs
- 220+ Tflops peak
performance
- 17 PB data storage
- The storage and
computation capability
increase by 10000 folds!
- Still increasing …

BGI
genomics in China
 One of world leading research institutes in
Genomics
Since 2007,
- 253 papers in high-impact journals
- Including 47 in Nature and its sub-journals，
9 in Science，2 in Cell, and 1 in NEJM, with
42 first and/or corresponding authors
- 369 patent applications
- 254 software authorship

BGI
genomics in China
 One of world leading research institutes in
Genomics
BGI has the sequencing capacity, hardware resource
and software proficiency to be the one of the strongest
end-to-end service providers in the world for NGS
sequencing, data analysis and data interpretation.

Challenges for
Handling Big Data
 Exponential growth of data amount
7

Challenges for
Handling Big Data
 Complicate data analysis process
8

Challenges for
Handling Big Data
 Complicate data analysis process
 Widely distributed data
Images from omicsmaps.com 9
BGI

Challenges and Solutions
 Data transfer
 Cloud Computing
 Computational Algorithms and Infrastructure
 Data Management
10

Solutions for data transfer
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
11

 Data transfer
Solution II: High Speed Data Transfer
12
High speed
data transfer

Solutions for data transfer:
High speed data transfer
13
 Demonstrated 10Gbps ultra high speed data
exchange with UC Davis, and NCBI in June,
2012.

Solutions for data transfer:
High speed data transfer
14
 Demonstrated 10Gbps ultra high speed data
exchange with UC Davis, and NCBI in June.
 A 24GB file was transferred from China to US
in 30 Seconds (~8Gbits/s).
Right software: Aspera Fastp data transfer protocol
Right infrastructure: 10Gb link between US and China
Right technology: RAM Disk, iPV6

 Data transfer
15
Aspera
Server
Aspera
Client
Aspera
Client
Aspera
Client
 Software license
 Expensive physical
bandwidth
Free
BGI
Clients
 Bottleneck on the
client site
 Not a good solution
of sharing

 Data transfer
Solution III: Don’t move the data (Cloud Computing)
16

Solutions for cloud
 Data transfer
 Cloud Computing
EasyGenomics, A Software as a Service (SaaS) platform
for NGS data analysis
17

EasyGenomics™
EasyGenomics is a Software as a Service (SaaS)
bioinformatics platform for research and applications.
Algorithms, W
orkflows,
Reports
Computational
Resources
Database,
Data management
Web portal,
Simple UIHigh speed
connection

A typical user case
19
A normal user case of EasyGenomics and Customers’ Local Computational resource.
The double line items are Customers’ data or resource. The single line items are
results and data within BGI and EasyGenomics platform. The widths of arrows
represent the sizes of data flows (not in real proportion).
Customers’
Local
Resources

Bioinformatics Workflow
 Four steps:
Upload, Create a Sample, Perform Analyses, Download Results
 Algorithms:
Carefully chosen, tested and optimized
 Workflows:
Whole Genome Resequencing, Exome Resequencing, RNA-Seq,
small RNA, ncRNA, and De novo Assembly

Homepage
Four task
portals
Status of
recent works
Warning and
Logging
Navigation
Tabs

Create an Analysis
Selected
sample(s)
•One selected sample => Single Analysis
•Multiple selected samples => Batch Analyses

Create an Analysis
Selectable
modules
Predefined
Settings
Shortcut

What’s new?
 An internal version of EG is running
automatically as a production system.
 It integrates the new data delivery portal of
sequencing service.
Aspera fastp download
Accessible to all workflows on EasyGenomics
26

You can chose to deliver data
to EasyGenomics platform
27
Configuration file

Import Data from
Sequencing Service
28

Import Data from
Sequencing Service
29
Imported Samples

Solutions for cloud
 Data transfer
 Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
30

Two paths for
the future cloud solution
 Software as a Service (SaaS) to Platform as a
Service (PaaS)
To give the flexibility to research users:
Add their own tools (any tools)
Integrate their own workflows (different combinations of
modules)
 One-Click SaaS solution
To give the automated solution for clinical users:
Automated solution for repetitive works
Fulfill very specific functions
31

Solutions for
Algorithm and Infrastructure
 Data transfer
 Cloud Computing
 Algorithm and Infrastructure
Scale up with Hadoop / MapReduce: Hecate (de novo
Assembly tool), Gaea (Resequencing pipeline)
32

• Fast Parallel Framework: Hadoop Streaming
• Reliable Storage System: HDFS
• Scalable Map/Reduce framework

Raw Data
QC
Mapping
Remove PCR duplications
Realignment
Identify
Variations
Selection & Annotation
Raw Data
SOAP-GaeaQC
SOAPalginer BWA BOWTIE
SOAP-GaeaAlignment
Selection & Annotation
SOAP-GaeaMarkDuplicate
SOAP-GaeaRealignment
SNP : SOAPsnp, SOAP-GaeaSNP, SAMtools
InDel : Dindel, SOAP-GaeaIndel
SOAP-Gaea: Hadoop based
resequencing pipeline

Reads
Reference
Key Value
Position
Map
Aligning
Reduce
 Distributed Indexing for load balancing
 Flexible splitting tolerates more mismatches
 Dynamic Programming for robust gap alignment
SOAP-Gaea: Hadoop based
resequencing pipeline

0
2
4
6
8
10
12
14
16
Old Pipeline Cloud-based pipeline
Two weeks
Within 15 hrs （
120cores)
Data: Human 60X whole genome Re-sequencing
Fast and Scalable
• The Hadoop Implementation provides great scalability.
• Simply by providing more resource, the analysis can finish much
faster.

 SOAP-GaeaAlignment (1 human sample in 1000genome)
Software Mapping Rate
Confident Mapping
Rate(MAPQ>=10)
Stampy 85.93% 70.00%
SOAP2 79.14% 79.14%
Novo align 82.53% 79.74
BWA 91.54% 84.78%
Bowtie 81.15% 81.15%
SOAP-GaeaAlignment 91.75% 85.20%
It’s not only FAST,
but also ACCURATE

Assembly
Constructing de bruijn Graph
Solving Tiny Repeats Merging Bubbles
Scaffolding Merging Contigs
SOAP-Hecate: Distributed
de novo Genome Assembly

Contig Extension
Scaffolding
Gap closing
SOAPdenovo v2
SOAP-Hecate v2.5
(84 cores)
SOAP-Hecate v2.5
(180 cores)
Data Size 670GB 670GB 670GB
No. of Servers 1 7 15
Time 59 hour 59hour 38hour
Memory Size 400*1 24*7 24G*15
Mode Centralized Distributed Distributed
*80X human whole genome
SOAP-Hecate is scalable and
using much less memory
 Scalability
 Performance
SOAP-Hecate SOAPdenovo ALLPATH Phusion2, phrap Meraculous ABySS
Scaffold
N50
26,570,829 117,000 211,000 495,000 486,000 144,300
Tested on simulated data from Assemblathon 1(Earl, Bradnam et al.

Solutions for Algorithms
 Data transfer
 Cloud Computing
Scale up with Hadoop / MapReduce: Hecate (de novo
Assembly tool), Gaea (Resequencing pipeline)
GPU based acceleration: SOAP3 (Aligner), GSNP(SNP
caller), GAMA (Population genetics tool)
40

SOAP3: ~20X speed up
from SOAP2
SOAP
SOAP2
(2008)
20-30x
SOAP3
(2011)
10-30X
GPU Version
1893.45
10671.39
211.53
819.81
0
2000
4000
6000
8000
10000
12000
Human Zebra fish
Total Time (second)
SOAP2 SOAP3
14.12
14.6
13
13.5
14
14.5
15
Human Zebra fish
Speedup
84.2
64.49
88.29
76.55
0
20
40
60
80
100
Human Zebra fish
Alignment Ratio (%)
SOAP2 SOAP3
Collaboration from University of Hong Kong

527
21879
1
10
100
1000
10000
100000
GSNP SOAPsnp
Elapsedtime(sec.)
Ch.1
73
3675
1
10
100
1000
10000
GSNP SOAPsnp
Elapsedtime(sec.)
Ch. 21
GSNP: 50X faster than its
CPU based SOAPSNP
 The elapsed time of all steps are included.
 GSNP is around 50x faster than single-thread
CPU-based SOAPsnp.

Solutions for
Data Management
 Data transfer
 Cloud Computing
Scale up with Hadoop / MapReduce
GPU based acceleration
 Data Management
Data management in BGI
43

Paradigm Shift
Traditional Model
Business
Determine
what question
to ask
IT
Structures the
data to
answer
that question
Big Data Model
IT
Delivers a
platform to
enable
creative
discovery
Business
Explores what
questions
could be
asked

Information Pyramid
Value
Decision
Knowledge
Information
Data
Element
Meaning
Context
Application
Achievement
Organizing Refining Summarizing Utilizing

BGI Data Pyramid
iRODS
(Data)
Database
(Information)
Data Mining
(Knowledge)
Health/Clinical APP
(Decision)
• Data Preservation
• Data Retrieval
• Data Sharing
• BGI-SNP
• BGI-SV
• BGI-GaP
• Disease: HGVD/PMRD
• Systems Biology
• Drug Discovery
• Diagnosis of Genetic
Diseases
• Drug of Choice

iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Variant
(Gene)
Disease
Drug

iRODS - integrated Rule
Oriented Data System
48*Access data with Web-based Browser or iRODS GUI or Command Line clients.
renci.org

iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow - iRODS
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Variant
(Gene)
Disease
Drug
iRODS-based Data Management
• Contents: raw data, analyzed data and related metadata
• Data backup
• Fully integrated with LIMS
• Able to search and access any data according to the metadata from
BGI data standard, e.g. project, sample, cohort, phenotype, QC, etc.
• Federation: integrate separate iRODS zones

Variant
(Gene)
Disease
Drug
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow – BGI-DB
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
BGI-DB
• A locus-specific database (LSDB) for all variants identified by BGI
• Manage all basic information generated from data analysis pipelines
• Link all detailed information about individual samples to each variant
• Easy to query information from samples with certain commonality
(such as same phenotype, same cohort, etc.)
• Provide the raw information for further data mining steps

iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow – BGI-DW & BGI-KB
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Variant
(Gene)
Disease
Drug
BGI Data Warehousing & Knowledge Base
• BGI data warehousing (BGI-DW) consists of a series of secondary databases related to
variants, diseases and drugs
• BGI knowledge base (BGI-KB) stores and manages the knowledge obtained through
mining BGI-DB, BGI-DW and other public resources
• Periodically and automatically updated
• Provide APIs for the bioinformaticians to query the information and generate
individualized reports

iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow - Successful Story
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Query the allele frequency
database to filter out common
variants and identify disease-
causal variants
Calculate variant frequencies
from certain cohorts and save
them into the allele frequency
database
Diagnosis for Monogenic
Disease
Group samples
into cohorts
based on their
phenotypes
Variant
(Gene)
Disease
Drug

Summary of Our Practice
in IT infrastructure
 Data transfer
 Cloud Computing
Scale up with Hadoop / MapReduce
GPU based acceleration
 Data Management
Using iRODs file system to manage big data 53

Acknowledgement
 Development Team
Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.
Flex Lab: Yan Li (Hecate), Zhi Zhang(GAEA, iRODS) etc. GPU Lab: Bingqiang Wang etc.
 Test & QA Team
Xin Guan, Jingjuan Liu, etc.
 PMO & IT Operation
Wenjun Zeng, Litong Lai, Jing Tian, etc.
 Product Team
Xing Xu, Jing Guo, Fang Fang etc.
 Other BGI Teams
 Collaborators:
University of Hong Kong (HKU)
Hong Kong University of Science and Technology (HKUST)
Nvidia - Aspera
RENCI - TianJing Supercomputing center

Best pratices at BGI for the Challenges in the Era of Big Genomics Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Best pratices at BGI for the Challenges in the Era of Big Genomics Data

Similar to Best pratices at BGI for the Challenges in the Era of Big Genomics Data (20)

Recently uploaded

Recently uploaded (16)

Best pratices at BGI for the Challenges in the Era of Big Genomics Data

Editor's Notes