WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
WuXi NextCODE Scales Up
Genomic Sequencing in AWS
Hákon Guðbjartsson, Ph.D.
Chief Informatics Officer,
WuXi NextCODE
hakon@wuxinextcode.com
A N T 2 1 0 - S
Jonsi Stefansson
Cloud Data Services CTO, NetApp
jonsi.Stefansson@netapp.com

Life Sciences Has a New Challenge…DATA!
The exponential growth of genomic data is challenging the industry to develop new and better
ways to manage and mine truly ‘big’ data.
Source: PLOSBiology
2 EB
Growth of
genomic
data by
2025
Genomes
Sequenced
40 Exabytes (EB)/yr
100M– 2B
Cost effective sequencing
has been solved…
…the new challenge
is data growth
1000GB=1TB, 1000TB=1PT, 1000PT=1EB
Average price for
WGS now < $1000
(1 EB = 1x106 TB)

How does HTS data look?
Showing data for four samples in the BRCA2 gene (1/300k of the genome)

Sequence reads and consistent variations
Showing the first exon in BRCA2 (1/6M of the genome)

The sequence of the double stranded helix
Many of the differences are noise. Reading frames show possible amino acids.

.
.
.
dis.1
dis.N
subj.
g. pos
subj.
g. pos
subj.
g. pos
GW-STRs FM-SNPs GW-chip SNPs SEQ
Completeness of sequence data promotes reuse

The WuXi NextCODE Difference
Our purpose-built platform and its breadth and depth of differentiated capabilities
sets us apart
(Genomically Ordered Relational database)
The world’s scalable digital platform purpose-built for the genome and population health
ROBUST COHORT
SOURCING
HIGHEST QUALITY
SEQUENCING
SCALABLE DATA
INTEGRATION
BEST-IN-CLASS DEEP
LEARNING + A.I.
Fueled by the GORdb™ platform

WXNC Platform Overview
A single digital platform built from the ground-up for population
scale genomics.
APP LAYER
Domain specific applications that are
powered by the WXNC platform
API + SDK
An intuitive API + SDK to allow
customization of app layer capabilities
DATA
A single layer to combine user data with
WXNC’s KnowledgeBASE of proprietary
and reference databases
GORdb™
A single digital architecture to power
WXNC’s end-to-end genomics capabilities
GORdb™
API + SDK
KnowledgeBASE Client Data
DiscoveryCODE CustomAppsPhenoCODECancerCODE RareCODE
+
NFS - NetApp cloud volumes

Our secure AWS system architecture
VPC or SAS
AWS Direct Connect
Co-location
NetApp Cloud
Volumes Service

Amazon Web Services
Why WXNC has selected to work with AWS
Accessibility
Services built to store and retrieve data
from anywhere in the world. Availability
in multiple countries.
Reliability
Redundancy to achieve 99.999999999
durability. Available across multiple zones.
High performance
Elastic environments that automatically
scale.
Battle tested solutions.
Compliant
Comprehensive security suite. US and
global security compliance
Rich solution ecosystem
Amazon RDS for Postgres SQL and Oracle; Cloud
storage service such as Amazon S3, Amazon Glacier,
Amazon EFS; AWS Partner Network: NetApp Cloud
Volumes; Edico secondary analysis.
Popularity
The preferred cloud platform for most of
our pharma customers.

Platform connecting research to the clinic
Samples ingested in CSA are automatically available in the DiscoveyCODE platform for research
Perform Genomic
Assay
Identify
Variants
Annotate with Clinical
Knowledgebase
Determine Variant
Impact
Aggregate Patient
Samples Perform Statistical
Analysis on Cohorts
Identify
Biomarkers
Populate Clinical
Knowledgebase
Generate
Report
CSA The DiscoveryCODE Platform
Clinical Care Research Discovery

Optimize. Mine. Share.
Fuel your organization’s discovery and development pipeline with a comprehensive analytical
suite built into the GORdb™ platform.
Discover genomic biomarkers from cohorts
of individuals
Discover genomic classifiers for patient
enrollment in clinical trials
De novo biomarker discovery
Phase 1
Discover genomic biomarkers associated
with adverse effects
Phase 2+
Discover genomic biomarkers associated
with adverse effects and for
responders/non-responders
Phase 3
Develop genomic classifiers for companion
IVD submission with an IND
DISCOVERY
DEVELOPMENT
Application
Suite
Clinical Sequence
Analyzer™
Sequence
Miner™
Artificial
Intelligence
PhenoCODE™
…and more
to come
+
GORdb™

Validated A.I. in Genomics
WXNC is leading genomics A.I. with its suite of validated and published methods and algorithms.
CARDIOVASCULAR
Validated our machine learning
capability in published in vitro
models
METABOLIC
Classified all variants of
targeted genes for childhood
obesity drug and companion
diagnostic development
ONCOLOGY
Identified a signal predictive
of survival across 21 cancers
using RNA, CNV, and
methylation data

Clinical Sequence Analysis
The nature of the feedback loop for variant interpretation will change.

Data driven diagnosis of rare diseases
High impact variants in genes that “match” the signs & symptoms

1.8M
1400
173
Blindness, Deafness, Diaphragmatic Weakness
1
Allele Freq <2%
VEP: MOD/LOF
Candidate Genes +
Paralogs
Recessive
All Variants
All Shared Variants
https://www.youtube.com/watch?v=kaTlGr0bHSk
Clinical Diagnostics Example
Ending a 5-year odyssey in minutes for 2 sisters

The Law of Diminishing Return
The standard deviation in estimates has inverse square root behavior
Rare
disease
Common
disease
Sample size
Precision

Rare Exchange
Use-cases related to exchange of information
ACMG variant curations
Enable crowd-sourcing of variant curations
across hospitals and organizations.
Sample match-maker
Enable exploration of the availability of
samples that overlap with index-case in
phenotypes or rare high-impact variants in
gene.
Delegate analysis
Enable user to move his case study
seamlessly to another organization that has
expertise and data in the disease of interest.
Aggregate data sharing
Share information across organizations
such as AF, GTF, segregated by predefined
traits and by dynamic match-making
definition.
This can be on a variant level, or gene-level
Study/sample sharing
Share an entire study with its associated
phenotypes and genomic data to another
organization.
Blind analysis
Temporarily move one or more samples to
other RareCode systems and perform
genome wide burden analysis, inheritance
labeling etc.

An epilepsy study example
Even small patient cohorts can lead to identification of new causal genes
Whole exome sequencing of 117 undiagnosed patients with epilepsy (41 trios and 76 singletons,
including epileptic encephalopathies (83 patients), febrile-infection related epilepsy, Rasmussen
encephalitis, and other focal and generalized epilepsies).
Results: Likely pathogenic variants Identified for epilepsy for 33% (n=39) of patients;
In a further 35% (n=41), potentially clinically relevant variants in candidate genes.
accounted for 33% of the variants identified: KCNQ2 (n=8) most frequently, followed by
SCN2A (n=4).
Epileptic encephalopathies associated with genetic variants in recently characterized genes
FGF12, GNAO1, ITPA, KCNB1, KCNH1, MBD5, PTPN23, RHOBTB2, SYNGAP1, and WWOX
Expanded the phenotype of genes known to be linked to intellectual disability DYNC1H1,
HUWE1, KAT6A and PTCHD
Anne Rochtus, Meredith Park, Lacey Smith, Alan Taylor, Christelle El Achkar, Shira Rockowitz, Beth Rosen Sheidley,
Annapurna Poduri
Boston Children’s Hospital and Harvard Medical School using WXNC Clinical Sequence Analyzer and Sequence
Miner

What is GORdb?
GORdb provides for genomic data
the data abstraction and query
functionality which conventional
RDBMS provide for regular
business data.
A relational database for genomic data

The GORql syntax
Influenced by SQL and shell pipe commands
SQLUnix
commands
GORql pipe syntax
+

Genomic Ordering Enables Rapid Queries
- GORql = SQL + Unix bash
- Allows for targeted querying and streaming
- Fast analysis and data updates
- Normalized schema designs for all genomic data
- Elastic scaling
- Parallel execution
- Materialized views
- External commands
GORdb™
GenomicAxis
Partition Axis
Targeted genomic
queries based on
chromosomal
coordinates
SPEED SCALABILITY

GOR Genotype Files
VCF2GOR for sparse alle row format and non-sparse horizontal layout
Transposed GT file with variant listed for all PNs0 = hom ref, good cov
1 = het, good cov
2 = hom, good cov
NA = poor cov, No Call

SQL vs GORql pipe syntax
Calculating transitivity vs transversion ratio – example taken from Google BigQuery
140 million rows, takes ~ 5 seconds:
create #t# = pgor –split 100 #dbsnp# | where len(ref)=1 and len(alt)=1
| calc transition = if(ref+’>’+alt in (’A>G’,’G>A’,’C>T’,’T>C’),1,0)
| calc transversion = 1 - transition
| group 100000 -sum -ic transition,transversion;
gor [#t#] | group 100000 –sum –ic sum_*
| calc TiTv_ratio = float(sum_sum_transition)/sum_sum_transversion

GORdb™ outperforms SparkSQL in genomic joins
The rapid, real-time querying relational database purpose-built for big genomic data.
TimeinSec
GORdb™ Outperforms Spark in
Genomic Queries
0
50
100
150
200
250
300
Q1 Q2 Q3 Q4 Q5 Q6
SparkSQL GORdb
QUERY 2 Retrieve
dbSNP data based on
overlap with genes (e.g.
90k rows of
overlapping genes
named BR*)
QUERY 3 Retrieve
dbSNP data based
on overlap with
exons (i.e., more
segments)
QUERY 4 Variant lookup
by joining variants from
dbSNP with a set of
variants based on rsIDs
(e.g. all rs22* giving ~100k
rows)
QUERY 5
Aggregating
counts of variants
in dbSNP based on
sequence structure
QUERY 1 Retrieve
dbSNP data and filter
based on genomic
range (e.g. 281 rows
from a region in
chr19)
The Genomically-Ordered Relational (GOR)
Database
• Developed by WXNC specifically for large volumes of
genomic data
• Database and query language optimized for genomic data
streaming
• Genomic coordinate indexing
• On-the-fly data joins
• Instant updates
GORdb™ provides for genomics the data abstraction and query
functionality which conventional RDBMS provides for regular
business data
RDBMS = relational database management software
QUERY 6 Calculate
transition-
transversion ratio
for all of dbSNP
(>100million rows)

The benefits of NetApp Cloud Volumes
Moving from self-managed NFS storage to NetApp Cloud Volumes was seamless
Less complexity
Moved from 15 EBS volumes
fronted with 3 large NFS servers
to a single NetApp service
endpoint.
Easy onboarding
Copied 50Tb of data in more than
2 million files in less than two
days.
Performance
Reading mutation data from 100k
samples with 1024 cores was 3x
faster than with self-managed
solution
Easier management
Backups, and data cloning for test
environments.
Reliability
Advanced RAID-DP technology
provide more reliability for large
data sets.
Tailors to GORdb needs
Transparent NFS caching

Thank you!
Hákon Guðbjartsson, Ph.D.
Chief Informatics Officer
hakon@wuxinextcode.com
Jonsi Stefansson
Cloud Data Services CTO, NetApp
jonsi.Stefansson@netapp.com

WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent 2018

Similar to WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent 2018