Genomics Data Warehousing and Analytics

Background
 Golden Helix
- Founded in 1998
- Genetic association software
- Analytic services
- Hundreds of users worldwide
- Over 900 customer citations in scientific
journals
 Products I Build with My Team
- SNP & Variation Suite (SVS)
- SNP, CNV, NGS tertiary analysis
- Import and deal with all flavors of upstream data
- VarSeq
- Annotate and filter variants in gene panels, exomes and
genomes for clinical labs and researchers.
- GenomeBrowse (Free!)
- Visualization of everything with genomic coordinates.
All standardized file formats.

Database Trends
VarSeq
 Tertiary analysis to report
in one click
 Focused and actionable
data
 Modeled on ACMG
guidelines
 Hereditary and cancer
templates
 OMIM included
VSReports
 Command line runner
 Integrate with your current
bioinformatics pipeline
 Create repeatable clinical
workflows for CLIA and
CAP certified analysis
 Supports high throughput
scenarios
VSPipeline
 Transactions
 Disk structure optimized
 Fixed schema
 SQL matures
 Small mem footprints
 Master <-> Slaves
 Threaded / Locking
 Expensive large
mainframes/servers
90s - SQL
 Scale out
 First class sharding
 Utalize cheap memory
 Don’t let disk be
bottleneck
 Support stream /
distributed analytics
10s - NewSQL
 “Web Scale” - distributed
 Eventually consistent
 Schema-less, key-based
 Avoid joins
 Peer-to-peer
 Memory cheap
 Many cheap commodity
servers in datacenter
configurations
00s - NoSQL
> SELECT * FROM trends GROUP BY decade;

The “Database” Market in Thirds
VarSeq
 Tertiary analysis to report
in one click
 Focused and actionable
data
 Modeled on ACMG
guidelines
 Hereditary and cancer
templates
 OMIM included
VSReports
 Command line runner
 Integrate with your current
bioinformatics pipeline
 Create repeatable clinical
workflows for CLIA and
CAP certified analysis
 Supports high throughput
scenarios
VSPipeline
 ACID / Transcations
 “Traditional” row-based
 MySQL
 Postgres
 Oracle
 MSSQL
 NewSQL
 VoltDB (scale-out)
 Google Spanner/F1
 MemSQL
 Clustrix
OLTP  Key and Hiearchical
Based
 Wide Columnar Stores
 BigTable / HBase
 Cassandra
 Hiearchical/Document
 MongoDB
 Couchbase
 Key-Value Stores
 Redis
 Memcachd
 FoundationDB
 Tuple/Triple-stores
Other
 Query Optimized
 Amazon Redshift
 HP Vertica
 Infobright
 Google BigQuery
 Teradata
 Cloudera Impala
 Hadoop+Hive
Data Warehousing
http://www.se-radio.net/2013/12/episode-199-michael-stonebraker/

 Mike Stonebraker
 Illustra (c Postgres), aquired by
IBM Informix (1996)
 StreamBase (c Aurora),
acquired by TIBCO (2013)
 Vertica (c C-Store), aquired HP
(2011)
 VoltDB (c H-Store) 23M function
in 4 rounds
 Paradigm (c SciDB)
INGRES – 73 -> 90
Postgres – 84 -> 92
Mariposa – 92 -> 97
Aurora – 01 -> 08
C-Store – 05 -> 09
H-Store – 07 -> Present
SciDB – 08 -> Present

Big Data, Small Analytics => Don’t use MapReduce
http://www.slideshare.net/Hapyrus/amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive

Data Warehousing / Scientific Analysis => Columnar
You’ve got to know what regression
means, what Naïve Bayes means,
what k-Nearest Neighbors means.
It’s all statistics.
All of that stuff turns out to be defined
on arrays. It’s not defined on tables.
The tools of future data scientists are
going to be array-based tools. Those
may live on top of relational database
systems. They may live on top of an
array database system, or perhaps
something else. It’s completely open.
• Columns -> Faster Queries
• Divide columns into chunks
• Compress chunks (better
ratios than rows)
• Pre-compute chunk-level
attributes (min/max etc)
• Flexible storage layer
• Distributed
• Encodings (Parquet,
ORC/Hive, custom)

Extract, Transform, Load (ETL)
 “Dimensional Moedeling”
- Fact tables & dimensional tables
- Fact tables often measurements over time
- Dimensional table goes into item details
- Denormalized data, complexity hidden
- Often many sources loaded into same warehouse
- Logs
- One or more relational databases (sales, customer-facing etc)
- Vender / Payment information
 Example
“Like table”: datetime, user_id, post_id,client_data
“User table”: user_id, subscription_type, last_paid, has_android_app

Genomics (Other Life Science) Data
Data Warehouse Like

Gabe’s Adjusted “Moore’s Law” NGS Cost Graph

Sequencers: Versatile tools for science

Genomics is Big Data
 5,000 public data repositories
 Broad Institute:
- Process 40K samples/year
- 1000 people
- 51 High Throughput Sequencers
- 10+ PB of storage
 1 Genome in Data
- ~300GB Compressed Sequence Data
- ~150MB Compressed Variant Data
- Seq data went through 5-6 steps

We Want Variants
 Differences between your DNA
and a reference come in man
sizes:
- Single letter substitutions are called
Single Nucleotide Polymorphisms
(SNPs)
- Small “length polymorphisms” are
called Insertions/Deletions (InDels)
- Large duplications/deletiosn are called
Copy Number Variations
 Average European has ~3 million
small variations to the reference.
100K of those in the 30K “gene
coding” regions (~2% of the
genome)

Next Generation Sequencing Analysis
Primary
Analysis
Secondary
Analysis
Tertiary
Analysis
“Sense Making”
 Analysis of hardware generated data, software built by vendors
 Use FPGA and GPUs to handle real-time optical or eletrical signals
from sequencing hardware
 Filtering/clipping of “reads” and their qualities
 Alignment/Assembly of reads
 Recalibrating, de-duplication, variant calling on aligned reads
 QA and filtering of variant calls
 Annotation (querying) variants to databases, filtering on results
 Merging/comparing multiple samples (multiple files)
 Visualization of variants in genomic context
 Statistics on matrixes

Applications of NGS Data in the Clinic
Carrier screening –
prenatal and standard
Lifetime risk prediction
Genetic disorder
diagnostics
Oncology care
PGx – dosage and
care

Public Annotations – Left Joins
 Exact Matching “Variants”
- “Population Catalogs”
- 1000 Genomes (84M variants)
- NHLBI 6,500 Exomes (2M variants)
- ExAC 61,486 exomes (10M variants)
- Clinical Classifications
- Precomputed predictions / scores
- dbNSFP - 89.6M predictions
 Algorithmic Classifciation
- How variant interacts with genes (85K tx)
 Region Based
- Disease regions
- Gene Lists

Annotations are Hard!
 HGVS is a standard that is not standard
- Tries to serve different goals
- Many representations of same variant
- Should not be used as IDs, but not many
good alternatives
 Transcripts
- Transcript set choice extremely important,
hard to curate with meaningful attributes as
well.
 Public Data Curation
- ClinVar: multi-record lines
- NHLBI: MAF vs AAF, splitting “glob” fields
- 1kG: No genotype counts
- ExAC: Multi-allelic splitting, left-align
- COSMIC: No Ref/Alt, only HGVS
- dbNSFP: Abbreviations and aggregate
scores
 Versioning and Issues
- ClinVar missing variants in VCF
- dbSNP patches without version changes

N-Glycanase Deficiency
 http://www.ngly1.org/
 Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics:
how next-generation sequencing and
families are altering the way rare
diseases are discovered, studied,
and treated. Genetics in Medicine.
March 2014.

Personalized Medicine
 http://www.ngly1.org/
 Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics:
how next-generation sequencing and
families are altering the way rare
diseases are discovered, studied,
and treated. Genetics in Medicine.
March 2014.
 Cancer is a disease of the genome
 “Molecular Targeted” drugs effective usually side-effect free
 Required genetic testing to direct cancer treatment becoming affordable

Tabular Storage Format
Postgres FDW

TSF
 Use SQLite as
container.
 SQLite has great
cache, multi-
threaded and
read/write properties
 Specialized genomic
index, also
lexigraphical
indexes (LevelDB to
do string sorting)
 GZIP / BLOSC chunk
compression
 Primitive, Enums
and List Types

TSF Backed Relational Data Store
 More efficient conditional queries
 Invisible Joins (i.e. row_id => array
offset)
 Size on disk
 "NULL [NA, Missing values] values
are part of the domain space,
which avoids auxiliary bit masks at
the expensive of 'loosing' a single
value from the domain.”
 SQL front-end allows using as
back-end to existing analytic and
web-stacks

Genomics Data Warehousing and Analytics

Recommended

Recommended

More Related Content

Similar to Genomics Data Warehousing and Analytics

Similar to Genomics Data Warehousing and Analytics (20)

Recently uploaded

Recently uploaded (20)

Genomics Data Warehousing and Analytics

Editor's Notes