SlideShare a Scribd company logo
1 of 26
Background
 Golden Helix
- Founded in 1998
- Genetic association software
- Analytic services
- Hundreds of users worldwide
- Over 900 customer citations in scientific
journals
 Products I Build with My Team
- SNP & Variation Suite (SVS)
- SNP, CNV, NGS tertiary analysis
- Import and deal with all flavors of upstream data
- VarSeq
- Annotate and filter variants in gene panels, exomes and
genomes for clinical labs and researchers.
- GenomeBrowse (Free!)
- Visualization of everything with genomic coordinates.
All standardized file formats.
Database Trends
VarSeq
 Tertiary analysis to report
in one click
 Focused and actionable
data
 Modeled on ACMG
guidelines
 Hereditary and cancer
templates
 OMIM included
VSReports
 Command line runner
 Integrate with your current
bioinformatics pipeline
 Create repeatable clinical
workflows for CLIA and
CAP certified analysis
 Supports high throughput
scenarios
VSPipeline
 Transactions
 Disk structure optimized
 Fixed schema
 SQL matures
 Small mem footprints
 Master <-> Slaves
 Threaded / Locking
 Expensive large
mainframes/servers
90s - SQL
 Scale out
 First class sharding
 Utalize cheap memory
 Don’t let disk be
bottleneck
 Support stream /
distributed analytics
10s - NewSQL
 “Web Scale” - distributed
 Eventually consistent
 Schema-less, key-based
 Avoid joins
 Peer-to-peer
 Memory cheap
 Many cheap commodity
servers in datacenter
configurations
00s - NoSQL
> SELECT * FROM trends GROUP BY decade;
The “Database” Market in Thirds
VarSeq
 Tertiary analysis to report
in one click
 Focused and actionable
data
 Modeled on ACMG
guidelines
 Hereditary and cancer
templates
 OMIM included
VSReports
 Command line runner
 Integrate with your current
bioinformatics pipeline
 Create repeatable clinical
workflows for CLIA and
CAP certified analysis
 Supports high throughput
scenarios
VSPipeline
 ACID / Transcations
 “Traditional” row-based
 MySQL
 Postgres
 Oracle
 MSSQL
 NewSQL
 VoltDB (scale-out)
 Google Spanner/F1
 MemSQL
 Clustrix
OLTP  Key and Hiearchical
Based
 Wide Columnar Stores
 BigTable / HBase
 Cassandra
 Hiearchical/Document
 MongoDB
 Couchbase
 Key-Value Stores
 Redis
 Memcachd
 FoundationDB
 Tuple/Triple-stores
Other
 Query Optimized
 Amazon Redshift
 HP Vertica
 Infobright
 Google BigQuery
 Teradata
 Cloudera Impala
 Hadoop+Hive
Data Warehousing
http://www.se-radio.net/2013/12/episode-199-michael-stonebraker/
 Mike Stonebraker
 Illustra (c Postgres), aquired by
IBM Informix (1996)
 StreamBase (c Aurora),
acquired by TIBCO (2013)
 Vertica (c C-Store), aquired HP
(2011)
 VoltDB (c H-Store) 23M function
in 4 rounds
 Paradigm (c SciDB)
INGRES – 73 -> 90
Postgres – 84 -> 92
Mariposa – 92 -> 97
Aurora – 01 -> 08
C-Store – 05 -> 09
H-Store – 07 -> Present
SciDB – 08 -> Present
Data Warehouse Solutions
Big Data, Small Analytics => Don’t use MapReduce
http://www.slideshare.net/Hapyrus/amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive
Data Warehousing / Scientific Analysis => Columnar
You’ve got to know what regression
means, what Naïve Bayes means,
what k-Nearest Neighbors means.
It’s all statistics.
All of that stuff turns out to be defined
on arrays. It’s not defined on tables.
The tools of future data scientists are
going to be array-based tools. Those
may live on top of relational database
systems. They may live on top of an
array database system, or perhaps
something else. It’s completely open.
• Columns -> Faster Queries
• Divide columns into chunks
• Compress chunks (better
ratios than rows)
• Pre-compute chunk-level
attributes (min/max etc)
• Flexible storage layer
• Distributed
• Encodings (Parquet,
ORC/Hive, custom)
Extract, Transform, Load (ETL)
 “Dimensional Moedeling”
- Fact tables & dimensional tables
- Fact tables often measurements over time
- Dimensional table goes into item details
- Denormalized data, complexity hidden
- Often many sources loaded into same warehouse
- Logs
- One or more relational databases (sales, customer-facing etc)
- Vender / Payment information
 Example
“Like table”: datetime, user_id, post_id,client_data
“User table”: user_id, subscription_type, last_paid, has_android_app
Genomics (Other Life Science) Data
Data Warehouse Like
Gabe’s Adjusted “Moore’s Law” NGS Cost Graph
Sequencers: Versatile tools for science
Genomics is Big Data
 5,000 public data repositories
 Broad Institute:
- Process 40K samples/year
- 1000 people
- 51 High Throughput Sequencers
- 10+ PB of storage
 1 Genome in Data
- ~300GB Compressed Sequence Data
- ~150MB Compressed Variant Data
- Seq data went through 5-6 steps
We Want Variants
 Differences between your DNA
and a reference come in man
sizes:
- Single letter substitutions are called
Single Nucleotide Polymorphisms
(SNPs)
- Small “length polymorphisms” are
called Insertions/Deletions (InDels)
- Large duplications/deletiosn are called
Copy Number Variations
 Average European has ~3 million
small variations to the reference.
100K of those in the 30K “gene
coding” regions (~2% of the
genome)
Next Generation Sequencing Analysis
Primary
Analysis
Secondary
Analysis
Tertiary
Analysis
“Sense Making”
 Analysis of hardware generated data, software built by vendors
 Use FPGA and GPUs to handle real-time optical or eletrical signals
from sequencing hardware
 Filtering/clipping of “reads” and their qualities
 Alignment/Assembly of reads
 Recalibrating, de-duplication, variant calling on aligned reads
 QA and filtering of variant calls
 Annotation (querying) variants to databases, filtering on results
 Merging/comparing multiple samples (multiple files)
 Visualization of variants in genomic context
 Statistics on matrixes
Applications of NGS Data in the Clinic
Carrier screening –
prenatal and standard
Lifetime risk prediction
Genetic disorder
diagnostics
Oncology care
PGx – dosage and
care
Public Annotations – Left Joins
 Exact Matching “Variants”
- “Population Catalogs”
- 1000 Genomes (84M variants)
- NHLBI 6,500 Exomes (2M variants)
- ExAC 61,486 exomes (10M variants)
- Clinical Classifications
- Precomputed predictions / scores
- dbNSFP - 89.6M predictions
 Algorithmic Classifciation
- How variant interacts with genes (85K tx)
 Region Based
- Disease regions
- Gene Lists
Annotations are Hard!
 HGVS is a standard that is not standard
- Tries to serve different goals
- Many representations of same variant
- Should not be used as IDs, but not many
good alternatives
 Transcripts
- Transcript set choice extremely important,
hard to curate with meaningful attributes as
well.
 Public Data Curation
- ClinVar: multi-record lines
- NHLBI: MAF vs AAF, splitting “glob” fields
- 1kG: No genotype counts
- ExAC: Multi-allelic splitting, left-align
- COSMIC: No Ref/Alt, only HGVS
- dbNSFP: Abbreviations and aggregate
scores
 Versioning and Issues
- ClinVar missing variants in VCF
- dbSNP patches without version changes
Splice Mutation
 asdf
N-Glycanase Deficiency
 http://www.ngly1.org/
 Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics:
how next-generation sequencing and
families are altering the way rare
diseases are discovered, studied,
and treated. Genetics in Medicine.
March 2014.
Personalized Medicine
 http://www.ngly1.org/
 Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics:
how next-generation sequencing and
families are altering the way rare
diseases are discovered, studied,
and treated. Genetics in Medicine.
March 2014.
 Cancer is a disease of the genome
 “Molecular Targeted” drugs effective usually side-effect free
 Required genetic testing to direct cancer treatment becoming affordable
Tabular Storage Format
Postgres FDW
TSF
 Use SQLite as
container.
 SQLite has great
cache, multi-
threaded and
read/write properties
 Specialized genomic
index, also
lexigraphical
indexes (LevelDB to
do string sorting)
 GZIP / BLOSC chunk
compression
 Primitive, Enums
and List Types
TSF in Practice - VarSeq
TSF Backed Relational Data Store
 More efficient conditional queries
 Invisible Joins (i.e. row_id => array
offset)
 Size on disk
 "NULL [NA, Missing values] values
are part of the domain space,
which avoids auxiliary bit masks at
the expensive of 'loosing' a single
value from the domain.”
 SQL front-end allows using as
back-end to existing analytic and
web-stacks

More Related Content

Similar to Genomics Data Warehousing and Analytics

BEAGLE Imputation in SVS for Human & Animal SNP Data
BEAGLE Imputation in SVS for Human & Animal SNP DataBEAGLE Imputation in SVS for Human & Animal SNP Data
BEAGLE Imputation in SVS for Human & Animal SNP DataGolden Helix
 
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVSExploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVSGolden Helix Inc
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
Introducing VSWarehouse - A Scalable Genetic Data Warehouse for VarSeq
Introducing VSWarehouse - A Scalable Genetic Data Warehouse for VarSeqIntroducing VSWarehouse - A Scalable Genetic Data Warehouse for VarSeq
Introducing VSWarehouse - A Scalable Genetic Data Warehouse for VarSeqGolden Helix Inc
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineGabe Rudy
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
2015 functional genomics variant annotation and interpretation- tools and p...
2015 functional genomics   variant annotation and interpretation- tools and p...2015 functional genomics   variant annotation and interpretation- tools and p...
2015 functional genomics variant annotation and interpretation- tools and p...Gabe Rudy
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers Golden Helix Inc
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsGolden Helix Inc
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esJoaquin Dopazo
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmersKevin Lee
 
Processing Hereditary Cancer Panels in VarSeq
Processing Hereditary Cancer Panels in VarSeqProcessing Hereditary Cancer Panels in VarSeq
Processing Hereditary Cancer Panels in VarSeqGolden Helix
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
Clinical Reporting Made Easy
Clinical Reporting Made EasyClinical Reporting Made Easy
Clinical Reporting Made EasyGolden Helix Inc
 

Similar to Genomics Data Warehousing and Analytics (20)

Beagle Imputation in SVS
Beagle Imputation in SVSBeagle Imputation in SVS
Beagle Imputation in SVS
 
BEAGLE Imputation in SVS for Human & Animal SNP Data
BEAGLE Imputation in SVS for Human & Animal SNP DataBEAGLE Imputation in SVS for Human & Animal SNP Data
BEAGLE Imputation in SVS for Human & Animal SNP Data
 
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVSExploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
Introducing VSWarehouse - A Scalable Genetic Data Warehouse for VarSeq
Introducing VSWarehouse - A Scalable Genetic Data Warehouse for VarSeqIntroducing VSWarehouse - A Scalable Genetic Data Warehouse for VarSeq
Introducing VSWarehouse - A Scalable Genetic Data Warehouse for VarSeq
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision Medicine
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
2015 functional genomics variant annotation and interpretation- tools and p...
2015 functional genomics   variant annotation and interpretation- tools and p...2015 functional genomics   variant annotation and interpretation- tools and p...
2015 functional genomics variant annotation and interpretation- tools and p...
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-es
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmers
 
Processing Hereditary Cancer Panels in VarSeq
Processing Hereditary Cancer Panels in VarSeqProcessing Hereditary Cancer Panels in VarSeq
Processing Hereditary Cancer Panels in VarSeq
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
Clinical Reporting Made Easy
Clinical Reporting Made EasyClinical Reporting Made Easy
Clinical Reporting Made Easy
 

Recently uploaded

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Recently uploaded (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

Genomics Data Warehousing and Analytics

  • 1. Background  Golden Helix - Founded in 1998 - Genetic association software - Analytic services - Hundreds of users worldwide - Over 900 customer citations in scientific journals  Products I Build with My Team - SNP & Variation Suite (SVS) - SNP, CNV, NGS tertiary analysis - Import and deal with all flavors of upstream data - VarSeq - Annotate and filter variants in gene panels, exomes and genomes for clinical labs and researchers. - GenomeBrowse (Free!) - Visualization of everything with genomic coordinates. All standardized file formats.
  • 2. Database Trends VarSeq  Tertiary analysis to report in one click  Focused and actionable data  Modeled on ACMG guidelines  Hereditary and cancer templates  OMIM included VSReports  Command line runner  Integrate with your current bioinformatics pipeline  Create repeatable clinical workflows for CLIA and CAP certified analysis  Supports high throughput scenarios VSPipeline  Transactions  Disk structure optimized  Fixed schema  SQL matures  Small mem footprints  Master <-> Slaves  Threaded / Locking  Expensive large mainframes/servers 90s - SQL  Scale out  First class sharding  Utalize cheap memory  Don’t let disk be bottleneck  Support stream / distributed analytics 10s - NewSQL  “Web Scale” - distributed  Eventually consistent  Schema-less, key-based  Avoid joins  Peer-to-peer  Memory cheap  Many cheap commodity servers in datacenter configurations 00s - NoSQL > SELECT * FROM trends GROUP BY decade;
  • 3. The “Database” Market in Thirds VarSeq  Tertiary analysis to report in one click  Focused and actionable data  Modeled on ACMG guidelines  Hereditary and cancer templates  OMIM included VSReports  Command line runner  Integrate with your current bioinformatics pipeline  Create repeatable clinical workflows for CLIA and CAP certified analysis  Supports high throughput scenarios VSPipeline  ACID / Transcations  “Traditional” row-based  MySQL  Postgres  Oracle  MSSQL  NewSQL  VoltDB (scale-out)  Google Spanner/F1  MemSQL  Clustrix OLTP  Key and Hiearchical Based  Wide Columnar Stores  BigTable / HBase  Cassandra  Hiearchical/Document  MongoDB  Couchbase  Key-Value Stores  Redis  Memcachd  FoundationDB  Tuple/Triple-stores Other  Query Optimized  Amazon Redshift  HP Vertica  Infobright  Google BigQuery  Teradata  Cloudera Impala  Hadoop+Hive Data Warehousing http://www.se-radio.net/2013/12/episode-199-michael-stonebraker/
  • 4.  Mike Stonebraker  Illustra (c Postgres), aquired by IBM Informix (1996)  StreamBase (c Aurora), acquired by TIBCO (2013)  Vertica (c C-Store), aquired HP (2011)  VoltDB (c H-Store) 23M function in 4 rounds  Paradigm (c SciDB) INGRES – 73 -> 90 Postgres – 84 -> 92 Mariposa – 92 -> 97 Aurora – 01 -> 08 C-Store – 05 -> 09 H-Store – 07 -> Present SciDB – 08 -> Present
  • 6. Big Data, Small Analytics => Don’t use MapReduce http://www.slideshare.net/Hapyrus/amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive
  • 7. Data Warehousing / Scientific Analysis => Columnar You’ve got to know what regression means, what Naïve Bayes means, what k-Nearest Neighbors means. It’s all statistics. All of that stuff turns out to be defined on arrays. It’s not defined on tables. The tools of future data scientists are going to be array-based tools. Those may live on top of relational database systems. They may live on top of an array database system, or perhaps something else. It’s completely open. • Columns -> Faster Queries • Divide columns into chunks • Compress chunks (better ratios than rows) • Pre-compute chunk-level attributes (min/max etc) • Flexible storage layer • Distributed • Encodings (Parquet, ORC/Hive, custom)
  • 8. Extract, Transform, Load (ETL)  “Dimensional Moedeling” - Fact tables & dimensional tables - Fact tables often measurements over time - Dimensional table goes into item details - Denormalized data, complexity hidden - Often many sources loaded into same warehouse - Logs - One or more relational databases (sales, customer-facing etc) - Vender / Payment information  Example “Like table”: datetime, user_id, post_id,client_data “User table”: user_id, subscription_type, last_paid, has_android_app
  • 9. Genomics (Other Life Science) Data Data Warehouse Like
  • 10. Gabe’s Adjusted “Moore’s Law” NGS Cost Graph
  • 12. Genomics is Big Data  5,000 public data repositories  Broad Institute: - Process 40K samples/year - 1000 people - 51 High Throughput Sequencers - 10+ PB of storage  1 Genome in Data - ~300GB Compressed Sequence Data - ~150MB Compressed Variant Data - Seq data went through 5-6 steps
  • 13. We Want Variants  Differences between your DNA and a reference come in man sizes: - Single letter substitutions are called Single Nucleotide Polymorphisms (SNPs) - Small “length polymorphisms” are called Insertions/Deletions (InDels) - Large duplications/deletiosn are called Copy Number Variations  Average European has ~3 million small variations to the reference. 100K of those in the 30K “gene coding” regions (~2% of the genome)
  • 14. Next Generation Sequencing Analysis Primary Analysis Secondary Analysis Tertiary Analysis “Sense Making”  Analysis of hardware generated data, software built by vendors  Use FPGA and GPUs to handle real-time optical or eletrical signals from sequencing hardware  Filtering/clipping of “reads” and their qualities  Alignment/Assembly of reads  Recalibrating, de-duplication, variant calling on aligned reads  QA and filtering of variant calls  Annotation (querying) variants to databases, filtering on results  Merging/comparing multiple samples (multiple files)  Visualization of variants in genomic context  Statistics on matrixes
  • 15. Applications of NGS Data in the Clinic Carrier screening – prenatal and standard Lifetime risk prediction Genetic disorder diagnostics Oncology care PGx – dosage and care
  • 16. Public Annotations – Left Joins  Exact Matching “Variants” - “Population Catalogs” - 1000 Genomes (84M variants) - NHLBI 6,500 Exomes (2M variants) - ExAC 61,486 exomes (10M variants) - Clinical Classifications - Precomputed predictions / scores - dbNSFP - 89.6M predictions  Algorithmic Classifciation - How variant interacts with genes (85K tx)  Region Based - Disease regions - Gene Lists
  • 17. Annotations are Hard!  HGVS is a standard that is not standard - Tries to serve different goals - Many representations of same variant - Should not be used as IDs, but not many good alternatives  Transcripts - Transcript set choice extremely important, hard to curate with meaningful attributes as well.  Public Data Curation - ClinVar: multi-record lines - NHLBI: MAF vs AAF, splitting “glob” fields - 1kG: No genotype counts - ExAC: Multi-allelic splitting, left-align - COSMIC: No Ref/Alt, only HGVS - dbNSFP: Abbreviations and aggregate scores  Versioning and Issues - ClinVar missing variants in VCF - dbSNP patches without version changes
  • 20.
  • 21. N-Glycanase Deficiency  http://www.ngly1.org/  Matthew Might and Matt Wilsey. The shifting model in clinical diagnostics: how next-generation sequencing and families are altering the way rare diseases are discovered, studied, and treated. Genetics in Medicine. March 2014.
  • 22. Personalized Medicine  http://www.ngly1.org/  Matthew Might and Matt Wilsey. The shifting model in clinical diagnostics: how next-generation sequencing and families are altering the way rare diseases are discovered, studied, and treated. Genetics in Medicine. March 2014.  Cancer is a disease of the genome  “Molecular Targeted” drugs effective usually side-effect free  Required genetic testing to direct cancer treatment becoming affordable
  • 24. TSF  Use SQLite as container.  SQLite has great cache, multi- threaded and read/write properties  Specialized genomic index, also lexigraphical indexes (LevelDB to do string sorting)  GZIP / BLOSC chunk compression  Primitive, Enums and List Types
  • 25. TSF in Practice - VarSeq
  • 26. TSF Backed Relational Data Store  More efficient conditional queries  Invisible Joins (i.e. row_id => array offset)  Size on disk  "NULL [NA, Missing values] values are part of the domain space, which avoids auxiliary bit masks at the expensive of 'loosing' a single value from the domain.”  SQL front-end allows using as back-end to existing analytic and web-stacks

Editor's Notes

  1. Experience comes from building secondary analysis pipelines for our services and RNA-seq purpose as supporting users of our downstream tools Our tools start after secondary
  2. Powerful, commercial-grade software designed for local hardware Largest and most up-to-date repository of Public Annotations Attention to details in getting public data right left-aligning, multi-allelic spitting, etc. Advanced and powerful filtering through rich user interface Multiple definable outputs and data export options Data transformations: VCF/gVCF merging Intelligent handling of multi-allelic sites Breaking MNVs to allelic primitives Expression editor for creating custom variables
  3. Powerful, commercial-grade software designed for local hardware Largest and most up-to-date repository of Public Annotations Attention to details in getting public data right left-aligning, multi-allelic spitting, etc. Advanced and powerful filtering through rich user interface Multiple definable outputs and data export options Data transformations: VCF/gVCF merging Intelligent handling of multi-allelic sites Breaking MNVs to allelic primitives Expression editor for creating custom variables
  4. Mariposa -  federated database over an economic model of resource trading, in which data distributed across multiple organizations could be integrated and queried from a single relational interface Aurora -> focused on data management for streaming data, using a new data model and query language. Unlike relational systems, which "pull" data and process it a record at a time, in Aurora, data is "pushed", arriving asynchronously from external data sources (such as stock ticks, news feeds, or sensors.)  C-Store -> developed a parallel, shared-nothing column-oriented DBMS for data warehousing. By dividing and storing data in columns, C-Store is able to perform less I/O and get better compression ratios than conventional database systems that store data in rows. H-Store is a distributed main-memory OLTP system designed to provide very high throughput on transaction processing workloads. First “NewSQL”. Horizontal partitioning (sharding) SciDB -> array-focused scientific workflows - multidimensional data management and analytics common to scientific, geospatial, financial, and industrial applications.
  5. Hive, on the other hand, works a bit differently. In a nutshell, Hive is a SQL-like data warehouse infrastructure built on HDFS (Hadoop Distributed File System). Instead of MPP, Hadoop uses a distributed processing model called MapReduce, which is also designed to process large data sets quickly. According to Amazon (so this data point may be somewhat biased), running an old school data warehouse costs $19,000 – $25,000 per terabyte per year. Redshift, on the other hand, boasts that it costs only $1,000 per terabyte per year at its lowest pricing tier. d
  6. MapReduce is dead, makes a horrible DBMS abstraction (all the HDFS abstractions work around it, and sometimes around HDFS) MapReduce designed to parse the web-scraped web. Not even used by Google to do that Small Analytics: sum/groupby etc, optimized column stores much faster than MapReduce
  7. Starter Template (update to include Primary and Incidental record) Add Coverage Statistics Add OMIM and ExaC Filtering GB Vization Ock filter chain Save template Re-run with new samples Open a reports view Show global config, discuss default templates etc Select variants for reporting Classification auto-fill Change fields Highly customizable, can include anything, we can customizae as a service or you can do it yourself, example N of One
  8. BUT – all of this is embarassingly parallel No integration of this data until my product, and then the dataset size is down to single-computer sized work.
  9. Note that for Desktop sequencers, Secondary is often bundled in with the machine (MiSeq, PGM, Proton). Primary Analysis: Can forget it really exists, Secondary is not there yet. Hence that is the focus of this talk Note this is DNA-seq focused, but you have similar steps for RNA
  10. Mention ethical matching of 1kg and NHLBI (I used European) 89,617,785 functional predictions in dbNSFP Maybe browse GB
  11. Upload VCF, get back VCF or interactive report