SlideShare a Scribd company logo
1 of 51
Download to read offline
Counsyl
www.counsyl.com
How I Learned to Stop Worrying
about Big Data
...and love the data that actually counts
Imran S. Haque
Counsyl
18 Jul 2013
Friday, July 26, 13
About the Speaker
•Imran S. Haque (ihaque@counsyl.com)
•Director of Research at Counsyl
•BS EECS, UC Berkeley; PhD CS, Stanford
Friday, July 26, 13
About Counsyl
We have developed a single genomic test that replaces 100+ expensive assays
It has reduced the cost of carrier testing by literally one hundred fold
Bloom Syndrome $167
Canavan Disease $473
Cystic Fibrosis $506
Familial Dysautonomia $334
Fanconi Anemia $167
Gaucher Disease $467
Glycogen Storage Disease Type Ia $283
Maple Syrup Urine Disease Type 1B $557
Mucolipidosis IV $279
Niemann-Pick Disease Type A $337
Spinal Muscular Atrophy $700
Tay-Sachs Disease $473
Total $4743
Friday, July 26, 13
Engineering at Counsyl
Wetlab
Biology
Ordering
Reporting
Billing
Fulfillment
Automation
Assay
Calling
Friday, July 26, 13
Engineering at Counsyl
How big is the data in genomics?
Wetlab
Biology
Ordering
Reporting
Billing
Fulfillment
Automation
Assay
Calling
Assay Calling
Friday, July 26, 13
Big Data Will
Save the World
Friday, July 26, 13
Big Data Will
Save the World
But what is it, anyway?
Friday, July 26, 13
Background
Friday, July 26, 13
Background
Wikipedia “Big Data”:
A collection of data sets so large and
complex that it becomes difficult to
process using on-hand database
management tools or traditional data
processing applications
Friday, July 26, 13
What Defines Big Data
• Computation: data so large that algorithms must be o(N1+ε):
“almost linear.”
• Handling: data so large that with tractable algorithms
communication becomes more significant than computation.
Friday, July 26, 13
Why Do People Care?
Big Data is fundamental to fields in which each individual piece
of data is relatively information-light, so it is necessary to
aggregate a lot of it.
Friday, July 26, 13
This particularly characterizes advertising, which funds the consumer
Internet. People are interested in Big Data as a means to an end (improving
conversion rates), not as an end in itself.
Genomics:
Big Data
Friday, July 26, 13
Genomics:
Big Data
But not as we know it.
Friday, July 26, 13
Short-Read Sequencing in Short
I don’t know what they want from me
It’s like the more money we come across
The more problems we see
Friday, July 26, 13
Short-Read Sequencing in Short
I don’t know what they want from me
It’s like the more money we come across
The more problems we see
It’s like the more
w what they wan
acro5sThe more probl
re problems we see
...
Friday, July 26, 13
Short-Read Sequencing in Short
I don’t know what they want from me
It’s like the more money we come across
The more problems we see
It’s like the more
w what they wan
acro5sThe more probl
re problems we see
...
Current sequencers can produce ~100Gb of short (100bp) reads/day
Friday, July 26, 13
Short-Read Alignment
It’s%like%the%more%money%we%come%across
Ferragina and Manzini, JACM 2005
Langmead et al, Genome Biol 2009
Li and Durbin et al, Bioinformatics 2009
Friday, July 26, 13
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
Ferragina and Manzini, JACM 2005
Langmead et al, Genome Biol 2009
Li and Durbin et al, Bioinformatics 2009
Friday, July 26, 13
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
It’s!like!the!more
Ferragina and Manzini, JACM 2005
Langmead et al, Genome Biol 2009
Li and Durbin et al, Bioinformatics 2009
Friday, July 26, 13
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
It’s!like!the!more
!!!!!!!!!!!!!!!re!data!!we!c
Ferragina and Manzini, JACM 2005
Langmead et al, Genome Biol 2009
Li and Durbin et al, Bioinformatics 2009
Friday, July 26, 13
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
It’s!like!the!more
!!!!!!!!!!!!!!!re!data!!we!c
!!!!like!the!more!d
Ferragina and Manzini, JACM 2005
Langmead et al, Genome Biol 2009
Li and Durbin et al, Bioinformatics 2009
Friday, July 26, 13
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
It’s!like!the!more
!!!!!!!!!!!!!!!re!data!!we!c
!!!!like!the!more!d
!!!!!!!!!!!!!!!!!!!!ata!!we!come!across
Ferragina and Manzini, JACM 2005
Langmead et al, Genome Biol 2009
Li and Durbin et al, Bioinformatics 2009
Friday, July 26, 13
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
It’s!like!the!more
!!!!!!!!!!!!!!!re!data!!we!c
!!!!like!the!more!d
!!!!!!!!!!!!!!!!!!!!ata!!we!come!across
It’s!like!the!more!data!we!come!across
Ferragina and Manzini, JACM 2005
Langmead et al, Genome Biol 2009
Li and Durbin et al, Bioinformatics 2009
Friday, July 26, 13
Alignment Algorithms
Ning, Cox, Mullikin. Genome Res 2001
Li, Ruan, Durbin Genome Res 2008
Ferragina and Manzini, JACM 2005
Langmead et al, Genome Biol 2009
Li and Durbin et al, Bioinformatics 2009
Friday, July 26, 13
Alignment Algorithms
• Smith-Waterman: O(MN), large constant factor
• Hash-based Alignment: much smaller constants than SW
• MAQ, SSAHA
• Burrows-Wheeler Alignment: sublinear in size of genome
• Bowtie, BWA
Ning, Cox, Mullikin. Genome Res 2001
Li, Ruan, Durbin Genome Res 2008
Ferragina and Manzini, JACM 2005
Langmead et al, Genome Biol 2009
Li and Durbin et al, Bioinformatics 2009
Friday, July 26, 13
Real-World Alignments
ATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG
............!!..............................
.............!!!!....C......................
..............!!!,,,,c,,,,,,,,,,,,,,,,,,,,,,
...............!!!..........................
.................!!..C......................
..................!!!C......................
.....................C!!....................
.....................C.!!...................
.....................C..!!..................
,,,,,,,,,,,,,,,,,,,,,,,,,!!!!!!,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,c,,,,!!!!!!............
...........................!!...............
............................!!..............
.....................C.......!!.............
.....................C.........!!...........
................................!!..........
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,!!.........
.....................C.............!!.......
Friday, July 26, 13
Real-World Alignments
ATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG
............!!..............................
.............!!!!....C......................
..............!!!,,,,c,,,,,,,,,,,,,,,,,,,,,,
...............!!!..........................
.................!!..C......................
..................!!!C......................
.....................C!!....................
.....................C.!!...................
.....................C..!!..................
,,,,,,,,,,,,,,,,,,,,,,,,,!!!!!!,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,c,,,,!!!!!!............
...........................!!...............
............................!!..............
.....................C.......!!.............
.....................C.........!!...........
................................!!..........
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,!!.........
.....................C.............!!.......
PAH:Y414C
(heterozygote C/T)
phenylketonuria
Friday, July 26, 13
Real-World Alignments
ATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG
............!!..............................
.............!!!!....C......................
..............!!!,,,,c,,,,,,,,,,,,,,,,,,,,,,
...............!!!..........................
.................!!..C......................
..................!!!C......................
.....................C!!....................
.....................C.!!...................
.....................C..!!..................
,,,,,,,,,,,,,,,,,,,,,,,,,!!!!!!,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,c,,,,!!!!!!............
...........................!!...............
............................!!..............
.....................C.......!!.............
.....................C.........!!...........
................................!!..........
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,!!.........
.....................C.............!!.......
PAH:Y414C
(heterozygote C/T)
phenylketonuria
Need to align 1.5M reads
per sample, across
thousands of samples!
Friday, July 26, 13
Genomics: Big Data?
Genomics appears to have all the characteristics of Big Data.
• Large quantity: ~100GB/day/sequencer
• Advanced algorithms: BWT alignment in linear/sublinear time
But characteristics of the data itself matter too!
Friday, July 26, 13
Clinical Genomics: Not That Big
Friday, July 26, 13
Clinical Genomics: Not That Big
Most of the human genome is currently non-actionable.
Whole Genome Sequencing (~3000 Mb)
Friday, July 26, 13
Clinical Genomics: Not That Big
Most of the human genome is currently non-actionable.
Whole Genome Sequencing (~3000 Mb)
Whole Exome Sequencing (~30 Mb)
Friday, July 26, 13
Clinical Genomics: Not That Big
Most of the human genome is currently non-actionable.
Whole Genome Sequencing (~3000 Mb)
Whole Exome Sequencing (~30 Mb)
Clinical Carrier Screening (~1 Mb)
Friday, July 26, 13
Clinical Genomics: Not That Big
Most of the human genome is currently non-actionable.
Whole Genome Sequencing (~3000 Mb)
Whole Exome Sequencing (~30 Mb)
Clinical Carrier Screening (~1 Mb)
Exome Sequencing (30 Mb)
Friday, July 26, 13
Clinical Genomics: Not That Big
Most of the human genome is currently non-actionable.
Whole Genome Sequencing (~3000 Mb)
Whole Exome Sequencing (~30 Mb)
Clinical Carrier Screening (~1 Mb)
Exome Sequencing (30 Mb)
Clinical Carrier Screening (~1 Mb)
Friday, July 26, 13
But 100Gb Is Still 100Gb, Right?
Friday, July 26, 13
But 100Gb Is Still 100Gb, Right?
Clinical genomics analysis is per-sample.
• Processing is embarrassingly parallel after demultiplexing.
• Handling a single sample is trivial on even a laptop.
Use ZFS and LSF/SGE, not Cassandra and Hadoop.
Friday, July 26, 13
Why is Genomics
Still Interesting?
Friday, July 26, 13
Why is Genomics
Still Interesting?
It’s OK to be Lil’.
Friday, July 26, 13
Research Genomics
Friday, July 26, 13
Research Genomics
Counsyl runs this many samples every year; clinical = scale.
Target # Samples # SNPs
Education Level 126,559 2.2M
Breast/Ovarian Cancer 11,705 31,812
Diabetes 10,128 2.2M
Telomere Length 37,684 2.4M
Rietveld et al, Science 2013
Couch et al, PLoS Genet 2013
Zeggini et al, Nat Genet 2008
Codd et al, Nat Genet 2013
Friday, July 26, 13
Clinical Genomics: Big Where It Matters
Whole Genome (3000 Mb)
Clinical Genome (1 Mb)
Friday, July 26, 13
Clinical Genomics: Big Where It Matters
• Focusing on a small region means you can examine thousands
of people: study important regions in great depth.
• Embarrassingly parallel is a good thing: people pay the bills!
Friday, July 26, 13
Let’s Science Up This Data
N=83,538 samples, 493 variants
Estimated carrier frequency per population as a binomial.
Bonferroni-corrected binomial equality test comparing each population
against the pooled data finds variants that are significantly enriched/
depleted in particular populations.
Haque et al, in preparation
Friday, July 26, 13
Smith-Lemli-Opitz Syndrome (DHCR7)
• We see a carrier rate double the predicted literature values
(e.g., 1/57 vs 1/124 in Northwestern Europeans)
• We find previously undescribed population associations for
DHCR7:IVS8-1G>C
Population Frequency
Overall
Frequency
P-value N
AJ 1 in 46 1 in 96 1.18E-11 4330
➡EA 0 1 in 96 1.56E-07 2739
Haque et al, in preparation
Friday, July 26, 13
Genetic Disease in South Asians
Cystic Fibrosis (CFTR)
• 1/57 observed vs 1/118 in literature.
GJB2-related DFNB1 nonsyndromic hearing loss and deafness
• Literature claims 1/133 with 35delG, but we find 1/2191.
• 36/2191 carriers, 35 for W24X.
Progressive cone dystrophy/achromatopsia (CNGB3)
• R403Q present in 1/18: 30% of carriers in 4% of tested pop.
Haque et al, in preparation
Friday, July 26, 13
Size Doesn’t Matter, It’s How You Use It
• Genomics has a real ground truth.
• Genomics has a real impact.
Clinical genomics is interesting independently of “Big”ness.
Friday, July 26, 13
Future of Genomics
Cratering prices drive technological shifts.
Technologies at the research frontier will become commercialized.
• Whole-genome association studies
• RNA-seq and transcriptomics
• Epigenomics
• Pathogen sequencing and metagenomics
Friday, July 26, 13
Where Are We Now?
• Theory has been developed in academia and government.
• Scale-up is just beginning in industry: started with tool
vendors, now reaching applications companies.
• New scales of data will feed back into basic R&D.
Friday, July 26, 13
Recap
Big Data =
•“near linear” algorithms
• communication is
harder than computation
Short-read sequencing
produces large amounts
of data.
Useful clinical insights
are mostly derived from
embarrassingly-parallel
small data.
“Small data” genomics is
highly impactful in its
own right.
Genomics may enter a
“big data” phase in the
future with new
methods.
Friday, July 26, 13
</talk>
jobs.counsyl.com
ihaque@counsyl.com
Friday, July 26, 13

More Related Content

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

How I Learned to Stop Worrying about Big Data and Love the Data That Actually Counts - Counsyl Tech Talk