0
Counsyl
www.counsyl.com
How I Learned to Stop Worrying
about Big Data
...and love the data that actually counts
Imran S. H...
About the Speaker
•Imran S. Haque (ihaque@counsyl.com)
•Director of Research at Counsyl
•BS EECS, UC Berkeley; PhD CS, Sta...
About Counsyl
We have developed a single genomic test that replaces 100+ expensive assays
It has reduced the cost of carri...
Engineering at Counsyl
Wetlab
Biology
Ordering
Reporting
Billing
Fulfillment
Automation
Assay
Calling
Friday, July 26, 13
Engineering at Counsyl
How big is the data in genomics?
Wetlab
Biology
Ordering
Reporting
Billing
Fulfillment
Automation
As...
Big Data Will
Save the World
Friday, July 26, 13
Big Data Will
Save the World
But what is it, anyway?
Friday, July 26, 13
Background
Friday, July 26, 13
Background
Wikipedia “Big Data”:
A collection of data sets so large and
complex that it becomes difficult to
process using ...
What Defines Big Data
• Computation: data so large that algorithms must be o(N1+ε):
“almost linear.”
• Handling: data so l...
Why Do People Care?
Big Data is fundamental to fields in which each individual piece
of data is relatively information-ligh...
Genomics:
Big Data
Friday, July 26, 13
Genomics:
Big Data
But not as we know it.
Friday, July 26, 13
Short-Read Sequencing in Short
I don’t know what they want from me
It’s like the more money we come across
The more proble...
Short-Read Sequencing in Short
I don’t know what they want from me
It’s like the more money we come across
The more proble...
Short-Read Sequencing in Short
I don’t know what they want from me
It’s like the more money we come across
The more proble...
Short-Read Alignment
It’s%like%the%more%money%we%come%across
Ferragina and Manzini, JACM 2005
Langmead et al, Genome Biol ...
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
Ferragina and Manzini, J...
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
It’s!like!the!more
Ferra...
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
It’s!like!the!more
!!!!!...
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
It’s!like!the!more
!!!!!...
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
It’s!like!the!more
!!!!!...
Short-Read Alignment
It’s%like%the%more%money%we%come%across
!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr
It’s!like!the!more
!!!!!...
Alignment Algorithms
Ning, Cox, Mullikin. Genome Res 2001
Li, Ruan, Durbin Genome Res 2008
Ferragina and Manzini, JACM 200...
Alignment Algorithms
• Smith-Waterman: O(MN), large constant factor
• Hash-based Alignment: much smaller constants than SW...
Real-World Alignments
ATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG
............!!..............................
.............
Real-World Alignments
ATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG
............!!..............................
.............
Real-World Alignments
ATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG
............!!..............................
.............
Genomics: Big Data?
Genomics appears to have all the characteristics of Big Data.
• Large quantity: ~100GB/day/sequencer
•...
Clinical Genomics: Not That Big
Friday, July 26, 13
Clinical Genomics: Not That Big
Most of the human genome is currently non-actionable.
Whole Genome Sequencing (~3000 Mb)
F...
Clinical Genomics: Not That Big
Most of the human genome is currently non-actionable.
Whole Genome Sequencing (~3000 Mb)
W...
Clinical Genomics: Not That Big
Most of the human genome is currently non-actionable.
Whole Genome Sequencing (~3000 Mb)
W...
Clinical Genomics: Not That Big
Most of the human genome is currently non-actionable.
Whole Genome Sequencing (~3000 Mb)
W...
Clinical Genomics: Not That Big
Most of the human genome is currently non-actionable.
Whole Genome Sequencing (~3000 Mb)
W...
But 100Gb Is Still 100Gb, Right?
Friday, July 26, 13
But 100Gb Is Still 100Gb, Right?
Clinical genomics analysis is per-sample.
• Processing is embarrassingly parallel after d...
Why is Genomics
Still Interesting?
Friday, July 26, 13
Why is Genomics
Still Interesting?
It’s OK to be Lil’.
Friday, July 26, 13
Research Genomics
Friday, July 26, 13
Research Genomics
Counsyl runs this many samples every year; clinical = scale.
Target # Samples # SNPs
Education Level 126...
Clinical Genomics: Big Where It Matters
Whole Genome (3000 Mb)
Clinical Genome (1 Mb)
Friday, July 26, 13
Clinical Genomics: Big Where It Matters
• Focusing on a small region means you can examine thousands
of people: study impo...
Let’s Science Up This Data
N=83,538 samples, 493 variants
Estimated carrier frequency per population as a binomial.
Bonfer...
Smith-Lemli-Opitz Syndrome (DHCR7)
• We see a carrier rate double the predicted literature values
(e.g., 1/57 vs 1/124 in ...
Genetic Disease in South Asians
Cystic Fibrosis (CFTR)
• 1/57 observed vs 1/118 in literature.
GJB2-related DFNB1 nonsyndr...
Size Doesn’t Matter, It’s How You Use It
• Genomics has a real ground truth.
• Genomics has a real impact.
Clinical genomi...
Future of Genomics
Cratering prices drive technological shifts.
Technologies at the research frontier will become commerci...
Where Are We Now?
• Theory has been developed in academia and government.
• Scale-up is just beginning in industry: starte...
Recap
Big Data =
•“near linear” algorithms
• communication is
harder than computation
Short-read sequencing
produces large...
</talk>
jobs.counsyl.com
ihaque@counsyl.com
Friday, July 26, 13
Upcoming SlideShare
Loading in...5
×

How I Learned to Stop Worrying about Big Data and Love the Data That Actually Counts - Counsyl Tech Talk

3,742

Published on

July 18, 2013 Counsyl Tech Talk on "How I Learned to Stop Worrying about Big Data and Love the Data That Actually Counts." Speaker: Imran Haque, Director of Research, Counsyl.

Video: https://vimeo.com/71282924
Check out our openings: jobs.counsyl.com

Published in: Technology, Health & Medicine
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,742
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
10
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "How I Learned to Stop Worrying about Big Data and Love the Data That Actually Counts - Counsyl Tech Talk"

  1. 1. Counsyl www.counsyl.com How I Learned to Stop Worrying about Big Data ...and love the data that actually counts Imran S. Haque Counsyl 18 Jul 2013 Friday, July 26, 13
  2. 2. About the Speaker •Imran S. Haque (ihaque@counsyl.com) •Director of Research at Counsyl •BS EECS, UC Berkeley; PhD CS, Stanford Friday, July 26, 13
  3. 3. About Counsyl We have developed a single genomic test that replaces 100+ expensive assays It has reduced the cost of carrier testing by literally one hundred fold Bloom Syndrome $167 Canavan Disease $473 Cystic Fibrosis $506 Familial Dysautonomia $334 Fanconi Anemia $167 Gaucher Disease $467 Glycogen Storage Disease Type Ia $283 Maple Syrup Urine Disease Type 1B $557 Mucolipidosis IV $279 Niemann-Pick Disease Type A $337 Spinal Muscular Atrophy $700 Tay-Sachs Disease $473 Total $4743 Friday, July 26, 13
  4. 4. Engineering at Counsyl Wetlab Biology Ordering Reporting Billing Fulfillment Automation Assay Calling Friday, July 26, 13
  5. 5. Engineering at Counsyl How big is the data in genomics? Wetlab Biology Ordering Reporting Billing Fulfillment Automation Assay Calling Assay Calling Friday, July 26, 13
  6. 6. Big Data Will Save the World Friday, July 26, 13
  7. 7. Big Data Will Save the World But what is it, anyway? Friday, July 26, 13
  8. 8. Background Friday, July 26, 13
  9. 9. Background Wikipedia “Big Data”: A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications Friday, July 26, 13
  10. 10. What Defines Big Data • Computation: data so large that algorithms must be o(N1+ε): “almost linear.” • Handling: data so large that with tractable algorithms communication becomes more significant than computation. Friday, July 26, 13
  11. 11. Why Do People Care? Big Data is fundamental to fields in which each individual piece of data is relatively information-light, so it is necessary to aggregate a lot of it. Friday, July 26, 13 This particularly characterizes advertising, which funds the consumer Internet. People are interested in Big Data as a means to an end (improving conversion rates), not as an end in itself.
  12. 12. Genomics: Big Data Friday, July 26, 13
  13. 13. Genomics: Big Data But not as we know it. Friday, July 26, 13
  14. 14. Short-Read Sequencing in Short I don’t know what they want from me It’s like the more money we come across The more problems we see Friday, July 26, 13
  15. 15. Short-Read Sequencing in Short I don’t know what they want from me It’s like the more money we come across The more problems we see It’s like the more w what they wan acro5sThe more probl re problems we see ... Friday, July 26, 13
  16. 16. Short-Read Sequencing in Short I don’t know what they want from me It’s like the more money we come across The more problems we see It’s like the more w what they wan acro5sThe more probl re problems we see ... Current sequencers can produce ~100Gb of short (100bp) reads/day Friday, July 26, 13
  17. 17. Short-Read Alignment It’s%like%the%more%money%we%come%across Ferragina and Manzini, JACM 2005 Langmead et al, Genome Biol 2009 Li and Durbin et al, Bioinformatics 2009 Friday, July 26, 13
  18. 18. Short-Read Alignment It’s%like%the%more%money%we%come%across !!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr Ferragina and Manzini, JACM 2005 Langmead et al, Genome Biol 2009 Li and Durbin et al, Bioinformatics 2009 Friday, July 26, 13
  19. 19. Short-Read Alignment It’s%like%the%more%money%we%come%across !!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr It’s!like!the!more Ferragina and Manzini, JACM 2005 Langmead et al, Genome Biol 2009 Li and Durbin et al, Bioinformatics 2009 Friday, July 26, 13
  20. 20. Short-Read Alignment It’s%like%the%more%money%we%come%across !!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr It’s!like!the!more !!!!!!!!!!!!!!!re!data!!we!c Ferragina and Manzini, JACM 2005 Langmead et al, Genome Biol 2009 Li and Durbin et al, Bioinformatics 2009 Friday, July 26, 13
  21. 21. Short-Read Alignment It’s%like%the%more%money%we%come%across !!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr It’s!like!the!more !!!!!!!!!!!!!!!re!data!!we!c !!!!like!the!more!d Ferragina and Manzini, JACM 2005 Langmead et al, Genome Biol 2009 Li and Durbin et al, Bioinformatics 2009 Friday, July 26, 13
  22. 22. Short-Read Alignment It’s%like%the%more%money%we%come%across !!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr It’s!like!the!more !!!!!!!!!!!!!!!re!data!!we!c !!!!like!the!more!d !!!!!!!!!!!!!!!!!!!!ata!!we!come!across Ferragina and Manzini, JACM 2005 Langmead et al, Genome Biol 2009 Li and Durbin et al, Bioinformatics 2009 Friday, July 26, 13
  23. 23. Short-Read Alignment It’s%like%the%more%money%we%come%across !!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr It’s!like!the!more !!!!!!!!!!!!!!!re!data!!we!c !!!!like!the!more!d !!!!!!!!!!!!!!!!!!!!ata!!we!come!across It’s!like!the!more!data!we!come!across Ferragina and Manzini, JACM 2005 Langmead et al, Genome Biol 2009 Li and Durbin et al, Bioinformatics 2009 Friday, July 26, 13
  24. 24. Alignment Algorithms Ning, Cox, Mullikin. Genome Res 2001 Li, Ruan, Durbin Genome Res 2008 Ferragina and Manzini, JACM 2005 Langmead et al, Genome Biol 2009 Li and Durbin et al, Bioinformatics 2009 Friday, July 26, 13
  25. 25. Alignment Algorithms • Smith-Waterman: O(MN), large constant factor • Hash-based Alignment: much smaller constants than SW • MAQ, SSAHA • Burrows-Wheeler Alignment: sublinear in size of genome • Bowtie, BWA Ning, Cox, Mullikin. Genome Res 2001 Li, Ruan, Durbin Genome Res 2008 Ferragina and Manzini, JACM 2005 Langmead et al, Genome Biol 2009 Li and Durbin et al, Bioinformatics 2009 Friday, July 26, 13
  26. 26. Real-World Alignments ATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG ............!!.............................. .............!!!!....C...................... ..............!!!,,,,c,,,,,,,,,,,,,,,,,,,,,, ...............!!!.......................... .................!!..C...................... ..................!!!C...................... .....................C!!.................... .....................C.!!................... .....................C..!!.................. ,,,,,,,,,,,,,,,,,,,,,,,,,!!!!!!,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,c,,,,!!!!!!............ ...........................!!............... ............................!!.............. .....................C.......!!............. .....................C.........!!........... ................................!!.......... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,!!......... .....................C.............!!....... Friday, July 26, 13
  27. 27. Real-World Alignments ATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG ............!!.............................. .............!!!!....C...................... ..............!!!,,,,c,,,,,,,,,,,,,,,,,,,,,, ...............!!!.......................... .................!!..C...................... ..................!!!C...................... .....................C!!.................... .....................C.!!................... .....................C..!!.................. ,,,,,,,,,,,,,,,,,,,,,,,,,!!!!!!,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,c,,,,!!!!!!............ ...........................!!............... ............................!!.............. .....................C.......!!............. .....................C.........!!........... ................................!!.......... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,!!......... .....................C.............!!....... PAH:Y414C (heterozygote C/T) phenylketonuria Friday, July 26, 13
  28. 28. Real-World Alignments ATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG ............!!.............................. .............!!!!....C...................... ..............!!!,,,,c,,,,,,,,,,,,,,,,,,,,,, ...............!!!.......................... .................!!..C...................... ..................!!!C...................... .....................C!!.................... .....................C.!!................... .....................C..!!.................. ,,,,,,,,,,,,,,,,,,,,,,,,,!!!!!!,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,c,,,,!!!!!!............ ...........................!!............... ............................!!.............. .....................C.......!!............. .....................C.........!!........... ................................!!.......... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,!!......... .....................C.............!!....... PAH:Y414C (heterozygote C/T) phenylketonuria Need to align 1.5M reads per sample, across thousands of samples! Friday, July 26, 13
  29. 29. Genomics: Big Data? Genomics appears to have all the characteristics of Big Data. • Large quantity: ~100GB/day/sequencer • Advanced algorithms: BWT alignment in linear/sublinear time But characteristics of the data itself matter too! Friday, July 26, 13
  30. 30. Clinical Genomics: Not That Big Friday, July 26, 13
  31. 31. Clinical Genomics: Not That Big Most of the human genome is currently non-actionable. Whole Genome Sequencing (~3000 Mb) Friday, July 26, 13
  32. 32. Clinical Genomics: Not That Big Most of the human genome is currently non-actionable. Whole Genome Sequencing (~3000 Mb) Whole Exome Sequencing (~30 Mb) Friday, July 26, 13
  33. 33. Clinical Genomics: Not That Big Most of the human genome is currently non-actionable. Whole Genome Sequencing (~3000 Mb) Whole Exome Sequencing (~30 Mb) Clinical Carrier Screening (~1 Mb) Friday, July 26, 13
  34. 34. Clinical Genomics: Not That Big Most of the human genome is currently non-actionable. Whole Genome Sequencing (~3000 Mb) Whole Exome Sequencing (~30 Mb) Clinical Carrier Screening (~1 Mb) Exome Sequencing (30 Mb) Friday, July 26, 13
  35. 35. Clinical Genomics: Not That Big Most of the human genome is currently non-actionable. Whole Genome Sequencing (~3000 Mb) Whole Exome Sequencing (~30 Mb) Clinical Carrier Screening (~1 Mb) Exome Sequencing (30 Mb) Clinical Carrier Screening (~1 Mb) Friday, July 26, 13
  36. 36. But 100Gb Is Still 100Gb, Right? Friday, July 26, 13
  37. 37. But 100Gb Is Still 100Gb, Right? Clinical genomics analysis is per-sample. • Processing is embarrassingly parallel after demultiplexing. • Handling a single sample is trivial on even a laptop. Use ZFS and LSF/SGE, not Cassandra and Hadoop. Friday, July 26, 13
  38. 38. Why is Genomics Still Interesting? Friday, July 26, 13
  39. 39. Why is Genomics Still Interesting? It’s OK to be Lil’. Friday, July 26, 13
  40. 40. Research Genomics Friday, July 26, 13
  41. 41. Research Genomics Counsyl runs this many samples every year; clinical = scale. Target # Samples # SNPs Education Level 126,559 2.2M Breast/Ovarian Cancer 11,705 31,812 Diabetes 10,128 2.2M Telomere Length 37,684 2.4M Rietveld et al, Science 2013 Couch et al, PLoS Genet 2013 Zeggini et al, Nat Genet 2008 Codd et al, Nat Genet 2013 Friday, July 26, 13
  42. 42. Clinical Genomics: Big Where It Matters Whole Genome (3000 Mb) Clinical Genome (1 Mb) Friday, July 26, 13
  43. 43. Clinical Genomics: Big Where It Matters • Focusing on a small region means you can examine thousands of people: study important regions in great depth. • Embarrassingly parallel is a good thing: people pay the bills! Friday, July 26, 13
  44. 44. Let’s Science Up This Data N=83,538 samples, 493 variants Estimated carrier frequency per population as a binomial. Bonferroni-corrected binomial equality test comparing each population against the pooled data finds variants that are significantly enriched/ depleted in particular populations. Haque et al, in preparation Friday, July 26, 13
  45. 45. Smith-Lemli-Opitz Syndrome (DHCR7) • We see a carrier rate double the predicted literature values (e.g., 1/57 vs 1/124 in Northwestern Europeans) • We find previously undescribed population associations for DHCR7:IVS8-1G>C Population Frequency Overall Frequency P-value N AJ 1 in 46 1 in 96 1.18E-11 4330 ➡EA 0 1 in 96 1.56E-07 2739 Haque et al, in preparation Friday, July 26, 13
  46. 46. Genetic Disease in South Asians Cystic Fibrosis (CFTR) • 1/57 observed vs 1/118 in literature. GJB2-related DFNB1 nonsyndromic hearing loss and deafness • Literature claims 1/133 with 35delG, but we find 1/2191. • 36/2191 carriers, 35 for W24X. Progressive cone dystrophy/achromatopsia (CNGB3) • R403Q present in 1/18: 30% of carriers in 4% of tested pop. Haque et al, in preparation Friday, July 26, 13
  47. 47. Size Doesn’t Matter, It’s How You Use It • Genomics has a real ground truth. • Genomics has a real impact. Clinical genomics is interesting independently of “Big”ness. Friday, July 26, 13
  48. 48. Future of Genomics Cratering prices drive technological shifts. Technologies at the research frontier will become commercialized. • Whole-genome association studies • RNA-seq and transcriptomics • Epigenomics • Pathogen sequencing and metagenomics Friday, July 26, 13
  49. 49. Where Are We Now? • Theory has been developed in academia and government. • Scale-up is just beginning in industry: started with tool vendors, now reaching applications companies. • New scales of data will feed back into basic R&D. Friday, July 26, 13
  50. 50. Recap Big Data = •“near linear” algorithms • communication is harder than computation Short-read sequencing produces large amounts of data. Useful clinical insights are mostly derived from embarrassingly-parallel small data. “Small data” genomics is highly impactful in its own right. Genomics may enter a “big data” phase in the future with new methods. Friday, July 26, 13
  51. 51. </talk> jobs.counsyl.com ihaque@counsyl.com Friday, July 26, 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×