0
© 2009 Illumina, Inc. All rights reserved.
Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, Gold...
2
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
From Whence We Came…
ATGCCGTTT…
CCGGTTAAT…
GAATTGCAG…
6:A2567C
12:C123T
20:T467...
3
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Genomic Big Data
Large amounts of data generated in genomics; multiple
samples,...
4
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Genomic Big Data and Personal Genome Information
PERSONAL SEQUENCE
(owned by in...
5
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Sequencing a 17-member three-generation
pedigree.
– Ultra deep sequencing impro...
6
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Reduction from 40 Q-scores to 8 Q-scores becoming accepted
Sequencing output is...
7
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Faster Data – DNA to Result in <2 Days
12 core server
64Gb RAM
Sequence Analyze...
8
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
WGS reveals somatic mutations in TERT
gene promoter of melanoma patients
Form a...
9
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Complexity of Data
10
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Surveillance of Leukaemia (CLL) – More Data Complexity!
0 6463 65 6662
Event
T...
11
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
A Deeper Complexity of Genomic Data
12
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Utility Requires Complex Composite Information
iPad
Plug and Play
Cloud
Allele...
13
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Apps
Public Genomic Databases
Users
EMR
Support & Engineering
Instruments
Geno...
14
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Genomic Big Data Status
Researcher
Treatment choice
Clinician
Patient
Knowledg...
15
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Challenges for this Meeting to Address
What data frameworks and models
are req...
Upcoming SlideShare
Loading in...5
×

Scott Kahn Genomic Big Data.gia.052913

348

Published on

Dr. Scott Kahn, CIO of Illumina, presents challenges and progress on big data solutions and its impact on scientific research at the 2013 Genome Informatics Alliance meeting.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
348
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Scott Kahn Genomic Big Data.gia.052913"

  1. 1. © 2009 Illumina, Inc. All rights reserved. Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, and GenomeStudio are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. COMPANY CONFIDENTIAL – INTERNAL USE ONLY Genome Informatics Alliance 2013 Defining Genomic Big Data and its Impact on Scientific Progress
  2. 2. 2 COMPANY CONFIDENTIAL – INTERNAL USE ONLY From Whence We Came… ATGCCGTTT… CCGGTTAAT… GAATTGCAG… 6:A2567C 12:C123T 20:T4678A 30-40TB ˜5TB 600GB ˜20GB
  3. 3. 3 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Genomic Big Data Large amounts of data generated in genomics; multiple samples, size of data, etc Integration of digital data to enrich context of samples; DNA, RNA, methylation, time courses, spatial distributions with samples, … Fusion of digital data and categorical data; combination rules (categories), extraction from unstructured inputs, … Tools and techniques appropriate for resultant data sets; visualization, model building, exploration, … Advances require data mining rather than the one-at-a- time hypothesis testing approaches of today
  4. 4. 4 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Genomic Big Data and Personal Genome Information PERSONAL SEQUENCE (owned by individual/doctor) Issued: 01 MAR 07 Recommended next check: 28 FEB 10 PGI id: 5910322 – 61215923014 RISK VARIANTS (approved for clinical use) Human Genome Clinical studies Populations SequencingFunctional annotation 3: 12,300 3: 12,400 ( kb ) PPARg GENOMIC ANNOTATION (in public domain) Variant: C3 : 12,450,610 : T0.7/C0.3 : PPARG : Pro/Leu : Medical consequence: Associated with severe insulin resistance, diabetes mellitus, hypertension Pharmacological consequence: Resistant to thiazolidinediones CLINICAL DECISION Consultation Consent Clinical assessment Selected risk information
  5. 5. 5 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Sequencing a 17-member three-generation pedigree. – Ultra deep sequencing improves sensitivity – Leveraging inheritance information improves accuracy – Data and results made publicly available Identifying ultra accurate genomic variants is enabling rapid improvements in technology and software This data will allow us to assess accuracy for many FDA submissions We are collaborating with NIST & CDC to develop a public resource for quantifying sequencing accuracy Platinum Genomes as a Truth Reference Creating a catalogue of highly-accurate SNPs, indels & SVs
  6. 6. 6 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Reduction from 40 Q-scores to 8 Q-scores becoming accepted Sequencing output is still increasing exponentially therefore further compression is likely to be required Platinum genome work suggest ~95% of genome is consistently called (this 95% is known as the platinum regions) Regions which are reliably called may not need 8 Q-scores resolution – we can reduce “well sequenced” regions to 2 Q- scores Start with 8 Q-score bam file: – Reduce the platinum regions to 2 Q-scores (keep non- platinum at 8 Q-scores) – Reduce the platinum regions to 1 Q-score – Whole genome 2 Q-score – Reduce platinum region to 2 Q-scores but also keep original Q-scores of mismatches (MM) and anomalous reads – ~40Gb (20Gb CRAM) Data Reduction Via Vertical Compression (NA12882) Build Total SNPs (>Q20) SNPs diff genotype (>Q20) Not called in Q-score compressed build (>Q20) Not called in 8 Q-score build (>Q20) 8 Q-score 3,735,575 (3,627,165) - - - 8 Q-score technical replicate 3,734,849 (3,626,485) 45,584 (22,400) 80,131 (29,211) 79,405 (28,845) Platinum Genome 2 Q- score 3,732,568 (3,620,612) 3,255 (161) 3417 (63) 410 (127) Platinum Genome 1 Q- score 3,764,928 (3,626,468) 4002 (584) 2605 (75) 31,958 (2964) Whole Genome 2 Q-score 3,712,636 (3,598,400) 25,175 (1912) 24,237 (166) 1298 (112) Platinum 2 q- score keep MM and anom. reads 3,735,684 (3,627,226) 197 (123) 142 (35) 251 (102)
  7. 7. 7 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Faster Data – DNA to Result in <2 Days 12 core server 64Gb RAM Sequence Analyze AnnotateSample 27 hr 8 hr HiSeq2500 Isaac analysis overnight 40 hr Fast turnaround is required for clinical applications 4.5 hr PCR Free library
  8. 8. 8 COMPANY CONFIDENTIAL – INTERNAL USE ONLY WGS reveals somatic mutations in TERT gene promoter of melanoma patients Form a novel transcription factor binding motif Recurrence in melanoma is as high as any known coding mutation Importance of Non-coding Mutations – Bigger Data! -200 -100 TERT gene 0 +100 +200 Gene (mutation) Incidence in melanoma TERT (promoter) 52% BRAF (V600E) 53% CDKN2A 50% NRAS (Q61R) 28% TERT (coding) 1% Horn et al. & Huang et al., Science 2013
  9. 9. 9 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Complexity of Data
  10. 10. 10 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Surveillance of Leukaemia (CLL) – More Data Complexity! 0 6463 65 6662 Event Timeline Sequencing Birth DeathTreatmentDiagnosis TreatmentTreatment 0 50 100 150 200 250 a b c d e NORMAL CLASS 4 CLASS 3 CLASS 2 CLASS 1 Time points Abundance Changing subclonal populations 0 1 2 3 4 5 c NO CL CL CL CL “Remission” has disease Schuh et al., Oxford
  11. 11. 11 COMPANY CONFIDENTIAL – INTERNAL USE ONLY A Deeper Complexity of Genomic Data
  12. 12. 12 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Utility Requires Complex Composite Information iPad Plug and Play Cloud Allele Frequency in populations www.1000genomes.org Medical/Risk data (with expert review) Hgmd, pharmgkb Genetic Variants dbSNP Functional Effects ensembl.org, genome.ucsc.edu, encode.org Disease association genome.gov ANNOTATED GENOME ( gVCF) <1Gbyte Ancestry Tissue type Risk Carrier status Diagnosis Drug response Annotate DisseminateInterpret
  13. 13. 13 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Apps Public Genomic Databases Users EMR Support & Engineering Instruments Genomic Big Data Ecosystems
  14. 14. 14 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Genomic Big Data Status Researcher Treatment choice Clinician Patient Knowledge Information
  15. 15. 15 COMPANY CONFIDENTIAL – INTERNAL USE ONLY Challenges for this Meeting to Address What data frameworks and models are required? How will genomes (DNA, RNA, methylation states, etc) be aggregated and compared? How will collaboration and data sharing evolve? Where will the technology go and how must the community respond to lever the benefits Brainstorming of ideas Sessions from groups that have experiences from many fields Next steps!! Actively participate and enjoy the entire experience!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×