Towards Precision Medicine: Tute Genomics, a cloud-based application for analysis of personal genomes

4,134 views

Published on

Tute Genomics is cloud-based software that can rapidly analyze entire human genomes. The cost of whole genome sequencing is dropping rapidly and we are in the middle of a genomic revolution. Tute is opening a new door for personalized medicine by helping researchers & healthcare organizations analyze human genomes.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,134
On SlideShare
0
From Embeds
0
Number of Embeds
2,762
Actions
Shares
0
Downloads
51
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Towards Precision Medicine: Tute Genomics, a cloud-based application for analysis of personal genomes

  1. 1. TOWARDS PRECISION MEDICINE: a cloud-based application for analysis of personal genomes Reid J. Robison, MD MBA December 6th, 2013
  2. 2. THE FALLING COST of sequencing the human genome $3 BILLION $2000
  3. 3. THE SEQUENCING EXPLOSION
  4. 4. PACE OF DISCOVERY of novel rare-disease-causing genes using whole-exome sequencing 140 105 70 35 2009 2010 2011 2012 Boycott et al. Rare-disease genetics in the era of next-generation sequencing: discovery to translation. Nature Reviews Genetics 14, 681–691 (2013)
  5. 5. RATE OF APPROVAL of rare disease drug products 160 120 80 40 2000 2005 2010 2015 2020 2025 Extrapolation from January 2013 Orphanet Report Series 2030
  6. 6. “We are on the tipping point of a whole new game in how we develop drugs.” Janet Woodcock, M.D. Director, Center for Drug Evaluation and Research, FDA
  7. 7. GENE-FINDING
  8. 8. Patient w/ unknown disease Next-gen sequencing Data processing Variant calling 3 million SNVs 0.5 million indels 1000 SVs ????????????????????????????????
  9. 9. STEPWISE REDUCTION ANNOVAR 450 citations >40,000 downloads Wang K et al. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data Nucleic Acids Research, 38:e164, 2010
  10. 10. ① 25523 variants ADHD Only non-synonymous or frameshift 6423 variants Conserved variants from 44-species alignment and anemia? 2935 variants Remove variants in segmental duplication regions 2652 variants Remove variants with MAF>1% 421 variants Apply recessive model 17 genes Remove “dispensable” genes 10 genes Literature survey identifies PKLR as candidate gene (confirmed with biochemical assay)
  11. 11. Genome browser shot of the PKLR gene and the location of the two causal mutations. Each of the two mutations sits within an evolutionarily conserved region, and has been reported once in patients affected with PKLR deficiency.
  12. 12. 3910156 variants Keep only exonic/splicing variants 21488 variants Remove synonymous & non-synonymous frameshift variants 10380 variants Remove variants in 1000 genomes project 1146 variants Remove variants in ESP6500 database 935 variants Remove variants in dbSNP135 582 variants Keep only genes with multiple variants 52 genes ! ② BOOKMAN syndrome
  13. 13. 52 genes Remove psuedogenes & questionable calls Remove olfactory receptor genes Sanger sequencing validation ! 2 genes left: TAF1L RBCK1! RanBP-type and C3HC4-type zinc finger containing 1 (Mutation results in splicing error)
  14. 14. OGDEN SYNDROME Clinical Features Two Families
  15. 15. 3441 variants on X chromosome Keep only heterozygous 2381 variants Keep only Stop/NonSyn/FS/Splice ! ③ OGDEN syndrome 136 variants Remove variants in dbSNP 40 variants Remove variants in ClinSeq 40 variants Keep only variants in shared haplotype 1 variant NAA10 (Encodes the catalytic subunit of the major human N-terminal acetyltransferase)
  16. 16. VARIANT annotation & interpretation • • • • >60 annotation types (SIFT, PolyPhen, Allele Freq, HGMD…) User-driven filtering for step-wise reduction. Run gene panels. Robust, scalable, secure. Run case-control & family-based analyses Machine-learning algorithms to generate TUTE score for prioritization of variants
  17. 17. THE TUTE SCORE using machine-learning to prioritize disease genes ① Select a set of functional prediction scores for which coding and non-coding variants can be assigned into ② Built SVM prediction models using SVMsensus ③ Identify the optimal hyperplane for the biggest margin between training points for neutral and deleterious variants & genes ④ Test & refine prediction model using known disease variants from UniProt & synthetic data sets
  18. 18. COMING SOON: more accurate variant calling
  19. 19. Toward  more  accurate  variant  calling  for  “personal  genomes” Jason  O’Rawe1,2, Tao Jiang3, Guangqing Sun3, Yiyang Wu1,2, Wei Wang4, Jingchu Hu3, Paul Bodily5, Lifeng Tian6, Hakon Hakonarson6, W. Evan Johnson7, Reid J. Robison9, Zhi Wei4, Kai Wang8,9, Gholson J. Lyon1,2,9 1) Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, NY, USA; 2) Stony Brook University, Stony Brook, NY, USA; 3) BGI-Shenzhen, Shenzhen, China; 4) Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA; 5) Department of Computer Science, Brigham Young University, Provo, UT, USA; 6) Center for Applied Genomics, Children’s Hospital of Philadelphia, Philadelphia, PA, USA; 7) Department of Medicine, Boston University School of Medicine, Boston MA, USA; 8) Zilkha Neurogenetic Institute, Department of Psychiatry and Preventive Medicine, University of Southern California, Los Angeles, CA, USA; 9) Utah Foundation for Biomedical Research, Salt Lake City, UT, USA. Background Methods Results To facilitate the clinical implementation of genomic medicine by nextgeneration sequencing, it will be critically important to obtain accurate and consistent variant calls on personal genomes. Multiple software tools for variant calling are available, but it is unclear how comparable these tools are or what their relative merits in real-world scenarios might be. Under conditions where “perfect” pipeline parameterization is un-attainable, researchers and clinicians stand to benefit from a greater understanding of the variability introduced into human genetic variation discovery when utilizing many different bioinformatics pipelines or different sequencing platforms. We sequenced 15 exomes from four families using the Illumina HiSeq 2000 platform and Agilent SureSelect v.2 capture kit, with ~120X coverage on average. We analyzed the raw data using near-default parameters with 5 different alignment and variant calling pipelines (SOAP, BWA-GATK, BWA-SNVer, GNUMAP, and BWA-SAMTools). We additionally sequenced a single whole genome using the Complete Genomics (CG) sequencing and analysis pipeline (v2.0), with 95% of the exome region being covered by 20 or more reads per base. Finally, we attempted to validate 919 SNVs and 841 indels, including similar fractions of GATK-only, SOAP-only, and shared calls, on the MiSeq platform by amplicon sequencing with ~5000X average coverage. SNV concordance between five Illumina pipelines across all 15 exomes is 57.4%, while 0.5-5.1% variants were called as unique to each pipeline. Indel concordance is only 26.8% between three indel calling pipelines, even after left-normalizing and intervalizing genomic coordinates by 20 base pairs. 2085 CG v2.0 variants that fall within targeted regions in exome sequencing were not called by any of the Illumina-based exome analysis pipelines, likely due to poor capture efficiency in those regions. Based on targeted amplicon sequencing on the MiSeq platform, 97.1%, 60.2% and 99.1% of the GATK(v.15)-only, SOAPsnp(v1.03)-only and shared SNVs can be validated, yet 54.0%, 44.6% and 78.1% of the GATK-only, SOAP-only and shared indels can be validated. • All Illumina exomes have at least 20 reads or more per base pair in >80% or more of the 44 MB target region. • Concordance rates with common SNPs genotyped on Illumina 610K genotyping chips were calculated. • Sensitivities and specificities were calculated for each pipeline using the Illumina 610k genotyping chips as a golden standard. A) SNV concordance was measured between all SNV calls made by the five illumina data pipelines. Overall concordance is low: 57.4%. • All pipelines show relatively high sensitivity and specificity when detecting known and common SNPS. B) SNV concordance is higher for already described variation (present in dbSNP135). • Specificity generally increases for sets of variants detected by more than a single pipeline. C) SNV concordance is lower for novel, un-described, human genetic variation (absent in dbSNP135). • All pipelines are very good with identifying already known, common SNPs. Sensitivity Known SNPs Mother-1 SOAPsnp GATK SNVer GNUMAP SAMTools Son-1 SOAPsnp GATK SNVer GNUMAP SAMTools Son-2 SOAPsnp GATK SNVer GNUMAP SAMTools Father-1 SOAPsnp GATK SNVer GNUMAP SAMTools Concordance Concordance Sites rate 6088 6074 99.77% 6249 6224 99.60% 5723 5708 99.74% 5458 5434 99.56% 5885 5848 99.37% 6366 6353 99.80% 6341 6323 99.72% 6255 6239 99.74% 5850 5828 99.62% 6383 6362 99.67% 6412 6401 99.83% 6426 6413 99.80% 6336 6325 99.83% 5906 5889 99.71% 6477 6450 99.58% 6247 6238 99.86% 6304 6288 99.75% 6205 6192 99.79% 5805 5786 99.67% 6344 6327 99.73% Compared Sites C Novel SNPs Mean* Software B A Specificity SD Mean* SD #Total #cSNP Ti/Tv #Total #cSNP Ti/Tv SOAPsnp Sample The similarity between SNV and indel calls made between two versions of GATK, v1.5 and v2.3-9, was measured. SNV and indel calls were made using both the UnifiedGenotyper and HaplotypeCaller modules on the same k8101-49685 participant sample. 99.82 0.039 94.53 2.287 30,022 17,409 2.77 875 419 1.94 GATK 99.72 0.085 95.33 1.161 29,620 17,306 2.8 365 206 2.34 SNVer 99.78 0.044 92.32 4.339 28,242 17,111 2.85 490 253 2.52 GNUMAP 99.64 0.065 86.67 3.286 24,893 15,144 3.03 1,091 659 1.28 SAMTools 99.59 0.158 94.45 4.221 29,577 17,449 2.78 949 539 1.33 ANY pipeline 99.62 0.113 97.72 1.215 33,947 19,638 2.68 2,163 1,182 1.23 >=2 pipelines 99.69 0.074 96.68 2.298 31,099 18,108 2.77 639 323 2.17 >=3 pipelines 99.73 0.045 95.65 3.143 29,363 17,257 2.84 416 230 2.56 >=4 pipelines 99.82 0.041 92.63 3.412 26,772 16,097 2.91 318 193 2.67 5 pipelines 99.87 0.015 80.61 5.266 21,174 13,320 3.12 234 149 2.83 • SNP concordance between the illumina data calls and the Complete Genomics v2.0 data calls was calculated for a single sample, “k8101-49685”. • Indel concordance between the three indel calling Illumina data pipelines (A) is low, 26.8%. • MiSeq validation was performed on a combination of SNPs and indels chosen (1756 in total) from sequencing data from the sample “k8101-49685”. • There are 2085 SNVs that Complete Genomics v2.0 detected but are not detected by any of the five Illumina data pipelines, despite high mappability among these variants. • Concordance is much better for known indels (B), and conversely much lower for novel, unknown, indels (C) (as defined by presence or absence in dbSNP135). • SNVs that were uniquely called by the SOAP-SNP v.1.03/Soap indel v2.01 and GATK v1.5 pipeline validated relatively well, with the SNVs called by both pipelines being better validated. A • Indels validated poorly for both unique to GATK(v.1.5) and SOAPindel (v2.01) calls. Overlapping indel calls validated better, though still relatively poorly.
  20. 20. #variantcallingproblems • • • • • • • • ! 15 exomes from 4 families 1 whole genome from Complete Genomics Illumina HiSeq platform and Agilent SureSelect capture kit 120X mean coverage Five NGS alignment+variant calling pipelines are tested (SOAP, BWA-GATK, BWA-SNVer, GNUMAP, and BWA-SAMtools) Illumina 610k SNP array used as gold standard ~60% SNVs are called by all five pipelines 0.5 to 5.1% of variants were called as unique to each pipeline.
  21. 21. • • • • All pipelines are ‘good’ with known, common SNPs Specificity increases for variants detected by more than one pipeline Indel concordance between all 3 platforms was very low (26.8%) Complete Genomics picked up >2000 variants that weren’t seen on Illumina, despite high mappability in these regions Specificity Sensitivity Known SNPs Novel SNPs Mean* SD Mean* SD #Total #cSNP Ti/Tv #Total #cSNP Ti/Tv SOAPsnp 99.82 0.039 94.53 2.287 30,022 17,409 2.77 875 419 1.94 GATK1.5 99.72 0.085 95.33 1.161 29,620 17,306 2.8 365 206 2.34 SNVer 99.78 0.044 92.32 4.339 28,242 17,111 2.85 490 253 2.52 GNUMAP 99.64 0.065 86.67 3.286 24,893 15,144 3.03 1,091 659 1.28 SAMTools 99.59 0.158 94.45 4.221 29,577 17,449 2.78 949 539 1.33 ANY pipeline 99.62 0.113 97.72 1.215 33,947 19,638 2.68 2,163 1,182 1.23 >=2 pipelines 99.69 0.074 96.68 2.298 31,099 18,108 2.77 639 323 2.17 >=3 pipelines 99.73 0.045 95.65 3.143 29,363 17,257 2.84 416 230 2.56 >=4 pipelines 99.82 0.041 92.63 3.412 26,772 16,097 2.91 318 193 2.67 5 pipelines 99.87 0.015 80.61 5.266 21,174 13,320 3.12 234 149 2.83
  22. 22. VARIANT CALLING CONCLUSIONS ① Significant discrepancy among all pipelines when applied to the same Illumina datasets ② There exists a set of robust calls that are shared among all pipelines even under lax parameters (although false negative rate is high) To get an accurate genome, you need to run multiple algorithms.
  23. 23. A results portal that lets doctors, labs & researchers give their patients access to important genetic findings
  24. 24. Reid J. Robison, MD MBA reid@tutegenomics.com

×