Your SlideShare is downloading. ×
Developing tools & Methodologies for the NExt Generation of Genomics & Bio Informatics
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Developing tools & Methodologies for the NExt Generation of Genomics & Bio Informatics


Published on

Published in: Technology, Business

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Computational challenges to getting NextGen sequence right: implications to diagnostics and therapeutics
  • 2. Progress in biomedical discovery has been enabled by technological progress • New sequencing technology – 100s of genomes a day are now produced • Advances in software • Standard DNA analysis codes have emerged • New versions continuously released • Custom software developed for unconventional analysis • Development of analysis pipelines for automated analysis and compilation of results • Advances in computational hardware • • • Codes standardized on Intel processor based systems ease porting to new systems Continuous advances in Intel product line enable us to easily “keep up” The bottom line – With process advances and new Intel MIC processors we have seen speedups from 1 genome/2 weeks to 50 genomes/day. It is straightforward to expand hardware in response to computational demand
  • 3. Computing is primarily done on a machine we developed: SHADOWFAX  A heterogeneous computing environment for data intensive computations  ~2,524 CPUs, > 12TB RAM (spectrum of Intel)  8 Intel® Xeon® E5-2600/FPGA hybrid core systems (in partnership with Convey)  ~0.8 PB Disk Arrays (DDN)  100 PB Sun/Oracle tape storage system
  • 4. Computing is primarily done on a machine we developed: SHADOWFAX  With local synchronized copies of major databases:  Medline, arXiv, PubMed Central, Genbank, SwissProt,  1,000 Genomes Project,  The Cancer Genome Atlas, Wikipedia  To meet the needs of applications that demand HPC:  deep sequencing assembly and analysis, molecular modeling, simulations, proteomics analysis, text mining, Health IT
  • 5. NextGen DNA sequence analysis is now the rate limiting step • The cost of sequencing has dropped from $3B/genome to ~$1K/genome. • • • New genomes are sequenced daily. It is estimated that there are 30,000 human genomes complete, with 15,000 of these in the public domain. Analysis has focused on on Single Nucleotide Polymorphisms (“ SNPs”), which are single letter changes in the DNA code. • For complex diseases like cancer, heart disease and mental disorders, extensive work has still only explains 10-20% of the known genetic component. • Recent research indicates that do to experimental measurement noise, perhaps most of the measured variations are false positives.
  • 6. Microsatellites, or repetitive DNA sequences are particularly challenging • Microsatellites, also called Simple Sequence Repeats or Short Tandem Repeats, are an understudied portion of genome; because they are considered part of our “Junk DNA” or more recently “Dark Matter” DNA; research focus has been on Single Nucleotide Polymorphisms (“ SNPs”) • Microsatellites have known value: long used for paternity and forensic testing and linked to neurological diseases (e.g. Huntington’s and Fragile-X) • None of major genomic research projects have focused on Microsatellites: not Human Genome Project, 1000 Genome Project, The Cancer Genome Atlas, ENCODE or the iCOGS study.
  • 7. Genomeon’s Research Methodology Download and rebuild thousands of “healthy and “affected” genomes Create genotype distributions for “healthy” and “affected” populations Compute Fishers Exact Test pvalue for each of ~1 million loci and rank results Identify “Patterns of Informative Microsatellites” (PIM) from loci that pass Bonferroni and Benjamini–Hochberg False Discovery Rate tests Manually review, do QC, compute sensitivity and specificity Annotate with ontologies, literature, input from experts Validate PIM with sequencing of wellcharacterized samples Business analysis; product definition; IP Publish; translate, regulatory approval, reimbursement; team with established clinical services co.
  • 8. Genomeon has created a unique library of over 7700 genomes from 1000 Genomes Project and The Cancer Genome Atlas with corrected microsatellites • “Healthy Population” representing many ethnicities • Ovarian cancer • Breast cancer • Brain cancer: Glioma; Glioblastoma; Medulloblastoma • Lung adenocarcinoma • Prostate cancer • Melanoma • Autism
  • 9. Breast Cancer
  • 10. Pattern of 55 informative microsatellites differentiates Breast Cancer germlines from healthy germlines Sensitivity = 84% Specificity = 87% BRCA positive samples
  • 11. Applications of these microsatellite loci variations – Microsatellite profiling for increased risk of cancer, and the Cancer Risk Diagnostics tissues at highest risk Companion/Treatment Diagnostics - Many informative microsatellites are functional elements implicated in therapeutic response Clinical Trial Support - Use of microsatellite profile to differentiate sub-populations in clinical trials Drug Targets - Identification of large number of genes previously unassociated with cancer many with functions associated with cancer processes Toxicology - Quantification of stress induced exposures via microsatellite mutation screen Prognosis - Comparison of microsatellite variations between germlines and tumors Non-cancer Diseases - PTSD, Autism, MS, cardiac diseases, aging
  • 12. Thank you. Any Questions?