Bioinformatics in Gene Research
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Bioinformatics in Gene Research

on

  • 719 views

Intended for a mixed/general audience of Clinicians, Business Interests, and Research Scientists. No audio, however the event was recorded and posted to youtube by Genome Atlantic: ...

Intended for a mixed/general audience of Clinicians, Business Interests, and Research Scientists. No audio, however the event was recorded and posted to youtube by Genome Atlantic: http://www.youtube.com/watch?v=FLVjwOngu-Q I

Statistics

Views

Total Views
719
Views on SlideShare
719
Embed Views
0

Actions

Likes
0
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Bioinformatics in Gene Research Presentation Transcript

  • 1. Bioinformatics in Genetics Research Genetics Noon Symposium Series Daniel Gaston, PhDDr. Karen Bedard Lab, Department of Pathology November 21st, 2012
  • 2. IGNITE Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca
  • 3. IGNITE Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca
  • 4. Outline Introduction  Bioinformatics in Disease Genomics  Next-Generation Sequencing Genomics in Research and the Clinic The Data Deluge and its Solutions  Bioinformatic Methods for Analyzing Genomic Data Case Studies Conclusion
  • 5. Bioinformatics in Disease Genomics Handling and long-term storage of raw data (sequencing, gene expression, etc) Maintenance and support of computational infrastructure Experimental design Data analysis Methods development  Analysis pipelines  Statistical analyses  Algorithm design
  • 6. Bioinformatics in Disease Genomics Handling and long-term storage of raw data (sequencing, gene expression, etc) Maintenance and support of computational infrastructure Experimental design Data analysis Methods development  Analysis pipelines  Statistical analysis techniques  Algorithm design
  • 7. ‘Next-Generation’ Sequencing and Disease Genomics
  • 8. Disease Genomics: Hunting Down Pathogenic Genetic VariationReferenc Exon 1 Intron 1 Exon 2e Start TAA Stop
  • 9. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop
  • 10. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein StopPatient Exon 1 Intron 1 Exon 2
  • 11. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  • 12. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  • 13. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  • 14. Disease Genomics: Research vs Clinic Still predominantly research oriented  Complex/Common disease  Mendelian disorders  Cancer genomics
  • 15. Disease Genomics: Research vs Clinic Still predominantly research oriented  Complex/Common disease  Mendelian disorders  Cancer genomics Clinical genomics starting to gain traction  Cancer genomics  Cancer subtype identification  Personalized medicine and predicting outcomes  Mendelian disorders  Early diagnosis  Cost effectiveness
  • 16. Clinical Genomics Children’s Mercy Hospital NICU  In the US >20% of infant deaths due to genetic disease  Serial sequencing of candidate genes too slow
  • 17. Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic disease  Sample preparation and sequencing: 30.5 hours  Automated bioinformatics analysis: 17.5 hours  Previous high-throughput sequencing methods: 19 days  Test on seven infants, two previously diagnosed using standard methods, five undiagnosed
  • 18. Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic disease  Sample preparation and sequencing: 30.5 hours  Automated bioinformatics analysis: 17.5 hours  Previous high-throughput sequencing methods: 19 days  Test on seven infants, two previously diagnosed using standard methods, five undiagnosed Caveats  Bioinformatics portion not available outside of hospital  Requires thorough clinical phenotyping using a controlled vocabulary  Generates a large amount of data
  • 19. The Data Deluge 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in population
  • 20. Surviving the Data DelugeReducing the Search Space: Exome Sequencing
  • 21. Exome Sequencing Exome: Portion of genome composed of protein- coding exons and functional RNA sequences 1.5 - 2% of human genome (50 Mb) > 85% of monogenic diseases due to variants in exome Complete exome sequencing: ~ $1000/sample
  • 22. Caveats Incomplete and non-uniform coverage of exome  Systematic bias (GC content)  Random sampling Not all genetic variants amenable to discovery  Non-coding variants  Structural variants
  • 23. Surviving The Data Deluge Bioinformatics
  • 24. Typical Bioinformatics Workflow QC of Raw Data Map to Reference QC Find Variants QC Annotate Filter
  • 25. It Sounds simple but… For every stage there are multiple programs available and published in the literature
  • 26. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always
  • 27. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always Best combinations of programs and options not well understood
  • 28. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always Best combinations of programs and options not well understood Protocols changing rapidly as new technologies and methods developed
  • 29. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always Best combinations of programs and options not well understood Protocols changing rapidly as new technologies and methods developed Different centres and groups use slightly different workflows with similar, but not identical results
  • 30. Typical Bioinformatics Workflow QC of Raw Data Map to Reference QC Find Variants QC Annotate Filter
  • 31. Annotating Variants
  • 32. If a problem cannot besolved, enlarge it. --Dwight D.Eisenhower
  • 33. Annotations Associated with GenomicVariants Is variant in a known protein-coding gene?  What does the gene do?  What molecular pathways?  What protein-protein interactions? 4 million genetic variants  What tissues is it expressed in? 2 million associated with protein-coding genes  When in development? 10,000 possibly Has this variant been seen before? of disease causing type 1500 <1%  What population(s)? With what frequency? frequency in population  Has it been seen in local sequencing projects?  Is there any known clinical significance? What is the effect of the variation?  Does it change the resulting protein? How?
  • 34. Gene Annotation Resources
  • 35. Variant Annotation Resources
  • 36. Potential Pitfalls with Annotation Sources Databases often overlap and agree, but there may be disagreements Source of information: Predicted versus experimental Incorrect and out-of-date information Large-scale un-validated versus manually curated datasets
  • 37. Bioinformatics Analyses of Genomic Variants Combining Data Sources and Filtering
  • 38. IGNITE Data Pipeline and Integration Gene Annotations Annotated Genomic Variants Mapped Gene Region(s) Definitions Filter Sort PrioritizeKnown Genes Pathway and Interactions
  • 39. Filtering the Data: Categorization 4 million variants Intronic Exonic Intergenic Amino Acid Unknown Splice Site Silent Mutation Splice Site Changing Potential Potential Disease Disease Causing CausingKnown Genetic Amino Acid Amino Acid Known Stop Loss / Disease Change Likely Change Likely Polymorphism Stop Gain Variant Pathogenic Benign in Population
  • 40. Filtering the Data: Common or Rare? Variants in dbSNP – Typically known polymorphisms, unlikely to be associated with rare disease Variants with relatively high frequency in control populations (1000 Genomes, HapMAP, EVS, 2800 Exomes) Number of times variant previously seen at sequencing centre/locally
  • 41. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence
  • 42. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence Reasonably well-sampled local populations are better than any other reference
  • 43. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence Reasonably well-sampled local populations are better than any other reference Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants
  • 44. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence Reasonably well-sampled local populations are better than any other reference Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants Some genes acquire large effect variants (stop loss / stop gain, etc) frequently. Some genes can be lost without causing disease
  • 45. Applications to Real DataCharcot-Marie-Tooth Disease and Cutis Laxa
  • 46. IGNITE Data Pipeline and Integration Gene Annotations Annotated Genomic Variants Mapped Gene Region(s) Definitions Filter Sort PrioritizeKnown Genes Pathway and Interactions
  • 47. Charcot-Marie-Tooth: Genetic Mapping Chromosome 9: 120,962,282 - 133,033,431
  • 48. Cutis Laxa: Genetic Mapping Chromosome 17: 79,596,811- 81,041,077
  • 49. Charcot-Marie-Tooth Cutis Laxa 143 genes in region  52 genes in region 13 known genes in  5 known genes in genome genome  ATP6V0A2  MPZ  ELN  PMP22  FBLN5  GDAP1  EFEMP2  KIF1B  SCYL1BP1  MFN2  ALDH18A1  SOX  EGR2  DNM2  RAB7  LITAF (SIMPLE)  GARS  YARS  LMNA
  • 50. Pathway and Interaction Data 37 pathways  10 pathways  Clathrin-derived vesicle  Phagosome budding  Collecting duct acid  Lysosome vesicle secretion biogenesis  Lysosome  Endocytosis  Protein digestion and  Golgi-associated vesicle absorption biogenesis  Metabolic pathways  Membrane trafficking  Oxidative  Trans-Golgi network phosphorylation vesicle budding  Arginine and proline Primarily LMNA or metabolism DNM2  Primarily ATP6V0A2
  • 51. Results: Charcot-Marie-Tooth 8 Genes PrioritizedGene Interactions PathwayLRSAM1 Multiple EndocytosisDNM1 DNM2 -FNBP1 DNM2 -TOR1A MNA -STXBP1 Multiple FiveSH3GLB2 - EndocytosisPIP5KL1 - EndocytosisFAM125B - Endocytosis For more information  Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
  • 52. Results: Cutis Laxa 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, ProteindigestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways, Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation For more information  Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9
  • 53. Conclusions
  • 54. Conclusions Bioinformatics is involved at every stage of genomic research from experimental design through to final analysis Standards and best practices do exist, but are rapidly evolving as new technologies and methods are developed Progress towards automatic generation of clinically interpretable genomics studies Annotation, filtering, and prioritization of genetic variants crucial Balance between false positive calls and false negatives
  • 55. Where Are We Headed? Integration of more data sources  Gene expression  More annotation sources  Controlled phenotype vocabularies  Gene Ontology terms  Predictive models  Recessive versus Dominant inheritance and Penetrance “New” and Emerging Technologies  RNA-Seq (Gene Expression)  ChIP-Seq (Protein-DNA binding)  Single-Molecule Sequencing
  • 56. Acknowledgements Dalhousie University  McGill/Genome Quebec  Dr. Karen Bedard  Dr. Jacek Majewski  Dr. Chris McMaster  Jeremy  Dr. Andrew Orr Schwartzentruber  Dr. Conrad Fernandez  Dr. Marissa Leblanc  Dr. Sarah Dyack  Mat Nightingale  Dr. Johane Robataille  Bedard Lab  Genome Atlantic  IGNITE