Bioinformatics in Gene Research
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Bioinformatics in Gene Research

  • 747 views
Uploaded on

Intended for a mixed/general audience of Clinicians, Business Interests, and Research Scientists. No audio, however the event was recorded and posted to youtube by Genome Atlantic: ...

Intended for a mixed/general audience of Clinicians, Business Interests, and Research Scientists. No audio, however the event was recorded and posted to youtube by Genome Atlantic: http://www.youtube.com/watch?v=FLVjwOngu-Q I

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
747
On Slideshare
747
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
10
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Bioinformatics in Genetics Research Genetics Noon Symposium Series Daniel Gaston, PhDDr. Karen Bedard Lab, Department of Pathology November 21st, 2012
  • 2. IGNITE Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca
  • 3. IGNITE Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca
  • 4. Outline Introduction  Bioinformatics in Disease Genomics  Next-Generation Sequencing Genomics in Research and the Clinic The Data Deluge and its Solutions  Bioinformatic Methods for Analyzing Genomic Data Case Studies Conclusion
  • 5. Bioinformatics in Disease Genomics Handling and long-term storage of raw data (sequencing, gene expression, etc) Maintenance and support of computational infrastructure Experimental design Data analysis Methods development  Analysis pipelines  Statistical analyses  Algorithm design
  • 6. Bioinformatics in Disease Genomics Handling and long-term storage of raw data (sequencing, gene expression, etc) Maintenance and support of computational infrastructure Experimental design Data analysis Methods development  Analysis pipelines  Statistical analysis techniques  Algorithm design
  • 7. ‘Next-Generation’ Sequencing and Disease Genomics
  • 8. Disease Genomics: Hunting Down Pathogenic Genetic VariationReferenc Exon 1 Intron 1 Exon 2e Start TAA Stop
  • 9. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop
  • 10. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein StopPatient Exon 1 Intron 1 Exon 2
  • 11. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  • 12. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  • 13. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  • 14. Disease Genomics: Research vs Clinic Still predominantly research oriented  Complex/Common disease  Mendelian disorders  Cancer genomics
  • 15. Disease Genomics: Research vs Clinic Still predominantly research oriented  Complex/Common disease  Mendelian disorders  Cancer genomics Clinical genomics starting to gain traction  Cancer genomics  Cancer subtype identification  Personalized medicine and predicting outcomes  Mendelian disorders  Early diagnosis  Cost effectiveness
  • 16. Clinical Genomics Children’s Mercy Hospital NICU  In the US >20% of infant deaths due to genetic disease  Serial sequencing of candidate genes too slow
  • 17. Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic disease  Sample preparation and sequencing: 30.5 hours  Automated bioinformatics analysis: 17.5 hours  Previous high-throughput sequencing methods: 19 days  Test on seven infants, two previously diagnosed using standard methods, five undiagnosed
  • 18. Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic disease  Sample preparation and sequencing: 30.5 hours  Automated bioinformatics analysis: 17.5 hours  Previous high-throughput sequencing methods: 19 days  Test on seven infants, two previously diagnosed using standard methods, five undiagnosed Caveats  Bioinformatics portion not available outside of hospital  Requires thorough clinical phenotyping using a controlled vocabulary  Generates a large amount of data
  • 19. The Data Deluge 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in population
  • 20. Surviving the Data DelugeReducing the Search Space: Exome Sequencing
  • 21. Exome Sequencing Exome: Portion of genome composed of protein- coding exons and functional RNA sequences 1.5 - 2% of human genome (50 Mb) > 85% of monogenic diseases due to variants in exome Complete exome sequencing: ~ $1000/sample
  • 22. Caveats Incomplete and non-uniform coverage of exome  Systematic bias (GC content)  Random sampling Not all genetic variants amenable to discovery  Non-coding variants  Structural variants
  • 23. Surviving The Data Deluge Bioinformatics
  • 24. Typical Bioinformatics Workflow QC of Raw Data Map to Reference QC Find Variants QC Annotate Filter
  • 25. It Sounds simple but… For every stage there are multiple programs available and published in the literature
  • 26. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always
  • 27. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always Best combinations of programs and options not well understood
  • 28. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always Best combinations of programs and options not well understood Protocols changing rapidly as new technologies and methods developed
  • 29. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always Best combinations of programs and options not well understood Protocols changing rapidly as new technologies and methods developed Different centres and groups use slightly different workflows with similar, but not identical results
  • 30. Typical Bioinformatics Workflow QC of Raw Data Map to Reference QC Find Variants QC Annotate Filter
  • 31. Annotating Variants
  • 32. If a problem cannot besolved, enlarge it. --Dwight D.Eisenhower
  • 33. Annotations Associated with GenomicVariants Is variant in a known protein-coding gene?  What does the gene do?  What molecular pathways?  What protein-protein interactions? 4 million genetic variants  What tissues is it expressed in? 2 million associated with protein-coding genes  When in development? 10,000 possibly Has this variant been seen before? of disease causing type 1500 <1%  What population(s)? With what frequency? frequency in population  Has it been seen in local sequencing projects?  Is there any known clinical significance? What is the effect of the variation?  Does it change the resulting protein? How?
  • 34. Gene Annotation Resources
  • 35. Variant Annotation Resources
  • 36. Potential Pitfalls with Annotation Sources Databases often overlap and agree, but there may be disagreements Source of information: Predicted versus experimental Incorrect and out-of-date information Large-scale un-validated versus manually curated datasets
  • 37. Bioinformatics Analyses of Genomic Variants Combining Data Sources and Filtering
  • 38. IGNITE Data Pipeline and Integration Gene Annotations Annotated Genomic Variants Mapped Gene Region(s) Definitions Filter Sort PrioritizeKnown Genes Pathway and Interactions
  • 39. Filtering the Data: Categorization 4 million variants Intronic Exonic Intergenic Amino Acid Unknown Splice Site Silent Mutation Splice Site Changing Potential Potential Disease Disease Causing CausingKnown Genetic Amino Acid Amino Acid Known Stop Loss / Disease Change Likely Change Likely Polymorphism Stop Gain Variant Pathogenic Benign in Population
  • 40. Filtering the Data: Common or Rare? Variants in dbSNP – Typically known polymorphisms, unlikely to be associated with rare disease Variants with relatively high frequency in control populations (1000 Genomes, HapMAP, EVS, 2800 Exomes) Number of times variant previously seen at sequencing centre/locally
  • 41. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence
  • 42. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence Reasonably well-sampled local populations are better than any other reference
  • 43. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence Reasonably well-sampled local populations are better than any other reference Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants
  • 44. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence Reasonably well-sampled local populations are better than any other reference Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants Some genes acquire large effect variants (stop loss / stop gain, etc) frequently. Some genes can be lost without causing disease
  • 45. Applications to Real DataCharcot-Marie-Tooth Disease and Cutis Laxa
  • 46. IGNITE Data Pipeline and Integration Gene Annotations Annotated Genomic Variants Mapped Gene Region(s) Definitions Filter Sort PrioritizeKnown Genes Pathway and Interactions
  • 47. Charcot-Marie-Tooth: Genetic Mapping Chromosome 9: 120,962,282 - 133,033,431
  • 48. Cutis Laxa: Genetic Mapping Chromosome 17: 79,596,811- 81,041,077
  • 49. Charcot-Marie-Tooth Cutis Laxa 143 genes in region  52 genes in region 13 known genes in  5 known genes in genome genome  ATP6V0A2  MPZ  ELN  PMP22  FBLN5  GDAP1  EFEMP2  KIF1B  SCYL1BP1  MFN2  ALDH18A1  SOX  EGR2  DNM2  RAB7  LITAF (SIMPLE)  GARS  YARS  LMNA
  • 50. Pathway and Interaction Data 37 pathways  10 pathways  Clathrin-derived vesicle  Phagosome budding  Collecting duct acid  Lysosome vesicle secretion biogenesis  Lysosome  Endocytosis  Protein digestion and  Golgi-associated vesicle absorption biogenesis  Metabolic pathways  Membrane trafficking  Oxidative  Trans-Golgi network phosphorylation vesicle budding  Arginine and proline Primarily LMNA or metabolism DNM2  Primarily ATP6V0A2
  • 51. Results: Charcot-Marie-Tooth 8 Genes PrioritizedGene Interactions PathwayLRSAM1 Multiple EndocytosisDNM1 DNM2 -FNBP1 DNM2 -TOR1A MNA -STXBP1 Multiple FiveSH3GLB2 - EndocytosisPIP5KL1 - EndocytosisFAM125B - Endocytosis For more information  Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
  • 52. Results: Cutis Laxa 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, ProteindigestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways, Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation For more information  Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9
  • 53. Conclusions
  • 54. Conclusions Bioinformatics is involved at every stage of genomic research from experimental design through to final analysis Standards and best practices do exist, but are rapidly evolving as new technologies and methods are developed Progress towards automatic generation of clinically interpretable genomics studies Annotation, filtering, and prioritization of genetic variants crucial Balance between false positive calls and false negatives
  • 55. Where Are We Headed? Integration of more data sources  Gene expression  More annotation sources  Controlled phenotype vocabularies  Gene Ontology terms  Predictive models  Recessive versus Dominant inheritance and Penetrance “New” and Emerging Technologies  RNA-Seq (Gene Expression)  ChIP-Seq (Protein-DNA binding)  Single-Molecule Sequencing
  • 56. Acknowledgements Dalhousie University  McGill/Genome Quebec  Dr. Karen Bedard  Dr. Jacek Majewski  Dr. Chris McMaster  Jeremy  Dr. Andrew Orr Schwartzentruber  Dr. Conrad Fernandez  Dr. Marissa Leblanc  Dr. Sarah Dyack  Mat Nightingale  Dr. Johane Robataille  Bedard Lab  Genome Atlantic  IGNITE