Bioinformatics in Gene Research

627
-1

Published on

Intended for a mixed/general audience of Clinicians, Business Interests, and Research Scientists. No audio, however the event was recorded and posted to youtube by Genome Atlantic: http://www.youtube.com/watch?v=FLVjwOngu-Q I

Published in: Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
627
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bioinformatics in Gene Research

  1. 1. Bioinformatics in Genetics Research Genetics Noon Symposium Series Daniel Gaston, PhDDr. Karen Bedard Lab, Department of Pathology November 21st, 2012
  2. 2. IGNITE Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca
  3. 3. IGNITE Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca
  4. 4. Outline Introduction  Bioinformatics in Disease Genomics  Next-Generation Sequencing Genomics in Research and the Clinic The Data Deluge and its Solutions  Bioinformatic Methods for Analyzing Genomic Data Case Studies Conclusion
  5. 5. Bioinformatics in Disease Genomics Handling and long-term storage of raw data (sequencing, gene expression, etc) Maintenance and support of computational infrastructure Experimental design Data analysis Methods development  Analysis pipelines  Statistical analyses  Algorithm design
  6. 6. Bioinformatics in Disease Genomics Handling and long-term storage of raw data (sequencing, gene expression, etc) Maintenance and support of computational infrastructure Experimental design Data analysis Methods development  Analysis pipelines  Statistical analysis techniques  Algorithm design
  7. 7. ‘Next-Generation’ Sequencing and Disease Genomics
  8. 8. Disease Genomics: Hunting Down Pathogenic Genetic VariationReferenc Exon 1 Intron 1 Exon 2e Start TAA Stop
  9. 9. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop
  10. 10. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein StopPatient Exon 1 Intron 1 Exon 2
  11. 11. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  12. 12. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  13. 13. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReferenc Exon 1 Intron 1 Exon 2e Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  14. 14. Disease Genomics: Research vs Clinic Still predominantly research oriented  Complex/Common disease  Mendelian disorders  Cancer genomics
  15. 15. Disease Genomics: Research vs Clinic Still predominantly research oriented  Complex/Common disease  Mendelian disorders  Cancer genomics Clinical genomics starting to gain traction  Cancer genomics  Cancer subtype identification  Personalized medicine and predicting outcomes  Mendelian disorders  Early diagnosis  Cost effectiveness
  16. 16. Clinical Genomics Children’s Mercy Hospital NICU  In the US >20% of infant deaths due to genetic disease  Serial sequencing of candidate genes too slow
  17. 17. Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic disease  Sample preparation and sequencing: 30.5 hours  Automated bioinformatics analysis: 17.5 hours  Previous high-throughput sequencing methods: 19 days  Test on seven infants, two previously diagnosed using standard methods, five undiagnosed
  18. 18. Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic disease  Sample preparation and sequencing: 30.5 hours  Automated bioinformatics analysis: 17.5 hours  Previous high-throughput sequencing methods: 19 days  Test on seven infants, two previously diagnosed using standard methods, five undiagnosed Caveats  Bioinformatics portion not available outside of hospital  Requires thorough clinical phenotyping using a controlled vocabulary  Generates a large amount of data
  19. 19. The Data Deluge 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in population
  20. 20. Surviving the Data DelugeReducing the Search Space: Exome Sequencing
  21. 21. Exome Sequencing Exome: Portion of genome composed of protein- coding exons and functional RNA sequences 1.5 - 2% of human genome (50 Mb) > 85% of monogenic diseases due to variants in exome Complete exome sequencing: ~ $1000/sample
  22. 22. Caveats Incomplete and non-uniform coverage of exome  Systematic bias (GC content)  Random sampling Not all genetic variants amenable to discovery  Non-coding variants  Structural variants
  23. 23. Surviving The Data Deluge Bioinformatics
  24. 24. Typical Bioinformatics Workflow QC of Raw Data Map to Reference QC Find Variants QC Annotate Filter
  25. 25. It Sounds simple but… For every stage there are multiple programs available and published in the literature
  26. 26. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always
  27. 27. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always Best combinations of programs and options not well understood
  28. 28. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always Best combinations of programs and options not well understood Protocols changing rapidly as new technologies and methods developed
  29. 29. It Sounds simple but… For every stage there are multiple programs available and published in the literature For every program there are a wide-variety of parameter values and options. Defaults often “good enough” but not always Best combinations of programs and options not well understood Protocols changing rapidly as new technologies and methods developed Different centres and groups use slightly different workflows with similar, but not identical results
  30. 30. Typical Bioinformatics Workflow QC of Raw Data Map to Reference QC Find Variants QC Annotate Filter
  31. 31. Annotating Variants
  32. 32. If a problem cannot besolved, enlarge it. --Dwight D.Eisenhower
  33. 33. Annotations Associated with GenomicVariants Is variant in a known protein-coding gene?  What does the gene do?  What molecular pathways?  What protein-protein interactions? 4 million genetic variants  What tissues is it expressed in? 2 million associated with protein-coding genes  When in development? 10,000 possibly Has this variant been seen before? of disease causing type 1500 <1%  What population(s)? With what frequency? frequency in population  Has it been seen in local sequencing projects?  Is there any known clinical significance? What is the effect of the variation?  Does it change the resulting protein? How?
  34. 34. Gene Annotation Resources
  35. 35. Variant Annotation Resources
  36. 36. Potential Pitfalls with Annotation Sources Databases often overlap and agree, but there may be disagreements Source of information: Predicted versus experimental Incorrect and out-of-date information Large-scale un-validated versus manually curated datasets
  37. 37. Bioinformatics Analyses of Genomic Variants Combining Data Sources and Filtering
  38. 38. IGNITE Data Pipeline and Integration Gene Annotations Annotated Genomic Variants Mapped Gene Region(s) Definitions Filter Sort PrioritizeKnown Genes Pathway and Interactions
  39. 39. Filtering the Data: Categorization 4 million variants Intronic Exonic Intergenic Amino Acid Unknown Splice Site Silent Mutation Splice Site Changing Potential Potential Disease Disease Causing CausingKnown Genetic Amino Acid Amino Acid Known Stop Loss / Disease Change Likely Change Likely Polymorphism Stop Gain Variant Pathogenic Benign in Population
  40. 40. Filtering the Data: Common or Rare? Variants in dbSNP – Typically known polymorphisms, unlikely to be associated with rare disease Variants with relatively high frequency in control populations (1000 Genomes, HapMAP, EVS, 2800 Exomes) Number of times variant previously seen at sequencing centre/locally
  41. 41. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence
  42. 42. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence Reasonably well-sampled local populations are better than any other reference
  43. 43. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence Reasonably well-sampled local populations are better than any other reference Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants
  44. 44. Notes on Filtering and Variant Annotation Very important to be aware of population when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence Reasonably well-sampled local populations are better than any other reference Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants Some genes acquire large effect variants (stop loss / stop gain, etc) frequently. Some genes can be lost without causing disease
  45. 45. Applications to Real DataCharcot-Marie-Tooth Disease and Cutis Laxa
  46. 46. IGNITE Data Pipeline and Integration Gene Annotations Annotated Genomic Variants Mapped Gene Region(s) Definitions Filter Sort PrioritizeKnown Genes Pathway and Interactions
  47. 47. Charcot-Marie-Tooth: Genetic Mapping Chromosome 9: 120,962,282 - 133,033,431
  48. 48. Cutis Laxa: Genetic Mapping Chromosome 17: 79,596,811- 81,041,077
  49. 49. Charcot-Marie-Tooth Cutis Laxa 143 genes in region  52 genes in region 13 known genes in  5 known genes in genome genome  ATP6V0A2  MPZ  ELN  PMP22  FBLN5  GDAP1  EFEMP2  KIF1B  SCYL1BP1  MFN2  ALDH18A1  SOX  EGR2  DNM2  RAB7  LITAF (SIMPLE)  GARS  YARS  LMNA
  50. 50. Pathway and Interaction Data 37 pathways  10 pathways  Clathrin-derived vesicle  Phagosome budding  Collecting duct acid  Lysosome vesicle secretion biogenesis  Lysosome  Endocytosis  Protein digestion and  Golgi-associated vesicle absorption biogenesis  Metabolic pathways  Membrane trafficking  Oxidative  Trans-Golgi network phosphorylation vesicle budding  Arginine and proline Primarily LMNA or metabolism DNM2  Primarily ATP6V0A2
  51. 51. Results: Charcot-Marie-Tooth 8 Genes PrioritizedGene Interactions PathwayLRSAM1 Multiple EndocytosisDNM1 DNM2 -FNBP1 DNM2 -TOR1A MNA -STXBP1 Multiple FiveSH3GLB2 - EndocytosisPIP5KL1 - EndocytosisFAM125B - Endocytosis For more information  Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
  52. 52. Results: Cutis Laxa 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, ProteindigestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways, Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation For more information  Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9
  53. 53. Conclusions
  54. 54. Conclusions Bioinformatics is involved at every stage of genomic research from experimental design through to final analysis Standards and best practices do exist, but are rapidly evolving as new technologies and methods are developed Progress towards automatic generation of clinically interpretable genomics studies Annotation, filtering, and prioritization of genetic variants crucial Balance between false positive calls and false negatives
  55. 55. Where Are We Headed? Integration of more data sources  Gene expression  More annotation sources  Controlled phenotype vocabularies  Gene Ontology terms  Predictive models  Recessive versus Dominant inheritance and Penetrance “New” and Emerging Technologies  RNA-Seq (Gene Expression)  ChIP-Seq (Protein-DNA binding)  Single-Molecule Sequencing
  56. 56. Acknowledgements Dalhousie University  McGill/Genome Quebec  Dr. Karen Bedard  Dr. Jacek Majewski  Dr. Chris McMaster  Jeremy  Dr. Andrew Orr Schwartzentruber  Dr. Conrad Fernandez  Dr. Marissa Leblanc  Dr. Sarah Dyack  Mat Nightingale  Dr. Johane Robataille  Bedard Lab  Genome Atlantic  IGNITE
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×