A Systematic approach to the Large-Scale Analysis of Genotype-Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass
The entire genetic identity of an individual that  does not show  any outward characteristics,  e.g.  Genes, mutations Genotype DNA ACTGCACTGACTGTACGTATATCT ACTGCACTG TG TGTACGTATATCT Mutations Genes
(harder to characterise)  The observable expression of gene’s producing  notable characteristics  in an individual,  e.g.  Hair or eye colour, body mass, resistance to disease Phenotype vs. Brown White and Brown
Genotype  to  Phenotype
Genotype Phenotype ? Current Methods 200 What processes to investigate?
? 200 Microarray + QTL Genes captured in microarray experiment and present in QTL ( Quantitative Trait Loci  )  region Genotype Phenotype Metabolic pathways Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping
CHR QTL Gene A Gene B Pathway A Pathway B Pathway linked to phenotype – high priority Pathway not linked to phenotype – medium priority Pathway C Phenotype literature literature literature Gene C Pathway not linked to QTL – low priority Genotype
Issues with current approaches
Huge amounts of data 200+ Genes QTL region on chromosome Microarray 1000+ Genes How do I look at ALL the genes systematically?
Hypothesis-Driven Analyses 200 QTL genes Case: African Sleeping sickness - parasitic infection - Known immune response Pick the genes involved in immunological process 40 QTL genes Pick the genes that I am most familiar with 2 QTL genes Biased view Result: African Sleeping sickness Immune response Cholesterol control Cell death
Manual Methods of data analysis Navigating through hyperlinks No explicit methods Human error Tedious and repetitive
Implicit methods
Issues with current approaches Scale of analysis task User bias and premature filtering Hypothesis-Driven approach to data analysis Constant flux of data - problems  with re-analysis of data Implicit methodologies (hyper-linking through web pages) Error proliferation from any of the listed issues Solution – Automate through workflows
The Two W’s Web Services Technology and standard for exposing code / database with an means that can be consumed by a third party remotely Describes how to interact with it Workflows General technique for describing and executing a process Describes  what  you want to do
Taverna Workflow Workbench http://taverna.sf.net
Hypothesis Utilising the capabilities of workflows and the pathway-driven approach, we are able to provide a more: - systematic - efficient - scalable - un-biased  - unambiguous the benefit will be that  new biology  results will be derived, increasing community knowledge of genotype and phenotype interactions.
Pathway Resource QTL mapping study Microarray gene expression study Identify genes in QTL regions Identify differentially expressed genes Wet Lab Literature Annotate genes with biological pathways Annotate genes with biological pathways Select common biological pathways Hypothesis generation and verification Statistical analysis Genomic Resource
Replicated original chain of data analysis
Trypanosomiasis in Africa http://www.genomics.liv.ac.uk/tryps/trypsindex.html Andy Brass Steve Kemp + many Others
Preliminary Results Trypanosomiasis resistance A strong candidate gene was found  Daxx  gene not found using manual investigation methods The gene was identified from analysis of biological pathway information Possible candidate identified by Yan et al (2004): Daxx SNP info Sequencing of the Daxx gene in  Wet Lab  showed mutations that is thought to change the structure of the protein Mutation was published in scientific literature, noting its effect on the binding of Daxx protein to p53 protein –  p53 plays direct role in cell death and apoptosis, one of the Trypanosomiasis phenotypes More genes to follow (hopefully) in publications being written
Shameless Plug! A Systematic Strategy for Large-Scale Analysis of Genotype-Phenotype Correlations: Identification of candidate genes involved in African Trypanosomiasis Fisher  et al ., (2007) Nucleic Acids Research doi:10.1093/nar/ gkm623   Explicitly discusses the methods we used for the Trypanosomiasis use case Discussion of the results for Daxx and shows mutation Sharing of workflows for re-use, re-purposing
Recycling, Reuse, Repurposing Identified a candidate gene (Daxx) for Trypanosomiasis resistance.  Manual analysis on the microarray and QTL data failed to identify this gene as a candidate.   Unbiased analysis. Confirmed by the wet lab. Here’s the  Science ! Here’s the  e-Science ! Trypanosomiasis  mouse workflow  reused without change  in  Trichuris muris  infection in mice  Identified biological pathways involved in sex dependence Previous manual  two year study  of candidate genes had failed to do this. Workflows now being run over  Colitis/ Inflammatory Bowel Disease in Mice   (without change)
Recycling, Reuse, Repurposing http://www.myexperiment.org/ Share Search Re-use Re-purpose Execute Communicate Record
What next? More use cases?? Can be done, but not for my project Text Mining !!! Aid biologists in identifying novel links between pathways Link pathways to phenotype through literature
Pathway Resource QTL mapping study Microarray gene expression study Identify genes in QTL regions Identify differentially expressed genes Wet Lab Literature Annotate genes with biological pathways Annotate genes with biological pathways Select common biological pathways Hypothesis generation and verification Statistical analysis Genomic Resource
CHR QTL Gene A Gene B Pathway A Pathway B Pathway linked to phenotype – high priority Pathway not linked to phenotype – medium priority Pathway C Phenotype literature literature literature Gene C Pathway not linked to QTL – low priority Genotype DONE MANUALLY
It can’t be that hard, right? PubMed contains ~17,787,763 journals to date Manually searching is tedious and frustrating Can be hard finding the links Computers can help with data gathering and information extraction – that’s their job !!!
Text Mining A means of  assisting  the researcher Time Effort Narrow searches Hypothesis generation and verification Suggested links Limited corpus, but its specific NOT A REPLACEMENT FOR  DOMAIN EXPERTISE
To Sum Up …. Need for Genotype-Phenotype correlations with respect to disease control High-throughput data can provide links between Genotype and Phenotype Highlighted issues with manually conducted  in silico  experiments  Improved the methods of current microarray and QTL based investigations through systematic nature Increased reproducibility of our methods - workflows stored in XML based schema - explicit declaration of services, parameters, and methods of data analysis Shown workflows are capable of deriving new biologically significant results African Trypanosomiasis in the mouse Infection of mice with  Trichuris muris The workflows require expansion to accommodate new analysis techniques – text mining
Many thanks to: including: Joanne Pennock, EPSRC, OMII, myGrid, and lots more people

A systematic approach to Genotype-Phenotype correlations

  • 1.
    A Systematic approachto the Large-Scale Analysis of Genotype-Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass
  • 2.
    The entire geneticidentity of an individual that does not show any outward characteristics, e.g. Genes, mutations Genotype DNA ACTGCACTGACTGTACGTATATCT ACTGCACTG TG TGTACGTATATCT Mutations Genes
  • 3.
    (harder to characterise) The observable expression of gene’s producing notable characteristics in an individual, e.g. Hair or eye colour, body mass, resistance to disease Phenotype vs. Brown White and Brown
  • 4.
    Genotype to Phenotype
  • 5.
    Genotype Phenotype ?Current Methods 200 What processes to investigate?
  • 6.
    ? 200 Microarray+ QTL Genes captured in microarray experiment and present in QTL ( Quantitative Trait Loci ) region Genotype Phenotype Metabolic pathways Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping
  • 7.
    CHR QTL GeneA Gene B Pathway A Pathway B Pathway linked to phenotype – high priority Pathway not linked to phenotype – medium priority Pathway C Phenotype literature literature literature Gene C Pathway not linked to QTL – low priority Genotype
  • 8.
  • 9.
    Huge amounts ofdata 200+ Genes QTL region on chromosome Microarray 1000+ Genes How do I look at ALL the genes systematically?
  • 10.
    Hypothesis-Driven Analyses 200QTL genes Case: African Sleeping sickness - parasitic infection - Known immune response Pick the genes involved in immunological process 40 QTL genes Pick the genes that I am most familiar with 2 QTL genes Biased view Result: African Sleeping sickness Immune response Cholesterol control Cell death
  • 11.
    Manual Methods ofdata analysis Navigating through hyperlinks No explicit methods Human error Tedious and repetitive
  • 12.
  • 13.
    Issues with currentapproaches Scale of analysis task User bias and premature filtering Hypothesis-Driven approach to data analysis Constant flux of data - problems with re-analysis of data Implicit methodologies (hyper-linking through web pages) Error proliferation from any of the listed issues Solution – Automate through workflows
  • 14.
    The Two W’sWeb Services Technology and standard for exposing code / database with an means that can be consumed by a third party remotely Describes how to interact with it Workflows General technique for describing and executing a process Describes what you want to do
  • 15.
    Taverna Workflow Workbenchhttp://taverna.sf.net
  • 16.
    Hypothesis Utilising thecapabilities of workflows and the pathway-driven approach, we are able to provide a more: - systematic - efficient - scalable - un-biased - unambiguous the benefit will be that new biology results will be derived, increasing community knowledge of genotype and phenotype interactions.
  • 17.
    Pathway Resource QTLmapping study Microarray gene expression study Identify genes in QTL regions Identify differentially expressed genes Wet Lab Literature Annotate genes with biological pathways Annotate genes with biological pathways Select common biological pathways Hypothesis generation and verification Statistical analysis Genomic Resource
  • 18.
    Replicated original chainof data analysis
  • 19.
    Trypanosomiasis in Africahttp://www.genomics.liv.ac.uk/tryps/trypsindex.html Andy Brass Steve Kemp + many Others
  • 20.
    Preliminary Results Trypanosomiasisresistance A strong candidate gene was found Daxx gene not found using manual investigation methods The gene was identified from analysis of biological pathway information Possible candidate identified by Yan et al (2004): Daxx SNP info Sequencing of the Daxx gene in Wet Lab showed mutations that is thought to change the structure of the protein Mutation was published in scientific literature, noting its effect on the binding of Daxx protein to p53 protein – p53 plays direct role in cell death and apoptosis, one of the Trypanosomiasis phenotypes More genes to follow (hopefully) in publications being written
  • 21.
    Shameless Plug! ASystematic Strategy for Large-Scale Analysis of Genotype-Phenotype Correlations: Identification of candidate genes involved in African Trypanosomiasis Fisher et al ., (2007) Nucleic Acids Research doi:10.1093/nar/ gkm623 Explicitly discusses the methods we used for the Trypanosomiasis use case Discussion of the results for Daxx and shows mutation Sharing of workflows for re-use, re-purposing
  • 22.
    Recycling, Reuse, RepurposingIdentified a candidate gene (Daxx) for Trypanosomiasis resistance. Manual analysis on the microarray and QTL data failed to identify this gene as a candidate. Unbiased analysis. Confirmed by the wet lab. Here’s the Science ! Here’s the e-Science ! Trypanosomiasis mouse workflow reused without change in Trichuris muris infection in mice Identified biological pathways involved in sex dependence Previous manual two year study of candidate genes had failed to do this. Workflows now being run over Colitis/ Inflammatory Bowel Disease in Mice (without change)
  • 23.
    Recycling, Reuse, Repurposinghttp://www.myexperiment.org/ Share Search Re-use Re-purpose Execute Communicate Record
  • 24.
    What next? Moreuse cases?? Can be done, but not for my project Text Mining !!! Aid biologists in identifying novel links between pathways Link pathways to phenotype through literature
  • 25.
    Pathway Resource QTLmapping study Microarray gene expression study Identify genes in QTL regions Identify differentially expressed genes Wet Lab Literature Annotate genes with biological pathways Annotate genes with biological pathways Select common biological pathways Hypothesis generation and verification Statistical analysis Genomic Resource
  • 26.
    CHR QTL GeneA Gene B Pathway A Pathway B Pathway linked to phenotype – high priority Pathway not linked to phenotype – medium priority Pathway C Phenotype literature literature literature Gene C Pathway not linked to QTL – low priority Genotype DONE MANUALLY
  • 27.
    It can’t bethat hard, right? PubMed contains ~17,787,763 journals to date Manually searching is tedious and frustrating Can be hard finding the links Computers can help with data gathering and information extraction – that’s their job !!!
  • 28.
    Text Mining Ameans of assisting the researcher Time Effort Narrow searches Hypothesis generation and verification Suggested links Limited corpus, but its specific NOT A REPLACEMENT FOR DOMAIN EXPERTISE
  • 29.
    To Sum Up…. Need for Genotype-Phenotype correlations with respect to disease control High-throughput data can provide links between Genotype and Phenotype Highlighted issues with manually conducted in silico experiments Improved the methods of current microarray and QTL based investigations through systematic nature Increased reproducibility of our methods - workflows stored in XML based schema - explicit declaration of services, parameters, and methods of data analysis Shown workflows are capable of deriving new biologically significant results African Trypanosomiasis in the mouse Infection of mice with Trichuris muris The workflows require expansion to accommodate new analysis techniques – text mining
  • 30.
    Many thanks to:including: Joanne Pennock, EPSRC, OMII, myGrid, and lots more people

Editor's Notes

  • #2 Title slide A Systematic approach to large-scale analysis Genotype-Phenotype correlations