Using Ontologies to
accelerate candidate gene
      identification
            Simon Twigger, Ph.D.

 AMIA Summit on Transl...
http://rgd.mcw.edu
Meet the client
Hypertension
Hypertensive




         Hypertension
QTL
Hypertensive




         Hypertension
QTL
Hypertensive

                        G   G   G




         Hypertension
QTL
Hypertensive

                        G   G   G




         Hypertension
Rat researchers ask...
Rat researchers ask...
      Has anyone done any expression
      studies using congenic rats?
Rat researchers ask...
      Has anyone done any expression
      studies using congenic rats?
          What tissue is th...
Rat researchers ask...
      Has anyone done any expression
      studies using congenic rats?
          What tissue is th...
Rat researchers ask...
      Has anyone done any expression
      studies using congenic rats?
          What tissue is th...
Rat researchers ask...
             Has anyone done any expression
             studies using congenic rats?
             ...
Rat researchers ask...
             Has anyone done any expression
             studies using congenic rats?
             ...
Biological Data Warehouse
Biological Data Warehouse




Really important piece of data...
NCBI GEO db
Data hidden in plain sight
NCBO Annotator




http://www.bioontology.org/wiki/index.php/Annotator_Web_service
Parallel Annotation Workflow
 GEO Records


                Create Annotation
                Jobs & Queue Up

            ...
Current Ontologies




http://bioportal.bioontology.org/
gminer.mcw.edu
Using the ontology structure
Curation of results




NCBO Ontology Widgets
http://www.bioontology.org/wiki/index.php/Ontology_Widgets
Curation of results




NCBO Ontology Widgets
http://www.bioontology.org/wiki/index.php/Ontology_Widgets
Curation of results




NCBO Ontology Widgets
http://www.bioontology.org/wiki/index.php/Ontology_Widgets
Curation of results




NCBO Ontology Widgets
http://www.bioontology.org/wiki/index.php/Ontology_Widgets
Explore Cardio data
Find Congenic data
Browse by annotation
SHRSP overview
Combine results
Combine results
Linking annotations to data
Linking annotations to data
Linking annotations to data




    Tm2d1
RGD1306410
       Svs4
       Hbb
    Scgb2a1
       Alb
Linking annotations to data




    Tm2d1
RGD1306410
       Svs4
       Hbb
    Scgb2a1
       Alb
Linking annotations to data
     Tm2d1
RGD1306410
       Svs4
       Hbb
    Scgb2a1
                          +
        A...
Linking annotations to data
     Tm2d1
RGD1306410
       Svs4
       Hbb
    Scgb2a1
                                   +
...
Linking annotations to data
     Tm2d1
RGD1306410
       Svs4
       Hbb
    Scgb2a1
                                     ...
Probeset results on GMiner
Probeset 1395269_s_at for Gabrd - gamma-aminobutyric
             acid (GABA) A receptor, delta
Probeset results on GMiner

               Probeset 1395269_s_at for
               Gabrd - gamma-aminobutyric
           ...
Probeset results on GMiner

                   Probeset 1395269_s_at for
                   Gabrd - gamma-aminobutyric
   ...
QTL
Hypertensive

                        G   G   G




         Hypertension
QTL
Hypertensive

                        G   G   G




         Hypertension
QTL
Hypertensive

                                  G   G   G



                        Pathway




         Hypertension
QTL
Hypertensive

                                          G   G   G



                                Pathway

        ...
QTL
Hypertensive

                                                    G   G   G



                                       ...
QTL
Hypertensive

                                                    G   G   G



                                       ...
QTL
Hypertensive

                                                    G      G   G



                                    ...
QTL
Hypertensive

                                                    G      G       G



                                ...
Ontology Advantages
•   Unstructured to Structured (using OBA service)
•   Structured (Faceted) browsing of data
•   Encou...
Ontology Hurdles
•   Managing ontology/vocabulary terms and structure
•   Time to encode data using ontology vs free text
...
Acknowledgements
•   Joey Geiger - Development of GMiner

•   Jennifer Smith - Video creation, data curation

•   Rajni Ni...
Links
•   http://gminer.mcw.edu               Web application

•   http://github.com/mcwbbc/gminer     Gminer Code

•   ht...
Upcoming SlideShare
Loading in …5
×

Using Ontologies to accelerate candidate gene identification

787 views

Published on

Copy of my slides from the AMIA Summit on Translational Medicine, 2010. This outlines our work with the National Center for Biomedical Ontology where we are using their tools to index biological data repositories and then enable the use of these annotations for further discoveries.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
787
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

  • The Rat Genome Database is one of the main projects we have at MCW. It is the model organism database for the laboratory rat, Rattus norvegicus. We curate, genes, strains, QTL, etc. and make extensive use of ontologies such as GO, pathway, rat strain, disease, phenotype.

  • This is a typical use case for rat genomics - how to identify the causes of hypertension in a hypertensive rat? Quite often a QTL is measured indicating a region on the chromosome that is statistically shown to be related to the trait in question - how to go from the genes in that region to the cause of the disease? Not an easy task - ‘then a miracle happens’
  • This is a typical use case for rat genomics - how to identify the causes of hypertension in a hypertensive rat? Quite often a QTL is measured indicating a region on the chromosome that is statistically shown to be related to the trait in question - how to go from the genes in that region to the cause of the disease? Not an easy task - ‘then a miracle happens’
  • This is a typical use case for rat genomics - how to identify the causes of hypertension in a hypertensive rat? Quite often a QTL is measured indicating a region on the chromosome that is statistically shown to be related to the trait in question - how to go from the genes in that region to the cause of the disease? Not an easy task - ‘then a miracle happens’
  • This is a typical use case for rat genomics - how to identify the causes of hypertension in a hypertensive rat? Quite often a QTL is measured indicating a region on the chromosome that is statistically shown to be related to the trait in question - how to go from the genes in that region to the cause of the disease? Not an easy task - ‘then a miracle happens’
  • Rat biologists ask many questions related to gene expression and diseases, these are some examples of typical questions.
    Many of these questions are in areas covered by ontologies and would benefit from the additional searching flexibility that ontologies provide
  • Rat biologists ask many questions related to gene expression and diseases, these are some examples of typical questions.
    Many of these questions are in areas covered by ontologies and would benefit from the additional searching flexibility that ontologies provide
  • Rat biologists ask many questions related to gene expression and diseases, these are some examples of typical questions.
    Many of these questions are in areas covered by ontologies and would benefit from the additional searching flexibility that ontologies provide
  • Rat biologists ask many questions related to gene expression and diseases, these are some examples of typical questions.
    Many of these questions are in areas covered by ontologies and would benefit from the additional searching flexibility that ontologies provide
  • Rat biologists ask many questions related to gene expression and diseases, these are some examples of typical questions.
    Many of these questions are in areas covered by ontologies and would benefit from the additional searching flexibility that ontologies provide
  • Rat biologists ask many questions related to gene expression and diseases, these are some examples of typical questions.
    Many of these questions are in areas covered by ontologies and would benefit from the additional searching flexibility that ontologies provide
  • Technical problem - lots of data being stored, hard to find it again.
    Government Warehouse image. Data is archived with good intentions but in doing so is often not easy to find again...
    If you cant find the data, its not really much use.
  • NCBI’s Gene Expression Omnibus has a lot of relevant data, either as text or raw data.
  • Can we start to capture some of this informaiton in an informatically-tractable fashion using ontologies and the OBA tools at the National Center for Biomedical Ontology in an annotation pipeline? The red boxes highlight some concepts of interest - rat strains and tissues being used in this experiment. A human can read these and know whats going on but what about a computer?
  • Driving biological project - use NCBO Annotator web services to mark up the text in the GEO records using ontologies

  • Take sections of text from GEO records, create annotation jobs, place in queue
    Workers take the jobs off the queue, index for appropriate ontologies at NCBO
    Results are placed on Input queue for saving back to the database.
  • We are currently using two ontologies, the rat strain ontology created at RGD and the Mouse Gross Anatomy Ontology created at the JAX. These are both available at the NCBO BioPortal
  • GEO data is run through the pipeline and loaded into Gminer for curation and analysis
  • Searching for BRAIN returns results that also match any of these child terms of the concept Brain.
  • New annotations can be added using the NCBO ontology widgets
  • New annotations can be added using the NCBO ontology widgets
  • New annotations can be added using the NCBO ontology widgets
  • Enter the ontology heirarchy at a top level and then drill down


  • Annotations and tag clouds can be used to explore the datasets - what do we know about SHRSP (Spontaneously hypertensive rat, stroke prone) - brain studies, also used in conjunction with the SR and SR/JHsd rats
  • Initial results focusing on GEO rat datasets has provided a lot of great information and allowed us to create some handy navigational interfaces to the data, enabling queries that were not possible on any other site. Want to find expression data for the SS rat Kidney - click the terms and the datasets appear.
  • Can we link from the annotations to the samples, down to the raw data in that sample and from there to the genes involved? Affy chips have the detection call, a fairly conservative present/absent call indicating if the probe set was observed in that particular sample.
  • Can we link from the annotations to the samples, down to the raw data in that sample and from there to the genes involved? Affy chips have the detection call, a fairly conservative present/absent call indicating if the probe set was observed in that particular sample.
  • Can we link from the annotations to the samples, down to the raw data in that sample and from there to the genes involved? Affy chips have the detection call, a fairly conservative present/absent call indicating if the probe set was observed in that particular sample.
  • We can then related the probesets to the genes to the ontology annotations to create triple such as this. If we do this for the affy data in GEO for Rat, Mouse and Human we will have somewhere upwards of 1.5B data points to encode.
  • We can then related the probesets to the genes to the ontology annotations to create triple such as this. If we do this for the affy data in GEO for Rat, Mouse and Human we will have somewhere upwards of 1.5B data points to encode.
  • For each probe we can look at the samples in which it was tested and see if it was present/absent/marginal and compile this data to get a feel for how often a gene was seen in a particular tissue/organ.
  • This can be viewed as a chart of tissue distribution. When compared to similar results from GeneCards/Novartis BioGPS the results are quite comparable indicating that this approach has some merit.
  • As we start to create these triples we can bridge the gap from the QTL and its genes to the disease, allowing the scientists to identify or prioritize candidate genes in their QTL regions (or gene lists) and save them (to some degree) from spending a lot of time manually searching databases online.
  • As we start to create these triples we can bridge the gap from the QTL and its genes to the disease, allowing the scientists to identify or prioritize candidate genes in their QTL regions (or gene lists) and save them (to some degree) from spending a lot of time manually searching databases online.
  • As we start to create these triples we can bridge the gap from the QTL and its genes to the disease, allowing the scientists to identify or prioritize candidate genes in their QTL regions (or gene lists) and save them (to some degree) from spending a lot of time manually searching databases online.
  • As we start to create these triples we can bridge the gap from the QTL and its genes to the disease, allowing the scientists to identify or prioritize candidate genes in their QTL regions (or gene lists) and save them (to some degree) from spending a lot of time manually searching databases online.
  • As we start to create these triples we can bridge the gap from the QTL and its genes to the disease, allowing the scientists to identify or prioritize candidate genes in their QTL regions (or gene lists) and save them (to some degree) from spending a lot of time manually searching databases online.
  • As we start to create these triples we can bridge the gap from the QTL and its genes to the disease, allowing the scientists to identify or prioritize candidate genes in their QTL regions (or gene lists) and save them (to some degree) from spending a lot of time manually searching databases online.
  • As we start to create these triples we can bridge the gap from the QTL and its genes to the disease, allowing the scientists to identify or prioritize candidate genes in their QTL regions (or gene lists) and save them (to some degree) from spending a lot of time manually searching databases online.


  • Acknowledgements

  • Using Ontologies to accelerate candidate gene identification

    1. 1. Using Ontologies to accelerate candidate gene identification Simon Twigger, Ph.D. AMIA Summit on Translational Bioinformatics San Francisco, March 2010
    2. 2. http://rgd.mcw.edu
    3. 3. Meet the client
    4. 4. Hypertension
    5. 5. Hypertensive Hypertension
    6. 6. QTL Hypertensive Hypertension
    7. 7. QTL Hypertensive G G G Hypertension
    8. 8. QTL Hypertensive G G G Hypertension
    9. 9. Rat researchers ask...
    10. 10. Rat researchers ask... Has anyone done any expression studies using congenic rats?
    11. 11. Rat researchers ask... Has anyone done any expression studies using congenic rats? What tissue is this gene expressed in?
    12. 12. Rat researchers ask... Has anyone done any expression studies using congenic rats? What tissue is this gene expressed in? Are any of these genes associated with my phenotype?
    13. 13. Rat researchers ask... Has anyone done any expression studies using congenic rats? What tissue is this gene expressed in? Are any of these genes associated with my phenotype? What rat expression studies have been done on Mammary Cancer(aka breast neoplasms/breast cancer/cancer of the
    14. 14. Rat researchers ask... Has anyone done any expression studies using congenic rats? What tissue is this gene What expression data expressed in? is known for SD (aka Are any of these SD/NHsd, Harlan genes associated Sprague Dawley, with my phenotype? Sprague Dawley) rats? What rat expression studies have been done on Mammary Cancer(aka breast neoplasms/breast cancer/cancer of the
    15. 15. Rat researchers ask... Has anyone done any expression studies using congenic rats? What tissue is this gene What expression data expressed in? is known for SD (aka Are any of these SD/NHsd, Harlan genes associated Sprague Dawley, with my phenotype? Sprague Dawley) rats? Has this gene been seen in the brain? What rat expression studies have been done on Mammary Cancer(aka breast neoplasms/breast cancer/cancer of the
    16. 16. Biological Data Warehouse
    17. 17. Biological Data Warehouse Really important piece of data...
    18. 18. NCBI GEO db
    19. 19. Data hidden in plain sight
    20. 20. NCBO Annotator http://www.bioontology.org/wiki/index.php/Annotator_Web_service
    21. 21. Parallel Annotation Workflow GEO Records Create Annotation Jobs & Queue Up Q-Out 1..n Annot. Workers RabbitMQ Index text at OBA Parse Q-In Results Results saved to Put results in to GMiner database queue for save
    22. 22. Current Ontologies http://bioportal.bioontology.org/
    23. 23. gminer.mcw.edu
    24. 24. Using the ontology structure
    25. 25. Curation of results NCBO Ontology Widgets http://www.bioontology.org/wiki/index.php/Ontology_Widgets
    26. 26. Curation of results NCBO Ontology Widgets http://www.bioontology.org/wiki/index.php/Ontology_Widgets
    27. 27. Curation of results NCBO Ontology Widgets http://www.bioontology.org/wiki/index.php/Ontology_Widgets
    28. 28. Curation of results NCBO Ontology Widgets http://www.bioontology.org/wiki/index.php/Ontology_Widgets
    29. 29. Explore Cardio data
    30. 30. Find Congenic data
    31. 31. Browse by annotation
    32. 32. SHRSP overview
    33. 33. Combine results
    34. 34. Combine results
    35. 35. Linking annotations to data
    36. 36. Linking annotations to data
    37. 37. Linking annotations to data Tm2d1 RGD1306410 Svs4 Hbb Scgb2a1 Alb
    38. 38. Linking annotations to data Tm2d1 RGD1306410 Svs4 Hbb Scgb2a1 Alb
    39. 39. Linking annotations to data Tm2d1 RGD1306410 Svs4 Hbb Scgb2a1 + Alb
    40. 40. Linking annotations to data Tm2d1 RGD1306410 Svs4 Hbb Scgb2a1 + Alb Hbb is_expressed_in rat kidney Tm2d1 is_expressed_in rat kidney
    41. 41. Linking annotations to data Tm2d1 RGD1306410 Svs4 Hbb Scgb2a1 + Alb Hbb is_expressed_in rat kidney Tm2d1 is_expressed_in rat kidney Human (U133, U133v2.), Mouse (430, U74, U95) and Rat (U34a/b/c, 230, 230v2) 62,000 samples x ca. 25,000 genes/sample = 1.5B data points
    42. 42. Probeset results on GMiner Probeset 1395269_s_at for Gabrd - gamma-aminobutyric acid (GABA) A receptor, delta
    43. 43. Probeset results on GMiner Probeset 1395269_s_at for Gabrd - gamma-aminobutyric acid (GABA) A receptor, delta
    44. 44. Probeset results on GMiner Probeset 1395269_s_at for Gabrd - gamma-aminobutyric acid (GABA) A receptor, delta Hs GABDR
    45. 45. QTL Hypertensive G G G Hypertension
    46. 46. QTL Hypertensive G G G Hypertension
    47. 47. QTL Hypertensive G G G Pathway Hypertension
    48. 48. QTL Hypertensive G G G Pathway G G Hypertension
    49. 49. QTL Hypertensive G G G Pathway G G Component Function Process Hypertension
    50. 50. QTL Hypertensive G G G Pathway G G Component Function Process Hypertension
    51. 51. QTL Hypertensive G G G Pathway G G Anatomy (Kidney) Component Function Process Hypertension
    52. 52. QTL Hypertensive G G G Pathway Str 1 != Str 2 G G Anatomy (Kidney) Component Function Process Hypertension
    53. 53. Ontology Advantages • Unstructured to Structured (using OBA service) • Structured (Faceted) browsing of data • Encourages discussion of data & its meaning • Integration with other data (via same ontologies or mappings to others)
    54. 54. Ontology Hurdles • Managing ontology/vocabulary terms and structure • Time to encode data using ontology vs free text • Consistent use/annotation using ontologies • Quite a few ‘standards’ to pick from....
    55. 55. Acknowledgements • Joey Geiger - Development of GMiner • Jennifer Smith - Video creation, data curation • Rajni Nigam - Rat Strain Ontology • Clement Jonquet - NCBO Annotator tools • Mark Musen & NIH Roadmap Initiative - Our Funding!
    56. 56. Links • http://gminer.mcw.edu Web application • http://github.com/mcwbbc/gminer Gminer Code • http://github.com/simont/MCW-RDF RDFizer code • http://bioportal.bioontology.org/ BioPortal Email: simont@mcw.edu Twitter: @simon_t

    ×