Next-generation text-mining
applied to toxicogenomics data
            analysis

         Kristina Hettne
       PhD thesis defense


          20 December, 2012
Toxicogenomics: study if a chemical causes
 damage to genes

Text mining: teach a computer to “read”
 articles and extract explicit information

Next-generation text mining: teach a
 computer to find implicit information in
 articles
Drug safety is essential!
                                  But… how to minimize animal testing?




Image source: The Independent, July 12, 2012
Toxicogenomics data                                Interpretation using
                                                       knowledge from manually
                                                       curated databases




Image sources: Verhallen and Piersma, 2011, de Jong et al 2011, http://www.flickr.com/photos/jseita/3764113525/
Toxicogenomics data                                Interpretation using
                                                       knowledge from manually
                                                       curated databases




                                                       Not sufficient in coverage

     We hypothesize that next-generation text mining
     can increase the information coverage
Image sources: Verhallen and Piersma, 2011, de Jong et al 2011, http://www.flickr.com/photos/jseita/3764113525/
Next-generation text mining = concept profile
   matching
     Information cloud for
     a gene concept                   Shared concepts




                                                        Information cloud
                                                        for a chemical
                                                        concept




Image source: Herman van Haagen

                                  7
Concepts come from a thesaurus and are identified
   in text with concept identification software


   A good
   thesaurus =
   the basis for
   good concept
   identification



Image source: Herman van Haagen
Research objectives:
• Investigate information coverage in public
   biomedical and chemical thesauri and
   databases
• Provide methods to improve the quality
   and coverage
• Give recommendations for use
• Investigate added value of next-
   generation text mining when interpreting
   toxicogenomics data
                    9
Results




 10
A thesaurus of chemical concepts1 and
methods1,2,3 to prepare a thesaurus to be
used with concept identification software




http://www.biosemantics.org/casper http://www.biosemantics.org/jochem


1. Hettne et al. Bioinformatics, 2009
2. Hettne et al. Journal of Biomedical Semantics, 2010
                                        11
3. Hettne et al. Journal of Cheminformatics, 2010
A next-generation text mining-based method
   for interpreting biological data
                                                                         Next-generation
       Biological data                      Statistical test             text mining
                                                                                             12




     This method gives more, and more specific results1
     than other available tools
      http://www.biosemantics.org/weightedglobaltest

1. Jelier R, Goeman JJ, Hettne KM, Schuemie MJ, den Dunnen JT, 't Hoen PA. Briefings in Bioinformatics, 2011
Application to toxicogenomics
                            Hettne et al. (submitted)
http://www.biosemantics.org/index.php?page=chemicalresponse-specific-gene-sets
See developmental defects in stem cells instead of
       in animal embryos
                                                                          Embryonic
                                                                          structure
     1.



2.                                                                   Posterior neuropore open




     A) Control group rat embryo B)Triazole-exposed rat embryo
Image sources1. Verhallen and Piersma, 2011, 2. De Jong et al 2012
Toxicity class prediction (case study: Triazoles)
      25 times larger chemical-gene matrix compared to manual
      work (Comparative Toxicogenomics Database)
                                                     Chemical
     1.




Image source 1: Verhallen and Piersma, 2011
Conclusions
Next-generation text mining combined with
statistical tests complements, and is
sometimes superior to, manually curated
databases in:
- Relating chemical information to gene
   expression data
- Identifying toxic effects already at the
   gene expression stage
- Discriminating between different classes
   of chemicals
Future
1. Make the method easier to use
(currently being worked on)

2. Apply the method for new drugs
with unknown toxicity

Early prediction of toxicity ->
less animal testing and safer drugs
Thank you to all who made
      this possible!

PhD thesis presentation

  • 1.
    Next-generation text-mining applied totoxicogenomics data analysis Kristina Hettne PhD thesis defense 20 December, 2012
  • 2.
    Toxicogenomics: study ifa chemical causes damage to genes Text mining: teach a computer to “read” articles and extract explicit information Next-generation text mining: teach a computer to find implicit information in articles
  • 4.
    Drug safety isessential! But… how to minimize animal testing? Image source: The Independent, July 12, 2012
  • 5.
    Toxicogenomics data Interpretation using knowledge from manually curated databases Image sources: Verhallen and Piersma, 2011, de Jong et al 2011, http://www.flickr.com/photos/jseita/3764113525/
  • 6.
    Toxicogenomics data Interpretation using knowledge from manually curated databases Not sufficient in coverage We hypothesize that next-generation text mining can increase the information coverage Image sources: Verhallen and Piersma, 2011, de Jong et al 2011, http://www.flickr.com/photos/jseita/3764113525/
  • 7.
    Next-generation text mining= concept profile matching Information cloud for a gene concept Shared concepts Information cloud for a chemical concept Image source: Herman van Haagen 7
  • 8.
    Concepts come froma thesaurus and are identified in text with concept identification software A good thesaurus = the basis for good concept identification Image source: Herman van Haagen
  • 9.
    Research objectives: • Investigateinformation coverage in public biomedical and chemical thesauri and databases • Provide methods to improve the quality and coverage • Give recommendations for use • Investigate added value of next- generation text mining when interpreting toxicogenomics data 9
  • 10.
  • 11.
    A thesaurus ofchemical concepts1 and methods1,2,3 to prepare a thesaurus to be used with concept identification software http://www.biosemantics.org/casper http://www.biosemantics.org/jochem 1. Hettne et al. Bioinformatics, 2009 2. Hettne et al. Journal of Biomedical Semantics, 2010 11 3. Hettne et al. Journal of Cheminformatics, 2010
  • 12.
    A next-generation textmining-based method for interpreting biological data Next-generation Biological data Statistical test text mining 12 This method gives more, and more specific results1 than other available tools http://www.biosemantics.org/weightedglobaltest 1. Jelier R, Goeman JJ, Hettne KM, Schuemie MJ, den Dunnen JT, 't Hoen PA. Briefings in Bioinformatics, 2011
  • 13.
    Application to toxicogenomics Hettne et al. (submitted) http://www.biosemantics.org/index.php?page=chemicalresponse-specific-gene-sets
  • 14.
    See developmental defectsin stem cells instead of in animal embryos Embryonic structure 1. 2. Posterior neuropore open A) Control group rat embryo B)Triazole-exposed rat embryo Image sources1. Verhallen and Piersma, 2011, 2. De Jong et al 2012
  • 15.
    Toxicity class prediction(case study: Triazoles) 25 times larger chemical-gene matrix compared to manual work (Comparative Toxicogenomics Database) Chemical 1. Image source 1: Verhallen and Piersma, 2011
  • 16.
    Conclusions Next-generation text miningcombined with statistical tests complements, and is sometimes superior to, manually curated databases in: - Relating chemical information to gene expression data - Identifying toxic effects already at the gene expression stage - Discriminating between different classes of chemicals
  • 17.
    Future 1. Make themethod easier to use (currently being worked on) 2. Apply the method for new drugs with unknown toxicity Early prediction of toxicity -> less animal testing and safer drugs
  • 18.
    Thank you toall who made this possible!