Enhancing Data Integration with Text Analysis to Find Genes Implicated in Plant Stress Response

1,010 views

Published on

International Symposium on Integrative Bioinformatics 2010

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,010
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Enhancing Data Integration with Text Analysis to Find Genes Implicated in Plant Stress Response

  1. 1. Enhancing Data Integration with Text Analysis to  Find Proteins Implicated in Plant Stress Response Keywan Hassani‐Pak keywan.hassani‐pak@bbsrc.ac.uk Integrative Bioinformatics 2010
  2. 2. Outline • Motivation – Prior information for candidate genes – Structured data and unstructured text • Methods – Text mining plugin for Ondex – Application case • Results – Visualisation – Association networks – Filtering noise – Validation • Summary
  3. 3. Motivation • High throughput ‘omics research can identify many candidate genes • Interpretation of experimental results needs prior information • Most important sources for prior information are – Structured bioinformatics databases  – Unstructured scientific literature • GOAL: Automated methods for the integration of prior information Identify genes that alter expression over time DBs Literature Public Data SourcesTime Course Microarray Data Gene1 Gene2 Gene3 … ... GeneN Candidate Genes Experiment1 Experiment2 … Get prior information  for genes regarding  the experiment
  4. 4. Structured Data vs. Unstructured Text • Data integration methods – Syntactic and semantic heterogeneity – Literature references • Text mining methods – Identify facts hidden in unstructured text – Integrate facts with database entries http://www.nactem.ac.uk/software/kleio http://www.uniprot.org
  5. 5. Integrative Text Mining • Old: Data integration and text mining systems have been largely  developed independently • Idea: Combining structured knowledge stored in public data bases with  unstructured information in literature • New: Text mining plugin for the data integration framework Ondex Data Transformation Clients/ToolsHeterogeneous Data Sources UniProt OBO Parser Parser Ondex CoreGeneralizedObjectDataModel DatabaseLayer Mapping Methods Accession Name based BLAST Data Exchange Taverna Cytoscape Ondex Frontend Lucene KEGG Parser OXL/RDF WebService Text Mining MEDLINE Parser Ondex Integrator www.ondex.org
  6. 6. Advanced Knowledge Base 1. Structured information  – Bioinformatics databases, ontologies – Curated citations in structured data sources (e.g. from UniProt) 2. Unstructured information – MEDLINE titles and abstracts are indexed and normalised (by Lucene) – Information Retrieval strategies: exact, fuzzy, proximity – Named Entity Recognition: concept‐based (names and synonyms)  – Score: tf‐idf weight (term frequency * inverse document frequency) text‐mining x y BA is_related Publication Concepts published_in weighted association network IP=1.7; M=1.2; N=2 yx BA
  7. 7. Association Scores weighted association network N=29; M=3.1; IP=22.4 BA = N tf‐idf = 3.1 = M tf‐idf = 1.7 tf‐idf = 0.9 IP = 22.4 ...
  8. 8. Phenotypes Worldwide Data Resources Time Course Microarray Data Network Inference ‐ Literature ‐ Public databases ‐ Public experiments Identification of key  regulatory genes Knock out experiments Overexpresser  experiments Identify genes that alter  expression over time Prior information Ondex The PRESTA project http://www2.warwick.ac.uk/fac/sci/whri/research/presta
  9. 9. Application Case: Knowledge Base for  Stress Response in Arabidopsis • Publications (the corpus) – MEDLINE: search ‘Arabidopsis thaliana’  28653 publications • Proteins – UniProtKB: search ‘taxid:3702 + reviewed’  8582 proteins – 13502 curated citations • Plant Stress Ontology – 33 stresses/treatments related to PRESTA  experiments – Biotic: Bacteria, Fungus, etc.  – Abiotic: Drought, Salt, Light, Hormone, etc. Stress Protein Publication Enzyme 13502 352445194 published_in is_related
  10. 10. X. campestris Network Visualisation
  11. 11. Protein‐Stress Association Network • 3145 proteins linked to 32 stresses by 10777 relations • On average • each protein associated with 3.4 stresses • each stress associated with 337 proteins • Filtering associations based on three confidence scores IP, M and N X. campestris Ethylene Metric Min Max IP 0.01 347.26 M 0.01 26.86 N 1 600
  12. 12. How to find cut‐offs for filtering? • Problem: Text mining results often error‐prone  • Aim: Improving signal‐to‐noise ratio by setting optimal cut‐offs • Co‐citation number (N) is simplest way to potentially reduce noise in  such association networks • Filtering by IP and M should be more selective as both consider  frequency of terms in the corpus • However, none of the metrics is superior overall • Considering several metrics at the same time seems to be method of  choice to reduce noise and highlight key associations
  13. 13. a. b. TM AHD AHD TM Validation of Protein‐Ethylene Pairs • Ethylene association network contained 533 proteins • Ideally read all abstracts and evaluate association • Comparison with Arabidopsis Hormone Database (AHD) a. 31 curated associations:  71.0% recall b. 166 total associations (inc. GO): 44.8% recall
  14. 14. Top 10 protein predictions ACCESSION NAME PUBMED YEAR M N IP PVAL TRUE AT3G05420 ACBP4 18836139 2008 13.51 1 13.51 1.00* yes AT1G31812 ACBP6 18836139 2008 11.57 2 17.14 0.50 yes AT3G03190 ATGSTF6 14617075 2003 7.36 7 15.75 0.25 yes AT4G26080 ABI1 19705149 2009 6.66 10 12.22 0.39 yes AT3G21510 AHP1 18384742 2008 6.60 3 6.70 0.17 yes AT1G75040 PR‐5 15988566 2005 5.18 12 5.47 0.07 yes AT2G45820 Remorin 9159183 1997 5.04 4 6.77 0.86 no AT3G11410 PP2CA 19705149 2009 5.00 1 5.00 1.00 yes AT1G09570 Phytochrome A 8703080 1996 4.79 11 8.47 0.19 no AT1G04240 IAA3 19213814 2009 4.54 3 5.14 0.67 yes • Evaluated top 10 proteins (sorted by M score) from our analyses that are  linked  to ethylene but were not found in AHD.  • P‐value relates to the significance of the IP score.   However if N=1  P=1 (*) • Evidence text • PMID:18836139: the interaction of ACBP4 and AtEBP may be related to AtEBP‐mediated  defence possibly via ethylene and/or jasmonate signalling. • PMID:19705149: protein phosphatase 2C ABI1 modulates biosynthesis ratio of ABA and  ethylene.
  15. 15. Future Work • Integrate more advanced text mining methods • Extensive analysis and evaluation of our association metrics • Investigate alternative association metrics • Finding best cut‐off for optimal signal‐to‐noise ratio • Apply method to more application cases
  16. 16. Summary • Prior information needs to be extracted from structured data and  unstructured text • Developed a flexible text mining plugin for the data integration framework  Ondex (open source) • Can be linked into various bioinformatics workflow to enhance high‐ throughput ‘omics research • First report of systematically combining data integration with basic text  mining • Generated prior information for Arabidopsis proteins regarding the  PRESTA experiments
  17. 17. Acknowledgements ONDEX BBSRC SABR Project BB/F006039 PRESTA BBSRC SABR project BB/F005806 Catherine Canevet Chris Rawlings Roxane Legaie Hugo van den Berg Jay Moore THANK YOU! Contact: keywan.hassani‐pak@bbsrc.ac.uk

×