2. Outline
• Motivation
– Prior information for candidate genes
– Structured data and unstructured text
• Methods
– Text mining plugin for Ondex
– Application case
• Results
– Visualisation
– Association networks
– Filtering noise
– Validation
• Summary
3. Motivation
• High throughput ‘omics research can identify many candidate genes
• Interpretation of experimental results needs prior information
• Most important sources for prior information are
– Structured bioinformatics databases
– Unstructured scientific literature
• GOAL: Automated methods for the integration of prior information
Identify genes that
alter expression
over time
DBs
Literature
Public Data SourcesTime Course Microarray Data
Gene1
Gene2
Gene3
…
...
GeneN
Candidate Genes
Experiment1
Experiment2
…
Get prior information
for genes regarding
the experiment
6. Advanced Knowledge Base
1. Structured information
– Bioinformatics databases, ontologies
– Curated citations in structured data sources (e.g. from UniProt)
2. Unstructured information
– MEDLINE titles and abstracts are indexed and normalised (by Lucene)
– Information Retrieval strategies: exact, fuzzy, proximity
– Named Entity Recognition: concept‐based (names and synonyms)
– Score: tf‐idf weight (term frequency * inverse document frequency)
text‐mining
x
y
BA
is_related
Publication
Concepts
published_in
weighted association network
IP=1.7; M=1.2; N=2
yx
BA
14. Top 10 protein predictions
ACCESSION NAME PUBMED YEAR M N IP PVAL TRUE
AT3G05420 ACBP4 18836139 2008 13.51 1 13.51 1.00* yes
AT1G31812 ACBP6 18836139 2008 11.57 2 17.14 0.50 yes
AT3G03190 ATGSTF6 14617075 2003 7.36 7 15.75 0.25 yes
AT4G26080 ABI1 19705149 2009 6.66 10 12.22 0.39 yes
AT3G21510 AHP1 18384742 2008 6.60 3 6.70 0.17 yes
AT1G75040 PR‐5 15988566 2005 5.18 12 5.47 0.07 yes
AT2G45820 Remorin 9159183 1997 5.04 4 6.77 0.86 no
AT3G11410 PP2CA 19705149 2009 5.00 1 5.00 1.00 yes
AT1G09570 Phytochrome A 8703080 1996 4.79 11 8.47 0.19 no
AT1G04240 IAA3 19213814 2009 4.54 3 5.14 0.67 yes
• Evaluated top 10 proteins (sorted by M score) from our analyses that are
linked to ethylene but were not found in AHD.
• P‐value relates to the significance of the IP score.
However if N=1 P=1 (*)
• Evidence text
• PMID:18836139: the interaction of ACBP4 and AtEBP may be related to AtEBP‐mediated
defence possibly via ethylene and/or jasmonate signalling.
• PMID:19705149: protein phosphatase 2C ABI1 modulates biosynthesis ratio of ABA and
ethylene.