Text mining tools for semantically enriching scientific literature
1. Text mining tools for
semantically enriching the
scientific literature
Sophia Ananiadou
Director
National Centre for Text Mining
School of Computer Science
University of Manchester
2. Need for enriching the literature
• Need for semantic search i.e. beyond keywords
• Need for technologies enabling focused
semantic search via the creation of semantic
metadata from literature
“The current scientific literature, were it to be
presented in semantically accessible form,
contains huge amounts of undiscovered
science”
Peter Murray-Rust, Data-driven science: A Scientist’s view.
NSF/JISC Repositories Workshop, 2007
3. Impact of text mining
• Extraction of named entities (genes, proteins,
metabolites, etc)
• Discovery of concepts allows semantic annotation of
documents
– Improves information access by going beyond index
terms, enabling semantic querying
– Improves clustering, classification of documents
– Visualisation based on semantic metadata derived
from text mining results
4. Beyond named entities: facts
• Extraction of relationships, events (facts)
for knowledge discovery
– Information extraction, more sophisticated
annotation of texts (fact annotation)
– Enables even more advanced semantic
querying
5. Enriched annotation
• Text Mining provides enriched annotation
layers
– the user will be able to carry out an easily
expressed semantic query which will deliver
facts matching that semantic query rather
than just sets of documents he has to read…
• Information Extraction and not just Information
Retrieval
• Fact extraction and not just sentence extraction
6. Annotations derived from Text Mining
lexicon ontology
text processing
raw deep annotated
part-of-speech named entity
(unstructured) syntactic (structured)
tagging recognition
text parsing text
……………………….... S
... Secretion of TNF was
abolished by BHA in VP
PMA-stimulated U937
NP VP
cells. ……………………
PP
NP PP PP NP
NN IN NN VBZ VBN IN NN IN JJ NN NNS .
Multi-layered Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .
annotations protein_molecule organic_compound cell_line
negative regulation
7. Mining associations from MEDLINE
• FACTA: Finding Associated Concepts with
Text Analysis
– What diseases are related to a particular chemical?
– What proteins are related to a particular disease?
– etc.
• EBIMed http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
• PubMatrix http://pubmatrix.grc.nia.nih.gov/
:
• FACTA http://text0.mib.man.ac.uk/software/facta/
– Quick and interactive
17. Named-entity recognition
The peri-kappa B site mediates human immunodeficiency
DNA virus
virus type 2 enhancer activation in monocytes …
cell_type
! Entity types (defined by Ontologies)
quot; Genes/protein names
quot; Enzymes, substances, metabolites, etc
quot; GO ontology, KEGG, CheBI, etc
19. Leveraging resources
• Annotated texts (GENIA corpus, GENIA
event corpus)
• Resources for bio-text mining
– resource-building NLP tools for text-based
knowledge harvesting (NaCTeM)
– BioLexicon
• Over 1.5M lexical entries for bio-text mining and
growing….
• Containing rich linguistic information for bio-text
mining
20. Population Process
Existing repositories
chemical, disease, enzyme, species names
Subclustering gene/protein names
of term variants
new gene/protein names
Medline abstracts Named entity Term mapping
recognition by normalization Bio-Lexicon
terminological verbs
Manual curation
on-going
Verb subcategorization
verb subcategorization frames
21. Semantic search based on facts
• MEDIE: an interactive advanced IR
system retrieving facts
• Performs a semantic search
! Core technology annotates texts
quot; GENIA tagger quot; syntactic structures
quot; Enju (deep parser) quot; facts
quot; Dictionary-based named entity recognition
J. Tsujii
22. Medie system overview
Off-line
On-line
Deep
parser Semantically- RegionAlgebra
Input
Textbase annotated Search engine
Entity Textbase
Recognizer
Search
Query
results
25. InfoPubMed
! An interactive Information Extraction system and
an efficient PubMed search tool, helping users to
find information about biomedical entities such
as genes, proteins, and the interactions
between them.
! System components
quot; Deep parsing technology
quot; Extraction of protein-protein interactions
quot; Multi-window interface on a browser
26. InfoPubMed
Interactions and not
just co-occurrences.
Calculated using ML
and deep semantics.
27. Semantic Information Retrieval
http://nactem4.mc.man.ac.uk:8080/Kleio/
# KLEIO: a semantically enriched
information retrieval system for biology
# Offers textual and metadata searches
across MEDLINE
# Leverages terminology technologies
#Named entity recognition: gene, protein,
metabolite, organ, disease, symptom
31. Linking and enriching pathways
with text
– REFINE (BBSRC)
quot; MCISB and NaCTeM (Kell, Ananiadou, Tsujii)
– to integrate text mining techniques with
visualisation technologies for better
understanding of the evidence for biochemical
and signalling pathways
– to enrich pathway models encoded in the
Systems Biology Markup Language (SBML)
with evidence derived from text mining
32. 2 Steps for linking text with
pathways
IkB P IkB U !
IkB
Pathways
Pathway Construction
IkB IkB P
Biological events IkB IkB U
IkB !
Event Extraction
… IkappaB is phosphorylated …
Literature … Ikappa B ubiquitination …
… degradation of IkB…
Tsujii-lab, Tokyo
34. Statistics & References
! Statistics
quot; 36,114 events have been identified from
and annotated to
! 1,000 Medline abstracts, which contain
! 9,372 sentences
quot; Kim, Jin-Dong, Tomoko Ohta and Jun'ichi
Tsujii (2008) Corpus annotation for
mining biomedical events from
literature. BMC Bioinformatics
quot; http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA
35. Acknowledgements
• Junichi Tsujii and his lab (University of Tokyo) MEDIE,
InfoPubMed, event annotation
• Yoshimasa Tsuruoka (NER, FACTA, KLEIO, REFINE)
• Naoaki Okazaki (TerMine, AcroMine)
• Yutaka Sasaki (BioLexicon, NER, KLEIO)
• John McNaught (BioLexicon, BOOTStrep project)
• Chikashi Nobata (KLEIO)
• Douglas Kell (REFINE)