0
Text mining tools for
semantically enriching the
   scientific literature
          Sophia Ananiadou
               Direct...
Need for enriching the literature
• Need for semantic search i.e. beyond keywords
• Need for technologies enabling focused...
Impact of text mining
• Extraction of named entities (genes, proteins,
  metabolites, etc)
• Discovery of concepts allows ...
Beyond named entities: facts
• Extraction of relationships, events (facts)
  for knowledge discovery
  – Information extra...
Enriched annotation
• Text Mining provides enriched annotation
  layers
  – the user will be able to carry out an easily
 ...
Annotations derived from Text Mining

                                       lexicon                          ontology



...
Mining associations from MEDLINE
• FACTA: Finding Associated Concepts with
  Text Analysis
   – What diseases are related ...
Query
Click!
Innovative Technologies applied to:
• Term recognition
• Named entity recognition        Semantic
                        ...
Natural Language Processing
           technologies
• Part-of-speech tagging: GENIA
  – Tuned to biomedical text: 97-99% p...
Automatic Term Recognition




http://www.nactem.ac.uk/software/termine/
Recognising and Disambiguating
Acronyms in Biomedical Literature




        http://www.nactem.ac.uk/software/acromine
Named-entity recognition

    The peri-kappa B site mediates human immunodeficiency
             DNA                      ...
Leveraging resources
• Annotated texts (GENIA corpus, GENIA
  event corpus)
• Resources for bio-text mining
  – resource-b...
Population Process
Existing repositories
                          chemical, disease, enzyme, species names

             ...
Semantic search based on facts
• MEDIE: an interactive advanced IR
  system retrieving facts
• Performs a semantic search
...
Medie system overview
            Off-line
                                          On-line
              Deep
          ...
Sentence Retrieval System
Using Semantic Representation
           MEDIE
InfoPubMed
! An interactive Information Extraction system and
  an efficient PubMed search tool, helping users to
  find i...
InfoPubMed
             Interactions and not
             just co-occurrences.
             Calculated using ML
          ...
Semantic Information Retrieval
        http://nactem4.mc.man.ac.uk:8080/Kleio/


# KLEIO: a semantically enriched
 informa...
KLEIO architecture
Fewer documents
with more precise
query
Linking and enriching pathways
           with text

– REFINE (BBSRC)
quot; MCISB and NaCTeM (Kell, Ananiadou, Tsujii)
– t...
2 Steps for linking text with
                   pathways
                                          IkB P   IkB U       !
...
Event Annotation - Example
Statistics & References
! Statistics
  quot; 36,114 events have been identified from
    and annotated to
     ! 1,000 Med...
Acknowledgements
• Junichi Tsujii and his lab (University of Tokyo) MEDIE,
  InfoPubMed, event annotation
• Yoshimasa Tsur...
Text mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literature
Upcoming SlideShare
Loading in...5
×

Text mining tools for semantically enriching scientific literature

6,452

Published on

presentation by Sophia Ananiadou at the Cheminformatics workshop 4th March 2008

Published in: Education, Technology
1 Comment
13 Likes
Statistics
Notes
No Downloads
Views
Total Views
6,452
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
263
Comments
1
Likes
13
Embeds 0
No embeds

No notes for slide

Transcript of "Text mining tools for semantically enriching scientific literature"

  1. 1. Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester
  2. 2. Need for enriching the literature • Need for semantic search i.e. beyond keywords • Need for technologies enabling focused semantic search via the creation of semantic metadata from literature “The current scientific literature, were it to be presented in semantically accessible form, contains huge amounts of undiscovered science” Peter Murray-Rust, Data-driven science: A Scientist’s view. NSF/JISC Repositories Workshop, 2007
  3. 3. Impact of text mining • Extraction of named entities (genes, proteins, metabolites, etc) • Discovery of concepts allows semantic annotation of documents – Improves information access by going beyond index terms, enabling semantic querying – Improves clustering, classification of documents – Visualisation based on semantic metadata derived from text mining results
  4. 4. Beyond named entities: facts • Extraction of relationships, events (facts) for knowledge discovery – Information extraction, more sophisticated annotation of texts (fact annotation) – Enables even more advanced semantic querying
  5. 5. Enriched annotation • Text Mining provides enriched annotation layers – the user will be able to carry out an easily expressed semantic query which will deliver facts matching that semantic query rather than just sets of documents he has to read… • Information Extraction and not just Information Retrieval • Fact extraction and not just sentence extraction
  6. 6. Annotations derived from Text Mining lexicon ontology text processing raw deep annotated part-of-speech named entity (unstructured) syntactic (structured) tagging recognition text parsing text ……………………….... S ... Secretion of TNF was abolished by BHA in VP PMA-stimulated U937 NP VP cells. …………………… PP NP PP PP NP NN IN NN VBZ VBN IN NN IN JJ NN NNS . Multi-layered Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells . annotations protein_molecule organic_compound cell_line negative regulation
  7. 7. Mining associations from MEDLINE • FACTA: Finding Associated Concepts with Text Analysis – What diseases are related to a particular chemical? – What proteins are related to a particular disease? – etc. • EBIMed http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp • PubMatrix http://pubmatrix.grc.nia.nih.gov/ : • FACTA http://text0.mib.man.ac.uk/software/facta/ – Quick and interactive
  8. 8. Query
  9. 9. Click!
  10. 10. Innovative Technologies applied to: • Term recognition • Named entity recognition Semantic Mark-up • Fact extraction ! semantic mark-up improves search ! classifying, linking documents ! knowledge discovery, hidden links, associations, hypothesis generation
  11. 11. Natural Language Processing technologies • Part-of-speech tagging: GENIA – Tuned to biomedical text: 97-99% precision • Dictionary-based named-entity recognition • Deep parsing – Predicate argument relations (90%) • Protein-protein interaction extraction • Event / fact extraction
  12. 12. Automatic Term Recognition http://www.nactem.ac.uk/software/termine/
  13. 13. Recognising and Disambiguating Acronyms in Biomedical Literature http://www.nactem.ac.uk/software/acromine
  14. 14. Named-entity recognition The peri-kappa B site mediates human immunodeficiency DNA virus virus type 2 enhancer activation in monocytes … cell_type ! Entity types (defined by Ontologies) quot; Genes/protein names quot; Enzymes, substances, metabolites, etc quot; GO ontology, KEGG, CheBI, etc
  15. 15. Leveraging resources • Annotated texts (GENIA corpus, GENIA event corpus) • Resources for bio-text mining – resource-building NLP tools for text-based knowledge harvesting (NaCTeM) – BioLexicon • Over 1.5M lexical entries for bio-text mining and growing…. • Containing rich linguistic information for bio-text mining
  16. 16. Population Process Existing repositories chemical, disease, enzyme, species names Subclustering gene/protein names of term variants new gene/protein names Medline abstracts Named entity Term mapping recognition by normalization Bio-Lexicon terminological verbs Manual curation on-going Verb subcategorization verb subcategorization frames
  17. 17. Semantic search based on facts • MEDIE: an interactive advanced IR system retrieving facts • Performs a semantic search ! Core technology annotates texts quot; GENIA tagger quot; syntactic structures quot; Enju (deep parser) quot; facts quot; Dictionary-based named entity recognition J. Tsujii
  18. 18. Medie system overview Off-line On-line Deep parser Semantically- RegionAlgebra Input Textbase annotated Search engine Entity Textbase Recognizer Search Query results
  19. 19. Sentence Retrieval System Using Semantic Representation MEDIE
  20. 20. InfoPubMed ! An interactive Information Extraction system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins, and the interactions between them. ! System components quot; Deep parsing technology quot; Extraction of protein-protein interactions quot; Multi-window interface on a browser
  21. 21. InfoPubMed Interactions and not just co-occurrences. Calculated using ML and deep semantics.
  22. 22. Semantic Information Retrieval http://nactem4.mc.man.ac.uk:8080/Kleio/ # KLEIO: a semantically enriched information retrieval system for biology # Offers textual and metadata searches across MEDLINE # Leverages terminology technologies #Named entity recognition: gene, protein, metabolite, organ, disease, symptom
  23. 23. KLEIO architecture
  24. 24. Fewer documents with more precise query
  25. 25. Linking and enriching pathways with text – REFINE (BBSRC) quot; MCISB and NaCTeM (Kell, Ananiadou, Tsujii) – to integrate text mining techniques with visualisation technologies for better understanding of the evidence for biochemical and signalling pathways – to enrich pathway models encoded in the Systems Biology Markup Language (SBML) with evidence derived from text mining
  26. 26. 2 Steps for linking text with pathways IkB P IkB U ! IkB Pathways Pathway Construction IkB IkB P Biological events IkB IkB U IkB ! Event Extraction … IkappaB is phosphorylated … Literature … Ikappa B ubiquitination … … degradation of IkB… Tsujii-lab, Tokyo
  27. 27. Event Annotation - Example
  28. 28. Statistics & References ! Statistics quot; 36,114 events have been identified from and annotated to ! 1,000 Medline abstracts, which contain ! 9,372 sentences quot; Kim, Jin-Dong, Tomoko Ohta and Jun'ichi Tsujii (2008) Corpus annotation for mining biomedical events from literature. BMC Bioinformatics quot; http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA
  29. 29. Acknowledgements • Junichi Tsujii and his lab (University of Tokyo) MEDIE, InfoPubMed, event annotation • Yoshimasa Tsuruoka (NER, FACTA, KLEIO, REFINE) • Naoaki Okazaki (TerMine, AcroMine) • Yutaka Sasaki (BioLexicon, NER, KLEIO) • John McNaught (BioLexicon, BOOTStrep project) • Chikashi Nobata (KLEIO) • Douglas Kell (REFINE)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×