SlideShare a Scribd company logo
1 of 35
Download to read offline
Text mining tools for
semantically enriching the
   scientific literature
          Sophia Ananiadou
               Director
    National Centre for Text Mining
     School of Computer Science
      University of Manchester
Need for enriching the literature
• Need for semantic search i.e. beyond keywords
• Need for technologies enabling focused
  semantic search via the creation of semantic
  metadata from literature

 “The current scientific literature, were it to be
  presented in semantically accessible form,
  contains huge amounts of undiscovered
  science”
  Peter Murray-Rust, Data-driven science: A Scientist’s view.
  NSF/JISC Repositories Workshop, 2007
Impact of text mining
• Extraction of named entities (genes, proteins,
  metabolites, etc)
• Discovery of concepts allows semantic annotation of
  documents
   – Improves information access by going beyond index
     terms, enabling semantic querying
   – Improves clustering, classification of documents
   – Visualisation based on semantic metadata derived
     from text mining results
Beyond named entities: facts
• Extraction of relationships, events (facts)
  for knowledge discovery
  – Information extraction, more sophisticated
    annotation of texts (fact annotation)
  – Enables even more advanced semantic
    querying
Enriched annotation
• Text Mining provides enriched annotation
  layers
  – the user will be able to carry out an easily
   expressed semantic query which will deliver
   facts matching that semantic query rather
   than just sets of documents he has to read…
    • Information Extraction and not just Information
      Retrieval
    • Fact extraction and not just sentence extraction
Annotations derived from Text Mining

                                       lexicon                          ontology



                                                 text processing


      raw                                                                                deep                   annotated
                          part-of-speech              named entity
 (unstructured)                                                                        syntactic               (structured)
                              tagging                  recognition
      text                                                                              parsing                     text



   ………………………....                                                S
   ... Secretion of TNF was
   abolished by BHA in                                                       VP
   PMA-stimulated U937
                                                      NP                                  VP
   cells. ……………………
                                                                                                     PP
                                                 NP        PP                         PP                  NP


                                              NN     IN NN VBZ     VBN     IN NN IN      JJ         NN NNS .
Multi-layered                               Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .

annotations                                           protein_molecule            organic_compound         cell_line


                                                                    negative regulation
Mining associations from MEDLINE
• FACTA: Finding Associated Concepts with
  Text Analysis
   – What diseases are related to a particular chemical?
   – What proteins are related to a particular disease?
   – etc.

• EBIMed http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
• PubMatrix http://pubmatrix.grc.nia.nih.gov/
     :
• FACTA http://text0.mib.man.ac.uk/software/facta/
   – Quick and interactive
Query
Click!
Innovative Technologies applied to:
• Term recognition
• Named entity recognition        Semantic
                                  Mark-up
• Fact extraction
  ! semantic mark-up improves search
  ! classifying, linking documents
  ! knowledge discovery, hidden links,
   associations, hypothesis generation
Natural Language Processing
           technologies
• Part-of-speech tagging: GENIA
  – Tuned to biomedical text: 97-99% precision
• Dictionary-based named-entity recognition
• Deep parsing
  – Predicate argument relations (90%)
• Protein-protein interaction extraction
• Event / fact extraction
Automatic Term Recognition




http://www.nactem.ac.uk/software/termine/
Recognising and Disambiguating
Acronyms in Biomedical Literature




        http://www.nactem.ac.uk/software/acromine
Named-entity recognition

    The peri-kappa B site mediates human immunodeficiency
             DNA                                     virus
    virus type 2 enhancer activation in monocytes …
                                          cell_type

! Entity types (defined by Ontologies)
   quot; Genes/protein names
   quot; Enzymes, substances, metabolites, etc
   quot; GO ontology, KEGG, CheBI, etc
Leveraging resources
• Annotated texts (GENIA corpus, GENIA
  event corpus)
• Resources for bio-text mining
  – resource-building NLP tools for text-based
    knowledge harvesting (NaCTeM)
  – BioLexicon
    • Over 1.5M lexical entries for bio-text mining and
      growing….
    • Containing rich linguistic information for bio-text
      mining
Population Process
Existing repositories
                          chemical, disease, enzyme, species names

                          Subclustering        gene/protein names
                         of term variants

                        new gene/protein names
Medline abstracts         Named entity          Term mapping
                           recognition         by normalization    Bio-Lexicon


                                                  terminological verbs
                         Manual curation

                                                   on-going
                         Verb subcategorization

                                         verb subcategorization frames
Semantic search based on facts
• MEDIE: an interactive advanced IR
  system retrieving facts
• Performs a semantic search
! Core technology annotates texts
    quot; GENIA tagger quot; syntactic structures
    quot; Enju (deep parser) quot; facts
    quot; Dictionary-based named entity recognition
J. Tsujii
Medie system overview
            Off-line
                                          On-line
              Deep
              parser     Semantically-     RegionAlgebra
 Input
Textbase                   annotated       Search engine
              Entity       Textbase
            Recognizer

                                                     Search
                                         Query
                                                     results
Sentence Retrieval System
Using Semantic Representation
           MEDIE
InfoPubMed
! An interactive Information Extraction system and
  an efficient PubMed search tool, helping users to
  find information about biomedical entities such
  as genes, proteins, and the interactions
  between them.
! System components
  quot; Deep parsing technology
  quot; Extraction of protein-protein interactions
  quot; Multi-window interface on a browser
InfoPubMed
             Interactions and not
             just co-occurrences.
             Calculated using ML
             and deep semantics.
Semantic Information Retrieval
        http://nactem4.mc.man.ac.uk:8080/Kleio/


# KLEIO: a semantically enriched
 information retrieval system for biology
# Offers textual and metadata searches
 across MEDLINE
# Leverages terminology technologies
  #Named entity recognition: gene, protein,
   metabolite, organ, disease, symptom
KLEIO architecture
Fewer documents
with more precise
query
Linking and enriching pathways
           with text

– REFINE (BBSRC)
quot; MCISB and NaCTeM (Kell, Ananiadou, Tsujii)
– to integrate text mining techniques with
  visualisation technologies for better
  understanding of the evidence for biochemical
  and signalling pathways
– to enrich pathway models encoded in the
  Systems Biology Markup Language (SBML)
  with evidence derived from text mining
2 Steps for linking text with
                   pathways
                                          IkB P   IkB U       !
                                IkB
                 Pathways

Pathway Construction
                                          IkB     IkB P

            Biological events             IkB     IkB U
                                            IkB     !

  Event Extraction

                                      … IkappaB is phosphorylated …
                 Literature     … Ikappa B ubiquitination …
                                      … degradation of IkB…
          Tsujii-lab, Tokyo
Event Annotation - Example
Statistics & References
! Statistics
  quot; 36,114 events have been identified from
    and annotated to
     ! 1,000 Medline abstracts, which contain
     ! 9,372 sentences
  quot; Kim, Jin-Dong, Tomoko Ohta and Jun'ichi
    Tsujii (2008) Corpus annotation for
    mining biomedical events from
    literature. BMC Bioinformatics
  quot; http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA
Acknowledgements
• Junichi Tsujii and his lab (University of Tokyo) MEDIE,
  InfoPubMed, event annotation
• Yoshimasa Tsuruoka (NER, FACTA, KLEIO, REFINE)
• Naoaki Okazaki (TerMine, AcroMine)
• Yutaka Sasaki (BioLexicon, NER, KLEIO)
• John McNaught (BioLexicon, BOOTStrep project)
• Chikashi Nobata (KLEIO)
• Douglas Kell (REFINE)

More Related Content

Similar to Text mining tools for semantically enriching scientific literature

Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
Atai Rabby
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Reece Hart
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wd
Wagied Davids
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Hilmar Lapp
 

Similar to Text mining tools for semantically enriching scientific literature (20)

RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Protein function and bioinformatics
Protein function and bioinformaticsProtein function and bioinformatics
Protein function and bioinformatics
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
Introduction to BioNLP and its applications
Introduction to BioNLP and its applicationsIntroduction to BioNLP and its applications
Introduction to BioNLP and its applications
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wd
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
 
Ncbi
NcbiNcbi
Ncbi
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
Introduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxIntroduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptx
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdf
 
"Ontology-centric navigation of the scientific literature"
"Ontology-centric navigation of the scientific literature""Ontology-centric navigation of the scientific literature"
"Ontology-centric navigation of the scientific literature"
 
Ibn Sina
Ibn SinaIbn Sina
Ibn Sina
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databases
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
Curation Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopCuration Introduction - Apollo Workshop
Curation Introduction - Apollo Workshop
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
Biocuration2012 poster P113
Biocuration2012 poster P113Biocuration2012 poster P113
Biocuration2012 poster P113
 

More from Duncan Hull

Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Duncan Hull
 

More from Duncan Hull (20)

Why study plants?
Why study plants?Why study plants?
Why study plants?
 
Embedding employability in the Computer Science curriculum
Embedding employability in the Computer Science curriculumEmbedding employability in the Computer Science curriculum
Embedding employability in the Computer Science curriculum
 
Wikipedia at the Royal Society: The Good, the Bad and the Ugly
Wikipedia at the Royal Society: The Good, the Bad and the UglyWikipedia at the Royal Society: The Good, the Bad and the Ugly
Wikipedia at the Royal Society: The Good, the Bad and the Ugly
 
Improving the troubled relationship between Scientists and Wikipedia
Improving the troubled relationship between Scientists and Wikipedia Improving the troubled relationship between Scientists and Wikipedia
Improving the troubled relationship between Scientists and Wikipedia
 
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome CampusBibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
 
OWL and OBO
OWL and OBOOWL and OBO
OWL and OBO
 
Accessing small molecule data using ChEBI
Accessing small molecule data using ChEBIAccessing small molecule data using ChEBI
Accessing small molecule data using ChEBI
 
How to Blog
How to BlogHow to Blog
How to Blog
 
OWL-XML-Summer-School-09
OWL-XML-Summer-School-09OWL-XML-Summer-School-09
OWL-XML-Summer-School-09
 
Authenticating Scientists with OpenID
Authenticating Scientists with OpenIDAuthenticating Scientists with OpenID
Authenticating Scientists with OpenID
 
The Invisible Scientist
The Invisible ScientistThe Invisible Scientist
The Invisible Scientist
 
myExperiment @ Nettab
myExperiment @ NettabmyExperiment @ Nettab
myExperiment @ Nettab
 
The Year of Blogging Dangerously
The Year of Blogging DangerouslyThe Year of Blogging Dangerously
The Year of Blogging Dangerously
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
 
The Future of Research (Science and Technology)
The Future of Research (Science and Technology)The Future of Research (Science and Technology)
The Future of Research (Science and Technology)
 
Chemical named entity recognition and literature mark-up
Chemical named entity recognition and literature mark-upChemical named entity recognition and literature mark-up
Chemical named entity recognition and literature mark-up
 
Chemoinformatics and information management
Chemoinformatics and information managementChemoinformatics and information management
Chemoinformatics and information management
 
Issues for metabolomics and
Issues for metabolomics and Issues for metabolomics and
Issues for metabolomics and
 
Adding Meaning To Your Data
Adding Meaning To Your DataAdding Meaning To Your Data
Adding Meaning To Your Data
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Recently uploaded (20)

Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

Text mining tools for semantically enriching scientific literature

  • 1. Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester
  • 2. Need for enriching the literature • Need for semantic search i.e. beyond keywords • Need for technologies enabling focused semantic search via the creation of semantic metadata from literature “The current scientific literature, were it to be presented in semantically accessible form, contains huge amounts of undiscovered science” Peter Murray-Rust, Data-driven science: A Scientist’s view. NSF/JISC Repositories Workshop, 2007
  • 3. Impact of text mining • Extraction of named entities (genes, proteins, metabolites, etc) • Discovery of concepts allows semantic annotation of documents – Improves information access by going beyond index terms, enabling semantic querying – Improves clustering, classification of documents – Visualisation based on semantic metadata derived from text mining results
  • 4. Beyond named entities: facts • Extraction of relationships, events (facts) for knowledge discovery – Information extraction, more sophisticated annotation of texts (fact annotation) – Enables even more advanced semantic querying
  • 5. Enriched annotation • Text Mining provides enriched annotation layers – the user will be able to carry out an easily expressed semantic query which will deliver facts matching that semantic query rather than just sets of documents he has to read… • Information Extraction and not just Information Retrieval • Fact extraction and not just sentence extraction
  • 6. Annotations derived from Text Mining lexicon ontology text processing raw deep annotated part-of-speech named entity (unstructured) syntactic (structured) tagging recognition text parsing text ……………………….... S ... Secretion of TNF was abolished by BHA in VP PMA-stimulated U937 NP VP cells. …………………… PP NP PP PP NP NN IN NN VBZ VBN IN NN IN JJ NN NNS . Multi-layered Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells . annotations protein_molecule organic_compound cell_line negative regulation
  • 7. Mining associations from MEDLINE • FACTA: Finding Associated Concepts with Text Analysis – What diseases are related to a particular chemical? – What proteins are related to a particular disease? – etc. • EBIMed http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp • PubMatrix http://pubmatrix.grc.nia.nih.gov/ : • FACTA http://text0.mib.man.ac.uk/software/facta/ – Quick and interactive
  • 10.
  • 11. Innovative Technologies applied to: • Term recognition • Named entity recognition Semantic Mark-up • Fact extraction ! semantic mark-up improves search ! classifying, linking documents ! knowledge discovery, hidden links, associations, hypothesis generation
  • 12. Natural Language Processing technologies • Part-of-speech tagging: GENIA – Tuned to biomedical text: 97-99% precision • Dictionary-based named-entity recognition • Deep parsing – Predicate argument relations (90%) • Protein-protein interaction extraction • Event / fact extraction
  • 14.
  • 15.
  • 16. Recognising and Disambiguating Acronyms in Biomedical Literature http://www.nactem.ac.uk/software/acromine
  • 17. Named-entity recognition The peri-kappa B site mediates human immunodeficiency DNA virus virus type 2 enhancer activation in monocytes … cell_type ! Entity types (defined by Ontologies) quot; Genes/protein names quot; Enzymes, substances, metabolites, etc quot; GO ontology, KEGG, CheBI, etc
  • 18.
  • 19. Leveraging resources • Annotated texts (GENIA corpus, GENIA event corpus) • Resources for bio-text mining – resource-building NLP tools for text-based knowledge harvesting (NaCTeM) – BioLexicon • Over 1.5M lexical entries for bio-text mining and growing…. • Containing rich linguistic information for bio-text mining
  • 20. Population Process Existing repositories chemical, disease, enzyme, species names Subclustering gene/protein names of term variants new gene/protein names Medline abstracts Named entity Term mapping recognition by normalization Bio-Lexicon terminological verbs Manual curation on-going Verb subcategorization verb subcategorization frames
  • 21. Semantic search based on facts • MEDIE: an interactive advanced IR system retrieving facts • Performs a semantic search ! Core technology annotates texts quot; GENIA tagger quot; syntactic structures quot; Enju (deep parser) quot; facts quot; Dictionary-based named entity recognition J. Tsujii
  • 22. Medie system overview Off-line On-line Deep parser Semantically- RegionAlgebra Input Textbase annotated Search engine Entity Textbase Recognizer Search Query results
  • 23. Sentence Retrieval System Using Semantic Representation MEDIE
  • 24.
  • 25. InfoPubMed ! An interactive Information Extraction system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins, and the interactions between them. ! System components quot; Deep parsing technology quot; Extraction of protein-protein interactions quot; Multi-window interface on a browser
  • 26. InfoPubMed Interactions and not just co-occurrences. Calculated using ML and deep semantics.
  • 27. Semantic Information Retrieval http://nactem4.mc.man.ac.uk:8080/Kleio/ # KLEIO: a semantically enriched information retrieval system for biology # Offers textual and metadata searches across MEDLINE # Leverages terminology technologies #Named entity recognition: gene, protein, metabolite, organ, disease, symptom
  • 29.
  • 31. Linking and enriching pathways with text – REFINE (BBSRC) quot; MCISB and NaCTeM (Kell, Ananiadou, Tsujii) – to integrate text mining techniques with visualisation technologies for better understanding of the evidence for biochemical and signalling pathways – to enrich pathway models encoded in the Systems Biology Markup Language (SBML) with evidence derived from text mining
  • 32. 2 Steps for linking text with pathways IkB P IkB U ! IkB Pathways Pathway Construction IkB IkB P Biological events IkB IkB U IkB ! Event Extraction … IkappaB is phosphorylated … Literature … Ikappa B ubiquitination … … degradation of IkB… Tsujii-lab, Tokyo
  • 34. Statistics & References ! Statistics quot; 36,114 events have been identified from and annotated to ! 1,000 Medline abstracts, which contain ! 9,372 sentences quot; Kim, Jin-Dong, Tomoko Ohta and Jun'ichi Tsujii (2008) Corpus annotation for mining biomedical events from literature. BMC Bioinformatics quot; http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA
  • 35. Acknowledgements • Junichi Tsujii and his lab (University of Tokyo) MEDIE, InfoPubMed, event annotation • Yoshimasa Tsuruoka (NER, FACTA, KLEIO, REFINE) • Naoaki Okazaki (TerMine, AcroMine) • Yutaka Sasaki (BioLexicon, NER, KLEIO) • John McNaught (BioLexicon, BOOTStrep project) • Chikashi Nobata (KLEIO) • Douglas Kell (REFINE)