SlideShare a Scribd company logo
1 of 21
Syntactic-semantic analysis for
information extraction in biomedicine
Sérgio Matos1, Anabela Barreiro2
1IEETA, Universidade de Aveiro
2Centro de Linguística, Universidade do Porto
aleixomatos@ua.pt; barreiro_anabela@hotmail.com
June 2009
Outline
• Background
• Text Mining and Information Extraction in
Biomedicine
• Objectives
• Implementation
• Results
• Conclusions
Background
• Genomics and Proteomics are fast-growing fields
• Literature grows exponentially
– MEDLINE/PubMed ~ 18m citations
• Researchers need to contextualize their theories and findings
– Interactions between genes/proteins
– Involvement in biological processes and in disease
– And many other factors...
• How to keep up-to-date with new knowledge in the
field?
Background
• Manually curated biomedical databases are a good source of
information
– Publications are reviewed and important information added to DBs
(e.g. protein interactions)
– Impossible to keep DBs up-to-date due to increased volume of
publications
• Text Mining can be useful for
– Information retrieval (IR)
– Information extraction (IE)
– DB curators and end-users (researchers)
Text Mining and Information Extraction
in Biomedicine
• Text mining deals with the automated processing of texts to
derive high quality information
• Information Extraction can be seen as one application of TM
• Different processing levels
• Entity Recognition (ER) genes, proteins, etc.
• Normalization ATF2 - GeneID 1386
ATF-2 – Uniprot P15336
• Relation extraction PPI, gene/disease
• Event extraction gene expression, regulation
+ semantics + domain knowledge
Text Mining and Information Extraction
in Biomedicine
• Good results for NER, but limited to a few entity types
– 80%-90% for recognition of genes/proteins
– Need to include more entities, like chemical compounds, diseases,
experimental conditions
• Relation extraction has focused mostly on PPI
• Inter-concept relations not too explored
– e.g. gene/disease, drug/target
– mostly based on co-occurence statistics
Text Mining and Information Extraction
in Biomedicine
• Recent interest towards extraction of events
– BioNLP shared task and BioCreaTive II.5
• ... and other entities / facts
– e.g. Experimental conditions, lab techniques, measurements
• ... Discourse analysis
– “indicating/suggesting that...”, “in contrast...”
• Full-text vs. Abstracts
– Complexity in grammar
Linguistic Resources for Biomedical TM
• UMLS Metathesaurus
– various terms, all linked to same concept (e.g. ‘Hypertension’)
– semantic information provided by the UMLS Semantic Network
• BioLexicon
– Includes domain relevant verbs (localize, bind, express, …)
• Lexical resources can be created from available online DBs
– NCBI Entrez Gene for gene names
– UniProt for proteins
– OMIM for diseases
– Various ontologies
Objectives
• Extract phrases indicating a biomolecular event from
scientific text
• Biomolecular events include various types
– Examples
• “phosphorylation of TRAF2”
• “localization of beta-catenin”
• “TRADD interacts with TES2”
• BioNLP'09 Shared Task on Event Extraction
– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/
Objectives
• Six event types considered
– Localization, Binding, Gene expression, Transcription, Protein
Catabolism, Phosphorylation
• Training data
– Annotation of genes/proteins occurring in each input text, including
the text section (start and end characters)
– Annotation of the events, including the event type, the participating
entities and the corresponding trigger word (with start and end times)
• Test data
– Annotation of participating genes/proteins is given
– Create annotation of events for the given entities
Implementation
• General approach
– Create syntactic grammars to detect phrases that indicate events
– Grammars are based only on NEs and domain verbs (and derived
names)
• Requisites
– Grammars outputs should indicate the event type
• Solution
– Event types can be associated with the trigger word using the
semantic properties in NooJ dictionaries
– Event types associated with each trigger word are derived from
training data
Implementation
• Resources
– Entity dictionary
• Create dictionary with list of entities occurring in the texts
Implementation
Lemma PoS FLX
Semantic
properties
ID TAXID
human N TABLE ORGANISM 9606
Homo sapiens N ORGANISM 9606
Mus musculus N ORGANISM 10090
Breast cancer type 1
susceptibility protein
N PROTEIN P38398 9606
BRCA1 N PROTEIN P38398 9606
BRCA1 N PROTEIN P48754 10090
BRCA1 N GENE 672 9606
RNF53 N GENE 672 9606
Implementation
• Resources
– Entity dictionary
• Create dictionary with list of entities occurring in the texts
– BioLexicon verb dictionary
• Adapted to include event type
– From the training data, extract the verbs associated with events
– Add a semantic property to the dictionary entry indicating the event type
– Example: “express,V+EventType=Gene_Expression”
• Added inflectional and derivation rules
– The inflected and derivated forms inherit the verb’s semantic properties
• Verb dictionary
Implementation
Lemma PoS DRV FLX EventType
express V ION:TABLE ABOLISH Gene_expression
ligate V TION:TABLE SMILE Binding
stimulate V TION:TABLE SMILE Positive_regulation
Implementation
• Syntactic grammars
– Sentences from training set used to generate surface
patterns
– Manual procedure
– Seven grammars created
– Example:
“stimulation of human CD4”
Implementation
Stimulation of human CD4
<EVENT+PROTEIN=CD4+EXP=Stimulation+TYPE=Positive_regulation>
Results
Pattern Concordance in text
<entity> [<entity_type>] <nominalization> HSP gene expression
<nominalization> “of” [<entity_type>] <entity> upregulation of Fas
<entity> [<entity_type>] <be> [“not”] [<adverb>] <verb>
IL-2R stimulation was totally
inhibited
<verb> <preposition> <entity> binding of TRAF2
<verb> <nominalization> “of” <entity>
suppressing activation of
STAT6
• Example patterns extracted from texts
Results
Event type Recall Precision F-score
Localization 35.63 70.45 47.33
Binding 13.54 34.06 19.38
Gene Expression 46.40 78.45 58.31
Transcription 33.58 41.07 36.95
Protein Catabolism 35.71 62.50 45.45
Phosphorylation 49.63 79.76 61.19
Average 36.76 65.58 47.11
• Average results
Conclusions
• NooJ syntactic grammars for IE
– Simple and flexible approach
– Takes advantage of semantic properties and inflectional and
derivational morphology in NooJ dictionaries
• Pattern identification
– Manual method is limited
– How to generate new patterns automatically ?
• Gene regulatory events
– Described by complex constructions
– Can syntactic grammars be used for this type of events ?
References and Acknowledgments
• BioLexicon was developed within the BOOTStrep project
– http://www.nactem.ac.uk/biolexicon/
– http://www.bootstrep.eu/bin/view/Extern/WebHome
• Data set from the BioNLP’09 Shared Task on Event Extraction
– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/
Sérgio Matos is funded by Fundação para a Ciência e Tecnologia (FCT)
under the Ciência2007 programme .

More Related Content

Similar to Syntactic-semantic analysis for information extraction in biomedicine

Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataYannick Pouliot
 
Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolJesminBinti
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015Susanna-Assunta Sansone
 
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...Artificial Intelligence Institute at UofSC
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biologyrobertstevens65
 
A knowledge capture framework for domain specific search systems
A knowledge capture framework for domain specific search systemsA knowledge capture framework for domain specific search systems
A knowledge capture framework for domain specific search systemsramakanz
 
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...OECD Environment
 
Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Syed Muhammad Ali Hasnain
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europeopen_phacts
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesLeighton Pritchard
 
Amia tb-review-12
Amia tb-review-12Amia tb-review-12
Amia tb-review-12Russ Altman
 
Case Study: Public Library of Science Thesaurus: Year One
Case Study: Public Library of Science Thesaurus: Year OneCase Study: Public Library of Science Thesaurus: Year One
Case Study: Public Library of Science Thesaurus: Year OneAccess Innovations, Inc.
 
research methodology ppt-pdf-converted.pptx
research methodology ppt-pdf-converted.pptxresearch methodology ppt-pdf-converted.pptx
research methodology ppt-pdf-converted.pptxDaniyalTahir9
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giantsBenjamin Good
 
Workshop on Assignment 2 SCI115 Live workshop 103020.docx
Workshop on Assignment 2 SCI115 Live workshop 103020.docxWorkshop on Assignment 2 SCI115 Live workshop 103020.docx
Workshop on Assignment 2 SCI115 Live workshop 103020.docxdunnramage
 

Similar to Syntactic-semantic analysis for information extraction in biomedicine (20)

Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
 
Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST Tool
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
Chibucos annot go_final
Chibucos annot go_finalChibucos annot go_final
Chibucos annot go_final
 
Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015
 
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biology
 
A knowledge capture framework for domain specific search systems
A knowledge capture framework for domain specific search systemsA knowledge capture framework for domain specific search systems
A knowledge capture framework for domain specific search systems
 
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
 
Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
 
Ibn Sina
Ibn SinaIbn Sina
Ibn Sina
 
Amia tb-review-12
Amia tb-review-12Amia tb-review-12
Amia tb-review-12
 
Case Study: Public Library of Science Thesaurus: Year One
Case Study: Public Library of Science Thesaurus: Year OneCase Study: Public Library of Science Thesaurus: Year One
Case Study: Public Library of Science Thesaurus: Year One
 
research methodology ppt-pdf-converted.pptx
research methodology ppt-pdf-converted.pptxresearch methodology ppt-pdf-converted.pptx
research methodology ppt-pdf-converted.pptx
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
Online Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery SystemsOnline Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery Systems
 
Workshop on Assignment 2 SCI115 Live workshop 103020.docx
Workshop on Assignment 2 SCI115 Live workshop 103020.docxWorkshop on Assignment 2 SCI115 Live workshop 103020.docx
Workshop on Assignment 2 SCI115 Live workshop 103020.docx
 

More from INESC-ID (Spoken Language Systems Laboratory - L2F)

More from INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

Multi3Generation@INGL2020
Multi3Generation@INGL2020Multi3Generation@INGL2020
Multi3Generation@INGL2020
 
NooJ 2020 presentation
NooJ 2020 presentationNooJ 2020 presentation
NooJ 2020 presentation
 
PROPOR2020_Barreiroetal
PROPOR2020_BarreiroetalPROPOR2020_Barreiroetal
PROPOR2020_Barreiroetal
 
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
 
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST ActionWelcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
 
Cross language semantic relations between English and Portuguese
Cross language semantic relations between English and PortugueseCross language semantic relations between English and Portuguese
Cross language semantic relations between English and Portuguese
 
Paraphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translationParaphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translation
 
ReWriter for legal text
ReWriter for legal textReWriter for legal text
ReWriter for legal text
 
Chatbots for Language Learning
Chatbots for Language LearningChatbots for Language Learning
Chatbots for Language Learning
 
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and SummarizationeSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
 
Barreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-languageBarreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-language
 
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignmentsRebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
 
Barreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentationBarreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentation
 
Barreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-posterBarreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-poster
 
NooJ-2018-Palermo
NooJ-2018-PalermoNooJ-2018-Palermo
NooJ-2018-Palermo
 
Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania
 
projeto-eSPERTo
projeto-eSPERToprojeto-eSPERTo
projeto-eSPERTo
 
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software ToolReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
 
Poster l2f 2017
Poster l2f 2017Poster l2f 2017
Poster l2f 2017
 
Nooj2017 cmota-etal
Nooj2017 cmota-etalNooj2017 cmota-etal
Nooj2017 cmota-etal
 

Recently uploaded

Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantadityabhardwaj282
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxPABOLU TEJASREE
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are important
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 

Syntactic-semantic analysis for information extraction in biomedicine

  • 1. Syntactic-semantic analysis for information extraction in biomedicine Sérgio Matos1, Anabela Barreiro2 1IEETA, Universidade de Aveiro 2Centro de Linguística, Universidade do Porto aleixomatos@ua.pt; barreiro_anabela@hotmail.com June 2009
  • 2. Outline • Background • Text Mining and Information Extraction in Biomedicine • Objectives • Implementation • Results • Conclusions
  • 3. Background • Genomics and Proteomics are fast-growing fields • Literature grows exponentially – MEDLINE/PubMed ~ 18m citations • Researchers need to contextualize their theories and findings – Interactions between genes/proteins – Involvement in biological processes and in disease – And many other factors... • How to keep up-to-date with new knowledge in the field?
  • 4. Background • Manually curated biomedical databases are a good source of information – Publications are reviewed and important information added to DBs (e.g. protein interactions) – Impossible to keep DBs up-to-date due to increased volume of publications • Text Mining can be useful for – Information retrieval (IR) – Information extraction (IE) – DB curators and end-users (researchers)
  • 5. Text Mining and Information Extraction in Biomedicine • Text mining deals with the automated processing of texts to derive high quality information • Information Extraction can be seen as one application of TM • Different processing levels • Entity Recognition (ER) genes, proteins, etc. • Normalization ATF2 - GeneID 1386 ATF-2 – Uniprot P15336 • Relation extraction PPI, gene/disease • Event extraction gene expression, regulation + semantics + domain knowledge
  • 6. Text Mining and Information Extraction in Biomedicine • Good results for NER, but limited to a few entity types – 80%-90% for recognition of genes/proteins – Need to include more entities, like chemical compounds, diseases, experimental conditions • Relation extraction has focused mostly on PPI • Inter-concept relations not too explored – e.g. gene/disease, drug/target – mostly based on co-occurence statistics
  • 7. Text Mining and Information Extraction in Biomedicine • Recent interest towards extraction of events – BioNLP shared task and BioCreaTive II.5 • ... and other entities / facts – e.g. Experimental conditions, lab techniques, measurements • ... Discourse analysis – “indicating/suggesting that...”, “in contrast...” • Full-text vs. Abstracts – Complexity in grammar
  • 8. Linguistic Resources for Biomedical TM • UMLS Metathesaurus – various terms, all linked to same concept (e.g. ‘Hypertension’) – semantic information provided by the UMLS Semantic Network • BioLexicon – Includes domain relevant verbs (localize, bind, express, …) • Lexical resources can be created from available online DBs – NCBI Entrez Gene for gene names – UniProt for proteins – OMIM for diseases – Various ontologies
  • 9. Objectives • Extract phrases indicating a biomolecular event from scientific text • Biomolecular events include various types – Examples • “phosphorylation of TRAF2” • “localization of beta-catenin” • “TRADD interacts with TES2” • BioNLP'09 Shared Task on Event Extraction – http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/
  • 10. Objectives • Six event types considered – Localization, Binding, Gene expression, Transcription, Protein Catabolism, Phosphorylation • Training data – Annotation of genes/proteins occurring in each input text, including the text section (start and end characters) – Annotation of the events, including the event type, the participating entities and the corresponding trigger word (with start and end times) • Test data – Annotation of participating genes/proteins is given – Create annotation of events for the given entities
  • 11. Implementation • General approach – Create syntactic grammars to detect phrases that indicate events – Grammars are based only on NEs and domain verbs (and derived names) • Requisites – Grammars outputs should indicate the event type • Solution – Event types can be associated with the trigger word using the semantic properties in NooJ dictionaries – Event types associated with each trigger word are derived from training data
  • 12. Implementation • Resources – Entity dictionary • Create dictionary with list of entities occurring in the texts
  • 13. Implementation Lemma PoS FLX Semantic properties ID TAXID human N TABLE ORGANISM 9606 Homo sapiens N ORGANISM 9606 Mus musculus N ORGANISM 10090 Breast cancer type 1 susceptibility protein N PROTEIN P38398 9606 BRCA1 N PROTEIN P38398 9606 BRCA1 N PROTEIN P48754 10090 BRCA1 N GENE 672 9606 RNF53 N GENE 672 9606
  • 14. Implementation • Resources – Entity dictionary • Create dictionary with list of entities occurring in the texts – BioLexicon verb dictionary • Adapted to include event type – From the training data, extract the verbs associated with events – Add a semantic property to the dictionary entry indicating the event type – Example: “express,V+EventType=Gene_Expression” • Added inflectional and derivation rules – The inflected and derivated forms inherit the verb’s semantic properties
  • 15. • Verb dictionary Implementation Lemma PoS DRV FLX EventType express V ION:TABLE ABOLISH Gene_expression ligate V TION:TABLE SMILE Binding stimulate V TION:TABLE SMILE Positive_regulation
  • 16. Implementation • Syntactic grammars – Sentences from training set used to generate surface patterns – Manual procedure – Seven grammars created – Example: “stimulation of human CD4”
  • 17. Implementation Stimulation of human CD4 <EVENT+PROTEIN=CD4+EXP=Stimulation+TYPE=Positive_regulation>
  • 18. Results Pattern Concordance in text <entity> [<entity_type>] <nominalization> HSP gene expression <nominalization> “of” [<entity_type>] <entity> upregulation of Fas <entity> [<entity_type>] <be> [“not”] [<adverb>] <verb> IL-2R stimulation was totally inhibited <verb> <preposition> <entity> binding of TRAF2 <verb> <nominalization> “of” <entity> suppressing activation of STAT6 • Example patterns extracted from texts
  • 19. Results Event type Recall Precision F-score Localization 35.63 70.45 47.33 Binding 13.54 34.06 19.38 Gene Expression 46.40 78.45 58.31 Transcription 33.58 41.07 36.95 Protein Catabolism 35.71 62.50 45.45 Phosphorylation 49.63 79.76 61.19 Average 36.76 65.58 47.11 • Average results
  • 20. Conclusions • NooJ syntactic grammars for IE – Simple and flexible approach – Takes advantage of semantic properties and inflectional and derivational morphology in NooJ dictionaries • Pattern identification – Manual method is limited – How to generate new patterns automatically ? • Gene regulatory events – Described by complex constructions – Can syntactic grammars be used for this type of events ?
  • 21. References and Acknowledgments • BioLexicon was developed within the BOOTStrep project – http://www.nactem.ac.uk/biolexicon/ – http://www.bootstrep.eu/bin/view/Extern/WebHome • Data set from the BioNLP’09 Shared Task on Event Extraction – http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/ Sérgio Matos is funded by Fundação para a Ciência e Tecnologia (FCT) under the Ciência2007 programme .

Editor's Notes

  1. Involvement of genes in biological processes and in disease
  2. DB curators can use text mining tools to accelerate their job. They can sort the articles by relevance and find the relevant sentences. If an article seems to contain relevant information they still need to read the full article to validate that. Users can use TM tools for better IR. There are also some tools that present the results obtained from IE Information retrieval - find and rank publications containing relevant information for a particular gene/protein/disease/... Information extraction – extract information such as interaction between two proteins or the involvement of a gene in a disease
  3. Different processing levels in TM for biomedicine require different levels of domain knowledge (semantics) Named entity recognition – recognizing gene/protein names, diseases, etc Normalization – Similar genes in different organisms (human, mouse, fly, etc) usually have the same gene symbol. Also a gene and related protein frequently have the same name, and some times also for a gene and a related disease (there is high ambiguity!). Normalization means two things: disambiguation (is it a gene, protein, disease) and specifying which gene/protein it refers to (for example, the x gene in men, or the x gene in mice) Relation extraction – extract a relation between a gene and a protein (“protein A, encoded by gene Y”), a gene and a disease (“Y is involved in X”), or between proteins (protein-protein interaction, PPI) Event extraction – similar, but may involve one, two or more entities: “gene Y was expressed”, “protein X binding with protein Z”
  4. This and next slide not too important. Just give an insight of where the field stands and where it’s going to (our own view)
  5. The shared tasks are evaluation contests for IR and IE There is some interest in discourse analysis (including me), for summarization for example, and for finding things like new research directions (“further work should be carried to validate .....”, “this may indicate ...”), hesitation/contradiction (“this seems to indicate...”, author X showed... However...”) Most work is done over abstracts (from the MEDLINE/PubMed database) but most information is in the full-texts Abstracts have a more “restrained” grammar as compared to full-text, and that “facilitates” our approach The BioCreaTive challenge is on full texts
  6. The BioLexicon is for now available in a early version, as a collection of tables that can be converted into a relational database (like MySQL). In this version the linguistic information is limited to a list of verbs The final version should be available soon and it will include inflexional and derivational forms of the verbs. It will also include information about what a specific verb/noun may indicate in the text
  7. A trigger word is the word in the sentence that indicates the event. In “TRADD interacts with TES2”, “interacts” is the trigger word Given this framework, we do not need to do NER, as the entities that we should worry about are given to us. Also, using the training data, it is simple to obtain a list of trigger words for each type of event.
  8. Example entries in the biomedical dictionaries We use separate dictionaries: one for organisms, other for genes+proteins Simple to add new entities like diseases or anatomy (arm, leg, heart, ...) IDs are obtained from major databases (Uniprot for proteins, Entrez Gene for genes, OMIM for diseases, etc) Note ambiguity in BRCA1: can be either a human or mouse protein or a human gene Note synonyms: BRCA1 and RNF53 represent the same human gene, with the unique ID 672 (NCBI gene ID) Note: Mus musculus scientific name for mouse
  9. As the current BioLexicon dictionaries do not contain semantic information about the verbs (which verbs are used to indicate a “localization”, “binding”, “regulation” and each other type of events) this had to be derived from the training data EPIA paper: “Based on the manual linguistic annotations, we extracted the sentences corresponding to each event, and assigned the event type to the verbs found on those sentences. We then manually checked this list and selected only those verbs showing a specific link to a type of event. In case verbs were linked to more than one event type, only the most frequent event type was selected, and the remaining ones removed.“ - If “localize” is used to indicate a “localization” event 20 times in the training data and is used to indicate a “binding” event 1 time, then we only keep the “localization” type
  10. Example entries in the verb dictionary The verbs that may indicate an event also have a ‘+FUNC’ semantic property (not shown) This is to differentiate from all other entries in the dictionary that are more general. The dictionary does not include names or adjectives. Only the ones derived from these verbs are available.
  11. Each grammar describes 1-3 syntactic patterns: “Stimulation of human CD4” and “human CD4 stimulation” Same grammar for both forms
  12. Average results are good compared with other results in the BioNLP shared task and given the simple implementation Binding events are more difficult because they usually (but not allways) include two proteins or genes We did not cover the regulation / down-regulation / up-regulation for lack of time. These are even more complex and need more complex grammars
  13. The proposed method takes advantage of the inflectional and derivational morphology and semantic properties established in dictionaries and grammars developed with NooJ, which allow to associate terminological verbs and their derivations to specific event types. Methods such as the one proposed in this paper can be used to help database curators identify the most relevant facts in the literature and speed-up the annotation process. Tools based on these methods can also provide alternative querying and browsing of facts cited in the literature and be useful for researchers.