SlideShare a Scribd company logo
Syntactic-semantic analysis for
information extraction in biomedicine
Sérgio Matos1, Anabela Barreiro2
1IEETA, Universidade de Aveiro
2Centro de Linguística, Universidade do Porto
aleixomatos@ua.pt; barreiro_anabela@hotmail.com
June 2009
Outline
• Background
• Text Mining and Information Extraction in
Biomedicine
• Objectives
• Implementation
• Results
• Conclusions
Background
• Genomics and Proteomics are fast-growing fields
• Literature grows exponentially
– MEDLINE/PubMed ~ 18m citations
• Researchers need to contextualize their theories and findings
– Interactions between genes/proteins
– Involvement in biological processes and in disease
– And many other factors...
• How to keep up-to-date with new knowledge in the
field?
Background
• Manually curated biomedical databases are a good source of
information
– Publications are reviewed and important information added to DBs
(e.g. protein interactions)
– Impossible to keep DBs up-to-date due to increased volume of
publications
• Text Mining can be useful for
– Information retrieval (IR)
– Information extraction (IE)
– DB curators and end-users (researchers)
Text Mining and Information Extraction
in Biomedicine
• Text mining deals with the automated processing of texts to
derive high quality information
• Information Extraction can be seen as one application of TM
• Different processing levels
• Entity Recognition (ER) genes, proteins, etc.
• Normalization ATF2 - GeneID 1386
ATF-2 – Uniprot P15336
• Relation extraction PPI, gene/disease
• Event extraction gene expression, regulation
+ semantics + domain knowledge
Text Mining and Information Extraction
in Biomedicine
• Good results for NER, but limited to a few entity types
– 80%-90% for recognition of genes/proteins
– Need to include more entities, like chemical compounds, diseases,
experimental conditions
• Relation extraction has focused mostly on PPI
• Inter-concept relations not too explored
– e.g. gene/disease, drug/target
– mostly based on co-occurence statistics
Text Mining and Information Extraction
in Biomedicine
• Recent interest towards extraction of events
– BioNLP shared task and BioCreaTive II.5
• ... and other entities / facts
– e.g. Experimental conditions, lab techniques, measurements
• ... Discourse analysis
– “indicating/suggesting that...”, “in contrast...”
• Full-text vs. Abstracts
– Complexity in grammar
Linguistic Resources for Biomedical TM
• UMLS Metathesaurus
– various terms, all linked to same concept (e.g. ‘Hypertension’)
– semantic information provided by the UMLS Semantic Network
• BioLexicon
– Includes domain relevant verbs (localize, bind, express, …)
• Lexical resources can be created from available online DBs
– NCBI Entrez Gene for gene names
– UniProt for proteins
– OMIM for diseases
– Various ontologies
Objectives
• Extract phrases indicating a biomolecular event from
scientific text
• Biomolecular events include various types
– Examples
• “phosphorylation of TRAF2”
• “localization of beta-catenin”
• “TRADD interacts with TES2”
• BioNLP'09 Shared Task on Event Extraction
– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/
Objectives
• Six event types considered
– Localization, Binding, Gene expression, Transcription, Protein
Catabolism, Phosphorylation
• Training data
– Annotation of genes/proteins occurring in each input text, including
the text section (start and end characters)
– Annotation of the events, including the event type, the participating
entities and the corresponding trigger word (with start and end times)
• Test data
– Annotation of participating genes/proteins is given
– Create annotation of events for the given entities
Implementation
• General approach
– Create syntactic grammars to detect phrases that indicate events
– Grammars are based only on NEs and domain verbs (and derived
names)
• Requisites
– Grammars outputs should indicate the event type
• Solution
– Event types can be associated with the trigger word using the
semantic properties in NooJ dictionaries
– Event types associated with each trigger word are derived from
training data
Implementation
• Resources
– Entity dictionary
• Create dictionary with list of entities occurring in the texts
Implementation
Lemma PoS FLX
Semantic
properties
ID TAXID
human N TABLE ORGANISM 9606
Homo sapiens N ORGANISM 9606
Mus musculus N ORGANISM 10090
Breast cancer type 1
susceptibility protein
N PROTEIN P38398 9606
BRCA1 N PROTEIN P38398 9606
BRCA1 N PROTEIN P48754 10090
BRCA1 N GENE 672 9606
RNF53 N GENE 672 9606
Implementation
• Resources
– Entity dictionary
• Create dictionary with list of entities occurring in the texts
– BioLexicon verb dictionary
• Adapted to include event type
– From the training data, extract the verbs associated with events
– Add a semantic property to the dictionary entry indicating the event type
– Example: “express,V+EventType=Gene_Expression”
• Added inflectional and derivation rules
– The inflected and derivated forms inherit the verb’s semantic properties
• Verb dictionary
Implementation
Lemma PoS DRV FLX EventType
express V ION:TABLE ABOLISH Gene_expression
ligate V TION:TABLE SMILE Binding
stimulate V TION:TABLE SMILE Positive_regulation
Implementation
• Syntactic grammars
– Sentences from training set used to generate surface
patterns
– Manual procedure
– Seven grammars created
– Example:
“stimulation of human CD4”
Implementation
Stimulation of human CD4
<EVENT+PROTEIN=CD4+EXP=Stimulation+TYPE=Positive_regulation>
Results
Pattern Concordance in text
<entity> [<entity_type>] <nominalization> HSP gene expression
<nominalization> “of” [<entity_type>] <entity> upregulation of Fas
<entity> [<entity_type>] <be> [“not”] [<adverb>] <verb>
IL-2R stimulation was totally
inhibited
<verb> <preposition> <entity> binding of TRAF2
<verb> <nominalization> “of” <entity>
suppressing activation of
STAT6
• Example patterns extracted from texts
Results
Event type Recall Precision F-score
Localization 35.63 70.45 47.33
Binding 13.54 34.06 19.38
Gene Expression 46.40 78.45 58.31
Transcription 33.58 41.07 36.95
Protein Catabolism 35.71 62.50 45.45
Phosphorylation 49.63 79.76 61.19
Average 36.76 65.58 47.11
• Average results
Conclusions
• NooJ syntactic grammars for IE
– Simple and flexible approach
– Takes advantage of semantic properties and inflectional and
derivational morphology in NooJ dictionaries
• Pattern identification
– Manual method is limited
– How to generate new patterns automatically ?
• Gene regulatory events
– Described by complex constructions
– Can syntactic grammars be used for this type of events ?
References and Acknowledgments
• BioLexicon was developed within the BOOTStrep project
– http://www.nactem.ac.uk/biolexicon/
– http://www.bootstrep.eu/bin/view/Extern/WebHome
• Data set from the BioNLP’09 Shared Task on Event Extraction
– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/
Sérgio Matos is funded by Fundação para a Ciência e Tecnologia (FCT)
under the Ciência2007 programme .

More Related Content

Similar to Syntactic-semantic analysis for information extraction in biomedicine

Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
Dan Sullivan, Ph.D.
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataYannick Pouliot
 
Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST Tool
JesminBinti
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
Chibucos annot go_final
Chibucos annot go_finalChibucos annot go_final
Chibucos annot go_final
Sucheta Tripathy
 
Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015
Susanna-Assunta Sansone
 
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
Artificial Intelligence Institute at UofSC
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biology
robertstevens65
 
A knowledge capture framework for domain specific search systems
A knowledge capture framework for domain specific search systemsA knowledge capture framework for domain specific search systems
A knowledge capture framework for domain specific search systems
ramakanz
 
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
OECD Environment
 
Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...
Syed Muhammad Ali Hasnain
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe
open_phacts
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
Leighton Pritchard
 
Ibn Sina
Ibn SinaIbn Sina
Ibn Sina
Yasmine Gaber
 
Amia tb-review-12
Amia tb-review-12Amia tb-review-12
Amia tb-review-12
Russ Altman
 
Case Study: Public Library of Science Thesaurus: Year One
Case Study: Public Library of Science Thesaurus: Year OneCase Study: Public Library of Science Thesaurus: Year One
Case Study: Public Library of Science Thesaurus: Year One
Access Innovations, Inc.
 
research methodology ppt-pdf-converted.pptx
research methodology ppt-pdf-converted.pptxresearch methodology ppt-pdf-converted.pptx
research methodology ppt-pdf-converted.pptx
DaniyalTahir9
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
Benjamin Good
 
Online Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery SystemsOnline Resources to Support Open Drug Discovery Systems
Workshop on Assignment 2 SCI115 Live workshop 103020.docx
Workshop on Assignment 2 SCI115 Live workshop 103020.docxWorkshop on Assignment 2 SCI115 Live workshop 103020.docx
Workshop on Assignment 2 SCI115 Live workshop 103020.docx
dunnramage
 

Similar to Syntactic-semantic analysis for information extraction in biomedicine (20)

Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
 
Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST Tool
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
Chibucos annot go_final
Chibucos annot go_finalChibucos annot go_final
Chibucos annot go_final
 
Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015
 
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biology
 
A knowledge capture framework for domain specific search systems
A knowledge capture framework for domain specific search systemsA knowledge capture framework for domain specific search systems
A knowledge capture framework for domain specific search systems
 
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
 
Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
 
Ibn Sina
Ibn SinaIbn Sina
Ibn Sina
 
Amia tb-review-12
Amia tb-review-12Amia tb-review-12
Amia tb-review-12
 
Case Study: Public Library of Science Thesaurus: Year One
Case Study: Public Library of Science Thesaurus: Year OneCase Study: Public Library of Science Thesaurus: Year One
Case Study: Public Library of Science Thesaurus: Year One
 
research methodology ppt-pdf-converted.pptx
research methodology ppt-pdf-converted.pptxresearch methodology ppt-pdf-converted.pptx
research methodology ppt-pdf-converted.pptx
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
Online Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery SystemsOnline Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery Systems
 
Workshop on Assignment 2 SCI115 Live workshop 103020.docx
Workshop on Assignment 2 SCI115 Live workshop 103020.docxWorkshop on Assignment 2 SCI115 Live workshop 103020.docx
Workshop on Assignment 2 SCI115 Live workshop 103020.docx
 

More from INESC-ID (Spoken Language Systems Laboratory - L2F)

Multi3Generation@INGL2020
Multi3Generation@INGL2020Multi3Generation@INGL2020
NooJ 2020 presentation
NooJ 2020 presentationNooJ 2020 presentation
PROPOR2020_Barreiroetal
PROPOR2020_BarreiroetalPROPOR2020_Barreiroetal
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST ActionWelcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Cross language semantic relations between English and Portuguese
Cross language semantic relations between English and PortugueseCross language semantic relations between English and Portuguese
Cross language semantic relations between English and Portuguese
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Paraphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translationParaphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translation
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
ReWriter for legal text
ReWriter for legal textReWriter for legal text
Chatbots for Language Learning
Chatbots for Language LearningChatbots for Language Learning
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and SummarizationeSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Barreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-languageBarreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-language
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignmentsRebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Barreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentationBarreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentation
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Barreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-posterBarreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-poster
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
NooJ-2018-Palermo
NooJ-2018-PalermoNooJ-2018-Palermo
Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
projeto-eSPERTo
projeto-eSPERToprojeto-eSPERTo
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software ToolReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Poster l2f 2017
Poster l2f 2017Poster l2f 2017
Nooj2017 cmota-etal
Nooj2017 cmota-etalNooj2017 cmota-etal

More from INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

Multi3Generation@INGL2020
Multi3Generation@INGL2020Multi3Generation@INGL2020
Multi3Generation@INGL2020
 
NooJ 2020 presentation
NooJ 2020 presentationNooJ 2020 presentation
NooJ 2020 presentation
 
PROPOR2020_Barreiroetal
PROPOR2020_BarreiroetalPROPOR2020_Barreiroetal
PROPOR2020_Barreiroetal
 
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
 
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST ActionWelcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
 
Cross language semantic relations between English and Portuguese
Cross language semantic relations between English and PortugueseCross language semantic relations between English and Portuguese
Cross language semantic relations between English and Portuguese
 
Paraphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translationParaphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translation
 
ReWriter for legal text
ReWriter for legal textReWriter for legal text
ReWriter for legal text
 
Chatbots for Language Learning
Chatbots for Language LearningChatbots for Language Learning
Chatbots for Language Learning
 
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and SummarizationeSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
 
Barreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-languageBarreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-language
 
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignmentsRebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
 
Barreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentationBarreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentation
 
Barreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-posterBarreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-poster
 
NooJ-2018-Palermo
NooJ-2018-PalermoNooJ-2018-Palermo
NooJ-2018-Palermo
 
Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania
 
projeto-eSPERTo
projeto-eSPERToprojeto-eSPERTo
projeto-eSPERTo
 
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software ToolReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
 
Poster l2f 2017
Poster l2f 2017Poster l2f 2017
Poster l2f 2017
 
Nooj2017 cmota-etal
Nooj2017 cmota-etalNooj2017 cmota-etal
Nooj2017 cmota-etal
 

Recently uploaded

GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 

Recently uploaded (20)

GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 

Syntactic-semantic analysis for information extraction in biomedicine

  • 1. Syntactic-semantic analysis for information extraction in biomedicine Sérgio Matos1, Anabela Barreiro2 1IEETA, Universidade de Aveiro 2Centro de Linguística, Universidade do Porto aleixomatos@ua.pt; barreiro_anabela@hotmail.com June 2009
  • 2. Outline • Background • Text Mining and Information Extraction in Biomedicine • Objectives • Implementation • Results • Conclusions
  • 3. Background • Genomics and Proteomics are fast-growing fields • Literature grows exponentially – MEDLINE/PubMed ~ 18m citations • Researchers need to contextualize their theories and findings – Interactions between genes/proteins – Involvement in biological processes and in disease – And many other factors... • How to keep up-to-date with new knowledge in the field?
  • 4. Background • Manually curated biomedical databases are a good source of information – Publications are reviewed and important information added to DBs (e.g. protein interactions) – Impossible to keep DBs up-to-date due to increased volume of publications • Text Mining can be useful for – Information retrieval (IR) – Information extraction (IE) – DB curators and end-users (researchers)
  • 5. Text Mining and Information Extraction in Biomedicine • Text mining deals with the automated processing of texts to derive high quality information • Information Extraction can be seen as one application of TM • Different processing levels • Entity Recognition (ER) genes, proteins, etc. • Normalization ATF2 - GeneID 1386 ATF-2 – Uniprot P15336 • Relation extraction PPI, gene/disease • Event extraction gene expression, regulation + semantics + domain knowledge
  • 6. Text Mining and Information Extraction in Biomedicine • Good results for NER, but limited to a few entity types – 80%-90% for recognition of genes/proteins – Need to include more entities, like chemical compounds, diseases, experimental conditions • Relation extraction has focused mostly on PPI • Inter-concept relations not too explored – e.g. gene/disease, drug/target – mostly based on co-occurence statistics
  • 7. Text Mining and Information Extraction in Biomedicine • Recent interest towards extraction of events – BioNLP shared task and BioCreaTive II.5 • ... and other entities / facts – e.g. Experimental conditions, lab techniques, measurements • ... Discourse analysis – “indicating/suggesting that...”, “in contrast...” • Full-text vs. Abstracts – Complexity in grammar
  • 8. Linguistic Resources for Biomedical TM • UMLS Metathesaurus – various terms, all linked to same concept (e.g. ‘Hypertension’) – semantic information provided by the UMLS Semantic Network • BioLexicon – Includes domain relevant verbs (localize, bind, express, …) • Lexical resources can be created from available online DBs – NCBI Entrez Gene for gene names – UniProt for proteins – OMIM for diseases – Various ontologies
  • 9. Objectives • Extract phrases indicating a biomolecular event from scientific text • Biomolecular events include various types – Examples • “phosphorylation of TRAF2” • “localization of beta-catenin” • “TRADD interacts with TES2” • BioNLP'09 Shared Task on Event Extraction – http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/
  • 10. Objectives • Six event types considered – Localization, Binding, Gene expression, Transcription, Protein Catabolism, Phosphorylation • Training data – Annotation of genes/proteins occurring in each input text, including the text section (start and end characters) – Annotation of the events, including the event type, the participating entities and the corresponding trigger word (with start and end times) • Test data – Annotation of participating genes/proteins is given – Create annotation of events for the given entities
  • 11. Implementation • General approach – Create syntactic grammars to detect phrases that indicate events – Grammars are based only on NEs and domain verbs (and derived names) • Requisites – Grammars outputs should indicate the event type • Solution – Event types can be associated with the trigger word using the semantic properties in NooJ dictionaries – Event types associated with each trigger word are derived from training data
  • 12. Implementation • Resources – Entity dictionary • Create dictionary with list of entities occurring in the texts
  • 13. Implementation Lemma PoS FLX Semantic properties ID TAXID human N TABLE ORGANISM 9606 Homo sapiens N ORGANISM 9606 Mus musculus N ORGANISM 10090 Breast cancer type 1 susceptibility protein N PROTEIN P38398 9606 BRCA1 N PROTEIN P38398 9606 BRCA1 N PROTEIN P48754 10090 BRCA1 N GENE 672 9606 RNF53 N GENE 672 9606
  • 14. Implementation • Resources – Entity dictionary • Create dictionary with list of entities occurring in the texts – BioLexicon verb dictionary • Adapted to include event type – From the training data, extract the verbs associated with events – Add a semantic property to the dictionary entry indicating the event type – Example: “express,V+EventType=Gene_Expression” • Added inflectional and derivation rules – The inflected and derivated forms inherit the verb’s semantic properties
  • 15. • Verb dictionary Implementation Lemma PoS DRV FLX EventType express V ION:TABLE ABOLISH Gene_expression ligate V TION:TABLE SMILE Binding stimulate V TION:TABLE SMILE Positive_regulation
  • 16. Implementation • Syntactic grammars – Sentences from training set used to generate surface patterns – Manual procedure – Seven grammars created – Example: “stimulation of human CD4”
  • 17. Implementation Stimulation of human CD4 <EVENT+PROTEIN=CD4+EXP=Stimulation+TYPE=Positive_regulation>
  • 18. Results Pattern Concordance in text <entity> [<entity_type>] <nominalization> HSP gene expression <nominalization> “of” [<entity_type>] <entity> upregulation of Fas <entity> [<entity_type>] <be> [“not”] [<adverb>] <verb> IL-2R stimulation was totally inhibited <verb> <preposition> <entity> binding of TRAF2 <verb> <nominalization> “of” <entity> suppressing activation of STAT6 • Example patterns extracted from texts
  • 19. Results Event type Recall Precision F-score Localization 35.63 70.45 47.33 Binding 13.54 34.06 19.38 Gene Expression 46.40 78.45 58.31 Transcription 33.58 41.07 36.95 Protein Catabolism 35.71 62.50 45.45 Phosphorylation 49.63 79.76 61.19 Average 36.76 65.58 47.11 • Average results
  • 20. Conclusions • NooJ syntactic grammars for IE – Simple and flexible approach – Takes advantage of semantic properties and inflectional and derivational morphology in NooJ dictionaries • Pattern identification – Manual method is limited – How to generate new patterns automatically ? • Gene regulatory events – Described by complex constructions – Can syntactic grammars be used for this type of events ?
  • 21. References and Acknowledgments • BioLexicon was developed within the BOOTStrep project – http://www.nactem.ac.uk/biolexicon/ – http://www.bootstrep.eu/bin/view/Extern/WebHome • Data set from the BioNLP’09 Shared Task on Event Extraction – http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/ Sérgio Matos is funded by Fundação para a Ciência e Tecnologia (FCT) under the Ciência2007 programme .

Editor's Notes

  1. Involvement of genes in biological processes and in disease
  2. DB curators can use text mining tools to accelerate their job. They can sort the articles by relevance and find the relevant sentences. If an article seems to contain relevant information they still need to read the full article to validate that. Users can use TM tools for better IR. There are also some tools that present the results obtained from IE Information retrieval - find and rank publications containing relevant information for a particular gene/protein/disease/... Information extraction – extract information such as interaction between two proteins or the involvement of a gene in a disease
  3. Different processing levels in TM for biomedicine require different levels of domain knowledge (semantics) Named entity recognition – recognizing gene/protein names, diseases, etc Normalization – Similar genes in different organisms (human, mouse, fly, etc) usually have the same gene symbol. Also a gene and related protein frequently have the same name, and some times also for a gene and a related disease (there is high ambiguity!). Normalization means two things: disambiguation (is it a gene, protein, disease) and specifying which gene/protein it refers to (for example, the x gene in men, or the x gene in mice) Relation extraction – extract a relation between a gene and a protein (“protein A, encoded by gene Y”), a gene and a disease (“Y is involved in X”), or between proteins (protein-protein interaction, PPI) Event extraction – similar, but may involve one, two or more entities: “gene Y was expressed”, “protein X binding with protein Z”
  4. This and next slide not too important. Just give an insight of where the field stands and where it’s going to (our own view)
  5. The shared tasks are evaluation contests for IR and IE There is some interest in discourse analysis (including me), for summarization for example, and for finding things like new research directions (“further work should be carried to validate .....”, “this may indicate ...”), hesitation/contradiction (“this seems to indicate...”, author X showed... However...”) Most work is done over abstracts (from the MEDLINE/PubMed database) but most information is in the full-texts Abstracts have a more “restrained” grammar as compared to full-text, and that “facilitates” our approach The BioCreaTive challenge is on full texts
  6. The BioLexicon is for now available in a early version, as a collection of tables that can be converted into a relational database (like MySQL). In this version the linguistic information is limited to a list of verbs The final version should be available soon and it will include inflexional and derivational forms of the verbs. It will also include information about what a specific verb/noun may indicate in the text
  7. A trigger word is the word in the sentence that indicates the event. In “TRADD interacts with TES2”, “interacts” is the trigger word Given this framework, we do not need to do NER, as the entities that we should worry about are given to us. Also, using the training data, it is simple to obtain a list of trigger words for each type of event.
  8. Example entries in the biomedical dictionaries We use separate dictionaries: one for organisms, other for genes+proteins Simple to add new entities like diseases or anatomy (arm, leg, heart, ...) IDs are obtained from major databases (Uniprot for proteins, Entrez Gene for genes, OMIM for diseases, etc) Note ambiguity in BRCA1: can be either a human or mouse protein or a human gene Note synonyms: BRCA1 and RNF53 represent the same human gene, with the unique ID 672 (NCBI gene ID) Note: Mus musculus scientific name for mouse
  9. As the current BioLexicon dictionaries do not contain semantic information about the verbs (which verbs are used to indicate a “localization”, “binding”, “regulation” and each other type of events) this had to be derived from the training data EPIA paper: “Based on the manual linguistic annotations, we extracted the sentences corresponding to each event, and assigned the event type to the verbs found on those sentences. We then manually checked this list and selected only those verbs showing a specific link to a type of event. In case verbs were linked to more than one event type, only the most frequent event type was selected, and the remaining ones removed.“ - If “localize” is used to indicate a “localization” event 20 times in the training data and is used to indicate a “binding” event 1 time, then we only keep the “localization” type
  10. Example entries in the verb dictionary The verbs that may indicate an event also have a ‘+FUNC’ semantic property (not shown) This is to differentiate from all other entries in the dictionary that are more general. The dictionary does not include names or adjectives. Only the ones derived from these verbs are available.
  11. Each grammar describes 1-3 syntactic patterns: “Stimulation of human CD4” and “human CD4 stimulation” Same grammar for both forms
  12. Average results are good compared with other results in the BioNLP shared task and given the simple implementation Binding events are more difficult because they usually (but not allways) include two proteins or genes We did not cover the regulation / down-regulation / up-regulation for lack of time. These are even more complex and need more complex grammars
  13. The proposed method takes advantage of the inflectional and derivational morphology and semantic properties established in dictionaries and grammars developed with NooJ, which allow to associate terminological verbs and their derivations to specific event types. Methods such as the one proposed in this paper can be used to help database curators identify the most relevant facts in the literature and speed-up the annotation process. Tools based on these methods can also provide alternative querying and browsing of facts cited in the literature and be useful for researchers.