This document discusses using syntactic-semantic analysis for information extraction in biomedicine. It aims to extract biomolecular events like phosphorylation from text. It uses dictionaries of entities and verbs associated with event types, and NooJ grammars to identify events. Evaluation on a shared task dataset shows average recall of 36.76% and precision of 65.58% for six event types. While results are promising, it discusses limitations like manual pattern identification and challenges with more complex event constructions.
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Syntactic-semantic analysis for information extraction in biomedicine
1. Syntactic-semantic analysis for
information extraction in biomedicine
Sérgio Matos1, Anabela Barreiro2
1IEETA, Universidade de Aveiro
2Centro de Linguística, Universidade do Porto
aleixomatos@ua.pt; barreiro_anabela@hotmail.com
June 2009
2. Outline
• Background
• Text Mining and Information Extraction in
Biomedicine
• Objectives
• Implementation
• Results
• Conclusions
3. Background
• Genomics and Proteomics are fast-growing fields
• Literature grows exponentially
– MEDLINE/PubMed ~ 18m citations
• Researchers need to contextualize their theories and findings
– Interactions between genes/proteins
– Involvement in biological processes and in disease
– And many other factors...
• How to keep up-to-date with new knowledge in the
field?
4. Background
• Manually curated biomedical databases are a good source of
information
– Publications are reviewed and important information added to DBs
(e.g. protein interactions)
– Impossible to keep DBs up-to-date due to increased volume of
publications
• Text Mining can be useful for
– Information retrieval (IR)
– Information extraction (IE)
– DB curators and end-users (researchers)
5. Text Mining and Information Extraction
in Biomedicine
• Text mining deals with the automated processing of texts to
derive high quality information
• Information Extraction can be seen as one application of TM
• Different processing levels
• Entity Recognition (ER) genes, proteins, etc.
• Normalization ATF2 - GeneID 1386
ATF-2 – Uniprot P15336
• Relation extraction PPI, gene/disease
• Event extraction gene expression, regulation
+ semantics + domain knowledge
6. Text Mining and Information Extraction
in Biomedicine
• Good results for NER, but limited to a few entity types
– 80%-90% for recognition of genes/proteins
– Need to include more entities, like chemical compounds, diseases,
experimental conditions
• Relation extraction has focused mostly on PPI
• Inter-concept relations not too explored
– e.g. gene/disease, drug/target
– mostly based on co-occurence statistics
7. Text Mining and Information Extraction
in Biomedicine
• Recent interest towards extraction of events
– BioNLP shared task and BioCreaTive II.5
• ... and other entities / facts
– e.g. Experimental conditions, lab techniques, measurements
• ... Discourse analysis
– “indicating/suggesting that...”, “in contrast...”
• Full-text vs. Abstracts
– Complexity in grammar
8. Linguistic Resources for Biomedical TM
• UMLS Metathesaurus
– various terms, all linked to same concept (e.g. ‘Hypertension’)
– semantic information provided by the UMLS Semantic Network
• BioLexicon
– Includes domain relevant verbs (localize, bind, express, …)
• Lexical resources can be created from available online DBs
– NCBI Entrez Gene for gene names
– UniProt for proteins
– OMIM for diseases
– Various ontologies
9. Objectives
• Extract phrases indicating a biomolecular event from
scientific text
• Biomolecular events include various types
– Examples
• “phosphorylation of TRAF2”
• “localization of beta-catenin”
• “TRADD interacts with TES2”
• BioNLP'09 Shared Task on Event Extraction
– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/
10. Objectives
• Six event types considered
– Localization, Binding, Gene expression, Transcription, Protein
Catabolism, Phosphorylation
• Training data
– Annotation of genes/proteins occurring in each input text, including
the text section (start and end characters)
– Annotation of the events, including the event type, the participating
entities and the corresponding trigger word (with start and end times)
• Test data
– Annotation of participating genes/proteins is given
– Create annotation of events for the given entities
11. Implementation
• General approach
– Create syntactic grammars to detect phrases that indicate events
– Grammars are based only on NEs and domain verbs (and derived
names)
• Requisites
– Grammars outputs should indicate the event type
• Solution
– Event types can be associated with the trigger word using the
semantic properties in NooJ dictionaries
– Event types associated with each trigger word are derived from
training data
13. Implementation
Lemma PoS FLX
Semantic
properties
ID TAXID
human N TABLE ORGANISM 9606
Homo sapiens N ORGANISM 9606
Mus musculus N ORGANISM 10090
Breast cancer type 1
susceptibility protein
N PROTEIN P38398 9606
BRCA1 N PROTEIN P38398 9606
BRCA1 N PROTEIN P48754 10090
BRCA1 N GENE 672 9606
RNF53 N GENE 672 9606
14. Implementation
• Resources
– Entity dictionary
• Create dictionary with list of entities occurring in the texts
– BioLexicon verb dictionary
• Adapted to include event type
– From the training data, extract the verbs associated with events
– Add a semantic property to the dictionary entry indicating the event type
– Example: “express,V+EventType=Gene_Expression”
• Added inflectional and derivation rules
– The inflected and derivated forms inherit the verb’s semantic properties
15. • Verb dictionary
Implementation
Lemma PoS DRV FLX EventType
express V ION:TABLE ABOLISH Gene_expression
ligate V TION:TABLE SMILE Binding
stimulate V TION:TABLE SMILE Positive_regulation
16. Implementation
• Syntactic grammars
– Sentences from training set used to generate surface
patterns
– Manual procedure
– Seven grammars created
– Example:
“stimulation of human CD4”
18. Results
Pattern Concordance in text
<entity> [<entity_type>] <nominalization> HSP gene expression
<nominalization> “of” [<entity_type>] <entity> upregulation of Fas
<entity> [<entity_type>] <be> [“not”] [<adverb>] <verb>
IL-2R stimulation was totally
inhibited
<verb> <preposition> <entity> binding of TRAF2
<verb> <nominalization> “of” <entity>
suppressing activation of
STAT6
• Example patterns extracted from texts
19. Results
Event type Recall Precision F-score
Localization 35.63 70.45 47.33
Binding 13.54 34.06 19.38
Gene Expression 46.40 78.45 58.31
Transcription 33.58 41.07 36.95
Protein Catabolism 35.71 62.50 45.45
Phosphorylation 49.63 79.76 61.19
Average 36.76 65.58 47.11
• Average results
20. Conclusions
• NooJ syntactic grammars for IE
– Simple and flexible approach
– Takes advantage of semantic properties and inflectional and
derivational morphology in NooJ dictionaries
• Pattern identification
– Manual method is limited
– How to generate new patterns automatically ?
• Gene regulatory events
– Described by complex constructions
– Can syntactic grammars be used for this type of events ?
21. References and Acknowledgments
• BioLexicon was developed within the BOOTStrep project
– http://www.nactem.ac.uk/biolexicon/
– http://www.bootstrep.eu/bin/view/Extern/WebHome
• Data set from the BioNLP’09 Shared Task on Event Extraction
– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/
Sérgio Matos is funded by Fundação para a Ciência e Tecnologia (FCT)
under the Ciência2007 programme .
Editor's Notes
Involvement of genes in biological processes and in disease
DB curators can use text mining tools to accelerate their job. They can sort the articles by relevance and find the relevant sentences. If an article seems to contain relevant information they still need to read the full article to validate that.
Users can use TM tools for better IR. There are also some tools that present the results obtained from IE
Information retrieval - find and rank publications containing relevant information for a particular gene/protein/disease/...
Information extraction – extract information such as interaction between two proteins or the involvement of a gene in a disease
Different processing levels in TM for biomedicine require different levels of domain knowledge (semantics)
Named entity recognition – recognizing gene/protein names, diseases, etc
Normalization – Similar genes in different organisms (human, mouse, fly, etc) usually have the same gene symbol. Also a gene and related protein frequently have the same name, and some times also for a gene and a related disease (there is high ambiguity!). Normalization means two things: disambiguation (is it a gene, protein, disease) and specifying which gene/protein it refers to (for example, the x gene in men, or the x gene in mice)
Relation extraction – extract a relation between a gene and a protein (“protein A, encoded by gene Y”), a gene and a disease (“Y is involved in X”), or between proteins (protein-protein interaction, PPI)
Event extraction – similar, but may involve one, two or more entities: “gene Y was expressed”, “protein X binding with protein Z”
This and next slide not too important. Just give an insight of where the field stands and where it’s going to (our own view)
The shared tasks are evaluation contests for IR and IE
There is some interest in discourse analysis (including me), for summarization for example, and for finding things like new research directions (“further work should be carried to validate .....”, “this may indicate ...”), hesitation/contradiction (“this seems to indicate...”, author X showed... However...”)
Most work is done over abstracts (from the MEDLINE/PubMed database) but most information is in the full-texts
Abstracts have a more “restrained” grammar as compared to full-text, and that “facilitates” our approach
The BioCreaTive challenge is on full texts
The BioLexicon is for now available in a early version, as a collection of tables that can be converted into a relational database (like MySQL). In this version the linguistic information is limited to a list of verbs
The final version should be available soon and it will include inflexional and derivational forms of the verbs. It will also include information about what a specific verb/noun may indicate in the text
A trigger word is the word in the sentence that indicates the event. In “TRADD interacts with TES2”, “interacts” is the trigger word
Given this framework, we do not need to do NER, as the entities that we should worry about are given to us.
Also, using the training data, it is simple to obtain a list of trigger words for each type of event.
Example entries in the biomedical dictionaries
We use separate dictionaries: one for organisms, other for genes+proteins
Simple to add new entities like diseases or anatomy (arm, leg, heart, ...)
IDs are obtained from major databases (Uniprot for proteins, Entrez Gene for genes, OMIM for diseases, etc)
Note ambiguity in BRCA1: can be either a human or mouse protein or a human gene
Note synonyms: BRCA1 and RNF53 represent the same human gene, with the unique ID 672 (NCBI gene ID)
Note: Mus musculus scientific name for mouse
As the current BioLexicon dictionaries do not contain semantic information about the verbs (which verbs are used to indicate a “localization”, “binding”, “regulation” and each other type of events) this had to be derived from the training data
EPIA paper: “Based on the manual linguistic annotations, we extracted the sentences corresponding to each event, and assigned the event type to the verbs found on those sentences. We then manually checked this list and selected only those verbs showing a specific link to a type of event. In case verbs were linked to more than one event type, only the most frequent event type was selected, and the remaining ones removed.“
- If “localize” is used to indicate a “localization” event 20 times in the training data and is used to indicate a “binding” event 1 time, then we only keep the “localization” type
Example entries in the verb dictionary
The verbs that may indicate an event also have a ‘+FUNC’ semantic property (not shown)
This is to differentiate from all other entries in the dictionary that are more general.
The dictionary does not include names or adjectives. Only the ones derived from these verbs are available.
Each grammar describes 1-3 syntactic patterns:
“Stimulation of human CD4” and “human CD4 stimulation”
Same grammar for both forms
Average results are good compared with other results in the BioNLP shared task and given the simple implementation
Binding events are more difficult because they usually (but not allways) include two proteins or genes
We did not cover the regulation / down-regulation / up-regulation for lack of time. These are even more complex and need more complex grammars
The proposed method takes advantage of the inflectional and derivational morphology and semantic properties established in dictionaries and grammars developed with NooJ, which allow to associate terminological verbs and their derivations to specific event types.
Methods such as the one proposed in this paper can be used to help database curators identify the most relevant facts in the literature and speed-up the annotation process. Tools based on these methods can also provide alternative querying and browsing of facts cited in the literature and be useful for researchers.