Evaluating Distributional Semantic and Feature Selection for Extracting Relationships from Biological Text

EVALUATING DISTRIBUTIONAL SEMANTIC
AND FEATURE SELECTION FOR
EXTRACTING RELATIONSHIPS FROM
BIOLOGICAL TEXT

Ehsan Emadzadeh*
Siddhartha Jonnalagadda †
Graciela Gonzalez*
*Department of Biomedical Informatics, Arizona State
University
†Department of Health Sciences Research, Mayo Clinic

PROBLEMS
 What are the useful features for biological relation
extraction?
 Word level semantic features like distributional
semantic contribution for relationship extraction is
unknown
 Which method for calculating semantic features is
the best?

3

CORPUS
 We used BioNLP 2011 GENIA corpus
 Training set includes 800 abstracts and 5 full papers
 Test set includes 150 abstracts and 5 full papers

4000

3500

3000

2500

2000

1500

1000 Train set
500 Test set
0

4

OUR CONTRIBUTION
 Evaluated distributional semantic features and
different ways to calculate them
 Detailed evaluation of all features

 Found how much a greedy Feature Selection (FS)
can improve classification results

5

RELATION REPRESENTATION
 A relation consists of two parts:
 Trigger: the main evidence of a relation (or an event)
 Argument(s): complementary information of the relation
and involved entities

6

Trigger: Binding Argument: Protein

Cross-linking of CD30 induces HIV in chronically infected T cells

Trigger: Binding

Upon engagement of CD40 by CD40 ligand (CD40L)

7

PREVIOUS WORKS: FEATURE SELECTION
 Forman 2003: Extensive survey of FS metrics for
text classiﬁcation
 Eom et al. 2004: Feature Dimension Reduction
Filter (FDRF) for proteins relationship extraction
from PubMed articles
 Saeys et al. 2007: survey of FS techniques in
bioinformatics
 Landeghem et al. 2008: a FS technique based on
the concept of gain ratio for extracting protein-
protein interactions
 Landeghem et al. 2010: ensemble feature selection
8
for biomolecular text mining

METHOD OUTLINE
 Preprocessing
 Building classifier instances

 Trigger extraction

9

PREPROCESSING
 How to convert string values to numeric features?
 For string attributes, for each value one attribute will
be created

10

BUILDING CLASSIFIER INSTANCES
 Create bags of triggers for each relation type from
the train-set
 If a phrase appears in the bag of triggers; a training
example will be created for it
 For the training, label the examples based on the
annotation
 When training, for each positive example
maximum, 3 negative examples were generated

11

TRIGGER EXTRACTION
 Double-layered machine learning approach for
trigger classification
 First layer: one-vs-rest binary classification for each
relation type (Used only SVM)
 Second layer: one classifier vote first layer classifiers to
make final decision (Evaluated SVM, Decision Tree and
Logistic Regression)

12

SEMANTIC FEATURE
 Distributional semantic similarity
 Random Indexing
 SemanticVector Java package

 Semantic similarity of a trigger candidate to known
triggers BOW
 Semantic features can be calculated in different
ways:
 Maximum similarity to the relation type’s BOW
 Average similarity to the relation type’s BOW

13

MAXIMUM VS. AVERAGE

 “Binding” event type
“Binding” triggers BOW
interacting
0.6
recruitment

0.9 binding

0.8
…
ligation
MaxSimilarity = 0.9
AverageSimilarity = 0.76

14

ONE VS. ALL SEMANTIC FEATURES
 One Semantic Feature: Including just similarity to
related BOW for each relation type classifier
 All Semantic Features: Including similarity to all
BOWs for all relation types classifiers

Similarity to Similarity to Similarity to
“Binding” “Regulation” “Localization”

“Binding” “Regulation” “Localization”
classifier classifier classifier

15

FEATURE SELECTION

Improving Improving
features features

Keep

All features Neutral Neutral
(29 features) features features

Further evaluation Continue …

Repeat this process Worsening Worsening
for each event type features features
16
Remove

SELECTED FEATURES
Event type Selected features Count

Localization PhraseText, SentenceTFIDF 2

Protein POS, PorterStem, SentenceTFIDF 3
catabolism
Phosphorylation POS, PorterStem, SentenceTFIDF 3

Positive POS, POSNext1, POSNext2, POSPre1, POSPre2, PorterStem 6
regulation
Gene expression PorterStem, WordnetStem, PhraseText, AllUpperCase, HasDigit, 8
QuoteRightCount, ProteinCountInWindow, ProteinCountInSentence
Binding WordnetStem, OriginalWord, StartWithUppercase, AllUppercase, 8
AllLowercase,HasDigit, ProteinCountInWindow, ProteinCountInSentence
Transcription POS, POSNext1, POSNext2, PhraseText, CommaRightCount, POSPre1, 9
POSPre2, ProteinCountInSentence, SentenceTFIDF
Negative PorterStem, AllUppercase, AllLowercase, HasSpecialChars, HasDigit, 9
regulation QuoteRightCount, NameEntity, ProteinCountInWindow,
ProteinCountInSentence

Regulation AllLowercase, AllUppercase, CommaLeftCount, HasDigit, MESHHeading, 15
NameEntity, OriginalWord, PorterStem, POSNext2, POSPre2, 17
ProteinCountInSentence, ProteinCountInWindow, SentenceTFIDF,
StartWithUppercase, WordnetStem

RESULTS: FEATURE SELECTION
Event type Before FS After FS Change
Protein catabolism 16.51 70.47 53.96
Negative regulation 21.67 43.62 21.95

Localization 31.11 51.98 20.87
Phosphorylation 60.59 79.52 18.93
Regulation 1.90 14.32 12.42
Gene expression 66.43 68.62 2.19
Transcription 29.17 30.48 1.31
Binding 54.49 54.42 -0.07
Positive regulation 40.39 35.37 -5.02

18

RESULTS: SEMANTIC FEATURES
90

80

70

60

50

40
Without SF
30 ONE_SF
ALL_SF
20

10

0

19

RESULTS: MAX VS. AVE
90

80

70

60

50

40

Maximum
30
Average
20

10

0

20

RESULTS: SVM VS. LOGISTIC REGRESSION
90
80
70
60
50
40
SVM
30
Logistic Regression
20
10
0

21

CONCLUSION
 We found different optimized feature set for each
event type and FS can improve classification of
triggers (7 out of 9 event types)
 Semantic feature, namely distributional semantic
can improve classification results up to 19.37% F-
Measure
 Using all semantic features for all classifiers is
better than using only related semantic feature (for
most of the event types)
 “Maximum” is better when BOWs are very different
but “Average” is better when BOWs are very similar
23

FUTURE WORKS
 Compare different semantic similarity kernels
 Compare other FS methods to the one we used in
this work
 Try manually created trigger BOWs

24

Special thanks to: Dr. Trevor Cohen, Robert Leaman, Azadeh Nikfarjam
and Nate Sutton. This work is supported by funding from NLM Contract
HHSN276201000031C.

Relationship extraction tool:
BioEvent (http://bioevent.sf.net)
Distributional semantic similarity:
The Semantic Vectors Package(http://code.google.com/p/semanticvectors/)

25

Evaluating Distributional Semantic and Feature Selection for Extracting Relationships from Biological Text

Recommended

Recommended

More Related Content

Similar to Evaluating Distributional Semantic and Feature Selection for Extracting Relationships from Biological Text

Similar to Evaluating Distributional Semantic and Feature Selection for Extracting Relationships from Biological Text (13)

Recently uploaded

Recently uploaded (20)

Evaluating Distributional Semantic and Feature Selection for Extracting Relationships from Biological Text