The LODIE team participated entity discovery of Cold Start KBP task in 2015. Cold Start KBP aims to build a knowledge base (KB) from scratch using a given corpus and a predefined schema for the entities and relations that will compose the KB.
1. LODIE Team Participation Summary
Method Details
The LODIE1 team Participation at the TAC2015
Entity Discovery Task of the Cold Start KBP Track
The Task
Entity Discovery of Cold Start KBP
• Cold Start KBP aims to build a KB from scratch using a given corpus and a
predefined schema for the entities and relations that will compose the KB
• Entity Discovery (ED, new in 2015)
• create a KB node for each person (PER), organization (ORG) and geo-
political entity (GPE) mentions in the document collection
• cluster all KB nodes that refer to the same entity
Challenge
• Scale-up: millions of name mentions are extracted and clustered
Ziqi Zhang, Jie Gao, and Anna Lisa Gentile
1. Representing the Linked Open Data for Information Extraction project team Contact: Ziqi Zhang ziqi.zhang@sheffield.ac.uk
Method Overview – a cross-document coreference approach
• State-of-the-art Named Entity Recognition (NER)
• Clustering within each type of Named Entities (NEs)
• A non-deterministic string similarity clustering process to split
data into macro-clusters
• Agglomerative clustering within each macro-cluster (that
contains NEs from different documents)
Performance Overview
• 63.2 CEAF mention F-measure (ranked as #3) on the 2015 Cold
Start KBP evaluation dataset
Evaluation
1. NER
Modules NE Types Text type
Stanford NER Standard PER, ORG, GPE All
Stanford NER Re-trained PER, ORG, GPE Colloquial
Gaze:eer GPE All
Ad-hoc Rules PER Colloquial
Merge by
heuristics
2.1. String similarity clustering
• string similarity between
entity names
• non-deterministic
• to split data into smaller
macro-clusters
• to focus on conflation of
entity names
• can over-cluster, e.g.,
‘David Miliband’ & ‘Ed
Miliband’ = 0.8
2.2. Agglomerative clustering
Applied to each macro-cluster that contain NEs from different documents.
(hypothesizing ‘one-sense-per-name’ within individual document)
b. Clustering
• Standard group average agglomerative clustering
(Murtagh, 1985) with L1 distance
• Determine a natural cluster number in data:
o Silhouette coefficient to evaluate clustering quality
o A non-greedy iterative algorithm that searches for
a local optimum as an approximation
a. Featurization
• Contextual tokens: previous and following n tokens
• Contextual NEs: previous and following n NEs
• Surface tokens (‘Mr Blair’ => ‘mr’, ‘blair’)
• Word embedding based
o train word & phrase embeddings using Mikolov et al. (2013)
o compute OOV vectors based on additive compositionality, i.e.,
vec(London Tower) = vec(London) + vec(Tower)
Featurecombinationby
Weightingalsoexperimented
Training the Standford NER for colloquial text:
- The training dataset of TAC2014 English
Entity Discovery and Linking (EDL)
Training the word embeddings:
- Comprehensive English source corpora
2013-14
Computing resources: single thread NER,
agglomerative clustering parallelized on 16
cores, max of 64GB memory
Evaluation measures: standard Precision,
Recall, F1 for NER; mention CEAF for
clustering
Settings: string similarity threshold of 0.7
(ss0.7) and 0.9 (ss0.9), combined with:
• previous and following 5 tokens (tok5);
• previous and following 3 entity mentions
(ne3);
• surface tokens (sf);
• word embedding based (dvec) CEAF (P, R, F) on TAC2014 EDL
evaluation dataset
Ceiling CEAF (P, R, F) on TAC2014
EDL evaluation dataset
Final results (CEAF) on TAC 2015 evaluation dataset
ss0.9+sf+dvec
ss0.9 only
ss0.9+dvec
ss0.7+sf+dvec
ss0.7+dvec
NER on TAC2014 EDL datasets
1. NER results
2. Clustering results (CEAF mention)
3. Clustering results, using NER ground truth
4. Final results on TAC2015