SlideShare a Scribd company logo
1 of 1
Download to read offline
LODIE Team Participation Summary
Method Details
The LODIE1 team Participation at the TAC2015
Entity Discovery Task of the Cold Start KBP Track
The Task
Entity Discovery of Cold Start KBP
•  Cold Start KBP aims to build a KB from scratch using a given corpus and a
predefined schema for the entities and relations that will compose the KB
•  Entity Discovery (ED, new in 2015)
•  create a KB node for each person (PER), organization (ORG) and geo-
political entity (GPE) mentions in the document collection
•  cluster all KB nodes that refer to the same entity
Challenge
•  Scale-up: millions of name mentions are extracted and clustered
Ziqi Zhang, Jie Gao, and Anna Lisa Gentile
1. Representing the Linked Open Data for Information Extraction project team Contact: Ziqi Zhang ziqi.zhang@sheffield.ac.uk
Method Overview – a cross-document coreference approach
•  State-of-the-art Named Entity Recognition (NER)
•  Clustering within each type of Named Entities (NEs)
•  A non-deterministic string similarity clustering process to split
data into macro-clusters
•  Agglomerative clustering within each macro-cluster (that
contains NEs from different documents)
Performance Overview
•  63.2 CEAF mention F-measure (ranked as #3) on the 2015 Cold
Start KBP evaluation dataset
Evaluation
1. NER
Modules	 NE	Types	 Text	type	
Stanford	NER	Standard	 PER,	ORG,	GPE	 All	
Stanford	NER	Re-trained	 PER,	ORG,	GPE	 Colloquial	
Gaze:eer	 GPE	 All	
Ad-hoc	Rules	 PER	 Colloquial	
Merge by
heuristics
2.1. String similarity clustering
• string similarity between
entity names
• non-deterministic
• to split data into smaller
macro-clusters
• to focus on conflation of
entity names
• can over-cluster, e.g.,
‘David Miliband’ & ‘Ed
Miliband’ = 0.8
2.2. Agglomerative clustering
Applied to each macro-cluster that contain NEs from different documents.
(hypothesizing ‘one-sense-per-name’ within individual document)
b. Clustering
• Standard group average agglomerative clustering
(Murtagh, 1985) with L1 distance
• Determine a natural cluster number in data:
o  Silhouette coefficient to evaluate clustering quality
o  A non-greedy iterative algorithm that searches for
a local optimum as an approximation
a. Featurization
• Contextual tokens: previous and following n tokens
• Contextual NEs: previous and following n NEs
• Surface tokens (‘Mr Blair’ => ‘mr’, ‘blair’)
• Word embedding based
o  train word & phrase embeddings using Mikolov et al. (2013)
o  compute OOV vectors based on additive compositionality, i.e.,
vec(London Tower) = vec(London) + vec(Tower)
Featurecombinationby
Weightingalsoexperimented
Training the Standford NER for colloquial text:
- The training dataset of TAC2014 English
Entity Discovery and Linking (EDL)
Training the word embeddings:
- Comprehensive English source corpora
2013-14
Computing resources: single thread NER,
agglomerative clustering parallelized on 16
cores, max of 64GB memory
Evaluation measures: standard Precision,
Recall, F1 for NER; mention CEAF for
clustering
Settings: string similarity threshold of 0.7
(ss0.7) and 0.9 (ss0.9), combined with:
•  previous and following 5 tokens (tok5);
•  previous and following 3 entity mentions
(ne3);
•  surface tokens (sf);
•  word embedding based (dvec) CEAF (P, R, F) on TAC2014 EDL
evaluation dataset
Ceiling CEAF (P, R, F) on TAC2014
EDL evaluation dataset
Final results (CEAF) on TAC 2015 evaluation dataset
ss0.9+sf+dvec
ss0.9 only
ss0.9+dvec
ss0.7+sf+dvec
ss0.7+dvec
NER on TAC2014 EDL datasets
1. NER results
2. Clustering results (CEAF mention)
3. Clustering results, using NER ground truth
4. Final results on TAC2015

More Related Content

Similar to The LODIE team at TAC-KBP2015

Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingNimrita Koul
 
Natural Language Interface to Knowledge Graph
Natural Language Interface to Knowledge GraphNatural Language Interface to Knowledge Graph
Natural Language Interface to Knowledge GraphVaticle
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Noemi Derzsy
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsPyData
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dassDiego Pessoa
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018Andre Freitas
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsAndre Freitas
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Marieke van Erp
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep LearningAndre Freitas
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Claudio Greco
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Alessandro Suglia
 
Combining Similarities and Regression for Entity Linking.
Combining Similarities and Regression for Entity Linking.Combining Similarities and Regression for Entity Linking.
Combining Similarities and Regression for Entity Linking.César de Pablo
 

Similar to The LODIE team at TAC-KBP2015 (20)

Text clustering
Text clusteringText clustering
Text clustering
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Interface to Knowledge Graph
Natural Language Interface to Knowledge GraphNatural Language Interface to Knowledge Graph
Natural Language Interface to Knowledge Graph
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge Graphs
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental survey
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Argument Mining
Argument MiningArgument Mining
Argument Mining
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Combining Similarities and Regression for Entity Linking.
Combining Similarities and Regression for Entity Linking.Combining Similarities and Regression for Entity Linking.
Combining Similarities and Regression for Entity Linking.
 

Recently uploaded

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 

Recently uploaded (20)

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 

The LODIE team at TAC-KBP2015

  • 1. LODIE Team Participation Summary Method Details The LODIE1 team Participation at the TAC2015 Entity Discovery Task of the Cold Start KBP Track The Task Entity Discovery of Cold Start KBP •  Cold Start KBP aims to build a KB from scratch using a given corpus and a predefined schema for the entities and relations that will compose the KB •  Entity Discovery (ED, new in 2015) •  create a KB node for each person (PER), organization (ORG) and geo- political entity (GPE) mentions in the document collection •  cluster all KB nodes that refer to the same entity Challenge •  Scale-up: millions of name mentions are extracted and clustered Ziqi Zhang, Jie Gao, and Anna Lisa Gentile 1. Representing the Linked Open Data for Information Extraction project team Contact: Ziqi Zhang ziqi.zhang@sheffield.ac.uk Method Overview – a cross-document coreference approach •  State-of-the-art Named Entity Recognition (NER) •  Clustering within each type of Named Entities (NEs) •  A non-deterministic string similarity clustering process to split data into macro-clusters •  Agglomerative clustering within each macro-cluster (that contains NEs from different documents) Performance Overview •  63.2 CEAF mention F-measure (ranked as #3) on the 2015 Cold Start KBP evaluation dataset Evaluation 1. NER Modules NE Types Text type Stanford NER Standard PER, ORG, GPE All Stanford NER Re-trained PER, ORG, GPE Colloquial Gaze:eer GPE All Ad-hoc Rules PER Colloquial Merge by heuristics 2.1. String similarity clustering • string similarity between entity names • non-deterministic • to split data into smaller macro-clusters • to focus on conflation of entity names • can over-cluster, e.g., ‘David Miliband’ & ‘Ed Miliband’ = 0.8 2.2. Agglomerative clustering Applied to each macro-cluster that contain NEs from different documents. (hypothesizing ‘one-sense-per-name’ within individual document) b. Clustering • Standard group average agglomerative clustering (Murtagh, 1985) with L1 distance • Determine a natural cluster number in data: o  Silhouette coefficient to evaluate clustering quality o  A non-greedy iterative algorithm that searches for a local optimum as an approximation a. Featurization • Contextual tokens: previous and following n tokens • Contextual NEs: previous and following n NEs • Surface tokens (‘Mr Blair’ => ‘mr’, ‘blair’) • Word embedding based o  train word & phrase embeddings using Mikolov et al. (2013) o  compute OOV vectors based on additive compositionality, i.e., vec(London Tower) = vec(London) + vec(Tower) Featurecombinationby Weightingalsoexperimented Training the Standford NER for colloquial text: - The training dataset of TAC2014 English Entity Discovery and Linking (EDL) Training the word embeddings: - Comprehensive English source corpora 2013-14 Computing resources: single thread NER, agglomerative clustering parallelized on 16 cores, max of 64GB memory Evaluation measures: standard Precision, Recall, F1 for NER; mention CEAF for clustering Settings: string similarity threshold of 0.7 (ss0.7) and 0.9 (ss0.9), combined with: •  previous and following 5 tokens (tok5); •  previous and following 3 entity mentions (ne3); •  surface tokens (sf); •  word embedding based (dvec) CEAF (P, R, F) on TAC2014 EDL evaluation dataset Ceiling CEAF (P, R, F) on TAC2014 EDL evaluation dataset Final results (CEAF) on TAC 2015 evaluation dataset ss0.9+sf+dvec ss0.9 only ss0.9+dvec ss0.7+sf+dvec ss0.7+dvec NER on TAC2014 EDL datasets 1. NER results 2. Clustering results (CEAF mention) 3. Clustering results, using NER ground truth 4. Final results on TAC2015