PhD Thesis:
Entity-Centric Knowledge Discovery for
Idiosyncratic Domains
Roman Prokofyev
eXascale Infolab
University of Fribourg, Switzerland
June17th
Fribourg, Switzerland
1
Outline
1. Motivation
2. Problem definition
3. Contributions
1. Named entity recognition
2. Co-reference resolution
3. Entity disambiguation
4. Tag recommendation
4. Conclusions
2
Outline
1. Motivation
2. Problem definition
3. Contributions
1. Named entity recognition
2. Co-reference resolution
3. Entity disambiguation
4. Tag recommendation
4. Conclusions
3
Structured vs. unstructured data
• Structured data
• Databases, CRMs
• Structured formats/tables (CSV, MS Excel)
• Schema.org, RDFa
• Unstructured data
• Reports, presentations, emails, etc.
• Web pages, social media content
4
Structured data
Unstructured data
Unstructured → natural language content
Structured vs. unstructured data
• Structured data
Can be understood and processed by machines
• Unstructured data
Can be understood by humans in individual pieces,
but extremely hard to get insights from large collections
7
?
Technical knowledge growth
Communications of the ACM, Vol. 56 No. 12, Pages 64-73
8
According to IBM, 80% of data generated on the web is
unstructured.
Idiosyncratic data
9
• Data in vertical domains with specific vocabulary.
• Parts of vocabulary are evolving and not clearly
defined.
Examples: physics, biomedicine, etc.
Hence knowledge cannot be extracted using
existing knowledge bases / dictionaries.
Outline
1. Motivation
2. Problem definition
3. Contributions
1. Named entity recognition
2. Co-reference resolution
3. Entity disambiguation
4. Tag recommendation
4. Conclusions
10
Problem definition
Goal of the thesis:
Design effective methods and techniques to mine
structured data from idiosyncratic domains.
?
11
Knowledge discovery tasks
• Named entity recognition
• Entity Disambiguation and Linking
• Co-reference resolution
Disambiguation
and Linking
Recognition Co-reference
12
Named Entity Recognition
How It Should Have Ended began after Daniel Baxter and Tommy
Watson started discussing alternate endings for a movie they
had watched. Christina "Tina" Alexander, who has previously
worked with Daniel, joined the team shortly thereafter.
HISHE was awarded the "Best Internet Parody" award for How
Superman Should Have Ended by Spike TV at the 2006 Scream
Awards at the Pantages Theater Hollywood, California. It has
also been featured as a Yahoo! Profile Pick and has appeared in
both Fade In and Wired magazines.
Organization LocationPerson
13
Entity Disambiguation and Linking
14
Co-reference resolution
http://www.telegraph.co.uk/
“Xi Jinping was due to arrive in Washington for a
dinner with Barack Obama on Thursday night, in
which he will aim to reassure the US president
about a rising China. The Chinese president said he
favors a “new model of major country relationship"
built on understanding, rather than suspicion.”
Xi Jinping he heChinese president
Barack Obama US president
15
Outline
1. Motivation
2. Problem definition
3. Contributions
1. Named entity recognition
2. Co-reference resolution
3. Entity disambiguation
4. Tag recommendation
4. Conclusions
16
Technical contributions
3.1 Named entity recognition
Roman Prokofyev, Gianluca Demartini, and Philippe Cudré-Mauroux. 2014.
Effective named entity recognition for idiosyncratic web collections.
In Proceedings of the 23rd International conference on World Wide Web (WWW ’14).
3.2 Co-reference resolution
Roman Prokofyev, Alberto Tonon, Djellel Difallah, Michael Luggen, Loic Voulloiz, and Philippe Cudré-
Mauroux. SANAPHOR: Ontology-based coreference resolution.
In Proceedings of the 14th International Semantic Web Conference (ISWC ’15).
3.3 Entity disambiguation
Roman Prokofyev, Gianluca Demartini, Alexey Boyarsky, Oleg Ruchayskiy, and Philippe Cudré-
Mauroux. Ontology-Based word sense disambiguation for scientific literature.
In Proceedings of the 35th European conference on Advances in Information Retrieval (ECIR ’13).
3.4 Tag recommendation
Roman Prokofyev, Alexey Boyarsky, Oleg Ruchayskiy, Karl Aberer, Gianluca Demartini and Philippe
Cudré-Mauroux. Tag recommendation for large-scale ontology- based information systems.
In Proceedings of the 11th International Semantic Web Conference (ISWC ’12).
17
Outline
1. Motivation
2. Problem definition
3. Contributions
1. Named entity recognition
2. Co-reference resolution
3. Entity disambiguation
4. Tag recommendation
4. Conclusions
18
Problem Definition
Entity type: scientific concept
19
non-baryonic dark matter
dark matter decay
gravitational lensing
dark matter
Standard Model
Galactic Signatures of Decaying Dark Matter,
arXiv.org, 2009
Intuitions
In order to extract named entities, we need to classify all
phrases in the text.
1. Reduce the number of phrases to analyze by candidate
pre-selection.
2. Classify a relatively small number of candidates with a
supervised method.
21
named entity
not a named entity
Approach
Our problem is formulated as a binary classification task.
Two-step classification:
• Extract candidate named entities using frequency filtering
algorithm.
• Classify candidate named entities using a supervised
classifier.
Candidate selection allows us to significantly reduce the
number of n-grams to classify.
22
Pipeline
23
Text
extraction
(Apache Tika)
List of
extracted
n-grams
n-gram
Indexing
foreach
Candidat e
Selection
List of
selected
n-grams
Supervised
Classi! er
Ranked
list of
n-grams
Lemmat
ization
n+1 grams
merging
Feature
extractionFeature
extractionFeatures
POS
Tagging
frequency
reweighting
Candidate selection
24
dark matter
cold dark
matter decayThe dominant dark matter decay channels are to standard
model leptons… We know that the self-interacting nature
of cold dark matter has been supported by some recent
observational data… In this paper, we strongly argue in
favor of the collisionless nature of cold dark matter
particles, which are feebly self-interacting at very
small scales.
dark mattercold dark
dark
matter
dominant dark
x2
x2
x3
x2
Candidate selection: Output
25
Galactic Signatures of Decaying Dark Matter, arXiv.org, 2009
named entity not a named entity
Classification
Machine Learning algorithm:
Forests of Decision Trees
Feature families:
• POS Tags and their derivatives
• External Knowledge Bases (DBLP, DBpedia)
• DBPedia relation graphs
• Syntactic features
26
Datasets
Two collections:
• Computer Science collection: 100 papers (SIGIR 2012)
• Physics collection: 100 papers randomly selected from
arXiv.org High Energy Physics category
CS Collection Physics Collection
N# Candidate N-grams ~ 21 500 ~ 18 000
N# Entities ~ 8 000 ~ 6000
Available at: github.com/XI-lab/scientific_NER_dataset
27
Features: Connected components
30
Standard Model
dark matter
particle physics
gravitational lensing
non-relativistic fluidonly one
music song
…
Features: Connected components
Com ponent siz eCom ponent siz e
NumberofcomponentsNumberofcomponents
5 10 15 20 25 30 35 40
0.4
1
2
4
10
20
40
100
200
400
31
Size of a connected
component a candidate
belongs to.
Numeric feature
Experiments
1. Feature importance scores
2. Comparison with a state-of-the-art MaxEntropy method
All results are obtained using average with 10-fold cross-
validation.
32
Feature importance
Feature Importance
NN* 0.3091
DBLP 0.1442
Components + DBLP 0.1125
Components 0.0789
*VB 0.0386
*NN 0.0380
JJ* 0.0364
Feature Importance
ScienceWISE 0.2870
Components +
ScienceWISE
0.1948
Wikipedia redirects 0.1104
Components 0.1093
Wikilinks 0.0439
Participation count 0.0370
CS Collection, 7 features Physics Collection, 6 features
33
State-of-the-art comparison
Precision Recall F1 score
Maximum Entropy 0.6566 0.7196 0.6867
Decision Trees 0.8121 0.8742 0.8420
MaxEntropy classifier receives full text as input.
(we used a classifier from NLTK package)
Comparison experiment: 80% of the collection as a training
data, 20% as a test dataset.
34
Outline
1. Motivation
2. Problem definition
3. Contributions
1. Named entity recognition
2. Co-reference resolution
3. Entity disambiguation
4. Tag recommendation
4. Conclusions
35
Motivations and Task Overview
36
Task: identify groups (cluster) of co-referring mentions.
Motivations:
• identification of a specific type of an unknown entity
• extract more relationships between named entities
http://www.telegraph.co.uk/
“Xi Jinping was due to arrive in Washington for a
dinner with Barack Obama on Thursday night, in
which he will aim to reassure the US president
about a rising China. The Chinese president said he
favors a “new model of major country relationship"
built on understanding, rather than suspicion.”
Motivations for a rich semantic layer
37
Syntactic approaches are not able to differentiate between
the names of the city and the province.
http://www.telegraph.co.uk/
“Xi Jinping was due to arrive in Washington for a
dinner with Barack Obama on Thursday night, in
which he will aim to reassure the US president
about a rising China. The Chinese president said he
favors a “new model of major country relationship"
built on understanding, rather than suspicion.”
✖
✖
Generic overview of the approach
Key techniques
Split and merge clusters based on their semantics.
38
Clusters produced
by Stanford Coref
Entity/Type
Linking
Split
clusters
Merge
clusters
SANAPHOR
Pre-Processing: Entity Linking
39
Entity Linking
US President Barack Obama
Australia
Quintex Australia
Quintex ltd.
US President Barack Obama
Australia
Quintex Australia
Quintex ltd.
Pre-Processing: Semantic Typing
40
Semantic Typing:
recognized entities are
typed, other mention are
typed by string similarity
with YAGO.
YAGO Index
US President
US President
Politician
Politician
Cluster splits
41
Entity- and Type-based splitting on clusters
AustraliaQuintex Australia Quintex ltd.
Cluster merges
42
Merge different clusters that
contain the same types/entities
US President Barack Obama
Evaluation
CoNLL-2012 Shared Task on Coreference Resolution:
• over 1M words
• 3 parts: development, training and test.
Design methods based on dev, evaluate on test.
Metrics:
• Precision/Recall/F1 for the case of clustering
• Evaluate noun-only clusters separately (no pronouns)
43
Cluster optimization results
44
• System improves on top of Stanford Coref in both split
and merge tasks.
• Greater improvement in split task for noun-only clusters,
since we do not re-assign pronouns.
Outline
1. Motivation
2. Problem definition
3. Contributions
1. Named entity recognition
2. Co-reference resolution
3. Entity disambiguation
4. Tag recommendation
4. Conclusions
45
Ontology-based entity disambiguation
More than 10% improvement in Precision
over state-of-the-art machine learning approaches.
State Space Model
Sequential Standard Model
Symmetric Standard Model
46
Supersymmetric Standard Model
We study finely tuned SSM, recently
proposed by… The runnings of the four
gaugino Yukawa couplings, the mu
term, the gaugino masses, and the
Higgs quartic coupling are computed…
Minimal distance
Shortest path
Outline
1. Motivation
2. Problem definition
3. Contributions
1. Named entity recognition
2. Co-reference resolution
3. Entity disambiguation
4. Tag recommendation
4. Conclusions
47
48
49
Why knowledge discovery is important?
50
Knowledge extraction and discovery opens door to
automated processing of unstructured data.
processing
Example: processing textual data to create targeted Q&A
systems
Outline
1. Motivation
2. Problem definition
3. Contributions
1. Named entity recognition
2. Co-reference resolution
3. Entity disambiguation
4. Tag recommendation
4. Conclusions
52
Conclusions
• We demonstrated the importance of knowledge discovery
in idiosyncratic domains with a real system
(ScienceWISE).
• We improved Entity Recognition and Entity
Disambiguation in idiosyncratic domains leveraging
domain-specific ontologies.
• We improved Coreference resolution by incorporating
semantic information into state-of-the-art model.
53
54
55
56
57

PhD Defense

  • 1.
    PhD Thesis: Entity-Centric KnowledgeDiscovery for Idiosyncratic Domains Roman Prokofyev eXascale Infolab University of Fribourg, Switzerland June17th Fribourg, Switzerland 1
  • 2.
    Outline 1. Motivation 2. Problemdefinition 3. Contributions 1. Named entity recognition 2. Co-reference resolution 3. Entity disambiguation 4. Tag recommendation 4. Conclusions 2
  • 3.
    Outline 1. Motivation 2. Problemdefinition 3. Contributions 1. Named entity recognition 2. Co-reference resolution 3. Entity disambiguation 4. Tag recommendation 4. Conclusions 3
  • 4.
    Structured vs. unstructureddata • Structured data • Databases, CRMs • Structured formats/tables (CSV, MS Excel) • Schema.org, RDFa • Unstructured data • Reports, presentations, emails, etc. • Web pages, social media content 4
  • 5.
  • 6.
    Unstructured data Unstructured →natural language content
  • 7.
    Structured vs. unstructureddata • Structured data Can be understood and processed by machines • Unstructured data Can be understood by humans in individual pieces, but extremely hard to get insights from large collections 7 ?
  • 8.
    Technical knowledge growth Communicationsof the ACM, Vol. 56 No. 12, Pages 64-73 8 According to IBM, 80% of data generated on the web is unstructured.
  • 9.
    Idiosyncratic data 9 • Datain vertical domains with specific vocabulary. • Parts of vocabulary are evolving and not clearly defined. Examples: physics, biomedicine, etc. Hence knowledge cannot be extracted using existing knowledge bases / dictionaries.
  • 10.
    Outline 1. Motivation 2. Problemdefinition 3. Contributions 1. Named entity recognition 2. Co-reference resolution 3. Entity disambiguation 4. Tag recommendation 4. Conclusions 10
  • 11.
    Problem definition Goal ofthe thesis: Design effective methods and techniques to mine structured data from idiosyncratic domains. ? 11
  • 12.
    Knowledge discovery tasks •Named entity recognition • Entity Disambiguation and Linking • Co-reference resolution Disambiguation and Linking Recognition Co-reference 12
  • 13.
    Named Entity Recognition HowIt Should Have Ended began after Daniel Baxter and Tommy Watson started discussing alternate endings for a movie they had watched. Christina "Tina" Alexander, who has previously worked with Daniel, joined the team shortly thereafter. HISHE was awarded the "Best Internet Parody" award for How Superman Should Have Ended by Spike TV at the 2006 Scream Awards at the Pantages Theater Hollywood, California. It has also been featured as a Yahoo! Profile Pick and has appeared in both Fade In and Wired magazines. Organization LocationPerson 13
  • 14.
  • 15.
    Co-reference resolution http://www.telegraph.co.uk/ “Xi Jinpingwas due to arrive in Washington for a dinner with Barack Obama on Thursday night, in which he will aim to reassure the US president about a rising China. The Chinese president said he favors a “new model of major country relationship" built on understanding, rather than suspicion.” Xi Jinping he heChinese president Barack Obama US president 15
  • 16.
    Outline 1. Motivation 2. Problemdefinition 3. Contributions 1. Named entity recognition 2. Co-reference resolution 3. Entity disambiguation 4. Tag recommendation 4. Conclusions 16
  • 17.
    Technical contributions 3.1 Namedentity recognition Roman Prokofyev, Gianluca Demartini, and Philippe Cudré-Mauroux. 2014. Effective named entity recognition for idiosyncratic web collections. In Proceedings of the 23rd International conference on World Wide Web (WWW ’14). 3.2 Co-reference resolution Roman Prokofyev, Alberto Tonon, Djellel Difallah, Michael Luggen, Loic Voulloiz, and Philippe Cudré- Mauroux. SANAPHOR: Ontology-based coreference resolution. In Proceedings of the 14th International Semantic Web Conference (ISWC ’15). 3.3 Entity disambiguation Roman Prokofyev, Gianluca Demartini, Alexey Boyarsky, Oleg Ruchayskiy, and Philippe Cudré- Mauroux. Ontology-Based word sense disambiguation for scientific literature. In Proceedings of the 35th European conference on Advances in Information Retrieval (ECIR ’13). 3.4 Tag recommendation Roman Prokofyev, Alexey Boyarsky, Oleg Ruchayskiy, Karl Aberer, Gianluca Demartini and Philippe Cudré-Mauroux. Tag recommendation for large-scale ontology- based information systems. In Proceedings of the 11th International Semantic Web Conference (ISWC ’12). 17
  • 18.
    Outline 1. Motivation 2. Problemdefinition 3. Contributions 1. Named entity recognition 2. Co-reference resolution 3. Entity disambiguation 4. Tag recommendation 4. Conclusions 18
  • 19.
    Problem Definition Entity type:scientific concept 19 non-baryonic dark matter dark matter decay gravitational lensing dark matter Standard Model Galactic Signatures of Decaying Dark Matter, arXiv.org, 2009
  • 20.
    Intuitions In order toextract named entities, we need to classify all phrases in the text. 1. Reduce the number of phrases to analyze by candidate pre-selection. 2. Classify a relatively small number of candidates with a supervised method. 21 named entity not a named entity
  • 21.
    Approach Our problem isformulated as a binary classification task. Two-step classification: • Extract candidate named entities using frequency filtering algorithm. • Classify candidate named entities using a supervised classifier. Candidate selection allows us to significantly reduce the number of n-grams to classify. 22
  • 22.
    Pipeline 23 Text extraction (Apache Tika) List of extracted n-grams n-gram Indexing foreach Candidate Selection List of selected n-grams Supervised Classi! er Ranked list of n-grams Lemmat ization n+1 grams merging Feature extractionFeature extractionFeatures POS Tagging frequency reweighting
  • 23.
    Candidate selection 24 dark matter colddark matter decayThe dominant dark matter decay channels are to standard model leptons… We know that the self-interacting nature of cold dark matter has been supported by some recent observational data… In this paper, we strongly argue in favor of the collisionless nature of cold dark matter particles, which are feebly self-interacting at very small scales. dark mattercold dark dark matter dominant dark x2 x2 x3 x2
  • 24.
    Candidate selection: Output 25 GalacticSignatures of Decaying Dark Matter, arXiv.org, 2009 named entity not a named entity
  • 25.
    Classification Machine Learning algorithm: Forestsof Decision Trees Feature families: • POS Tags and their derivatives • External Knowledge Bases (DBLP, DBpedia) • DBPedia relation graphs • Syntactic features 26
  • 26.
    Datasets Two collections: • ComputerScience collection: 100 papers (SIGIR 2012) • Physics collection: 100 papers randomly selected from arXiv.org High Energy Physics category CS Collection Physics Collection N# Candidate N-grams ~ 21 500 ~ 18 000 N# Entities ~ 8 000 ~ 6000 Available at: github.com/XI-lab/scientific_NER_dataset 27
  • 27.
    Features: Connected components 30 StandardModel dark matter particle physics gravitational lensing non-relativistic fluidonly one music song …
  • 28.
    Features: Connected components Component siz eCom ponent siz e NumberofcomponentsNumberofcomponents 5 10 15 20 25 30 35 40 0.4 1 2 4 10 20 40 100 200 400 31 Size of a connected component a candidate belongs to. Numeric feature
  • 29.
    Experiments 1. Feature importancescores 2. Comparison with a state-of-the-art MaxEntropy method All results are obtained using average with 10-fold cross- validation. 32
  • 30.
    Feature importance Feature Importance NN*0.3091 DBLP 0.1442 Components + DBLP 0.1125 Components 0.0789 *VB 0.0386 *NN 0.0380 JJ* 0.0364 Feature Importance ScienceWISE 0.2870 Components + ScienceWISE 0.1948 Wikipedia redirects 0.1104 Components 0.1093 Wikilinks 0.0439 Participation count 0.0370 CS Collection, 7 features Physics Collection, 6 features 33
  • 31.
    State-of-the-art comparison Precision RecallF1 score Maximum Entropy 0.6566 0.7196 0.6867 Decision Trees 0.8121 0.8742 0.8420 MaxEntropy classifier receives full text as input. (we used a classifier from NLTK package) Comparison experiment: 80% of the collection as a training data, 20% as a test dataset. 34
  • 32.
    Outline 1. Motivation 2. Problemdefinition 3. Contributions 1. Named entity recognition 2. Co-reference resolution 3. Entity disambiguation 4. Tag recommendation 4. Conclusions 35
  • 33.
    Motivations and TaskOverview 36 Task: identify groups (cluster) of co-referring mentions. Motivations: • identification of a specific type of an unknown entity • extract more relationships between named entities http://www.telegraph.co.uk/ “Xi Jinping was due to arrive in Washington for a dinner with Barack Obama on Thursday night, in which he will aim to reassure the US president about a rising China. The Chinese president said he favors a “new model of major country relationship" built on understanding, rather than suspicion.”
  • 34.
    Motivations for arich semantic layer 37 Syntactic approaches are not able to differentiate between the names of the city and the province. http://www.telegraph.co.uk/ “Xi Jinping was due to arrive in Washington for a dinner with Barack Obama on Thursday night, in which he will aim to reassure the US president about a rising China. The Chinese president said he favors a “new model of major country relationship" built on understanding, rather than suspicion.” ✖ ✖
  • 35.
    Generic overview ofthe approach Key techniques Split and merge clusters based on their semantics. 38 Clusters produced by Stanford Coref Entity/Type Linking Split clusters Merge clusters SANAPHOR
  • 36.
    Pre-Processing: Entity Linking 39 EntityLinking US President Barack Obama Australia Quintex Australia Quintex ltd. US President Barack Obama Australia Quintex Australia Quintex ltd.
  • 37.
    Pre-Processing: Semantic Typing 40 SemanticTyping: recognized entities are typed, other mention are typed by string similarity with YAGO. YAGO Index US President US President Politician Politician
  • 38.
    Cluster splits 41 Entity- andType-based splitting on clusters AustraliaQuintex Australia Quintex ltd.
  • 39.
    Cluster merges 42 Merge differentclusters that contain the same types/entities US President Barack Obama
  • 40.
    Evaluation CoNLL-2012 Shared Taskon Coreference Resolution: • over 1M words • 3 parts: development, training and test. Design methods based on dev, evaluate on test. Metrics: • Precision/Recall/F1 for the case of clustering • Evaluate noun-only clusters separately (no pronouns) 43
  • 41.
    Cluster optimization results 44 •System improves on top of Stanford Coref in both split and merge tasks. • Greater improvement in split task for noun-only clusters, since we do not re-assign pronouns.
  • 42.
    Outline 1. Motivation 2. Problemdefinition 3. Contributions 1. Named entity recognition 2. Co-reference resolution 3. Entity disambiguation 4. Tag recommendation 4. Conclusions 45
  • 43.
    Ontology-based entity disambiguation Morethan 10% improvement in Precision over state-of-the-art machine learning approaches. State Space Model Sequential Standard Model Symmetric Standard Model 46 Supersymmetric Standard Model We study finely tuned SSM, recently proposed by… The runnings of the four gaugino Yukawa couplings, the mu term, the gaugino masses, and the Higgs quartic coupling are computed… Minimal distance Shortest path
  • 44.
    Outline 1. Motivation 2. Problemdefinition 3. Contributions 1. Named entity recognition 2. Co-reference resolution 3. Entity disambiguation 4. Tag recommendation 4. Conclusions 47
  • 45.
  • 46.
  • 47.
    Why knowledge discoveryis important? 50 Knowledge extraction and discovery opens door to automated processing of unstructured data. processing Example: processing textual data to create targeted Q&A systems
  • 49.
    Outline 1. Motivation 2. Problemdefinition 3. Contributions 1. Named entity recognition 2. Co-reference resolution 3. Entity disambiguation 4. Tag recommendation 4. Conclusions 52
  • 50.
    Conclusions • We demonstratedthe importance of knowledge discovery in idiosyncratic domains with a real system (ScienceWISE). • We improved Entity Recognition and Entity Disambiguation in idiosyncratic domains leveraging domain-specific ontologies. • We improved Coreference resolution by incorporating semantic information into state-of-the-art model. 53
  • 51.
  • 52.
  • 53.
  • 54.

Editor's Notes

  • #2 Good afternoon everybody, you all look great. and I’m pleased to have you here. My name is Roman Prokofyev, from the … and today I’m going to give you a presentation on my PhD thesis entitled ...
  • #3 Let me start with a brief outline of my talk, so that you know what's ahead of you. First I'm going to motivate the problem I'm addressing in this thesis, why it's important and why it's challenging. Next, I’m going describe in detail the contributions we made during the course of this work. Finally, I will finish this talk with conclusions and future work.
  • #5 First, I’m going to introduce some concepts that define the scope of my work. In this work, we make a distinction between two types of data. structured – adheres to a schema. Humans can read and understand unstructured in individual pieces, but it’s extremely hard and inefficient to get insights from large collections of such data.
  • #8 First, I’m going to introduce some concepts that define the scope of my work. In this work, we make a distinction between two types of data. structured – adheres to a schema. Humans can read and understand unstructured in individual pieces, but it’s extremely hard and inefficient to get insights from large collections of such data.
  • #9 rise of unstructured data in vertical domains
  • #10 A specific sub-category of unstructured data, which we explore in this work, is unstructured data in idiosyncratic domains, In essence, idiosyncratic data represents a data in vertical domains… As a results, knowledge cannot be extracted from such domain using existing... We need more specific techniques to address this problem.
  • #12 Develop better methods and applications to mine structured information from and for idiosyncratic domains. addressed via specific subtasks
  • #13 There are a number of
  • #14 we need to detect entity mentions in these document and assign certain types to them.
  • #17 To give you an overview of my talk I’m going to start with a definition of a problem that we’re trying to solve here, explain why we think It is different from standard NER and I will continue with an overview of our approach, motivation, candidate entities selection, dataset and feature descriptions and finally evaluation.
  • #18 During the course of this thesis, we made a number of contributions and addresses the tasks we previously describe for the case of idiosyncratic domains. In particular, we made …
  • #20 We start with a problem definition Entities can be identified by either the authors of a given document or by an expert in a domain of a document.
  • #21 Now, in traditional NER, there are a few approaches that are widely used: Obey certain rules and appear in specific contexts, which is not the case for scientific concepts. And we will have a comparison experiment to support this hypothesis.
  • #23 Put emphasis on N-grams in candidate selection
  • #24 Put emphasis on N-grams in candidate selection
  • #26 What we are left after candidate selection process.
  • #28 We used 2 dataset for our evaluation.
  • #30 Next feature family TODO: binary feature, examples
  • #31 Additionally we perform a match with Dbpedia knowledge base, but since it’s a general-purpose KB, it contains many common concepts that we are not interested. two hop-distance
  • #32 Additionally we perform a match with Dbpedia knowledge base, but since it’s a general-purpose KB, it contains many common concepts that we are not interested. two hop-distance
  • #35 TODO: why only CS collection?
  • #36 To give you an overview of my talk I’m going to start with a definition of a problem that we’re trying to solve here, explain why we think It is different from standard NER and I will continue with an overview of our approach, motivation, candidate entities selection, dataset and feature descriptions and finally evaluation.
  • #37 TODO: change example
  • #38 However, NLP-based approach fails to determine correct coreference cluster when the referring phrases are somewhat ambiguous.
  • #39 Thus, we have designed the following pipeline for our system. Let’s see how each box operates in detail.
  • #40 First step of our pipeline, … spotlight – decent technology
  • #41 beyond EL, semantic typing, the next pre-processing step is semantic typing…
  • #42 Now, after we completed the necessary pre-processing steps, we start re-arranging the coreference clusters. The first step is to split semantically unrelated clusters, which means that clusters contain either different entities or types from different branches of hierarchy. ANSWER: high-precision entity linking
  • #43 Second step is cluster merging, that is, cluster that either contain same entities, or exactly same types, or, in case there is a mix of types and entities,…
  • #44 Ontonotes 5: available on LDC for free, 1M words from newswire, magazine articles, web data
  • #45 In the evaluation, we focus on the two subtasks of co-reference resolution that is splitting and merging the clusters. we notice that the absolute increase in F1 score for the split task is greater for the Noun-Only case (+10.54% vs +2.94%). This results from the fact that All Clusters also contain non-noun mentions, such as pronouns, which we don’t directly tackle in this work but have to be assigned to one of the splits nevertheless. Our approach in that context is to keep the non-noun mentions with the first noun-mention in the cluster, which seems to be suboptimal for this case. For the merge task, the difference between All and Noun-Only clusters is much smaller (+27.03% for the All Clusters vs +18.96% for the Noun-Only case). In this case, non-noun words do not have any effect, since we merge clusters and also include all other mentions. ANSWER: We focused on these two parts because our recall on entity linking is low.
  • #46 To give you an overview of my talk I’m going to start with a definition of a problem that we’re trying to solve here, explain why we think It is different from standard NER and I will continue with an overview of our approach, motivation, candidate entities selection, dataset and feature descriptions and finally evaluation.
  • #47 Naïve Bayes, minimal distance, shortest path, nearest neighbors.
  • #48 To give you an overview of my talk I’m going to start with a definition of a problem that we’re trying to solve here, explain why we think It is different from standard NER and I will continue with an overview of our approach, motivation, candidate entities selection, dataset and feature descriptions and finally evaluation.
  • #51 So, why knowledge discovery is important, especially in idiosyncratic domains. The one-sentence summary would be that it opens door to automated processing of unstructured data by machines. Here is one example. Let’s day we have a collection of documents that we want to analyze. By applying knowledge extraction techniques, we can build a system that can store structured information on top of this collections and allow people to execute natural language queries about its knowledge.
  • #52 Provide you a highly relevant answer, not 1000s of results, to your question posed in natural language, not keywords. Monitor the law for changes that can positively/negatively affect your case, instead of flooding you with legal news. Learn the more you and other lawyers use it. One of the country’s biggest law firms has become the first to publicly announce that it has “hired” a robot lawyer to assist with bankruptcy cases.
  • #53 To give you an overview of my talk I’m going to start with a definition of a problem that we’re trying to solve here, explain why we think It is different from standard NER and I will continue with an overview of our approach, motivation, candidate entities selection, dataset and feature descriptions and finally evaluation.
  • #54 KD is important and there are multiple tasks in there. in this work, we focused on a number of challenging tasks, in detail …
  • #55 I was one of the first students in his lab, and it’s really amazing the environment that he created since the inception of the lab. Philippe really helps us, phd students, he was really embracing our own ideas, and guiding us to the goal of properly executing and evaluating these idea. Alberto, my permanent italian officemate.
  • #56 KD is important and there are multiple tasks in there. in this work, we focused on a number of challenging tasks, in detail …
  • #57 My mother, who is here today, and who put a lot of effort into me and my brother, maybe even sacrificing too much for this, so I’m immensely grateful to her. My brilliant wife, who is also here today, and who