The document describes a framework for performing information extraction from social media posts. It leverages entity linking to improve named entity recognition in short texts. The framework includes components for named entity recognition, candidate retrieval from a knowledge base, entity disambiguation, and using the linking results to enhance the entity classifier. An evaluation on a dataset of tweets shows marginal improvements in entity classification accuracy when incorporating entity linking information. The authors discuss opportunities to better identify new emerging entities from social media and extract relationships to improve knowledge base curation.
Talk at the 2nd Summer Workshop of the Center for Semantic Web Research (January 16, 2016, Santiago, Chile) about the construction of Yahoo's Knowledge Graph and associated research challenges.
Talk at the 2nd Summer Workshop of the Center for Semantic Web Research (January 16, 2016, Santiago, Chile) about the construction of Yahoo's Knowledge Graph and associated research challenges.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
Information Extraction and Linked Data CloudDhaval Thakker
In the media industry there is a great emphasis on providing descriptive metadata as part of the media assets to the consumers. Information extraction (IE) is considered an important tool for metadata generation process and its performance largely depend on the knowledge base it utilizes. The advances in the “Linked Data Cloud” research provide a great opportunity for generating such knowledge base that benefit from the participation of wider community. In this talk, I will discuss our experiences of utilizing Linked Data Cloud in conjunction with a GATE-based IE system.
Presented January 18, 2010 to the ALCTS Committee on Cataloging: Description and Access (CC:DA) as an introduction to RDF data, and application profiles. Presenters were Jon Phipps, Karen Coyle and Diane Hillmann.
The Eprints Application Profile: a FRBR approach to modelling repository meta...Julie Allinson
Julie Allinson, Pete Johnston and Andy Powell, UKOLN, University of Bath, present recent work on developing a Dublin Core Application Profile (DCAP) for describing "scholarly publications" (eprints). They will explain why the Dublin Core Abstract Model is well suited to creating descriptions based on entity-relational models such as the FRBR-based (Functional Requirements for Bibliographic Records) Eprints data model. The ePrints DCAP highlights the relational nature of the model underpinning Dublin Core and illustrates that the Dublin Core Abstract Model can support the representation of complex data describing multiple entities and their relationships.
SEMANTIC WEB SOURCES – comparison of open-source Knowledge GraphsMatteoBelcao
A theorical & practical comparison between the currently most used open-source Knowledge Graphs: DBpedia, Wikidata, Yago
Practical explaination of how to query each Knwlwdge Graph with SPARQL and the sandboxes
Entity Search on Virtual Documents Created with Graph EmbeddingsSease
Entity Search is a search paradigm that aims to retrieve entities and all the information related to them. In the last few years the importance of this topic has become greater and greater due to the fact that 40% of the queries made by users mention specific entities nowdays.
This talk wants to give a first overview of the state-of-the-art methods used for entities retrieval and then describe the new approach Anna has implemented and proposed in her master thesis. The novelty introduced with this work exploits two machine learning techniques: neural network and clustering.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
Information Extraction and Linked Data CloudDhaval Thakker
In the media industry there is a great emphasis on providing descriptive metadata as part of the media assets to the consumers. Information extraction (IE) is considered an important tool for metadata generation process and its performance largely depend on the knowledge base it utilizes. The advances in the “Linked Data Cloud” research provide a great opportunity for generating such knowledge base that benefit from the participation of wider community. In this talk, I will discuss our experiences of utilizing Linked Data Cloud in conjunction with a GATE-based IE system.
Presented January 18, 2010 to the ALCTS Committee on Cataloging: Description and Access (CC:DA) as an introduction to RDF data, and application profiles. Presenters were Jon Phipps, Karen Coyle and Diane Hillmann.
The Eprints Application Profile: a FRBR approach to modelling repository meta...Julie Allinson
Julie Allinson, Pete Johnston and Andy Powell, UKOLN, University of Bath, present recent work on developing a Dublin Core Application Profile (DCAP) for describing "scholarly publications" (eprints). They will explain why the Dublin Core Abstract Model is well suited to creating descriptions based on entity-relational models such as the FRBR-based (Functional Requirements for Bibliographic Records) Eprints data model. The ePrints DCAP highlights the relational nature of the model underpinning Dublin Core and illustrates that the Dublin Core Abstract Model can support the representation of complex data describing multiple entities and their relationships.
SEMANTIC WEB SOURCES – comparison of open-source Knowledge GraphsMatteoBelcao
A theorical & practical comparison between the currently most used open-source Knowledge Graphs: DBpedia, Wikidata, Yago
Practical explaination of how to query each Knwlwdge Graph with SPARQL and the sandboxes
Entity Search on Virtual Documents Created with Graph EmbeddingsSease
Entity Search is a search paradigm that aims to retrieve entities and all the information related to them. In the last few years the importance of this topic has become greater and greater due to the fact that 40% of the queries made by users mention specific entities nowdays.
This talk wants to give a first overview of the state-of-the-art methods used for entities retrieval and then describe the new approach Anna has implemented and proposed in her master thesis. The novelty introduced with this work exploits two machine learning techniques: neural network and clustering.
Linked Data Generation for the University Data From Legacy Database dannyijwest
Web was developed to share information among the users through internet as some hyperlinked documents.
If someone wants to collect some data from the web he has to search and crawl through the documents to
fulfil his needs. Concept of Linked Data creates a breakthrough at this stage by enabling the links within
data. So, besides the web of connected documents a new web developed both for humans and machines, i.e.,
the web of connected data, simply known as Linked Data Web. Since it is a very new domain, still a very
few works has been done, specially the publication of legacy data within a University domain as Linked
Data.
KDIR2015-Entity Linking and Knowledge Discovery in Microblogs-Presentation
1. PIKAKSHI MANCHANDA
DISCo, University of Milano-Bicocca, Milan, Italy
@pikakshi787
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
KDIR 2015, Lisbon,12th November, 2015
2. People communicate and share information increasingly through social media
platforms
Fresh information emerging in real-time on social media platforms primarily
New entities (newly emerging, newly relevant/popular)
New relationships
Factual information
Events
2
SOCIAL MEDIA: ENTITIES-EMOJIS-EVENTS
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
3. WHY INFORMATION EXTRACTION??
3
Existing
entities
New entity
(Product Launch)
Apple Watch
Product
IBM OS2
Product
Apple
Company
New
Relations
WHY SOCIAL MEDIA
PLATFORMS??
Fresh
Real-time info
Incomplete KBs
Unstructured
Web
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
4. MOTIVATION
4
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
Bridging the gap between Unstructured Web and Web of Data
• Intrinsic incompleteness in KBs
Information Extraction from social media streams (microposts,..)
• Named Entity Recognition (NER)
• Named Entity Classification
• Named Entity Linking (NEL)
Knowledge Base (KB) enrichment
• Identify new knowledge
• Improve NER
• Lexically enriching knowledge bases for existing & new entities
5. INFORMATION EXTRACTION
Named Entity Recognition: Task of identifying named entities in a piece of text
Named Entities: text fragments that refer to entities in the real world (proper nouns..)
Named Entity Classification: Classifying recognized named entities into entity types such as
PERSON, LOCATION, ORGANIZATION…
Named Entity Linking: Linking the identified named entities to resources in a knowledge base
(such as Wikipedia, DBpedia)
5
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
6. 6
The Town might be one of the best movies I have seen all year. So,
so good. And don't worry Ben, we already forgave you for Gigli.
Really.
http://dbpedia.org/page/Ben_Affleck
foaf:Person
yago:AmericanFilmActors
http://dbpedia.org/page/Gigli
dbo:Film
yago:AmericanFilms
http://es.dbpedia.org/page/The_Town
dbpedia-owl:Film
schema.org/Movie
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
NamedEntityLinking
INFORMATION EXTRACTION
7. 7
The Town might be one of the best movies I have seen all year. So,
so good. And don't worry Ben, we already forgave you for Gigli.
Really.
http://dbpedia.org/page/Ben_Affleck
foaf:Person
yago:AmericanFilmActors
http://dbpedia.org/page/Gigli
dbo:Film
yago:AmericanFilms
http://live.dbpedia.org/page/The_Town_(2012_TV_series)
dbo:TelevisionShow
http://schema.org/CreativeWork
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
INFORMATION EXTRACTION
NamedEntityLinking
8. Entity Recognition and Linking in microposts has been reported to be quite challenging:
1. Short and noisy nature, typographic errors, shortening of words, ambiguity, polysemy (Liu et al. 2013, Ritter et
al. 2011, Meij et al. 2012)
2. Out Of Vocabulary (OOV) entity mention identification problem
The Big Bang Theory being referred as TBBT
3. Out of Knowledge base (OOKB) entity problem
A new upcoming company Widro
8
CHALLENGES
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
9. 9
Systems/Tools Approach Domain Entity Types/Classes Taxonomy
ANNIE Gazetteers & FSM Newswire 7 (adapted) MUC
Stanford NER CRF Newswire 4, 3 or 7 CoNLL, ACE
Alchemy API Machine Learning Unspecified 324 Alchemy
NERD-ML KNN & Naïve
Bayes
Twitter 4 NERD
TextRazor Machine Learning Unspecified 1779 DBpedia, Freebase
Ritter et al., 2011 CRF Twitter 3 or 10 CoNLL, ACE
Liu et al. 2011 KNN & CRF Twitter 4 CoNLL, ACE
Kalina et al, 2013 Gazetteers & FSM Twitter 3 or 10 CoNLL
Derczynski et al, 2015 Structured
Learning (CRF)
Twitter 10 Freebase
ENTITY RECOGNITION
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
10. 10
Tools Taxonomy Approach/ Features used Domain
DBpedia Spotlight
(Mendes et al., 2011)
DBpedia, Freebase,
Schema.org
Gazetteers and Similarity Metrics Unspecified
TAGME (Ferragina and
Scaiella, 2010)
Wikipedia Wikipedia anchor texts and the
pages linked to those anchor texts
Short texts
YODIE (Damljanovic and
Bontcheva, 2012)
DBpedia Similarity metrics and URI frequency Twitter
Babelfy (Moro et al., 2014) BabelNet semantic
network
Graph-based approach, semantic
signatures
Short text
Meij et al., 2012 Wikipedia n-gram features, concept features,
and tweet features
Twitter
S-MART, Yang et al, 2015 Wikipedia Structural Learning (Tree-based) Twitter
Weasel (Tristram et al,
2015)
DBpedia Machine Learning (using SVM) Newspaper
Articles
Guo et al., 2013 Wikipedia Structural SVM Twitter
Yamada et al., 2015 Wikipedia Supervised
(String matching, n-grams)
Twitter
Mention detection
& disambiguation
system: Pipeline
Use NEL to learn
how to perform
NER: pipeline
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
ENTITY LINKING
11. THE PROPOSED SYSTEM
An end-to-end IE framework for microblogs to orchestrate NER and NEL
• Entity Recognition and Classification
• Candidate match retrieval for identified entities
• Entity linking
• Leverage entity linking to improve named entity classification
Gold-standard corpus of ~2400 tweets (Ritter et al., 2011)
Ground Truth: Manually curated set of 1616 named entities identified with entity types
Use of DBpedia as an external KB
11
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
12. 12
FRAMEWORK
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
Named Entity
Recognition
Tweet Surface forms of
named entities
Index
(rdfs:label)
Entity Search
Top-k labels for each
surface form
Resource
description
f
Surface form,
entity type &
context
KB
Entity Disambiguation
Resource
foreach
label
Entity Linking
Improvement of NER
Resource
for surface
form
13. 13
ENTITY RECOGNITION
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
Named Entity
Recognition
Tweet Surface forms of
named entities
Index
(rdfs:label)
Entity Search
Top-k labels for each
surface form
Resource
description
f
Surface form,
entity type &
context
KB
Entity Disambiguation
Resource
foreach
label
Entity Linking
Improvement of NER
Resource
for surface
form
14. T-NER grounded on Conditional Random Fields (Sutton and McCallum, 2006)
Classifying each entity e into one or more entity type/class c with a probability score PCRF(e,c)
Experimental Analysis: Entity Recognition
NER Systems: T-NER (Ritter et al. 2011)
14
Entity type: O Entity type:
Geo-Loc
Entity type:
Band
Entity type:
Sportsteam
Identification Errors
“@vogueglamGIRL Ah I know! She is simply the best in The Sept Issue. My boyfriend’s aunt worked for Anna
Wintor in NY”
Classification Errors
“Cant wait for the ravens game tomorrow....go ray rice!!!!!!!”
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
PCRF (e, c) = exp (Σ wkfk (e, c))
k=1
K
15. Text
Phrase
Classification
Level
Example Classification
(%)
Entity Entity Type
Entities
(1496)
Correctly
Classified
Justin Bieber Person 61.57
Incorrectly
Classified
Chicago Person 37.96
Segmentation
Error
Alpha, Omega
(Alpha-Omega)
Geo-Location,
Band
0.47
Non-
Entities
(44k)
Correctly
Classified
It Outside (O) 99.8
Incorrectly
Classified
justthen Person 0.2
T-NER Classification Performance
15
Identifies 1496 named entities from the GS, in contrast to 1616 entities in ground
truth.
8% of entities are not even recognized and thus classified as non-entities (amongst other
44k tokens)
Entity Type Error (%)
Band 73.83
Company 21.9
Facility 54.79
Geo-Location 19.75
Movie 75.83
Other 46.29
Person 28.18
Product 39.70
Sportsteam 48.27
TVshow 48.71
Classification Error Rate: T-NER
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
16. 16
ENTITY LINKING
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
Surface forms of
named entities
Index
(rdfs:label)
Entity Search
Top-k labels for each
surface form
Resource
description
f
Surface form,
entity type &
context
KB
Entity Disambiguation
Resource
foreach
label
DBpedia Titles files and NLP resources available at: http://wiki.dbpedia.org/Downloads2015-04
Entity Linking
Named Entity
Recognition
Tweet
Improvement of NER
Resource
for surface
form
17. 17
Classifiable
Named
Entity
Linking Level Example Linking
(%)
Entity DBpedia
Type
Linkable Correctly
Linked
Wisconsin Geo-
Location
63.11
Incorrectly
Linked
America Movie 3.05
Uninformative N.J. Thing 16.15
Non-
Linkable
Uninformative Secrets Thing 11.85
Generic Whitney Other 5.83
A total of 1442 entities out of 1496 entities are
disambiguated with ~4k candidate KB resources
Entity Linking-Performance Analysis
Matching function, PKB (e, rc), to detect the resource for a
surface form of named entity in KB, if it exists:
1. Lexical Similarity, lex(e, lrc)
2. Coherence, coh(e+, drc)
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
Experimental Analysis: Entity Linking
⇒ PKB (e, rc) = *(lex(e, lrc)) + (1- )*(coh(e+, drc))
( currently set to 0.5)
18. 18
ENTITY RECOGNITION ENHANCEMENT
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
Surface forms of
named entities
Index
(rdfs:label)
Entity Search
Top-k labels for each
surface form
Resource
description
f
Surface form,
entity type &
context
KB
Entity Disambiguation
Resource
foreach
label
Entity Linking
Named Entity
Recognition
Tweet
T-NER+
Resource
for surface
form
c*
e = argmax {PCRF (e, c)*PKB (e, rc)}
c
19. T-NER Performance
Analysis
T-NER+ Performance
Analysis
Entity Type Precision Recall F1 Precision Recall F1
Band 0.26 0.88 0.40 0.39 0.90 0.54
Company 0.78 0.90 0.84 0.81 0.90 0.85
Facility 0.45 0.72 0.55 0.50 0.72 0.59
Geo-Location 0.80 0.95 0.87 0.80 0.95 0.87
Movie 0.24 0.88 0.38 0.34 0.88 0.49
Other 0.57 0.70 0.63 0.56 0.76 0.64
Person 0.72 0.92 0.81 0.77 0.92 0.84
Product 0.60 0.69 0.65 0.63 0.71 0.67
Sportsteam 0.52 0.83 0.64 0.63 0.85 0.72
TVshow 0.51 0.91 0.66 0.45 0.89 0.59
Overall 0.62 0.87 0.73 0.66 0.88 0.76
Comparative Analysis: T-NER and T-NER+
19
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
Experimental Analysis: Entity Recognition Enhancement
Entity Ground-Truth T-NER T-NER+
30stm Band Product Band
Yahoo Company Band Company
Southgate
House
Facility Band Facility
Canada Geo-Location Person Geo-Location
Camp rock 2 Movie Person Movie
Thanksgiving Other Person Other
John Acuff Person Facility Person
iphone Product Company Product
Lions Sportsteam Person Sportsteam
TMZ TVshow Band TVshow
Example: Re-classification of entities
Precision (P) =
|{cor.cl} ∩ {cl}|
|{cl}|
Recall (R) =
|{cor.cl} ∩ {cl}|
|{cor. cl}|
F1 Measure =
2 x P x R
P+R
cor.cl denotes correctly classified entities,
while cl denotes classified entities.
20. 20
New knowledge emerges constantly on social media streams
Its important to identify new knowledge in order to bridge the gap
between Unstructured Web and Web of Data
An end-to-end entity linking pipeline might be helpful for
detecting new knowledge
Entity linking can be used to improve classification performance
of an entity recognition system
Improving entity recognition is crucial for identifying new entities
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015
21. 21
Presented an end-to-end entity linking pipeline for short textual formats (microposts)
Presented an approach for improving entity recognition through re-classification
Marginal improvements observed in re-classification using linked entities
A definite scope for improving the current system
New knowledge has been identified, though not dealt with currently
Quality assessment, trustworthy factors…
Relation extraction from microposts to improve identification of new knowledge
Experimenting with more recent datasets
Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts
KDIR 2015