Statistical Entity Linking

CONTENT INTELLIGENCE
Statistical Entity Linking
Laurie Lugrin, R&D NLP engineer

idio, content marketing
for marketing & sales teams
• content insight
• user interest profile
• recommendation demo
content model
user model
topic performance chart

idio, content marketing
• focus on interests, not socio-demographics / firmographics
• automatic text analysis, model building and recommendation

Content analysis at idio
• Python
• build automation (luigi)
• web services
• scala
• entity linking, i.e. finding topics in texts

How most discussions start at conferences
• me: “I work on Natural Language Processing.”
• other: “So you’re in the field of deep learning?”
• our topic extractor is based on statistical analysis
however

Outline
• entity linking task
• method (inspired by DBpedia-Spotlight)
• data pre-processing
• adaptation

Entity Linking task
• find topics in a text written in a natural language
• For us: 1 topic = 1 uri to a wikipedia article.
• Sometimes called “wikification”.
demo
spotlight demo

Entity Linking jargon
Surface Form: word (or phrase) that refers to a topic
“hoverboard” and “hover board” are surface
forms for the topic “Hoverboard”, the fictional
levitating board used for personal transportation.
Context: words surrounding a surface form.

I'm on cloud 9 whenever I write Python code.

Challenges
• ambiguous words
• multi-word expressions
• different possible splits

Method
• Build an annotation model
• statistics about words and topics
• Apply it on the given input text

Build an annotation model
Data: wikipedia
• 1 link = 1 annotation
demo
edit mode of a wikipedia page

Build an annotation model
Data: wikipedia
• 1 link = 1 annotation
Algorithm overview:
• find all potential Surface Forms
• decide to annotate or not
• decide which topic
So, what statistics do we need?
demo
spotlight demo: candidates

Extract stats
Identify all known SFs
Decide to annotate or not
Decide which topic
‣ Set of SFs we’ve seen annotated
‣ P ( annotation | SF )
‣ number of annotations for each SF
‣ P ( topic | SF )
‣ P ( topic | annotation )
‣ P ( topic | context )

Model
surface forms
• annotated count, total count
topics
• number of annotations
• context, i.e. surrounding words, with number of occurrences
surface form <-> topic associations
• number of annotations

Model: Example
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.

Model
surface forms: annotated count, total count
• “is” / “an” / “and” are skipped because too common (not informative)
SF anno total
ScyPy 0 1
Python 1 1
open 0 1
open source 1 1
engineering doing 0 1

Model
topics: number of annotations, context
topic
num.
annotations
context
Open_source 1
SciPy, Python, library, engineer,
technical, computing, use,  
science x2, analyst
Python_(programming
_language)
1 (same)
Scientific_computing 1 (same)

Model
surface form <-> topic associations: number of annotations
SF topic
num.
annotations
open source Open_Source 1
Python Python_(programming_language) 1
scientific
computing
Scientific_computing 1

Text annotation
model
SF, topic and association statistics
input text
Known SFs: “cloud”, “9”, “cloud 9”, “write”, “Python”, “code”
Annotate?
• discard “9” and “write” because low anno probability
• “cloud” vs “cloud 9” overlap: keep higher anno probability
• keep “Python” and “code”
Which topic?
• SF “Python” is ambiguous: animal or programming language?
• The context supports the programming language.

Data preparation
extract stats: “It’s just counting.”

Data preparation
• stemming: reduce words to their word stem
• “laptops” -> “laptops”, “worked” -> “work”
• skip words
• wikipedia dump is a single XML file of ~50G. 
(See our blog post on “idio’s Wikipedia toolkits”)
extract stats: “It’s just counting.”

Challenges
Wikipedia is not representative of our client’s articles.
• Errors in wikipedia, but it’s not the worst
• Unwanted SF-topic associations
• Virtually all surface forms are ambiguous.
• Bad priors: “yesterday” -> Beatles song (even in lower-case)
• annotation of a sub-part of a SF

How to fix?
• Whenever possible, find a pattern, understand the underlying issue
and make algorithm changes: give a boost to some SFs, some
topics, or some SF-topic associations, e.g. capitalised phrases.
• tweak the model/probabilities for isolated issues.
• We made a model editor for spotlight. Check our github.

Takeaways
• Wikipedia is an awesome resource for NLP
• Statistics can solve some NLP tasks
• Adapt the formula with rules and boosts to make up for the
differences between the learning data set and the output we want

Links
Resources
DBpedia
idio
• http://dl.acm.org/citation.cfm?id=2002592
• http://dbpedia-spotlight.github.io/demo/
• https://github.com/dbpedia-spotlight/dbpedia-spotlight
• idioplatform.com
• http://engineering.idioplatform.com/
• github.com/idio/

Thank you

Statistical Entity Linking

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Statistical Entity Linking

Similar to Statistical Entity Linking (20)

Recently uploaded

Recently uploaded (20)

Statistical Entity Linking