© 2016 Relegence.
Roni Wiener
AOL’s Named Entity Resolver
via strongly connected components and ad-hoc edge
construction
© 2016 Relegence.
What is News Tagging?
2
Categories - extract what an article is about.
Entities - extract who and what is in the article.
Will Smith
New Orleans
NFL
New Orleans Saints
Atlanta Falcons
...
Entities Categories
U.S News
Crime
Football
Murder
Celebrities
...
Easy for humans, hard for machines
Try to mimic humans than - search for
hits
http://www.sbnation.com/2016/4/28/11518540/will-smith-murder-case-new-orleans-saints-trial
© 2016 Relegence. 3
Outline
What is news Documents Tagging?
Entity Ambiguity and Unknown Entities
Documents Graph
Graph based Ambiguity Resolution
© 2016 Relegence.
So, What is the Big Deal?
4
We use Machine Learning for Categories and Keywords for Entities mentions
Machine Learning
Keywords Search
ARTICLE
Entities From
Knowledge Base
Categories
Well … we are not done yet, need to handle
AMBIGUITY and UNKNOWN ENTITIES ...
© 2016 Relegence.
What is Entities Ambiguity and Unknown Entities
5
200 ‘Will Smith’s
are in our
Knowledge Base
Keyword –
Will Smith
We must consider an
Unknown Entity for each
keyword
Totaling 201 possibilities
to choose from
Most of the ‘Will
Smith’s in the
world are not in our
Knowledge Base
404
© 2016 Relegence.
OK, This May Be Tricky
6
For the given “Will Smith” article, we have:
Keywordsoptions = (Will Smith)201 X (Saints)31 X (Georgia)26 ….
Totaling:
263,006,617,337,856,000,000
Entities Combinations!!
BUT ONLY 1 COMBINATION IS CORRECT
© 2016 Relegence.
Wow, How Can We Solve This Mess?
7
Pruning
Improbable Entities
Grouping
Solve ambiguity between
small number of entities
groups
Both grouping and pruning are done by looking for
contextual hints left by the article’s author
Build a Document Graph with the aid of a Knowledge Graph. It allows us to do:
© 2016 Relegence.
We have a lot of prior knowledge represented as a graph, it contains Entities, Categories,
Text and relations between them.
The knowledge graph can be thought as the article’s author and readers’ prior shared
knowledge, where an article contextual hints are based on it.
What is a Knowledge Graph … Wikipedia ++
8
Entity
Text
Category
“First Lady”
“President”
Michelle Obama
Barack Obama
Politics
© 2016 Relegence.
It is derived from the Knowledge Graph with additional temporal knowledge.
First hints are all possible entity candidates and their relations derived from the knowledge
graph.
Building the Document Graph - Entities
9
New Orleans Saints
(Football team)
Atlanta Falcons
(Football team)
Will Smith
(Footballer)
NFL
(Football League)
New Orleans
(City)
w=1.0
w=1.0
w=1.0
w=1.0
w=1.0
Will Smith
(Actor)
22 Jump Street
(Movie)
Facebook
(Company)
© 2016 Relegence.
Next hints are the Document categories derived for our classifier.
Building the Document Graph - Categories
10
New Orleans Saints
(Football team)
Atlanta Falcons
(Football team)
Will Smith
(Footballer)
NFL
(Football League)
New Orleans
(City)
w=1.0
w=1.0
w=1.0
w=1.0
w=1.0
Will Smith
(Actor)
22 Jump Street
(Movie)
Facebook
(Company)
Celebrities
(Category)
Football
(Category)
w=1.0
w=1.0
© 2016 Relegence.
Heuristic Nodes and Edges
11
Heuristics helps us prune improbable Entities, for example, highly frequent terms, common
names and single words, as they are not sound contextual hints.
Will Smith
Facebook
W < 0
Common
Names
Single
word
Frequent
word
Heuristic Nodes
© 2016 Relegence.
Knowledge nodes that are highly connected to the Document Graph can be assumed as
falling under the same context.
Represent this prior knowledge by an edge between relevant document nodes.
Context Nodes and Ad Hoc Edges
12
Document
Graph
Knowledge
Graph
New Orleans Saints
(Football team)
Atlanta Falcons
(Football team)
Footballer C
Footballer A
Footballer B
Context
Nodes
© 2016 Relegence.
Knowledge nodes that are highly connected to the Document Graph can be assumed as
falling under the same context.
Represent this prior knowledge by an edge between relevant document nodes.
Context Nodes and Ad Hoc Edges
13
Document
Graph
Knowledge
Graph
New Orleans Saints
(Football team)
Atlanta Falcons
(Football team)
Footballer C
Footballer A
Footballer B
Context
Nodes
Ad hoc edge
w=1.0
© 2016 Relegence.
Building the Document Graph - Text
14
“ … Defensive End Will Smith ...”
Footballer
“Defensive End”
P(Footballer(w) | Defensive End,w-5,w+5) ⪝1
P(Actor(w) | Defensive End,w-5,w+5) ⪞ 0
Actor
© 2016 Relegence.
Add Text nodes found in the Document, draw edges by proximity and expectations.
Building the Document Graph - Text
15
New Orleans Saints
(Football team)
Atlanta Falcons
(Football team)
Will Smith
(Footballer)
NFL
(Football League)
New Orleans
(City)
w=1.0
w=1.0
w=1.0
w=1.0
Will Smith
(Actor)22 Jump Street
(Movie)
Facebook
(Company)
Celebrities
(Category)
Football
(Category)
w=1.0
w=1.0
“Defensive
End”
w= -0.6
w= 1.0
© 2016 Relegence.
Weight each Entity node by the sum of its edges
Ok, The Graph Is Built, Now What?
16
New Orleans Saints
(Football team)
Atlanta Falcons
(Football team)
Will Smith
(Footballer)
NFL
(Football League)
New Orleans
(City)
Will Smith
(Actor)
w <0
22 Jump Street
(Movie)
Facebook
(Company)
Celebrities
(Category)
Football
(Category)
“Defensive End”
Name
(Heuristic)
Frequent
(Heuristic) Single Word
(Heuristic)
w > 0
w > 0
w = 0
w > 0
w > 0
w > 0
w > 0
© 2016 Relegence.
Node Weights Groups
17
Facebook
(w) < 0
NFL
W > 0
22 Jump
Street
W=0
Prune
improbable at the
given document
context
or Unknown Entity
No / Mixed Signal
resolve them once
the correct
document context is
clearer
Positive Nodes
Let’s start with
these
© 2016 Relegence.
Divide the positive nodes to 2 sets, edges can still run between the sets
Solved and Unsolved Nodes Sets
18
Unsolved Set
Has Ambiguity
Solved Set No
Ambiguity
Will Smith
Will Smith
Our goal is to move all Entities from the unsolved set to the solved one
© 2016 Relegence.
Group nodes by finding Strongly Connected Components (SCC) in each set
Group Nodes by Context
19
Revealing contextual regions in the document
Unsolved Set
Has Ambiguity
Solved Set No
Ambiguity
Will Smith
Will Smith
New Orleans Saints
© 2016 Relegence.
Iteratively Solve on the SCCs Level
20
Iterate:
• Move the best scored SCC from the unsolved set to the solved one
• Filter out all losing ambiguous nodes and their edges
• Prune negative weight nodes and their edges
• Extract SCCs from each set
Unsolved Set
Has
Ambiguity
Solved Set No
Ambiguity
Will Smith
Will Smith
Stop when the unsolved set is empty - no more ambiguity
Max Score
SCC
© 2016 Relegence.
SCC score is the sum of its nodes scores.
Node Score is dominated by the sum of its edges’ weights to solved SCC
Node(score) = ∑(Edge(w) * |SCC|)
How to Score SCCs
21
Unsolved Set
Has Ambiguity
Solved Set No
Ambiguity
Will Smith
Will Smith
|SCC| = 2
|SCC| = 3
© 2016 Relegence.
Revisiting Zero Weight Nodes
22
Based on the correct Entities and context regions, heuristics are applied to resolve
these nodes. (Out of our scope)
Most Entities with positive correlation to the context are regarded as correct.
© 2016 Relegence. 23
Summary
Document Graph
Solving on the SCC level and not on the Entity level
Dealing with Unknowne Entities
Shows significant improvements over state of the art products on real life
scenarios
For more information, please contact:
© 2016 Relegence.
roni.wienner@teamaol.com
www.relegence.com
Thank You
Roni Wiener
Please try it at:
http://www.relegence.com/demo

AOL's Entity Resolver

  • 1.
    © 2016 Relegence. RoniWiener AOL’s Named Entity Resolver via strongly connected components and ad-hoc edge construction
  • 2.
    © 2016 Relegence. Whatis News Tagging? 2 Categories - extract what an article is about. Entities - extract who and what is in the article. Will Smith New Orleans NFL New Orleans Saints Atlanta Falcons ... Entities Categories U.S News Crime Football Murder Celebrities ... Easy for humans, hard for machines Try to mimic humans than - search for hits http://www.sbnation.com/2016/4/28/11518540/will-smith-murder-case-new-orleans-saints-trial
  • 3.
    © 2016 Relegence.3 Outline What is news Documents Tagging? Entity Ambiguity and Unknown Entities Documents Graph Graph based Ambiguity Resolution
  • 4.
    © 2016 Relegence. So,What is the Big Deal? 4 We use Machine Learning for Categories and Keywords for Entities mentions Machine Learning Keywords Search ARTICLE Entities From Knowledge Base Categories Well … we are not done yet, need to handle AMBIGUITY and UNKNOWN ENTITIES ...
  • 5.
    © 2016 Relegence. Whatis Entities Ambiguity and Unknown Entities 5 200 ‘Will Smith’s are in our Knowledge Base Keyword – Will Smith We must consider an Unknown Entity for each keyword Totaling 201 possibilities to choose from Most of the ‘Will Smith’s in the world are not in our Knowledge Base 404
  • 6.
    © 2016 Relegence. OK,This May Be Tricky 6 For the given “Will Smith” article, we have: Keywordsoptions = (Will Smith)201 X (Saints)31 X (Georgia)26 …. Totaling: 263,006,617,337,856,000,000 Entities Combinations!! BUT ONLY 1 COMBINATION IS CORRECT
  • 7.
    © 2016 Relegence. Wow,How Can We Solve This Mess? 7 Pruning Improbable Entities Grouping Solve ambiguity between small number of entities groups Both grouping and pruning are done by looking for contextual hints left by the article’s author Build a Document Graph with the aid of a Knowledge Graph. It allows us to do:
  • 8.
    © 2016 Relegence. Wehave a lot of prior knowledge represented as a graph, it contains Entities, Categories, Text and relations between them. The knowledge graph can be thought as the article’s author and readers’ prior shared knowledge, where an article contextual hints are based on it. What is a Knowledge Graph … Wikipedia ++ 8 Entity Text Category “First Lady” “President” Michelle Obama Barack Obama Politics
  • 9.
    © 2016 Relegence. Itis derived from the Knowledge Graph with additional temporal knowledge. First hints are all possible entity candidates and their relations derived from the knowledge graph. Building the Document Graph - Entities 9 New Orleans Saints (Football team) Atlanta Falcons (Football team) Will Smith (Footballer) NFL (Football League) New Orleans (City) w=1.0 w=1.0 w=1.0 w=1.0 w=1.0 Will Smith (Actor) 22 Jump Street (Movie) Facebook (Company)
  • 10.
    © 2016 Relegence. Nexthints are the Document categories derived for our classifier. Building the Document Graph - Categories 10 New Orleans Saints (Football team) Atlanta Falcons (Football team) Will Smith (Footballer) NFL (Football League) New Orleans (City) w=1.0 w=1.0 w=1.0 w=1.0 w=1.0 Will Smith (Actor) 22 Jump Street (Movie) Facebook (Company) Celebrities (Category) Football (Category) w=1.0 w=1.0
  • 11.
    © 2016 Relegence. HeuristicNodes and Edges 11 Heuristics helps us prune improbable Entities, for example, highly frequent terms, common names and single words, as they are not sound contextual hints. Will Smith Facebook W < 0 Common Names Single word Frequent word Heuristic Nodes
  • 12.
    © 2016 Relegence. Knowledgenodes that are highly connected to the Document Graph can be assumed as falling under the same context. Represent this prior knowledge by an edge between relevant document nodes. Context Nodes and Ad Hoc Edges 12 Document Graph Knowledge Graph New Orleans Saints (Football team) Atlanta Falcons (Football team) Footballer C Footballer A Footballer B Context Nodes
  • 13.
    © 2016 Relegence. Knowledgenodes that are highly connected to the Document Graph can be assumed as falling under the same context. Represent this prior knowledge by an edge between relevant document nodes. Context Nodes and Ad Hoc Edges 13 Document Graph Knowledge Graph New Orleans Saints (Football team) Atlanta Falcons (Football team) Footballer C Footballer A Footballer B Context Nodes Ad hoc edge w=1.0
  • 14.
    © 2016 Relegence. Buildingthe Document Graph - Text 14 “ … Defensive End Will Smith ...” Footballer “Defensive End” P(Footballer(w) | Defensive End,w-5,w+5) ⪝1 P(Actor(w) | Defensive End,w-5,w+5) ⪞ 0 Actor
  • 15.
    © 2016 Relegence. AddText nodes found in the Document, draw edges by proximity and expectations. Building the Document Graph - Text 15 New Orleans Saints (Football team) Atlanta Falcons (Football team) Will Smith (Footballer) NFL (Football League) New Orleans (City) w=1.0 w=1.0 w=1.0 w=1.0 Will Smith (Actor)22 Jump Street (Movie) Facebook (Company) Celebrities (Category) Football (Category) w=1.0 w=1.0 “Defensive End” w= -0.6 w= 1.0
  • 16.
    © 2016 Relegence. Weighteach Entity node by the sum of its edges Ok, The Graph Is Built, Now What? 16 New Orleans Saints (Football team) Atlanta Falcons (Football team) Will Smith (Footballer) NFL (Football League) New Orleans (City) Will Smith (Actor) w <0 22 Jump Street (Movie) Facebook (Company) Celebrities (Category) Football (Category) “Defensive End” Name (Heuristic) Frequent (Heuristic) Single Word (Heuristic) w > 0 w > 0 w = 0 w > 0 w > 0 w > 0 w > 0
  • 17.
    © 2016 Relegence. NodeWeights Groups 17 Facebook (w) < 0 NFL W > 0 22 Jump Street W=0 Prune improbable at the given document context or Unknown Entity No / Mixed Signal resolve them once the correct document context is clearer Positive Nodes Let’s start with these
  • 18.
    © 2016 Relegence. Dividethe positive nodes to 2 sets, edges can still run between the sets Solved and Unsolved Nodes Sets 18 Unsolved Set Has Ambiguity Solved Set No Ambiguity Will Smith Will Smith Our goal is to move all Entities from the unsolved set to the solved one
  • 19.
    © 2016 Relegence. Groupnodes by finding Strongly Connected Components (SCC) in each set Group Nodes by Context 19 Revealing contextual regions in the document Unsolved Set Has Ambiguity Solved Set No Ambiguity Will Smith Will Smith New Orleans Saints
  • 20.
    © 2016 Relegence. IterativelySolve on the SCCs Level 20 Iterate: • Move the best scored SCC from the unsolved set to the solved one • Filter out all losing ambiguous nodes and their edges • Prune negative weight nodes and their edges • Extract SCCs from each set Unsolved Set Has Ambiguity Solved Set No Ambiguity Will Smith Will Smith Stop when the unsolved set is empty - no more ambiguity Max Score SCC
  • 21.
    © 2016 Relegence. SCCscore is the sum of its nodes scores. Node Score is dominated by the sum of its edges’ weights to solved SCC Node(score) = ∑(Edge(w) * |SCC|) How to Score SCCs 21 Unsolved Set Has Ambiguity Solved Set No Ambiguity Will Smith Will Smith |SCC| = 2 |SCC| = 3
  • 22.
    © 2016 Relegence. RevisitingZero Weight Nodes 22 Based on the correct Entities and context regions, heuristics are applied to resolve these nodes. (Out of our scope) Most Entities with positive correlation to the context are regarded as correct.
  • 23.
    © 2016 Relegence.23 Summary Document Graph Solving on the SCC level and not on the Entity level Dealing with Unknowne Entities Shows significant improvements over state of the art products on real life scenarios
  • 24.
    For more information,please contact: © 2016 Relegence. roni.wienner@teamaol.com www.relegence.com Thank You Roni Wiener Please try it at: http://www.relegence.com/demo