Knowledg graphs yosi mass

IBM Research
© 2014 IBM Corporation
A Scalable Graph Representation of Knowledge Bases
and its Uses for Semantic Document Relatedness
Yosi Mass, Dafna Sheinwald (HRL)
Feng Cao, Yuan Ni, Hai Pei Zhang, Qiongkai Xu (CRL)

IBM Research
2
Introduction – Knowledge Base
A Knowledge-base (KB) is a representation of a knowledge where -
 Nodes represent entities
 Edges represent relationships between entities
 Nodes and edges may have attributes
Linked Open Data

IBM Research
The DBPedia Knowledge base

IBM Research
4
Usage of Knowledge Bases
1. Semantic understanding of a text by mapping phrases to the knowledge base.
2. Helps to find relatedness/similarity between two given texts
In the United Kingdom and Ireland, high school students traditionally do not have 'free
periods' but do have 'break' which normally occurs just after their second lesson of the
day (normally referred to as second period).
 Mentions
 United Kingdom - http://en.wikipedia.org/wiki/United_Kingdom
 Ireland - http://en.wikipedia.org/wiki/Ireland
 high school students - http://en.wikipedia.org/wiki/High_school - note the derivation to "high school
student" and then the re-direct to "High school".
 ‘free periods’ - http://en.wikipedia.org/wiki/Period_(school) - note the disambiguation.
 ‘break’ - http://en.wikipedia.org/wiki/Break_(work) - note the disambiguation.
 lesson - http://en.wikipedia.org/wiki/Lesson
 day - http://en.wikipedia.org/wiki/Day
– period - http://en.wikipedia.org/wiki/Period_(school) - note the disambiguation.

IBM Research
5
Mention Detection
Graph based Similarity scorers
• Exploits the graph structure to find relationships between pairs of mentions
• Aggregate over all pairs
Facet graph use case - find semantic relatedness between two text
paragraphs
Paragraph 1 Paragraph 2
?

IBM Research
Outline
• Generation of the Facet Graph from DBPedia
• Mention Detection
• Similarity measures on the FacetGraph

IBM Research
Titan graph
Hbase
shortest path
similarity scorers
The TinkerPop Stack Usage in a project
Cassandra (planned)
Hadoop
Access the graph
Map reduce code
To generate the graph
Graph stack library

IBM Research
• Input is given as RDF triples.
• Example
http://dbpedia.org/resource/Yehuda_Vilner,
http://dbpedia.org/ontology/birthPlace
http://dbpedia.org/resource/Israel
• URIs are translated to vertexIDs
• Adding a triple requires:
1. Add the subject and object as nodes (or get their IDs if they are already in the graph)
2. Add the predicate as an edge between the two nodes
This is the
most
expensive
operation
9
Generate the Knowledge Graph from RDF data
subject
object
predicate
Does not scale
to millions of
triples

IBM Research
A scalable solution using MapReduce
• What is MapReduce?
• Programming model for expressing distributed computations at a massive scale
• Execution framework for organizing and performing such computations
• Open-source implementation called Hadoop
• Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’*) → <k’’, v’’>*
All values with the same key are sent to the same reducer
The execution framework handles everything else…

IBM Research
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
MapReduce

IBM Research
Graph generation using MapReduce
Job 1 – sort by subjects
(S1, P1, O1)
(S2, P2, O2)
(S3, P3, O1)
(S1, P2, O2)
map
S1 (P1, O1)
S2 (P2, O2)
S3 (P3, O1)
S1 (P2, O2)
reduce
Job 2 – add subjects to graph and sort by objects
map
O1 (P1, SID1)
O2 (P2, SID2)
O1 (P3, SID3)
O2 (P2, SID1)
reduce
S1 (P1, O1)
S2 (P2, O2)
S3 (P3, O1)
S1 (P2, O2)
O1 (P1, SID1)
O2 (P2, SID1)
O1 (P3, SID3)
O2 (P2, SID2)
Job 3 – add objects and edges to graph
S1 (P1, O1)
S2 (P2, O2)
S3 (P3, O1)
S1 (P2, O2)
O1 (P1, SID1)
O2 (P2, SID1)
O1 (P3, SID3)
O2 (P2, SID2)
map
SID1
OID1
P1
OID2
P2
SID3 P3
SID2
P2

IBM Research
• Implementation based on Titan Graph Library With Hbase as the backend
• Runs on a cluster of 3 machines
• Each machine has 16 cores, 2Tb disk and 32Gb mem
13
Facet Graph Architecture
Rexster
Server
Titan graph 1
Hbase
Application REST API
Hadoop cluster
Titan graph n…

IBM Research
14
Facet Graph performance
• Creation (offline)
• Use three Map-reduce jobs to index DBPedia into Titan
1. First job sorts subjects
2. Second job adds subjects
3. Third job adds objects and edges
• Access (online)
• Implemented as a JAVAAPI that wraps REST API through Rexster server
• Performance on a cluster of 3 machines each with 16 cores, 2Tb disk and 32Gb mem
Graph #Vertices #Edges Creation time Access time
Semantics FG 14M 72M 3h:45m 1 msec to get node
description
2 sec to get 223K inlinks of
an heavy node (USA)
Links FG 19M 152M 7h:18m 4.4 sec to get 447K inlinks
of an heave node (USA)

IBM Research
16
Mention detection
Input Text
Lexicon
Spotting
candidates
Selection
Disambiguation
Lucene Index
Facet Graph
Spotting stage: recognizes in a sentence the phrases (surface forms) that may indicate a
mention in the KB
Candidate selection stage: given the surface form, retrieves the set of candidate URIs
for disambiguation
Disambiguation stage: uses the context around the spotted phrase to decide on the best
candidate.
Annotated Text

IBM Research
18
Pairwise Concept similarity based on wikilinks [1]
[1] Milne D., Witten I. H., An Effective, Low-Cost Measure of Semantic Relatedness Obtained from
Wikipedia Links, AAAI, 2008

IBM Research
Our assets on IBM.next
IBM Confidential14／9／
http://ibmnext.stage1.mybluemix.net/assets

IBM Research
Thank You

Knowledg graphs yosi mass

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Knowledg graphs yosi mass

Similar to Knowledg graphs yosi mass (20)

More from diannepatricia

More from diannepatricia (20)

Recently uploaded

Recently uploaded (20)

Knowledg graphs yosi mass