Distributed Graph Databases and the Emerging Web of Data

Distributed Graph Databases and the
Emerging Web of Data

Marko A. Rodriguez
T-5, Center for Nonlinear Studies
Los Alamos National Laboratory
http://markorodriguez.com

April 16, 2009

Abstract

The World Wide Web is the defacto medium for publicly exposing a corpus
of interrelated documents. In its current form, the World Wide Web is the
Web of Documents. The next generation of the World Wide Web will
support the Web of Data. The Web of Data utilizes the same Uniform
Resource Identifier (URI) address space as the Web of Documents, but
instead of a exposing a graph of documents, the Web of Data exposes a
graph of data. Given that the URI address space of the Web is distributed
and infinite, the Web of Data provides a single unified space by which the
worlds data can be publicly exposed and interrelated. The Web of Data is
supported by both graph databases (which structure the data) and
distributed computing mechanism (which process the data). This
presentation will discuss the Web of Data, graph databases, and models of
computing in this emerging space.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Outline

• The Relational Database vs. the Graph Database

• The Web of Documents vs. the Web of Data

• Local Computing vs. Distributed Computing

• Multi-Relational Network Analysis with Grammar Walkers


The Relational Database vs. the Graph Database

• A relational database’s (e.g. MySQL, PostgreSQL, Oracle) data model
is a collection interlinked tables.

• A graph database’s (e.g. OpenSesame, AllegroGraph, Neo4j) data model
is a multi-relational graph.

Relational Database Graph Database
d

c a
a

b
127.0.0.1 127.0.0.2


Types of Graphs
• Undirected single-relational graph: homogenous set of symmetric links.

• Directed single-relational graph: homogenous set of links.

• Directed multi-relational graph: heterogenous set of links.
undirected single-relational graph

x z

directed single-relational graph

x z

directed multi-relational graph

x y z


Our Make Believe World - Phase 1

• Marko is a human and Fluﬀy is a dog.


Our World Modeled in a Relational Database - Phase 1

ID Name Type Legs Fur

0001 Marko Human 2 false

0002 Fluffy Dog 4 true

Object_Table


Our World Modeled in a Graph Database - Phase 1

Human Dog

type type

0001 0002

name name
legs fur legs fur

2 Marko false 4 Fluffy true




• Marko and Fluﬀy are good friends.



ID Name Type Legs Fur ID2 ID2

0001 Marko Human 2 false 0001 0002

0002 Fluffy Dog 4 true 0002 0001

Object_Table Friendship_Table



Human Dog

type type

friend
0001 friend 0002

name name
legs fur legs fur






• Human and dog are a subclass of mammal.



ID Name Type Legs Fur ID2 ID2 Type1 Type2

0001 Marko Human 2 false 0001 0002 Human Mammal

0002 Fluffy Dog 4 true 0002 0001 Dog Mammal

Object_Table Friendship_Table Subclass_Table


Mammal

subclassof subclassof

Human Dog

type type

friend
0001 friend 0002

name name
legs fur legs fur






• Fluﬀy peed on the carpet.






0003 My_Rug Carpet N/A N/A
Friendship_Table Subclass_Table

Object_Table ID1 ID2

0002 0003

Pee_Table



Mammal


Human Dog Carpet

type type type

friend
0001 friend 0002 peedOn 0003

name name name
legs fur legs fur

2 Marko false 4 Fluffy true My_Rug






• Fluﬀy peed on the carpet.

• Marko and Fluﬀy are both mammals.





0003 My_Rug Carpet N/A N/A
Friendship_Table Subclass_Table

Object_Table ID1 ID2 ID Type

0002 0003 0001 Human

Pee_Table 0002 Dog

0003 Carpet

0001 Mammal

0002 Mammal

Type_Table



Mammal


Human Dog Carpet

type type

type type type

friend
0001 friend 0002 peedOn 0003

name name name
legs fur legs fur

2 Marko false 4 Fluffy true My_Rug


The Graph as the Natural World Model

• The world is inherently (or perceived as) object-oriented.

• The world is ﬁlled with objects and relations among them.

• The multi-relational graph is a very natural representation of the world.


The Graph as the Natural Programming Model

• High-level computer languages are object-oriented.

• Nearly no impedance mismatch between the multi-relational graph and
the programming object.

• It is easy to go from graph database to in-memory object.

Human marko = new Human();
marko.name = "Marko";
marko.addFriend(fluffy);
marko.setHasFur(false);
marko.setLegs(2);


SQL vs. SPARQL

SELECT OTY.Name FROM Object_Table AS OTX,
Object_Table AS OTY, Friendship_Table WHERE
OTX.Name = "Marko" AND
Friendship_Table.ID1 = OTY.ID AND
Friendship_Table.ID2 = OTX.ID;

SELECT ?z WHERE {
?x name "Marko" .
?y friend ?x .
?y name ?z }

E. Prud’hommeaux and A. Seaborne. SPARQL Query Language for RDF, WWW Consortium,

http://www.w3.org/TR/2004/WD-rdf-sparql-query-20041012/, 2004.


Internet Address Spaces

• The Uniform Resource Identiﬁer (URI) is the superclass of the Uniform
Resource Locator (URL) and Uniform Resource Name (URN).


The Uniform Resource Locator
• The set of all URLs is the address space of all resources that can be
located and retrieved on the Web. URLs denote where a resource is.
http://markorodriguez.com/index.html
∗ Domain name server (DNS): markorodriguez.com → 216.251.43.6
∗ http:// means GET at port 80,
∗ /index.html means the resource to get at that Internet location.

Web Server

index.html

markorodriguez.com
216.251.43.6


The Uniform Resource Name

• The set of all URNs is the address space of all resources within the urn:
namespace.
urn:uuid:bd93def0-8026-11dd-842be54955baa12
urn:issn:0892-3310
urn:doi:10.1016/j.knosys.2008.03.030

• Named resources need not be retrievable through the Web.

• URNs denote what a resource is.


The Uniform Resource Identifier
• The URI address space is an infinite space for all Internet resources.
urn:issn:0892-3310
ftp://markorodriguez.com/private/markos_secrets.txt
http://www.lanl.gov#fluffy

• Important: URIs can denote concepts, instances, and datum.

lanl:fluffy lanl:fluffy_legs

lanl is a namespace prefix which extends to http://www.lanl.gov#.


The Web of Documents
• The World of Documents is primarily concerned with the Hyper-Text
Transfer Protocol (HTTP) and with retrievable resources in the URL
address space.

• These retrievable resources are ﬁles: HTML documents, images, audio,
etc. The “web” is created when HTML documents contain URLs.
http://markorodriguez.com/

index.html

href

Resume.html href Home.html href Research.html


The Web of Data

• The Web of Data is primarily concerned with URIs.

• The Resource Description Framework (RDF) is the standard for
representing the relationship between URIs and literals (e.g. ﬂoat, string,
date time, etc.).

subject predicate object

lanl:marko foaf:knows lanl:ﬂuffy

foaf:name foaf:name

"Marko A. Rodriguez"^^xsd:string "Fluffy P. Everywhere"^^xsd:string

C. Bizer, T. Heath, K. Idehen, and T. Berners-Lee. Linked Data on the Web, International World Wide Web Conference, 2008.


Our Make Believe World in RDF
lanl:Mammal

rdfs:subClassOf rdfs:subClassOf

lanl:Human lanl:Dog

rdf:type rdf:type
rdf:type rdf:type

lanl:marko lanl:friend lanl:ﬂuffy

lanl:friend
lanl:fur lanl:legs lanl:fur lanl:legs
foaf:name foaf:name

"false"^^xsd:boolean "2"^^xsd:integer "true"^^xsd:boolean "4"^^xsd:integer

"Marko A. Rodriguez"^^xsd:string "Fluffy P. Everywhere"^^xsd:string


The Web of Data is a Distributed Database

• The URI address space is distributed.

• URIs can denote datum.

• RDF denotes the relationships URIs.

• The Web of Data’s foundational standard is RDF.

• Therefore, the Web of Data is a distributed database.


The Web of Documents vs. the Web of Data

Web Server Web Server

HTML href HTML

127.0.0.1 127.0.0.2

Graph Database Graph Database

lanl:friend

127.0.0.1 127.0.0.2


The Current Web of Data - March 2009
homologenekegg projectgutenberg
symbol
homologenekegg
libris projectgutenberg
cas symbol
bbcjohnpeel
libris
unists diseasome dailymed w3cwordnet
chebi
hgnc pubchem eurostat
mgi
geneid
omim wikicompany geospecies
cas bbcjohnpeel
diseasome dailymed
drugbank worldfactbook
reactome
pubmed unists
magnatune
opencyc w3cwordnet
uniparc linkedct chebi
freebase

taxonomy
uniref
uniprot
geneontology
interpro hgnc pubchem eurostat
pdb yago umbel
pfam mgi
dbpedia omim
bbclatertotpgovtrack wikicompany geospecies
prosite
prodom flickrwrappr
geneid
opencalais

reactome
uscensusdata
drugbank worldfactbook
lingvoj linkedmdb
surgeradio
magnatune
pubmed
virtuososponger opencyc
rdfbookmashup
uniparc freebase
swconferencecorpus geonames musicbrainz myspacewrapper linkedct
dblpberlin uniprot pubguide
taxonomy revyu interpro
uniref geneontologyjamendo bbcplaycountdata
rdfohloh
pdb umbel
yago
semanticweborg siocsites riese
pfam dbpedia bbclatertotp govtrack
foafprofiles
dblphannover openguides audioscrobbler prosite bbcprogrammes
prodom
crunchbase flickrwrappropencalais
doapspace uscensusdata
flickrexporter
surgeradio
budapestbme qdos
lingvoj linkedmdb
semwebcentral virtuososponger
eurecom ecssouthampton

pisa
dblprkbexplorer
newcastle rdfbookmashup
geonames musicbrainz
rae2001
eprints
irittoulouse
laascnrs acm citeseer
swconferencecorpus myspacewrapper
ieee dblpberlin pubguide
resex
ibm

revyu jamendo
rdfohloh
bbcplaycountdata
M.A. Rodriguez. A Graph Analysis of the Linked Data Cloud, in review, http://arxiv.org/abs/0903.0194, 2009.
semanticweborg riese siocsites
foafprofiles
openguides audioscrobbler bbcprogrammes
dblphannover
crunchbase
doapspace

flickrexporter
qdos

The Current Web of Data - March 2009
data set domain data set domain data set domain
audioscrobbler music govtrack government pubguide books
bbclatertotp music homologene biology qdos social
bbcplaycountdata music ibm computer rae2001 computer
bbcprogrammes media ieee computer rdfbookmashup books
budapestbme computer interpro biology rdfohloh social
chebi biology jamendo music resex computer
crunchbase business laascnrs computer riese government
dailymed medical libris books semanticweborg computer
dblpberlin computer lingvoj reference semwebcentral social
dblphannover computer linkedct medical siocsites social
dblprkbexplorer computer linkedmdb movie surgeradio music
dbpedia general magnatune music swconferencecorpus computer
doapspace social musicbrainz music taxonomy reference
drugbank medical myspacewrapper social umbel general
eurecom computer opencalais reference uniref biology
eurostat government opencyc general unists biology
flickrexporter images openguides reference uscensusdata government
flickrwrappr images pdb biology virtuososponger reference
foafprofiles social pfam biology w3cwordnet reference
freebase general pisa computer wikicompany business
geneid biology prodom biology worldfactbook government
geneontology biology projectgutenberg books yago general
geonames geographic prosite biology ...


Cultural Diﬀerences that are Leading to Web-Based
Data Management - Part 1

• Relational databases tend to not maintain public access points.

• Relational database users tend to not publish their schemas.

• Web of Data graph databases maintain public access points called
SPARQL end-points or Linked Data URLs.

• Web of Data graph database users tend to reuse and extend public
schemas called ontologies.


Cultural Diﬀerences that are Leading to Web-Based
Data Management - Part 2
Conventional Model Web of Data Model
127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.1 127.0.0.2 127.0.0.3
Application 1 Application 2 Application 3 Application 1 Application 2 Application 3

processes processes processes

processes processes processes

Web of Data

structures structures structures

127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.4 127.0.0.5 127.0.0.6


SPARQLing a Data Provider - Local Computing

SELECT ?x WHERE { 127.0.0.2
lanl:marko lanl:friend ?x

END-POINT
127.0.0.1

SPARQL
}
Graph Database

{ lanl:ﬂuffy }

• The 127.0.0.1 client is querying the 127.0.0.2 server.

• The query is any read-based SPARQL query.

• The results are those resources that bound to the query arguments.


GETing Linked Data as RDF - Local Computing
http://www.lanl.gov#marko

lanl:ﬂuffy

lanl:friend

lanl:ﬂuffy
lanl:marko
HTTP GET
lanl:wrote lanl:friend

vub:1010 Web of Data
lanl:marko ieee:2020

http://www.vub.edu#1010 lanl:wrote lanl:cites

ieee:2020
vub:1010

lanl:cites

vub:1010 HTTP GET

127.0.0.1


Problem with the Current Web of Data Infrastructure

• The only interfaces are SPARQL end-points and HTTP GETs of RDF
subgraphs.

• For human-based document retrieval, this is ﬁne. For machine-based
data processing, this does not scale.

M.A. Rodriguez. A Distributed Process Infrastructure for a Distributed Data Structure. Semantic Web and Information Systems

Bulletin, AIS Special Interest Group on Semantic Web and Information Systems, http://arxiv.org/abs/0807.3908, 2008.


Problem with the Current Web of Data Infrastructure

• We can not rely on the “download and index” philosophy of the World
Wide Web.
As of March 2009, the Web of Data maintains 4.5 billion triples.

• The Web of Data can not rely on a single service provider.
too much data.
too many types algorithms that can utilize this data.
too many clock cycles to locally process this data.


The Open Virtual Machine Farm
Graph Database Graph Database

lanl:friend

127.0.0.1 127.0.0.2

Virtual Machine code/ Virtual Machine
Farm machine Farm

• Distributed computing through code/machine migration between farms.

• move the process to the data, not the data to the process.

M.A. Rodriguez. General Purpose Computing on a Semantic Network Substrate. in Emergent Web Intelligence, eds. R. Chbeir,

A. Hassanien, A. Abraham and Y. Badr, Springer-Verlag, http://arxiv.org/abs/0704.3395, 2009.

M.A. Rodriguez. The RDF Virtual Machine, in review, LA-UR-08-03925, 2009.


Neno RDF Programming Language - Code Serialization
urn:uuid:
demo:Human rdf:type
4fa0f752
hasMethod xsd:int example(xsd:string a)
Method
{
urn:uuid:
hasMethodName
6e400b42
if(a == "marko")
return 1;
hasBlock else
Block
"example"^^xsd:string return 2;
urn:uuid:
4e0bada0 }
nextInst
Equals
urn:uuid: Block
51b8d4a0 urn:uuid:
falseInst
67bbd072
nextInst

hasLeft Branch Block nextInst
urn:uuid: urn:uuid: PushValue
trueInst
51b8d4a0 610eb4b0
urn:uuid:
LocalDirect
6d451a1e
nextInst
urn:uuid: hasRight
54e14d4c PushValue hasValue
LocalDirect
urn:uuid: LocalDirect
hasURI urn:uuid: 5c4d5bc2
5869b878 urn:uuid:
62e8b8dc
hasURI hasValue
"a"^^xsd:string hasURI
LocalDirect nextInst
urn:uuid:
"marko"^^xsd:string 6425e5ec
nextInst "2"^^xsd:int
hasURI
Return
urn:uuid:
urn:uuid: 008e999a
"1"^^xsd:int
0748e1c6
Return


The Fhat RDF Virtual Machine - Machine Serialization
xsd:boolean RVM xsd:boolean
[1] [1]

methodReuse halt

programLocation Fhat

operandTop hasFrame
returnTop

[0..1] [0..1] [0..1]
currentFrame
[0..1] Operand [0..1]
Instruction ReturnStack
Stack
rdf:rest rdf:rest blockTop
rdf:first [0..1] [0..*]
rdf:first
[0..1]
[0..1] forFrame Frame
[1]
rdfs:Resource Instruction
rdf:li
[0..*]

[0..1] [0..1] Frame
Block
Variable
Stack
rdf:rest hasSymbol hasValue fromBlock
rdf:first
[0..1] [1] [0..*] [1]

Block xsd:string rdfs:Resource Block


A Collection of Interlinked Graph Databases - Currently
127.0.0.2 127.0.0.3

127.0.0.6

127.0.0.4 127.0.0.5

127.0.0.10
127.0.0.9

127.0.0.8

127.0.0.7 127.0.0.11


A Collection of Interlinked Graph Databases and
Processors - Future
127.0.0.2 127.0.0.3

127.0.0.6

127.0.0.4 127.0.0.5

127.0.0.10
127.0.0.9

127.0.0.8

127.0.0.7 127.0.0.11


The Future of Web-Based Distributed Computing

• The HTTP GET approach to Web of Data does not scale.

• The Neno/Fhat (or any general-purpose computing) environment is
unsafe.

• The Web of Data needs an open, safe, ﬂexible, and easy to adopt
computing infrastructure.


What Type of Processing?

• Object-oriented programming: Web of Data as an object repository.

• Logic: Web of Data as a knowledge-base.

• Graph/network analysis: Web of Data as a multi-relational graph.

• The future computing environment should support at least these popular
processing models.

• We will focus on graph/network analysis for the remainder of this
presentation.


Introduction to Random Walkers

• Random walkers can be used in single-relational networks to calculate:
stationary probability distribution: primary eigenvector calculation
spreading activation: search by means of diﬀusion

• There is a continuous and a discrete form of the general random walk
method.


Random Walks in a Single-Relational Network

• Suppose a single-relational network G, where

G = (V, E ⊆ (V × V )).

• Let’s represent that network as a row stochastic adjacency matrix A ∈
[0, 1]|V |×|V |, where

1
Γ(i) if (i, j) ∈ E
Ai,j =
0 otherwise.

• Finally, assume an “energy vector” π ∈ R|V |.


Random Walks in a Single-Relational Network

a b c d

a 0 0.5 0 0.5
b c
b 0 0 1 0
1 0 0 0
c 0.5 0 0 0.5
a d

d 0 1 0 0

G A π
• πA can be interpreted as the continuous form of propagating random
walkers over the G.


Stationary Probability Distribution in a
Single-Relational Network
π1 1 0 0 0

a b c d π2 0 0.5 0 0.5

0 0.5 0 0.5
π3 0 0.5 0.5 0

1
π4
0 0 0
0.25 0 0.5 0.25 time

0.5 0 0 0.5
5
0 0 0
π 0.25 0.38 0 0.36

1
π6 0 0.5 0.38 0.13

A ...

π∞ 0.15 0.31 0.31 0.23


Stationary Probability Distribution in a
Single-Relational Network

• If G is strongly connected and aperiodic then there exits a π such that
π = πA.

• This stationary π ∞ is the primary eigenvector of A.

• PageRank computes the stationary π by forcing G (the Web citation
graph) to be strongly connected and aperiodic.


Spreading Activation in a Single-Relational Network

• Spreading activation can be thought of as a “local rank” algorithm, while
calculating the stationary probability provides you a “global rank”.

• With spreading activation, you iterate for only a certain number of
timesteps.

• Also, you record how much energy has ﬂowed through each vertex.

• Let’s demonstrate using a single discrete walker...


Spreading Activation in a Single-Relational Network

• The walkers moves from vertex to vertex with choice dependent on the
probability distribution of A.

• At every step, if the walker is at vertex i then πi = π + 1.

2 3
π1 1 0 0 0

G b c
π2 1 1 0 0
time

1 π3 1 1 1 0

π4
a d
4 2 1 1 0


Random Walks in a Multi-Relational Network

• Suppose a multi-relational network M , where

M = (V, E = {E0, E1, . . . , Ek ⊆ (V × V )})

• Represent as a {0, 1}-adjacency tensor A ∈ {0, 1}|V |×|V |×|E|, where

1 if (i, j) ∈ Em : 1 ≤ m ≤ k
Am =
i,j
0 otherwise.

• Then assume a “energy vector” π ∈ R|V |.

M.A. Rodriguez and J. Shinavier. Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms, in

review, http://arxiv.org/abs/0806.2274, 2009.


Random Walks in a Multi-Relational Network

b cites c
0 1 0 0
authored contains 0 0 0 0 1 0 0 0
a d 0 0 0 0

0 0 0 0
ns
ai
nt
co

s
te

ed
ci

or
th
au

M A π


The Operations of the Multi-Relational Path Algebra

• A · B: ordinary matrix multiplication determines the number of (A, B)-
paths between vertices.
• A : matrix transpose inverts path directionality.
• A ◦ B: Hadamard, entry-wise multiplication applies a ﬁlter to selectively
exclude paths.
• n(A): not generates the complement of a {0, 1}n×n matrix.
• c(A): clip generates a {0, 1}n×n matrix from a Rn×n matrix.
+
• v ±(A): vertex generates a {0, 1}n×n matrix from a Rn×n matrix, where
+
only certain rows or columns contain non-zero values.
• λA: scalar multiplication weights the entries of a matrix.
• A + B: matrix addition merges paths.


The Traverse Operation
• An interesting aspect of the single-relational adjacency matrix A ∈ {0, 1}n×n is that when it is raised
(k)
to the kth power, the entry Ai,j is equal to the number of paths of length k that connect vertex i to
vertex j .
(1)
• Given, by deﬁnition, that Ai,j (i.e. Ai,j ) represents the number of paths that go from i to j of length
1 (i.e. a single edge) and by the rules of ordinary matrix multiplication,

(k) (k−1)
Ai,j = Ai,l · Al,j : k ≥ 2.
l∈V

a b c

a b c a b c a b c

a 0 1 0 a 0 1 0 a 0 0 1

b 0 0 1 · b 0 0 1 = b 0 0 0

c 0 0 0 c 0 0 0 c 0 0 0
there is a path of length 2
from a to c


A1 : authored A2 : cites A3 : contains
h ih ih i

The Traverse Operation

Z = A1 · A2 · A1 ,
Zi,j deﬁnes the number of paths from vertex i to vertex j such that a path goes from author i to one the
articles he or she has authored, from that article to one of the articles it cites, and ﬁnally, from that cited
article to its author j . Semantically, Z is an author-citation single-relational path matrix.

A2
vub:1010 lanl:cites ieee:2020

A1 lanl:authored A1
lanl:authored

lanl:marko lanl:author-citation vub:fheyligh

Z

* NOTE: All diagrams are with respect to a “source” vertex (the blue vertex) in order to preserve clarity. In reality, the operations

operate on all vertices in parallel.


The Filter Operation
Various path ﬁlters can be deﬁned and applied using the entry-wise
Hadamard matrix product denoted ◦, where
 
A1,1 · B1,1 · · · A1,m · B1,m
A◦B= .
. ... .
. .
An,1 · Bn,1 · · · An,m · Bn,m

24 1 0 0 0 0 1 0 0 0 0 1 0 0 0

0 72 0 4 0 0 1 0 0 0 0 72 0 0 0

23 0 0 0 0 ◦ 1 0 0 0 0 = 23 0 0 0 0

0 0 15.3 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 12 0 0 0 0 0 0 0 0 0 0

Path Matrix Path Filter Filtered Path Matrix


The Filter Operation

• A◦1=A
• A◦0=0
• A◦B=B◦A
• A ◦ (B + C) = (A ◦ B) + (A ◦ C)
• A ◦ B = (A ◦ B) .


The Not Filter
The not ﬁlter is useful for excluding a set of paths to or from a vertex.

n : {0, 1}n×n → {0, 1}n×n

with a function rule of

1 if Ai,j = 0
n(A)i,j =
0 otherwise.

0 0 1 1 1 1 1 0 0 0

1 0 1 0 1 0 1 0 1 0

n 0 1 1 1 1 = 1 0 0 0 0

1 1 0 1 1 0 0 1 0 0

1 1 1 1 0 0 0 0 0 1


The Not Filter

If A ∈ {0, 1}n×n, then

• n(n(A)) = A
• A ◦ n(A) = 0
• n(A) ◦ n(A) = n(A).


h ih ih i

The Not Filter
A coauthorship path matrix is

Z = A1 · A1 ◦ n(I)

acm:0505

A1 lanl:authored
A1
lanl:authored

lanl:marko lanl:coauthor lanl:jbollen

Z
n(I)
lanl:coauthor


The Clip Filter
The general purpose of clip is to take a path matrix and “clip”, or
normalize, it to a {0, 1}n×n matrix.

c : Rn×n → {0, 1}n×n
+

1 if Zi,j > 0
c(Z)i,j =
0 otherwise.

24 1 0 0 0 1 1 0 0 0

0 72 0 4 0 0 1 0 1 0

c 23 0 0 0 0 = 1 0 0 0 0

0 0 15.3 0 0 0 0 1 0 0

0 0 0 0 12 0 0 0 0 1


The Clip Filter

If A, B ∈ {0, 1}n×n and Y, Z ∈ Rn×n, then
+

• c(A) = A
• c(n(A)) = n(c(A)) = n(A)
• c(Y ◦ Z) = c(Y) ◦ c(Z)
• n(A ◦ B) = c (n(A) + n(B))
• n(A + B) = n(A) ◦ n(B)


h ih ih i

The Clip Filter
Suppose we want to create an author citation path matrix that does not allow self citation or coauthor
citations. „ « „ „ ««
1 2 1 1 1
Z= A ·A ·A ◦n c A · A ◦ n(I) ◦ n(I)
|{z}
| {z } | {z } no self
cites no coauthors

Z
lanl:author-citation odu:nelson

authored
2
A A1
lanl:3030 lanl:cites lanl:4040

A 1 A1
lanl:authored lanl:authored
lanl:authored

lanl:marko lanl:coauthor lanl:jbollen

n c A1 · A1 ◦ n(I)

self n(I)


h ih ih i

The Clip Filter

However, using various theorems of the path algebra and abstract algebra
in general,

Z = A1 · A2 · A1 ◦ n c A1 · A1 ◦ n(I) ◦ n(I)
no self
cites no coauthors

becomes

Z = A1 · A2 · A1 ◦ n c A1 · A1 ◦ n(I).


Other Filters and Operations...

• Please refer to the article for more information on these ﬁlters and
operations.


Problems with the Path Algebra

• As a matrix algebra, it is impossible (computationally speaking) to
compute matrix operations over the entire Web of Data.

• However, it is possible to approximate these calculations using “random”
walkers.


Mapping Paths to Grammar-Based Random Walkers

• A grammar-based random walker is a walker that obeys a path
description.

• Able to compute “semantically rich” spreading activation and stationary
probability distributions in a multi-relational network.

• Able to approximate through the convergence properties of these
operations.

• Provides a convenient application to the Web of Data and linked graph
databases.

M.A. Rodriguez. Grammar-Based Random Walkers in Semantic Networks. Knowledge-Based Systems, 21(7), 727–739, 2008.


A Grammar Walker
Grammar Walker

A1 · A1 ◦ n(I)

t=1
t=2 t=3

Web of Data


127.0.0.4 127.0.0.5 127.0.0.6


Grammar Walking the Web of Data
127.0.0.1

1 7

127.0.0.2 127.0.0.3

2
127.0.0.6

127.0.0.4 127.0.0.5

127.0.0.10
3
127.0.0.9

127.0.0.8 6

5
127.0.0.7 4 127.0.0.11


Conclusion

• Graph databases will increasingly support the Web of Data.

• The Web of Data is about open, global-scale data management.

• Distributed computing is required for global-scale data processing.

• Grammar walkers can be used for distributed network analysis on the
Web of Data.


Thank You For Your Time

My homepage: http://markorodriguez.com
Neno/Fhat: http://neno.lanl.gov
Collective Decision Making Systems: http://cdms.lanl.gov
Faith in the Algorithm: http://faithinthealgorithm.net
MESUR: http://www.mesur.org


Distributed Graph Databases and the Emerging Web of Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from Marko Rodriguez

More from Marko Rodriguez (20)

Recently uploaded

Recently uploaded (20)

Distributed Graph Databases and the Emerging Web of Data