Processing Linked Data at
Warp Speed
Abraham Bernstein
CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
"Earth's Location in the Universe (JPEG)" by Andrew Z. Colvin - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons
http://commons.wikimedia.org/wiki/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg#mediaviewer/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons - 

http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Processing Graphs
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons - 

http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Semantic Web Reasoning
KB:Asserted Triples
Entailed KB: 	

Asserted & Infered Triples
DLReasoning
InductiveR.
AnalogicalR.
YourR.
Signal/Collect
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
• Vertices as stateful
processing units	

• Vertices interact through
signals along edges	

• Which are collected by a
processing function that
updates the vertex state
Processing Graphs Naturally
• Define the graph structure	

• Vertices represent
RDFS classes	

• Edges from superclasses
to subclasses	

• Vertex state initialized
with the class that the
vertex represents
Signal/Collect: An Intuition
for RDFS subclass inference
id: animal
state: {animal}
id: bird
state: {bird}
id: owl
state: {owl}
id: penguin
state: {penguin}
id: animal
id: bird
id: owlid: penguin
Stutz et al, 2010
Signal/Collect: An Intuition
for RDFS subclass inference
Stutz et al, 2010
id: animal
state: {animal}
id: bird
state: {bird}
id: owl
state: {owl}
id: penguin
state: {penguin}
{bird, animal}
{animal}
{bird, animal}
id: animal
state: {animal}
id: bird
state: {bird}
id: owl
state: {owl}
id: penguin
state: {penguin}
id: penguin
state: {penguin, bird, animal}
id: bird
state: {bird, animal}
id: owl
state: {owl, bird, animal}
def collect =
state [
[
s2signals
s
def signal = state
Scoring/Asychronicity:
Single-Source Shortest Path
state: ∞
state: ∞
state: ∞
state: ∞
state: ∞
state: ∞
state: 0
∞
1
∞
∞
∞1
1
∞
state: 1 state: ∞
state: ∞
state: ∞state: 1
state: 1 2
1
2
∞
21
1
∞
state: 1 state: 2
state: 2
state: 2state: 1
state: 1 2
1
2
3
21
1
3
state: 1 state: 2
state: 2
state: 2state: 1
state: 1
def signal = state + weight
def collect = min (state, min (signals) )
state: 1
Scoring/Asynchronicity:
Single-Source Shortest Path
state: ∞
state: ∞
state: ∞
state: 0
1
1
1
state: 1 state: ∞
state: ∞
state: ∞
state: 1 2
2
2
state: 2
state: 2
state: 2
state: 2
state: 2
var oldState = infinity
def scoreSignal =
if (state ! = oldState)
1
else
0
3
3
PageRank in Code
class Document(id: Any) extends Vertex(id, 0.15) {
def collect = 0.15 + 0.85 * signals[Double].foldLeft(0.0)(_ + _)
}
Algorithm
class Citation(citer: Any, cited: Any) extends Edge(citer, cited) {
def signal = source.state.asInstanceOf[Double] * weight / source.sumOfOutWeights
}
ExecutionInitialization
object Algorithm {
def executeCitationRank(db: SparqlAccessor) {
val computeGraph = new AsynchronousComputeGraph()
val citations = new SparqlTuples(db, "select ?source ?target where {"
+ "?source <http://lsdis.cs.uga.edu/projects/semdis/opus#cites> ?target}")
citations foreach {
case (citer, cited) =>
computeGraph.addVertex[Document](citer)
computeGraph.addVertex[Document](cited)
computeGraph.addEdge[Citation](citer, cited)
}
computeGraph.execute()
}
}
Signal/Collect as a Platform
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Images::BoInsogna,https://flic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://flic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://flic.kr/p/2CG5PU,CC-BY-SA
Tr i p l e S t o re
E LV I S D Y L A N
J O B S
? X I N S P I R E D ? Y
? Y I N S P I R E D ? Z
I N S P I R E D
DATA QUERY
IN
SP
IR
E
D
DylanElvis inspired
*Elvis inspiredDylan
* inspired DylanElvis
*
*Elvis
*** inspired Dylan
* *
** *
DylanElvis inspired
*Elvis inspiredDylan
* inspired DylanElvis
*
*Elvis
*** inspired Dylan
* *
** *
*Dylan inspired Jobs
* inspiredJobsDylan
*
JobsDylan inspired
*Dylan
*Jobs
* *
DylanElvis inspired
Dylan
* inspired
** inspired
*Dylan inspired Jobs
* inspired
JobsDylan inspired
Query Vertex
?X inspired ?Y!
?Y inspired ?Z
?X inspired ?Y!
?Y inspired ?Z
Elvis inspired Dylan!
Dylan inspired ?Z
Dylan inspired Jobs!
Jobs inspired ?Z
No vertex with ID!
[ Jobs inspired * ]
Query Vertex
{ ?X = Elvis, ?Y = Dylan, ?Z = Jobs }
✗
Elvis inspired Dylan!
Dylan inspired Jobs
✓
?X inspired ?Y!
?Y inspired ?Z
DylanElvis inspired
Dylan
* inspired
** inspired
*Dylan inspired Jobs
* inspired
JobsDylan inspired
Query Vertex
?X inspired ?Y!
?Y inspired ?Z
?X inspired ?Y!
?Y inspired ?Z
Elvis inspired Dylan!
Dylan inspired ?Z
Dylan inspired Jobs!
Jobs inspired ?Z
No vertex with ID!
[ Jobs inspired * ]
Query Vertex
{ ?X = Elvis, ?Y = Dylan, ?Z = Jobs }
✗
Elvis inspired Dylan!
Dylan inspired Jobs
✓
?X inspired ?Y!
?Y inspired ?Z
P e r f o r m a n c e R e s u l t s
Distributed (8 nodes), LUBM 10240 (~1.36 billion triples)
Single-node, LUBM 160 (~21 million triples)
Fastest L1 L2 L3 L4 L5 L6 L7 Geo.
of 10 runs mean
TripleRush 3,111.2 1,457.9 0.7 3.5 9.5 29.1 1,165.8 62.1
Trinity.RDF 12,648.0 6,018.0 8,735.0 5.0 4.0 9.0 31,214.0 450.0
TriAD 7,631.0 1,663.0 4,290.0 2.1 0.5 69.0 14,895.0 249.0
TriAD-SG 2,146.0 2,025.0 1,647.0 1.3 0.7 1.4 16,863.0 106.0
Fastest L1 L2 L3 L4 L5 L6 L7 Geo
of 10 runs mean
TripleRush 22.6 27.8 0.4 1.0 0.4 0.9 21.2 2.94
Trinity.RDF 281.0 132.0 110.0 5.0 4.0 9.0 630.0 46.0
TriAD 427.0 117.0 210.0 2.0 0.5 19.0 693.0 39.0
TriAD-SG 97.0 140.0 31.0 1.0 0.2 1.8 711.0 14.0
O n g o i n g Wo r k : 

G r a p h P a r t i t i o n i n g
• https://github.com/uzh/triplerush
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Signal/Collect as a Platform
Images::BoInsogna,https://flic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://flic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://flic.kr/p/2CG5PU,CC-BY-SA
D e c o m p o s i n g F r a u d
P a t t e r n s
• Participants can be
labeled as:
• Splitters
• Aggregators
• Forwarders
$10k$11k
2 x $5k $10k
10k CHF
2k CHF
4k CHF
6k CHF
8k CHF
E l i c i t a t i o n P ro c e s s
FilterConnectMatch
B i t c o i n Tr a n s a c t i o n s :
R u n t i m e v s . M a t c h i n g C o m p l e x i t y
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
4" 6" 8" 10" 12"
Time%(sec)%
Matching%Complexity%
Time"in"GC"
Total"Processing"Time"
Dataset Size: 50M Transactions, Matching Duration: 1 week
Runtime: 27 min
Throughput: 1.8M / min
Runtime: 35 min
Throughput: 1.4M / min
Runtime: 98 min
Throughput: 0.5M / min
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Signal/Collect as a Platform
Images::BoInsogna,https://flic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://flic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://flic.kr/p/2CG5PU,CC-BY-SA
≠
≠≠
≠
y
Distributed Constraint
Optimization
z
q
x
x ≠ y
y ≠ z
z ≠ x
q ≠ z
x ∈ {0,1,2}
y ∈ {0,1}
z ∈ {1,2}
q ∈ {0,1,2}
Vertex Coloring in action
Optimized Version of DSA Running on a MacBook Pro with 8 workers

(slow, due to lots of IO for logging, bookkeeping, etc.)

• Scaled to: 10 Million vertices / variables
Verman and Bernstein, 2014
Industry Usage
!
“We	
  are	
  using	
  Signal/Collect	
  to	
  analyze	
  millions	
  of	
  
claims	
  every	
  day	
  to	
  iden9fy	
  opportuni9es	
  for	
  our	
  
clients	
  to	
  save	
  money	
  through	
  be=er	
  healthcare	
  or	
  
avoiding	
  fraud,	
  waste,	
  and	
  abuse.”	
  
US	
  Healthcare	
  Analy/cs	
  Company	
  


"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons - 

http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Processing Graph Streams
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons - 

http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
147.1b € 97.5b €
147.1b € 97.5b €
5’100
3mViewers	

300 Events/s
250 TV Channels	

25 frames/s
EPG for 7 days	

LOD enhanced
Traditional TripleStore
http://www.mpi.de/
http://www.ifi.uzh.ch/
http://www.uni-sb.de/courses/
http://www.universities.de/Saarbruecken http://www.ifi.uzh.ch/
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.lubm.org/teaches
http://www.lubm.org/
Traditional TripleStore
http://www.mpi.de/
http://www.ifi.uzh.ch/
http://www.uni-sb.de/courses/
http://www.universities.de/Saarbruecken http://www.ifi.uzh.ch/
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.lubm.org/teaches
http://www.lubm.org/
Semantic Flow
Processing
Semantic Flow
Processing
t
ViSTA-TV Viewership Data
0
7500
15000
22500
30000
0h 8h 16h 24h 8h 16h 24h 8h 16h 24h
Num of Valid Data Entries
day 1 day 2 day 3
Cache
ViSTA-­‐TV	
  project:	
  UserLog	
  and	
  EPG	
  streams
Semantic Flow
Processing is:
• Time-stamped tripes t = <s,p,o> [time] 	

• Semantic flow F = [t1, t2, ... tn]	

• Perform query matching on cached subset of F	

• Subject to Stress—incoming data-rate
overwhelms the system’s processing
capability
Context:	

Time Window
last 1 min
?
within 1 sec
Load Shedding
last 1 min
Eviction
last 1 min
?
Eviction is:
• Remove cached data	

• Note: Eviction may lower recall	

• Evict potential to produce future results	

• Types of eviction strategies (considered):	

• Random	

• Time-based (i.e, FIFO)	

• Least Recently Used (LRU)
Why not LRU?
LRU
C
B
Input: UserLog
,<watch>,ChannelUser 1 B
Join
EPG Cache
Two-way Join:	

UserLog EPG
Why not LRU?
LRU
C
B
User 1
Input: UserLog
,<watch>,Channel C
Join
EPG Cache
Why not LRU?
LRU
B
C
Input
Input: EPG
Channel , <Play>, Show 4DD
EPG Cache
D
User 3
Input: UserLog
,<watch>,Channel
?
Why not LRU?
LRU
C
Input: EPG
Channel , <Play>, Show 5E
B
Input: EPG
Channel , <Play>, Show 4D
InputInput: EPG
Channel , <Play>, Show 6F
Input: EPG
Channel , <Play>, Show 7G
Input: EPG
Channel , <Play>, Show 8H
D
EPG Cache
CLOCK is:
• Consider both recency (LRU) and past
results	

• Giving each data entry a score	

• The score could be incremented and
depreciated	

• Named by the buffer management
algorithm CLOCK
CLOCK
CLOCK Score
6
1
3C
B
Input: UserLog
,<watch>,ChannelUser 2 C
Join
EPG Cache
CLOCK
CLOCK Score
6
1
4C
B
Input: UserLog
,<watch>,ChannelUser 2 C
Join
EPG Cache
CLOCK
CLOCK Score
6
1
4C
B
Input: EPG
Channel , <Play>, Show 4D
Input
dep()
EPG Cache
CLOCK
CLOCK Score
5
1
4C
B
Input: EPG
Channel , <Play>, Show 4D
Input
dep()
EPG Cache
CLOCK
CLOCK Score
5
0
4C
B
Input: EPG
Channel , <Play>, Show 4D
Input dep()
D
EPG Cache
CLOCK
CLOCK Score
5
init = 1
4C
Input: EPG
Channel , <Play>, Show 4D
Input dep()
D
EPG Cache
CLOCK
CLOCK Score
5
1
4C
Input
dep()
D
Input: EPG
Channel , <Play>, Show 5E EPG Cache
CLOCK
CLOCK Score
5
1
3C
Input
dep()
D
Input: EPG
Channel , <Play>, Show 5E EPG Cache
CLOCK
CLOCK Score
4
1
3C
Input
dep()
D
Input: EPG
Channel , <Play>, Show 5E EPG Cache
CLOCK
CLOCK Score
4
0
3C
Input dep()
D
Input: EPG
Channel , <Play>, Show 5E EPG CacheE
CLOCK
CLOCK Score
4
1
3C
Input
dep()
E
EPG Cache
Input: EPG
Channel , <Play>, Show 6F
Input: EPG
Channel , <Play>, Show 7G
Input: EPG
Channel , <Play>, Show 8H
CLOCK Eviction is:
!
• The weight is adjusted by dep():	

• Linear: dep(w) = w - 1	

• Exponential: dep(w) = w * ρ (0< ρ < 1)
Recency History
Experimental Results
TV Viewership Data
Recall
0%
25%
50%
75%
100%
Cache Size
100% 50% 25% 15% 1%
Random FIFO LRU CLOCK
Two-way join query with 1,919,216 input triples
Experimental Results:	

Depreciation Function
Exponential factor ρ = ρ in {0.25, 0.5, 0.75, 0.95}
Limitations
• Static depreciation function: dep()	

• Experiment on other datasets	

• Real implementation to investigate other
performance metrics	

• Only local eviction strategies considered
CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0	

"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons - 

http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Semantic Web Reasoning
KB:Asserted Triples
Entailed KB: !
Asserted & Infered Triples
DLReasoning
InductiveR.
AnalogicalR.
YourR.
Signal/Collect
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
CLOCK Eviction is:
!
• The weight is adjusted by dep():"
• Linear: dep(w) = w - 1"
• Exponential: dep(w) = w * ρ (0< ρ < 1)
Recency History
CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0	

Philip Stutz, Mihaela Verman, Shen Gao,
Daniel Strebel, Bibek Paudel, Lorenz Fischer,
Thomas Keller, Robin Hafen, Genc Mazlami

Thu bernstein key_warp_speed