Streaming and Online Algorithms for GraphX 
Graph Analytics Team 
Xia (Ivy) Zhu 
Intel Confidential — Do Not Forward
Why Streaming Processing on Graph? 
2 
• New stores join 
• New users join 
• New users 
browse/clicks and 
buy items 
• Old users 
browse/clicks and 
buy items 
• New ads added 
• … 
• Recommend products 
based on users’ interest 
• Recommend products 
based on users’ shopping 
habits 
• Recommend products 
based on users’ 
purchasing capability 
• Place ads which most 
likely will be clicked by 
users 
• … 
Everyday 
How 
To 
Huge amount of relationships are created each day, 
Wisely utilize them is important
Alibaba Is Not Alone, Graphs are Everywhere 
3 
100B Neuron 
100T Relationships 
1.23B Users 
160B Friendships 
1 Trillion Pages 
100s T Links 
Millions of Products 
and Users 
50M Users 
1B hours/moth watch 
Large Biological 
Cell Networks
… And Graphs Keep Evolving 
4
Streaming Processing Pipeline 
5 
Data Stream 
ETL 
Graph 
Creation 
ML 
Distributed Messaging System 
• We are using Kafka for distributed messaging 
• GraphX as graph processing engine
6 
What is GraphX 
• Graph processing engine on Spark 
• Support Pregel-type vertex programming 
• Unifies data-parallel and graph-parallel processing 
Picture Source: GraphX team
7 
Why GraphX 
• GraphLab performs well, but standalone 
• Giraph, open source, scales well, but performance is not good 
• GraphX supports both table and graph operations 
• On the same platform, Spark streaming provides basic streaming 
framework 
SchemaRDD’s RDD-Based 
RDDs, Transformations, and Actions 
Spark 
Spark Streaming 
real-time 
Spark 
SQL 
MLLib 
machine learning 
DStream’s: 
Streams of RDD’s 
Matrices 
RDD-Based 
Graphs 
GraphX 
graph processing/ 
machine learning 
Picture Source: Databricks
8 
Naïve Streaming Does not Scale 
• Current GraphX is designed for static graphs 
• Current Spark streaming provides limited types of state DStreams 
• Naïve approach: 
• Merge table data before going to graph processing pipeline 
• Re-generate whole graph and re-run ML at each window 
• Minimal changes to GraphX and Spark Streaming 
• Straightforward, but does not scale well 
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
Throughput vs Latency of Naive Graph Streaming 
1 2 3 4 5 6 7 8 9 
Latency(s) 
Sample Point
Our solution 
9 
• Static algorithms -> Online algorithms 
• Merge information at graph phase 
• Efficient graph store for evolving graph 
• Better partitioning algorithms to reduce replicas 
• Static index -> On the fly indexing method (ongoing)
Static vs Online Algorithms 
10 
• Static algorithms 
• Good for re-compute the whole graph at each time instance , and re-run ML 
• Become increasingly infeasible in Big Data era, given the size and growth rate 
of graphs 
• Online algorithms 
• Incremental machine learning is triggered by changes in the graph 
• We designed delta updates based online algorithms 
• Page rank as an example 
• Same idea is applicable to other machine learning algorithms
Static vs Online Page Rank 
11 
Static_PageRank 
// InitialVertexValue 
(0.0, 0.0) 
// first messsage 
initialMessage: 
msg = alpha/(1.0-alpha) 
// broadcast to neighbors 
SendMessage: 
if (edge.srcAttr._2 > tol) 
Iterator((edge.dstId, edge.srcAttr_2 * 
edge.attr)) 
//Aggregate Messages for each Vertex 
messageCombiner(a,b) : 
sum = a+b 
//Update Vertex 
vertexProgram(sum) : 
updates = (1.0 - alpha) * sum 
(oldPR + updates, updates) 
Online_PageRank 
// Initialize vertex value 
base graph: 
(0.0, 0.0) 
incremental graph: 
old vertices: 
(lastWindowPR, lastWindowDelta) 
new vertices: 
(alpha, alpha) 
// First Message 
initialMessage: 
base graph: 
msg = alpha/(1.0-alpha) 
incremental graph: 
none 
// broadcast to neighbors 
SendMessage: 
oldSrc->newDst: 
Iterator((edge.dstId,(edge.srcAttr_1 – alpha) * 
edge.attr)) 
newSrc->newDst or not converged: 
Iterator((edge.dstId,edge.srcAttr_2 * edge.attr)) 
//Aggregate Messages for each Vertex 
messageCombiner(a,b) : 
sum = a+b 
//Update Vertex 
vertexProgram(sum) : 
updates = (1.0 - alpha) * sum 
(oldPR + updates, updates)
GraphX Data Loading and Data Structure 
12 
Edge 
lists 
SSrrccIIdd 
DstId 
EdgeRDD 
DDaattaa 
IInnddeexx 
Re-HashPartition 
RRoouuttiinnggTTaabblleePPaarrttiittiioonn 
VVeerrtteexxRRDDDD 
RoutingTableMesssage 
HHaassSSrrccIIdd 
HHaassDDssttIIdd 
Replicated 
Vertex 
View 
GGrraapphhIImmppll 
EEddggeePPaarrttiittiioonn 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask 
Shippable 
Vertex 
Partition 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask
GraphX Data Loading and Data Structure 
13 
Edge 
lists 
SSrrccIIdd 
DstId 
EdgeRDD 
DDaattaa 
Index 
Re-HashPartition 
RRoouuttiinnggTTaabblleePPaarrttiittiioonn 
VVeerrtteexxRRDDDD 
RoutingTableMesssage 
HHaassSSrrccIIdd 
HHaassDDssttIIdd 
Replicated 
Vertex 
View 
GGrraapphhIImmppll 
EEddggeePPaarrttiittiioonn 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask 
Shippable 
Vertex 
Partition 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask 
Static Index 
Partitioning Algorithm can help 
reduce the replication factors
Partitioning Algorithm 
14 
• Torus-based partitioning 
• Divide overall partitions to A x B matrix 
• Vertex’s master partition is decided by Hash function 
• Replica set is in the same column as master partition (full column), and same row as 
master partition (  
⁄ + 1 elements starting from master partition) 
• The intersection between source replica set and target replica set decides where an 
edge is placed
Index Structure for Graph Streaming 
15 
• GraphX uses CSR(Compressed Sparse Row)-based index 
• Originated from sparse matrix compression 
• Good for finding all out edges of a source vertex 
• No support for finding all in edges of a target vertex. Need full table scan 
• At minimal, need to add CSC(Compressed Sparse Column) for indexing in edges 
Raw Edge Lists 
Src Dst Data 
3 2  
3 5  
3 9  
5 2 	 
5 3 
 
7 3  
8 5
8 6 
 
10 6  
Dst Data 
2  
5  
9  
2 	 
3 
 
3  
5
6 
 
6  
Idx Unique 
Src 
0 3 
3 5 
5 7 
6 8 
8 10 
CSR 
Data Src 
 3 
	 5 

 5 
 7 
 3
8 

 8 
 10 
 3 
Unique 
Dst 
Idx 
2 0 
3 2 
5 4 
6 6 
9 8 
CSC
Index Structure for Graph Streaming 
16 
• Both CSR and CSC need firstly sort edge lists and then create index. 
• Even better way is to build index on the fly 
• For graph streaming, need to support both fast insert/write and fast search/read 
• HashMap 
• Good for exact match, point search 
• Fast on insert and search 
• Good for graph with fixed/known size 
• Need to re-hash when size surpasses capacity 
• Trees: B-Tree, LSM-Tree (Log Structured Merge Tree), COLA(Cache Oblivious 
Lookahead Array) 
• Support both point search and range search 
• B-Tree good for fast search, slow for insert 
• LSM-Tree good for fast insert, slow for search 
• COLA achieves good tradeoff: fast insert and good enough search 
COLA based index for graph streaming
Putting Things Together: Our Streaming Pipeline 
17 
 
 
OML 
 
 
 
+ 
 
OML 
 
 
 
+ 
 
OML 
 
 
 
+ 
 
OML 
 
	 
	 
+ 
		 
OML 
		 
…
Performance - Convergence Rate 
18 
1.2 
Converage Rate 
Naive Incremental 
Normalized Number of Iterations Graph Size ( Num of Edges) 
1.0 
0.8 
0.6 
0.4 
0.2 
0.0 
Base +20% +40% +60% +80% +100% +150% +200%

Xia Zhu – Intel at MLconf ATL

  • 1.
    Streaming and OnlineAlgorithms for GraphX Graph Analytics Team Xia (Ivy) Zhu Intel Confidential — Do Not Forward
  • 2.
    Why Streaming Processingon Graph? 2 • New stores join • New users join • New users browse/clicks and buy items • Old users browse/clicks and buy items • New ads added • … • Recommend products based on users’ interest • Recommend products based on users’ shopping habits • Recommend products based on users’ purchasing capability • Place ads which most likely will be clicked by users • … Everyday How To Huge amount of relationships are created each day, Wisely utilize them is important
  • 3.
    Alibaba Is NotAlone, Graphs are Everywhere 3 100B Neuron 100T Relationships 1.23B Users 160B Friendships 1 Trillion Pages 100s T Links Millions of Products and Users 50M Users 1B hours/moth watch Large Biological Cell Networks
  • 4.
    … And GraphsKeep Evolving 4
  • 5.
    Streaming Processing Pipeline 5 Data Stream ETL Graph Creation ML Distributed Messaging System • We are using Kafka for distributed messaging • GraphX as graph processing engine
  • 6.
    6 What isGraphX • Graph processing engine on Spark • Support Pregel-type vertex programming • Unifies data-parallel and graph-parallel processing Picture Source: GraphX team
  • 7.
    7 Why GraphX • GraphLab performs well, but standalone • Giraph, open source, scales well, but performance is not good • GraphX supports both table and graph operations • On the same platform, Spark streaming provides basic streaming framework SchemaRDD’s RDD-Based RDDs, Transformations, and Actions Spark Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s Matrices RDD-Based Graphs GraphX graph processing/ machine learning Picture Source: Databricks
  • 8.
    8 Naïve StreamingDoes not Scale • Current GraphX is designed for static graphs • Current Spark streaming provides limited types of state DStreams • Naïve approach: • Merge table data before going to graph processing pipeline • Re-generate whole graph and re-run ML at each window • Minimal changes to GraphX and Spark Streaming • Straightforward, but does not scale well 180 160 140 120 100 80 60 40 20 0 Throughput vs Latency of Naive Graph Streaming 1 2 3 4 5 6 7 8 9 Latency(s) Sample Point
  • 9.
    Our solution 9 • Static algorithms -> Online algorithms • Merge information at graph phase • Efficient graph store for evolving graph • Better partitioning algorithms to reduce replicas • Static index -> On the fly indexing method (ongoing)
  • 10.
    Static vs OnlineAlgorithms 10 • Static algorithms • Good for re-compute the whole graph at each time instance , and re-run ML • Become increasingly infeasible in Big Data era, given the size and growth rate of graphs • Online algorithms • Incremental machine learning is triggered by changes in the graph • We designed delta updates based online algorithms • Page rank as an example • Same idea is applicable to other machine learning algorithms
  • 11.
    Static vs OnlinePage Rank 11 Static_PageRank // InitialVertexValue (0.0, 0.0) // first messsage initialMessage: msg = alpha/(1.0-alpha) // broadcast to neighbors SendMessage: if (edge.srcAttr._2 > tol) Iterator((edge.dstId, edge.srcAttr_2 * edge.attr)) //Aggregate Messages for each Vertex messageCombiner(a,b) : sum = a+b //Update Vertex vertexProgram(sum) : updates = (1.0 - alpha) * sum (oldPR + updates, updates) Online_PageRank // Initialize vertex value base graph: (0.0, 0.0) incremental graph: old vertices: (lastWindowPR, lastWindowDelta) new vertices: (alpha, alpha) // First Message initialMessage: base graph: msg = alpha/(1.0-alpha) incremental graph: none // broadcast to neighbors SendMessage: oldSrc->newDst: Iterator((edge.dstId,(edge.srcAttr_1 – alpha) * edge.attr)) newSrc->newDst or not converged: Iterator((edge.dstId,edge.srcAttr_2 * edge.attr)) //Aggregate Messages for each Vertex messageCombiner(a,b) : sum = a+b //Update Vertex vertexProgram(sum) : updates = (1.0 - alpha) * sum (oldPR + updates, updates)
  • 12.
    GraphX Data Loadingand Data Structure 12 Edge lists SSrrccIIdd DstId EdgeRDD DDaattaa IInnddeexx Re-HashPartition RRoouuttiinnggTTaabblleePPaarrttiittiioonn VVeerrtteexxRRDDDD RoutingTableMesssage HHaassSSrrccIIdd HHaassDDssttIIdd Replicated Vertex View GGrraapphhIImmppll EEddggeePPaarrttiittiioonn VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Shippable Vertex Partition VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask
  • 13.
    GraphX Data Loadingand Data Structure 13 Edge lists SSrrccIIdd DstId EdgeRDD DDaattaa Index Re-HashPartition RRoouuttiinnggTTaabblleePPaarrttiittiioonn VVeerrtteexxRRDDDD RoutingTableMesssage HHaassSSrrccIIdd HHaassDDssttIIdd Replicated Vertex View GGrraapphhIImmppll EEddggeePPaarrttiittiioonn VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Shippable Vertex Partition VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Static Index Partitioning Algorithm can help reduce the replication factors
  • 14.
    Partitioning Algorithm 14 • Torus-based partitioning • Divide overall partitions to A x B matrix • Vertex’s master partition is decided by Hash function • Replica set is in the same column as master partition (full column), and same row as master partition ( ⁄ + 1 elements starting from master partition) • The intersection between source replica set and target replica set decides where an edge is placed
  • 15.
    Index Structure forGraph Streaming 15 • GraphX uses CSR(Compressed Sparse Row)-based index • Originated from sparse matrix compression • Good for finding all out edges of a source vertex • No support for finding all in edges of a target vertex. Need full table scan • At minimal, need to add CSC(Compressed Sparse Column) for indexing in edges Raw Edge Lists Src Dst Data 3 2 3 5 3 9 5 2 5 3 7 3 8 5
  • 16.
    8 6 10 6 Dst Data 2 5 9 2 3 3 5
  • 17.
    6 6 Idx Unique Src 0 3 3 5 5 7 6 8 8 10 CSR Data Src 3 5 5 7 3
  • 18.
    8 8 10 3 Unique Dst Idx 2 0 3 2 5 4 6 6 9 8 CSC
  • 19.
    Index Structure forGraph Streaming 16 • Both CSR and CSC need firstly sort edge lists and then create index. • Even better way is to build index on the fly • For graph streaming, need to support both fast insert/write and fast search/read • HashMap • Good for exact match, point search • Fast on insert and search • Good for graph with fixed/known size • Need to re-hash when size surpasses capacity • Trees: B-Tree, LSM-Tree (Log Structured Merge Tree), COLA(Cache Oblivious Lookahead Array) • Support both point search and range search • B-Tree good for fast search, slow for insert • LSM-Tree good for fast insert, slow for search • COLA achieves good tradeoff: fast insert and good enough search COLA based index for graph streaming
  • 20.
    Putting Things Together:Our Streaming Pipeline 17 OML + OML + OML + OML + OML …
  • 21.
    Performance - ConvergenceRate 18 1.2 Converage Rate Naive Incremental Normalized Number of Iterations Graph Size ( Num of Edges) 1.0 0.8 0.6 0.4 0.2 0.0 Base +20% +40% +60% +80% +100% +150% +200%