Xia Zhu – Intel at MLconf ATL

Streaming and Online Algorithms for GraphX
Graph Analytics Team
Xia (Ivy) Zhu
Intel Confidential — Do Not Forward

Why Streaming Processing on Graph?
2
• New stores join
• New users join
• New users
browse/clicks and
buy items
• Old users
browse/clicks and
buy items
• New ads added
• …
• Recommend products
based on users’ interest
based on users’ shopping
habits
based on users’
purchasing capability
• Place ads which most
likely will be clicked by
users
• …
Everyday
How
To
Huge amount of relationships are created each day,
Wisely utilize them is important

Alibaba Is Not Alone, Graphs are Everywhere
3
100B Neuron
100T Relationships
1.23B Users
160B Friendships
1 Trillion Pages
100s T Links
Millions of Products
and Users
50M Users
1B hours/moth watch
Large Biological
Cell Networks

… And Graphs Keep Evolving
4

Streaming Processing Pipeline
5
Data Stream
ETL
Graph
Creation
ML
Distributed Messaging System
• We are using Kafka for distributed messaging
• GraphX as graph processing engine

6
What is GraphX
• Graph processing engine on Spark
• Support Pregel-type vertex programming
• Unifies data-parallel and graph-parallel processing
Picture Source: GraphX team

7
Why GraphX
• GraphLab performs well, but standalone
• Giraph, open source, scales well, but performance is not good
• GraphX supports both table and graph operations
• On the same platform, Spark streaming provides basic streaming
framework
SchemaRDD’s RDD-Based
RDDs, Transformations, and Actions
Spark
Spark Streaming
real-time
Spark
SQL
MLLib
machine learning
DStream’s:
Streams of RDD’s
Matrices
RDD-Based
Graphs
GraphX
graph processing/
machine learning
Picture Source: Databricks

8
Naïve Streaming Does not Scale
• Current GraphX is designed for static graphs
• Current Spark streaming provides limited types of state DStreams
• Naïve approach:
• Merge table data before going to graph processing pipeline
• Re-generate whole graph and re-run ML at each window
• Minimal changes to GraphX and Spark Streaming
• Straightforward, but does not scale well
180
160
140
120
100
80
60
40
20
0
Throughput vs Latency of Naive Graph Streaming
1 2 3 4 5 6 7 8 9
Latency(s)
Sample Point

Our solution
9
• Static algorithms -> Online algorithms
• Merge information at graph phase
• Efficient graph store for evolving graph
• Better partitioning algorithms to reduce replicas
• Static index -> On the fly indexing method (ongoing)

Static vs Online Algorithms
10
• Static algorithms
• Good for re-compute the whole graph at each time instance , and re-run ML
• Become increasingly infeasible in Big Data era, given the size and growth rate
of graphs
• Online algorithms
• Incremental machine learning is triggered by changes in the graph
• We designed delta updates based online algorithms
• Page rank as an example
• Same idea is applicable to other machine learning algorithms

Static vs Online Page Rank
11
Static_PageRank
// InitialVertexValue
(0.0, 0.0)
// first messsage
initialMessage:
msg = alpha/(1.0-alpha)
// broadcast to neighbors
SendMessage:
if (edge.srcAttr._2 > tol)
Iterator((edge.dstId, edge.srcAttr_2 *
edge.attr))
//Aggregate Messages for each Vertex
messageCombiner(a,b) :
sum = a+b
//Update Vertex
vertexProgram(sum) :
updates = (1.0 - alpha) * sum
(oldPR + updates, updates)
Online_PageRank
// Initialize vertex value
base graph:
(0.0, 0.0)
incremental graph:
old vertices:
(lastWindowPR, lastWindowDelta)
new vertices:
(alpha, alpha)
// First Message
initialMessage:
base graph:
msg = alpha/(1.0-alpha)
incremental graph:
none
// broadcast to neighbors
SendMessage:
oldSrc->newDst:
Iterator((edge.dstId,(edge.srcAttr_1 – alpha) *
edge.attr))
newSrc->newDst or not converged:
Iterator((edge.dstId,edge.srcAttr_2 * edge.attr))
//Aggregate Messages for each Vertex
messageCombiner(a,b) :
sum = a+b
//Update Vertex
vertexProgram(sum) :
updates = (1.0 - alpha) * sum
(oldPR + updates, updates)

GraphX Data Loading and Data Structure
12
Edge
lists
SSrrccIIdd
DstId
EdgeRDD
DDaattaa
IInnddeexx
Re-HashPartition
RRoouuttiinnggTTaabblleePPaarrttiittiioonn
VVeerrtteexxRRDDDD
RoutingTableMesssage
HHaassSSrrccIIdd
HHaassDDssttIIdd
Replicated
Vertex
View
GGrraapphhIImmppll
EEddggeePPaarrttiittiioonn
VVeerrtteexxPPaarrttiittiioonn
Vid
DDaattaa
Mask
Shippable
Vertex
Partition
Vid
DDaattaa
Mask

GraphX Data Loading and Data Structure
13
Edge
lists
SSrrccIIdd
DstId
EdgeRDD
DDaattaa
Index
Re-HashPartition
RRoouuttiinnggTTaabblleePPaarrttiittiioonn
VVeerrtteexxRRDDDD
RoutingTableMesssage
HHaassSSrrccIIdd
HHaassDDssttIIdd
Replicated
Vertex
View
GGrraapphhIImmppll
EEddggeePPaarrttiittiioonn
Vid
DDaattaa
Mask
Shippable
Vertex
Partition
Vid
DDaattaa
Mask
Static Index
Partitioning Algorithm can help
reduce the replication factors

Partitioning Algorithm
14
• Torus-based partitioning
• Divide overall partitions to A x B matrix
• Vertex’s master partition is decided by Hash function
• Replica set is in the same column as master partition (full column), and same row as
master partition (
⁄ + 1 elements starting from master partition)
• The intersection between source replica set and target replica set decides where an
edge is placed

Index Structure for Graph Streaming
15
• GraphX uses CSR(Compressed Sparse Row)-based index
• Originated from sparse matrix compression
• Good for finding all out edges of a source vertex
• No support for finding all in edges of a target vertex. Need full table scan
• At minimal, need to add CSC(Compressed Sparse Column) for indexing in edges
Raw Edge Lists
Src Dst Data
3 2
3 5
3 9
5 2
5 3

7 3
8 5

8 6

10 6
Dst Data
2
5
9
2
3

3
5

6

6
Idx Unique
Src
0 3
3 5
5 7
6 8
8 10
CSR
Data Src
3
5

5
7
3

8

8
10
3
Unique
Dst
Idx
2 0
3 2
5 4
6 6
9 8
CSC

Index Structure for Graph Streaming
16
• Both CSR and CSC need firstly sort edge lists and then create index.
• Even better way is to build index on the fly
• For graph streaming, need to support both fast insert/write and fast search/read
• HashMap
• Good for exact match, point search
• Fast on insert and search
• Good for graph with fixed/known size
• Need to re-hash when size surpasses capacity
• Trees: B-Tree, LSM-Tree (Log Structured Merge Tree), COLA(Cache Oblivious
Lookahead Array)
• Support both point search and range search
• B-Tree good for fast search, slow for insert
• LSM-Tree good for fast insert, slow for search
• COLA achieves good tradeoff: fast insert and good enough search
COLA based index for graph streaming

Putting Things Together: Our Streaming Pipeline
17

OML

+

OML

+

OML

+

OML

+

OML

…

Performance - Convergence Rate
18
1.2
Converage Rate
Naive Incremental
Normalized Number of Iterations Graph Size ( Num of Edges)
1.0
0.8
0.6
0.4
0.2
0.0
Base +20% +40% +60% +80% +100% +150% +200%

Xia Zhu – Intel at MLconf ATL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Xia Zhu – Intel at MLconf ATL

Similar to Xia Zhu – Intel at MLconf ATL (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Xia Zhu – Intel at MLconf ATL