Distributed graph processing

Graphs - going distributed
vertices, edges, degrees, properties… in the Big Data era

Plan
1. Vocabulary
2. Storage
3. Ad-hoc analysis
4. Data processing
5. Data visualization
6. Algorithms
7. Challenges

Vocabulary
Undirected graph
Directed graph
graph types - direction

Vocabulary
Acyclic
Cyclic
graph types - cycles

Vocabulary
Labeled graph
graph types - labels

Vocabulary
Simple graph
Multigraph
Pseudograph
graph types - parallel edges and loops

Vocabulary
graph types - bipartite

Vocabulary
● G = (V, E)
● e = {u, v} or (u, v) or uv, v = {v1
, v2
, …vn
}
● |E(G)|, |V(G)| or |E|, |V|
● vertex-cut G-v:
●
●
●
●
●
●
● s
●
●
● graph partition satisfies: V = V1
∪ V2
,
V1
∩ V2
!= ∅, V1
!= ∅, V2
!= ∅
● vertex degrees: d+
(v) and d-
(v)
graph theory
http://math.tut.fi/~ruohonen/GT_English.pdf

Storage
Adjacency matrix
representation
(+) easy to check vertices connectivity
(+) easy to add or remove vertices
(-) not space efficient
(-) traversal execution

Storage
Incidence matrix
representation
similar characteristics to adjacency matrix

Storage
Adjacency list
representation
(+) better space efficiency than matrices
(+) better for traversals - list of outgoing
edges retrieved immediately
(-) worse to check 2 vertices connectivity

Storage
Edge table
representation
https://docs.microsoft.com/en-us/sql/relational-databases/graphs/media/
person-friends-tables.png?view=sql-server-2017

Storage
Index-free adjacency list
representation (+) natural graph data representation
(+) traversal time is always N
(-) inefficient with vertex-to-vertex
connectivity checks
(-) writing is hard for supernodes
(-) horizontal scalability is not easy

Storage
Graph - first-class storage citizen. Neo4j’s
concept to distinguish between “real” and
“fake” graphs as the ones based on key-value
stores or RDBMS.
Master-slave architecture, read horizontal
scalability, HA commercial feature
native graph storage - Neo4j
“Graph Databases” by Ian Robinson, Jim Webber & Emil Eifrem

Storage
NoSQL data stores used as a persistence
layer.
NoSQL - on top of other types
https://docs.janusgraph.org/latest/images/architecture-layer-diagram.svg

But the NoSQL graph engines exist too: DGraph
● sharded and distributed
● consistent
● provided High Availability
● fault tolerance
● Docker-friendly
● adjacency-list of (entity, attribute, value)
triples
○ entity = UID, attribute = predicate, value =
object (literal, UID of other entities)
○ triple - unit of sharding
■ sharding unit - attribute
■ each attribute assigned to a group of
nodes
● schema definition or interference
● backed by Badger - Go K/V store
Storage
graph NoSQL

Not efficient but technically possible:
● multiple JOINS - dynamic or static generation
○ inefficient
○ fine for 1 level but imagine more:
SELECT name FROM Person p1 LEFT JOIN Person
p2 ON p1.boss = p2.id LEFT JOIN Person p3 ON
p2.boss = p3.boss WHERE p1.id = 1
● hierarchical queries
○ WITH RECURSIVE
● native graph support: SQL Server Graph
Databases
○ vertices and edge tables
○ SQL with MATCH clause:
-- use MATCH in SELECT to find friends of
Alice
SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person
Person2 WHERE
MATCH(Person1-(friend)->Person2) AND
Person1.name = 'Alice';
● a lot of drawbacks: costly joins, querying difficulty
Storage
RDBMS

Storage
Azure Cosmos DB
Microsoft outperforms the concurrence:
● multi-model database: key-value, document,
column-family and graph
● geographical distribution
● elastic scalability
● Open Source-based: Tinkerpop Gremlin-based API
Amazon Neptune -
AWS tries to catch up the train
● queryable with Tinkerpop
● structure stored and backed up in S3
● master-slave architecture
cloud

Storage
FlockDB
● aka: horizontally scaling relational database
● graph reduced to “edges between nodes” table
● not maintained anymore
big graph players - Twitter
https://blog.twitter.com/content/dam/blog-twitter/archive/it_s_in_the_c
oud96.thumb.1280.1280.png

Storage
TAO
● aka: “The Associations and Objects”
● large collection of geographically distributed server
clusters
● persistent storage with memory cachebig graph players - Facebook
https://scontent-cdg2-1.xx.fbcdn.net/v/t1.0-9/s720x720/1016494_1
0151647749827200_1788611608_n.png?_nc_cat=106&_nc_ht=sc
ontent-cdg2-1.xx&oh=6c17fc682408be59f3a4718b34200324&oe=
5C4FFAA6

Ad-hoc analysis
Cypher
● natural representation of graphs in query
language
○ expresses graph traversal
○ applies to simple and more complex
cases
● limited to Neo4J
Any edges between node1 and node2
MATCH (node1)-->(node2)
Users with more than 10 friends
MATCH (user)-[:FRIEND]-(friend) WHERE
user.name = $name WITH user,
count(friend) AS friends WHERE
friends > 10 RETURN user
Edge creation with a property
CREATE (worker)-[:WORKS_FORY {since:
2018}]->(company)
query languages

Ad-hoc analysis
Gremlin
● less SQL-friendly
● DSL
● OLTP and OLAP
● vendor-agnostic
Every vertex with the code property equals to
‘AUS’
g.V().has('code','AUS').valueMap(true
).unfold()
Every vertex with ‘airport’ label grouped by a
country
g.V().hasLabel('airport').groupCount(
).by('country')
Traversal result partial output
g.V().has('code','SAF').repeat(out())
.emit().path().by('code').limit(10)
query languages

Ad-hoc analysis
GraphQL +-
● DGraph proposal
● inspired from GraphQL
● query = nested blocks starting with a query
root
● more a JSON document than a query
All nodes with “jones indiana” in name
{
me(func: allofterms(name@en, "jones
indiana")) {
name@en
genre {
name@en
}
}
}
query languages

Data processing
going distributed

Data processing
Think Like a Vertex
Main computation component ⇒ vertex
● vertex knows only about its nearests
neighbours
● communication with other vertices in
computation stages called supersteps
Google + Pregel = Apache Giraph
● Facebook-proven
Other implementations
vertex-centric approach

Data processing
Think Like a Graph
● TLAV good but: network communication can
become an overhead
● graph-centric:
○ data stored in subgraphs
○ traversals applied first locally
○ results propagated to boundary
vertices
● implementation example: GoFFish v3
○ unfortunately: lack of industry
adoption
○ Spark on Neo4J tends to imitate the
approach
graph-centric approach

Data processing
Graph as a stream
● born from hardware limitations
● stream of: vertices, triples or edges - edges
the best because no desynchronization
problem
● Scatter-Gather logic
● implementation examples from EPFL:
X-Stream (standalone), Chaos (distributed
X-Stream)
● seems still at theoretical stage
streaming approach

Data visualization
D3.js
JavaScript libraries

Data visualization
Sigma.js

Data visualization
Alchemy.js

Data visualization
Neo4J console
databases UIs

Data visualization
DGraph Ratel
databases UIs
https://user-images.githubusercontent.com/4924405/34387959-6d224
cd2-eb31-11e7-9169-9d23cc1e6405.png

Data visualization
Cytoscape
3rd party tools
http://manual.cytoscape.org/en/stable/_images/sampleOriginal.png

Data visualization
Linkurious
3rd party tools
https://linkurio.us/wp-content/uploads/2015/09/czech-ehealth-ecosyst
em.png

Data visualization
IBM Compose
3rd party tools
https://www.ibm.com/blogs/bluemix/wp-content/uploads/2017/07/Full
screenbrowser-800x653.png

Algorithms
PageRank
● problem of authority of vertices in the graph
● math formula:
PR(v) = (0.15/N+0.85) * Sum PR(vin)/L(vin)
● used for: recommendation systems
(TwitterRank, ItemRank), anomaly detection,
authority detection (BooRank), prediction and
much more

Algorithms
PageRank - superstep 1
starting PR
1/5 = 0.2
0.2
0.06
0.06
0.06

Algorithms
0.17
0.2
0.050.050.05

Algorithms
0.17
0.2
0.050.050.05
0.050.05
0.05

Algorithms
0.17
0.2
0.040.040.04

Challenges
Partitioning is hard - limited
choice in OS frameworks.
Real-time not so easy as for
other data stores.
Lack of standardised query
language.
graphs at scale

Distributed graph processing

Recommended

Recommended

More Related Content

Similar to Distributed graph processing

Similar to Distributed graph processing (20)

Recently uploaded

Recently uploaded (20)

Distributed graph processing