This document discusses challenges and approaches for working with graphs at scale in a distributed environment. It begins with an overview of graph terminology and representations. It then covers approaches for distributed storage of graphs using techniques like adjacency matrices, lists, and NoSQL databases. Methods for distributed graph processing are discussed, including vertex-centric, graph-centric, and streaming approaches. Graph query languages like Cypher, Gremlin and GraphQL are presented. The document also reviews algorithms like PageRank and tools for visualizing large graphs. It concludes by noting challenges like partitioning graphs across servers and providing real-time capabilities at large scale.
12. Storage
Adjacency list
representation
(+) better space efficiency than matrices
(+) better for traversals - list of outgoing
edges retrieved immediately
(-) worse to check 2 vertices connectivity
14. Storage
Index-free adjacency list
representation (+) natural graph data representation
(+) traversal time is always N
(-) inefficient with vertex-to-vertex
connectivity checks
(-) writing is hard for supernodes
(-) horizontal scalability is not easy
15. Storage
Graph - first-class storage citizen. Neo4j’s
concept to distinguish between “real” and
“fake” graphs as the ones based on key-value
stores or RDBMS.
Master-slave architecture, read horizontal
scalability, HA commercial feature
native graph storage - Neo4j
“Graph Databases” by Ian Robinson, Jim Webber & Emil Eifrem
16. Storage
NoSQL data stores used as a persistence
layer.
NoSQL - on top of other types
https://docs.janusgraph.org/latest/images/architecture-layer-diagram.svg
17. But the NoSQL graph engines exist too: DGraph
● sharded and distributed
● consistent
● provided High Availability
● fault tolerance
● Docker-friendly
● adjacency-list of (entity, attribute, value)
triples
○ entity = UID, attribute = predicate, value =
object (literal, UID of other entities)
○ triple - unit of sharding
■ sharding unit - attribute
■ each attribute assigned to a group of
nodes
● schema definition or interference
● backed by Badger - Go K/V store
Storage
graph NoSQL
18. Not efficient but technically possible:
● multiple JOINS - dynamic or static generation
○ inefficient
○ fine for 1 level but imagine more:
SELECT name FROM Person p1 LEFT JOIN Person
p2 ON p1.boss = p2.id LEFT JOIN Person p3 ON
p2.boss = p3.boss WHERE p1.id = 1
● hierarchical queries
○ WITH RECURSIVE
● native graph support: SQL Server Graph
Databases
○ vertices and edge tables
○ SQL with MATCH clause:
-- use MATCH in SELECT to find friends of
Alice
SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person
Person2 WHERE
MATCH(Person1-(friend)->Person2) AND
Person1.name = 'Alice';
● a lot of drawbacks: costly joins, querying difficulty
Storage
RDBMS
19. Storage
Azure Cosmos DB
Microsoft outperforms the concurrence:
● multi-model database: key-value, document,
column-family and graph
● geographical distribution
● elastic scalability
● Open Source-based: Tinkerpop Gremlin-based API
Amazon Neptune -
AWS tries to catch up the train
● queryable with Tinkerpop
● structure stored and backed up in S3
● master-slave architecture
cloud
20. Storage
FlockDB
● aka: horizontally scaling relational database
● graph reduced to “edges between nodes” table
● not maintained anymore
big graph players - Twitter
https://blog.twitter.com/content/dam/blog-twitter/archive/it_s_in_the_c
oud96.thumb.1280.1280.png
21. Storage
TAO
● aka: “The Associations and Objects”
● large collection of geographically distributed server
clusters
● persistent storage with memory cachebig graph players - Facebook
https://scontent-cdg2-1.xx.fbcdn.net/v/t1.0-9/s720x720/1016494_1
0151647749827200_1788611608_n.png?_nc_cat=106&_nc_ht=sc
ontent-cdg2-1.xx&oh=6c17fc682408be59f3a4718b34200324&oe=
5C4FFAA6
22. Ad-hoc analysis
Cypher
● natural representation of graphs in query
language
○ expresses graph traversal
○ applies to simple and more complex
cases
● limited to Neo4J
Any edges between node1 and node2
MATCH (node1)-->(node2)
Users with more than 10 friends
MATCH (user)-[:FRIEND]-(friend) WHERE
user.name = $name WITH user,
count(friend) AS friends WHERE
friends > 10 RETURN user
Edge creation with a property
CREATE (worker)-[:WORKS_FORY {since:
2018}]->(company)
query languages
23. Ad-hoc analysis
Gremlin
● less SQL-friendly
● DSL
● OLTP and OLAP
● vendor-agnostic
Every vertex with the code property equals to
‘AUS’
g.V().has('code','AUS').valueMap(true
).unfold()
Every vertex with ‘airport’ label grouped by a
country
g.V().hasLabel('airport').groupCount(
).by('country')
Traversal result partial output
g.V().has('code','SAF').repeat(out())
.emit().path().by('code').limit(10)
query languages
24. Ad-hoc analysis
GraphQL +-
● DGraph proposal
● inspired from GraphQL
● query = nested blocks starting with a query
root
● more a JSON document than a query
All nodes with “jones indiana” in name
{
me(func: allofterms(name@en, "jones
indiana")) {
name@en
genre {
name@en
}
}
}
query languages
26. Data processing
Think Like a Vertex
Main computation component ⇒ vertex
● vertex knows only about its nearests
neighbours
● communication with other vertices in
computation stages called supersteps
Google + Pregel = Apache Giraph
● Facebook-proven
Other implementations
vertex-centric approach
27. Data processing
Think Like a Graph
● TLAV good but: network communication can
become an overhead
● graph-centric:
○ data stored in subgraphs
○ traversals applied first locally
○ results propagated to boundary
vertices
● implementation example: GoFFish v3
○ unfortunately: lack of industry
adoption
○ Spark on Neo4J tends to imitate the
approach
graph-centric approach
28. Data processing
Graph as a stream
● born from hardware limitations
● stream of: vertices, triples or edges - edges
the best because no desynchronization
problem
● Scatter-Gather logic
● implementation examples from EPFL:
X-Stream (standalone), Chaos (distributed
X-Stream)
● seems still at theoretical stage
streaming approach
36. Data visualization
IBM Compose
3rd party tools
https://www.ibm.com/blogs/bluemix/wp-content/uploads/2017/07/Full
screenbrowser-800x653.png
37. Algorithms
PageRank
● problem of authority of vertices in the graph
● math formula:
PR(v) = (0.15/N+0.85) * Sum PR(vin)/L(vin)
● used for: recommendation systems
(TwitterRank, ItemRank), anomaly detection,
authority detection (BooRank), prediction and
much more
43. Challenges
Partitioning is hard - limited
choice in OS frameworks.
Real-time not so easy as for
other data stores.
Lack of standardised query
language.
graphs at scale