SlideShare a Scribd company logo
1 of 44
Download to read offline
Graphs - going distributed
vertices, edges, degrees, properties… in the Big Data era
Plan
1. Vocabulary
2. Storage
3. Ad-hoc analysis
4. Data processing
5. Data visualization
6. Algorithms
7. Challenges
Vocabulary
fundamentals
Vocabulary
Undirected graph
Directed graph
graph types - direction
Vocabulary
Acyclic
Cyclic
graph types - cycles
Vocabulary
Labeled graph
graph types - labels
Vocabulary
Simple graph
Multigraph
Pseudograph
graph types - parallel edges and loops
Vocabulary
graph types - bipartite
Vocabulary
● G = (V, E)
● e = {u, v} or (u, v) or uv, v = {v1
, v2
, …vn
}
● |E(G)|, |V(G)| or |E|, |V|
● vertex-cut G-v:
●
●
●
●
●
●
● s
●
●
● graph partition satisfies: V = V1
∪ V2
,
V1
∩ V2
!= ∅, V1
!= ∅, V2
!= ∅
● vertex degrees: d+
(v) and d-
(v)
graph theory
http://math.tut.fi/~ruohonen/GT_English.pdf
Storage
Adjacency matrix
representation
(+) easy to check vertices connectivity
(+) easy to add or remove vertices
(-) not space efficient
(-) traversal execution
Storage
Incidence matrix
representation
similar characteristics to adjacency matrix
Storage
Adjacency list
representation
(+) better space efficiency than matrices
(+) better for traversals - list of outgoing
edges retrieved immediately
(-) worse to check 2 vertices connectivity
Storage
Edge table
representation
https://docs.microsoft.com/en-us/sql/relational-databases/graphs/media/
person-friends-tables.png?view=sql-server-2017
Storage
Index-free adjacency list
representation (+) natural graph data representation
(+) traversal time is always N
(-) inefficient with vertex-to-vertex
connectivity checks
(-) writing is hard for supernodes
(-) horizontal scalability is not easy
Storage
Graph - first-class storage citizen. Neo4j’s
concept to distinguish between “real” and
“fake” graphs as the ones based on key-value
stores or RDBMS.
Master-slave architecture, read horizontal
scalability, HA commercial feature
native graph storage - Neo4j
“Graph Databases” by Ian Robinson, Jim Webber & Emil Eifrem
Storage
NoSQL data stores used as a persistence
layer.
NoSQL - on top of other types
https://docs.janusgraph.org/latest/images/architecture-layer-diagram.svg
But the NoSQL graph engines exist too: DGraph
● sharded and distributed
● consistent
● provided High Availability
● fault tolerance
● Docker-friendly
● adjacency-list of (entity, attribute, value)
triples
○ entity = UID, attribute = predicate, value =
object (literal, UID of other entities)
○ triple - unit of sharding
■ sharding unit - attribute
■ each attribute assigned to a group of
nodes
● schema definition or interference
● backed by Badger - Go K/V store
Storage
graph NoSQL
Not efficient but technically possible:
● multiple JOINS - dynamic or static generation
○ inefficient
○ fine for 1 level but imagine more:
SELECT name FROM Person p1 LEFT JOIN Person
p2 ON p1.boss = p2.id LEFT JOIN Person p3 ON
p2.boss = p3.boss WHERE p1.id = 1
● hierarchical queries
○ WITH RECURSIVE
● native graph support: SQL Server Graph
Databases
○ vertices and edge tables
○ SQL with MATCH clause:
-- use MATCH in SELECT to find friends of
Alice
SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person
Person2 WHERE
MATCH(Person1-(friend)->Person2) AND
Person1.name = 'Alice';
● a lot of drawbacks: costly joins, querying difficulty
Storage
RDBMS
Storage
Azure Cosmos DB
Microsoft outperforms the concurrence:
● multi-model database: key-value, document,
column-family and graph
● geographical distribution
● elastic scalability
● Open Source-based: Tinkerpop Gremlin-based API
Amazon Neptune -
AWS tries to catch up the train
● queryable with Tinkerpop
● structure stored and backed up in S3
● master-slave architecture
cloud
Storage
FlockDB
● aka: horizontally scaling relational database
● graph reduced to “edges between nodes” table
● not maintained anymore
big graph players - Twitter
https://blog.twitter.com/content/dam/blog-twitter/archive/it_s_in_the_c
oud96.thumb.1280.1280.png
Storage
TAO
● aka: “The Associations and Objects”
● large collection of geographically distributed server
clusters
● persistent storage with memory cachebig graph players - Facebook
https://scontent-cdg2-1.xx.fbcdn.net/v/t1.0-9/s720x720/1016494_1
0151647749827200_1788611608_n.png?_nc_cat=106&_nc_ht=sc
ontent-cdg2-1.xx&oh=6c17fc682408be59f3a4718b34200324&oe=
5C4FFAA6
Ad-hoc analysis
Cypher
● natural representation of graphs in query
language
○ expresses graph traversal
○ applies to simple and more complex
cases
● limited to Neo4J
Any edges between node1 and node2
MATCH (node1)-->(node2)
Users with more than 10 friends
MATCH (user)-[:FRIEND]-(friend) WHERE
user.name = $name WITH user,
count(friend) AS friends WHERE
friends > 10 RETURN user
Edge creation with a property
CREATE (worker)-[:WORKS_FORY {since:
2018}]->(company)
query languages
Ad-hoc analysis
Gremlin
● less SQL-friendly
● DSL
● OLTP and OLAP
● vendor-agnostic
Every vertex with the code property equals to
‘AUS’
g.V().has('code','AUS').valueMap(true
).unfold()
Every vertex with ‘airport’ label grouped by a
country
g.V().hasLabel('airport').groupCount(
).by('country')
Traversal result partial output
g.V().has('code','SAF').repeat(out())
.emit().path().by('code').limit(10)
query languages
Ad-hoc analysis
GraphQL +-
● DGraph proposal
● inspired from GraphQL
● query = nested blocks starting with a query
root
● more a JSON document than a query
All nodes with “jones indiana” in name
{
me(func: allofterms(name@en, "jones
indiana")) {
name@en
genre {
name@en
}
}
}
query languages
Data processing
going distributed
Data processing
Think Like a Vertex
Main computation component ⇒ vertex
● vertex knows only about its nearests
neighbours
● communication with other vertices in
computation stages called supersteps
Google + Pregel = Apache Giraph
● Facebook-proven
Other implementations
vertex-centric approach
Data processing
Think Like a Graph
● TLAV good but: network communication can
become an overhead
● graph-centric:
○ data stored in subgraphs
○ traversals applied first locally
○ results propagated to boundary
vertices
● implementation example: GoFFish v3
○ unfortunately: lack of industry
adoption
○ Spark on Neo4J tends to imitate the
approach
graph-centric approach
Data processing
Graph as a stream
● born from hardware limitations
● stream of: vertices, triples or edges - edges
the best because no desynchronization
problem
● Scatter-Gather logic
● implementation examples from EPFL:
X-Stream (standalone), Chaos (distributed
X-Stream)
● seems still at theoretical stage
streaming approach
Data visualization
D3.js
JavaScript libraries
Data visualization
Sigma.js
JavaScript libraries
Data visualization
Alchemy.js
JavaScript libraries
Data visualization
Neo4J console
databases UIs
Data visualization
DGraph Ratel
databases UIs
https://user-images.githubusercontent.com/4924405/34387959-6d224
cd2-eb31-11e7-9169-9d23cc1e6405.png
Data visualization
Cytoscape
3rd party tools
http://manual.cytoscape.org/en/stable/_images/sampleOriginal.png
Data visualization
Linkurious
3rd party tools
https://linkurio.us/wp-content/uploads/2015/09/czech-ehealth-ecosyst
em.png
Data visualization
IBM Compose
3rd party tools
https://www.ibm.com/blogs/bluemix/wp-content/uploads/2017/07/Full
screenbrowser-800x653.png
Algorithms
PageRank
● problem of authority of vertices in the graph
● math formula:
PR(v) = (0.15/N+0.85) * Sum PR(vin)/L(vin)
● used for: recommendation systems
(TwitterRank, ItemRank), anomaly detection,
authority detection (BooRank), prediction and
much more
Algorithms
PageRank - superstep 1
starting PR
1/5 = 0.2
0.2
0.06
0.06
0.06
Algorithms
PageRank - superstep 1
0.17
0.2
0.050.050.05
Algorithms
PageRank - superstep 2
0.17
0.2
0.050.050.05
Algorithms
PageRank - superstep 2
0.17
0.2
0.050.050.05
0.050.05
0.05
Algorithms
PageRank - superstep 3
0.17
0.2
0.040.040.04
Challenges
Partitioning is hard - limited
choice in OS frameworks.
Real-time not so easy as for
other data stores.
Lack of standardised query
language.
graphs at scale
Thank you
Questions ?

More Related Content

Similar to Distributed graph processing

A walk in graph databases v1.0
A walk in graph databases v1.0A walk in graph databases v1.0
A walk in graph databases v1.0
Pierre De Wilde
 
graphin-c1.pnggraphin-c1.txt1 22 3 83 44 5.docx
graphin-c1.pnggraphin-c1.txt1 22 3 83 44 5.docxgraphin-c1.pnggraphin-c1.txt1 22 3 83 44 5.docx
graphin-c1.pnggraphin-c1.txt1 22 3 83 44 5.docx
whittemorelucilla
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 

Similar to Distributed graph processing (20)

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
PowerLyra@EuroSys2015
PowerLyra@EuroSys2015PowerLyra@EuroSys2015
PowerLyra@EuroSys2015
 
Challenges in knowledge graph visualization
Challenges in knowledge graph visualizationChallenges in knowledge graph visualization
Challenges in knowledge graph visualization
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Igraph
IgraphIgraph
Igraph
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
JanusGraph DataBase Concepts
JanusGraph DataBase ConceptsJanusGraph DataBase Concepts
JanusGraph DataBase Concepts
 
3DRepo
3DRepo3DRepo
3DRepo
 
Twopi.1
Twopi.1Twopi.1
Twopi.1
 
A walk in graph databases v1.0
A walk in graph databases v1.0A walk in graph databases v1.0
A walk in graph databases v1.0
 
graphin-c1.pnggraphin-c1.txt1 22 3 83 44 5.docx
graphin-c1.pnggraphin-c1.txt1 22 3 83 44 5.docxgraphin-c1.pnggraphin-c1.txt1 22 3 83 44 5.docx
graphin-c1.pnggraphin-c1.txt1 22 3 83 44 5.docx
 
Graph computation
Graph computationGraph computation
Graph computation
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Graph
GraphGraph
Graph
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
 

Recently uploaded

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

Distributed graph processing

  • 1. Graphs - going distributed vertices, edges, degrees, properties… in the Big Data era
  • 2. Plan 1. Vocabulary 2. Storage 3. Ad-hoc analysis 4. Data processing 5. Data visualization 6. Algorithms 7. Challenges
  • 9. Vocabulary ● G = (V, E) ● e = {u, v} or (u, v) or uv, v = {v1 , v2 , …vn } ● |E(G)|, |V(G)| or |E|, |V| ● vertex-cut G-v: ● ● ● ● ● ● ● s ● ● ● graph partition satisfies: V = V1 ∪ V2 , V1 ∩ V2 != ∅, V1 != ∅, V2 != ∅ ● vertex degrees: d+ (v) and d- (v) graph theory http://math.tut.fi/~ruohonen/GT_English.pdf
  • 10. Storage Adjacency matrix representation (+) easy to check vertices connectivity (+) easy to add or remove vertices (-) not space efficient (-) traversal execution
  • 12. Storage Adjacency list representation (+) better space efficiency than matrices (+) better for traversals - list of outgoing edges retrieved immediately (-) worse to check 2 vertices connectivity
  • 14. Storage Index-free adjacency list representation (+) natural graph data representation (+) traversal time is always N (-) inefficient with vertex-to-vertex connectivity checks (-) writing is hard for supernodes (-) horizontal scalability is not easy
  • 15. Storage Graph - first-class storage citizen. Neo4j’s concept to distinguish between “real” and “fake” graphs as the ones based on key-value stores or RDBMS. Master-slave architecture, read horizontal scalability, HA commercial feature native graph storage - Neo4j “Graph Databases” by Ian Robinson, Jim Webber & Emil Eifrem
  • 16. Storage NoSQL data stores used as a persistence layer. NoSQL - on top of other types https://docs.janusgraph.org/latest/images/architecture-layer-diagram.svg
  • 17. But the NoSQL graph engines exist too: DGraph ● sharded and distributed ● consistent ● provided High Availability ● fault tolerance ● Docker-friendly ● adjacency-list of (entity, attribute, value) triples ○ entity = UID, attribute = predicate, value = object (literal, UID of other entities) ○ triple - unit of sharding ■ sharding unit - attribute ■ each attribute assigned to a group of nodes ● schema definition or interference ● backed by Badger - Go K/V store Storage graph NoSQL
  • 18. Not efficient but technically possible: ● multiple JOINS - dynamic or static generation ○ inefficient ○ fine for 1 level but imagine more: SELECT name FROM Person p1 LEFT JOIN Person p2 ON p1.boss = p2.id LEFT JOIN Person p3 ON p2.boss = p3.boss WHERE p1.id = 1 ● hierarchical queries ○ WITH RECURSIVE ● native graph support: SQL Server Graph Databases ○ vertices and edge tables ○ SQL with MATCH clause: -- use MATCH in SELECT to find friends of Alice SELECT Person2.name AS FriendName FROM Person Person1, friend, Person Person2 WHERE MATCH(Person1-(friend)->Person2) AND Person1.name = 'Alice'; ● a lot of drawbacks: costly joins, querying difficulty Storage RDBMS
  • 19. Storage Azure Cosmos DB Microsoft outperforms the concurrence: ● multi-model database: key-value, document, column-family and graph ● geographical distribution ● elastic scalability ● Open Source-based: Tinkerpop Gremlin-based API Amazon Neptune - AWS tries to catch up the train ● queryable with Tinkerpop ● structure stored and backed up in S3 ● master-slave architecture cloud
  • 20. Storage FlockDB ● aka: horizontally scaling relational database ● graph reduced to “edges between nodes” table ● not maintained anymore big graph players - Twitter https://blog.twitter.com/content/dam/blog-twitter/archive/it_s_in_the_c oud96.thumb.1280.1280.png
  • 21. Storage TAO ● aka: “The Associations and Objects” ● large collection of geographically distributed server clusters ● persistent storage with memory cachebig graph players - Facebook https://scontent-cdg2-1.xx.fbcdn.net/v/t1.0-9/s720x720/1016494_1 0151647749827200_1788611608_n.png?_nc_cat=106&_nc_ht=sc ontent-cdg2-1.xx&oh=6c17fc682408be59f3a4718b34200324&oe= 5C4FFAA6
  • 22. Ad-hoc analysis Cypher ● natural representation of graphs in query language ○ expresses graph traversal ○ applies to simple and more complex cases ● limited to Neo4J Any edges between node1 and node2 MATCH (node1)-->(node2) Users with more than 10 friends MATCH (user)-[:FRIEND]-(friend) WHERE user.name = $name WITH user, count(friend) AS friends WHERE friends > 10 RETURN user Edge creation with a property CREATE (worker)-[:WORKS_FORY {since: 2018}]->(company) query languages
  • 23. Ad-hoc analysis Gremlin ● less SQL-friendly ● DSL ● OLTP and OLAP ● vendor-agnostic Every vertex with the code property equals to ‘AUS’ g.V().has('code','AUS').valueMap(true ).unfold() Every vertex with ‘airport’ label grouped by a country g.V().hasLabel('airport').groupCount( ).by('country') Traversal result partial output g.V().has('code','SAF').repeat(out()) .emit().path().by('code').limit(10) query languages
  • 24. Ad-hoc analysis GraphQL +- ● DGraph proposal ● inspired from GraphQL ● query = nested blocks starting with a query root ● more a JSON document than a query All nodes with “jones indiana” in name { me(func: allofterms(name@en, "jones indiana")) { name@en genre { name@en } } } query languages
  • 26. Data processing Think Like a Vertex Main computation component ⇒ vertex ● vertex knows only about its nearests neighbours ● communication with other vertices in computation stages called supersteps Google + Pregel = Apache Giraph ● Facebook-proven Other implementations vertex-centric approach
  • 27. Data processing Think Like a Graph ● TLAV good but: network communication can become an overhead ● graph-centric: ○ data stored in subgraphs ○ traversals applied first locally ○ results propagated to boundary vertices ● implementation example: GoFFish v3 ○ unfortunately: lack of industry adoption ○ Spark on Neo4J tends to imitate the approach graph-centric approach
  • 28. Data processing Graph as a stream ● born from hardware limitations ● stream of: vertices, triples or edges - edges the best because no desynchronization problem ● Scatter-Gather logic ● implementation examples from EPFL: X-Stream (standalone), Chaos (distributed X-Stream) ● seems still at theoretical stage streaming approach
  • 33. Data visualization DGraph Ratel databases UIs https://user-images.githubusercontent.com/4924405/34387959-6d224 cd2-eb31-11e7-9169-9d23cc1e6405.png
  • 34. Data visualization Cytoscape 3rd party tools http://manual.cytoscape.org/en/stable/_images/sampleOriginal.png
  • 35. Data visualization Linkurious 3rd party tools https://linkurio.us/wp-content/uploads/2015/09/czech-ehealth-ecosyst em.png
  • 36. Data visualization IBM Compose 3rd party tools https://www.ibm.com/blogs/bluemix/wp-content/uploads/2017/07/Full screenbrowser-800x653.png
  • 37. Algorithms PageRank ● problem of authority of vertices in the graph ● math formula: PR(v) = (0.15/N+0.85) * Sum PR(vin)/L(vin) ● used for: recommendation systems (TwitterRank, ItemRank), anomaly detection, authority detection (BooRank), prediction and much more
  • 38. Algorithms PageRank - superstep 1 starting PR 1/5 = 0.2 0.2 0.06 0.06 0.06
  • 39. Algorithms PageRank - superstep 1 0.17 0.2 0.050.050.05
  • 40. Algorithms PageRank - superstep 2 0.17 0.2 0.050.050.05
  • 41. Algorithms PageRank - superstep 2 0.17 0.2 0.050.050.05 0.050.05 0.05
  • 42. Algorithms PageRank - superstep 3 0.17 0.2 0.040.040.04
  • 43. Challenges Partitioning is hard - limited choice in OS frameworks. Real-time not so easy as for other data stores. Lack of standardised query language. graphs at scale