Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Berlin

GRADOOP: Scalable Graph Analytics
with Apache Flink
Martin Junghanns @kc1s
Apache Flink and Neo4j Meetup Berlin

About the speaker and the team
Apache Flink and Neo4j Meetup Berlin 2
André
PhD Student
Martin
PhD Student
Kevin
M.Sc. Student
Niklas
M.Sc. Student
Prof. Dr. Erhard Rahm
Database Chair

Motivation

„Graphs are everywhere“
𝑮𝑟𝑎𝑝ℎ = (𝑽𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝑬𝑑𝑔𝑒𝑠)

Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠)

𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠)
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent

Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
„Graphs are heterogeneous“
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)

Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
„Graphs can be analyzed“

0.2
0.28
0.26
0.33
0.25
0.26
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
3.6
2.82

Assuming a social network

1. Determine subgraph

2. Find communities

2. Find communities
3. Filter communities

2. Find communities

2. Find communities
4. Find common subgraph

2. Find communities

• Heterogeneous data
• Apply graph transformation
2. Find communities
• Handle collections of graphs
• Aggregation, Selection
• Apply dedicated algorithm

2. Find communities

„And let‘s not forget …“

“…Graphs are large”

„A framework and research platform for efficient,
distributed and domain independent management
and analytics of heterogeneous graph data.“

High Level Architecture

HDFS/YARN
Cluster

HDFS/YARN
Cluster
Apache HBase Distributed Graph Store

HDFS/YARN
Cluster
Apache Flink Distributed Operator Execution

HDFS/YARN
Cluster
Apache Flink Operator Implementation
Apache Flink Distributed Operator Execution
Extended Property Graph Model (EPGM)
Graph Analytical Language (GrALa)  Java 7
 25K (33K) LOC
 GPLv3

Apache Flink Third-party library
Streaming Dataflow Runtime
DataSet DataStream
HadoopMR
Table
Gelly
ML
Table
Zeppelin
Cascading
MRQL
Dataflow
Storm
Dataflow
SAMOA
GRADOOP
Cluster (e.g. YARN)Local Cloud (e.g. EC2)
Batch Stream
Data Storage (e.g. Files, HDFS, S3, JDBC, Kafka, …)

Extended Property Graph Model (EPGM)

Extended Property Graph Model
• Vertices and directed Edges

• Logical Graphs

• Logical Graphs
• Identifiers
1 3
4
5
21 2
3
4
5
1
2

• Logical Graphs
• Identifiers
• Type Labels
1 3
4
5
21 2
3
4
5
Person Band
Person
Person
Band
likes likes
likes
knows
likes
1|Community
2|Community

• Logical Graphs
• Identifiers
• Type Labels
• Properties
1 3
4
5
21 2
3
4
5
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock

EPGM Operators

Basic Binary Operators

1 3
4
5
2
1
2

1 3
4
5
2
1 3
4
5
2
1
2
Combination
3

1 3
4
5
2
31 3
4
5
2
1
2
3
Combination
Overlap
3

1 3
4
5
2
3
1 2
1 3
4
5
2
1
2
3
3
Combination
Overlap
Exclusion
3

Graph Aggregation

Graph Aggregation
1 3
4
5
2
3

Graph Aggregation
1 3
4
5
2
3
UDF

Graph Aggregation
1 3
4
5
2
3
1 3
4
5
2
3 | vertexCount: 5
UDF

Graph Aggregation
1 3
4
5
2
3
1 3
4
5
2
3 | vertexCount: 5
1 3
4
5
2
3
revenue:7000
expense:1000
expense:1000
UDF

Graph Aggregation
1 3
4
5
2
3
1 3
4
5
2
3 | vertexCount: 5
1 3
4
5
2
3
revenue:7000
expense:1000
expense:1000
UDF
UDF

Graph Aggregation
1 3
4
5
2
3
1 3
4
5
2
3 | vertexCount: 5
1 3
4
5
2
3
revenue:7000
expense:1000
expense:1000
1 3
4
5
2
3 | profit: 5000
revenue:7000
expense:1000
expense:1000
UDF
UDF

Graph Transformation

3 | vertexCount: 5
name:Alice
f_name:Bob1 3
4
5
2

UDF
3 | vertexCount: 5
name:Alice
f_name:Bob1 3
4
5
2
3 | Community| vCount: 5
f_name:Alice
f_name:Bob1 3
4
5
2

Subgraph Extraction

Subgraph Extraction
3
1 3
4
5
2

Subgraph Extraction
3
1 3
4
5
2
UDF

Subgraph Extraction
3
1 3
4
5
2
3
4
1 2UDF

Subgraph Extraction
3
1 3
4
5
2
3
4
1 2
UDF
UDF

Subgraph Extraction
3
1 3
4
5
2
3
4
1 2
3
4
1 2UDF
UDF

Subgraph Extraction
3
1 3
4
5
2
3
4
1 2
3
4
1 2
UDF
UDF
UDF

Subgraph Extraction
3
1 3
4
5
2
3
4
1 2
3
4
1 2
4
3
5
2UDF
UDF
UDF

Graph Pattern Matching

3
1 3
4
5
2

3
1 3
4
5
2 Pattern

3
1 3
4
5
2 Pattern
4 5
1 3
4
2

3
1 3
4
5
2 Pattern
4 5
1 3
4
2
Graph Collection

Graph Grouping

Graph Grouping
3
1 3
4
5
2

Graph Grouping
Keys
3
1 3
4
5
2

Graph Grouping
Keys
3
1 3
4
5
2
4
6 7

Graph Grouping
Keys
3
1 3
4
5
2
4
6 7
3
a:23 a:84
a:42
a:12
1 3
4
5
2
a:13
a:21

Graph Grouping
Keys
3
1 3
4
5
2
4
6 7
+Aggregate
3
a:23 a:84
a:42
a:12
1 3
4
5
2
a:13
a:21

Graph Grouping
Keys
3
1 3
4
5
2
4
6 7
+Aggregate
3
a:23 a:84
a:42
a:12
1 3
4
5
2
a:13
a:21
4
count:2 count:2
max(a):42
max(a):84
max(a):13 max(a):21
6 7

Apply (e.g. Aggregation)

1
2
3
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210

Operator
1
2
3
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210

Operator
1
2
3
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210
1 | profit: 5000
2 | profit: -1000
3 | profit: 3000
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210

Selection

Selection
1 | profit: 5000
2 | profit: -1000
3 | profit: 3000
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210

Selection
UDF
profit > 0
1 | profit: 5000
2 | profit: -1000
3 | profit: 3000
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210

Selection
UDF
profit > 0
1 | profit: 5000
2 | profit: -1000
3 | profit: 3000
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210
1 | profit: 5000
3 | profit: 3000
revenue:7000
expense:1000
expense:1000
revenue:4000 expense:1000
0 2
3
4
1
9 11 1210

Call (e.g. Clustering)

1
0 2
3
4
1
5 7 86
9 11 1210

Algorithm
1
0 2
3
4
1
5 7 86
9 11 1210

Algorithm
1
0 2
3
4
1
5 7 86
9 11 1210
2
3
4
0 2
3
4
1
5 7 86
9 11 1210

Call (e.g. PageRank)

1
0 2
3
4
1
5 7 86
9 11 1210

Algorithm
1
0 2
3
4
1
5 7 86
9 11 1210

Algorithm
2
rank:0.11
rank:0.25
rank:0.11
rank:1.29
rank:1.29
rank:1.58rank:0.11rank:5.12
rank:0.11
rank:0.11 rank:0.26 rank:0.11 rank:2.47
0 2
3
4
1
5 7 86
9 11 1210
1
0 2
3
4
1
5 7 86
9 11 1210

EPGM Operators Overview
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Flink Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply
Reduce
Call
Adaptive Partitioning
Subgraph

Operators
Unary Binary
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Flink Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply
Reduce
Call
Subgraph

EPGM on Apache Flink

Flink DataSet API

Flink DataSet API
• DataSet := Distributed Collection of Data Objects
DataSet
DataSet
DataSet

Flink DataSet API
• Transformation := Operation on DataSets
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet

Flink DataSet API
• Flink Programm := Composition of Transformations
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Transformation DataSet
Flink Program

Flink DataSet API
DataSetDataSetDataSet
• Flink Programm := Composition of Transformations
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Transformation DataSet
Flink Program

Graph Representation

EPGMGraphHead
Id Label Properties POJO DataSet<EPGMGraphHead>

Id Label Properties Graphs
EPGMGraphHead
EPGMVertex
Id Label Properties POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>

Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
POJO
POJO
DataSet<EPGMVertex>
DataSet<EPGMEdge>

EPGMGraphHead
EPGMVertex
EPGMEdge
POJO
POJO
DataSet<EPGMVertex>
DataSet<EPGMEdge>
EPGMVertex

EPGMGraphHead
EPGMVertex
EPGMEdge
POJO
POJO
DataSet<EPGMVertex>
DataSet<EPGMEdge>
EPGMVertex
GradoopId := UUID
128-bit

EPGMGraphHead
EPGMVertex
EPGMEdge
POJO
POJO
DataSet<EPGMVertex>
DataSet<EPGMEdge>
EPGMVertex
GradoopId := UUID
128-bit
String

EPGMGraphHead
EPGMVertex
EPGMEdge
POJO
POJO
DataSet<EPGMVertex>
DataSet<EPGMEdge>
EPGMVertex
GradoopId := UUID
128-bit
String PropertyList := List<Property>
Property := (String, PropertyValue)
PropertyValue := byte[]

EPGMGraphHead
EPGMVertex
EPGMEdge
POJO
POJO
DataSet<EPGMVertex>
DataSet<EPGMEdge>
EPGMVertex
GradoopId := UUID
128-bit
String PropertyList := List<Property>
Property := (String, PropertyValue)
PropertyValue := byte[]
GradoopIdSet := Set<GradoopId>

1 3
4
5
2
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5

Id Label Properties
1 Community {interest:Heavy Metal}
2 Community {interest:Hard Rock}
1 3
4
5
2
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5

Id Label Properties
1 Person {name:Alice, born:1984} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
4 Band {name:AC/DC,founded:1973} {2}
5 Person {name:Eve} {2}
1 3
4
5
2
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
DataSet<EPGMVertex>

Id Label Properties
1 Person {name:Alice, born:1984} {1}
4 Band {name:AC/DC,founded:1973} {2}
5 Person {name:Eve} {2}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
3 likes 3 4 {since:2015} {2}
4 knows 3 5 {} {2}
5 likes 5 4 {since:2014} {2}
1 3
4
5
2
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
DataSet<EPGMVertex> DataSet<EPGMEdge>

Flink DataSet Transformations

SQL-like Transformations
• filter
• project
• cross
• union
• distinct
• first-N (limit)
• groupBy
• aggregate
• join
• leftOuterJoin
• rightOuterJoin
• fullOuterJoin

Hadoop-like Transformations
• map
• flatMap
• mapPartition
• reduce
• reduceGroup
• coGroup
Special Flink Operations
• iterate
• iterateDelta
SQL-like Transformations
• filter
• project
• cross
• union
• distinct
• first-N (limit)
• groupBy
• aggregate
• join
• leftOuterJoin
• rightOuterJoin
• fullOuterJoin

Operator Implementation
1 3
4
5
2
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5

1 3
4
5
2
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
Exclusion

1 3
4
5
2
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5 // input: firstGraph (G[1]), secondGraph (G[2])
1: DataSet<GradoopId> graphId = secondGraph.getGraphHead()
2: .map(new Id<G>());
3:
4: DataSet<V> newVertices = firstGraph.getVertices()
5: .filter(new NotInGraphBroadCast<V>())
6: .withBroadcastSet(graphId, GRAPH_ID);
7:
8: DataSet<E> newEdges = firstGraph.getEdges()
9: .filter(new NotInGraphBroadCast<E>())
10: .withBroadcastSet(graphId, GRAPH_ID)
11: .join(newVertices)
12: .where(new SourceId<E>().equalTo(new Id<V>())
13: .with(new LeftSide<E, V>())
14: .join(newVertices)
15: .where(new TargetId<E>().equalTo(new Id<V>())
16: .with(new LeftSide<E, V>());
Exclusion

Operator Implementation – Exclusion

graphId = secondGraph.getGraphHead()

Id Label Properties

Id Label Properties
.map(new Id<G>());

Id Label Properties
Id
2
.map(new Id<G>());

Id Label Properties
Id
2
newVertices = firstGraph.getVertices()
.map(new Id<G>());

Id Label Properties
Id
2
newVertices = firstGraph.getVertices() Id Label Properties Graphs
1 Person {name:Alice} {1}
.map(new Id<G>());

Id Label Properties
Id
2
.map(new Id<G>());
.filter(new NotInGraphBroadCast<V>())
.withBroadcastSet(graphId, GRAPH_ID);

newEdges = firstGraph.getEdges()

newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}

1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)

1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}

1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())

1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target … Id Label …
1 likes 1 2 … 1 Person …
.join(newVertices)

1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
.with(new LeftSide<E, V>())
.join(newVertices)

1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target …
1 likes 1 2 …
.join(newVertices)

1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 …
.join(newVertices)
.where(new TargetId<E>().equalTo(new Id<V>())
.join(newVertices)

1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 …
1 likes 1 2 … 2 Band …
.join(newVertices)
.join(newVertices)

1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 …
.with(new LeftSide<E, V>());
.join(newVertices)
.join(newVertices)

1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
1 likes 1 2 …
1 likes 1 2 …
.with(new LeftSide<E, V>());
.join(newVertices)
.join(newVertices)

GrALa API

GrALa API
class LogicalGraph<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge> {
fromCollections(...) : LogicalGraph<G, V, E>
fromDataSets(...) : LogicalGraph<G, V, E>
fromGellyGraph(...) : LogicalGraph<G, V, E>
getGraphHead() : DataSet<G>
getVertices() : DataSet<V>
getEdges() : DataSet<E>
aggregate(...) : LogicalGraph<G, V, E>
match(...) : GraphCollection<G, V, E>
groupBy(...) : LogicalGraph<G, V, E>
subgraph(...) : LogicalGraph<G, V, E>
combine(...) : LogicalGraph<G, V, E>
// ...
}

GrALa API
class LogicalGraph<G extends EPGMGraphHead,
fromCollections(...) : LogicalGraph<G, V, E>
fromDataSets(...) : LogicalGraph<G, V, E>
fromGellyGraph(...) : LogicalGraph<G, V, E>
getGraphHead() : DataSet<G>
aggregate(...) : LogicalGraph<G, V, E>
match(...) : GraphCollection<G, V, E>
groupBy(...) : LogicalGraph<G, V, E>
subgraph(...) : LogicalGraph<G, V, E>
combine(...) : LogicalGraph<G, V, E>
// ...
}
class GraphCollection<G extends EPGMGraphHead,
E extends EPGMEdge > {
fromCollections(...) : GraphCollection<G, V, E>
fromDataSets(...) : GraphCollection<G, V, E>
getGraphHeads() : DataSet<G>
select(...) : GraphCollection<G, V, E>
distinct( ) : GraphCollection<G, V, E>
sortBy(...) : GraphCollection<G, V, E>
union(...) : GraphCollection<G, V, E>
difference(...) : GraphCollection<G, V, E>
// ...
}

GrALa API
class EPGMDatabase<G extends EPGMGraphHead,
fromCollections(...) : EPGMDatabase<G, V, E>
fromDataSets(...) : EPGMDatabase<G, V, E>
fromHBase(...) : EPGMDatabase<G, V, E>
fromJSON(...) : EPGMDatabase<G, V, E>
fromExternalGraph(...) : EPGMDatabase<G, V, E>
writeAsJSON(...) : void
writeToHBase(...) : void
getDatabaseGraph( ) : LogicalGraph<G, V, E>
getGraphById(...) : LogicalGraph<G, V, E>
getGraphsById(...) : GraphCollection<G, V, E>
// ...
}

Performance

Social Network Benchmark

http://www.ldbcouncil.org/

1. Extract subgraph containing only Persons and knows relations

2. Transform Persons to necessary information

3. Find communities using Label Propagation

4. Aggregate vertex count for each community

5. Select communities with more than 50K users

6. Combine large communities to a single graph

7. Group graph by Persons location and gender

8. Aggregate vertex and edge count of grouped graph

8. Aggregate vertex and edge count of grouped graph
https://git.io/vgozj

Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12)
• 16x 48 GB RAM
• 1 Gigabit Ethernet
• Hadoop 2.6.0
• Flink 1.0-SNAPSHOT
• slots (per worker) 12
• jobmanager.heap.mb 2048
• taskmanager.heap.mb 40960

Social Network Benchmark – Runtime
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x 48 GB RAM
• Hadoop 2.6.0
0
200
400
600
800
1000
1200
1 2 4 8 16
Runtime[s]
Number of workers
Graphalytics.100

1
2
4
8
16
1 2 4 8 16
Speedup
Number of workers
Graphalytics.100 Linear
Social Network Benchmark – Speedup
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x 48 GB RAM
• Hadoop 2.6.0

1
10
100
1000
10000
Runtime[s]
Social Network Benchmark – Datasets
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x 48 GB RAM
• Hadoop 2.6.0

Demo
https://github.com/s1ck/neo4j-gradoop-demos

Current State and Future Work

Current State – Operator Implementations
Operators
Unary Binary
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Flink Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply
Reduce
Call
Subgraph

Release History
• 0.0.1 First Prototype (May 2015)
– Hadoop MapReduce and Giraph for operator implementations
– Too much complexity
– Performance loss through serialization in HDFS/HBase
• 0.0.2 Using Flink as execution layer (June 2015)
– Basic operators
• 0.1 December 2015
– System-side identifiers (UUID)
– Improved property handling
– More operator implementations (e.g., Equality, Bool operators)
– Code refactoring
• 0.2-SNAPSHOT
– Graph Pattern Matching
– Frequent Subgraph Mining
– Memory optimization (96-bit ID, Dictionary Encoding, …)
– Tuple Implementation

Contributions to Flink
• FLINK-2411 Add basic graph summarization algorithm
• FLINK-2590 DataSetUtils.zipWithUniqueID creates duplicate Ids
• FLINK-2905 Add intersect method to Graph class
• FLINK-2910 Combine tests for binary graph operators
• FLINK-2941 Implement a neo4j - Flink/Gelly connector
• FLINK-2981 Update README for building docs
• FLINK-3064 Missing size check in GroupReduceOperatorBase leads to NPE
• FLINK-3118 Check if MessageFunction implements ResultTypeQueryable
• FLINK-3122 Generalize value type in LabelPropagation
• FLINK-3272 Generalize vertex value type in ConnectedComponents
• Flink Forward (October 2015)
• Meetup Big Data Usergroup Saxony (December 2015)
• FOSDEM (January 2016)

Contributions Welcome
• Code
– Operator implementations / improvement
– Performance Tuning
• People
– Bachelor / Master Thesis
– Open PhD positions in Leipzig, Germany
• Use Cases and (Big) Data!

Thank you!
www.gradoop.com
http://flink.apache.org
http://neo4j.com
http://ldbcouncil.org
https://github.com/s1ck/neo4j-gradoop-demos
https://github.com/s1ck/flink-neo4j
https://github.com/s1ck/ldbc-flink-import
https://github.com/s1ck/gdl

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Berlin

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Berlin

Similar to Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Berlin (20)

Recently uploaded

Recently uploaded (20)

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Berlin