SlideShare a Scribd company logo
1 of 60
Download to read offline
GRADOOP: Scalable Graph
Analytics with Apache Flink
Martin Junghanns
@kc1s
Leipzig University
About the speaker and the team
André, PhD StudentMartin, PhD Student
Kevin, M.Sc. StudentNiklas, M.Sc. Student
Prof. Dr. Erhard Rahm
Database Chair
Motivation
𝑮𝑟𝑟𝑟𝑟 = (𝑽𝑒𝑒𝑒𝑒𝑒𝑒𝑒, 𝑬𝑑𝑑𝑑𝑑)
“Graphs are everywhere”
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹)
“Graphs are everywhere”
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹)
“Graphs are everywhere”
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
“Graphs are heterogeneous”
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔 ∪ 𝐁𝐁𝐁𝐁𝐁, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 ∪ 𝐿𝐿𝐿𝐿𝐿)
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔 ∪ 𝐁𝐁𝐁𝐁𝐁, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 ∪ 𝐿𝐿𝐿𝐿𝐿)
“Graphs can be analyzed”
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
0.2
0.28
0.26
0.33
0.25
0.26
“Graphs can be analyzed”
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
3.6
2.82
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔 ∪ 𝐁𝐁𝐁𝐁𝐁, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 ∪ 𝐿𝐿𝐿𝐿𝐿)
“Graphs can be analyzed“
Assuming a social network
“Graphs can be analyzed“
Assuming a social network
1. Determine subgraph
“Graphs can be analyzed“
Assuming a social network
1. Determine subgraph
2. Find communities
“Graphs can be analyzed“
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
“Graphs can be analyzed“
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
“Graph data models must be expressive“
Assuming a social network
 Heterogeneous data
1. Determine subgraph
 Apply graph transformation
2. Find communities
 Handle collections of graphs
3. Filter communities
 Aggregation, Selection
4. Find common subgraph
 Apply dedicated algorithms
„And let‘s not forget …“
“…Graphs are large”
A framework and research platform for efficient,
distributed and domain independent graph data
management and analytics.
High Level Architecture
HDFS/YARN
Cluster
HBase Distributed Graph Store
Extended Property Graph Model
Flink Operator Implementation
Graph Analytical DSL  Java
 25K (33K) LOC
 GPLv3
Extended Property Graph Model (EPGM)
EPGM – Graph Representation
0
1
2
3
4
5
6 7 8 9
10
11 12 13 14
15
16
17
18 19 20 21
22
23
knows
since : 2014
knows
since : 2014
knows
since : 2013
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
[2] Community | interest : Graphs | vertexCount : 4
[1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3
EPGM – Graph Representation
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
10
11 12 13 14
15
16
17
18 19 20 21
22
23
knows
since : 2014
knows
since : 2014
knows
since : 2013
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
EPGM – Operators and Algorithms
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call *
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply *
Reduce *
Call *
* auxiliary
Adaptive Partitioning
Subgraph
EPGM – Operators and Algorithms
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call *
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply *
Reduce *
Call *
* auxiliary
Adaptive Partitioning
Subgraph
EPGM – Operators and Algorithms
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call *
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply *
Reduce *
Call *
* auxiliary
Adaptive Partitioning
Subgraph
Combination
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
[2] Community | interest : Graphs| vertexCount : 4
[1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
10
11 12 13 14
15
16
17
18 19 20 21
22
23
knows
since : 2014
knows
since : 2014
knows
since : 2013
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
DB
[0] Community | interest : Databases | vertexCount : 3
[1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
DB
Combination
[4]
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
EPGM – Operators and Algorithms
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call *
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply *
Reduce *
Call *
* auxiliary
Adaptive Partitioning
Subgraph
[0] Community | interest : Databases | vertexCount : 3
[1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
DB
Combination + Grouping
[4]
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
2: vertexGroupingKeys = [:label, “city”]
3: edgeGroupingKeys = [: label]
4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|)
5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|)
6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc,
edgeGroupingKeys, edgeAggFunc)
[0] Community | interest : Databases | vertexCount : 3
[1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
DB
Combination + Grouping
[4]
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
2: vertexGroupingKeys = [:label, “city”]
3: edgeGroupingKeys = [: label]
4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|)
5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|)
6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc,
edgeGroupingKeys, edgeAggFunc)
[5]
[11] Person
city : Leipzig
count : 2
[12] Person
city : Dresden
count : 3
[13] Person
city : Berlin
count : 1
24
25
26
27
28
knows
count : 3
knows
count : 1
knows
count : 2
knows
count : 2
knows
count : 2
[0] Community | interest : Databases | vertexCount : 3
[1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
DB
Combination + Grouping + Aggregation
[4]
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
2: vertexGroupingKeys = [:label, “city”]
3: edgeGroupingKeys = [: label]
4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|)
5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|)
6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc,
edgeGroupingKeys, edgeAggFunc)
7: aggFunc = (g => |g.E|)
8: aggGraph = sumGraph.aggregate(“edgeCount”, aggFunc)
[5]
[11] Person
city : Leipzig
count : 2
[12] Person
city : Dresden
count : 3
[13] Person
city : Berlin
count : 1
24
25
26
27
28
knows
count : 3
knows
count : 1
knows
count : 2
knows
count : 2
knows
count : 2
[0] Community | interest : Databases | vertexCount : 3
[1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
DB
Combination + Grouping + Aggregation
[4]
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
2: vertexGroupingKeys = [:label, “city”]
3: edgeGroupingKeys = [: label]
4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|)
5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|)
6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc,
edgeGroupingKeys, edgeAggFunc)
7: aggFunc = (g => |g.E|)
8: aggGraph = sumGraph.aggregate(“edgeCount”, aggFunc)
[5] edgeCount : 5
[11] Person
city : Leipzig
count : 2
[12] Person
city : Dresden
count : 3
[13] Person
city : Berlin
count : 1
24
25
26
27
28
knows
count : 3
knows
count : 1
knows
count : 2
knows
count : 2
knows
count : 2
EPGM – Operators and Algorithms
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call *
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply *
Reduce *
Call *
* auxiliary
Adaptive Partitioning
Subgraph
Selection
1: resultColl = db.G[0,1,2].select((g => g[“vertexCount”] > 3))
[2] Community | interest : Graphs | vertexCount : 4
[1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
10
11 12 13 14
15
16
17
18 19 20 21
22
23
knows
since : 2014
knows
since : 2014
knows
since : 2013
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
DB
Selection
1: resultColl = db.G[0,1,2].select((g => g[“vertexCount”] > 3))
[1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
6 7 8 9
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
DB
[2] Community | interest : Graphs | vertexCount : 4
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
0
1
2
3
4
5
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
EPGM – Operators and Algorithms
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call *
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply *
Reduce *
Call *
* auxiliary
Adaptive Partitioning
Subgraph
Apache Flink
Apache Flink
http://www.slideshare.net/robertmetzger1/apache-flink-meetup-munich-november-2015-flink-overview-architecture-integrations-and-use-case
„Streaming Dataflow Engine that provides
• data distribution,
• communication,
• and fault tolerance
for distributed computations over data streams.“
HDFS
LocalFS
HBase
JDBC
Kafka
RabbitMQ
Flume
(Neo4j) EmbeddedTezYarnClusterLocal
Streaming Dataflow Runtime
DataSet DataStream
HadoopMR
Table
Gelly
ML
Table
Zeppelin
Cascading
MRQL
Dataflow
Storm(wip)
Dataflow(wip)
SAMOA
Apache Flink – DataSet API
 DataSet := Distributed Collection of Data
 Transformation := Operation applied on DataSet
 Flink Program := Composition of Transformations
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Transformation DataSet
Flink Program
Apache Flink – DataSet Transformations
 aggregate
 coGroup
 cross
 distinct
 filter
 first-N
 flatMap
 groupBy
 join
 leftOuterJoin
 rightOuterJoin
 fullOuterJoin
 map
 mapPartition
 project
 reduce
 reduceGroup
 union
 iterate
 iterateDelta
The „Hello World“ of Big Data – Word Count
1: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
2:
3: DataSet<String> text = env.fromElements( // or env.readTextFile(„hdfs://…“)
4: „He who controls the past controls the future.“,
5: „He who controls the present controls the past.“);
6:
7: DataSet<Tuple2<String, Integer>> wordCounts = text
8: .flatMap(new LineSplitter()) // splits the line and outputs (word, 1) tuples
9: .groupBy(0)
10: .sum(1);
11:
12: wordCounts.print(); // trigger execution
flatMap
„He who controls the past controls the future.“
„He who controls the present controls the past.“
(He,1)
(who,1)
(controls,1)
(the,1)
(past,1)
// ...
groupBy(0)
[(He,1),(He,1)]
[(who,1),(who,1)]
[(future,1)]
[(past,1),(past,1)]
[(present,1)]
// ...
sum(1)
(He,2)
(who,2)
(future,1)
(past,2)
(present,1)
// ...
EPGM on Apache Flink
EPGM on Apache Flink – User facing API
LogicalGraph
fromCollections(…) : LogicalGraph
fromDataSets(…) : LogicalGraph
fromGellyGraph(…) : LogicalGraph
getGraphHead() : DataSet<EPGMGraphHead>
toGellyGraph() : Graph
combine(…) : LogicalGraph
intersect(…) : LogicalGraph
groupBy(…) : LogicalGraph
match(…) : GraphCollection
// ...
GraphCollection
fromCollections(…) : GraphCollection
fromDataSets(…) : GraphCollection
getGraphHeads() : DataSet<EPGMGraphHead>
getGraph(…) : LogicalGraph
getGraphs(…) : GraphCollection
select(…) : GraphCollection
union(…) : GraphCollection
distinct(…) : GraphCollection
sortBy(…) : GraphCollection
// ...
GraphBase
getVertices() : DataSet<EPGMVertex>
getEdges() : DataSet<EPGMEdge>
// ...
graphHeads : DataSet<EPGMGraphHead>
vertices : DataSet<EPGMVertex>
edges : DataSet<EPGMEdge>
EPGMDatabase
fromCollections(…) : EPGMDatabase
fromJSONFile(…) : EPGMDatabase
fromHBase(…) : EPGMDatabase
writeAsJSON(…) : void
writeToHBase(…) : void
getDatabaseGraph() : LogicalGraph
// ...
EPGM on Apache Flink – DataSets
Id Label Properties Graphs
Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
Id Label Properties POJO
POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
DataSet<EPGMEdge>
Id Label Properties Graphs
EPGMVertex
GradoopId := UUID
128-bit
String PropertyList := List<Property>
Property := (String, PropertyValue)
PropertyValue := byte[]
GradoopIdSet := Set<GradoopId>
EPGM on Apache Flink – Exclusion
// input: firstGraph (G[0]), secondGraph (G[2])
1: DataSet<GradoopId> graphId = secondGraph.getGraphHead()
2: .map(new Id<G>());
3:
4: DataSet<V> newVertices = firstGraph.getVertices()
5: .filter(new NotInGraphBroadCast<V>())
6: .withBroadcastSet(graphId, GRAPH_ID);
7:
8: DataSet<E> newEdges = firstGraph.getEdges()
9: .filter(new NotInGraphBroadCast<E>())
10: .withBroadcastSet(graphId, GRAPH_ID)
11: .join(newVertices)
12: .where(new SourceId<E>().equalTo(new Id<V>())
13: .with(new LeftSide<E, V>())
14: .join(newVertices)
15: .where(new TargetId<E>().equalTo(new Id<V>())
16: .with(new LeftSide<E, V>());
db.G[0].exclude(db.G[2])
[2] Community | interest : Graphs| vertexCount : 4
[0] Community | interest : Databases | vertexCount : 3
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
0
1
2
3
4
5
6 7
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2013
EPGM on Apache Flink – Exclusion
Id Label Properties
2 Community interest: Graphs vertexCount: 4
graphId =
secondGraph.getGraphHead()
Id
2
newVertices =
firstGraph.getVertices() Id Label Properties Graphs
5 Person name: Alice gender: f … [0, 2]
6 Person name: Bob gender: m … [0, 2]
9 Person name: Eve gender: f … [0]
Id Label Properties Graphs
9 Person name: Eve gender: f … [0]
.map(new Id<G>());
.filter(new NotInGraphBroadCast<V>())
.withBroadcastSet(graphId, GRAPH_ID);
EPGM in Apache Flink – Exclusion
newEdges =
firstGraph.getEdges()
Id Label SourceId TargetId Properties Graphs
0 knows 5 6 since: 2014 [0, 2]
1 knows 6 5 since: 2014 [0, 2]
6 knows 9 5 since: 2013 [0]
7 knows 9 6 since: 2015 [0]
Id Label SourceId TargetId Properties Graphs
6 knows 9 5 since: 2013 [0]
7 knows 9 6 since: 2015 [0]
Id Label SourceId TargetId … Id Label …
6 knows 9 5 … 9 Person …
7 knows 9 6 … 9 Person …
Id Label SourceId TargetId …
6 knows 9 5 …
7 knows 9 6 …
Id Label SourceId TargetId … Id Label …
Id Label SourceId TargetId ….with(new LeftSide<E, V>());
.join(newVertices)
.where(new TargetId<E>().equalTo(new Id<V>())
.with(new LeftSide<E, V>())
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
Social Network Example
LDBC Social Network Data
http://ldbcouncil.org/
LDBC Social Network Data
socialNetwork
.subgraph(
(v => v.label == ‘Person‘),
(e => e.label == ‘knows‘))
.transform(
(gIn, gOut => gOut = gIn),
(vIn, vOut => {
vOut.label = vIn.label,
vOut[‘city‘] = vIn[‘city‘]
vOut[‘gender‘] = vIn[‘gender‘]
vOut[‘key‘] = vIn[‘birthday‘]}),
(eIn, eOut) => eOut.label = eIn.label))
.callForCollection(:LabelPropagation, [‘key‘, 4])
.apply(g =>
g.aggregate(‘vertexCount‘, (h => |h.V|))
.select(g => g[‘vertexCount‘] > 50000)
.reduce(g, h => g.combine(h))
.groupBy(
[‘city‘,‘gender‘], (superVertex, vertices =>
superVertex[‘count‘] = |vertices|),
[], (superEdge, edges =>
superEdge[‘count‘] = |edges|)
.aggregate(‘vCount‘, (g => |g.V|))
.aggregate(‘eCount‘, (g => |g.E|))
LDBC Social Network Data
socialNetwork
.subgraph(
(v => v.label == ‘Person‘),
(e => e.label == ‘knows‘))
.transform(
(gIn, gOut => gOut = gIn),
(vIn, vOut => {
vOut.label = vIn.label,
vOut[‘city‘] = vIn[‘city‘]
vOut[‘gender‘] = vIn[‘gender‘]
vOut[‘key‘] = vIn[‘birthday‘]}),
(eIn, eOut) => eOut.label = eIn.label))
.callForCollection(:LabelPropagation, [‘key‘, 4])
.apply(g =>
g.aggregate(‘vertexCount‘, (h => |h.V|))
.select(g => g[‘vertexCount‘] > 50000)
.reduce(g, h => g.combine(h))
.groupBy(
[‘city‘,‘gender‘], (superVertex, vertices =>
superVertex[‘count‘] = |vertices|),
[], (superEdge, edges =>
superEdge[‘count‘] = |edges|)
.aggregate(‘vCount‘, (g => |g.V|))
.aggregate(‘eCount‘, (g => |g.E|))
Benchmark Results
Dataset # Vertices # Edges
Graphalytics.1 61,613 2,026,082
Graphalytics.10 260,613 16,600,778
Graphalytics.100 1,695,613 147,437,275
Graphalytics.1000 12,775,613 1,363,747,260
Graphalytics.10000 90,025,613 10,872,109,028
 16x Intel(R) Xeon(R) 2.50GHz (6 Cores)
 16x 48 GB RAM
 1 Gigabit Ethernet
 Hadoop 2.6.0
 Flink 1.0-SNAPSHOT
 slots (per worker) 12
 jobmanager.heap.mb 2048
 taskmanager.heap.mb 40960
0
200
400
600
800
1000
1200
1 2 4 8 16
Runtime[s]
Number of workers
Runtime
Graphalytics.100
Benchmark Results
Dataset # Vertices # Edges
Graphalytics.1 61,613 2,026,082
Graphalytics.10 260,613 16,600,778
Graphalytics.100 1,695,613 147,437,275
Graphalytics.1000 12,775,613 1,363,747,260
Graphalytics.10000 90,025,613 10,872,109,028
1
2
4
8
16
1 2 4 8 16
Speedup
Number of workers
Speedup
Graphalytics.100 Linear
 16x Intel(R) Xeon(R) 2.50GHz (6 Cores)
 16x 48 GB RAM
 1 Gigabit Ethernet
 Hadoop 2.6.0
 Flink 1.0-SNAPSHOT
 slots (per worker) 12
 jobmanager.heap.mb 2048
 taskmanager.heap.mb 40960
Benchmark Results
Dataset # Vertices # Edges
Graphalytics.1 61,613 2,026,082
Graphalytics.10 260,613 16,600,778
Graphalytics.100 1,695,613 147,437,275
Graphalytics.1000 12,775,613 1,363,747,260
Graphalytics.10000 90,025,613 10,872,109,028
1
10
100
1000
10000
Runtime[s]
Datasets
 16x Intel(R) Xeon(R) 2.50GHz (6 Cores)
 16x 48 GB RAM
 1 Gigabit Ethernet
 Hadoop 2.6.0
 Flink 1.0-SNAPSHOT
 slots (per worker) 12
 jobmanager.heap.mb 2048
 taskmanager.heap.mb 40960
Current State & Future Work
Current State
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call *
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply *
Reduce *
Call *
* auxiliary
Adaptive Partitioning
Subgraph
Current State
 0.0.1 First Prototype (May 2015)
 Hadoop MapReduce and Giraph for operator implementations
 Too much complexity
 Performance loss through serialization in HDFS/HBase
 0.0.2 Using Flink as execution layer (June 2015)
 Basic operators
 0.1 December 2015
 System-side identifiers (UUID)
 Improved property handling
 More operator implementations (e.g., Equality, Bool operators)
 Code refactoring
 0.2-SNAPSHOT
 Graph Pattern Matching
 Frequent Subgraph Mining
 Memory optimization (96-bit ID, Dictionary Encoding, …)
 Tuple Implementation
Contributions to Flink
 FLINK-2411 Add basic graph summarization algorithm
 FLINK-2590 DataSetUtils.zipWithUniqueID creates duplicate Ids
 FLINK-2905 Add intersect method to Graph class
 FLINK-2910 Combine tests for binary graph operators
 FLINK-2941 Implement a neo4j - Flink/Gelly connector
 FLINK-2981 Update README for building docs
 FLINK-3064 Missing size check in GroupReduceOperatorBase leads to NPE
 FLINK-3118 Check if MessageFunction implements ResultTypeQueryable
 FLINK-3122 Generalize value type in LabelPropagation
 FLINK-3272 Generalize vertex value type in ConnectedComponents
 Flink Forward (October 2015)
 Meetup Big Data Usergroup Saxony (December 2015)
Contributions welcome!
 Code
 Operator implementations
 Performance Tuning
 Extend HBase Storage
 People
 Bachelor / Master Thesis
 Open PhD positions in Leipzig, Germany
 Data! and Use Cases
 We are researchers, we assume ...
Thank you!
www.gradoop.com
https://flink.apache.org
http://ldbcouncil.org/
http://dbs.uni-leipzig.de/file/GradoopTR.pdf
http://dbs.uni-leipzig.de/file/biiig-vldb2014.pdf
https://github.com/dbs-leipzig/gradoop
https://github.com/s1ck/gdl
https://github.com/s1ck/ldbc-flink-import
https://github.com/s1ck/flink-neo4j

More Related Content

What's hot

Finding Insights In Connected Data: Using Graph Databases In Journalism
Finding Insights In Connected Data: Using Graph Databases In JournalismFinding Insights In Connected Data: Using Graph Databases In Journalism
Finding Insights In Connected Data: Using Graph Databases In Journalism
William Lyon
 

What's hot (10)

Spot Sigs
Spot SigsSpot Sigs
Spot Sigs
 
Mapping Graph Queries to PostgreSQL
Mapping Graph Queries to PostgreSQLMapping Graph Queries to PostgreSQL
Mapping Graph Queries to PostgreSQL
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Kaggle Competitions, New Friends, New Skills and New Opportunities
Kaggle Competitions, New Friends, New Skills and New OpportunitiesKaggle Competitions, New Friends, New Skills and New Opportunities
Kaggle Competitions, New Friends, New Skills and New Opportunities
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
Querying Linked Geospatial Data with Incomplete Information
Querying Linked Geospatial Data with  Incomplete InformationQuerying Linked Geospatial Data with  Incomplete Information
Querying Linked Geospatial Data with Incomplete Information
 
Finding Insights In Connected Data: Using Graph Databases In Journalism
Finding Insights In Connected Data: Using Graph Databases In JournalismFinding Insights In Connected Data: Using Graph Databases In Journalism
Finding Insights In Connected Data: Using Graph Databases In Journalism
 
Network Analysis: People and Open Source Communities - LinuxCon Seattle and D...
Network Analysis: People and Open Source Communities - LinuxCon Seattle and D...Network Analysis: People and Open Source Communities - LinuxCon Seattle and D...
Network Analysis: People and Open Source Communities - LinuxCon Seattle and D...
 
Shebanq gniezno
Shebanq gnieznoShebanq gniezno
Shebanq gniezno
 
Learning Commonalities in RDF
Learning Commonalities in RDFLearning Commonalities in RDF
Learning Commonalities in RDF
 

Similar to Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

20110719 social media research foundation-charting collections of connections
20110719 social media research foundation-charting collections of connections20110719 social media research foundation-charting collections of connections
20110719 social media research foundation-charting collections of connections
SMRFoundation
 

Similar to Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016 (20)

" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer..." NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...
 
Redis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph DistributionRedis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph Distribution
 
Graph technology meetup slides
Graph technology meetup slidesGraph technology meetup slides
Graph technology meetup slides
 
Introduction to graph databases in term of neo4j
Introduction to graph databases in term of neo4jIntroduction to graph databases in term of neo4j
Introduction to graph databases in term of neo4j
 
Interactive visualization and exploration of network data with Gephi
Interactive visualization and exploration of network data with GephiInteractive visualization and exploration of network data with Gephi
Interactive visualization and exploration of network data with Gephi
 
Graph and RDF databases
Graph and RDF databasesGraph and RDF databases
Graph and RDF databases
 
Introducción a Neo4j
Introducción a Neo4jIntroducción a Neo4j
Introducción a Neo4j
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
GraphTour Boston - Graphs for AI and ML
GraphTour Boston - Graphs for AI and MLGraphTour Boston - Graphs for AI and ML
GraphTour Boston - Graphs for AI and ML
 
20110719 social media research foundation-charting collections of connections
20110719 social media research foundation-charting collections of connections20110719 social media research foundation-charting collections of connections
20110719 social media research foundation-charting collections of connections
 
Graphs for Ai and ML
Graphs for Ai and MLGraphs for Ai and ML
Graphs for Ai and ML
 
Neo4j Makes Graphs Easy- GraphDays
Neo4j Makes Graphs Easy- GraphDaysNeo4j Makes Graphs Easy- GraphDays
Neo4j Makes Graphs Easy- GraphDays
 
Wimmics Overview 2021
Wimmics Overview 2021Wimmics Overview 2021
Wimmics Overview 2021
 
Doing data science with F#
Doing data science with F#Doing data science with F#
Doing data science with F#
 

Recently uploaded

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 

Recently uploaded (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 

Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

  • 1. GRADOOP: Scalable Graph Analytics with Apache Flink Martin Junghanns @kc1s Leipzig University
  • 2. About the speaker and the team André, PhD StudentMartin, PhD Student Kevin, M.Sc. StudentNiklas, M.Sc. Student Prof. Dr. Erhard Rahm Database Chair
  • 4. 𝑮𝑟𝑟𝑟𝑟 = (𝑽𝑒𝑒𝑒𝑒𝑒𝑒𝑒, 𝑬𝑑𝑑𝑑𝑑) “Graphs are everywhere”
  • 5. 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹) “Graphs are everywhere” Alice Bob Eve Dave Carol Mallory Peggy Trent
  • 6. 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹) “Graphs are everywhere” Alice Bob Eve Dave Carol Mallory Peggy Trent
  • 7. “Graphs are heterogeneous” Alice Bob AC/DC Dave Carol Mallory Peggy Metallica 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔 ∪ 𝐁𝐁𝐁𝐁𝐁, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 ∪ 𝐿𝐿𝐿𝐿𝐿)
  • 8. 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔 ∪ 𝐁𝐁𝐁𝐁𝐁, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 ∪ 𝐿𝐿𝐿𝐿𝐿) “Graphs can be analyzed” Alice Bob AC/DC Dave Carol Mallory Peggy Metallica
  • 9. 0.2 0.28 0.26 0.33 0.25 0.26 “Graphs can be analyzed” Alice Bob AC/DC Dave Carol Mallory Peggy Metallica 3.6 2.82 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔 ∪ 𝐁𝐁𝐁𝐁𝐁, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 ∪ 𝐿𝐿𝐿𝐿𝐿)
  • 10. “Graphs can be analyzed“ Assuming a social network
  • 11. “Graphs can be analyzed“ Assuming a social network 1. Determine subgraph
  • 12. “Graphs can be analyzed“ Assuming a social network 1. Determine subgraph 2. Find communities
  • 13. “Graphs can be analyzed“ Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities
  • 14. “Graphs can be analyzed“ Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities 4. Find common subgraph
  • 15. “Graph data models must be expressive“ Assuming a social network  Heterogeneous data 1. Determine subgraph  Apply graph transformation 2. Find communities  Handle collections of graphs 3. Filter communities  Aggregation, Selection 4. Find common subgraph  Apply dedicated algorithms
  • 16. „And let‘s not forget …“
  • 18. A framework and research platform for efficient, distributed and domain independent graph data management and analytics.
  • 19. High Level Architecture HDFS/YARN Cluster HBase Distributed Graph Store Extended Property Graph Model Flink Operator Implementation Graph Analytical DSL  Java  25K (33K) LOC  GPLv3
  • 20. Extended Property Graph Model (EPGM)
  • 21. EPGM – Graph Representation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 knows since : 2014 knows since : 2014 knows since : 2013 hasInterest hasInterest hasInterest hasInterest hasModeratorhasModerator hasMember hasMember hasMember hasMember hasTag hasTaghasTag hasTag knows since : 2013 knows since : 2014 knows since : 2014 knows since : 2015 knows since : 2015 knows since : 2015 knows since : 2013 [0] Tag name : Databases [1] Tag name : Graphs [2] Tag name : Hadoop [3] Forum title : Graph Databases [4] Forum title : Graph Processing [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en [10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3
  • 22. [2] Community | interest : Graphs | vertexCount : 4 [1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3 EPGM – Graph Representation [0] Tag name : Databases [1] Tag name : Graphs [2] Tag name : Hadoop [3] Forum title : Graph Databases [4] Forum title : Graph Processing [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en [10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 knows since : 2014 knows since : 2014 knows since : 2013 hasInterest hasInterest hasInterest hasInterest hasModeratorhasModerator hasMember hasMember hasMember hasMember hasTag hasTaghasTag hasTag knows since : 2013 knows since : 2014 knows since : 2014 knows since : 2015 knows since : 2015 knows since : 2015 knows since : 2013
  • 23. EPGM – Operators and Algorithms Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call * Combination Overlap Exclusion Equality Union Intersection Difference Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply * Reduce * Call * * auxiliary Adaptive Partitioning Subgraph
  • 24. EPGM – Operators and Algorithms Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call * Combination Overlap Exclusion Equality Union Intersection Difference Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply * Reduce * Call * * auxiliary Adaptive Partitioning Subgraph
  • 25. EPGM – Operators and Algorithms Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call * Combination Overlap Exclusion Equality Union Intersection Difference Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply * Reduce * Call * * auxiliary Adaptive Partitioning Subgraph
  • 26. Combination 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) [2] Community | interest : Graphs| vertexCount : 4 [1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3 [0] Tag name : Databases [1] Tag name : Graphs [2] Tag name : Hadoop [3] Forum title : Graph Databases [4] Forum title : Graph Processing [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en [10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 knows since : 2014 knows since : 2014 knows since : 2013 hasInterest hasInterest hasInterest hasInterest hasModeratorhasModerator hasMember hasMember hasMember hasMember hasTag hasTaghasTag hasTag knows since : 2013 knows since : 2014 knows since : 2014 knows since : 2015 knows since : 2015 knows since : 2015 knows since : 2013 DB
  • 27. [0] Community | interest : Databases | vertexCount : 3 [1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3 [0] Tag name : Databases [1] Tag name : Graphs [2] Tag name : Hadoop [3] Forum title : Graph Databases [4] Forum title : Graph Processing 10 11 12 13 14 15 16 17 18 19 20 21 22 23 hasInterest hasInterest hasInterest hasInterest hasModeratorhasModerator hasMember hasMember hasMember hasMember hasTag hasTaghasTag hasTag DB Combination [4] [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en [10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3 0 1 2 3 4 5 6 7 8 9 knows since : 2014 knows since : 2014 knows since : 2013 knows since : 2013 knows since : 2014 knows since : 2014 knows since : 2015 knows since : 2015 knows since : 2015 knows since : 2013 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
  • 28. EPGM – Operators and Algorithms Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call * Combination Overlap Exclusion Equality Union Intersection Difference Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply * Reduce * Call * * auxiliary Adaptive Partitioning Subgraph
  • 29. [0] Community | interest : Databases | vertexCount : 3 [1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3 [0] Tag name : Databases [1] Tag name : Graphs [2] Tag name : Hadoop [3] Forum title : Graph Databases [4] Forum title : Graph Processing 10 11 12 13 14 15 16 17 18 19 20 21 22 23 hasInterest hasInterest hasInterest hasInterest hasModeratorhasModerator hasMember hasMember hasMember hasMember hasTag hasTaghasTag hasTag DB Combination + Grouping [4] [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en [10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3 0 1 2 3 4 5 6 7 8 9 knows since : 2014 knows since : 2014 knows since : 2013 knows since : 2013 knows since : 2014 knows since : 2014 knows since : 2015 knows since : 2015 knows since : 2015 knows since : 2013 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = [:label, “city”] 3: edgeGroupingKeys = [: label] 4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|) 5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|) 6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)
  • 30. [0] Community | interest : Databases | vertexCount : 3 [1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3 [0] Tag name : Databases [1] Tag name : Graphs [2] Tag name : Hadoop [3] Forum title : Graph Databases [4] Forum title : Graph Processing 10 11 12 13 14 15 16 17 18 19 20 21 22 23 hasInterest hasInterest hasInterest hasInterest hasModeratorhasModerator hasMember hasMember hasMember hasMember hasTag hasTaghasTag hasTag DB Combination + Grouping [4] [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en [10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3 0 1 2 3 4 5 6 7 8 9 knows since : 2014 knows since : 2014 knows since : 2013 knows since : 2013 knows since : 2014 knows since : 2014 knows since : 2015 knows since : 2015 knows since : 2015 knows since : 2013 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = [:label, “city”] 3: edgeGroupingKeys = [: label] 4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|) 5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|) 6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc) [5] [11] Person city : Leipzig count : 2 [12] Person city : Dresden count : 3 [13] Person city : Berlin count : 1 24 25 26 27 28 knows count : 3 knows count : 1 knows count : 2 knows count : 2 knows count : 2
  • 31. [0] Community | interest : Databases | vertexCount : 3 [1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3 [0] Tag name : Databases [1] Tag name : Graphs [2] Tag name : Hadoop [3] Forum title : Graph Databases [4] Forum title : Graph Processing 10 11 12 13 14 15 16 17 18 19 20 21 22 23 hasInterest hasInterest hasInterest hasInterest hasModeratorhasModerator hasMember hasMember hasMember hasMember hasTag hasTaghasTag hasTag DB Combination + Grouping + Aggregation [4] [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en [10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3 0 1 2 3 4 5 6 7 8 9 knows since : 2014 knows since : 2014 knows since : 2013 knows since : 2013 knows since : 2014 knows since : 2014 knows since : 2015 knows since : 2015 knows since : 2015 knows since : 2013 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = [:label, “city”] 3: edgeGroupingKeys = [: label] 4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|) 5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|) 6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc) 7: aggFunc = (g => |g.E|) 8: aggGraph = sumGraph.aggregate(“edgeCount”, aggFunc) [5] [11] Person city : Leipzig count : 2 [12] Person city : Dresden count : 3 [13] Person city : Berlin count : 1 24 25 26 27 28 knows count : 3 knows count : 1 knows count : 2 knows count : 2 knows count : 2
  • 32. [0] Community | interest : Databases | vertexCount : 3 [1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3 [0] Tag name : Databases [1] Tag name : Graphs [2] Tag name : Hadoop [3] Forum title : Graph Databases [4] Forum title : Graph Processing 10 11 12 13 14 15 16 17 18 19 20 21 22 23 hasInterest hasInterest hasInterest hasInterest hasModeratorhasModerator hasMember hasMember hasMember hasMember hasTag hasTaghasTag hasTag DB Combination + Grouping + Aggregation [4] [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en [10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3 0 1 2 3 4 5 6 7 8 9 knows since : 2014 knows since : 2014 knows since : 2013 knows since : 2013 knows since : 2014 knows since : 2014 knows since : 2015 knows since : 2015 knows since : 2015 knows since : 2013 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = [:label, “city”] 3: edgeGroupingKeys = [: label] 4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|) 5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|) 6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc) 7: aggFunc = (g => |g.E|) 8: aggGraph = sumGraph.aggregate(“edgeCount”, aggFunc) [5] edgeCount : 5 [11] Person city : Leipzig count : 2 [12] Person city : Dresden count : 3 [13] Person city : Berlin count : 1 24 25 26 27 28 knows count : 3 knows count : 1 knows count : 2 knows count : 2 knows count : 2
  • 33. EPGM – Operators and Algorithms Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call * Combination Overlap Exclusion Equality Union Intersection Difference Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply * Reduce * Call * * auxiliary Adaptive Partitioning Subgraph
  • 34. Selection 1: resultColl = db.G[0,1,2].select((g => g[“vertexCount”] > 3)) [2] Community | interest : Graphs | vertexCount : 4 [1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3 [0] Tag name : Databases [1] Tag name : Graphs [2] Tag name : Hadoop [3] Forum title : Graph Databases [4] Forum title : Graph Processing [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en [10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 knows since : 2014 knows since : 2014 knows since : 2013 hasInterest hasInterest hasInterest hasInterest hasModeratorhasModerator hasMember hasMember hasMember hasMember hasTag hasTaghasTag hasTag knows since : 2013 knows since : 2014 knows since : 2014 knows since : 2015 knows since : 2015 knows since : 2015 knows since : 2013 DB
  • 35. Selection 1: resultColl = db.G[0,1,2].select((g => g[“vertexCount”] > 3)) [1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3 [0] Tag name : Databases [1] Tag name : Graphs [2] Tag name : Hadoop [3] Forum title : Graph Databases [4] Forum title : Graph Processing [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en [10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 hasInterest hasInterest hasInterest hasInterest hasModeratorhasModerator hasMember hasMember hasMember hasMember hasTag hasTaghasTag hasTag knows since : 2015 knows since : 2015 knows since : 2015 knows since : 2013 DB [2] Community | interest : Graphs | vertexCount : 4 [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 0 1 2 3 4 5 knows since : 2014 knows since : 2014 knows since : 2013 knows since : 2013 knows since : 2014 knows since : 2014
  • 36. EPGM – Operators and Algorithms Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call * Combination Overlap Exclusion Equality Union Intersection Difference Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply * Reduce * Call * * auxiliary Adaptive Partitioning Subgraph
  • 38. Apache Flink http://www.slideshare.net/robertmetzger1/apache-flink-meetup-munich-november-2015-flink-overview-architecture-integrations-and-use-case „Streaming Dataflow Engine that provides • data distribution, • communication, • and fault tolerance for distributed computations over data streams.“ HDFS LocalFS HBase JDBC Kafka RabbitMQ Flume (Neo4j) EmbeddedTezYarnClusterLocal Streaming Dataflow Runtime DataSet DataStream HadoopMR Table Gelly ML Table Zeppelin Cascading MRQL Dataflow Storm(wip) Dataflow(wip) SAMOA
  • 39. Apache Flink – DataSet API  DataSet := Distributed Collection of Data  Transformation := Operation applied on DataSet  Flink Program := Composition of Transformations DataSet DataSet DataSet Transformation Transformation DataSet DataSet Transformation DataSet Flink Program
  • 40. Apache Flink – DataSet Transformations  aggregate  coGroup  cross  distinct  filter  first-N  flatMap  groupBy  join  leftOuterJoin  rightOuterJoin  fullOuterJoin  map  mapPartition  project  reduce  reduceGroup  union  iterate  iterateDelta
  • 41. The „Hello World“ of Big Data – Word Count 1: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 2: 3: DataSet<String> text = env.fromElements( // or env.readTextFile(„hdfs://…“) 4: „He who controls the past controls the future.“, 5: „He who controls the present controls the past.“); 6: 7: DataSet<Tuple2<String, Integer>> wordCounts = text 8: .flatMap(new LineSplitter()) // splits the line and outputs (word, 1) tuples 9: .groupBy(0) 10: .sum(1); 11: 12: wordCounts.print(); // trigger execution flatMap „He who controls the past controls the future.“ „He who controls the present controls the past.“ (He,1) (who,1) (controls,1) (the,1) (past,1) // ... groupBy(0) [(He,1),(He,1)] [(who,1),(who,1)] [(future,1)] [(past,1),(past,1)] [(present,1)] // ... sum(1) (He,2) (who,2) (future,1) (past,2) (present,1) // ...
  • 42. EPGM on Apache Flink
  • 43. EPGM on Apache Flink – User facing API LogicalGraph fromCollections(…) : LogicalGraph fromDataSets(…) : LogicalGraph fromGellyGraph(…) : LogicalGraph getGraphHead() : DataSet<EPGMGraphHead> toGellyGraph() : Graph combine(…) : LogicalGraph intersect(…) : LogicalGraph groupBy(…) : LogicalGraph match(…) : GraphCollection // ... GraphCollection fromCollections(…) : GraphCollection fromDataSets(…) : GraphCollection getGraphHeads() : DataSet<EPGMGraphHead> getGraph(…) : LogicalGraph getGraphs(…) : GraphCollection select(…) : GraphCollection union(…) : GraphCollection distinct(…) : GraphCollection sortBy(…) : GraphCollection // ... GraphBase getVertices() : DataSet<EPGMVertex> getEdges() : DataSet<EPGMEdge> // ... graphHeads : DataSet<EPGMGraphHead> vertices : DataSet<EPGMVertex> edges : DataSet<EPGMEdge> EPGMDatabase fromCollections(…) : EPGMDatabase fromJSONFile(…) : EPGMDatabase fromHBase(…) : EPGMDatabase writeAsJSON(…) : void writeToHBase(…) : void getDatabaseGraph() : LogicalGraph // ...
  • 44. EPGM on Apache Flink – DataSets Id Label Properties Graphs Id Label Properties SourceId TargetId Graphs EPGMGraphHead EPGMVertex EPGMEdge Id Label Properties POJO POJO POJO DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge> Id Label Properties Graphs EPGMVertex GradoopId := UUID 128-bit String PropertyList := List<Property> Property := (String, PropertyValue) PropertyValue := byte[] GradoopIdSet := Set<GradoopId>
  • 45. EPGM on Apache Flink – Exclusion // input: firstGraph (G[0]), secondGraph (G[2]) 1: DataSet<GradoopId> graphId = secondGraph.getGraphHead() 2: .map(new Id<G>()); 3: 4: DataSet<V> newVertices = firstGraph.getVertices() 5: .filter(new NotInGraphBroadCast<V>()) 6: .withBroadcastSet(graphId, GRAPH_ID); 7: 8: DataSet<E> newEdges = firstGraph.getEdges() 9: .filter(new NotInGraphBroadCast<E>()) 10: .withBroadcastSet(graphId, GRAPH_ID) 11: .join(newVertices) 12: .where(new SourceId<E>().equalTo(new Id<V>()) 13: .with(new LeftSide<E, V>()) 14: .join(newVertices) 15: .where(new TargetId<E>().equalTo(new Id<V>()) 16: .with(new LeftSide<E, V>()); db.G[0].exclude(db.G[2]) [2] Community | interest : Graphs| vertexCount : 4 [0] Community | interest : Databases | vertexCount : 3 [5] Person name : Alice gender : f city : Leipzig age : 23 [6] Person name : Bob gender : m city : Leipzig age : 30 [7] Person name : Carol gender : f city : Dresden age : 30 [8] Person name : Dave gender : m city : Dresden age : 42 [9] Person name : Eve gender : f city : Dresden age : 35 speaks : en 0 1 2 3 4 5 6 7 knows since : 2014 knows since : 2014 knows since : 2013 knows since : 2013 knows since : 2014 knows since : 2014 knows since : 2015 knows since : 2013
  • 46. EPGM on Apache Flink – Exclusion Id Label Properties 2 Community interest: Graphs vertexCount: 4 graphId = secondGraph.getGraphHead() Id 2 newVertices = firstGraph.getVertices() Id Label Properties Graphs 5 Person name: Alice gender: f … [0, 2] 6 Person name: Bob gender: m … [0, 2] 9 Person name: Eve gender: f … [0] Id Label Properties Graphs 9 Person name: Eve gender: f … [0] .map(new Id<G>()); .filter(new NotInGraphBroadCast<V>()) .withBroadcastSet(graphId, GRAPH_ID);
  • 47. EPGM in Apache Flink – Exclusion newEdges = firstGraph.getEdges() Id Label SourceId TargetId Properties Graphs 0 knows 5 6 since: 2014 [0, 2] 1 knows 6 5 since: 2014 [0, 2] 6 knows 9 5 since: 2013 [0] 7 knows 9 6 since: 2015 [0] Id Label SourceId TargetId Properties Graphs 6 knows 9 5 since: 2013 [0] 7 knows 9 6 since: 2015 [0] Id Label SourceId TargetId … Id Label … 6 knows 9 5 … 9 Person … 7 knows 9 6 … 9 Person … Id Label SourceId TargetId … 6 knows 9 5 … 7 knows 9 6 … Id Label SourceId TargetId … Id Label … Id Label SourceId TargetId ….with(new LeftSide<E, V>()); .join(newVertices) .where(new TargetId<E>().equalTo(new Id<V>()) .with(new LeftSide<E, V>()) .join(newVertices) .where(new SourceId<E>().equalTo(new Id<V>()) .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 49. LDBC Social Network Data http://ldbcouncil.org/
  • 50. LDBC Social Network Data socialNetwork .subgraph( (v => v.label == ‘Person‘), (e => e.label == ‘knows‘)) .transform( (gIn, gOut => gOut = gIn), (vIn, vOut => { vOut.label = vIn.label, vOut[‘city‘] = vIn[‘city‘] vOut[‘gender‘] = vIn[‘gender‘] vOut[‘key‘] = vIn[‘birthday‘]}), (eIn, eOut) => eOut.label = eIn.label)) .callForCollection(:LabelPropagation, [‘key‘, 4]) .apply(g => g.aggregate(‘vertexCount‘, (h => |h.V|)) .select(g => g[‘vertexCount‘] > 50000) .reduce(g, h => g.combine(h)) .groupBy( [‘city‘,‘gender‘], (superVertex, vertices => superVertex[‘count‘] = |vertices|), [], (superEdge, edges => superEdge[‘count‘] = |edges|) .aggregate(‘vCount‘, (g => |g.V|)) .aggregate(‘eCount‘, (g => |g.E|))
  • 51. LDBC Social Network Data socialNetwork .subgraph( (v => v.label == ‘Person‘), (e => e.label == ‘knows‘)) .transform( (gIn, gOut => gOut = gIn), (vIn, vOut => { vOut.label = vIn.label, vOut[‘city‘] = vIn[‘city‘] vOut[‘gender‘] = vIn[‘gender‘] vOut[‘key‘] = vIn[‘birthday‘]}), (eIn, eOut) => eOut.label = eIn.label)) .callForCollection(:LabelPropagation, [‘key‘, 4]) .apply(g => g.aggregate(‘vertexCount‘, (h => |h.V|)) .select(g => g[‘vertexCount‘] > 50000) .reduce(g, h => g.combine(h)) .groupBy( [‘city‘,‘gender‘], (superVertex, vertices => superVertex[‘count‘] = |vertices|), [], (superEdge, edges => superEdge[‘count‘] = |edges|) .aggregate(‘vCount‘, (g => |g.V|)) .aggregate(‘eCount‘, (g => |g.E|))
  • 52. Benchmark Results Dataset # Vertices # Edges Graphalytics.1 61,613 2,026,082 Graphalytics.10 260,613 16,600,778 Graphalytics.100 1,695,613 147,437,275 Graphalytics.1000 12,775,613 1,363,747,260 Graphalytics.10000 90,025,613 10,872,109,028  16x Intel(R) Xeon(R) 2.50GHz (6 Cores)  16x 48 GB RAM  1 Gigabit Ethernet  Hadoop 2.6.0  Flink 1.0-SNAPSHOT  slots (per worker) 12  jobmanager.heap.mb 2048  taskmanager.heap.mb 40960 0 200 400 600 800 1000 1200 1 2 4 8 16 Runtime[s] Number of workers Runtime Graphalytics.100
  • 53. Benchmark Results Dataset # Vertices # Edges Graphalytics.1 61,613 2,026,082 Graphalytics.10 260,613 16,600,778 Graphalytics.100 1,695,613 147,437,275 Graphalytics.1000 12,775,613 1,363,747,260 Graphalytics.10000 90,025,613 10,872,109,028 1 2 4 8 16 1 2 4 8 16 Speedup Number of workers Speedup Graphalytics.100 Linear  16x Intel(R) Xeon(R) 2.50GHz (6 Cores)  16x 48 GB RAM  1 Gigabit Ethernet  Hadoop 2.6.0  Flink 1.0-SNAPSHOT  slots (per worker) 12  jobmanager.heap.mb 2048  taskmanager.heap.mb 40960
  • 54. Benchmark Results Dataset # Vertices # Edges Graphalytics.1 61,613 2,026,082 Graphalytics.10 260,613 16,600,778 Graphalytics.100 1,695,613 147,437,275 Graphalytics.1000 12,775,613 1,363,747,260 Graphalytics.10000 90,025,613 10,872,109,028 1 10 100 1000 10000 Runtime[s] Datasets  16x Intel(R) Xeon(R) 2.50GHz (6 Cores)  16x 48 GB RAM  1 Gigabit Ethernet  Hadoop 2.6.0  Flink 1.0-SNAPSHOT  slots (per worker) 12  jobmanager.heap.mb 2048  taskmanager.heap.mb 40960
  • 55. Current State & Future Work
  • 56. Current State Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call * Combination Overlap Exclusion Equality Union Intersection Difference Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply * Reduce * Call * * auxiliary Adaptive Partitioning Subgraph
  • 57. Current State  0.0.1 First Prototype (May 2015)  Hadoop MapReduce and Giraph for operator implementations  Too much complexity  Performance loss through serialization in HDFS/HBase  0.0.2 Using Flink as execution layer (June 2015)  Basic operators  0.1 December 2015  System-side identifiers (UUID)  Improved property handling  More operator implementations (e.g., Equality, Bool operators)  Code refactoring  0.2-SNAPSHOT  Graph Pattern Matching  Frequent Subgraph Mining  Memory optimization (96-bit ID, Dictionary Encoding, …)  Tuple Implementation
  • 58. Contributions to Flink  FLINK-2411 Add basic graph summarization algorithm  FLINK-2590 DataSetUtils.zipWithUniqueID creates duplicate Ids  FLINK-2905 Add intersect method to Graph class  FLINK-2910 Combine tests for binary graph operators  FLINK-2941 Implement a neo4j - Flink/Gelly connector  FLINK-2981 Update README for building docs  FLINK-3064 Missing size check in GroupReduceOperatorBase leads to NPE  FLINK-3118 Check if MessageFunction implements ResultTypeQueryable  FLINK-3122 Generalize value type in LabelPropagation  FLINK-3272 Generalize vertex value type in ConnectedComponents  Flink Forward (October 2015)  Meetup Big Data Usergroup Saxony (December 2015)
  • 59. Contributions welcome!  Code  Operator implementations  Performance Tuning  Extend HBase Storage  People  Bachelor / Master Thesis  Open PhD positions in Leipzig, Germany  Data! and Use Cases  We are researchers, we assume ...