Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

GRADOOP: Scalable Graph
Analytics with Apache Flink
Martin Junghanns
Leipzig University
Big Data User Group Dresden / Graph Databases Sachsen
December 2015

About the speaker and the team
André, PhD StudentMartin, PhD Student
Kevin, M.Sc. StudentNiklas, M.Sc. Student
Prof. Dr. Erhard Rahm
Database Chair

Outline
 Motivation
 Gradoop Architecture
 Extended Property Graph Model (EPGM)
 Apache Flink
 EPGM on Apache Flink
 Business Intelligence Use Case
 Tooling
 Current State & Future Work

𝑮𝑟𝑟𝑟𝑟 = (𝑽𝑒𝑒𝑒𝑒𝑒𝑒𝑒, 𝑬𝑑𝑑𝑑𝑑)
“Graphs are everywhere”

𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹)
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy

𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹)
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent

𝐺𝐺𝐺𝐺𝐺 = (𝐂𝐂𝐂𝐂𝐂𝐂, 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶)
Leipzig
pop: 544K
Dresden
pop: 536K
Berlin
pop: 3.5M
Hamburg
pop: 1.7M
Munich
pop: 1.4M
Chemnitz
pop: 243K
Nuremberg
pop: 500K
Cologne
pop: 1M

 World Wide Web
 ca. 1 billion websites
“Graphs are large”
 Facebook
 ca. 1.49 billion active users
 ca. 340 friends per user

End-to-End Graph Analytics
Data Integration Graph Analytics Representation
 Integrate data from one or more sources into a dedicated
graph storage with common graph data model
 Definition of analytical workflows from operator algebra
 Result representation in a meaningful way

Graph Data Management
Graph Database
Systems
Neo4j, OrientDB
Graph Processing
Systems
Pregel, Giraph
Distributed Workflow
Systems
Flink Gelly, Spark GraphX
Data Model Rich Graph
Models
Generic Graph Models Generic Graph Models
Focus Local ACID
Operations
Global Graph Operations Global Data and Graph
Operations
Query Language Yes No No
Persistency Yes No No
Scalability Vertical Horizontal Horizontal
Workflows No No Yes
Data Integration No No No
Graph Analytics No Yes Yes
Representation Yes No No

What‘s missing?
An end-to-end framework and research platform
for efficient, distributed and domain independent
graph data management and analytics.

Gradoop Architecture & Data Model

High Level Architecture
HDFS/YARN
Cluster
HBase Distributed Graph Store
Extended Property Graph Model
Flink Operator Implementations
Data Integration
Flink Operator Execution
Workflow
Declaration
Visual
GrALa DSL
Representation
Data flow
Control flow
Graph Analytics Representation
Workflow Execution

[1] Community | interest : Hadoop| vertexCount : 3[0] Community | interest : Databases | vertexCount : 3
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
title : Graph Databases
[4] Forum
title : Graph Processing
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
10
11 12 13 14
15
16
17
18 19 20 21
22
23
knows
since : 2014
knows
since : 2014
knows
since : 2013
hasInterest
hasInterest hasInterest
hasInterest
hasModeratorhasModerator
hasMember hasMember
hasMember hasMember
hasTag hasTaghasTag hasTag
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013

[2] Community | interest : Graphs | vertexCount : 4
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
[4] Forum
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
10
11 12 13 14
15
16
17
18 19 20 21
22
23
knows
since : 2014
knows
since : 2014
knows
since : 2013
hasInterest
hasInterest
hasMember hasMember
hasMember hasMember
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013

Graph Operators and Algorithms
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Projection
Summarization Equality
Call *
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Gelly Library
BTG Extraction
Label Propagation
Graph Forecasting
Frequent Subgraphs
Top
Selection
Distinct
Sort
Apply *
Reduce *
Call *
* auxiliary

Combination
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
[2] Community | interest : Graphs| vertexCount : 4
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
[4] Forum
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
10
11 12 13 14
15
16
17
18 19 20 21
22
23
knows
since : 2014
knows
since : 2014
knows
since : 2013
hasInterest
hasInterest
hasMember hasMember
hasMember hasMember
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
DB

[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
[4] Forum
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest
hasMember hasMember
hasMember hasMember
DB
Combination
[4]
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013

[0] Community | interest : Databases | vertexCount : 3
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
[4] Forum
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest
hasMember hasMember
hasMember hasMember
DB
Combination + Summarization
[4]
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
2: vertexGroupingKeys = {:LABEL, “city”}
3: edgeGroupingKeys = {:LABEL}
4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|)
5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|)
6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc,
edgeGroupingKeys, edgeAggFunc)

[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
[4] Forum
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest
hasMember hasMember
hasMember hasMember
DB
Combination + Summarization
[4]
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
[5]
[11] Person
city : Leipzig
count : 2
[12] Person
city : Dresden
count : 3
[13] Person
city : Berlin
count : 1
24
25
26
27
28
knows
count : 3
knows
count : 1
knows
count : 2
knows
count : 2
knows
count : 2

[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
[4] Forum
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest
hasMember hasMember
hasMember hasMember
DB
Combination + Summarization + Aggregation
[4]
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
7: aggFunc = (Graph g => |g.E|)
8: aggGraph = sumGraph.aggregate(“edgeCount”, aggFunc)
[5]
[11] Person
city : Leipzig
count : 2
[12] Person
city : Dresden
count : 3
[13] Person
city : Berlin
count : 1
24
25
26
27
28
knows
count : 3
knows
count : 1
knows
count : 2
knows
count : 2
knows
count : 2

[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
[4] Forum
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest
hasMember hasMember
hasMember hasMember
DB
Combination + Summarization + Aggregation
[4]
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
7: aggFunc = (Graph g => |g.E|)
8: aggGraph = sumGraph.aggregate(“edgeCount”, aggFunc)
[5] edgeCount : 5
[11] Person
city : Leipzig
count : 2
[12] Person
city : Dresden
count : 3
[13] Person
city : Berlin
count : 1
24
25
26
27
28
knows
count : 3
knows
count : 1
knows
count : 2
knows
count : 2
knows
count : 2

Selection
1: resultColl = db.G[0,1,2].select((Graph g => g[“vertexCount”] > 3))
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
[4] Forum
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
0
1
2
3
4
5
6 7 8 9
10
11 12 13 14
15
16
17
18 19 20 21
22
23
knows
since : 2014
knows
since : 2014
knows
since : 2013
hasInterest
hasInterest
hasMember hasMember
hasMember hasMember
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
DB

Selection
1: resultColl = db.G[0,1,2].select((Graph g => g[“vertexCount”] > 3))
[0] Tag
name : Databases
[1] Tag
name : Graphs
[2] Tag
name : Hadoop
[3] Forum
[4] Forum
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
[10] Person
name : Frank
gender : m
city : Berlin
age : 23
IP: 169.32.1.3
6 7 8 9
10
11 12 13 14
15
16
17
18 19 20 21
22
23
hasInterest
hasInterest
hasMember hasMember
hasMember hasMember
knows
since : 2015
knows
since : 2015
knows
since : 2015
knows
since : 2013
DB
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
0
1
2
3
4
5
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014

Apache Flink
http://www.slideshare.net/robertmetzger1/apache-flink-meetup-munich-november-2015-flink-overview-architecture-integrations-and-use-case
„Streaming Dataflow Engine that provides
• data distribution,
• communication,
• and fault tolerance
for distributed computations over data streams.“
HDFS
LocalFS
HBase
JDBC
Kafka
RabbitMQ
Flume
(Neo4j) EmbeddedTezYarnClusterLocal
Streaming Dataflow Runtime
DataSet DataStream
HadoopMR
Table
Gelly
ML
Table
Zeppelin
Cascading
MRQL
Dataflow
Storm(wip)
Dataflow(wip)
SAMOA

Apache Flink – DataSet API
 DataSet := Distributed Collection of Data
 Transformation := Operation applied on DataSet
 Flink Program := Composition of Transformations
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Transformation DataSet
Flink Program

Apache Flink – DataSet Transformations
 aggregate
 coGroup
 cross
 distinct
 filter
 first-N
 flatMap
 groupBy
 join
 leftOuterJoin
 rightOuterJoin
 fullOuterJoin
 map
 mapPartition
 reduce
 reduceGroup
 union
 iterate
 iterateDelta

The „Hello World“ of Big Data – Word Count
1: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
2:
3: DataSet<String> text = env.fromElements( // or env.readTextFile(„hdfs://…“)
4: „He who controls the past controls the future.“,
5: „He who controls the present controls the past.“);
6:
7: DataSet<Tuple2<String, Integer>> wordCounts = text
8: .flatMap(new LineSplitter()) // splits the line and outputs (word, 1) tuples
9: .groupBy(0)
10: .sum(1);
11:
12: wordCounts.print(); // trigger execution
flatMap
„He who controls the past controls the future.“
„He who controls the present controls the past.“
(He,1)
(who,1)
(controls,1)
(the,1)
(past,1)
// ...
groupBy(0)
[(He,1),(He,1)]
[(who,1),(who,1)]
[(future,1)]
[(past,1),(past,1)]
[(present,1)]
// ...
sum(1)
(He,2)
(who,2)
(future,1)
(past,2)
(present,1)
// ...

EPGM in Apache Flink – User facing API
LogicalGraph
fromCollections(…) : LogicalGraph
fromDataSets(…) : LogicalGraph
fromGellyGraph(…) : LogicalGraph
getGraphHead() : DataSet<EPGMGraphHead>
toGellyGraph() : Graph
combine(…) : LogicalGraph
intersect(…) : LogicalGraph
summarize(…) : LogicalGraph
match(…) : GraphCollection
// ...
GraphCollection
fromCollections(…) : GraphCollection
fromDataSets(…) : GraphCollection
getGraphHeads() : DataSet<EPGMGraphHead>
getGraph(…) : LogicalGraph
getGraphs(…) : GraphCollection
select(…) : GraphCollection
union(…) : GraphCollection
distinct(…) : GraphCollection
sortBy(…) : GraphCollection
// ...
GraphBase
getVertices() : DataSet<EPGMVertex>
getEdges() : DataSet<EPGMEdge>
// ...
graphHeads : DataSet<EPGMGraphHead>
vertices : DataSet<EPGMVertex>
edges : DataSet<EPGMEdge>
EPGMDatabase
fromCollections(…) : EPGMDatabase
fromJSONFile(…) : EPGMDatabase
fromHBase(…) : EPGMDatabase
writeAsJSON(…) : void
writeToHBase(…) : void
getDatabaseGraph() : LogicalGraph
// ...

EPGM in Apache Flink – DataSets
Id Label Properties Graphs
Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
Id Label Properties POJO
POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
DataSet<EPGMEdge>
EPGMVertex
GradoopId := UUID
128-bit
String PropertyList := List<Property>
Property := (String, PropertyValue)
PropertyValue := byte[]
GradoopIdSet := Set<GradoopId>
(55421132-f45b-40f0-8f6a-50ea13dbf2ea:Person{gender=f,city=Leipzig,name=Alice,age=20} @ [c2c0f288-9f27-4e55-b1c6-7a35e0eabe36, 77b710f9-07c2-49ab-b4bf-51e1a3138822])

EPGM in Apache Flink – Exclusion
// input: firstGraph (G[0]), secondGraph (G[2])
1: DataSet<GradoopId> graphId = secondGraph.getGraphHead()
2: .map(new Id<G>());
3:
4: DataSet<V> newVertices = firstGraph.getVertices()
5: .filter(new NotInGraphBroadCast<V>())
6: .withBroadcastSet(graphId, GRAPH_ID);
7:
8: DataSet<E> newEdges = firstGraph.getEdges()
9: .filter(new NotInGraphBroadCast<E>())
10: .withBroadcastSet(graphId, GRAPH_ID)
11: .join(newVertices)
12: .where(new SourceId<E>().equalTo(new Id<V>())
13: .with(new LeftSide<E, V>())
14: .join(newVertices)
15: .where(new TargetId<E>().equalTo(new Id<V>())
16: .with(new LeftSide<E, V>());
db.G[0].exclude(db.G[2])
[5] Person
name : Alice
gender : f
city : Leipzig
age : 23
[6] Person
name : Bob
gender : m
city : Leipzig
age : 30
[7] Person
name : Carol
gender : f
city : Dresden
age : 30
[8] Person
name : Dave
gender : m
city : Dresden
age : 42
[9] Person
name : Eve
gender : f
city : Dresden
age : 35
speaks : en
0
1
2
3
4
5
6 7
knows
since : 2014
knows
since : 2014
knows
since : 2013
knows
since : 2013
knows
since : 2014
knows
since : 2014
knows
since : 2015
knows
since : 2013

Id Label Properties
2 Community interest: Graphs vertexCount: 4
graphId =
secondGraph.getGraphHead()
Id
2
newVertices =
firstGraph.getVertices() Id Label Properties Graphs
5 Person name: Alice gender: f … [0, 2]
6 Person name: Bob gender: m … [0, 2]
9 Person name: Eve gender: f … [0]
9 Person name: Eve gender: f … [0]
.map(new Id<G>());
.filter(new NotInGraphBroadCast<V>())
.withBroadcastSet(graphId, GRAPH_ID);

newEdges =
firstGraph.getEdges()
Id Label SourceId TargetId Properties Graphs
0 knows 5 6 since: 2014 [0, 2]
1 knows 6 5 since: 2014 [0, 2]
6 knows 9 5 since: 2013 [0]
7 knows 9 6 since: 2015 [0]
Id Label SourceId TargetId Properties Graphs
6 knows 9 5 since: 2013 [0]
7 knows 9 6 since: 2015 [0]
Id Label SourceId TargetId … Id Label …
6 knows 9 5 … 9 Person …
7 knows 9 6 … 9 Person …
Id Label SourceId TargetId …
6 knows 9 5 …
7 knows 9 6 …
Id Label SourceId TargetId … Id Label …
Id Label SourceId TargetId ….with(new LeftSide<E, V>());
.join(newVertices)
.where(new TargetId<E>().equalTo(new Id<V>())
.with(new LeftSide<E, V>())
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)

Use Case: Graph Business Intelligence

Use Case: Graph Business Intelligence
 Business intelligence usually based on relational data
warehouses
 Enterprise data is integrated within dimensional schema
 Analysis limited to predefined relationships
 No support for relationship-oriented data mining
 Graph-based approach
 Integrate data sources within an instance graph by preserving original
relationships between data objects (transactional and master data)
 Determine subgraphs (business transaction graphs) related to business
activities
 Analyze subgraphs or entire graphs with aggregation queries, mining
relationship patterns, etc.
Facts
Dim 1
Dim 2
Dim 3

Prerequisites: Data Integration
metadata
Data SourcesEnterprise
Service Bus
Unified Metadata Graph
Domain expert
(1) Metadata
aquisition
(2) Graph
integration
Integrated Instance Graph
data
Business Transaction Graphs
(3) Subgraph
Detection

CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesInvoice
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
basedOn serves
serves
bills
bills
bills
processedBy

CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesInvoice
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills

BTG 1
(1) BTG Extraction
BTG 2
BTG 3
BTG 4
BTG 5
BTG n
…

(1) BTG Extraction
// generate base collection
btgs = iig.callForCollection( :BusinessTransactionGraphs , {} )

(2) Profit Aggregation
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesInvoice
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills

// define profit aggregate function
aggFunc = ( Graph g =>
g.V.values(“Revenue").sum() - g.V.values(“Expense").sum()
)

BTG 1
BTG 2
BTG 3
BTG 4
BTG 5
BTG n
… ∑ Revenue ∑ Expenses Net Profit
5,000 -3,000 2,000
9,000 -3,000 6,000
2,000 -1,500 500
5,000 -7,000 -2,000
10,000 -15,000 -5,000
… … …
8,000 -4,000 4,000

// define profit aggregate function
aggFunc = ( Graph g =>
g.V.values(“Revenue").sum() - g.V.values(“Expense").sum()
)
// apply aggregate function and store result at new property
btgs = btgs.apply( Graph g =>
g.aggregate( “Profit“ , aggFunc )
)

(3) BTG Clustering
BTG 1
BTG 2
BTG 3
BTG 4
BTG 5
BTG n
… ∑ Revenue ∑ Expenses Net Profit
5,000 -3,000 2,000
9,000 -3,000 6,000
2,000 -1,500 500
5,000 -7,000 -2,000
10,000 -15,000 -5,000
… … …
8,000 -4,000 4,000

(3) BTG Clustering
// select profit and loss clusters
profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 )
lossBtgs = btgs.difference(profitBtgs)

(4) Cluster Characteristic Patterns
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesInvoice
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills

BTG 1
BTG 2
BTG 3
BTG 4
BTG 5
BTG n
…
∑ Revenue ∑ Expenses Net Profit
5,000 -3,000 2,000
9,000 -3,000 6,000
2,000 -1,500 500
5,000 -7,000 -2,000
10,000 -15,000 -5,000
… … …
8,000 -4,000 4,000
TicketAlice
processedBy
Bob
createdBy
PurchaseOrder

// select profit and loss clusters
profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 )
lossBtgs = btgs.difference(profitBtgs)
// apply magic
profitFreqPats = profitBtgs.callForCollection(
:FrequentSubgraphs , {“Threshold”:0.7}
)
lossFreqPats = lossBtgs.callForCollection(
:FrequentSubgraphs , {“Threshold”:0.7}
)
// determine cluster characteristic patterns
trivialPats = profitFreqPats.intersect(lossFreqPats)
profitCharPatterns = profitFreqPats.difference(trivialPats)
lossCharPatterns = lossFreqPats.difference(trivialPats)

Graph Definition Language (Cypher for EPGM)
 Unit Testing graph analytical operators can be hard

 Unit Testing graph analytical operators can be hard
Y U NO MAKE IT DECLARATIVE?

 Describe expected output in unit test

 FlinkAsciiGraphLoader
 Creates LogicalGraphs and GraphCollections based on ASCII graph
 Based on Cypher: https://github.com/s1ck/gdl
 Define vertices
(alice:User {name = "Alice", age = 23})
 Define edges
(alice)-[e1:knows {since = 2014}]->(bob)
 Define paths
(alice)-->(bob)<--(eve)-->(carol)-->(alice)
 Define graphs
g1:Community {title = "Graphs", memberCount = 3}[
(alice:User)-[:knows]->(bob:User)
(bob)-[e:knows]->(eve:User)
(eve)
]

LDBC-Flink-Import
 Linked Data Benchmark Council
 MapReduce-based data generator for social network data
http://ldbcouncil.org/

LDBC-Flink-Import
 Makes LDBC output available in Flink DataSets
 https://github.com/s1ck/ldbc-flink-import
1: LDBCToFlink ldbcToFlink = new LDBCToFlink(
2: "/path/to/ldbc/output", // or "hdfs://..."
3: ExecutionEnvironment.getExecutionEnvironment());
4:
5: DataSet<LDBCVertex> vertices = ldbcToFlink.getVertices();
6: DataSet<LDBCEdge> edges = ldbcToFlink.getEdges();

Current State
 0.0.1 First Prototype (May 2015)
 Hadoop MapReduce and Giraph for operator implementations
 Too much complexity
 Performance loss through serialization in HDFS/HBase
 0.0.2 Using Flink as execution layer (June 2015)
 Basic operators
 0.1 Today 
 Improved ID handling
 Improved property handling
 More operator implementations (e.g. Equality, Bool operators)
 Code refactoring
 0.2-SNAPSHOT
 Graph Pattern Matching
 Frequent Subgraph Mining

Current State
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Projection
Summarization Equality
Call *
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Gelly Library
BTG Extraction
Label Propagation
Graph Forecasting
Frequent Subgraphs
Top
Selection
Distinct
Sort
Apply *
Reduce *
Call *
* auxiliary

Benchmark Preview
0
200
400
600
800
1000
1200
1400
1 2 4 8 16
Time [s]
# Worker
Summarization (Vertex and Edge Labels)
 16x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz (12 Cores), 48 GB RAM
 Hadoop 2.5.2, Flink 0.9.0
 slots (per node) 12
 jobmanager.heap.mb 2048
 taskmanager.heap.mb 40960
 Foodbroker Graph (https://github.com/dbs-leipzig/foodbroker)
 Generates BI process data
 858,624,267 Vertices, 4,406,445,007 Edges, 663GB Payload

Contributions welcome!
 Code
 Operator implementations
 Performance Tuning
 Extend HBase Storage
 Data! and Use Cases
 We are researchers, we assume ...
 Getting real data (especially BI data) is nearly impossible

Thank you!
www.gradoop.com
https://flink.apache.org
http://ldbcouncil.org/
http://dbs.uni-leipzig.de/file/GradoopTR.pdf
http://dbs.uni-leipzig.de/file/biiig-vldb2014.pdf
https://github.com/dbs-leipzig/gradoop
https://github.com/s1ck/gdl
https://github.com/s1ck/ldbc-flink-import
(https://github.com/s1ck/flink-neo4j)

Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

More Related Content

Viewers also liked

Similar to Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Recently uploaded

Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink