Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Graph Analytics with Gradoop

948 views

Published on

Presentation of the Gradoop Framework at the Graph Database Meetup in Munich (https://www.meetup.com/inovex-munich/events/231187528/). The talk is about the extended property graph model, its operators and how they are implemented on top of Apache Flink. The talk also includes some benchmark results on scalability (see www.gradoop.com)

Published in: Data & Analytics
  • Be the first to comment

Distributed Graph Analytics with Gradoop

  1. 1. Distributed Graph Analytics with Gradoop inovex Meetup Munich Let‘s talk about Graph Databases July 2016 Martin Junghanns (@kc1s) University of Leipzig – Database Research Group
  2. 2. Motivation Extended Property Graph Model Operators Benchmark Implementation
  3. 3. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 3 Motivation EPGM Operators BenchmarkImplementation 3 Motivation
  4. 4. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 4 Motivation EPGM Operators BenchmarkImplementation 4 Motivation 𝑮𝑟𝑎𝑝ℎ = (𝑽𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝑬𝑑𝑔𝑒𝑠) „Graphs are everywhere“
  5. 5. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 5 Motivation EPGM Operators BenchmarkImplementation 5 Motivation Alice Bob Eve Dave Carol Mallory Peggy Trent 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠) „Graphs are everywhere“
  6. 6. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 6 Motivation EPGM Operators BenchmarkImplementation 6 Motivation 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠) Alice Bob Eve Dave Carol Mallory Peggy Trent „Graphs are everywhere“
  7. 7. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 7 Motivation EPGM Operators BenchmarkImplementation 7 Motivation Alice Bob AC/DC Dave Carol Mallory Peggy Metallica 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠) „Graphs are heterogeneous“
  8. 8. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 8 Motivation EPGM Operators BenchmarkImplementation 8 Motivation Alice Bob AC/DC Dave Carol Mallory Peggy Metallica 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠) „Graphs can be analyzed“
  9. 9. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 9 Motivation EPGM Operators BenchmarkImplementation 9 Motivation 0.2 0.28 0.26 0.33 0.25 0.26 Alice Bob AC/DC Dave Carol Mallory Peggy Metallica 3.6 2.82 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠) „Graphs can be analyzed“
  10. 10. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 10 Motivation EPGM Operators BenchmarkImplementation 10 Motivation Assuming a social network
  11. 11. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 11 Motivation EPGM Operators BenchmarkImplementation 11 Motivation Assuming a social network 1. Determine subgraph
  12. 12. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 12 Motivation EPGM Operators BenchmarkImplementation 12 Motivation Assuming a social network 1. Determine subgraph
  13. 13. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 13 Motivation EPGM Operators BenchmarkImplementation 13 Motivation Assuming a social network 1. Determine subgraph 2. Find communities
  14. 14. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 14 Motivation EPGM Operators BenchmarkImplementation 14 Motivation Assuming a social network 1. Determine subgraph 2. Find communities
  15. 15. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 15 Motivation EPGM Operators BenchmarkImplementation 15 Motivation Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities
  16. 16. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 16 Motivation EPGM Operators BenchmarkImplementation 16 Motivation Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities
  17. 17. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 17 Motivation EPGM Operators BenchmarkImplementation 17 Motivation Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities 4. Find common subgraph
  18. 18. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 18 Motivation EPGM Operators BenchmarkImplementation 18 Motivation Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities 4. Find common subgraph
  19. 19. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 19 Motivation EPGM Operators BenchmarkImplementation 19 Motivation Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  20. 20. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 20 Motivation EPGM Operators BenchmarkImplementation 20 Motivation Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  21. 21. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 21 Motivation EPGM Operators BenchmarkImplementation 21 Motivation Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  22. 22. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 22 Motivation EPGM Operators BenchmarkImplementation 22 Motivation Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  23. 23. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 23 Motivation EPGM Operators BenchmarkImplementation 23 Motivation Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  24. 24. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 24 Motivation EPGM Operators BenchmarkImplementation 24 Motivation „And let‘s not forget …“
  25. 25. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 25 Motivation EPGM Operators BenchmarkImplementation 25 Motivation “...Graphs are large.”
  26. 26. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 26 Motivation EPGM Operators BenchmarkImplementation 26 Motivation „An open-source framework and research platform for efficient, distributed and domain independent management and analytics of heterogeneous graph data.“
  27. 27. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 27 Motivation EPGM Operators BenchmarkImplementation 27 Motivation Data Volume and Problem Complexity Ease-of-use Graph Processing Systems Graph Databases Graph Dataflow Systems Gelly
  28. 28. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 28 Motivation EPGM Operators BenchmarkImplementation 28 Motivation Distributed Graph Store (Apache HBase) Apache Flink Operator Implementation Apache Flink Distributed Operator Execution Extended Property Graph Model (EPGM) Graph Analytical Language (GrALa) I/O Distributed File System (Apache HDFS)
  29. 29. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 29 Motivation EPGM Operators BenchmarkImplementation 29 Extended Property Graph Model (EPGM)
  30. 30. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 30 Motivation EPGM Operators BenchmarkImplementationEPGM • Vertices and directed Edges
  31. 31. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 31 Motivation EPGM Operators BenchmarkImplementationEPGM • Vertices and directed Edges • Logical Graphs
  32. 32. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 32 Motivation EPGM Operators BenchmarkImplementationEPGM • Vertices and directed Edges • Logical Graphs • Identifiers 1 3 4 5 2 1 2 3 4 5 1 2
  33. 33. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 33 Motivation EPGM Operators BenchmarkImplementationEPGM • Vertices and directed Edges • Logical Graphs • Identifiers • Type Labels 1 3 4 5 2 1 2 3 4 5 Person Band Person Person Band likes likes likes knows likes 1|Community 2|Community
  34. 34. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 34 Motivation EPGM Operators BenchmarkImplementationEPGM • Vertices and directed Edges • Logical Graphs • Identifiers • Type Labels • Properties 1 3 4 5 2 1 2 3 4 5 Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973 likes since : 2014 likes since : 2013 likes since : 2015 knows likes since : 2014 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock
  35. 35. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 35 Motivation EPGM Operators BenchmarkImplementation 35 Operators
  36. 36. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 36 Motivation EPGM Operators BenchmarkImplementation 36 Operators Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call Combination Overlap Exclusion Equality Union Intersection Difference Flink Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply Reduce Call Adaptive Partitioning Subgraph
  37. 37. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 37 Motivation EPGM Operators BenchmarkImplementation 37 Operators 1 3 4 5 2 3 1 2 1 3 4 5 2 1 2 4 5 Combination Overlap Exclusion 3Basic Binary Operators LogicalGraph graph3 = graph1.combine(graph2); LogicalGraph graph4 = graph1.overlap(graph2); LogicalGraph graph5 = graph1.exclude(graph2);
  38. 38. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 38 Motivation EPGM Operators BenchmarkImplementation 38 Operators 1 3 4 5 2 3 1 3 4 5 2 3 | vertexCount: 5 UDF Aggregation graph3 = graph3.aggregate(“vertexCount”, new AggregateFunction<Long>() { public DataSet<Long> execute(LogicalGraph g) { return Count.count(g.getVertices()); } });
  39. 39. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 39 Motivation EPGM Operators BenchmarkImplementation 39 Operators UDF 3 | vertexCount: 5 name:Alice f_name:Bob1 3 4 5 2 3 | Community| vCount: 5 f_name:Alice f_name:Bob1 3 4 5 2 Transformation graph3 = graph3.transformEdges(new TransformationFunction<Edge>() { public Edge execute(Edge e) { e.setLabel(e.getLabel().equals(“orange”) ? “red” : e.getLabel()); return e; }});
  40. 40. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 40 Motivation EPGM Operators BenchmarkImplementation 40 Operators 3 1 3 4 5 2 3 4 1 2 3 4 1 2 4 3 5 2 UDF UDF UDF Subgraph LogicalGraph graph4 = graph3.subgraph( new FilterFunction<Vertex>() { public boolean execute(Vertex v) { return v.getLabel().equals(“green”); }}, new FilterFunction<Edge>() { public boolean execute(Edge e) { return e.getLabel().equals(“orange”); }});
  41. 41. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 41 Motivation EPGM Operators BenchmarkImplementation 41 Operators 3 1 3 4 5 2 Pattern 4 5 1 3 4 2 Graph Collection Pattern Matching GraphCollection collection = graph3.match(“(:Green)-[:orange]->(:Orange)”);
  42. 42. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 42 Motivation EPGM Operators BenchmarkImplementation 42 Operators Keys 3 1 3 4 5 2 +Aggregate 3 a:23 a:84 a:42 a:12 1 3 4 5 2 a:13 a:21 4 count:2 count:3 max(a):42 max(a):84 max(a):13 max(a):21 6 7 4 6 7 Grouping LogicalGraph grouped = graph3.groupBy() .useVertexLabel() .useEdgeLabel() .addVertexAggregate(new CountAggregator()) .addEdgeAggregate(new MaxAggregator(“a”));
  43. 43. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 43 Motivation EPGM Operators BenchmarkImplementation 43 Operators Operator 1 2 0 2 3 4 1 5 7 86 1 | vertexCount: 5 2 | vertexCount: 4 0 2 3 4 1 5 7 86 Apply (e.g. Aggregation) collection = collection.apply(new Aggregation<>(“vertexCount”, new VertexCount()));
  44. 44. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 44 Motivation EPGM Operators BenchmarkImplementation 44 Operators UDF vertexCount > 4 1 | vertexCount: 5 2 | vertexCount: 4 0 2 3 4 1 5 7 86 1 | vertexCount: 5 0 2 3 4 1 Selection GraphCollection filtered = collection.select(new FilterFunction<GraphHead>() { public boolean filter(GraphHead g) { return g.getPropertyValue(“vertexCount”).getLong() > 4L; } });
  45. 45. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 45 Motivation EPGM Operators BenchmarkImplementation 45 Operators Algorithm 1 0 2 3 4 1 5 7 86 2 3 0 2 3 4 1 5 7 86 Call (e.g. Clustering) GraphCollection clustering = graph.callForCollection(new ClusteringAlgorithm());
  46. 46. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 46 Motivation EPGM Operators BenchmarkImplementation 46 Operators Algorithm 2 rank:0.11 rank:0.25 rank:0.11 rank:1.29 rank:1.29 rank:1.58rank:0.11 rank:0.75rank:0.11 0 2 3 4 1 5 7 86 1 0 2 3 4 1 5 7 86 Call (e.g. Page Rank) LogicalGraph pageRankGraph = graph.callForGraph(new PageRankAlgorithm());
  47. 47. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 47 Motivation EPGM Operators BenchmarkImplementation 47 Implementation Apache Flink Gradoop on Flink
  48. 48. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 48 Motivation EPGM Operators BenchmarkImplementation 48 Implementation Apache Flink Gradoop on Flink „Streaming Dataflow Engine that provides data distribution, communication and fault tolerance for distributed computations over data streams.“ https://flink.apache.org/ Streaming Dataflow Runtime DataSet DataStream HadoopMR Table Gelly FlinkML Table Zeppelin Cascading MRQL Dataflow Storm Dataflow SAMOA GRADOOP Cluster (e.g. YARN)Local Cloud (e.g. EC2) Batch Stream Data Storage (e.g. Files, HDFS, S3, JDBC, HBase, Kafka, …)
  49. 49. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 49 Motivation EPGM Operators BenchmarkImplementation 49 Implementation Apache Flink Gradoop on Flink DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet • DataSet := Distributed Collection of Data Objects • Transformation := Operation on DataSets (Higher-order function) • Flink Programm := Composition of Transformations DataSet DataSet DataSet Transformation Transformation DataSet DataSet Transformation DataSet Flink Program
  50. 50. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 50 Motivation EPGM Operators BenchmarkImplementation 50 Implementation Apache Flink Gradoop on Flink Hadoop-like Transformations • map • flatMap • mapPartition • reduce • reduceGroup • coGroup Special Flink Operations • iterate • iterateDelta SQL-like Transformations • filter • project • cross • union • distinct • first-N (limit) • groupBy • aggregate • join • leftOuterJoin • rightOuterJoin • fullOuterJoin
  51. 51. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 51 Motivation EPGM Operators BenchmarkImplementation 51 Implementation Apache Flink Gradoop on Flink 1: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 2: 3: DataSet<String> text = env.fromElements( // or env.readTextFile(„hdfs://…“) 4: „He who controls the past controls the future.“, 5: „He who controls the present controls the past.“); 6: 7: DataSet<Tuple2<String, Integer>> wordCounts = text 8: .flatMap(new LineSplitter()) // splits the line and outputs (word, 1) tuples 9: .groupBy(0) 10: .sum(1); 11: 12: wordCounts.print(); // trigger execution flatMap „He who controls the past controls the future.“ „He who controls the present controls the past.“ (He,1) (who,1) (controls,1) (the,1) (past,1) // ... groupBy(0) [(He,1),(He,1)] [(who,1),(who,1)] [(future,1)] [(past,1),(past,1)] [(present,1)] // ... sum(1) (He,2) (who,2) (future,1) (past,2) (present,1) // ...
  52. 52. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 52 Motivation EPGM Operators BenchmarkImplementation 52 Implementation Apache Flink Gradoop on Flink flatMap (He,1) (who,1) (controls,1) groupBy(0) [(He,1),(He,1)] [(who,1),(who,1)] sum(1) (He,2) (who,2) Source flatMap (the,1) (past,1) groupBy(0) [(future,1)] [(past,1),(past,1)] sum(1) (future,1) (past,2) flatMap (future,1) (past,1) groupBy(0) [(present,1)] sum(1) (present,1) Sink
  53. 53. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 53 Motivation EPGM Operators BenchmarkImplementation 53 Implementation Apache Flink Gradoop on Flink Id Label Properties Graphs Id Label Properties SourceId TargetId Graphs EPGMGraphHead EPGMVertex EPGMEdge Id Label Properties POJO POJO POJO DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge> Id Label Properties Graphs EPGMVertex GradoopId := UUID 128-bit String PropertyList := List<Property> Property := (String, PropertyValue) PropertyValue := byte[] GradoopIdSet := Set<GradoopId> EPGM Graph Representation
  54. 54. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 54 Motivation EPGM Operators BenchmarkImplementation 54 Implementation Apache Flink Gradoop on Flink Id Label Properties 1 Community {interest:Heavy Metal} 2 Community {interest:Hard Rock} Id Label Properties Graphs 1 Person {name:Alice, born:1984} {1} 2 Band {name:Metallica,founded:1981} {1} 3 Person {name:Bob} {1,2} 4 Band {name:AC/DC,founded:1973} {2} 5 Person {name:Eve} {2} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} 3 likes 3 4 {since:2015} {2} 4 knows 3 5 {} {2} 5 likes 5 4 {since:2014} {2} likes since : 2014 likes since : 2013 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973likes since : 2015 knows likes since : 2014 1 2 3 4 5 DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge>
  55. 55. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 55 Motivation EPGM Operators BenchmarkImplementation 55 Implementation Apache Flink Gradoop on Flink LogicalGraph grouped = graph1.combine(graph2).groupBy() .useVertexLabel() .useEdgeLabel() .addVertexAggregate(new CountAggregator()) .addEdgeAggregate(new CountAggregator()); 6 7 Person count : 3 Band count : 2 likes count : 4 knows count : 1 6 7 4 likes since : 2014 likes since : 2013 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973likes since : 2015 knows likes since : 2014 1 2 3 4 5
  56. 56. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 56 Motivation EPGM Operators BenchmarkImplementation 56 Implementation Apache Flink Gradoop on Flink GroupBy(1,2,3) + GC + GR* + Map Assign edges to groups Compute aggregates Build super edges Filter + Map Extract super vertex tuples Build super vertices GroupBy(1) + GroupReduce* Assign vertices to groups Compute aggregates Create super vertex tuples Forward updated group members V E (1,[Person],[]) (2,[Band],[]) (3,[Person],[]) (4,[Band],[]) (5,[Person],[]) (-,6,[Person],[3]) (1,6,[],[]) (-,7,[Band],[2]) (2,7,[],[]) (3,6,[],[]) (4,7,[],[]) (5,6,[],[]) v6 v7 (1,6) (2,7) (3,6) (4,7) (5,6) (1,1,2,[likes],[]) (2,3,2,[likes],[]) (3,3,4,[likes],[]) (4,3,5,[knows],[]) (5,5,4,[likes],[]) (1,6,7,[likes],[]) (2,6,7,[likes],[]) (3,6,7,[likes],[]) (4,6,6,[knows],[]) (5,6,7,[likes],[]) e6 e7 Map Extract attributes Filter + Map Extract group members Reduce memory footprint Join* Replace Source/TargetId with corresponding super vertex id Map Extract attributes *requires worker communication
  57. 57. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 57 Motivation EPGM Operators BenchmarkImplementation 57 Implementation Apache Flink Gradoop on Flink class LogicalGraph<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { fromCollections(...) : LogicalGraph<G, V, E> fromDataSets(...) : LogicalGraph<G, V, E> fromGellyGraph(...) : LogicalGraph<G, V, E> getGraphHead() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> aggregate(...) : LogicalGraph<G, V, E> match(...) : GraphCollection<G, V, E> groupBy(...) : LogicalGraph<G, V, E> subgraph(...) : LogicalGraph<G, V, E> combine(...) : LogicalGraph<G, V, E> // ... } class GraphCollection<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge > { fromCollections(...) : GraphCollection<G, V, E> fromDataSets(...) : GraphCollection<G, V, E> getGraphHeads() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> select(...) : GraphCollection<G, V, E> distinct( ) : GraphCollection<G, V, E> sortBy(...) : GraphCollection<G, V, E> union(...) : GraphCollection<G, V, E> difference(...) : GraphCollection<G, V, E> // ... } EPGM API (Operators)
  58. 58. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 58 Motivation EPGM Operators BenchmarkImplementation 58 Implementation Apache Flink Gradoop on Flink interface DataSource<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { getLogicalGraph(...) : LogicalGraph<G, V, E> getGraphCollection(...) : GraphCollection<G, V, E> } interface DataSink<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge > { write(LogicalGraph<G, V, E>) : void write(GraphCollection<G, V, E>) : void } class GraphDataSource<...> implements DataSource<...> { } class HBaseDataSource<...> implements DataSource<...> { } class JSONDataSource<...> implements DataSource<...> { } class TLFDataSource<...> implements DataSource<...> { } class HBaseDataSink<...> implements DataSink<...> { } class JSONDataSink<...> implements DataSink<...> { } class TLFDataSink<...> implements DataSource<...> { } EPGM API (I/O)
  59. 59. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 59 Motivation EPGM Operators BenchmarkImplementation 59 Benchmark
  60. 60. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 60 Motivation EPGM Operators BenchmarkImplementation 60 Benchmark 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation 4. Aggregate vertex count for each community 5. Select communities with more than 50K users 6. Combine large communities to a single graph 7. Group graph by Persons location and gender 8. Aggregate vertex and edge count of grouped graph http://ldbcouncil.org/
  61. 61. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 61 Motivation EPGM Operators BenchmarkImplementation 61 Benchmark 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation 4. Aggregate vertex count for each community 5. Select communities with more than 50K users 6. Combine large communities to a single graph 7. Group graph by Persons location and gender 8. Aggregate vertex and edge count of grouped graph https://git.io/vgozj
  62. 62. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 62 Motivation EPGM Operators BenchmarkImplementation 62 Benchmark Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
  63. 63. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 63 Motivation EPGM Operators BenchmarkImplementation 63 Benchmark Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960 0 200 400 600 800 1000 1200 1 2 4 8 16 Runtime[s] Number of workers Graphalytics.100
  64. 64. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 64 Motivation EPGM Operators BenchmarkImplementation 64 Benchmark 1 2 4 8 16 1 2 4 8 16 Speedup Number of workers Graphalytics.100 Linear Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
  65. 65. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 65 Motivation EPGM Operators BenchmarkImplementation 65 Benchmark 1 10 100 1000 10000 Runtime[s] Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
  66. 66. Summary • 0.0.1 First Prototype (May 2015) – Hadoop MapReduce and Giraph for operator implementations – Too much complexity – Performance loss through serialization in HDFS/HBase • 0.0.2 Using Flink as execution layer (June 2015) – Basic operators • 0.1 December 2015 – System-side identifiers (UUID) – Improved property handling – More operator implementations (e.g., Equality, Bool operators) – Code refactoring • 0.2-SNAPSHOT August 2016 – Graph Pattern Matching  – Frequent Subgraph Mining  – Memory optimization (96-bit ID, Dictionary Encoding, …) – Refactoring Release History
  67. 67. Summary Contributions welcome! • Code • I/O Formats (GraphML, DOT, …) • Operators and Algorithms • Tuning (Memory consumption, serialization, …) • API improvements • Use cases and data • Business Intelligence • Fraud Detection • Pattern Mining • …
  68. 68. • Extended Property Graph Model • Schema flexible: Type Labels and Properties • Logical Graphs / Graphs Collection • Graph and Collection Operators • Combination to analytical workflows • Implemented on Apache Flink • Built-in scalability • Combine with other libraries Summary
  69. 69. www.gradoop.com [1] Junghanns, M.; Petermann, A.; Teichmann, N.; Gomez, K.; Rahm, E., „Analyzing Extended Property Graphs with Apache Flink“, Int. Workshop on Network Data Analytics (NDA), SIGMOD 2016. [2] Petermann, A.; Junghanns, M., „Scalable Business Intelligence with Graph Collections“, it – Special Issue on Big Data Analytics, 2016. [3] Petermann, A.; Junghanns, M.; Müller, M.; Rahm, E., „Graph-based Data Integration and Business Intelligence with BIIIG“, Proc. VLDB Conf. (Demo), 2014.

×