Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Flink & Graph Processing

796 views

Published on

Graph processing with Gelly at the London Apache Flink meetup.

Published in: Data & Analytics
  • Be the first to comment

Apache Flink & Graph Processing

  1. 1. Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Apache Flink Meetup London October 5th, 2016
  2. 2. 2 Graphs capture relationships between data items connections, interactions, purchases, dependencies, friendships, etc. Recommenders Social networks Bioinformatics Web search
  3. 3. Outline • Distributed Graph Processing 101 • Gelly: Batch Graph Processing with Apache Flink • BREAK! • Gelly-Stream: Continuous Graph Processing with Apache Flink
  4. 4. Apache Flink • An open-source, distributed data analysis framework • True streaming at its core • Streaming & Batch API 4 Historic data Kafka, RabbitMQ, ... HDFS, JDBC, ... Event logs ETL, Graphs,
 Machine Learning
 Relational, … Low latency,
 windowing, aggregations, ...
  5. 5. WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?
  6. 6. MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE Big Data Ninja MISCONCEPTION #1
  7. 7. A SOCIAL NETWORK
  8. 8. NAIVE WHO(M)-T0-FOLLOW ▸ Naive Who(m) to Follow: ▸ compute a friends-of-friends list per user ▸ exclude existing friends ▸ rank by common connections
  9. 9. DON’T JUST CONSIDER YOUR INPUT GRAPH SIZE. INTERMEDIATE DATA MATTERS TOO!
  10. 10. DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE Data Science Rockstar MISCONCEPTION #2
  11. 11. GRAPHS DON’T APPEAR OUT OF THIN AIR Expectation…
  12. 12. GRAPHS DON’T APPEAR OUT OF THIN AIR Reality!
  13. 13. WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING? ▸ When you do have really big graphs ▸ When the intermediate data is big ▸ When your data is already distributed ▸ When you want to build end-to-end graph pipelines
  14. 14. HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?
  15. 15. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013 Graph Traversals NScale 2014 Ego-network analysis Arabesque 2015 Pattern Matching Tinkerpop
  16. 16. PREGEL: THINK LIKE A VERTEX 1 5 4 3 2 1 3, 4 2 1, 4 5 3 ...
  17. 17. PREGEL: SUPERSTEPS (Vi+1, outbox) <— compute(Vi, inbox) 1 3, 4 2 1, 4 5 3 .. 1 3, 4 2 1, 4 5 3 .. Superstep i Superstep i+1
  18. 18. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING VertexID Out-degree Transition Probability 1 2 1/2 2 2 1/2 3 0 - 4 3 1/3 5 1 1 1 5 4 3 2
  19. 19. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING VertexID Out-degree Transition Probability 1 2 1/2 2 2 1/2 3 0 - 4 3 1/3 5 1 1 PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5) 1 5 4 3 2
  20. 20. 1 5 4 3 2 PAGERANK: THE WORD COUNT OF GRAPH PROCESSING VertexID Out-degree Transition Probability 1 2 1/2 2 2 1/2 3 0 - 4 3 1/3 5 1 1 PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
  21. 21. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING VertexID Out-degree Transition Probability 1 2 1/2 2 2 1/2 3 0 - 4 3 1/3 5 1 1 PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5) 1 5 4 3 2
  22. 22. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING VertexID Out-degree Transition Probability 1 2 1/2 2 2 1/2 3 0 - 4 3 1/3 5 1 1 PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5) 1 5 4 3 2
  23. 23. PREGEL EXAMPLE: PAGERANK void compute(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for setValue(0.15/numVertices() + 0.85*sum) for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for sum up received messages update vertex rank distribute rank to neighbors
  24. 24. SIGNAL-COLLECT outbox <— signal(Vi) 1 3, 4 2 1, 4 5 3 .. 1 3, 4 2 1, 4 5 3 .. Superstep i Vi+1 <— collect(inbox) 1 3, 4 2 1, 4 5 3 .. Signal Collect Superstep i+1
  25. 25. SIGNAL-COLLECT EXAMPLE: PAGERANK void signal(): for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for void collect(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for setValue(0.15/numVertices() + 0.85*sum) distribute rank to neighbors sum up messages update vertex rank
  26. 26. GATHER-SUM-APPLY (POWERGRAPH) 1 ... ... Gather Sum 1 2 5 ... Apply 3 1 5 5 3 1 ... Gather 3 1 5 5 3 Superstep i Superstep i+1
  27. 27. GSA EXAMPLE: PAGERANK double gather(source, edge, target): return target.value() / target.numEdges() double sum(rank1, rank2): return rank1 + rank2 double apply(sum, currentRank): return 0.15 + 0.85*sum compute partial rank combine partial ranks update rank
  28. 28. PREGEL VS. SIGNAL-COLLECT VS. GSA Update Function Properties Update Function Logic Communication Scope Communication Logic Pregel arbitrary arbitrary any vertex arbitrary Signal-Collect arbitrary based on received messages any vertex based on vertex state GSA associative & commutative based on neighbors’ values neighborhood based on vertex state
  29. 29. CAN WE HAVE IT ALL? ▸ Data pipeline integration: built on top of an efficient distributed processing engine ▸ Graph ETL: high-level API with abstractions and methods to transform graphs ▸ Familiar programming model: support popular programming abstractions
  30. 30. Gelly the Apache Flink Graph API
  31. 31. Apache Flink Stack Gelly Table/SQL ML SAMOA DataSet (Java/Scala) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Embedded Dataflow Dataflow Table/SQL Cascading Streaming dataflow runtime CEP
  32. 32. Meet Gelly • Java & Scala Graph APIs on top of Flink’s DataSet API Flink Core Scala API (batch and streaming) Java API (batch and streaming) FlinkML GellyTable API ... Transformations and Utilities Iterative Graph Processing Graph Library 34
  33. 33. Gelly is NOT • a graph database • a specialized graph processor 35
  34. 34. Hello, Gelly! ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env); Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env); DataSet<Vertex<Long, Long>> verticesWithMinIds = graph.run( new ConnectedComponents(maxIterations)); val env = ExecutionEnvironment.getExecutionEnvironment val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env) val graph = Graph.fromDataSet(edges, env) val components = graph.run(new ConnectedComponents(maxIterations)) Java Scala
  35. 35. Graph Methods Graph Properties getVertexIds getEdgeIds numberOfVertices numberOfEdges getDegrees Mutations add vertex/edge remove vertex/edge Transformations map, filter, join subgraph, union, difference reverse, undirected getTriplets Generators R-Mat (power-law) Grid Star Complete …
  36. 36. Example: mapVertices // increment each vertex value by one
 val graph = Graph.fromDataSet(...)
 
 // increment each vertex value by one
 val updatedGraph = graph.mapVertices(v => v.getValue + 1) 4 2 8 5 5 3 1 7 4 5
  37. 37. Example: subGraph val graph: Graph[Long, Long, Long] = ...
 
 // keep only vertices with positive values
 // and only edges with negative values
 val subGraph = graph.subgraph( vertex => vertex.getValue > 0, edge => edge.getValue < 0 )
  38. 38. Neighborhood Methods Apply a reduce function to the 1st-hop neighborhood of each vertex in parallel graph.reduceOnNeighbors( new MinValue, EdgeDirection.OUT)
  39. 39. What makes Gelly unique? • Batch graph processing on top of a streaming dataflow engine • Built for end-to-end analytics • Support for multiple iteration abstractions • Graph algorithm building blocks • A large open-source library of graph algorithms
  40. 40. Why streaming dataflow? • Batch engines materialize data… even if they don’t have to • the graph is always loaded and materialized in memory, even if not needed, e.g. mapping, filtering, transformation • Communication and computation overlap • We can do continuous graph processing (more after the break!)
  41. 41. End-to-end analytics • Graphs don’t appear out of thin air… • We need to support pre- and post-processing • Gelly can be easily mixed with the DataSet API: pre-processing, graph analysis, and post- processing in the same Flink program
  42. 42. Iterative Graph Processing • Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations • vertex-centric • scatter-gather • gather-sum-apply • partition-centric*
  43. 43. Flink Iteration Operators Input Iterative Update Function Result Replace Workset Iterative Update Function Result Solution Set State
  44. 44. Optimization • the runtime is aware of the iterative execution • no scheduling overhead between iterations • caching and state maintenance are handled automatically Push work
 “out of the loop” Maintain state as indexCache Loop-invariant Data
  45. 45. Vertex-Centric SSSP final class SSSPComputeFunction extends ComputeFunction { override def compute(vertex: Vertex, messages: MessageIterator) = { var minDistance = if (vertex.getId == srcId) 0 else Double.MaxValue while (messages.hasNext) { val msg = messages.next if (msg < minDistance) minDistance = msg } if (vertex.getValue > minDistance) { setNewVertexValue(minDistance) for (edge: Edge <- getEdges) sendMessageTo(edge.getTarget, vertex.getValue + edge.getValue) }
  46. 46. Algorithms building blocks • Allow operator re-use across graph algorithms when processing the same input with a similar configuration
  47. 47. Library of Algorithms • PageRank • Single Source Shortest Paths • Label Propagation • Weakly Connected Components • Community Detection • Triangle Count & Enumeration • Local and Global Clustering Coefficient • HITS • Jaccard & Adamic-Adar Similarity • Graph Summarization • val ranks = inputGraph.run(new PageRank(0.85, 20))
  48. 48. Tracker Tracker Ad Server display relevant ads cookie exchange profiling Web Tracking
  49. 49. Can’t we block them? proxy Tracker Tracker Ad Server Legitimate site
  50. 50. • not frequently updated • not sure who or based on what criteria URLs are blacklisted • miss “hidden” trackers or dual-role nodes • blocking requires manual matching against the list • can you buy your way into the whitelist? Available Solutions Crowd-sourced “black lists” of tracker URLs: - AdBlock, DoNotTrack, EasyPrivacy
  51. 51. DataSet • 6 months (Nov 2014 - April 2015) of augmented Apache logs from a web proxy • 80m requests, 2m distinct URLs, 3k users
  52. 52. h2 h3 h4 h5 h6 h8 h7 h1 h3 h4 h5 h6 h1 h2 h7 h8 r1 r2 r3 r5 r6 r7 NT NT T T ? T NT NT r4 r1 r2 r3 r3 r3 r4 r5r6 r7 hosts-projection graph : referer : non-tracker host : tracker host : unlabeled host The Hosts-Projection Graph U: Referers referer-hosts graph V: hosts
  53. 53. Classification via Label Propagation non-tracker tracker unlabeled 55
  54. 54. Data Pipeline raw logs cleaned logs 1: logs pre- processing 2: bipartite graph creation 3: largest connected component extraction 4: hosts- projection graph creation 5: community detection google-analytics.com: T bscored-research.com: T facebook.com: NT github.com: NT cdn.cxense.com: NT ... 6: results DataSet API Gelly DataSet API
  55. 55. Feeling Gelly? • Gelly Guide https://ci.apache.org/projects/flink/flink-docs-master/libs/ gelly_guide.html • To Petascale and Beyond @Flink Forward ‘16 http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache- flink-in-the-clouds/ • Web Tracker Detection @Flink Forward ’15 https://www.youtube.com/watch?v=ZBCXXiDr3TU paper: Kalavri, Vasiliki, et al. "Like a pack of wolves: Community structure of web trackers." International Conference on Passive and Active Network Measurement, 2016.
  56. 56. Gelly-Stream single-pass stream graph processing with Flink
  57. 57. Real Graphs are dynamic Graphs are created from events happening in real-time
  58. 58. How we’ve done graph processing so far 1. Load: read the graph from disk and partition it in memory
  59. 59. 2. Compute: read and mutate the graph state How we’ve done graph processing so far 1. Load: read the graph from disk and partition it in memory
  60. 60. 3. Store: write the final graph state back to disk How we’ve done graph processing so far 2. Compute: read and mutate the graph state 1. Load: read the graph from disk and partition it in memory
  61. 61. What’s wrong with this model? • It is slow • wait until the computation is over before you see any result • pre-processing and partitioning • It is expensive • lots of memory and CPU required in order to scale • It requires re-computation for graph changes • no efficient way to deal with updates
  62. 62. Can we do graph processing on streams? • Maintain the dynamic graph structure • Provide up-to-date results with low latency • Compute on fresh state only
  63. 63. Single-pass graph streaming • Each event is an edge addition • Maintain only a graph summary • Recent events are grouped in graph windows
  64. 64. Graph Summaries • spanners for distance estimation • sparsifiers for cut estimation • sketches for homomorphic properties graph summary algorithm algorithm~R1 R2
  65. 65. 1 43 2 5 i=0 Batch Connected Components 6 7 8
  66. 66. 1 43 2 5 6 7 8 i=0 Batch Connected Components 1 4 3 4 5 2 3 5 2 4 7 8 6 7 6 8
  67. 67. 1 21 2 2 i=1 Batch Connected Components 6 6 6
  68. 68. 1 21 1 2 6 6 6 i=1 Batch Connected Components 2 1 2 2 1 1 2 1 2 7 6 6 6
  69. 69. 1 11 1 1 i=2 Batch Connected Components 6 6 6
  70. 70. 54 76 86 42 31 52 Stream Connected Components Graph Summary: Disjoint Set (Union-Find) • Only store component IDs and vertex IDs
  71. 71. 54 76 86 42 43 31 52 1 3 Cid = 1
  72. 72. 54 76 86 42 43 87 31 52 1 3 Cid = 1 2 5 Cid = 2
  73. 73. 54 76 86 42 43 87 41 31 52 1 3 Cid = 1 2 5 Cid = 2 4
  74. 74. 54 76 86 42 43 87 41 31 52 1 3 Cid = 1 2 5 Cid = 2 4 6 7 Cid = 6
  75. 75. 54 76 86 42 43 87 41 31 52 1 3 Cid = 1 2 5 Cid = 2 4 6 7 Cid = 6 8
  76. 76. 54 76 86 42 43 87 41 52 1 3 Cid = 1 2 5 Cid = 2 4 6 7 Cid = 6 8
  77. 77. 54 76 86 42 43 87 41 6 7 Cid = 6 8 1 3 Cid = 1 2 5 Cid = 2 4
  78. 78. 54 76 86 42 43 87 41 1 3 Cid = 1 2 5 4 6 7 Cid = 6 8
  79. 79. Distributed Stream Connected Components
  80. 80. Stream Connected Components with Flink DataStream<DisjointSet> cc = edgeStream
 .keyBy(0)
 .timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
 .fold(new DisjointSet(), new UpdateCC())
 .flatMap(new Merger())
 .setParallelism(1);
  81. 81. Stream Connected Components with Flink DataStream<DisjointSet> cc = edgeStream
 .keyBy(0)
 .timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
 .fold(new DisjointSet(), new UpdateCC())
 .flatMap(new Merger())
 .setParallelism(1); Partition the edge stream
  82. 82. Stream Connected Components with Flink DataStream<DisjointSet> cc = edgeStream
 .keyBy(0)
 .timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
 .fold(new DisjointSet(), new UpdateCC())
 .flatMap(new Merger())
 .setParallelism(1); Define the merging frequency
  83. 83. Stream Connected Components with Flink DataStream<DisjointSet> cc = edgeStream
 .keyBy(0)
 .timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
 .fold(new DisjointSet(), new UpdateCC())
 .flatMap(new Merger())
 .setParallelism(1); merge locally
  84. 84. Stream Connected Components with Flink DataStream<DisjointSet> cc = edgeStream
 .keyBy(0)
 .timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
 .fold(new DisjointSet(), new UpdateCC())
 .flatMap(new Merger())
 .setParallelism(1); merge globally
  85. 85. Gelly on Streams DataStreamDataSet Distributed Dataflow Deployment Gelly Gelly-Stream • Static Graphs • Multi-Pass Algorithms • Full Computations • Dynamic Graphs • Single-Pass Algorithms • Approximate Computations DataStream
  86. 86. Introducing Gelly-Stream Gelly-Stream enriches the DataStream API with two new additional ADTs: • GraphStream: • A representation of a data stream of edges. • Edges can have state (e.g. weights). • Supports property streams, transformations and aggregations. • GraphWindow: • A “time-slice” of a graph stream. • It enables neighborhood aggregations
  87. 87. GraphStream Operations .getEdges() .getVertices() .numberOfVertices() .numberOfEdges() .getDegrees() .inDegrees() .outDegrees() GraphStream -> DataStream .mapEdges(); .distinct(); .filterVertices(); .filterEdges(); .reverse(); .undirected(); .union(); GraphStream -> GraphStream Property Streams Transformations
  88. 88. Graph Stream Aggregations result aggregate property streamgraph stream (window) fold combine fold reduce local summaries global summary edges agg global aggregates can be persistent or transient graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))
  89. 89. Slicing Graph Streams graphStream.slice(Time.of(1, MINUTE)); 11:40 11:41 11:42 11:43
  90. 90. Aggregating Slices graphStream.slice(Time.of(1, MINUTE), direction) .reduceOnEdges(); .foldNeighbors(); .applyOnNeighbors(); • Slicing collocates edges by vertex information • Neighborhood aggregations on sliced graphs source target Aggregations
  91. 91. Finding Matches Nearby graphStream.filterVertices(GraphGeeks()) .slice(Time.of(15, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs()) slice GraphStream :: graph geek check-ins wendy checked_in soap_bar steve checked_in soap_bar tom checked_in joe’s_grill sandra checked_in soap_bar rafa checked_in joe’s_grill wendy steve sandra soap bar tom rafa joe’s grill FindPairs {wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa} GraphWindow :: user-place
  92. 92. Feeling Gelly? • Gelly Guide https://ci.apache.org/projects/flink/flink-docs-master/libs/ gelly_guide.html • Gelly-Stream Repository https://github.com/vasia/gelly-streaming • Gelly-Stream talk @FOSDEM16 https://fosdem.org/2016/schedule/event/graph_processing_apache_flink/ • Related Papers http://www.citeulike.org/user/vasiakalavri/tag/graph-streaming
  93. 93. Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Apache Flink Meetup London October 5th, 2016

×