Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Xia Zhu – Intel at MLconf ATL

2,547 views

Published on

Streaming and Online Algorithms for GraphX

GraphX is a resilient distributed graph processing framework on Apache Spark. It is designed for, and is good at, analysis of static graphs. However, it does not support analysis on time evolving graphs yet. In this talk, I will present graph processing research on streaming enhancements for GraphX, which may be used in both pure stream processing or lambda architectures. I will describe an architecture design, and demonstrate how it works with three machine learning algorithms, with detailed evaluation and analysis on performance and scalability.

Published in: Technology
  • Be the first to comment

Xia Zhu – Intel at MLconf ATL

  1. 1. Streaming and Online Algorithms for GraphX Graph Analytics Team Xia (Ivy) Zhu Intel Confidential — Do Not Forward
  2. 2. Why Streaming Processing on Graph? 2 • New stores join • New users join • New users browse/clicks and buy items • Old users browse/clicks and buy items • New ads added • … • Recommend products based on users’ interest • Recommend products based on users’ shopping habits • Recommend products based on users’ purchasing capability • Place ads which most likely will be clicked by users • … Everyday How To Huge amount of relationships are created each day, Wisely utilize them is important
  3. 3. Alibaba Is Not Alone, Graphs are Everywhere 3 100B Neuron 100T Relationships 1.23B Users 160B Friendships 1 Trillion Pages 100s T Links Millions of Products and Users 50M Users 1B hours/moth watch Large Biological Cell Networks
  4. 4. … And Graphs Keep Evolving 4
  5. 5. Streaming Processing Pipeline 5 Data Stream ETL Graph Creation ML Distributed Messaging System • We are using Kafka for distributed messaging • GraphX as graph processing engine
  6. 6. 6 What is GraphX • Graph processing engine on Spark • Support Pregel-type vertex programming • Unifies data-parallel and graph-parallel processing Picture Source: GraphX team
  7. 7. 7 Why GraphX • GraphLab performs well, but standalone • Giraph, open source, scales well, but performance is not good • GraphX supports both table and graph operations • On the same platform, Spark streaming provides basic streaming framework SchemaRDD’s RDD-Based RDDs, Transformations, and Actions Spark Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s Matrices RDD-Based Graphs GraphX graph processing/ machine learning Picture Source: Databricks
  8. 8. 8 Naïve Streaming Does not Scale • Current GraphX is designed for static graphs • Current Spark streaming provides limited types of state DStreams • Naïve approach: • Merge table data before going to graph processing pipeline • Re-generate whole graph and re-run ML at each window • Minimal changes to GraphX and Spark Streaming • Straightforward, but does not scale well 180 160 140 120 100 80 60 40 20 0 Throughput vs Latency of Naive Graph Streaming 1 2 3 4 5 6 7 8 9 Latency(s) Sample Point
  9. 9. Our solution 9 • Static algorithms -> Online algorithms • Merge information at graph phase • Efficient graph store for evolving graph • Better partitioning algorithms to reduce replicas • Static index -> On the fly indexing method (ongoing)
  10. 10. Static vs Online Algorithms 10 • Static algorithms • Good for re-compute the whole graph at each time instance , and re-run ML • Become increasingly infeasible in Big Data era, given the size and growth rate of graphs • Online algorithms • Incremental machine learning is triggered by changes in the graph • We designed delta updates based online algorithms • Page rank as an example • Same idea is applicable to other machine learning algorithms
  11. 11. Static vs Online Page Rank 11 Static_PageRank // InitialVertexValue (0.0, 0.0) // first messsage initialMessage: msg = alpha/(1.0-alpha) // broadcast to neighbors SendMessage: if (edge.srcAttr._2 > tol) Iterator((edge.dstId, edge.srcAttr_2 * edge.attr)) //Aggregate Messages for each Vertex messageCombiner(a,b) : sum = a+b //Update Vertex vertexProgram(sum) : updates = (1.0 - alpha) * sum (oldPR + updates, updates) Online_PageRank // Initialize vertex value base graph: (0.0, 0.0) incremental graph: old vertices: (lastWindowPR, lastWindowDelta) new vertices: (alpha, alpha) // First Message initialMessage: base graph: msg = alpha/(1.0-alpha) incremental graph: none // broadcast to neighbors SendMessage: oldSrc->newDst: Iterator((edge.dstId,(edge.srcAttr_1 – alpha) * edge.attr)) newSrc->newDst or not converged: Iterator((edge.dstId,edge.srcAttr_2 * edge.attr)) //Aggregate Messages for each Vertex messageCombiner(a,b) : sum = a+b //Update Vertex vertexProgram(sum) : updates = (1.0 - alpha) * sum (oldPR + updates, updates)
  12. 12. GraphX Data Loading and Data Structure 12 Edge lists SSrrccIIdd DstId EdgeRDD DDaattaa IInnddeexx Re-HashPartition RRoouuttiinnggTTaabblleePPaarrttiittiioonn VVeerrtteexxRRDDDD RoutingTableMesssage HHaassSSrrccIIdd HHaassDDssttIIdd Replicated Vertex View GGrraapphhIImmppll EEddggeePPaarrttiittiioonn VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Shippable Vertex Partition VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask
  13. 13. GraphX Data Loading and Data Structure 13 Edge lists SSrrccIIdd DstId EdgeRDD DDaattaa Index Re-HashPartition RRoouuttiinnggTTaabblleePPaarrttiittiioonn VVeerrtteexxRRDDDD RoutingTableMesssage HHaassSSrrccIIdd HHaassDDssttIIdd Replicated Vertex View GGrraapphhIImmppll EEddggeePPaarrttiittiioonn VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Shippable Vertex Partition VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Static Index Partitioning Algorithm can help reduce the replication factors
  14. 14. Partitioning Algorithm 14 • Torus-based partitioning • Divide overall partitions to A x B matrix • Vertex’s master partition is decided by Hash function • Replica set is in the same column as master partition (full column), and same row as master partition ( ⁄ + 1 elements starting from master partition) • The intersection between source replica set and target replica set decides where an edge is placed
  15. 15. Index Structure for Graph Streaming 15 • GraphX uses CSR(Compressed Sparse Row)-based index • Originated from sparse matrix compression • Good for finding all out edges of a source vertex • No support for finding all in edges of a target vertex. Need full table scan • At minimal, need to add CSC(Compressed Sparse Column) for indexing in edges Raw Edge Lists Src Dst Data 3 2 3 5 3 9 5 2 5 3 7 3 8 5
  16. 16. 8 6 10 6 Dst Data 2 5 9 2 3 3 5
  17. 17. 6 6 Idx Unique Src 0 3 3 5 5 7 6 8 8 10 CSR Data Src 3 5 5 7 3
  18. 18. 8 8 10 3 Unique Dst Idx 2 0 3 2 5 4 6 6 9 8 CSC
  19. 19. Index Structure for Graph Streaming 16 • Both CSR and CSC need firstly sort edge lists and then create index. • Even better way is to build index on the fly • For graph streaming, need to support both fast insert/write and fast search/read • HashMap • Good for exact match, point search • Fast on insert and search • Good for graph with fixed/known size • Need to re-hash when size surpasses capacity • Trees: B-Tree, LSM-Tree (Log Structured Merge Tree), COLA(Cache Oblivious Lookahead Array) • Support both point search and range search • B-Tree good for fast search, slow for insert • LSM-Tree good for fast insert, slow for search • COLA achieves good tradeoff: fast insert and good enough search COLA based index for graph streaming
  20. 20. Putting Things Together: Our Streaming Pipeline 17 OML + OML + OML + OML + OML …
  21. 21. Performance - Convergence Rate 18 1.2 Converage Rate Naive Incremental Normalized Number of Iterations Graph Size ( Num of Edges) 1.0 0.8 0.6 0.4 0.2 0.0 Base +20% +40% +60% +80% +100% +150% +200%
  22. 22. Performance - Communication Overhead 19 120% 100% 80% 60% 40% 20% 0% Communication Overhead Base +20% +40% +60% +80% +100% +150% +200% Normalized Number of Messages Sent Graph Size (Num of Edges) naive Incremental
  23. 23. Ongoing Future Work 20 • Working on online version of ML algorithms in different categories • Performance evaluation on various online algorithms • Complete on the fly indexing work • Performance evaluation on different indexing methods
  24. 24. Intel Confidential — Do Not Forward

×