Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

3,092 views

Published on

Abstract: How graphs became just another big data primitive

Graph-shaped data is used in product recommendation systems, social network analysis, network threat detection, image de-noising, and many other important applications. And, a growing number of these applications will benefit from parallel distributed processing for graph featuring engineering, model training, and model serving. But today’s graph tools are riddled with limitations and shortcomings, such as a lack of language bindings, streaming support, and seamless integration with other popular data services. In this talk, we’ll argue that the key to doing more with graphs is doing less with specialized systems and more with systems already good at handling data of other shapes. We’ll examine some practical data science workflows to further motivate this argument and we’ll talk about some of the things that Intel is doing with the open source community and industry to make graphs just another big data primitive.

Published in: Technology
  • Be the first to comment

Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

  1. 1. How graphs became just another big data primitive Ted Willke Cloud Platforms Group / Big Data Solutions
  2. 2. Why graphs are cool: DEMO 2
  3. 3. So, how did graphs become just another useful big data primitive? They DIDN’T.
  4. 4. Reduce the tool drag for graph analytics -- Vision (early 2012) Set off in the right direction 4
  5. 5. A complete graph analytics solution 5 -- July 2013
  6. 6. 6 Wide on Analytics  E2E on Graph  Deep on Graph  Wide on Analytics User Interest
  7. 7. Learning #1: Don’t ignore what’s popular! 7
  8. 8. Popular Big Data (Structure) Primitives Which one is best? It depends… and it’s probably not just one. Key-Value Document Column Tabular Graph 8
  9. 9. Which one is best? It depends… and it’s probably not just one. Key-Value Document Column Tabular Graph Basic dictionary. Very fast. Very easy. No/minimal structure. Java, PIQL, Lua, XML, XQuery,… Popular Big Data (Structure) Primitives 9
  10. 10. Which one is best? It depends… and it’s probably not just one. Key-Value Document Column Tabular Graph Key(s), metadata, hierarchy, document structure XML, BSON, JSON… Java, C, C++, REST, Clojure, Scala… Popular Big Data (Structure) Primitives 10
  11. 11. Which one is best? It depends… and it’s probably not just one. Key-Value Document Column Tabular Graph Key:col_val, Key:col_val… Great for “do this to everything in this column” Not so much for multiple columns, specific keys Hadoop, Zookeeper, Java, Python,… Popular Big Data (Structure) Primitives 11
  12. 12. Which one is best? It depends… and it’s probably not just one. Key-Value Document Column Tabular Graph Old-school RDBMS Collection of tables + relations that join them *SQL* Popular Big Data (Structure) Primitives 12
  13. 13. Which one is best? It depends… and it’s probably not just one. Key-Value Document Column Tabular Graph Nodes, edges, properties of nodes and edges Java, Clojure, Lisp, Ruby, C, C++, Scala, REST,… Popular Big Data (Structure) Primitives 13
  14. 14. Key-Value Document Graph Sync (I/O) Async (Bus) Off-line (Queue) API (Remote) LIB (Local) Model Access Implementation Column SQL 14 How we use the primitives
  15. 15. How are these primitives put to use? 15
  16. 16. Ingest & Clean Engineer Features Structure Model Train Model Query & Analyze Learn Visualize Data workflow example 16
  17. 17. Data Representation Personal Learning Knowledge Graph has_associated has_result contains implemented_by Task Level evaluated_by -name: "10th Grade" -value: 10 Learning Task -name: "Matrix Multiplication" -task_id: 101 -description: "Demonstrate how to multiply two matrices" -type: "homework" Subject -name: "Linear Algebra" -subject_id: 100 Task Outcome -score: 0.8 -num_correct: 8 -num_attempts: 2 Learning Plan -plan_id: 1 -num_tasks: 5 -expected_time: 5h Learning Goal -goal_id: 9 -description: "Achieve above average proficiency in all Linear Algebra course tasks" Proficiency name: "Above Average" summarized_by has_associated has_prerequisite Graph? Columnar? Tabular?? 17
  18. 18. 18 Run a graph-based classifier (e.g. LBP) Build graph w/ features from frame Pull results back to frame to get model perf stats Engineer features (avg, ratios) Input from another model (segment/cluster)
  19. 19. Learning #2: The primitives are not used in isolation. 19
  20. 20. Ingest & Clean Engineer Features Structure Model Train Model Query & Analyze Learn Visualize Pig/MR PySpark ETL Tools? Pig/MR PySpark Java, Scala Giraph GraphX (Java, Scala…) Mahout MLlib ?? *SQL* BI tools PySpark… Tooling mash-up! 20
  21. 21. Tools are not used in isolation either. How can we cope with this? 21
  22. 22. Direction #1: Unify primitives and processing on a workflow-oriented engine 22
  23. 23. Unification with Apache Spark Image Source: Databricks •In-memory structures (RDDs) support both table and graph abstractions •Batch processing and Spark streaming Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s RDD-Based Matrices GraphX graph processing/ machine learning RDD-Based Graphs 23
  24. 24. Image Source: GraphX project •Graph processing engine on Spark •Supports Pregel-style vertex programming •View same data as either graphs or collections GraphX API for Spark 24
  25. 25. Python bindings for Spark (GraphX) 25 Client Server Python JVM Py4J Files JVM Akka Python Worker Pipes Serialized Python Functions Results “Transformations” “Actions” “Operations”
  26. 26. Python bindings for Spark GraphX 26
  27. 27. Python bindings for Spark GraphX Coming soon to Apache! Vertex •Transformations: filter, mapValues, diff •Actions: aggregateUsingIndex •Join Operations: innerJoin, leftJoin Edge •Transformations: filter, mapValues, reverse •Join Operations: innerJoin Graph •Property Operators: mapVertices, mapEdges, mapTriplets •Structural Operators: subgraph, reverse, mask, groupEdges, •Join Operations: joinVertices, outerJoinVertices, •Neighborhood Aggregation: mapReduceTriplets •Analytics: ALS, SVDPlusPlus, TriangleCount, PageRank, ConnectedComponents, ShortestPaths 27
  28. 28. Direction #1: Spark 28 •Feature engineering •Model training •Limited language binding (Python, R getting better) •Lacks transactions and model serving
  29. 29. Lacks transactions and model serving... or does it? Image Source: Crankshaw, D., et al., “The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox,” Cornell University Library Archive, retrieved November 2014 Extending BDAS with Velox: A UC Berkeley AMPlab project (sponsored in part by Intel) 29
  30. 30. Direction #2: Unify primitives and processing in relational database 30
  31. 31. Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) Unification within the In-Memory Database (IMDB) •Index data structure for graph traversal •Prototyped in SAP HANA distributed columnar IMDB •Lays foundation for complex graph query and algorithms 31
  32. 32. Graph Traversal Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) 32
  33. 33. Graph Indexing Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) 33
  34. 34. Graph Traversal Results Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014) 34
  35. 35. •Store graph as a set of nodes and a set of edges •Relational algebra captures all basic graph operations •Iterative algorithms captured as driver program that calls stored procedures Graph Analytics in Relational Databases? Source: ISTC for Big Data, Alekh Jindal, “Graph Analytics: The New Use Case for Relational Databases,” blog 35
  36. 36. Source: ISTC for Big Data, Alekh Jindal, “Graph Analytics: The New Use Case for Relational Databases,” blog Graph Analytics in Relational Databases? Relational and graphical analysis – better together! 36
  37. 37. Source: ISTC for Big Data, Alekh Jindal Expressing Graph in SQL 37
  38. 38. Real Time Database BQL – BigDAWG Query Language & Compiler Analytics Libraries Hardware Platforms Applications, Visualization, Languages “Narrow waist” provides portability Historical / Analytics Databases Spill Stream Future Vision – BigDAWG 38
  39. 39. Future Vision – BigDAWG Real Time DBMSs BQL – BigDAWG Query Language & Compiler Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching Languages, e.g, Julia, R, MLbase, GraphLab SciDB Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages TupleWare Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT TileDB S-Store “Narrow waist” provides portability MyriaX Historical / Analytics DBMSs Spill Stream 39
  40. 40. Direction #2: Relational DB 40 •Feature engineering •Transactions and model serving •Performant model training? •Just another Spark behind *QL?
  41. 41. Which direction do you favor? 41 Will the lines blur?
  42. 42. 42 Takeaway from both: Do all of the parallel distributed processing in one place and work with it through one UI!
  43. 43. 43 FILESYSTEMS AND NOSQL STORAGE HW PLATFORM APACHE HADOOP APACHE SPARK DATA WRANGLING MACHINE LEARNING AND STATISTICS Graphical Algorithms Classical Algorithms Graph Construction Tools Useful String Manipulation Useful Math Operators “DATA SCIENCE” REST API Intel Analytics Toolkit Unified UI’s across the workflow Easier feature & model creation End-to-end graph pipeline Fully scalable throughout Multiple data primitives Optimized for IA Python Libraries 3rd Party GUIs/SDKs Viz Tools Future Libraries BI Connectors Query Interfaces ... Pressing forward with the Intel Analytics Toolkit
  44. 44. Analyzing the Semantic Web Reputations Neutral Good Bad Suspect 44
  45. 45. Unified programming environment: DEMO 45
  46. 46. 46 PROGRESS TOWARD VISION
  47. 47. 47 If we are successful... graph will become just another big data primitive!
  48. 48. 49 How graphs became just another big data primitive Graph-shaped data is used in product recommendation systems, social network analysis, network threat detection, image de-noising, and many other important applications. And, a growing number of these applications will benefit from parallel distributed processing for graph featuring engineering, model training, and model serving. But today’s graph tools are riddled with limitations and shortcomings, such as a lack of language bindings, streaming support, and seamless integration with other popular data services. In this talk, we’ll argue that the key to doing more with graphs is doing less with specialized systems and more with systems already good at handling data of other shapes. We’ll examine some practical data science workflows to further motivate this argument and we’ll talk about some of the things that Intel is doing with the open source community and industry to make graphs just another big data primitive.

×