Successfully reported this slideshow.
Your SlideShare is downloading. ×

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spark Graph

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 54 Ad

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spark Graph

Download to read offline

Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and Graph Algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research. The tutorial aims to help understanding when graphs should be used and how Spark Graph can be used to extend analytical workflows. In this tutorial we will explore the concepts and motivations behind graph querying and graph algorithms, the components of the new Spark Graph module and their APIs, and how those APIs allow you to successfully write your own graph applications and integrate them in your data science workflows.

The tutorial is a mixture of presentation, code examples, and notebooks. We will demonstrate how to write an end-to-end Graph application that operates on different kinds of input data. We will show how Spark Graph interacts with Spark SQL and openCypher Morpheus, a Spark Graph extension that allows you to easily manage multiple graphs and provides built-in Property Graph Data Sources for the Neo4j graph database as well as Cypher language extensions.

At the end of the tutorial, attendees will have a good understanding of when to apply graphs in their data science workflows, how to bring Spark Graph into an existing Spark workflow and how to make best use of the new APIs. This tutorial will be both lead by the presenters and also hands-on interactive session. The tutorial material will be made available during the presentation.

Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and Graph Algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research. The tutorial aims to help understanding when graphs should be used and how Spark Graph can be used to extend analytical workflows. In this tutorial we will explore the concepts and motivations behind graph querying and graph algorithms, the components of the new Spark Graph module and their APIs, and how those APIs allow you to successfully write your own graph applications and integrate them in your data science workflows.

The tutorial is a mixture of presentation, code examples, and notebooks. We will demonstrate how to write an end-to-end Graph application that operates on different kinds of input data. We will show how Spark Graph interacts with Spark SQL and openCypher Morpheus, a Spark Graph extension that allows you to easily manage multiple graphs and provides built-in Property Graph Data Sources for the Neo4j graph database as well as Cypher language extensions.

At the end of the tutorial, attendees will have a good understanding of when to apply graphs in their data science workflows, how to bring Spark Graph into an existing Spark workflow and how to make best use of the new APIs. This tutorial will be both lead by the presenters and also hands-on interactive session. The tutorial material will be made available during the presentation.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spark Graph (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spark Graph

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. #UnifiedDataAnalytics #SparkAISummit
  3. 3. #UnifiedDataAnalytics #SparkAISummit Graphs are everywhere 3
  4. 4. #UnifiedDataAnalytics #SparkAISummit … and growing 4
  5. 5. #UnifiedDataAnalytics #SparkAISummit Graphs at Spark Summit 5
  6. 6. #UnifiedDataAnalytics #SparkAISummit Property Graphs & Big Data The Property Graph data model is becoming increasingly mainstream Cloud graph data services like Azure CosmosDB or Amazon Neptune Simple graph features in SQLServer 2017, multiple new graph DB products New graph query language to be standardized by ISO Neo4j becoming common operational store in retail, finance, telcos … and more Increasing interest in graph algorithms over graph data as a basis for AI Apache® Spark is the leading scale-out clustered memory solution for Big Data Spark 2: Data viewed as tables (DataFrames), processed by SQL, in function chains, using queries and user functions, transforming immutable tabular data sets 6
  7. 7. #UnifiedDataAnalytics #SparkAISummit Graphs are coming to Spark 7 [SPARK-25994] SPIP: Property Graphs, Cypher Queries, and Algorithms Goal ● bring Property Graphs and the Cypher Query language to Spark ● the SparkSQL for graphs Status ● Accepted by the community ● Implementation still Work in Progress
  8. 8. #UnifiedDataAnalytics #SparkAISummit Demonstration
  9. 9. #UnifiedDataAnalytics #SparkAISummit The Property Graph
  10. 10. #UnifiedDataAnalytics #SparkAISummit The Whiteboard Model Is the Physical Model Eliminates Graph-to-Relational Mapping In your data Bridge the gap between logical model and DB models
  11. 11. #UnifiedDataAnalytics #SparkAISummit REVIEW S name: “Dan” born: May 29, 1970 twitter: “@dan” name: “Ann” born: Dec 5, 1975 date: Jan 10, 2011 name: “Cars, Inc” sector: “automotive” Property Graph Model Components Nodes • The objects in the graph • Can have name-value properties • Can be labeled KNOWS KNOWS FOLLOWS REVIEW S User User Relationships • Relate nodes by type and direction • Can have name-value properties Business
  12. 12. #UnifiedDataAnalytics #SparkAISummit Relational Versus Graph Models Relational Model Graph Model REVIEWS REVIEWS REVIEWS Alice Burgers, Inc Pizza, Inc Pretzels User BusinessUser-Business Alice Burgers, Inc Pretzels Pizza, Inc
  13. 13. #UnifiedDataAnalytics #SparkAISummit Graphs in Spark 3.0
  14. 14. #UnifiedDataAnalytics #SparkAISummit Tables for Labels • In Spark Graph, PropertyGraphs are represented by – Node Tables and Relationship Tables • Tables are represented by DataFrames – Require a fixed schema • Property Graphs have a Graph Type – Node and relationship types that occur in the graph – Node and relationship properties and their data type Property Graph Node Tables Rel. Tables Graph Type
  15. 15. #UnifiedDataAnalytics #SparkAISummit Tables for Labels :User:ProAccount name: Alice :Business name: Burgers, Inc :REVIEWS id name 0 Alice id name 1 Burgers, Inc id source target 0 0 1 :User:ProAccount :Business :REVIEWS Graph Type { :User:ProAccount ( name: STRING ), :Business ( name: STRING ), :REVIEWS }
  16. 16. #UnifiedDataAnalytics #SparkAISummit Creating a graph Property Graphs are created from a set of DataFrames. There are two possible options: - Using Wide Tables - one DF for nodes and one for relationships - column name convention identifies label and property columns - Using NodeFrames and RelationshipFrames - requires a single DataFrame per node label combination and relationship type - allows mapping DF columns to properties 16
  17. 17. #UnifiedDataAnalytics #SparkAISummit Storing and Loading business.json user.json review.json Create Node and Relationship Tables Create Property Graph Store Property Graph as Parquet 17
  18. 18. #UnifiedDataAnalytics #SparkAISummit Demonstration
  19. 19. #UnifiedDataAnalytics #SparkAISummit Graph Querying with Cypher
  20. 20. #UnifiedDataAnalytics #SparkAISummit What is Cypher? • Declarative query language for graphs – "SQL for graphs" • Based on pattern matching • Supports basic data types for properties • Functions, aggregations, etc 20
  21. 21. #UnifiedDataAnalytics #SparkAISummit Pattern matching a b 1 4 3 2 5 1 2 32 4 5 Query graph: Data graph: Result:
  22. 22. #UnifiedDataAnalytics #SparkAISummit Basic Pattern: Alice's reviews? (:User {name:'Alice'} ) -[:REVIEWS]-> (business:Business) REVIEWS User Forrest Gump VAR LABEL NODE NODE ? LABEL PROPERTY RELATIONSHIP Type
  23. 23. #UnifiedDataAnalytics #SparkAISummit Cypher query structure • Cypher operates over a graph and returns a table • Basic structure: MATCH pattern WHERE predicate RETURN/WITH expression AS alias, ... ORDER BY expression SKIP ... LIMIT ...
  24. 24. #UnifiedDataAnalytics #SparkAISummit Basic Query: Businesses Alice has reviewed? MATCH (user:User)-[r:REVIEWS]->(b:Business) WHERE user.name = 'Alice' RETURN b.name, r.rating
  25. 25. #UnifiedDataAnalytics #SparkAISummit Query Comparison: Colleagues of Tom Hanks? SELECT co.name AS coReviewer, count(co) AS nbrOfCoReviews FROM User AS user JOIN UserBusiness AS ub1 ON (user.id = ub1.user_id) JOIN UserBusiness AS ub2 ON (ub1.b_id = ub2.b_id) JOIN User AS co ON (co.id = ub2.user_id) WHERE user.name = "Alice" GROUP BY co.name MATCH (user:User)-[:REVIEWS]->(:Business)<-[:REVIEWS]-(co:User) WHERE user.name = 'Alice' RETURN co.name AS coReviewer, count(*) AS nbrOfCoReviews
  26. 26. #UnifiedDataAnalytics #SparkAISummit Variable-length patterns MATCH (a:User)-[r:KNOWS*2..6]->(other:User) RETURN a, other, length(r) AS length Allows the traversal of paths of variable length Returns all results between the minimum and maximum number of hops 26
  27. 27. #UnifiedDataAnalytics #SparkAISummit Aggregations • Cypher supports a number of aggregators – min(), max(), sum(), avg(), count(), collect(), ... • When aggregating, non-aggregation projections form a grouping key: MATCH (u:User) RETURN u.name, count(*) AS count The above query will return the count per unique name 27
  28. 28. #UnifiedDataAnalytics #SparkAISummit Projections • UNWIND UNWIND [‘a’, ‘b’, ‘c’] AS list • WITH – Behaves similar to RETURN – Allows projection of values into new variables – Controls scoping of variables MATCH (n1)-[r1]->(m1) WITH n1, collect(r1) AS r // r1, m1 not visible after this RETURN n1, r list ‘a’ ‘b’ ‘c’
  29. 29. #UnifiedDataAnalytics #SparkAISummit Expressions • Arithmetic (+, -, *, /, %) • Logical (AND, OR, NOT) • Comparison (<, <=, =, <>, >=, >) • Functions – Math functions (sin(), cos(), asin(), ceil(), floor()) – Conversion (toInteger(), toFloat(), toString()) – String functions – Date and Time functions – Containers (Nodes, Relationships, Lists, Maps) – …
  30. 30. #UnifiedDataAnalytics #SparkAISummit Cypher in Spark 3.0
  31. 31. #UnifiedDataAnalytics #SparkAISummit In the previous session... • Property Graph is a growing data model • Spark Graph will bring Property Graphs and Cypher to Spark 3.0 • Cypher is "SQL for graphs" – Based on pattern matching 31
  32. 32. #UnifiedDataAnalytics #SparkAISummit Now that you know Cypher... val graph: PropertyGraph = … graph.cypher( """ |MATCH (u:User)-[:REVIEWS]->(b:Business) |WHERE u.name = 'Alice' |RETURN u.name, b.name """.stripMargin ).df.show 32
  33. 33. #UnifiedDataAnalytics #SparkAISummit So, what happens in that call? 33
  34. 34. #UnifiedDataAnalytics #SparkAISummit ● Distributed executionSpark Core Spark SQL ● Rule-based query optimization Query Processing MATCH (u:User)-[:REVIEWS]->(b:Business) WHERE u.name = 'Alice' RETURN u.name, b.name 34 openCypher Frontend ● Shared with Neo4j database system ● Parsing, Rewriting, Normalization ● Semantic Analysis (Scoping, Typing, etc.) Okapi + Spark Cypher ● Schema and Type handling ● Query translation to DataFrame operations
  35. 35. #UnifiedDataAnalytics #SparkAISummit Query Translation MATCH (u:User)-[:REVIEWS]->(b:Business) WHERE u.name = 'Alice' RETURN u.name, b.name Logical view Physical view (DataFrame operations) NodeTable(Business) RelTable(REVIEWS) NodeTable(User) Result 35
  36. 36. #UnifiedDataAnalytics #SparkAISummit Spark Cypher Architecture 36 ● Conversion of expressions ● Typing of expressions ● Translation into Logical Operators ● Basic Logical Optimization ● Column layout computation for intermediate results ● Translation into Relational Operations openCypher Frontend Okapi + Spark Cypher Spark SQL Spark Core Intermediate Language Relational Planning Logical Planning ● Translation of Relational Operations into DataFrame transformations ● Expression Conversion to Spark SQL Columns Spark Cypher
  37. 37. #UnifiedDataAnalytics #SparkAISummit Query(None, SingleQuery(List( Match(false, Pattern(List( EveryPath( RelationshipChain( NodePattern(Some(Variable(u)),List(),None,None), RelationshipPattern(Some(Variable(UNNAMED18)),List(RelTypeName(REVIEWS)),None,None,OUTGOING,None,false), NodePattern(Some(Variable(b)),List(),None,None))))),List(), Some(Where( Ands( Set(HasLabels(Variable(u),List(LabelName(User))), HasLabels(Variable(b),List(LabelName(Business))), Equals(Property(Variable(u),PropertyKeyName(name)),Parameter( AUTOSTRING0,String))))))), With(false,ReturnItems(false,List( AliasedReturnItem(Property(Variable(u),PropertyKeyName(name)),Variable(n.name)), AliasedReturnItem(Property(Variable(b),PropertyKeyName(name)),Variable(b.name)))),None,None,None,None), Return(false,ReturnItems(false,List( AliasedReturnItem(Variable(u.name),Variable(u.name)), AliasedReturnItem(Variable(b.name),Variable(b.name)))),None,None,None,Set())))) MATCH (u:User)-[:REVIEWS]->(b:Business) WHERE u.name = 'Alice' RETURN u.name, b.name 37 openCypher Frontend Okapi + Spark Cypher Spark SQL Spark Core Intermediate Language Relational Planning Logical Planning Spark Cypher
  38. 38. #UnifiedDataAnalytics #SparkAISummit MATCH (u:User)-[:REVIEWS]->(b:Business) WHERE u.name = 'Alice' RETURN u.name, b.name 38 ╙──TableResultBlock(OrderedFields(List(u.name :: STRING, b.name :: STRING)), ...) ╙──ProjectBlock(Fields(Map(u.name :: STRING -> u.name :: STRING, b.name :: STRING -> b.name :: STRING)), Set(), ...) ╙──ProjectBlock(Fields(Map(u.name :: STRING -> u.name :: STRING, b.name :: STRING -> b.name :: STRING)), Set(), ...) ╙──MatchBlock(Pattern( Set(u :: NODE, b :: NODE, UNNAMED18 :: RELATIONSHIP(:REVIEWS)), Map( UNNAMED18 :: RELATIONSHIP(:REVIEWS) -> DirectedRelationship(Endpoints(u :: NODE, b :: NODE))), Set(u:User :: BOOLEAN, b:Business :: BOOLEAN, u.name :: STRING = "Alice" :: STRING) )) ╙──SourceBlock(IRCatalogGraph(session.tmp#1) openCypher Frontend Okapi + Spark Cypher Spark SQL Spark Core Intermediate Language Relational Planning Logical Planning Spark Cypher • Converting AST patterns into okapi patterns • Converting AST expressions into okapi expressions • Typing expressions
  39. 39. #UnifiedDataAnalytics #SparkAISummit MATCH (u:User)-[:REVIEWS]->(b:Business) WHERE n.name = 'Alice' RETURN u.name, b.name Select(List(u.name :: STRING, b.name :: STRING), ...) ╙─Project((b.name :: STRING,Some(b.name :: STRING)), ...) ╙─Project((u.name :: STRING,Some(u.name :: STRING)), ...) ╙─Filter(u.name :: STRING = $ AUTOSTRING0 :: STRING, ...) ╙─Project((u.name :: STRING,None), ...) ╙─Filter(b:Business :: BOOLEAN, ...) ╙─Filter(u:User :: BOOLEAN, ...) ╙─Expand(u :: NODE, UNNAMED18 :: RELATIONSHIP(:REVIEWS), b :: NODE, Directed, ...) ╟─NodeScan(u :: NODE, ...) ║ ╙─Start(LogicalCatalogGraph(session.tmp#1), ...) ╙─NodeScan(b :: NODE , ...) ╙─Start(LogicalCatalogGraph(session.tmp#1), ...) Convert Intermediate Language Blocks into Logical Query Operators 39 openCypher Frontend Okapi + Spark Cypher Spark SQL Spark Core Intermediate Language Relational Planning Logical Planning Spark Cypher
  40. 40. #UnifiedDataAnalytics #SparkAISummit Select(List(u.name :: STRING, b.name :: STRING), ...) ╙─Project((b.name :: STRING,Some(b.name :: STRING)), ...) ╙─Project((u.name :: STRING,Some(u.name :: STRING)), ...) ╙─Filter(u.name :: STRING = $ AUTOSTRING0 :: STRING, ...) ╙─Project((u.name :: STRING,None), ...) ╙─Expand(u :: NODE, UNNAMED18 :: RELATIONSHIP(:REVIEWS), b :: NODE, Directed, ...) ╟─NodeScan(u :: NODE(:User), ...) ║ ╙─Start(LogicalCatalogGraph(session.tmp#1), ...) ╙─NodeScan(b :: NODE(:Business), ...) ╙─Start(LogicalCatalogGraph(session.tmp#1), ...) 40 openCypher Frontend Okapi + Spark Cypher Spark SQL Spark Core Intermediate Language Relational Planning Logical Planning Spark Cypher Apply basic optimizations to a Logical Query plan (e.g. label pushdown) MATCH (u:User)-[:REVIEWS]->(b:Business) WHERE u.name = 'Alice' RETURN u.name, b.name
  41. 41. #UnifiedDataAnalytics #SparkAISummit MATCH (u:User)-[:REVIEWS]->(b:Business) WHERE u.name = 'Alice' RETURN u.name, b.name Select(u.name :: STRING, b.name :: STRING), RecordHeader with 2 entries) ╙─Alias(b.name :: STRING AS b.name :: STRING RecordHeader with 15 entries) ╙─Alias(u.name :: STRING AS u.name :: STRING, RecordHeader with 14 entries) ╙─Filter(u.name :: STRING = "Alice", RecordHeader with 13 entries) ╙─Join((target(UNNAMED18 :: RELATIONSHIP(:REVIEWS)) -> b :: NODE)), RecordHeader with 13 entries, InnerJoin) ╟─Join((u :: NODE -> source(UNNAMED18 :: RELATIONSHIP(:REVIEWS))), RecordHeader with 9 entries, InnerJoin) ║ ╟─NodeScan(u :: NODE(:User), RecordHeader 4 entries) ║ ║ ╙─Start(Some(CAPSRecords.unit), session.tmp#1) ║ ╙─RelationshipScan(UNNAMED18 :: RELATIONSHIP(:REVIEWS), RecordHeader with 5 entries) ║ ╙─Start(None) ╙─NodeScan(b :: NODE(:Business), RecordHeader with 4 entries) ╙─Start(Some(CAPSRecords.unit)) Translation of graph operations into relational operations 41 openCypher Frontend Okapi + Spark Cypher Spark SQL Spark Core Intermediate Language Logical Planning Relational Planning Spark Cypher
  42. 42. #UnifiedDataAnalytics #SparkAISummit ● Describes the output table of a relational operator ● Maps query expressions (e.g. ‘n.name’ or ‘n:User’) to DataFrame / Table columns ● Used to access columns when evaluating expression during physical execution ● Supports relational operations to reflect data changes (e.g. header.join(otherHeader)) 42
  43. 43. #UnifiedDataAnalytics #SparkAISummit Expression Column Name NodeVar(n) :: CTNODE(:User) n HasLabel(n, :User) :: CTBOOLEAN ____n:User Property(n.name) :: CTSTRING ____n_dot_nameSTRING 43 Select(b, name) ╙─Project(b.name as name) ╙─Project(n as b) ╙─NodeScan(n:User)
  44. 44. #UnifiedDataAnalytics #SparkAISummit Expression Column Name NodeVar(n) :: CTNODE(:User) n HasLabel(n, :User) :: CTBOOLEAN ____n:User Property(n.name) :: CTSTRING ____n_dot_nameSTRING NodeVar(b) :: CTNODE(:User) n HasLabel(b, :User) :: CTBOOLEAN ____n:User Property(b.name) :: CTSTRING ____n_dot_nameSTRING 44 Select(b, name) ╙─Project(b.name as name) ╙─Project(n as b) ╙─NodeScan(n:User)
  45. 45. #UnifiedDataAnalytics #SparkAISummit Expression Column Name NodeVar(n) :: CTNODE(:User) n HasLabel(n, :User) :: CTBOOLEAN ____n:User Property(n.name) :: CTSTRING ____n_dot_nameSTRING NodeVar(b) :: CTNODE(:User) n HasLabel(b, :User) :: CTBOOLEAN ____n:User Property(b.name) :: CTSTRING ____n_dot_nameSTRING SimpleVar(name) :: CTSTRING ____n_dot_nameSTRING 45 Select(b, name) ╙─Project(b.name as name) ╙─Project(n as b) ╙─NodeScan(n:User)
  46. 46. #UnifiedDataAnalytics #SparkAISummit Expression Column Name NodeVar(b) :: CTNODE(:User) n HasLabel(b, :User) :: CTBOOLEAN ____n:User Property(b.name) :: CTSTRING ____n_dot_nameSTRING SimpleVar(name) :: CTSTRING ____n_dot_nameSTRING 46 Select(b, name) ╙─Project(b.name as name) ╙─Project(n as b) ╙─NodeScan(n:User)
  47. 47. #UnifiedDataAnalytics #SparkAISummit Abstracts relational operators over the relational backends table Converts OKAPI expressions into backend specific expressions trait Table[E] { def header: RecordHeader def select(expr: String, exprs: String*): Table[E] def filter(expr: Expr): Table[E] def distinct: Table[E] def order(by: SortItem[Expr]*): Table[E] def group(by: Set[Expr], aggregations: Set[Aggregation]): Table[E] def join(other: Table[E], joinExprs: Set[(String, String)], joinType: JoinType): Table[E] def unionAll(other: Table[E]): Table[E] def add(expr: Expr): Table[E] def addInto(expr: Expr, into: String): Table[E] def drop(columns: String*): Table[E] def rename(oldColumn: Expr, newColumn: String): Table[E] } 47 openCypher Frontend Okapi + Spark Cypher Spark SQL Spark Core Intermediate Language Relational Planning Spark Cypher Logical Planning
  48. 48. #UnifiedDataAnalytics #SparkAISummit class DataFrameTable(df: DataFrame) extends RelationalTable[DataFrameTable] { // ... override def filter(expr: Expr): DataFrameTable = { new DataFrameTable(df.filter(convertExpression(expr, header))) } override def join(other: DataFrameTable, joinExprs: Set[(String, String)], joinType: JoinType): DataFrameTable = { val joinExpr = joinExprs.map { case (l,r) => df.col(l) === other.df(r) }.reduce(_ && _) new DataFrameTable(df.join(other.df, joinExpr, joinType)) } // ... } openCypher Frontend Okapi + Spark Cypher Spark SQL Spark Core Intermediate Language Relational Planning Logical Planning Spark Cypher 48
  49. 49. #UnifiedDataAnalytics #SparkAISummit Future improvement ideas • Common table expressions • Worst-case optimal joins • Graph-aware optimisations in Catalyst • Graph-aware data partitioning 49
  50. 50. #UnifiedDataAnalytics #SparkAISummit Spark Cypher in Action
  51. 51. #UnifiedDataAnalytics #SparkAISummit Extending Spark Graph with Neo4j Morpheus
  52. 52. #UnifiedDataAnalytics #SparkAISummit Neo4j Morpheus • Incubator for SparkCypher • Extends Cypher language with multiple-graph features • Graph catalog • Property graph data sources for integration with Neo4j, SQL DBMS, etc. https://github.com/opencypher/morpheus 52
  53. 53. #UnifiedDataAnalytics #SparkAISummit Get Involved! • SPIP was accepted in February • Current status: – Core development is poc-complete – PRs in review • We are not Spark committers – Help us review / merge – Contribute to documentation, Python API 53
  54. 54. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×