GraphFrames Access Methods
Jim Hatcher
Solution Architect, DataStax
Twitter: @thejimhatcher
Graph Day - San Francisco
September 2018
© DataStax, All Rights Reserved.1
Agenda
© 2016 DataStax, All Rights Reserved. 2
● Building Blocks
● OSS Spark GraphFrames
● DSEGraphFrames
● Demo
● Resources
Building Blocks
3
Concepts DataStax Enterprise (DSE)Open Source
Graph
Theory
Database
Graph
Database
Distributed
Database
Execution
Framework
Distributed
Execution
Framework
Apache
Spark
Apache
Cassandra
DSE
Graph
DSE Graph Frames - Mental Model of Concepts
Spark
GraphX
Spark
Graph
Frames
DSE
Graph
Frames
DSE
Search
DSE
Analytics
DSE
Core
Machine
Learning
Graph
Algorithms
Spark
Data
Frames
OLTP /
Realtime
Database
Resilient
Distributed
Dataset
(RDD) Spark
Query Plan
& Memory
Optimi-
zation
Apache
Tinkerpop
& Gremlin
Cluster
Data Center 1
OLTP / Realtime
Data Center 2
OLAP / Batch
Real-time Clients Batch Clients
Typical Cluster Topology in DSE Graph
OSS Spark GraphFrames
6
Capabilities
© 2016 DataStax, All Rights Reserved. 7
● Parallelization / Resilience / Distributed (from Spark)
● Query Plan Optimization (from Spark’s Catalyst engine)
● Memory Optimization (from Spark’s Tungsten engine)
● Spark SQL (from Spark DataFrames)
Motif Finding
© 2016 DataStax, All Rights Reserved. 8
● Motif Finding
○ g.find()
○ motif (subset of cypher)
Graph Algorithms
© 2016 DataStax, All Rights Reserved. 9
● Graph Algorithms (from GraphX)
○ Breadth-First Search (BFS)
○ Connected Components / Strongly Connected Components
○ Label Propagation Algorithm (LPA)
○ Page Rank
○ Shortest Paths
○ SVD++
○ Triangle Count
● Building blocks to write your own algorithms
○ aggregateMessages()
○ pregel() - GraphX
Data Source
© 2016 DataStax, All Rights Reserved. 10
● Load your vertices / edges from any Spark source
DSEGraphFrames
11
Data Source
© 2016 DataStax, All Rights Reserved. 12
● Point to your DSE Graph
val g = spark.dseGraph(“my_graph_name”)
● Or, point to any other data source
Apache Tinkerpop support
© 2016 DataStax, All Rights Reserved. 13
● The same Gremlin that you write for your OLTP-based traversals can be used for Analytical
requirements
● However, only a limited subset of the Gremlin steps are implemented currently
○ Inclusions:
■ DSE 5.1: https://docs.datastax.com/en/dse/5.1/dse-
dev/datastax_enterprise/graph/graphAnalytics/tinkerpopDseGraphFrame.html
■ DSE 6.0: https://docs.datastax.com/en/dse/6.0/dse-
dev/datastax_enterprise/graph/graphAnalytics/tinkerpopDseGraphFrame.html
○ Notable Exclusions:
■ repeat()
■ union()
■ as() / select() -- added in DSE 6.0
Good for Scan Operations
© 2016 DataStax, All Rights Reserved. 14
● Very good for operations that require table scans
○ Examples:
■ g.V().count()
■ g.E().count()
■ g.V().groupCount().by(__.label())
■ g.E().groupCount().by(__.label())
Mutations
© 2016 DataStax, All Rights Reserved. 15
● Effective way of mutating the graph (not available in OSS GraphFrames)
○ Mutations cannot be done using Gremlin OLAP
○ Takes advantage of Spark’s innate ability to parallelize processes
● Potential Use Cases
○ Migration from current graph schema to new graph schema
○ Adding shortcut edges
○ Initial load of the graph
■ Requires a distributed file system such as DSEFS or HDFS
○ Drop all instances of Vertex Label X
© 2016 DataStax, All Rights Reserved. 16
Demo
Dataset
© 2016 DataStax, All Rights Reserved. 17
KillrVideo - reference application
https://github.com/datastax/graph-examples/
Summary Traversals - TinkerPop/Gremlin
© 2016 DataStax, All Rights Reserved. 18
val g = spark.dseGraph("killrvideo")
g.V().count()
g.E().count()
g.V().groupCount().by(__.label())
g.E().groupCount().by(__.label())
//get count of actors by movie
g.V()
.hasLabel("movie")
//.has("title", "I Am Legend")
.as("m")
.out("actor")
.groupCount().by(__.select("m").values("title"))
.order(local).by(values, decr)
Summary Traversals - Spark SQL
© 2016 DataStax, All Rights Reserved. 19
//register our vertex and edge tables so we can reference them in Spark SQL
spark.read.format("com.datastax.bdp.graph.spark.sql.vertex").option("graph",
"killrvideo").load.createOrReplaceTempView("vertices")
spark.read.format("com.datastax.bdp.graph.spark.sql.edge").option("graph",
"killrvideo").load.createOrReplaceTempView("edges")
//get Count of Actors by movie
val moviesAndActorCounts = spark.sql("""
SELECT vMovie.title, COUNT(*) AS NumberOfActors
FROM vertices vMovie
INNER JOIN edges eActor ON vMovie.id = eActor.src AND eActor.`~label` = 'actor'
WHERE vMovie.`~label` = 'movie'
GROUP BY vMovie.id, vMovie.title
ORDER BY COUNT(*) DESC
""")
moviesAndActorCounts.show(false)
//moviesAndActorCounts.explain
Summary Traversals - Spark SQL (cont'd)
© 2016 DataStax, All Rights Reserved. 20
val actorsInMultipleGenres = spark.sql("""
SELECT ActorGenreGrouping.ActorName, ActorGenreGrouping.NumberOfGenres
FROM
(
SELECT vPerson.name AS ActorName, COUNT(*) AS NumberOfGenres
FROM vertices vPerson
INNER JOIN edges eActor ON vPerson.id = eActor.dst AND eActor.`~label` = 'actor'
INNER JOIN vertices vMovie ON vMovie.id = eActor.src AND vPerson.`~label` = 'person'
INNER JOIN edges eGenre ON vMovie.id = eGenre.src AND eGenre.`~label` = 'belongsTo'
INNER JOIN vertices vGenre ON vGenre.id = eGenre.dst AND vGenre.`~label` = 'genre'
WHERE vPerson.`~label` = 'person'
AND vPerson.name <> 'Animation'
GROUP BY vPerson.name, vGenre.name
) AS ActorGenreGrouping
WHERE ActorGenreGrouping.NumberOfGenres > 1
ORDER BY ActorGenreGrouping.NumberOfGenres DESC
""")
actorsInMultipleGenres.show(false)
Motif finding
© 2016 DataStax, All Rights Reserved. 21
val g = spark.dseGraph("killrvideo")
//get a list of actors who have worked in comedy movies
var comedyActors = g.find("(movie)-[e1]->(person); (movie)-[e2]->(genre)")
.filter("""
person.`~label` = 'person'
and e1.`~label` = 'actor'
and movie.`~label` = 'movie'
and e2.`~label` = 'belongsTo'
and genre.`~label` = 'genre'
and genre.name = 'Comedy'
""")
.select("person.name", "movie.title", "genre.name")
comedyActors.show(false)
//comedyActors.explain
Adding Shortcut Edges - DataFrames
© 2016 DataStax, All Rights Reserved. 22
val g = spark.dseGraph("killrvideo")
val vPerson1 = g.vertices.filter($"~label" === "person")
val eActor1 = g.edges.filter($"~label" === "actor")
val vMovie1 = g.vertices.filter($"~label" === "movie")
val eActor2 = g.edges.filter($"~label" === "actor")
val tempResults1 = vPerson1
.join(eActor1, vPerson1.col("id") === eActor1.col("dst"))
.select(vPerson1.col("id").as("vPerson1_id"), vPerson1.col("name").as("vPerson1_name"), eActor1.col("src").as("eActor1_src"))
val tempResults2 = tempResults1
.join(vMovie1, tempResults1.col("eActor1_src") === vMovie1.col("id"))
.select(tempResults1.col("vPerson1_id"), tempResults1.col("vPerson1_name"), vMovie1.col("id").as("vMovie1_id"), vMovie1.col("title"))
val tempResults3 = tempResults2
.join(eActor2, tempResults2.col("vMovie1_id") === eActor2.col("src"))
.select(tempResults2.col("vPerson1_id"), tempResults2.col("vPerson1_name"), tempResults2.col("title"), eActor2.col("dst").as("eActor2_dst"))
val shortcutEdges = tempResults3
.filter($"vPerson1_id" =!= $"eActor2_dst")
.select(tempResults3.col("vPerson1_id").as("src"), tempResults3.col("eActor2_dst").as("dst"), lit("workedTogether").as("~label"))
g.updateEdges(shortcutEdges)
Shortest Path
© 2016 DataStax, All Rights Reserved. 23
spark.sparkContext.setCheckpointDir("dsefs://127.0.0.1:5598/checkpoints")
val g = spark.dseGraph("killrvideo")
val johnWayneId = g.V.has("person", "name", "John Wayne").df.collect()(0)(0)
val jamesStewartId = g.V.has("person", "name", "James Stewart").df.collect()(0)(0)
val shortestPaths = g.shortestPaths.landmarks(Seq(johnWayneId, jamesStewartId)).run
//make a C* table that matches the schema of my dataframe
shortestPaths.createCassandraTable(
"test", //keyspace
"shortest_paths", //table_name
partitionKeyColumns = Some(Seq("id")),
clusteringKeyColumns = Some(Seq("~label")))
Shortest Path (cont'd)
© 2016 DataStax, All Rights Reserved. 24
//write to the table
shortestPaths.write.format("org.apache.spark.sql.cassandra")
.options(
Map(
"table" -> "shortest_paths",
"keyspace" -> "test",
"spark.cassandra.output.ignoreNulls" -> "true"
)
).save
//read it back in later
//val shortestPaths.read.cassandraFormat("shortest_paths", "test").load
shortestPaths
.filter($"~label" === "person")
.select('name, 'distances(johnWayneId).as("hopsFromDuke"), 'distances(jamesStewartId).as("hopsFromJimmy"))
.orderBy('hopsFromJohnWayne desc)
.show(500, false)
Resources
© 2016 DataStax, All Rights Reserved. 25
https://graphframes.github.io/user-guide.html
https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx
https://github.com/graphframes/graphframes
https://www.youtube.com/watch?v=DW09q18OHfc - Russell Spitzer / Artem Aliev - Spark Summit talk
https://www.datastax.com/dev/blog/dse-graph-frame
https://github.com/datastax/graph-examples/blob/master/dse-graph-frame/Spark-shell-notes.scala
https://www.manning.com/books/spark-graphx-in-action
https://academy.datastax.com/resources/ds332

GraphFrames Access Methods in DSE Graph

  • 1.
    GraphFrames Access Methods JimHatcher Solution Architect, DataStax Twitter: @thejimhatcher Graph Day - San Francisco September 2018 © DataStax, All Rights Reserved.1
  • 2.
    Agenda © 2016 DataStax,All Rights Reserved. 2 ● Building Blocks ● OSS Spark GraphFrames ● DSEGraphFrames ● Demo ● Resources
  • 3.
  • 4.
    Concepts DataStax Enterprise(DSE)Open Source Graph Theory Database Graph Database Distributed Database Execution Framework Distributed Execution Framework Apache Spark Apache Cassandra DSE Graph DSE Graph Frames - Mental Model of Concepts Spark GraphX Spark Graph Frames DSE Graph Frames DSE Search DSE Analytics DSE Core Machine Learning Graph Algorithms Spark Data Frames OLTP / Realtime Database Resilient Distributed Dataset (RDD) Spark Query Plan & Memory Optimi- zation Apache Tinkerpop & Gremlin
  • 5.
    Cluster Data Center 1 OLTP/ Realtime Data Center 2 OLAP / Batch Real-time Clients Batch Clients Typical Cluster Topology in DSE Graph
  • 6.
  • 7.
    Capabilities © 2016 DataStax,All Rights Reserved. 7 ● Parallelization / Resilience / Distributed (from Spark) ● Query Plan Optimization (from Spark’s Catalyst engine) ● Memory Optimization (from Spark’s Tungsten engine) ● Spark SQL (from Spark DataFrames)
  • 8.
    Motif Finding © 2016DataStax, All Rights Reserved. 8 ● Motif Finding ○ g.find() ○ motif (subset of cypher)
  • 9.
    Graph Algorithms © 2016DataStax, All Rights Reserved. 9 ● Graph Algorithms (from GraphX) ○ Breadth-First Search (BFS) ○ Connected Components / Strongly Connected Components ○ Label Propagation Algorithm (LPA) ○ Page Rank ○ Shortest Paths ○ SVD++ ○ Triangle Count ● Building blocks to write your own algorithms ○ aggregateMessages() ○ pregel() - GraphX
  • 10.
    Data Source © 2016DataStax, All Rights Reserved. 10 ● Load your vertices / edges from any Spark source
  • 11.
  • 12.
    Data Source © 2016DataStax, All Rights Reserved. 12 ● Point to your DSE Graph val g = spark.dseGraph(“my_graph_name”) ● Or, point to any other data source
  • 13.
    Apache Tinkerpop support ©2016 DataStax, All Rights Reserved. 13 ● The same Gremlin that you write for your OLTP-based traversals can be used for Analytical requirements ● However, only a limited subset of the Gremlin steps are implemented currently ○ Inclusions: ■ DSE 5.1: https://docs.datastax.com/en/dse/5.1/dse- dev/datastax_enterprise/graph/graphAnalytics/tinkerpopDseGraphFrame.html ■ DSE 6.0: https://docs.datastax.com/en/dse/6.0/dse- dev/datastax_enterprise/graph/graphAnalytics/tinkerpopDseGraphFrame.html ○ Notable Exclusions: ■ repeat() ■ union() ■ as() / select() -- added in DSE 6.0
  • 14.
    Good for ScanOperations © 2016 DataStax, All Rights Reserved. 14 ● Very good for operations that require table scans ○ Examples: ■ g.V().count() ■ g.E().count() ■ g.V().groupCount().by(__.label()) ■ g.E().groupCount().by(__.label())
  • 15.
    Mutations © 2016 DataStax,All Rights Reserved. 15 ● Effective way of mutating the graph (not available in OSS GraphFrames) ○ Mutations cannot be done using Gremlin OLAP ○ Takes advantage of Spark’s innate ability to parallelize processes ● Potential Use Cases ○ Migration from current graph schema to new graph schema ○ Adding shortcut edges ○ Initial load of the graph ■ Requires a distributed file system such as DSEFS or HDFS ○ Drop all instances of Vertex Label X
  • 16.
    © 2016 DataStax,All Rights Reserved. 16 Demo
  • 17.
    Dataset © 2016 DataStax,All Rights Reserved. 17 KillrVideo - reference application https://github.com/datastax/graph-examples/
  • 18.
    Summary Traversals -TinkerPop/Gremlin © 2016 DataStax, All Rights Reserved. 18 val g = spark.dseGraph("killrvideo") g.V().count() g.E().count() g.V().groupCount().by(__.label()) g.E().groupCount().by(__.label()) //get count of actors by movie g.V() .hasLabel("movie") //.has("title", "I Am Legend") .as("m") .out("actor") .groupCount().by(__.select("m").values("title")) .order(local).by(values, decr)
  • 19.
    Summary Traversals -Spark SQL © 2016 DataStax, All Rights Reserved. 19 //register our vertex and edge tables so we can reference them in Spark SQL spark.read.format("com.datastax.bdp.graph.spark.sql.vertex").option("graph", "killrvideo").load.createOrReplaceTempView("vertices") spark.read.format("com.datastax.bdp.graph.spark.sql.edge").option("graph", "killrvideo").load.createOrReplaceTempView("edges") //get Count of Actors by movie val moviesAndActorCounts = spark.sql(""" SELECT vMovie.title, COUNT(*) AS NumberOfActors FROM vertices vMovie INNER JOIN edges eActor ON vMovie.id = eActor.src AND eActor.`~label` = 'actor' WHERE vMovie.`~label` = 'movie' GROUP BY vMovie.id, vMovie.title ORDER BY COUNT(*) DESC """) moviesAndActorCounts.show(false) //moviesAndActorCounts.explain
  • 20.
    Summary Traversals -Spark SQL (cont'd) © 2016 DataStax, All Rights Reserved. 20 val actorsInMultipleGenres = spark.sql(""" SELECT ActorGenreGrouping.ActorName, ActorGenreGrouping.NumberOfGenres FROM ( SELECT vPerson.name AS ActorName, COUNT(*) AS NumberOfGenres FROM vertices vPerson INNER JOIN edges eActor ON vPerson.id = eActor.dst AND eActor.`~label` = 'actor' INNER JOIN vertices vMovie ON vMovie.id = eActor.src AND vPerson.`~label` = 'person' INNER JOIN edges eGenre ON vMovie.id = eGenre.src AND eGenre.`~label` = 'belongsTo' INNER JOIN vertices vGenre ON vGenre.id = eGenre.dst AND vGenre.`~label` = 'genre' WHERE vPerson.`~label` = 'person' AND vPerson.name <> 'Animation' GROUP BY vPerson.name, vGenre.name ) AS ActorGenreGrouping WHERE ActorGenreGrouping.NumberOfGenres > 1 ORDER BY ActorGenreGrouping.NumberOfGenres DESC """) actorsInMultipleGenres.show(false)
  • 21.
    Motif finding © 2016DataStax, All Rights Reserved. 21 val g = spark.dseGraph("killrvideo") //get a list of actors who have worked in comedy movies var comedyActors = g.find("(movie)-[e1]->(person); (movie)-[e2]->(genre)") .filter(""" person.`~label` = 'person' and e1.`~label` = 'actor' and movie.`~label` = 'movie' and e2.`~label` = 'belongsTo' and genre.`~label` = 'genre' and genre.name = 'Comedy' """) .select("person.name", "movie.title", "genre.name") comedyActors.show(false) //comedyActors.explain
  • 22.
    Adding Shortcut Edges- DataFrames © 2016 DataStax, All Rights Reserved. 22 val g = spark.dseGraph("killrvideo") val vPerson1 = g.vertices.filter($"~label" === "person") val eActor1 = g.edges.filter($"~label" === "actor") val vMovie1 = g.vertices.filter($"~label" === "movie") val eActor2 = g.edges.filter($"~label" === "actor") val tempResults1 = vPerson1 .join(eActor1, vPerson1.col("id") === eActor1.col("dst")) .select(vPerson1.col("id").as("vPerson1_id"), vPerson1.col("name").as("vPerson1_name"), eActor1.col("src").as("eActor1_src")) val tempResults2 = tempResults1 .join(vMovie1, tempResults1.col("eActor1_src") === vMovie1.col("id")) .select(tempResults1.col("vPerson1_id"), tempResults1.col("vPerson1_name"), vMovie1.col("id").as("vMovie1_id"), vMovie1.col("title")) val tempResults3 = tempResults2 .join(eActor2, tempResults2.col("vMovie1_id") === eActor2.col("src")) .select(tempResults2.col("vPerson1_id"), tempResults2.col("vPerson1_name"), tempResults2.col("title"), eActor2.col("dst").as("eActor2_dst")) val shortcutEdges = tempResults3 .filter($"vPerson1_id" =!= $"eActor2_dst") .select(tempResults3.col("vPerson1_id").as("src"), tempResults3.col("eActor2_dst").as("dst"), lit("workedTogether").as("~label")) g.updateEdges(shortcutEdges)
  • 23.
    Shortest Path © 2016DataStax, All Rights Reserved. 23 spark.sparkContext.setCheckpointDir("dsefs://127.0.0.1:5598/checkpoints") val g = spark.dseGraph("killrvideo") val johnWayneId = g.V.has("person", "name", "John Wayne").df.collect()(0)(0) val jamesStewartId = g.V.has("person", "name", "James Stewart").df.collect()(0)(0) val shortestPaths = g.shortestPaths.landmarks(Seq(johnWayneId, jamesStewartId)).run //make a C* table that matches the schema of my dataframe shortestPaths.createCassandraTable( "test", //keyspace "shortest_paths", //table_name partitionKeyColumns = Some(Seq("id")), clusteringKeyColumns = Some(Seq("~label")))
  • 24.
    Shortest Path (cont'd) ©2016 DataStax, All Rights Reserved. 24 //write to the table shortestPaths.write.format("org.apache.spark.sql.cassandra") .options( Map( "table" -> "shortest_paths", "keyspace" -> "test", "spark.cassandra.output.ignoreNulls" -> "true" ) ).save //read it back in later //val shortestPaths.read.cassandraFormat("shortest_paths", "test").load shortestPaths .filter($"~label" === "person") .select('name, 'distances(johnWayneId).as("hopsFromDuke"), 'distances(jamesStewartId).as("hopsFromJimmy")) .orderBy('hopsFromJohnWayne desc) .show(500, false)
  • 25.
    Resources © 2016 DataStax,All Rights Reserved. 25 https://graphframes.github.io/user-guide.html https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx https://github.com/graphframes/graphframes https://www.youtube.com/watch?v=DW09q18OHfc - Russell Spitzer / Artem Aliev - Spark Summit talk https://www.datastax.com/dev/blog/dse-graph-frame https://github.com/datastax/graph-examples/blob/master/dse-graph-frame/Spark-shell-notes.scala https://www.manning.com/books/spark-graphx-in-action https://academy.datastax.com/resources/ds332