GraphFrames Access Methods in DSE Graph

GraphFrames Access Methods
Jim Hatcher
Solution Architect, DataStax
Twitter: @thejimhatcher
Graph Day - San Francisco
September 2018
© DataStax, All Rights Reserved.1

Agenda
© 2016 DataStax, All Rights Reserved. 2
● Building Blocks
● OSS Spark GraphFrames
● DSEGraphFrames
● Demo
● Resources

Concepts DataStax Enterprise (DSE)Open Source
Graph
Theory
Database
Graph
Database
Distributed
Database
Execution
Framework
Distributed
Execution
Framework
Apache
Spark
Apache
Cassandra
DSE
Graph
DSE Graph Frames - Mental Model of Concepts
Spark
GraphX
Spark
Graph
Frames
DSE
Graph
Frames
DSE
Search
DSE
Analytics
DSE
Core
Machine
Learning
Graph
Algorithms
Spark
Data
Frames
OLTP /
Realtime
Database
Resilient
Distributed
Dataset
(RDD) Spark
Query Plan
& Memory
Optimi-
zation
Apache
Tinkerpop
& Gremlin

Cluster
Data Center 1
OLTP / Realtime
Data Center 2
OLAP / Batch
Real-time Clients Batch Clients
Typical Cluster Topology in DSE Graph

Capabilities
● Parallelization / Resilience / Distributed (from Spark)
● Query Plan Optimization (from Spark’s Catalyst engine)
● Memory Optimization (from Spark’s Tungsten engine)
● Spark SQL (from Spark DataFrames)

Motif Finding
● Motif Finding
○ g.find()
○ motif (subset of cypher)

Graph Algorithms
● Graph Algorithms (from GraphX)
○ Breadth-First Search (BFS)
○ Connected Components / Strongly Connected Components
○ Label Propagation Algorithm (LPA)
○ Page Rank
○ Shortest Paths
○ SVD++
○ Triangle Count
● Building blocks to write your own algorithms
○ aggregateMessages()
○ pregel() - GraphX

Data Source
● Load your vertices / edges from any Spark source

Data Source
● Point to your DSE Graph
val g = spark.dseGraph(“my_graph_name”)
● Or, point to any other data source

Apache Tinkerpop support
● The same Gremlin that you write for your OLTP-based traversals can be used for Analytical
requirements
● However, only a limited subset of the Gremlin steps are implemented currently
○ Inclusions:
■ DSE 5.1: https://docs.datastax.com/en/dse/5.1/dse-
dev/datastax_enterprise/graph/graphAnalytics/tinkerpopDseGraphFrame.html
■ DSE 6.0: https://docs.datastax.com/en/dse/6.0/dse-
dev/datastax_enterprise/graph/graphAnalytics/tinkerpopDseGraphFrame.html
○ Notable Exclusions:
■ repeat()
■ union()
■ as() / select() -- added in DSE 6.0

Good for Scan Operations
● Very good for operations that require table scans
○ Examples:
■ g.V().count()
■ g.E().count()
■ g.V().groupCount().by(__.label())
■ g.E().groupCount().by(__.label())

Mutations
● Effective way of mutating the graph (not available in OSS GraphFrames)
○ Mutations cannot be done using Gremlin OLAP
○ Takes advantage of Spark’s innate ability to parallelize processes
● Potential Use Cases
○ Migration from current graph schema to new graph schema
○ Adding shortcut edges
○ Initial load of the graph
■ Requires a distributed file system such as DSEFS or HDFS
○ Drop all instances of Vertex Label X

Demo

Dataset
KillrVideo - reference application
https://github.com/datastax/graph-examples/

Summary Traversals - TinkerPop/Gremlin
val g = spark.dseGraph("killrvideo")
g.V().count()
g.E().count()
g.V().groupCount().by(__.label())
g.E().groupCount().by(__.label())
//get count of actors by movie
g.V()
.hasLabel("movie")
//.has("title", "I Am Legend")
.as("m")
.out("actor")
.groupCount().by(__.select("m").values("title"))
.order(local).by(values, decr)

Summary Traversals - Spark SQL
//register our vertex and edge tables so we can reference them in Spark SQL
spark.read.format("com.datastax.bdp.graph.spark.sql.vertex").option("graph",
"killrvideo").load.createOrReplaceTempView("vertices")
spark.read.format("com.datastax.bdp.graph.spark.sql.edge").option("graph",
"killrvideo").load.createOrReplaceTempView("edges")
//get Count of Actors by movie
val moviesAndActorCounts = spark.sql("""
SELECT vMovie.title, COUNT(*) AS NumberOfActors
FROM vertices vMovie
INNER JOIN edges eActor ON vMovie.id = eActor.src AND eActor.`~label` = 'actor'
WHERE vMovie.`~label` = 'movie'
GROUP BY vMovie.id, vMovie.title
ORDER BY COUNT(*) DESC
""")
moviesAndActorCounts.show(false)
//moviesAndActorCounts.explain

Summary Traversals - Spark SQL (cont'd)
val actorsInMultipleGenres = spark.sql("""
SELECT ActorGenreGrouping.ActorName, ActorGenreGrouping.NumberOfGenres
FROM
(
SELECT vPerson.name AS ActorName, COUNT(*) AS NumberOfGenres
FROM vertices vPerson
INNER JOIN edges eActor ON vPerson.id = eActor.dst AND eActor.`~label` = 'actor'
INNER JOIN vertices vMovie ON vMovie.id = eActor.src AND vPerson.`~label` = 'person'
INNER JOIN edges eGenre ON vMovie.id = eGenre.src AND eGenre.`~label` = 'belongsTo'
INNER JOIN vertices vGenre ON vGenre.id = eGenre.dst AND vGenre.`~label` = 'genre'
WHERE vPerson.`~label` = 'person'
AND vPerson.name <> 'Animation'
GROUP BY vPerson.name, vGenre.name
) AS ActorGenreGrouping
WHERE ActorGenreGrouping.NumberOfGenres > 1
ORDER BY ActorGenreGrouping.NumberOfGenres DESC
""")
actorsInMultipleGenres.show(false)

Motif finding
//get a list of actors who have worked in comedy movies
var comedyActors = g.find("(movie)-[e1]->(person); (movie)-[e2]->(genre)")
.filter("""
person.`~label` = 'person'
and e1.`~label` = 'actor'
and movie.`~label` = 'movie'
and e2.`~label` = 'belongsTo'
and genre.`~label` = 'genre'
and genre.name = 'Comedy'
""")
.select("person.name", "movie.title", "genre.name")
comedyActors.show(false)
//comedyActors.explain

Adding Shortcut Edges - DataFrames
val vPerson1 = g.vertices.filter($"~label" === "person")
val eActor1 = g.edges.filter($"~label" === "actor")
val vMovie1 = g.vertices.filter($"~label" === "movie")
val eActor2 = g.edges.filter($"~label" === "actor")
val tempResults1 = vPerson1
.join(eActor1, vPerson1.col("id") === eActor1.col("dst"))
.select(vPerson1.col("id").as("vPerson1_id"), vPerson1.col("name").as("vPerson1_name"), eActor1.col("src").as("eActor1_src"))
val tempResults2 = tempResults1
.join(vMovie1, tempResults1.col("eActor1_src") === vMovie1.col("id"))
.select(tempResults1.col("vPerson1_id"), tempResults1.col("vPerson1_name"), vMovie1.col("id").as("vMovie1_id"), vMovie1.col("title"))
val tempResults3 = tempResults2
.join(eActor2, tempResults2.col("vMovie1_id") === eActor2.col("src"))
.select(tempResults2.col("vPerson1_id"), tempResults2.col("vPerson1_name"), tempResults2.col("title"), eActor2.col("dst").as("eActor2_dst"))
val shortcutEdges = tempResults3
.filter($"vPerson1_id" =!= $"eActor2_dst")
.select(tempResults3.col("vPerson1_id").as("src"), tempResults3.col("eActor2_dst").as("dst"), lit("workedTogether").as("~label"))
g.updateEdges(shortcutEdges)

Shortest Path
spark.sparkContext.setCheckpointDir("dsefs://127.0.0.1:5598/checkpoints")
val johnWayneId = g.V.has("person", "name", "John Wayne").df.collect()(0)(0)
val jamesStewartId = g.V.has("person", "name", "James Stewart").df.collect()(0)(0)
val shortestPaths = g.shortestPaths.landmarks(Seq(johnWayneId, jamesStewartId)).run
//make a C* table that matches the schema of my dataframe
shortestPaths.createCassandraTable(
"test", //keyspace
"shortest_paths", //table_name
partitionKeyColumns = Some(Seq("id")),
clusteringKeyColumns = Some(Seq("~label")))

Shortest Path (cont'd)
//write to the table
shortestPaths.write.format("org.apache.spark.sql.cassandra")
.options(
Map(
"table" -> "shortest_paths",
"keyspace" -> "test",
"spark.cassandra.output.ignoreNulls" -> "true"
)
).save
//read it back in later
//val shortestPaths.read.cassandraFormat("shortest_paths", "test").load
shortestPaths
.filter($"~label" === "person")
.select('name, 'distances(johnWayneId).as("hopsFromDuke"), 'distances(jamesStewartId).as("hopsFromJimmy"))
.orderBy('hopsFromJohnWayne desc)
.show(500, false)

Resources
https://graphframes.github.io/user-guide.html
https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx
https://github.com/graphframes/graphframes
https://www.youtube.com/watch?v=DW09q18OHfc - Russell Spitzer / Artem Aliev - Spark Summit talk
https://www.datastax.com/dev/blog/dse-graph-frame
https://github.com/datastax/graph-examples/blob/master/dse-graph-frame/Spark-shell-notes.scala
https://www.manning.com/books/spark-graphx-in-action
https://academy.datastax.com/resources/ds332

GraphFrames Access Methods in DSE Graph

More Related Content

What's hot

Similar to GraphFrames Access Methods in DSE Graph

Recently uploaded

GraphFrames Access Methods in DSE Graph