Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
JON HADDAD
PRINCIPAL CONSULTANT, TLP
CONNECTING C* DATA WITH GRAPHFRAMES
loves
WHAT’S THE LAST
PICKLE DO?
WE HELP MAKE YOU A
GROUP OF EXPERTS
WHO IS THIS GUY?
15 YEARS
EXPERIENCE
4 YEARS WITH CASSANDRA
CASSANDRA DATA
MODELING
BIG DATA
PROBLEMS
DATA MODELING RULES
DENORMALIZE
I MISS MY
JOINS
PERFORMANCE &
RELIABILITY
LOW
FLEXIBILITY
WHAT IS A
GRAPH?
JON
TLP
NATE
cofounded
knows
works at
RELATIONSHIPS
ARE ARBITRARY
A GRAPH IS TRAVERSED
JON TLPworks at
start end
follow
ELEMENTS ARE NOT TYPED
FOLLOW ALL EDGES IN
ANY QUERY
JON
TLP
NATE
cofounded
knows
works at
TRAVERSALS ARE MORE
FLEXIBLE THAN JOINS
THE WORLD
IS A GRAPH
epic chart of flexibility
Apache Cassandra
Graph databases
GRAPH IS
COOL
RIGHT?
NEO4J
TITAN
DSE GRAPH
cookie monster photo?
GRAPH ALL THE THINGS!
TRADEOFFS
PERFORMANCE
graph
REMEMBER WHY WE
DON’T DO JOINS?
DISTRIBUTED JOINS
ARE HARD WORK
MORE WORK
=
SLOWER DATABASE
APPLICATION COMPLEXITY
DO I NEED
GRAPH ALL
THE TIME?
GRAPH QUERIES ON CASSANDRA?
GRAPH IS COOL
FOR ANALYTICS
LET’S USE
SPARK
cdm install movielens
CREATE TABLE movies (
id uuid PRIMARY KEY,
avg_rating float,
genres set<text>,
name text,
release_date date,
url text,
vide...
CREATE TABLE users (
id uuid PRIMARY KEY,
address text,
age int,
city text,
gender text,
name text,
occupation text,
zip t...
CREATE TABLE ratings_by_movie (
movie_id uuid,
user_id uuid,
rating int,
ts int,
PRIMARY KEY (movie_id, user_id)
);
RECOMMENDATION
ENGINE
AWESOME
GET DATA INTO A
GRAPH
JON TOP GUN
label: rated
rating: 5
DATAFRAMESid genres name
ae4f9269-5d62-4ad1-
b87c-1b23962bb224
{'Drama'} Prefontaine (1997)
de9a14a9-6d6d-4573-b415-
c8555...
BOILERPLATE
sql = SQLContext(sc)
from functools import partial
connector = "org.apache.spark.sql.cassandra"
load = partial...
LOAD THE DATA FRAMES
movies = load(table="movies")
ratings = load(table="ratings_by_movie")
users = load(table="users")
WHAT’S A GRAPHFRAME?
GraphFrame(v, e)
VERTEX LIST
movies
+
users
MOVIE DATAFRAME
DataFrame[
id: string,
avg_rating: float,
genres: array<string>,
name: string,
release_date: date,
url: str...
MOVIES AS LIST OF VERTICES
movies_v = movies.select("id", "name").
withColumn("label", F.lit("movie"))
graph elements have...
MOVIE VERTEX
[Row(id=uʼ6d318848…ʼ,
name=u'Anna (1996)',
label=u'movie')]
USERS AS VERTICES
users_v = users.select("id", "name").
withColumn(“label”, F.lit("user"))
[Row(id=uʼb52fcdfc…ʼ,
name=u'Ha...
CREATE THE FULL VERTEX LIST
vertices =
movies_v.unionAll(users_v)
GET THE EDGES
edges =
ratings.select(ratings.movie_id.alias("dst"),
ratings.user_id.alias("src"),
"rating")
CREATE THE GRAPH
g = GraphFrame(vertices, edges)
PATTERN MATCHING
AKA MOTIFS
(a)-[r]->(c)
MOTIFS
(a)-[r]->(c); (b)-[s]->(c)
MOTIFS
a
r
c
b
s
(a)-[r]->(c); (b)-[s]->(c)
name: jon name: dani
name: top gun
QUERY THE GRAPH
corated = g.find("(a)-[r]->(c); (b)-[s]->(c)").
filter("a.label = ʻuserʼ").
filter("b.label = ʻuser'").
filter...
WORKING WITH RESULTS
user_movie_rating_freq = 
corated.select(corated.a.id.alias("user"),
corated.c.id.alias("user2")).
gr...
WRITE YOUR DATA FRAMES
BACK TO CASSANDRA
create table corated (
user1 uuid,
user2 uuid,
count int,
primary key(user1, user2)
);
SHORTEST PATH
WHAT IS THE CONNECTION
FROM A TO B?
ATTENDED
@RUSTYRAZORBLADE
@PATRICKMCFADIN
COUSIN
@oscar_the_grouch
ATTENDED
CAL POLY
MY COUSIN WENT TO SCHOOL WITH PATRICK
GRAPH PROBLEMS ARE
USUALLY JUST FEATURES
USE CASSANDRA
PLUS SPARK
@RUSTYRAZORBLADE
Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016
Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016
Upcoming SlideShare
Loading in …5
×

Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

477 views

Published on

If you've been aware of the buzzword scene for the last 6 months, you've probably come across the term ""graph"" quite a bit. Sadly, very few people are aware of what a graph is or how to appropriately use it. If I've just described you, don't panic! In this talk, I'll bring everyone up to speed on the basics of what a graph is all about. The term graph holds a foundation in some pretty hardcore computer science. This lends itself to some complexity, but there's hope! We can leverage these roots in a less complicated manner by using GraphFrames and Spark to extract maximum analytical awesomeness from our existing Cassandra data.

About the Speaker
Jon Haddad Consultant, The Last Pickle

Jon Haddad has 15 years experience in both development and operations. For the last 11 years, he's worked at various startups in southern California. In a past life, he put Cassandra in production and co-wrote cqlengine, the Python object mapper for CQL. After 2 years as an evangelist at DataStax, he is now Principal Consultant at The Last Pickle, helping customers of all size succeed with Apache Cassandra.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

  1. 1. JON HADDAD PRINCIPAL CONSULTANT, TLP CONNECTING C* DATA WITH GRAPHFRAMES loves
  2. 2. WHAT’S THE LAST PICKLE DO?
  3. 3. WE HELP MAKE YOU A GROUP OF EXPERTS
  4. 4. WHO IS THIS GUY?
  5. 5. 15 YEARS EXPERIENCE
  6. 6. 4 YEARS WITH CASSANDRA
  7. 7. CASSANDRA DATA MODELING
  8. 8. BIG DATA PROBLEMS
  9. 9. DATA MODELING RULES
  10. 10. DENORMALIZE
  11. 11. I MISS MY JOINS
  12. 12. PERFORMANCE & RELIABILITY
  13. 13. LOW FLEXIBILITY
  14. 14. WHAT IS A GRAPH?
  15. 15. JON TLP NATE cofounded knows works at
  16. 16. RELATIONSHIPS ARE ARBITRARY
  17. 17. A GRAPH IS TRAVERSED JON TLPworks at start end follow
  18. 18. ELEMENTS ARE NOT TYPED
  19. 19. FOLLOW ALL EDGES IN ANY QUERY
  20. 20. JON TLP NATE cofounded knows works at
  21. 21. TRAVERSALS ARE MORE FLEXIBLE THAN JOINS
  22. 22. THE WORLD IS A GRAPH
  23. 23. epic chart of flexibility Apache Cassandra Graph databases
  24. 24. GRAPH IS COOL RIGHT?
  25. 25. NEO4J TITAN DSE GRAPH cookie monster photo?
  26. 26. GRAPH ALL THE THINGS!
  27. 27. TRADEOFFS
  28. 28. PERFORMANCE graph
  29. 29. REMEMBER WHY WE DON’T DO JOINS?
  30. 30. DISTRIBUTED JOINS ARE HARD WORK
  31. 31. MORE WORK = SLOWER DATABASE
  32. 32. APPLICATION COMPLEXITY
  33. 33. DO I NEED GRAPH ALL THE TIME?
  34. 34. GRAPH QUERIES ON CASSANDRA?
  35. 35. GRAPH IS COOL FOR ANALYTICS
  36. 36. LET’S USE SPARK
  37. 37. cdm install movielens
  38. 38. CREATE TABLE movies ( id uuid PRIMARY KEY, avg_rating float, genres set<text>, name text, release_date date, url text, video_release_date date )
  39. 39. CREATE TABLE users ( id uuid PRIMARY KEY, address text, age int, city text, gender text, name text, occupation text, zip text );
  40. 40. CREATE TABLE ratings_by_movie ( movie_id uuid, user_id uuid, rating int, ts int, PRIMARY KEY (movie_id, user_id) );
  41. 41. RECOMMENDATION ENGINE
  42. 42. AWESOME
  43. 43. GET DATA INTO A GRAPH
  44. 44. JON TOP GUN label: rated rating: 5
  45. 45. DATAFRAMESid genres name ae4f9269-5d62-4ad1- b87c-1b23962bb224 {'Drama'} Prefontaine (1997) de9a14a9-6d6d-4573-b415- c8555e85d391 {'Drama'} Raging Bull (1980) 0b67d4e7-ee2b-47ab-9437- df0c793ea72a {'Action', 'Sci-Fi', 'Thriller'} Face/Off (1997)
  46. 46. BOILERPLATE sql = SQLContext(sc) from functools import partial connector = "org.apache.spark.sql.cassandra" load = partial(sql.read.format(connector).load, keyspace="movielens")
  47. 47. LOAD THE DATA FRAMES movies = load(table="movies") ratings = load(table="ratings_by_movie") users = load(table="users")
  48. 48. WHAT’S A GRAPHFRAME? GraphFrame(v, e)
  49. 49. VERTEX LIST movies + users
  50. 50. MOVIE DATAFRAME DataFrame[ id: string, avg_rating: float, genres: array<string>, name: string, release_date: date, url: string, video_release_date: date
  51. 51. MOVIES AS LIST OF VERTICES movies_v = movies.select("id", "name"). withColumn("label", F.lit("movie")) graph elements have no type
  52. 52. MOVIE VERTEX [Row(id=uʼ6d318848…ʼ, name=u'Anna (1996)', label=u'movie')]
  53. 53. USERS AS VERTICES users_v = users.select("id", "name"). withColumn(“label”, F.lit("user")) [Row(id=uʼb52fcdfc…ʼ, name=u'Harrold Hills', label=u'user'),
  54. 54. CREATE THE FULL VERTEX LIST vertices = movies_v.unionAll(users_v)
  55. 55. GET THE EDGES edges = ratings.select(ratings.movie_id.alias("dst"), ratings.user_id.alias("src"), "rating")
  56. 56. CREATE THE GRAPH g = GraphFrame(vertices, edges)
  57. 57. PATTERN MATCHING AKA MOTIFS
  58. 58. (a)-[r]->(c) MOTIFS
  59. 59. (a)-[r]->(c); (b)-[s]->(c) MOTIFS
  60. 60. a r c b s (a)-[r]->(c); (b)-[s]->(c) name: jon name: dani name: top gun
  61. 61. QUERY THE GRAPH corated = g.find("(a)-[r]->(c); (b)-[s]->(c)"). filter("a.label = ʻuserʼ"). filter("b.label = ʻuser'"). filter("r.rating >= 4"). filter("s.rating >= 4")
  62. 62. WORKING WITH RESULTS user_movie_rating_freq = corated.select(corated.a.id.alias("user"), corated.c.id.alias("user2")). groupBy("user", "user2").count() [Row( user=u'87281e3a-3ca5-438b-917d-fb8d3d96da35', user2=u'e9e24ad2-457a-488c- bdd1-3cb0ea82a470', count=7)]
  63. 63. WRITE YOUR DATA FRAMES BACK TO CASSANDRA
  64. 64. create table corated ( user1 uuid, user2 uuid, count int, primary key(user1, user2) );
  65. 65. SHORTEST PATH
  66. 66. WHAT IS THE CONNECTION FROM A TO B?
  67. 67. ATTENDED @RUSTYRAZORBLADE @PATRICKMCFADIN COUSIN @oscar_the_grouch ATTENDED CAL POLY MY COUSIN WENT TO SCHOOL WITH PATRICK
  68. 68. GRAPH PROBLEMS ARE USUALLY JUST FEATURES
  69. 69. USE CASSANDRA PLUS SPARK
  70. 70. @RUSTYRAZORBLADE

×