©2013 DataStax Confidential. Do not distribute without consent.
@rustyrazorblade
Jon Haddad

Technical Evangelist, DataStax
Enter the Snake Pit
1
Why is Cassandra Awesome?
• Fast ingestion of data
• Even on spinning disks
• Storage optimized for reading
sorted rows
• Linear scalability
• Multi-DC: Active / Active
What’s Hard with Cassandra?
• Ac hoc querying
• Batch Processing
• Schema migrations
• Analytics
• Machine Learning
• Visualizing data
Enter Apache Spark
Distributed computation,
kind of like MapReduce,
but not completely horrible
Apache Spark
• Batch processing
• Functional constructs
• map / reduce / filter
• Fully distributed SQL
• RDD is a collection of data
• Scala, Python, Java
• Streaming
• Machine learning
• Graph analytics (GraphX)
Batch Processing
• Data Migrations
• Aggregations
• Read data from one source
• Perform computations
• Write to somewhere else
user movie rating
1 1 10
2 1 9
3 2 8
4 2 10
id name rating
1 rambo 9.5
2 rocky 9
Stream Processing
• Read data from a streaming source
• ZeroMQ, Kafka, Raw Socket
• Data is read in batches
• Streaming is at best an approximation
• ssc = StreamingContext(sc, 1) # 1
second
Time 1.1 1.5 1.7 2.1 2.4 2.8 3.4
Data (1,2) (4,2) (6,2) (9,1) (3,5) (7,1) (3,10)
Machine Learning
• Supervised Learning
• Predictive questions
• Unsupervised learning
• Classification
• Batch or streaming
• reevaluate clusters as new data arrives
Collaborative Filtering
• Recommendation engine
• Algo: Alternating least squares
• Movies, music, etc
• Perfect match for Cassandra
• Source of truth
• Hot, live data
• Spark generates recommendations
(store in cassandra)
• Feedback loops generates better
recommendations over time
Setup
• Setup a second DC for analytics
• Run spark on each Cassandra node
• More RAM is good
• Connector is smart about locality
In the beginning… there was RDD
sc = SparkContext(appName="PythonPi")
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 < 1 else 0
count = sc.parallelize(range(1, n + 1), partitions).
map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
sc.stop()
Why Not Python + RDDs?
RDD
JavaGatewayServer
Py4J
RDD
Why Python?
• Spark is written in Scala
• Python is slow :(
• Python is popular
• Pandas, matplotlib, numpy
• We already know it
• It's so beautiful…
DataFrames
• Abstraction over RDDs
• Modeled after Pandas & R
• Structured data
• Python passes commands only
• Commands are pushed down
• Goal: Data Never Leaves the JVM
• You can still use the RDD if you
want
• Sometimes you still need to pull
data into python (UUIDs)
RDD
DataFrame
Let's play with code
Sample Dataset - Movielens
• Subset of movies (1-100)
• ~800k ratings
CREATE TABLE movielens.movie (
movie_id int PRIMARY KEY,
genres set<text>,
title text
)
CREATE TABLE movielens.rating (
movie_id int,
user_id int,
rating decimal,
ts int,
PRIMARY KEY (movie_id, user_id)
)
Reading Cassandra Tables
• DataFrames has a standard
interface for reading
• Cache if you want to keep dataset
in memory
cl = "org.apache.spark.sql.cassandra"
movies = sql.read.format(cl).
load(keyspace="movielens",
table="movie").cache()
ratings = sql.read.format(cl).
load(keyspace="movielens",
table="rating").cache()
Filtering
• Select specific rows matching
various patterns
• Fields do not require indexes
• Filtering occurs in memory
• You can use DSE Solr Search
Queries
• Filtering returns a DataFrame
movies.filter(movies.movie_id == 1)
movies[movies.movie_id == 1]
movies.filter("movie_id=1")
movie_id title genres
44 Mortal Kombat (1995)
['Action',
'Adventure',
'Fantasy']
movies.filter("title like '%Kombat%'")
Filtering
• Helper function:
explode()
• select() to keep
specific columns
• alias() to rename
title
Broken Arrow (1996)
GoldenEye (1995)
Mortal Kombat (1995)
White Squall (1996)
Nick of Time (1995)
from pyspark.sql import functions as F
movies.select("title", F.explode("genres").
alias("genre")).
filter("genre = 'Action'").select("title")
title genre
Broken Arrow (1996) Action
Broken Arrow (1996) Adventure
Broken Arrow (1996) Thriller
Aggregation
• Count, sum, avg
• in SQL: GROUP BY
• Useful with spark streaming
• Aggregate raw data
• Send to dashboards
ratings.groupBy("movie_id").
agg(F.avg("rating").alias('avg'))
ratings.groupBy("movie_id").avg("rating")
movie_id avg
31 3.24
32 3.8823
33 3.021
Joins
• Inner join by default
• Can do various outer joins
as well
• Returns a new DF with all
the columns
ratings.join(movies, "movie_id")
DataFrame[movie_id: int,
user_id: int,
rating: decimal(10,0),
ts: int,
genres: array<string>,
title: string]
Chaining Operations
• Similar to SQL, we can build up in
complexity
• Combine joins with aggregations,
limits & sorting
ratings.groupBy("movie_id").
agg(F.avg("rating").
alias('avg')).
sort("avg", ascending=False).
limit(3).
join(movies, "movie_id").
select("title", "avg")
title avg
Usual Suspects, The (1995) 4.32
Seven (a.k.a. Se7en) (1995) 4.054
Persuasion (1995) 4.053
SparkSQL
• Register DataFrame as Table
• Query using HiveSQL syntax
movies.registerTempTable("movie")
ratings.registerTempTable("rating")
sql.sql("""select title, avg(rating) as avg_rating
from movie join rating
on movie.movie_id = rating.movie_id
group by title
order by avg_rating DESC limit 3""")
Database Migrations
• DataFrame reader supports JDBC
• JOIN operations can be cross DB
• Read dataframe from JDBC, write
to Cassandra
Inter-DB Migration
from pyspark.sql import SQLContext
sql = SQLContext(sc)
m_con = "jdbc:mysql://127.0.0.1:3307/movielens?user=root"
movies = sql.read.jdbc(m_con, "movielens.movies")
movies.write.format("org.apache.spark.sql.cassandra").
options(table="movie", keyspace="lens").
save(mode="append")
http://rustyrazorblade.com/2015/08/migrating-from-mysql-to-cassandra-using-spark/
Jupyter Notebooks
• Iterate quickly
• Test ideas
• Graph results
Visualization
• dataframe.toPandas()
• Matplotlib
• Seaborn (looks nicer)
• Crunch big data in spark
©2013 DataStax Confidential. Do not distribute without consent. 29

Enter the Snake Pit for Fast and Easy Spark

  • 1.
    ©2013 DataStax Confidential.Do not distribute without consent. @rustyrazorblade Jon Haddad
 Technical Evangelist, DataStax Enter the Snake Pit 1
  • 2.
    Why is CassandraAwesome? • Fast ingestion of data • Even on spinning disks • Storage optimized for reading sorted rows • Linear scalability • Multi-DC: Active / Active
  • 3.
    What’s Hard withCassandra? • Ac hoc querying • Batch Processing • Schema migrations • Analytics • Machine Learning • Visualizing data
  • 4.
  • 5.
    Distributed computation, kind oflike MapReduce, but not completely horrible
  • 6.
    Apache Spark • Batchprocessing • Functional constructs • map / reduce / filter • Fully distributed SQL • RDD is a collection of data • Scala, Python, Java • Streaming • Machine learning • Graph analytics (GraphX)
  • 7.
    Batch Processing • DataMigrations • Aggregations • Read data from one source • Perform computations • Write to somewhere else user movie rating 1 1 10 2 1 9 3 2 8 4 2 10 id name rating 1 rambo 9.5 2 rocky 9
  • 8.
    Stream Processing • Readdata from a streaming source • ZeroMQ, Kafka, Raw Socket • Data is read in batches • Streaming is at best an approximation • ssc = StreamingContext(sc, 1) # 1 second Time 1.1 1.5 1.7 2.1 2.4 2.8 3.4 Data (1,2) (4,2) (6,2) (9,1) (3,5) (7,1) (3,10)
  • 9.
    Machine Learning • SupervisedLearning • Predictive questions • Unsupervised learning • Classification • Batch or streaming • reevaluate clusters as new data arrives
  • 10.
    Collaborative Filtering • Recommendationengine • Algo: Alternating least squares • Movies, music, etc • Perfect match for Cassandra • Source of truth • Hot, live data • Spark generates recommendations (store in cassandra) • Feedback loops generates better recommendations over time
  • 11.
    Setup • Setup asecond DC for analytics • Run spark on each Cassandra node • More RAM is good • Connector is smart about locality
  • 12.
    In the beginning…there was RDD sc = SparkContext(appName="PythonPi") partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 n = 100000 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 < 1 else 0 count = sc.parallelize(range(1, n + 1), partitions). map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) sc.stop()
  • 13.
    Why Not Python+ RDDs? RDD JavaGatewayServer Py4J RDD
  • 14.
    Why Python? • Sparkis written in Scala • Python is slow :( • Python is popular • Pandas, matplotlib, numpy • We already know it • It's so beautiful…
  • 15.
    DataFrames • Abstraction overRDDs • Modeled after Pandas & R • Structured data • Python passes commands only • Commands are pushed down • Goal: Data Never Leaves the JVM • You can still use the RDD if you want • Sometimes you still need to pull data into python (UUIDs) RDD DataFrame
  • 16.
  • 17.
    Sample Dataset -Movielens • Subset of movies (1-100) • ~800k ratings CREATE TABLE movielens.movie ( movie_id int PRIMARY KEY, genres set<text>, title text ) CREATE TABLE movielens.rating ( movie_id int, user_id int, rating decimal, ts int, PRIMARY KEY (movie_id, user_id) )
  • 18.
    Reading Cassandra Tables •DataFrames has a standard interface for reading • Cache if you want to keep dataset in memory cl = "org.apache.spark.sql.cassandra" movies = sql.read.format(cl). load(keyspace="movielens", table="movie").cache() ratings = sql.read.format(cl). load(keyspace="movielens", table="rating").cache()
  • 19.
    Filtering • Select specificrows matching various patterns • Fields do not require indexes • Filtering occurs in memory • You can use DSE Solr Search Queries • Filtering returns a DataFrame movies.filter(movies.movie_id == 1) movies[movies.movie_id == 1] movies.filter("movie_id=1") movie_id title genres 44 Mortal Kombat (1995) ['Action', 'Adventure', 'Fantasy'] movies.filter("title like '%Kombat%'")
  • 20.
    Filtering • Helper function: explode() •select() to keep specific columns • alias() to rename title Broken Arrow (1996) GoldenEye (1995) Mortal Kombat (1995) White Squall (1996) Nick of Time (1995) from pyspark.sql import functions as F movies.select("title", F.explode("genres"). alias("genre")). filter("genre = 'Action'").select("title") title genre Broken Arrow (1996) Action Broken Arrow (1996) Adventure Broken Arrow (1996) Thriller
  • 21.
    Aggregation • Count, sum,avg • in SQL: GROUP BY • Useful with spark streaming • Aggregate raw data • Send to dashboards ratings.groupBy("movie_id"). agg(F.avg("rating").alias('avg')) ratings.groupBy("movie_id").avg("rating") movie_id avg 31 3.24 32 3.8823 33 3.021
  • 22.
    Joins • Inner joinby default • Can do various outer joins as well • Returns a new DF with all the columns ratings.join(movies, "movie_id") DataFrame[movie_id: int, user_id: int, rating: decimal(10,0), ts: int, genres: array<string>, title: string]
  • 23.
    Chaining Operations • Similarto SQL, we can build up in complexity • Combine joins with aggregations, limits & sorting ratings.groupBy("movie_id"). agg(F.avg("rating"). alias('avg')). sort("avg", ascending=False). limit(3). join(movies, "movie_id"). select("title", "avg") title avg Usual Suspects, The (1995) 4.32 Seven (a.k.a. Se7en) (1995) 4.054 Persuasion (1995) 4.053
  • 24.
    SparkSQL • Register DataFrameas Table • Query using HiveSQL syntax movies.registerTempTable("movie") ratings.registerTempTable("rating") sql.sql("""select title, avg(rating) as avg_rating from movie join rating on movie.movie_id = rating.movie_id group by title order by avg_rating DESC limit 3""")
  • 25.
    Database Migrations • DataFramereader supports JDBC • JOIN operations can be cross DB • Read dataframe from JDBC, write to Cassandra
  • 26.
    Inter-DB Migration from pyspark.sqlimport SQLContext sql = SQLContext(sc) m_con = "jdbc:mysql://127.0.0.1:3307/movielens?user=root" movies = sql.read.jdbc(m_con, "movielens.movies") movies.write.format("org.apache.spark.sql.cassandra"). options(table="movie", keyspace="lens"). save(mode="append") http://rustyrazorblade.com/2015/08/migrating-from-mysql-to-cassandra-using-spark/
  • 27.
    Jupyter Notebooks • Iteratequickly • Test ideas • Graph results
  • 28.
    Visualization • dataframe.toPandas() • Matplotlib •Seaborn (looks nicer) • Crunch big data in spark
  • 29.
    ©2013 DataStax Confidential.Do not distribute without consent. 29