SlideShare a Scribd company logo
Lightning-fast analytics with Spark and Cassandra 
Nick Bailey 
@nickmbailey 
©2014 DataStax Confidential. Do not distribute without consent. 
1
What is Spark? 
*Apache Project since 2010" 
*Fast" 
* 10x-100x faster than Hadoop MapReduce" 
* In-memory storage" 
* Single JVM process per node" 
*Easy" 
*Rich Scala, Java and Python APIs" 
* 2x-5x less code" 
* Interactive shell 
Analytic 
Analytic 
Search
API 
map reduce
API 
map" 
filter" 
groupBy 
sort" 
union" 
join" 
leftOuterJoin 
rightOuterJoin 
reduce" 
count" 
fold" 
reduceByKey 
groupByKey 
cogroup 
cross" 
zip 
sample" 
take" 
first 
partitionBy 
mapWith 
pipe" 
save 
...
API 
*Resilient Distributed Datasets (RDD)" 
*Collections of objects spread across a cluster, stored in RAM or on Disk" 
* Built through parallel transformations" 
* Automatically rebuilt on failure" 
" 
*Operations" 
* Transformations (e.g. map, filter, groupBy)" 
* Actions (e.g. count, collect, save)
Operator Graph: Optimization and Fault Tolerance 
join 
groupBy 
filter 
Stage 3 
Stage 1 
Stage 2 
A: B: 
C: D: E: 
F: 
map 
= RDD = Cached partition
Fast 
Running Time (s) 
4000 
3000 
2000 
1000 
0 
1 5 10 20 30 
Number of Iterations 
110 sec / iteration 
Hadoop Spark 
first iteration 80 sec" 
further iterations 1 sec 
* Logistic Regression Performance"
Why Spark on Cassandra? 
*Data model independent queries" 
*Cross-table operations (JOIN, UNION, etc.)" 
*Complex analytics (e.g. machine learning)" 
*Data transformation, aggregation, etc." 
*Stream processing
How to Spark on Cassandra? 
*DataStax Cassandra Spark driver" 
*Open source: https://github.com/datastax/spark-cassandra-connector" 
*Compatible with" 
* Spark 0.9+" 
*Cassandra 2.0+" 
*DataStax Enterprise 4.5+
Cassandra Spark Driver 
*Cassandra tables exposed as Spark RDDs" 
*Read from and write to Cassandra" 
*Mapping of C* tables and rows to Scala objects" 
*All Cassandra types supported and converted to Scala types" 
*Server side data selection" 
*Spark Streaming support" 
*Scala and Java support
Connecting to Cassandra 
// Import Cassandra-specific functions on SparkContext and RDD objects" 
import com.datastax.driver.spark._" 
" 
" 
// Spark connection options" 
val conf = new SparkConf(true)" 
! ! .setMaster("spark://192.168.123.10:7077")" 
! ! .setAppName("cassandra-demo")" 
.set(“cassandra.connection.host", "192.168.123.10") // initial 
contact! 
.set("cassandra.username", "cassandra")! 
.set("cassandra.password", "cassandra") " 
" 
val sc = new SparkContext(conf)
Accessing Data 
CREATE TABLE test.words (word text PRIMARY KEY, count int);" 
" 
INSERT INTO test.words (word, count) VALUES ('bar', 30);" 
INSERT INTO test.words (word, count) VALUES ('foo', 20); 
*Accessing table above as RDD: 
// Use table as RDD" 
val rdd = sc.cassandraTable("test", "words")" 
// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]" 
" 
rdd.toArray.foreach(println)" 
// CassandraRow[word: bar, count: 30]! 
// CassandraRow[word: foo, count: 20]" 
" 
rdd.columnNames // Stream(word, count) " 
rdd.size // 2" 
" 
val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, 
count: 30]! 
firstRow.getInt("count") // Int = 30
Saving Data 
val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))" 
// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]" 
" 
newRdd.saveToCassandra("test", "words", Seq("word", "count")) 
*RDD above saved to Cassandra: 
SELECT * FROM test.words;" 
" 
word | count" 
------+-------" 
bar | 30" 
foo | 20" 
cat | 40" 
fox | 50" 
" 
(4 rows)
Type Mapping 
CQL Type Scala Type 
ascii String 
bigint Long 
boolean Boolean 
counter Long 
decimal BigDecimal, java.math.BigDecimal 
double Double 
float Float 
inet java.net.InetAddress 
int Int 
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List 
map Map, TreeMap, java.util.HashMap 
set Set, TreeSet, java.util.HashSet 
text, varchar String 
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime 
timeuuid java.util.UUID 
uuid java.util.UUID 
varint BigInt, java.math.BigInteger 
*nullable values Option
Mapping Rows to Objects 
*Mapping rows to Scala Case Classes" 
*CQL underscore case column mapped to Scala camel case 
property" 
*Custom mapping functions (see docs) 
CREATE TABLE test.cars (" 
! id text PRIMARY KEY," 
! model text," 
! fuel_type text," 
! year int! 
); 
case class Vehicle(" 
! id: String," 
! model: String," 
! fuelType: String," 
! year: Int! 
)" 
" 
sc.cassandraTable[Vehicle]("test", "cars").toArray! 
//Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009)," 
// Vehicle(MT8787, Hyundai x35, Diesel, 2011) 
!
Server Side Data Selection 
*Reduce the amount of data transferred" 
*Selecting columns 
sc.cassandraTable("test", 
"users").select("username").toArray.foreach(println)" 
// CassandraRow{username: john} ! 
// CassandraRow{username: tom} 
" 
*Selecting rows (by clustering columns and/or secondary indexes) 
sc.cassandraTable("test", "cars").select("model").where("color = ?", 
"black").toArray.foreach(println)" 
// CassandraRow{model: Ford Mondeo}! 
// CassandraRow{model: Hyundai x35}
Spark SQL 
Compatible 
Spark SQL Streaming ML 
Spark (General execution engine) 
Graph 
Cassandra
Spark SQL 
*SQL query engine on top of Spark" 
*Hive compatible (JDBC, UDFs, types, metadata, etc.)" 
*Support for in-memory processing" 
*Pushdown of predicates to Cassandra when possible
Spark SQL Example 
" 
import com.datastax.spark.connector._" 
" 
// Connect to the Spark cluster" 
val conf = new SparkConf(true)...! 
val sc = new SparkContext(conf)" 
" 
// Create Cassandra SQL context! 
val cc = new CassandraSQLContext(sc)! 
" 
// Execute SQL query" 
val rdd = cc.sql("SELECT * FROM keyspace.table WHERE ...”)"
Analytics Workload Isolation 
Cassandra" 
+ Spark DC 
Cassandra" 
Only DC 
Online 
App 
Analytical 
App 
Mixed Load Cassandra Cluster
Analytics High Availability 
*Spark Workers run on all Cassandra nodes" 
*Workers are resilient by default" 
*First Spark node promoted as Spark Master" 
*Standby Master promoted on failure" 
*Master HA available in DataStax Enterprise 
Spark Master 
Spark Standby Master 
Spark Worker
Questions?

More Related Content

What's hot

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
Matthias Niehoff
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
Patricia Gorla
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Natalino Busa
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Jacek Lewandowski
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax
 
Heuritech: Apache Spark REX
Heuritech: Apache Spark REXHeuritech: Apache Spark REX
Heuritech: Apache Spark REX
didmarin
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
Jon Haddad
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
Russell Spitzer
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
Russell Spitzer
 

What's hot (20)

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Heuritech: Apache Spark REX
Heuritech: Apache Spark REXHeuritech: Apache Spark REX
Heuritech: Apache Spark REX
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 

Viewers also liked

Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
Josef Adersberger
 
Transform & Analyze Time Series Data via Apache Spark @Windward
Transform & Analyze Time Series Data via Apache Spark @WindwardTransform & Analyze Time Series Data via Apache Spark @Windward
Transform & Analyze Time Series Data via Apache Spark @Windward
Demi Ben-Ari
 
Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
Pavel Hardak
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
DataStax
 
Data Modeling with Cassandra and Time Series Data
Data Modeling with Cassandra and Time Series DataData Modeling with Cassandra and Time Series Data
Data Modeling with Cassandra and Time Series Data
Dani Traphagen
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
Sandy Ryza
 
Cassandra 3.0
Cassandra 3.0Cassandra 3.0
Cassandra 3.0
Robert Stupp
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
QAware GmbH
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016
Duyhai Doan
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
Dean Wampler
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Spark & Zeppelin을 활용한 머신러닝 실전 적용기Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Taejun Kim
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy Ryza
Spark Summit
 
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
NAVER D2
 
Apache Spark 입문에서 머신러닝까지
Apache Spark 입문에서 머신러닝까지Apache Spark 입문에서 머신러닝까지
Apache Spark 입문에서 머신러닝까지
Donam Kim
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
SangHoon Lee
 

Viewers also liked (20)

Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Transform & Analyze Time Series Data via Apache Spark @Windward
Transform & Analyze Time Series Data via Apache Spark @WindwardTransform & Analyze Time Series Data via Apache Spark @Windward
Transform & Analyze Time Series Data via Apache Spark @Windward
 
Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
 
Data Modeling with Cassandra and Time Series Data
Data Modeling with Cassandra and Time Series DataData Modeling with Cassandra and Time Series Data
Data Modeling with Cassandra and Time Series Data
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Cassandra 3.0
Cassandra 3.0Cassandra 3.0
Cassandra 3.0
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Spark & Zeppelin을 활용한 머신러닝 실전 적용기Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy Ryza
 
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
 
Apache Spark 입문에서 머신러닝까지
Apache Spark 입문에서 머신러닝까지Apache Spark 입문에서 머신러닝까지
Apache Spark 입문에서 머신러닝까지
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
 

Similar to Lightning fast analytics with Spark and Cassandra

Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Databricks
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
DataStax Academy
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
Radek Simko
 

Similar to Lightning fast analytics with Spark and Cassandra (20)

Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
 

More from nickmbailey

Clojure at DataStax: The Long Road From Python to Clojure
Clojure at DataStax: The Long Road From Python to ClojureClojure at DataStax: The Long Road From Python to Clojure
Clojure at DataStax: The Long Road From Python to Clojure
nickmbailey
 
Introduction to Cassandra Architecture
Introduction to Cassandra ArchitectureIntroduction to Cassandra Architecture
Introduction to Cassandra Architecture
nickmbailey
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
nickmbailey
 
Cassandra and Clojure
Cassandra and ClojureCassandra and Clojure
Cassandra and Clojure
nickmbailey
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
nickmbailey
 
An Introduction to Cassandra on Linux
An Introduction to Cassandra on LinuxAn Introduction to Cassandra on Linux
An Introduction to Cassandra on Linuxnickmbailey
 
Introduction to Cassandra and Data Modeling
Introduction to Cassandra and Data ModelingIntroduction to Cassandra and Data Modeling
Introduction to Cassandra and Data Modelingnickmbailey
 
CFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for HadoopCFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for Hadoopnickmbailey
 
Clojure and the Web
Clojure and the WebClojure and the Web
Clojure and the Web
nickmbailey
 

More from nickmbailey (9)

Clojure at DataStax: The Long Road From Python to Clojure
Clojure at DataStax: The Long Road From Python to ClojureClojure at DataStax: The Long Road From Python to Clojure
Clojure at DataStax: The Long Road From Python to Clojure
 
Introduction to Cassandra Architecture
Introduction to Cassandra ArchitectureIntroduction to Cassandra Architecture
Introduction to Cassandra Architecture
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
Cassandra and Clojure
Cassandra and ClojureCassandra and Clojure
Cassandra and Clojure
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
 
An Introduction to Cassandra on Linux
An Introduction to Cassandra on LinuxAn Introduction to Cassandra on Linux
An Introduction to Cassandra on Linux
 
Introduction to Cassandra and Data Modeling
Introduction to Cassandra and Data ModelingIntroduction to Cassandra and Data Modeling
Introduction to Cassandra and Data Modeling
 
CFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for HadoopCFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for Hadoop
 
Clojure and the Web
Clojure and the WebClojure and the Web
Clojure and the Web
 

Recently uploaded

Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 

Recently uploaded (20)

Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 

Lightning fast analytics with Spark and Cassandra

  • 1. Lightning-fast analytics with Spark and Cassandra Nick Bailey @nickmbailey ©2014 DataStax Confidential. Do not distribute without consent. 1
  • 2. What is Spark? *Apache Project since 2010" *Fast" * 10x-100x faster than Hadoop MapReduce" * In-memory storage" * Single JVM process per node" *Easy" *Rich Scala, Java and Python APIs" * 2x-5x less code" * Interactive shell Analytic Analytic Search
  • 4. API map" filter" groupBy sort" union" join" leftOuterJoin rightOuterJoin reduce" count" fold" reduceByKey groupByKey cogroup cross" zip sample" take" first partitionBy mapWith pipe" save ...
  • 5. API *Resilient Distributed Datasets (RDD)" *Collections of objects spread across a cluster, stored in RAM or on Disk" * Built through parallel transformations" * Automatically rebuilt on failure" " *Operations" * Transformations (e.g. map, filter, groupBy)" * Actions (e.g. count, collect, save)
  • 6. Operator Graph: Optimization and Fault Tolerance join groupBy filter Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map = RDD = Cached partition
  • 7. Fast Running Time (s) 4000 3000 2000 1000 0 1 5 10 20 30 Number of Iterations 110 sec / iteration Hadoop Spark first iteration 80 sec" further iterations 1 sec * Logistic Regression Performance"
  • 8. Why Spark on Cassandra? *Data model independent queries" *Cross-table operations (JOIN, UNION, etc.)" *Complex analytics (e.g. machine learning)" *Data transformation, aggregation, etc." *Stream processing
  • 9. How to Spark on Cassandra? *DataStax Cassandra Spark driver" *Open source: https://github.com/datastax/spark-cassandra-connector" *Compatible with" * Spark 0.9+" *Cassandra 2.0+" *DataStax Enterprise 4.5+
  • 10. Cassandra Spark Driver *Cassandra tables exposed as Spark RDDs" *Read from and write to Cassandra" *Mapping of C* tables and rows to Scala objects" *All Cassandra types supported and converted to Scala types" *Server side data selection" *Spark Streaming support" *Scala and Java support
  • 11. Connecting to Cassandra // Import Cassandra-specific functions on SparkContext and RDD objects" import com.datastax.driver.spark._" " " // Spark connection options" val conf = new SparkConf(true)" ! ! .setMaster("spark://192.168.123.10:7077")" ! ! .setAppName("cassandra-demo")" .set(“cassandra.connection.host", "192.168.123.10") // initial contact! .set("cassandra.username", "cassandra")! .set("cassandra.password", "cassandra") " " val sc = new SparkContext(conf)
  • 12. Accessing Data CREATE TABLE test.words (word text PRIMARY KEY, count int);" " INSERT INTO test.words (word, count) VALUES ('bar', 30);" INSERT INTO test.words (word, count) VALUES ('foo', 20); *Accessing table above as RDD: // Use table as RDD" val rdd = sc.cassandraTable("test", "words")" // rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]" " rdd.toArray.foreach(println)" // CassandraRow[word: bar, count: 30]! // CassandraRow[word: foo, count: 20]" " rdd.columnNames // Stream(word, count) " rdd.size // 2" " val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30]! firstRow.getInt("count") // Int = 30
  • 13. Saving Data val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))" // newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]" " newRdd.saveToCassandra("test", "words", Seq("word", "count")) *RDD above saved to Cassandra: SELECT * FROM test.words;" " word | count" ------+-------" bar | 30" foo | 20" cat | 40" fox | 50" " (4 rows)
  • 14. Type Mapping CQL Type Scala Type ascii String bigint Long boolean Boolean counter Long decimal BigDecimal, java.math.BigDecimal double Double float Float inet java.net.InetAddress int Int list Vector, List, Iterable, Seq, IndexedSeq, java.util.List map Map, TreeMap, java.util.HashMap set Set, TreeSet, java.util.HashSet text, varchar String timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime timeuuid java.util.UUID uuid java.util.UUID varint BigInt, java.math.BigInteger *nullable values Option
  • 15. Mapping Rows to Objects *Mapping rows to Scala Case Classes" *CQL underscore case column mapped to Scala camel case property" *Custom mapping functions (see docs) CREATE TABLE test.cars (" ! id text PRIMARY KEY," ! model text," ! fuel_type text," ! year int! ); case class Vehicle(" ! id: String," ! model: String," ! fuelType: String," ! year: Int! )" " sc.cassandraTable[Vehicle]("test", "cars").toArray! //Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009)," // Vehicle(MT8787, Hyundai x35, Diesel, 2011) !
  • 16. Server Side Data Selection *Reduce the amount of data transferred" *Selecting columns sc.cassandraTable("test", "users").select("username").toArray.foreach(println)" // CassandraRow{username: john} ! // CassandraRow{username: tom} " *Selecting rows (by clustering columns and/or secondary indexes) sc.cassandraTable("test", "cars").select("model").where("color = ?", "black").toArray.foreach(println)" // CassandraRow{model: Ford Mondeo}! // CassandraRow{model: Hyundai x35}
  • 17. Spark SQL Compatible Spark SQL Streaming ML Spark (General execution engine) Graph Cassandra
  • 18. Spark SQL *SQL query engine on top of Spark" *Hive compatible (JDBC, UDFs, types, metadata, etc.)" *Support for in-memory processing" *Pushdown of predicates to Cassandra when possible
  • 19. Spark SQL Example " import com.datastax.spark.connector._" " // Connect to the Spark cluster" val conf = new SparkConf(true)...! val sc = new SparkContext(conf)" " // Create Cassandra SQL context! val cc = new CassandraSQLContext(sc)! " // Execute SQL query" val rdd = cc.sql("SELECT * FROM keyspace.table WHERE ...”)"
  • 20. Analytics Workload Isolation Cassandra" + Spark DC Cassandra" Only DC Online App Analytical App Mixed Load Cassandra Cluster
  • 21. Analytics High Availability *Spark Workers run on all Cassandra nodes" *Workers are resilient by default" *First Spark node promoted as Spark Master" *Standby Master promoted on failure" *Master HA available in DataStax Enterprise Spark Master Spark Standby Master Spark Worker