SlideShare a Scribd company logo
1 of 22
©2014 DataStax Confidential. Do not distribute without consent.
timothy.vincent@datastax.com
Tim Vincent
Solution Engineer
Lightning-fast analytics with Spark
for Cassandra and DataStax Enterprise
1
What is Spark?
* Apache Project since 2010
* Fast
* 10x-100x faster than Hadoop MapReduce
* In-memory storage
* Single JVM process per node
* Easy
* Rich Scala, Java and Python APIs
* 2x-5x less code
* Interactive shell
Analytic
Analytic
Search
API
map reduce
API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
API
* Resilient Distributed Datasets (RDDs)
* Collections of objects spread across a cluster
* Stored in RAM or on Disk
* Built through parallel transformations
* Automatically rebuilt on failure
* Operations
* Transformations (e.g. map, filter, groupBy)
* Actions (e.g. count, collect, save)
Operator Graph: Optimization and Fault Tolerance
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
= Cached partition= RDD
Fast
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop Spark
110 sec / iteration
first iteration 80 sec
further iterations 1 sec
* Logistic Regression Performance
Why Spark on Cassandra?
* Data model independent queries
* Cross-table operations (JOIN, UNION, etc.)
* Complex analytics (e.g. machine learning)
* Data transformation, aggregation, etc.
* Stream processing (coming soon)
* Near real time
How to Spark on Cassandra?
* DataStax Cassandra Spark driver
* Open source: https://github.com/datastax/cassandra-driver-spark
* Compatible with
* Spark 0.9+
* Cassandra 2.0+
* DataStax Enterprise 4.5+
Analytics Workload Isolation
Cassandra
+ Spark DC
Cassandra
Only DC
Online
App
Analytical
App
Mixed Load Cassandra Cluster
Analytics High Availability
* All nodes are Spark Workers
* By default resilient to Worker failures
* First Spark node promoted as Spark Master
* Standby Master promoted on failure
* Master HA available in DataStax Enterprise
Spark Master
Spark Standby Master
Spark Worker
Cassandra Spark Driver
* Cassandra tables exposed as Spark RDDs
* Read from and write to Cassandra
* Mapping of C* tables and rows to Scala objects
* All Cassandra types supported and converted to Scala types
* Server side data selection
* Virtual Nodes support
* Scala only driver for now
Connecting to Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import com.datastax.driver.spark._
// Spark connection options
val conf = new SparkConf(true)
.setMaster("spark://192.168.123.10:7077")
.setAppName("cassandra-demo")
.set("cassandra.connection.host", "192.168.123.10") // initial contact
.set("cassandra.username", "cassandra")
.set("cassandra.password", "cassandra")
val sc = new SparkContext(conf)
Accessing Data
CREATE TABLE test.words (word text PRIMARY KEY, count int);
INSERT INTO test.words (word, count) VALUES ('bar', 30);
INSERT INTO test.words (word, count) VALUES ('foo', 20);
// Use table as RDD
val rdd = sc.cassandraTable("test", "words")
// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]
rdd.toArray.foreach(println)
// CassandraRow[word: bar, count: 30]
// CassandraRow[word: foo, count: 20]
rdd.columnNames // Stream(word, count)
rdd.size // 2
val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30]
firstRow.getInt("count") // Int = 30
* Accessing table above as RDD:
Saving Data
val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))
// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]
newRdd.saveToCassandra("test", "words", Seq("word", "count"))
SELECT * FROM test.words;
word | count
------+-------
bar | 30
foo | 20
cat | 40
fox | 50
(4 rows)
* RDD above saved to Cassandra:
Type Mapping
CQL Type Scala Type
ascii String
bigint Long
boolean Boolean
counter Long
decimal BigDecimal, java.math.BigDecimal
double Double
float Float
inet java.net.InetAddress
int Int
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List
map Map, TreeMap, java.util.HashMap
set Set, TreeSet, java.util.HashSet
text, varchar String
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime
timeuuid java.util.UUID
uuid java.util.UUID
varint BigInt, java.math.BigInteger
*nullable values Option
Mapping Rows to Objects
CREATE TABLE test.cars (
id text PRIMARY KEY,
model text,
fuel_type text,
year int
);
case class Vehicle(
id: String,
model: String,
fuelType: String,
year: Int
)
sc.cassandraTable[Vehicle]("test", "cars").toArray
//Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009),
// Vehicle(MT8787, Hyundai x35, Diesel, 2011)

* Mapping rows to Scala Case Classes
* CQL underscore case column mapped to Scala camel case property
* Custom mapping functions (see docs)
Server Side Data Selection
* Reduce the amount of data transferred
* Selecting columns
* Selecting rows (by clustering columns and/or secondary indexes)
sc.cassandraTable("test", "users").select("username").toArray.foreach(println)
// CassandraRow{username: john}
// CassandraRow{username: tom}
sc.cassandraTable("test", "cars").select("model").where("color = ?", "black").toArray.foreach(println)
// CassandraRow{model: Ford Mondeo}
// CassandraRow{model: Hyundai x35}
Shark
* SQL query engine on top of Spark
* Not part of Apache Spark
* Hive compatible (JDBC, UDFs, types, metadata, etc.)
* Supports in-memory tables
* Available as a part of DataStax Enterprise
Shark In-memory Tables
CREATE TABLE CachedStocks TBLPROPERTIES ("shark.cache" = "true")
AS SELECT * from PortfolioDemo.Stocks WHERE value > 95.0;
OK
Time taken: 1.215 seconds
SELECT * FROM CachedStocks;
OK
MQT price 97.9270442241818
SII price 99.69238346610474
.
. (123 additional prices)
.
PBG price 96.09162963505352
Time taken: 0.569 seconds
Spark SQL vs Shark
Shark
or
Spark SQL
Streaming ML
Spark (General execution engine)
Graph
Cassandra
Compatible
Questions?

More Related Content

What's hot

Hadoop+Cassandra_Integration
Hadoop+Cassandra_IntegrationHadoop+Cassandra_Integration
Hadoop+Cassandra_Integration
Joyabrata Das
 

What's hot (20)

Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesUpdates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
Munich March 2015 - Cassandra + Spark Overview
Munich March 2015 -  Cassandra + Spark OverviewMunich March 2015 -  Cassandra + Spark Overview
Munich March 2015 - Cassandra + Spark Overview
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Hadoop+Cassandra_Integration
Hadoop+Cassandra_IntegrationHadoop+Cassandra_Integration
Hadoop+Cassandra_Integration
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
The Automation Factory
The Automation FactoryThe Automation Factory
The Automation Factory
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
 

Viewers also liked

The Network: A Data Structure that Links Domains
The Network: A Data Structure that Links DomainsThe Network: A Data Structure that Links Domains
The Network: A Data Structure that Links Domains
Marko Rodriguez
 
An Evidential Logic for Multi-Relational Networks
An Evidential Logic for Multi-Relational NetworksAn Evidential Logic for Multi-Relational Networks
An Evidential Logic for Multi-Relational Networks
Marko Rodriguez
 

Viewers also liked (9)

The Pathology of Graph Databases
The Pathology of Graph DatabasesThe Pathology of Graph Databases
The Pathology of Graph Databases
 
The Network: A Data Structure that Links Domains
The Network: A Data Structure that Links DomainsThe Network: A Data Structure that Links Domains
The Network: A Data Structure that Links Domains
 
An Evidential Logic for Multi-Relational Networks
An Evidential Logic for Multi-Relational NetworksAn Evidential Logic for Multi-Relational Networks
An Evidential Logic for Multi-Relational Networks
 
From the Signal to the Symbol: Structure and Process in Artificial Intelligence
From the Signal to the Symbol: Structure and Process in Artificial IntelligenceFrom the Signal to the Symbol: Structure and Process in Artificial Intelligence
From the Signal to the Symbol: Structure and Process in Artificial Intelligence
 
Business Case Calculator for DevOps Initiatives - Leading credit card service...
Business Case Calculator for DevOps Initiatives - Leading credit card service...Business Case Calculator for DevOps Initiatives - Leading credit card service...
Business Case Calculator for DevOps Initiatives - Leading credit card service...
 
An Overview of Data Management Paradigms: Relational, Document, and Graph
An Overview of Data Management Paradigms: Relational, Document, and GraphAn Overview of Data Management Paradigms: Relational, Document, and Graph
An Overview of Data Management Paradigms: Relational, Document, and Graph
 
DataStax Careers
DataStax CareersDataStax Careers
DataStax Careers
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 
Digital Banking Strategy Roadmap - 3.24.15
Digital Banking Strategy Roadmap - 3.24.15Digital Banking Strategy Roadmap - 3.24.15
Digital Banking Strategy Roadmap - 3.24.15
 

Similar to Lightning Fast Analytics with Cassandra and Spark

An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 

Similar to Lightning Fast Analytics with Cassandra and Spark (20)

A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 

Recently uploaded

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

Lightning Fast Analytics with Cassandra and Spark

  • 1. ©2014 DataStax Confidential. Do not distribute without consent. timothy.vincent@datastax.com Tim Vincent Solution Engineer Lightning-fast analytics with Spark for Cassandra and DataStax Enterprise 1
  • 2. What is Spark? * Apache Project since 2010 * Fast * 10x-100x faster than Hadoop MapReduce * In-memory storage * Single JVM process per node * Easy * Rich Scala, Java and Python APIs * 2x-5x less code * Interactive shell Analytic Analytic Search
  • 5. API * Resilient Distributed Datasets (RDDs) * Collections of objects spread across a cluster * Stored in RAM or on Disk * Built through parallel transformations * Automatically rebuilt on failure * Operations * Transformations (e.g. map, filter, groupBy) * Actions (e.g. count, collect, save)
  • 6. Operator Graph: Optimization and Fault Tolerance join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map = Cached partition= RDD
  • 7. Fast 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 sec / iteration first iteration 80 sec further iterations 1 sec * Logistic Regression Performance
  • 8. Why Spark on Cassandra? * Data model independent queries * Cross-table operations (JOIN, UNION, etc.) * Complex analytics (e.g. machine learning) * Data transformation, aggregation, etc. * Stream processing (coming soon) * Near real time
  • 9. How to Spark on Cassandra? * DataStax Cassandra Spark driver * Open source: https://github.com/datastax/cassandra-driver-spark * Compatible with * Spark 0.9+ * Cassandra 2.0+ * DataStax Enterprise 4.5+
  • 10. Analytics Workload Isolation Cassandra + Spark DC Cassandra Only DC Online App Analytical App Mixed Load Cassandra Cluster
  • 11. Analytics High Availability * All nodes are Spark Workers * By default resilient to Worker failures * First Spark node promoted as Spark Master * Standby Master promoted on failure * Master HA available in DataStax Enterprise Spark Master Spark Standby Master Spark Worker
  • 12. Cassandra Spark Driver * Cassandra tables exposed as Spark RDDs * Read from and write to Cassandra * Mapping of C* tables and rows to Scala objects * All Cassandra types supported and converted to Scala types * Server side data selection * Virtual Nodes support * Scala only driver for now
  • 13. Connecting to Cassandra // Import Cassandra-specific functions on SparkContext and RDD objects import com.datastax.driver.spark._ // Spark connection options val conf = new SparkConf(true) .setMaster("spark://192.168.123.10:7077") .setAppName("cassandra-demo") .set("cassandra.connection.host", "192.168.123.10") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra") val sc = new SparkContext(conf)
  • 14. Accessing Data CREATE TABLE test.words (word text PRIMARY KEY, count int); INSERT INTO test.words (word, count) VALUES ('bar', 30); INSERT INTO test.words (word, count) VALUES ('foo', 20); // Use table as RDD val rdd = sc.cassandraTable("test", "words") // rdd: CassandraRDD[CassandraRow] = CassandraRDD[0] rdd.toArray.foreach(println) // CassandraRow[word: bar, count: 30] // CassandraRow[word: foo, count: 20] rdd.columnNames // Stream(word, count) rdd.size // 2 val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30] firstRow.getInt("count") // Int = 30 * Accessing table above as RDD:
  • 15. Saving Data val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50))) // newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2] newRdd.saveToCassandra("test", "words", Seq("word", "count")) SELECT * FROM test.words; word | count ------+------- bar | 30 foo | 20 cat | 40 fox | 50 (4 rows) * RDD above saved to Cassandra:
  • 16. Type Mapping CQL Type Scala Type ascii String bigint Long boolean Boolean counter Long decimal BigDecimal, java.math.BigDecimal double Double float Float inet java.net.InetAddress int Int list Vector, List, Iterable, Seq, IndexedSeq, java.util.List map Map, TreeMap, java.util.HashMap set Set, TreeSet, java.util.HashSet text, varchar String timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime timeuuid java.util.UUID uuid java.util.UUID varint BigInt, java.math.BigInteger *nullable values Option
  • 17. Mapping Rows to Objects CREATE TABLE test.cars ( id text PRIMARY KEY, model text, fuel_type text, year int ); case class Vehicle( id: String, model: String, fuelType: String, year: Int ) sc.cassandraTable[Vehicle]("test", "cars").toArray //Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009), // Vehicle(MT8787, Hyundai x35, Diesel, 2011)  * Mapping rows to Scala Case Classes * CQL underscore case column mapped to Scala camel case property * Custom mapping functions (see docs)
  • 18. Server Side Data Selection * Reduce the amount of data transferred * Selecting columns * Selecting rows (by clustering columns and/or secondary indexes) sc.cassandraTable("test", "users").select("username").toArray.foreach(println) // CassandraRow{username: john} // CassandraRow{username: tom} sc.cassandraTable("test", "cars").select("model").where("color = ?", "black").toArray.foreach(println) // CassandraRow{model: Ford Mondeo} // CassandraRow{model: Hyundai x35}
  • 19. Shark * SQL query engine on top of Spark * Not part of Apache Spark * Hive compatible (JDBC, UDFs, types, metadata, etc.) * Supports in-memory tables * Available as a part of DataStax Enterprise
  • 20. Shark In-memory Tables CREATE TABLE CachedStocks TBLPROPERTIES ("shark.cache" = "true") AS SELECT * from PortfolioDemo.Stocks WHERE value > 95.0; OK Time taken: 1.215 seconds SELECT * FROM CachedStocks; OK MQT price 97.9270442241818 SII price 99.69238346610474 . . (123 additional prices) . PBG price 96.09162963505352 Time taken: 0.569 seconds
  • 21. Spark SQL vs Shark Shark or Spark SQL Streaming ML Spark (General execution engine) Graph Cassandra Compatible

Editor's Notes

  1. Key thing to explain on this slide is that computation for 2nd iteration will not go beyond cached RDDs. So for example when F: requests second iteration it will not hit A: as long as data is in B:. We basically perform operation on A: and keep in B: RDD. Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in brownif they are already in memory. To run an action on RDD F, we build build stages at wide dependencies and pipeline narrow transformations inside each stage. In this case, stage 1’s output RDD is already in RAM, so we run stage 2 and then 3.