SlideShare a Scribd company logo
Spark and Cassandra:
2 Fast, 2 Furious
Russell Spitzer
DataStax Inc.
Russell, ostensibly a software engineer
• Did a Ph.D in bioinformatics at some
point
• Written a great deal of automation and
testing framework code
• Now develops for Datastax on the
Analytics Team
• Focuses a lot on the 

Datastax OSS Spark Cassandra Connector
Datastax Spark Cassandra Connector
Let Spark Interact with your Cassandra Data!
https://github.com/datastax/spark-cassandra-connector
http://spark-packages.org/package/datastax/spark-cassandra-connector
http://spark-packages.org/package/TargetHolding/pyspark-cassandra
Compatible with Spark 1.6 + Cassandra 3.0
• Bulk writing to Cassandra
• Distributed full table scans
• Optimized direct joins with Cassandra
• Secondary index pushdown
• Connection and prepared statement pools
Cassandra is essentially a hybrid between a key-value and a column-oriented
(or tabular) database management system. Its data model is a partitioned row
store with tunable consistency*
*https://en.wikipedia.org/wiki/Apache_Cassandra
Let's break that down

1.What is a C* Partition and Row

2.How does C* Place Partitions
CQL looks a lot like SQL
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
INSERTS look almost Identical
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
Cassandra Data is stored in Partitions
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
C* Partitions Store Many Rows
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0
C* Partitions Store Many Rows
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0Partition
C* Partitions Store Many Rows
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0Row
Within a partition there is ordering based on the
Clustering Keys
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0
Ordered by Clustering Key
Slices within a Partition are Very Easy
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0
Ordered by Clustering Key
Cassandra is a Distributed Fault-Tolerant Database
San Jose
Oakland
San Francisco
Data is located on a Token Range
San Jose
Oakland
San Francisco
Data is located on a Token Range
0
San Jose
Oakland
San Francisco
1200
600
The Partition Key of a Row is Hashed to Determine
it's Token
0
San Jose
Oakland
San Francisco
1200
600
The Partition Key of a Row is Hashed to Determine
it's Token
01200
600
San Jose
Oakland
San Francisco
vehicle_id
1
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
The Partition Key of a Row is Hashed to Determine
it's Token
01200
600
San Jose
Oakland
San Francisco
vehicle_id
1
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
Hash(1) = 1000
How The Spark Cassandra Connector Reads
0
500
How The Spark Cassandra Connector Reads
0
San Jose Oakland San Francisco
500
Spark Partitions are Built From the Token Range
0
San Jose Oakland San Francisco
500
500
-
600
600
-
700
500 1200
Each Partition Can be Drawn Locally from at
Least One Node
0
San Jose Oakland San Francisco
500
500
-
600
600
-
700
500 1200
Each Partition Can be Drawn Locally from at
Least One Node
0
San Jose Oakland San Francisco
500
500
-
600
600
-
700
500 1200
Spark Executor Spark Executor Spark Executor
No Need to Leave the Node For Data!
0
500
Data is Retrieved using the Datastax Java Driver
0
Spark Executor Cassandra
A Connection is Established
0
Spark Executor Cassandra
500
-
600
A Query is Prepared with Token Bounds
0
Spark Executor Cassandra
SELECT * FROM TABLE WHERE
Token(PK) > StartToken AND
Token(PK) < EndToken500
-
600
The Spark Partitions Bounds are Placed in the Query
0
Spark Executor Cassandra
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600500
-
600
Paged a number of rows at a Time
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
How we write to Cassandra
• Data is written via the Datastax Java Driver
• Data is grouped based on Partition Key
(configurable)
• Batches are written
Data is Written using the Datastax Java Driver
0
Spark Executor Cassandra
A Connection is Established
0
Spark Executor Cassandra
Data is Grouped on Key
0
Spark Executor Cassandra
Data is Grouped on Key
0
Spark Executor Cassandra
Once a Batch is Full or There are Too Many Batches

The Largest Batch is Executed
0
Spark Executor Cassandra
Once a Batch is Full or There are Too Many Batches

The Largest Batch is Executed
0
Spark Executor Cassandra
Once a Batch is Full or There are Too Many Batches

The Largest Batch is Executed
0
Spark Executor Cassandra
Once a Batch is Full or There are Too Many Batches

The Largest Batch is Executed
0
Spark Executor Cassandra
Most Common Features
• RDD APIs
• cassandraTable
• saveToCassandra
• repartitionByCassandraTable
• joinWithCassandraTable
• DF API
• Datasource
Full Table Scans, Making an RDD out of a Table
import	com.datastax.spark.connector._	
sc.cassandraTable(KeyspaceName,	TableName)
import	com.datastax.spark.connector._	
sc.cassandraTable[MyClass](KeyspaceName,	TableName)
Pushing Down CQL to Cassandra
import	com.datastax.spark.connector._	
sc.cassandraTable[MyClass](KeyspaceName,	
TableName).select("vehicle_id").where("ts	>	10")
SELECT "vehicle_id" FROM TABLE
WHERE
Token(PK) > 500 AND
Token(PK) < 600 AND 

ts > 10
Distributed Key Retrieval
import	com.datastax.spark.connector._	
rdd.joinWithCassandraTable("keyspace",	"table")
San Jose Oakland San Francisco
Spark Executor Spark Executor Spark Executor
RDD
But our Data isn't Colocated
import	com.datastax.spark.connector._	
rdd.joinWithCassandraTable("keyspace",	"table")
San Jose Oakland San Francisco
Spark Executor Spark Executor Spark Executor
RDD
RBCR Moves bulk reshuffles our data so Data Will
Be Local
rdd	
	.repartitionByCassandraReplica("keyspace","table")	
	.joinWithCassandraTable("keyspace",	"table")
San Jose Oakland San Francisco
Spark Executor Spark Executor Spark Executor
RDD
The Connector Provides a Distributed Pool for
Prepared Statements and Sessions
CassandraConnector(sc.getConf)
rdd.mapPartitions{	it	=>	{	
		val	ps	=	CassandraConnector(sc.getConf)	
				.withSessionDo(	s	=>	s.prepare)		
		it.map{	ps.bind(_).executeAsync()}

		}
Your Pool Ready to Be Deployed
The Connector Provides a Distributed Pool for
Prepared Statements and Sessions
CassandraConnector(sc.getConf)
rdd.mapPartitions{	it	=>	{	
		val	ps	=	CassandraConnector(sc.getConf)	
				.withSessionDo(	s	=>	s.prepare)		
		it.map{	ps.bind(_).executeAsync()}

		}
Your Pool Ready to Be Deployed
Prepared
Statement Cache
Session Cache
The Connector Supports the DataSources Api
sqlContext

		.read

		.format("org.apache.spark.sql.cassandra")

		.options(Map("keyspace"	->	"read_test",	"table"	->	"simple_kv"))

		.load
import	org.apache.spark.sql.cassandra._	
sqlContext

		.read

		.cassandraFormat("read_test","table")	
		.load
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
The Connector Works with Catalyst to 

Pushdown Predicates to Cassandra
import	com.datastax.spark.connector._	
df.select("vehicle_id").filter("ts	>	10")
SELECT "vehicle_id" FROM TABLE
WHERE
Token(PK) > 500 AND
Token(PK) < 600 AND 

ts > 10
The Connector Works with Catalyst to 

Pushdown Predicates to Cassandra
import	com.datastax.spark.connector._	
df.select("vehicle_id").filter("ts	>	10")
QueryPlan Catalyst PrunedFilteredScan
Only Prunes (projections) and
Filters (predicates) are able to
be pushed down.
Recent Features
• C* 3.0 Support
• Materialized Views
• SASL Indexes (for pushdown)
• Advanced Spark Partitioner Support
• Increased Performance on JWCT
Use C* Partitioning in Spark
Use C* Partitioning in Spark
• C* Data is Partitioned
• Spark has Partitions and partitioners
• Spark can use a known partitioner to speed up
Cogroups (joins)
• How Can we Leverage this?
Use C* Partitioning in Spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
Now if keyBy is used on a CassandraTableScanRDD and the PartitionKey
is included in the key. The RDD will be given a C* Partitioner
val	ks	=	"doc_example"	
val	rdd	=	{	sc.cassandraTable[(String,	Int)](ks,	"users")	
		.select("name"	as	"_1",	"zipcode"	as	"_2",	"userid")	
		.keyBy[Tuple1[Int]]("userid")	
}	
rdd.partitioner	
//res3:	Option[org.apache.spark.Partitioner]	=	
Some(com.datastax.spark.connector.rdd.partitioner.CassandraPartitioner@94515d3e)
Use C* Partitioning in Spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
Share partitioners between Tables for joins on Partition Key
val	ks	=	"doc_example"	
val	rdd1	=	{	sc.cassandraTable[(Int,	Int,	String)](ks,	"purchases")	
		.select("purchaseid"	as	"_1",	"amount"	as	"_2",	"objectid"	as	"_3",	"userid")	
		.keyBy[Tuple1[Int]]("userid")	
}	
val	rdd2	=	{	sc.cassandraTable[(String,	Int)](ks,	"users")	
		.select("name"	as	"_1",	"zipcode"	as	"_2",	"userid")	
		.keyBy[Tuple1[Int]]("userid")	
}.applyPartitionerFrom(rdd1)	//	Assigns	the	partitioner	from	the	first	rdd	to	this	one	
val	joinRDD	=	rdd1.join(rdd2)
Use C* Partitioning in Spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
Many other uses for this, try it yourself!
• All self joins using the Partition key
• Groups within C* Partitions
• Anything formerly using SpanBy
• Joins with other RDDs

And much more!
Enhanced Parallelism with JWCT
• joinWithCassandraTable now has increased
concurrency and parallelism!
• 5X Speed increases in some cases
• https://datastax-oss.atlassian.net/browse/
SPARKC-233
• Thanks Jaroslaw!
The Connector wants you!
• OSS Project that loves community involvement
• Bug Reports
• Feature Requests
• Doc Improvements
• Come join us!
Vin Diesel may or may not be a contributor
Tons of Videos at Datastax Academy
https://academy.datastax.com/
https://academy.datastax.com/courses/getting-started-apache-spark
Tons of Videos at Datastax Academy
https://academy.datastax.com/
https://academy.datastax.com/courses/getting-started-apache-spark
See you at Cassandra Summit!
Join me and 3500 of your database peers, and take a deep dive
into Apache Cassandra™, the massively scalable NoSQL
database that powers global businesses like Apple, Spotify,
Netflix and Sony.
San Jose Convention Center

September 7-9, 2016
https://cassandrasummit.org/
Build Something Disruptive

More Related Content

What's hot

Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
Matthias Niehoff
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
Patrick McFadin
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Jacek Lewandowski
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
Patrick McFadin
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
Josef Adersberger
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
Patrick McFadin
 
Successful Architectures for Fast Data
Successful Architectures for Fast DataSuccessful Architectures for Fast Data
Successful Architectures for Fast Data
Patrick McFadin
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
Patrick McFadin
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 

What's hot (20)

Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and Spark
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
 
Successful Architectures for Fast Data
Successful Architectures for Fast DataSuccessful Architectures for Fast Data
Successful Architectures for Fast Data
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 

Viewers also liked

Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
сonnect and protect
сonnect and protect  сonnect and protect
сonnect and protect
Max Gerasimov
 
Indian domestic sector and the need for promoting community based microgenera...
Indian domestic sector and the need for promoting community based microgenera...Indian domestic sector and the need for promoting community based microgenera...
Indian domestic sector and the need for promoting community based microgenera...
eSAT Publishing House
 
InvestmentNews 2015 Adviser Technology Study: Key Findings
InvestmentNews 2015 Adviser Technology Study: Key FindingsInvestmentNews 2015 Adviser Technology Study: Key Findings
InvestmentNews 2015 Adviser Technology Study: Key Findings
INResearch
 
Choosing the Components of your PC
Choosing the Components of your PCChoosing the Components of your PC
Choosing the Components of your PC
Hazel Anne Quirao
 
Patnent Information services
Patnent Information servicesPatnent Information services
Patnent Information servicesLihua Gao
 
Andrewgoodwinsmusicvideoprinciples
AndrewgoodwinsmusicvideoprinciplesAndrewgoodwinsmusicvideoprinciples
Andrewgoodwinsmusicvideoprinciples
miloomahoney
 
CIP Report Deepesh
CIP Report DeepeshCIP Report Deepesh
CIP Report Deepesh
deepesh sharma
 
Khawaja Muhammad Safdar Medical College Sialkot Merit List 2014
Khawaja Muhammad Safdar Medical College Sialkot Merit List 2014Khawaja Muhammad Safdar Medical College Sialkot Merit List 2014
Khawaja Muhammad Safdar Medical College Sialkot Merit List 2014
Rana Waqar
 
Ba com bsc bba bca btech be diploma b.ed
Ba com bsc bba bca btech be diploma b.edBa com bsc bba bca btech be diploma b.ed
Ba com bsc bba bca btech be diploma b.ed
Harsh Vardhan Sharma
 
A mathematical model for determining the optimal size
A mathematical model for determining the optimal sizeA mathematical model for determining the optimal size
A mathematical model for determining the optimal size
eSAT Publishing House
 
Testailua vaan
Testailua vaanTestailua vaan
Testailua vaan
Tenttu
 
Deber de informatica
Deber de informaticaDeber de informatica
Deber de informatica
kelvinmaster4
 
OLADIPUPO OSIBAJO Resume 13
OLADIPUPO OSIBAJO Resume 13OLADIPUPO OSIBAJO Resume 13
OLADIPUPO OSIBAJO Resume 13OSIBAJO
 
Oscillatory motion control of hinged body using controller
Oscillatory motion control of hinged body using controllerOscillatory motion control of hinged body using controller
Oscillatory motion control of hinged body using controller
eSAT Publishing House
 
java concepts
java conceptsjava concepts
java concepts
Surya Prakash
 

Viewers also liked (17)

Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
сonnect and protect
сonnect and protect  сonnect and protect
сonnect and protect
 
Indian domestic sector and the need for promoting community based microgenera...
Indian domestic sector and the need for promoting community based microgenera...Indian domestic sector and the need for promoting community based microgenera...
Indian domestic sector and the need for promoting community based microgenera...
 
CV_Chanchalerm
CV_ChanchalermCV_Chanchalerm
CV_Chanchalerm
 
InvestmentNews 2015 Adviser Technology Study: Key Findings
InvestmentNews 2015 Adviser Technology Study: Key FindingsInvestmentNews 2015 Adviser Technology Study: Key Findings
InvestmentNews 2015 Adviser Technology Study: Key Findings
 
Choosing the Components of your PC
Choosing the Components of your PCChoosing the Components of your PC
Choosing the Components of your PC
 
Patnent Information services
Patnent Information servicesPatnent Information services
Patnent Information services
 
Andrewgoodwinsmusicvideoprinciples
AndrewgoodwinsmusicvideoprinciplesAndrewgoodwinsmusicvideoprinciples
Andrewgoodwinsmusicvideoprinciples
 
CIP Report Deepesh
CIP Report DeepeshCIP Report Deepesh
CIP Report Deepesh
 
Khawaja Muhammad Safdar Medical College Sialkot Merit List 2014
Khawaja Muhammad Safdar Medical College Sialkot Merit List 2014Khawaja Muhammad Safdar Medical College Sialkot Merit List 2014
Khawaja Muhammad Safdar Medical College Sialkot Merit List 2014
 
Ba com bsc bba bca btech be diploma b.ed
Ba com bsc bba bca btech be diploma b.edBa com bsc bba bca btech be diploma b.ed
Ba com bsc bba bca btech be diploma b.ed
 
A mathematical model for determining the optimal size
A mathematical model for determining the optimal sizeA mathematical model for determining the optimal size
A mathematical model for determining the optimal size
 
Testailua vaan
Testailua vaanTestailua vaan
Testailua vaan
 
Deber de informatica
Deber de informaticaDeber de informatica
Deber de informatica
 
OLADIPUPO OSIBAJO Resume 13
OLADIPUPO OSIBAJO Resume 13OLADIPUPO OSIBAJO Resume 13
OLADIPUPO OSIBAJO Resume 13
 
Oscillatory motion control of hinged body using controller
Oscillatory motion control of hinged body using controllerOscillatory motion control of hinged body using controller
Oscillatory motion control of hinged body using controller
 
java concepts
java conceptsjava concepts
java concepts
 

Similar to Spark and Cassandra 2 Fast 2 Furious

Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
datastaxjp
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark Connector
Christopher Batey
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Patrick McFadin
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
DataStax Academy
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
Alexander DEJANOVSKI
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
Bryan Yang
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
DataStax Academy
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
shsedghi
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 

Similar to Spark and Cassandra 2 Fast 2 Furious (20)

Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internals
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark Connector
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 

Recently uploaded

Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 

Recently uploaded (20)

Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 

Spark and Cassandra 2 Fast 2 Furious

  • 1. Spark and Cassandra: 2 Fast, 2 Furious Russell Spitzer DataStax Inc.
  • 2. Russell, ostensibly a software engineer • Did a Ph.D in bioinformatics at some point • Written a great deal of automation and testing framework code • Now develops for Datastax on the Analytics Team • Focuses a lot on the 
 Datastax OSS Spark Cassandra Connector
  • 3. Datastax Spark Cassandra Connector Let Spark Interact with your Cassandra Data! https://github.com/datastax/spark-cassandra-connector http://spark-packages.org/package/datastax/spark-cassandra-connector http://spark-packages.org/package/TargetHolding/pyspark-cassandra Compatible with Spark 1.6 + Cassandra 3.0 • Bulk writing to Cassandra • Distributed full table scans • Optimized direct joins with Cassandra • Secondary index pushdown • Connection and prepared statement pools
  • 4. Cassandra is essentially a hybrid between a key-value and a column-oriented (or tabular) database management system. Its data model is a partitioned row store with tunable consistency* *https://en.wikipedia.org/wiki/Apache_Cassandra Let's break that down
 1.What is a C* Partition and Row
 2.How does C* Place Partitions
  • 5. CQL looks a lot like SQL CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts))
  • 6. INSERTS look almost Identical CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
  • 7. Cassandra Data is stored in Partitions CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
  • 8. C* Partitions Store Many Rows CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 ts: 1 x: -1 y:0 ts: 2 x: 0 y: -1 ts: 3 x: 1 y: 0
  • 9. C* Partitions Store Many Rows CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 ts: 1 x: -1 y:0 ts: 2 x: 0 y: -1 ts: 3 x: 1 y: 0Partition
  • 10. C* Partitions Store Many Rows CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 ts: 1 x: -1 y:0 ts: 2 x: 0 y: -1 ts: 3 x: 1 y: 0Row
  • 11. Within a partition there is ordering based on the Clustering Keys CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 ts: 1 x: -1 y:0 ts: 2 x: 0 y: -1 ts: 3 x: 1 y: 0 Ordered by Clustering Key
  • 12. Slices within a Partition are Very Easy CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 ts: 1 x: -1 y:0 ts: 2 x: 0 y: -1 ts: 3 x: 1 y: 0 Ordered by Clustering Key
  • 13. Cassandra is a Distributed Fault-Tolerant Database San Jose Oakland San Francisco
  • 14. Data is located on a Token Range San Jose Oakland San Francisco
  • 15. Data is located on a Token Range 0 San Jose Oakland San Francisco 1200 600
  • 16. The Partition Key of a Row is Hashed to Determine it's Token 0 San Jose Oakland San Francisco 1200 600
  • 17. The Partition Key of a Row is Hashed to Determine it's Token 01200 600 San Jose Oakland San Francisco vehicle_id 1 INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
  • 18. The Partition Key of a Row is Hashed to Determine it's Token 01200 600 San Jose Oakland San Francisco vehicle_id 1 INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1) Hash(1) = 1000
  • 19. How The Spark Cassandra Connector Reads 0 500
  • 20. How The Spark Cassandra Connector Reads 0 San Jose Oakland San Francisco 500
  • 21. Spark Partitions are Built From the Token Range 0 San Jose Oakland San Francisco 500 500 - 600 600 - 700 500 1200
  • 22. Each Partition Can be Drawn Locally from at Least One Node 0 San Jose Oakland San Francisco 500 500 - 600 600 - 700 500 1200
  • 23. Each Partition Can be Drawn Locally from at Least One Node 0 San Jose Oakland San Francisco 500 500 - 600 600 - 700 500 1200 Spark Executor Spark Executor Spark Executor
  • 24. No Need to Leave the Node For Data! 0 500
  • 25. Data is Retrieved using the Datastax Java Driver 0 Spark Executor Cassandra
  • 26. A Connection is Established 0 Spark Executor Cassandra 500 - 600
  • 27. A Query is Prepared with Token Bounds 0 Spark Executor Cassandra SELECT * FROM TABLE WHERE Token(PK) > StartToken AND Token(PK) < EndToken500 - 600
  • 28. The Spark Partitions Bounds are Placed in the Query 0 Spark Executor Cassandra SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600500 - 600
  • 29. Paged a number of rows at a Time 0 Spark Executor Cassandra 500 - 600 SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600
  • 30. Data is Paged 0 Spark Executor Cassandra 500 - 600 SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600
  • 31. Data is Paged 0 Spark Executor Cassandra 500 - 600 SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600
  • 32. Data is Paged 0 Spark Executor Cassandra 500 - 600 SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600
  • 33. Data is Paged 0 Spark Executor Cassandra 500 - 600 SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600
  • 34. How we write to Cassandra • Data is written via the Datastax Java Driver • Data is grouped based on Partition Key (configurable) • Batches are written
  • 35. Data is Written using the Datastax Java Driver 0 Spark Executor Cassandra
  • 36. A Connection is Established 0 Spark Executor Cassandra
  • 37. Data is Grouped on Key 0 Spark Executor Cassandra
  • 38. Data is Grouped on Key 0 Spark Executor Cassandra
  • 39. Once a Batch is Full or There are Too Many Batches
 The Largest Batch is Executed 0 Spark Executor Cassandra
  • 40. Once a Batch is Full or There are Too Many Batches
 The Largest Batch is Executed 0 Spark Executor Cassandra
  • 41. Once a Batch is Full or There are Too Many Batches
 The Largest Batch is Executed 0 Spark Executor Cassandra
  • 42. Once a Batch is Full or There are Too Many Batches
 The Largest Batch is Executed 0 Spark Executor Cassandra
  • 43. Most Common Features • RDD APIs • cassandraTable • saveToCassandra • repartitionByCassandraTable • joinWithCassandraTable • DF API • Datasource
  • 44. Full Table Scans, Making an RDD out of a Table import com.datastax.spark.connector._ sc.cassandraTable(KeyspaceName, TableName) import com.datastax.spark.connector._ sc.cassandraTable[MyClass](KeyspaceName, TableName)
  • 45. Pushing Down CQL to Cassandra import com.datastax.spark.connector._ sc.cassandraTable[MyClass](KeyspaceName, TableName).select("vehicle_id").where("ts > 10") SELECT "vehicle_id" FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600 AND 
 ts > 10
  • 46. Distributed Key Retrieval import com.datastax.spark.connector._ rdd.joinWithCassandraTable("keyspace", "table") San Jose Oakland San Francisco Spark Executor Spark Executor Spark Executor RDD
  • 47. But our Data isn't Colocated import com.datastax.spark.connector._ rdd.joinWithCassandraTable("keyspace", "table") San Jose Oakland San Francisco Spark Executor Spark Executor Spark Executor RDD
  • 48. RBCR Moves bulk reshuffles our data so Data Will Be Local rdd .repartitionByCassandraReplica("keyspace","table") .joinWithCassandraTable("keyspace", "table") San Jose Oakland San Francisco Spark Executor Spark Executor Spark Executor RDD
  • 49. The Connector Provides a Distributed Pool for Prepared Statements and Sessions CassandraConnector(sc.getConf) rdd.mapPartitions{ it => { val ps = CassandraConnector(sc.getConf) .withSessionDo( s => s.prepare) it.map{ ps.bind(_).executeAsync()}
 } Your Pool Ready to Be Deployed
  • 50. The Connector Provides a Distributed Pool for Prepared Statements and Sessions CassandraConnector(sc.getConf) rdd.mapPartitions{ it => { val ps = CassandraConnector(sc.getConf) .withSessionDo( s => s.prepare) it.map{ ps.bind(_).executeAsync()}
 } Your Pool Ready to Be Deployed Prepared Statement Cache Session Cache
  • 51. The Connector Supports the DataSources Api sqlContext
 .read
 .format("org.apache.spark.sql.cassandra")
 .options(Map("keyspace" -> "read_test", "table" -> "simple_kv"))
 .load import org.apache.spark.sql.cassandra._ sqlContext
 .read
 .cassandraFormat("read_test","table") .load https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
  • 52. The Connector Works with Catalyst to 
 Pushdown Predicates to Cassandra import com.datastax.spark.connector._ df.select("vehicle_id").filter("ts > 10") SELECT "vehicle_id" FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600 AND 
 ts > 10
  • 53. The Connector Works with Catalyst to 
 Pushdown Predicates to Cassandra import com.datastax.spark.connector._ df.select("vehicle_id").filter("ts > 10") QueryPlan Catalyst PrunedFilteredScan Only Prunes (projections) and Filters (predicates) are able to be pushed down.
  • 54. Recent Features • C* 3.0 Support • Materialized Views • SASL Indexes (for pushdown) • Advanced Spark Partitioner Support • Increased Performance on JWCT
  • 56. Use C* Partitioning in Spark • C* Data is Partitioned • Spark has Partitions and partitioners • Spark can use a known partitioner to speed up Cogroups (joins) • How Can we Leverage this?
  • 57. Use C* Partitioning in Spark https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md Now if keyBy is used on a CassandraTableScanRDD and the PartitionKey is included in the key. The RDD will be given a C* Partitioner val ks = "doc_example" val rdd = { sc.cassandraTable[(String, Int)](ks, "users") .select("name" as "_1", "zipcode" as "_2", "userid") .keyBy[Tuple1[Int]]("userid") } rdd.partitioner //res3: Option[org.apache.spark.Partitioner] = Some(com.datastax.spark.connector.rdd.partitioner.CassandraPartitioner@94515d3e)
  • 58. Use C* Partitioning in Spark https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md Share partitioners between Tables for joins on Partition Key val ks = "doc_example" val rdd1 = { sc.cassandraTable[(Int, Int, String)](ks, "purchases") .select("purchaseid" as "_1", "amount" as "_2", "objectid" as "_3", "userid") .keyBy[Tuple1[Int]]("userid") } val rdd2 = { sc.cassandraTable[(String, Int)](ks, "users") .select("name" as "_1", "zipcode" as "_2", "userid") .keyBy[Tuple1[Int]]("userid") }.applyPartitionerFrom(rdd1) // Assigns the partitioner from the first rdd to this one val joinRDD = rdd1.join(rdd2)
  • 59. Use C* Partitioning in Spark https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md Many other uses for this, try it yourself! • All self joins using the Partition key • Groups within C* Partitions • Anything formerly using SpanBy • Joins with other RDDs
 And much more!
  • 60. Enhanced Parallelism with JWCT • joinWithCassandraTable now has increased concurrency and parallelism! • 5X Speed increases in some cases • https://datastax-oss.atlassian.net/browse/ SPARKC-233 • Thanks Jaroslaw!
  • 61. The Connector wants you! • OSS Project that loves community involvement • Bug Reports • Feature Requests • Doc Improvements • Come join us! Vin Diesel may or may not be a contributor
  • 62. Tons of Videos at Datastax Academy https://academy.datastax.com/ https://academy.datastax.com/courses/getting-started-apache-spark
  • 63. Tons of Videos at Datastax Academy https://academy.datastax.com/ https://academy.datastax.com/courses/getting-started-apache-spark
  • 64. See you at Cassandra Summit! Join me and 3500 of your database peers, and take a deep dive into Apache Cassandra™, the massively scalable NoSQL database that powers global businesses like Apple, Spotify, Netflix and Sony. San Jose Convention Center
 September 7-9, 2016 https://cassandrasummit.org/ Build Something Disruptive