SlideShare a Scribd company logo
1 of 64
Download to read offline
Spark and Cassandra:
2 Fast, 2 Furious
Russell Spitzer
DataStax Inc.
Russell, ostensibly a software engineer
• Did a Ph.D in bioinformatics at some
point
• Written a great deal of automation and
testing framework code
• Now develops for Datastax on the
Analytics Team
• Focuses a lot on the 

Datastax OSS Spark Cassandra Connector
Datastax Spark Cassandra Connector
Let Spark Interact with your Cassandra Data!
https://github.com/datastax/spark-cassandra-connector
http://spark-packages.org/package/datastax/spark-cassandra-connector
http://spark-packages.org/package/TargetHolding/pyspark-cassandra
Compatible with Spark 1.6 + Cassandra 3.0
• Bulk writing to Cassandra
• Distributed full table scans
• Optimized direct joins with Cassandra
• Secondary index pushdown
• Connection and prepared statement pools
Cassandra is essentially a hybrid between a key-value and a column-oriented
(or tabular) database management system. Its data model is a partitioned row
store with tunable consistency*
*https://en.wikipedia.org/wiki/Apache_Cassandra
Let's break that down

1.What is a C* Partition and Row

2.How does C* Place Partitions
CQL looks a lot like SQL
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
INSERTS look almost Identical
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
Cassandra Data is stored in Partitions
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
C* Partitions Store Many Rows
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0
C* Partitions Store Many Rows
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0Partition
C* Partitions Store Many Rows
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0Row
Within a partition there is ordering based on the
Clustering Keys
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0
Ordered by Clustering Key
Slices within a Partition are Very Easy
CREATE TABLE tracker (
vehicle_id uuid,
ts timestamp,
x double,
y double,
PRIMARY KEY (vehicle_id, ts))
vehicle_id
1
ts: 0
x: 0 y: 1
ts: 1
x: -1 y:0
ts: 2
x: 0 y: -1
ts: 3
x: 1 y: 0
Ordered by Clustering Key
Cassandra is a Distributed Fault-Tolerant Database
San Jose
Oakland
San Francisco
Data is located on a Token Range
San Jose
Oakland
San Francisco
Data is located on a Token Range
0
San Jose
Oakland
San Francisco
1200
600
The Partition Key of a Row is Hashed to Determine
it's Token
0
San Jose
Oakland
San Francisco
1200
600
The Partition Key of a Row is Hashed to Determine
it's Token
01200
600
San Jose
Oakland
San Francisco
vehicle_id
1
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
The Partition Key of a Row is Hashed to Determine
it's Token
01200
600
San Jose
Oakland
San Francisco
vehicle_id
1
INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
Hash(1) = 1000
How The Spark Cassandra Connector Reads
0
500
How The Spark Cassandra Connector Reads
0
San Jose Oakland San Francisco
500
Spark Partitions are Built From the Token Range
0
San Jose Oakland San Francisco
500
500
-
600
600
-
700
500 1200
Each Partition Can be Drawn Locally from at
Least One Node
0
San Jose Oakland San Francisco
500
500
-
600
600
-
700
500 1200
Each Partition Can be Drawn Locally from at
Least One Node
0
San Jose Oakland San Francisco
500
500
-
600
600
-
700
500 1200
Spark Executor Spark Executor Spark Executor
No Need to Leave the Node For Data!
0
500
Data is Retrieved using the Datastax Java Driver
0
Spark Executor Cassandra
A Connection is Established
0
Spark Executor Cassandra
500
-
600
A Query is Prepared with Token Bounds
0
Spark Executor Cassandra
SELECT * FROM TABLE WHERE
Token(PK) > StartToken AND
Token(PK) < EndToken500
-
600
The Spark Partitions Bounds are Placed in the Query
0
Spark Executor Cassandra
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600500
-
600
Paged a number of rows at a Time
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
Data is Paged
0
Spark Executor Cassandra
500
-
600
SELECT * FROM TABLE WHERE
Token(PK) > 500 AND
Token(PK) < 600
How we write to Cassandra
• Data is written via the Datastax Java Driver
• Data is grouped based on Partition Key
(configurable)
• Batches are written
Data is Written using the Datastax Java Driver
0
Spark Executor Cassandra
A Connection is Established
0
Spark Executor Cassandra
Data is Grouped on Key
0
Spark Executor Cassandra
Data is Grouped on Key
0
Spark Executor Cassandra
Once a Batch is Full or There are Too Many Batches

The Largest Batch is Executed
0
Spark Executor Cassandra
Once a Batch is Full or There are Too Many Batches

The Largest Batch is Executed
0
Spark Executor Cassandra
Once a Batch is Full or There are Too Many Batches

The Largest Batch is Executed
0
Spark Executor Cassandra
Once a Batch is Full or There are Too Many Batches

The Largest Batch is Executed
0
Spark Executor Cassandra
Most Common Features
• RDD APIs
• cassandraTable
• saveToCassandra
• repartitionByCassandraTable
• joinWithCassandraTable
• DF API
• Datasource
Full Table Scans, Making an RDD out of a Table
import	com.datastax.spark.connector._	
sc.cassandraTable(KeyspaceName,	TableName)
import	com.datastax.spark.connector._	
sc.cassandraTable[MyClass](KeyspaceName,	TableName)
Pushing Down CQL to Cassandra
import	com.datastax.spark.connector._	
sc.cassandraTable[MyClass](KeyspaceName,	
TableName).select("vehicle_id").where("ts	>	10")
SELECT "vehicle_id" FROM TABLE
WHERE
Token(PK) > 500 AND
Token(PK) < 600 AND 

ts > 10
Distributed Key Retrieval
import	com.datastax.spark.connector._	
rdd.joinWithCassandraTable("keyspace",	"table")
San Jose Oakland San Francisco
Spark Executor Spark Executor Spark Executor
RDD
But our Data isn't Colocated
import	com.datastax.spark.connector._	
rdd.joinWithCassandraTable("keyspace",	"table")
San Jose Oakland San Francisco
Spark Executor Spark Executor Spark Executor
RDD
RBCR Moves bulk reshuffles our data so Data Will
Be Local
rdd	
	.repartitionByCassandraReplica("keyspace","table")	
	.joinWithCassandraTable("keyspace",	"table")
San Jose Oakland San Francisco
Spark Executor Spark Executor Spark Executor
RDD
The Connector Provides a Distributed Pool for
Prepared Statements and Sessions
CassandraConnector(sc.getConf)
rdd.mapPartitions{	it	=>	{	
		val	ps	=	CassandraConnector(sc.getConf)	
				.withSessionDo(	s	=>	s.prepare)		
		it.map{	ps.bind(_).executeAsync()}

		}
Your Pool Ready to Be Deployed
The Connector Provides a Distributed Pool for
Prepared Statements and Sessions
CassandraConnector(sc.getConf)
rdd.mapPartitions{	it	=>	{	
		val	ps	=	CassandraConnector(sc.getConf)	
				.withSessionDo(	s	=>	s.prepare)		
		it.map{	ps.bind(_).executeAsync()}

		}
Your Pool Ready to Be Deployed
Prepared
Statement Cache
Session Cache
The Connector Supports the DataSources Api
sqlContext

		.read

		.format("org.apache.spark.sql.cassandra")

		.options(Map("keyspace"	->	"read_test",	"table"	->	"simple_kv"))

		.load
import	org.apache.spark.sql.cassandra._	
sqlContext

		.read

		.cassandraFormat("read_test","table")	
		.load
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
The Connector Works with Catalyst to 

Pushdown Predicates to Cassandra
import	com.datastax.spark.connector._	
df.select("vehicle_id").filter("ts	>	10")
SELECT "vehicle_id" FROM TABLE
WHERE
Token(PK) > 500 AND
Token(PK) < 600 AND 

ts > 10
The Connector Works with Catalyst to 

Pushdown Predicates to Cassandra
import	com.datastax.spark.connector._	
df.select("vehicle_id").filter("ts	>	10")
QueryPlan Catalyst PrunedFilteredScan
Only Prunes (projections) and
Filters (predicates) are able to
be pushed down.
Recent Features
• C* 3.0 Support
• Materialized Views
• SASL Indexes (for pushdown)
• Advanced Spark Partitioner Support
• Increased Performance on JWCT
Use C* Partitioning in Spark
Use C* Partitioning in Spark
• C* Data is Partitioned
• Spark has Partitions and partitioners
• Spark can use a known partitioner to speed up
Cogroups (joins)
• How Can we Leverage this?
Use C* Partitioning in Spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
Now if keyBy is used on a CassandraTableScanRDD and the PartitionKey
is included in the key. The RDD will be given a C* Partitioner
val	ks	=	"doc_example"	
val	rdd	=	{	sc.cassandraTable[(String,	Int)](ks,	"users")	
		.select("name"	as	"_1",	"zipcode"	as	"_2",	"userid")	
		.keyBy[Tuple1[Int]]("userid")	
}	
rdd.partitioner	
//res3:	Option[org.apache.spark.Partitioner]	=	
Some(com.datastax.spark.connector.rdd.partitioner.CassandraPartitioner@94515d3e)
Use C* Partitioning in Spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
Share partitioners between Tables for joins on Partition Key
val	ks	=	"doc_example"	
val	rdd1	=	{	sc.cassandraTable[(Int,	Int,	String)](ks,	"purchases")	
		.select("purchaseid"	as	"_1",	"amount"	as	"_2",	"objectid"	as	"_3",	"userid")	
		.keyBy[Tuple1[Int]]("userid")	
}	
val	rdd2	=	{	sc.cassandraTable[(String,	Int)](ks,	"users")	
		.select("name"	as	"_1",	"zipcode"	as	"_2",	"userid")	
		.keyBy[Tuple1[Int]]("userid")	
}.applyPartitionerFrom(rdd1)	//	Assigns	the	partitioner	from	the	first	rdd	to	this	one	
val	joinRDD	=	rdd1.join(rdd2)
Use C* Partitioning in Spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
Many other uses for this, try it yourself!
• All self joins using the Partition key
• Groups within C* Partitions
• Anything formerly using SpanBy
• Joins with other RDDs

And much more!
Enhanced Parallelism with JWCT
• joinWithCassandraTable now has increased
concurrency and parallelism!
• 5X Speed increases in some cases
• https://datastax-oss.atlassian.net/browse/
SPARKC-233
• Thanks Jaroslaw!
The Connector wants you!
• OSS Project that loves community involvement
• Bug Reports
• Feature Requests
• Doc Improvements
• Come join us!
Vin Diesel may or may not be a contributor
Tons of Videos at Datastax Academy
https://academy.datastax.com/
https://academy.datastax.com/courses/getting-started-apache-spark
Tons of Videos at Datastax Academy
https://academy.datastax.com/
https://academy.datastax.com/courses/getting-started-apache-spark
See you at Cassandra Summit!
Join me and 3500 of your database peers, and take a deep dive
into Apache Cassandra™, the massively scalable NoSQL
database that powers global businesses like Apple, Spotify,
Netflix and Sony.
San Jose Convention Center

September 7-9, 2016
https://cassandrasummit.org/
Build Something Disruptive

More Related Content

What's hot

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...Databricks
 
Cql – cassandra query language
Cql – cassandra query languageCql – cassandra query language
Cql – cassandra query languageCourtney Robinson
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 

What's hot (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 
Cql – cassandra query language
Cql – cassandra query languageCql – cassandra query language
Cql – cassandra query language
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 

Viewers also liked

Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkJen Aman
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbJen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development KitJen Aman
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Morticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark WorkflowsMorticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark WorkflowsSpark Summit
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache SparkJen Aman
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkJen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 

Viewers also liked (20)

Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Morticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark WorkflowsMorticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark Workflows
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache Spark
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 

Similar to Spark And Cassandra: 2 Fast, 2 Furious

Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsChristopher Batey
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark datastaxjp
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorChristopher Batey
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015Alexander DEJANOVSKI
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 

Similar to Spark And Cassandra: 2 Fast, 2 Furious (20)

Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internals
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark Connector
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 

More from Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkJen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics Jen Aman
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkJen Aman
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To ProductionJen Aman
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduJen Aman
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Jen Aman
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningJen Aman
 

More from Jen Aman (18)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
 

Recently uploaded

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Recently uploaded (20)

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Spark And Cassandra: 2 Fast, 2 Furious

  • 1. Spark and Cassandra: 2 Fast, 2 Furious Russell Spitzer DataStax Inc.
  • 2. Russell, ostensibly a software engineer • Did a Ph.D in bioinformatics at some point • Written a great deal of automation and testing framework code • Now develops for Datastax on the Analytics Team • Focuses a lot on the 
 Datastax OSS Spark Cassandra Connector
  • 3. Datastax Spark Cassandra Connector Let Spark Interact with your Cassandra Data! https://github.com/datastax/spark-cassandra-connector http://spark-packages.org/package/datastax/spark-cassandra-connector http://spark-packages.org/package/TargetHolding/pyspark-cassandra Compatible with Spark 1.6 + Cassandra 3.0 • Bulk writing to Cassandra • Distributed full table scans • Optimized direct joins with Cassandra • Secondary index pushdown • Connection and prepared statement pools
  • 4. Cassandra is essentially a hybrid between a key-value and a column-oriented (or tabular) database management system. Its data model is a partitioned row store with tunable consistency* *https://en.wikipedia.org/wiki/Apache_Cassandra Let's break that down
 1.What is a C* Partition and Row
 2.How does C* Place Partitions
  • 5. CQL looks a lot like SQL CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts))
  • 6. INSERTS look almost Identical CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
  • 7. Cassandra Data is stored in Partitions CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
  • 8. C* Partitions Store Many Rows CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 ts: 1 x: -1 y:0 ts: 2 x: 0 y: -1 ts: 3 x: 1 y: 0
  • 9. C* Partitions Store Many Rows CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 ts: 1 x: -1 y:0 ts: 2 x: 0 y: -1 ts: 3 x: 1 y: 0Partition
  • 10. C* Partitions Store Many Rows CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 ts: 1 x: -1 y:0 ts: 2 x: 0 y: -1 ts: 3 x: 1 y: 0Row
  • 11. Within a partition there is ordering based on the Clustering Keys CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 ts: 1 x: -1 y:0 ts: 2 x: 0 y: -1 ts: 3 x: 1 y: 0 Ordered by Clustering Key
  • 12. Slices within a Partition are Very Easy CREATE TABLE tracker ( vehicle_id uuid, ts timestamp, x double, y double, PRIMARY KEY (vehicle_id, ts)) vehicle_id 1 ts: 0 x: 0 y: 1 ts: 1 x: -1 y:0 ts: 2 x: 0 y: -1 ts: 3 x: 1 y: 0 Ordered by Clustering Key
  • 13. Cassandra is a Distributed Fault-Tolerant Database San Jose Oakland San Francisco
  • 14. Data is located on a Token Range San Jose Oakland San Francisco
  • 15. Data is located on a Token Range 0 San Jose Oakland San Francisco 1200 600
  • 16. The Partition Key of a Row is Hashed to Determine it's Token 0 San Jose Oakland San Francisco 1200 600
  • 17. The Partition Key of a Row is Hashed to Determine it's Token 01200 600 San Jose Oakland San Francisco vehicle_id 1 INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1)
  • 18. The Partition Key of a Row is Hashed to Determine it's Token 01200 600 San Jose Oakland San Francisco vehicle_id 1 INSERT INTO tracker (vehicle_id, ts, x, y) Values ( 1, 0, 0, 1) Hash(1) = 1000
  • 19. How The Spark Cassandra Connector Reads 0 500
  • 20. How The Spark Cassandra Connector Reads 0 San Jose Oakland San Francisco 500
  • 21. Spark Partitions are Built From the Token Range 0 San Jose Oakland San Francisco 500 500 - 600 600 - 700 500 1200
  • 22. Each Partition Can be Drawn Locally from at Least One Node 0 San Jose Oakland San Francisco 500 500 - 600 600 - 700 500 1200
  • 23. Each Partition Can be Drawn Locally from at Least One Node 0 San Jose Oakland San Francisco 500 500 - 600 600 - 700 500 1200 Spark Executor Spark Executor Spark Executor
  • 24. No Need to Leave the Node For Data! 0 500
  • 25. Data is Retrieved using the Datastax Java Driver 0 Spark Executor Cassandra
  • 26. A Connection is Established 0 Spark Executor Cassandra 500 - 600
  • 27. A Query is Prepared with Token Bounds 0 Spark Executor Cassandra SELECT * FROM TABLE WHERE Token(PK) > StartToken AND Token(PK) < EndToken500 - 600
  • 28. The Spark Partitions Bounds are Placed in the Query 0 Spark Executor Cassandra SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600500 - 600
  • 29. Paged a number of rows at a Time 0 Spark Executor Cassandra 500 - 600 SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600
  • 30. Data is Paged 0 Spark Executor Cassandra 500 - 600 SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600
  • 31. Data is Paged 0 Spark Executor Cassandra 500 - 600 SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600
  • 32. Data is Paged 0 Spark Executor Cassandra 500 - 600 SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600
  • 33. Data is Paged 0 Spark Executor Cassandra 500 - 600 SELECT * FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600
  • 34. How we write to Cassandra • Data is written via the Datastax Java Driver • Data is grouped based on Partition Key (configurable) • Batches are written
  • 35. Data is Written using the Datastax Java Driver 0 Spark Executor Cassandra
  • 36. A Connection is Established 0 Spark Executor Cassandra
  • 37. Data is Grouped on Key 0 Spark Executor Cassandra
  • 38. Data is Grouped on Key 0 Spark Executor Cassandra
  • 39. Once a Batch is Full or There are Too Many Batches
 The Largest Batch is Executed 0 Spark Executor Cassandra
  • 40. Once a Batch is Full or There are Too Many Batches
 The Largest Batch is Executed 0 Spark Executor Cassandra
  • 41. Once a Batch is Full or There are Too Many Batches
 The Largest Batch is Executed 0 Spark Executor Cassandra
  • 42. Once a Batch is Full or There are Too Many Batches
 The Largest Batch is Executed 0 Spark Executor Cassandra
  • 43. Most Common Features • RDD APIs • cassandraTable • saveToCassandra • repartitionByCassandraTable • joinWithCassandraTable • DF API • Datasource
  • 44. Full Table Scans, Making an RDD out of a Table import com.datastax.spark.connector._ sc.cassandraTable(KeyspaceName, TableName) import com.datastax.spark.connector._ sc.cassandraTable[MyClass](KeyspaceName, TableName)
  • 45. Pushing Down CQL to Cassandra import com.datastax.spark.connector._ sc.cassandraTable[MyClass](KeyspaceName, TableName).select("vehicle_id").where("ts > 10") SELECT "vehicle_id" FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600 AND 
 ts > 10
  • 46. Distributed Key Retrieval import com.datastax.spark.connector._ rdd.joinWithCassandraTable("keyspace", "table") San Jose Oakland San Francisco Spark Executor Spark Executor Spark Executor RDD
  • 47. But our Data isn't Colocated import com.datastax.spark.connector._ rdd.joinWithCassandraTable("keyspace", "table") San Jose Oakland San Francisco Spark Executor Spark Executor Spark Executor RDD
  • 48. RBCR Moves bulk reshuffles our data so Data Will Be Local rdd .repartitionByCassandraReplica("keyspace","table") .joinWithCassandraTable("keyspace", "table") San Jose Oakland San Francisco Spark Executor Spark Executor Spark Executor RDD
  • 49. The Connector Provides a Distributed Pool for Prepared Statements and Sessions CassandraConnector(sc.getConf) rdd.mapPartitions{ it => { val ps = CassandraConnector(sc.getConf) .withSessionDo( s => s.prepare) it.map{ ps.bind(_).executeAsync()}
 } Your Pool Ready to Be Deployed
  • 50. The Connector Provides a Distributed Pool for Prepared Statements and Sessions CassandraConnector(sc.getConf) rdd.mapPartitions{ it => { val ps = CassandraConnector(sc.getConf) .withSessionDo( s => s.prepare) it.map{ ps.bind(_).executeAsync()}
 } Your Pool Ready to Be Deployed Prepared Statement Cache Session Cache
  • 51. The Connector Supports the DataSources Api sqlContext
 .read
 .format("org.apache.spark.sql.cassandra")
 .options(Map("keyspace" -> "read_test", "table" -> "simple_kv"))
 .load import org.apache.spark.sql.cassandra._ sqlContext
 .read
 .cassandraFormat("read_test","table") .load https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
  • 52. The Connector Works with Catalyst to 
 Pushdown Predicates to Cassandra import com.datastax.spark.connector._ df.select("vehicle_id").filter("ts > 10") SELECT "vehicle_id" FROM TABLE WHERE Token(PK) > 500 AND Token(PK) < 600 AND 
 ts > 10
  • 53. The Connector Works with Catalyst to 
 Pushdown Predicates to Cassandra import com.datastax.spark.connector._ df.select("vehicle_id").filter("ts > 10") QueryPlan Catalyst PrunedFilteredScan Only Prunes (projections) and Filters (predicates) are able to be pushed down.
  • 54. Recent Features • C* 3.0 Support • Materialized Views • SASL Indexes (for pushdown) • Advanced Spark Partitioner Support • Increased Performance on JWCT
  • 56. Use C* Partitioning in Spark • C* Data is Partitioned • Spark has Partitions and partitioners • Spark can use a known partitioner to speed up Cogroups (joins) • How Can we Leverage this?
  • 57. Use C* Partitioning in Spark https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md Now if keyBy is used on a CassandraTableScanRDD and the PartitionKey is included in the key. The RDD will be given a C* Partitioner val ks = "doc_example" val rdd = { sc.cassandraTable[(String, Int)](ks, "users") .select("name" as "_1", "zipcode" as "_2", "userid") .keyBy[Tuple1[Int]]("userid") } rdd.partitioner //res3: Option[org.apache.spark.Partitioner] = Some(com.datastax.spark.connector.rdd.partitioner.CassandraPartitioner@94515d3e)
  • 58. Use C* Partitioning in Spark https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md Share partitioners between Tables for joins on Partition Key val ks = "doc_example" val rdd1 = { sc.cassandraTable[(Int, Int, String)](ks, "purchases") .select("purchaseid" as "_1", "amount" as "_2", "objectid" as "_3", "userid") .keyBy[Tuple1[Int]]("userid") } val rdd2 = { sc.cassandraTable[(String, Int)](ks, "users") .select("name" as "_1", "zipcode" as "_2", "userid") .keyBy[Tuple1[Int]]("userid") }.applyPartitionerFrom(rdd1) // Assigns the partitioner from the first rdd to this one val joinRDD = rdd1.join(rdd2)
  • 59. Use C* Partitioning in Spark https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md Many other uses for this, try it yourself! • All self joins using the Partition key • Groups within C* Partitions • Anything formerly using SpanBy • Joins with other RDDs
 And much more!
  • 60. Enhanced Parallelism with JWCT • joinWithCassandraTable now has increased concurrency and parallelism! • 5X Speed increases in some cases • https://datastax-oss.atlassian.net/browse/ SPARKC-233 • Thanks Jaroslaw!
  • 61. The Connector wants you! • OSS Project that loves community involvement • Bug Reports • Feature Requests • Doc Improvements • Come join us! Vin Diesel may or may not be a contributor
  • 62. Tons of Videos at Datastax Academy https://academy.datastax.com/ https://academy.datastax.com/courses/getting-started-apache-spark
  • 63. Tons of Videos at Datastax Academy https://academy.datastax.com/ https://academy.datastax.com/courses/getting-started-apache-spark
  • 64. See you at Cassandra Summit! Join me and 3500 of your database peers, and take a deep dive into Apache Cassandra™, the massively scalable NoSQL database that powers global businesses like Apple, Spotify, Netflix and Sony. San Jose Convention Center
 September 7-9, 2016 https://cassandrasummit.org/ Build Something Disruptive