SlideShare a Scribd company logo
Maximum Overdrive:
Tuning the Spark Cassandra Connector
Russell Spitzer, Datastax
© DataStax, All Rights Reserved.
Who is this guy and why should I listen to him?
2
Russell Spitzer, Passing Software Engineer
•Been working at DataStax since 2013
•Worked in Test Engineering and now Analytics Dev
•Have been working with Spark since 0.9
•Working with Cassandra since 1.2
•Main focus is the Spark Cassandra Connector
•Surgically grafted to the Spark Cassandra Connector Mailing List
© DataStax, All Rights Reserved.
The Spark Cassandra Connector
Connects Spark to Cassandra
3
It's all there in the name
•Provides a DataSource for Datasets/DataFrames
•Provides methods for Writing DataSets/Data Frames
•Reading and Writing RDD
•Connection Pooling
•Type Conversions and Mapping
•Data Locality
•Open Source Software!
https://github.com/datastax/spark-cassandra-connector
© DataStax, All Rights Reserved. 4
WARNING: THIS TALK WILL CONTAIN TECHNICAL DETAILS AND EXPLICIT
SCALA
DISTRIBUTED SYSTEMS
Tuning the Spark Cassandra Connector
DISTRIBUTED SYSTEMS
© DataStax, All Rights Reserved.
1 Lots of Write Tuning
2 A Bit of Read Tuning
5
© DataStax, All Rights Reserved. 6
Context is Very Important
Knowing your Data is Key for Maximum Performance
© DataStax, All Rights Reserved.
Write Tuning in the SCC Is all about
Batching
7
Batches aren't
good for
performance in
Cassandra.
Not when the writes
within the batch are in
the same Partition and
they are unlogged!
I keep telling you this!
© DataStax, All Rights Reserved.
Multi-Partition Key Batches put load on the
Coordinator
8
THE BATCH
Cassandra
Cluster
Row Row
Row
Row
© DataStax, All Rights Reserved.
Multi-Partition Key Batches put load on the
Coordinator
9
Cassandra
Cluster
R
o
w
R
o
w
R
o
w
R
o
w
A batch moves as a single entity
to the Coordinator for that write
This batch has to sit there until
all the portions of it get confirmed
at their set consistency level
© DataStax, All Rights Reserved.
Multi-Partition Key Batches put load on the
Coordinator
10
Cassandra
Cluster
R
o
w
R
o
w
R
o
w
R
o
w
Even when some portions of the
batch finish early we have to wait
until the entire thing is done before
we can respond to the client.
© DataStax, All Rights Reserved.
We end up with a lot of rows just sitting around in
memory waiting for others to get out of the way
11
© DataStax, All Rights Reserved.
Single Partition Batches are Treated as
A Single Mutation in Cassandra
12
THE BATCH
Row Row Row
RowRow Row
Row Row Row
Row Row Row
Cassandra
Cluster
© DataStax, All Rights Reserved.
Single Partition Batches are Treated as
A Single Mutation in Cassandra
13
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
Cassandra
Cluster
Now the entire batch can be
treated as a single mutation. We
also only have to wait for one set
of replicas
© DataStax, All Rights Reserved.
When all of the Rows are Going to the Same Place
Writing to Cassandra is Fast
14
The Connector Will Automatically Batch
Writes
15
rdd.saveToCassandra("bestkeyspace", "besttable")
df.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "besttable", "keyspace" -> "bestkeyspace"))
.save()
import org.apache.spark.sql.cassandra._
df.write
.cassandraFormat("besttable", "bestkeyspace")
.save()
RDD
DataFrame
By default batching happens on
Identical Partition Key
16
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
WriteConf(batchGroupingKey= ?)
Change it as a SparkConf or DataFrame Parameter
Or directly pass in a WriteConf
Batches are Placed in Holding Until Certain
Thresholds are hit
17
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
Batches are Placed in Holding Until Certain
Thresholds are hit
18
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Batches are Placed in Holding Until Certain
Thresholds are hit
19
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Batches are Placed in Holding Until Certain
Thresholds are hit
20
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Batches are Placed in Holding Until Certain
Thresholds are hit
21
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Spark Cassandra Stress for Running Basic
Benchmarks
22
https://github.com/datastax/spark-cassandra-stress
Running Benchmarks on a bunch of AWS Machines,
5 M3.2XLarge
DSE 5.0.1
Spark 1.6.1
Spark CC 1.6.0
RF = 3
2 M Writes/ 100K C* Partitions and 400 Spark Partitions
Caveat: Don't benchmark exactly like this
I'm making some bad decisions to to make some broad points
Depending on your use case Sorting within Partitions
can Greatly Increase Write Performance
23
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
28
77
0
20
40
60
80
100
Rows Out of Order Rows In Order
Default Conf kOps/s
Grouping on Partition Key
The Safest Thing You Can Do
Including everything in the Batch
24
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
28
77
69
125
0
32.5
65
97.5
130
162.5
Rows Out of Order Rows In Order
Default Conf kOps/s No Batch Key
May be Safe For Short Durations
BUT WILL LEAD TO SYSTEM
INSTABILITY
Grouping on Replica Set
25
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
28
77
70
125
0
32.5
65
97.5
130
162.5
Rows Out of Order Rows In Order
Default Conf kOps/s
Grouped on Replica Set
Safer, But still will put
extra load on the Coordinator
Remember the Tortoise vs the Hare
26
Overwhelming Cassandra will slow you down
Limit the amount of writes per executor : output.throughput_mb_per_sec
Limit maximum executor cores : spark.max.cores
Lower concurrency : output.concurrent.writes
DEPENDING ON DISK PERFORMANCE YOUR
INITIAL SPEEDS IN BENCHMARKING MAY
NOT BE SUSTAINABLE
For Example Lets run with Batch Key None for a
Longer Test (20M writes)
27
[Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskS
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
For Example Lets run with Batch Key None for a
Longer Test (20M writes)
28
[Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskS
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Back to Default PartitionKey Batching
29
28
77
39
190
0
50
100
150
200
Rows Out of Order Rows In Order
Default Conf kOps/s 10X Length Run
So why are we doing so
much better over a longer
run?
Back to Default PartitionKey Batching
30
28
77
39
190
0
50
100
150
200
Rows Out of Order Rows In Order
Default Conf kOps/s 10X Length Run
400 Spark Partitions in Both Cases
2M/ 400 = 5000
20M / 400 = 50000
Having Too Many Partitions will Slow Down your
Writes
31
Every task has Setup and Teardown and
we can only build up good batches if there
are enough elements to build them from
Depending on your use case Sorting within Partitions
can Greatly Increase Write Performance
32
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
39
190
0
50
100
150
200
Rows Out of Order Rows In Order
10X Length Run
A spark sort on partition key
may speed up your total operation
by several fold
Maximizing performance for out of Order Writes or
No Clustering Keys
33
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
39
190
47
59
0
50
100
150
200
Rows Out of Order Rows In Order
10X Length Run Modified Conf kOps/s
Turn Off Batching
Increase Concurrency
spark.cassandra.output.batch.size.rows 1
spark.cassandra.output.concurrent.writes 2000
Maximizing performance for out of Order Writes or
No Clustering Keys
34
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
39
190
47
59
0
50
100
150
200
Rows Out of Order Rows In Order
10X Length Run Modified Conf kOps/s
Turn Off Batching
Increase Concurrency
spark.cassandra.output.batch.size.rows 1
spark.cassandra.output.concurrent.writes 2000
Single
Partition
Batches are
good I keep
telling you!
This turns the connector into a Multi-Machine Cassandra
Loader (Basically just executeAsync as fast as possible)
35
https://github.com/brianmhess/cassandra-loader
Now Let's Talk About Reading!
36
Read Tuning mostly About Partitioning
37
• RDDs are a large Dataset Broken Into Bits,
• These bits are call Partitions
• Cassandra Partitions != Spark Partitions
• Spark Partitions are sized based on the estimated data size of the underlying C* table
• input.split.size_in_mb
TokenRange
Spark Partitions
OOMs Caused by Spark Partitions Holding Too Much
Data
38
Executor JVM Heap
Core 1
Core 2
Core 3
As a general rule of thumb your Executor should be
set to hold
Number of Cores * Size of Partition * 1.2
See a lot of GC? OOM? Increase the amount of partitions
Some Caveats
• We don't know the actual partition size until runtime
• Cassandra on disk memory usage != in memory size
OOMs Caused by Spark Partitions Holding Too Much
Data
39
Executor JVM Heap
Core 1
Core 2
Core 3
input.split.size_in_mb 64
Approx amount of data to be fetched into a
Spark partition. Minimum number of resulting
Spark partitions is
1 + 2 * SparkContext.defaultParallelism
split.size_in_mb compares uses the system table size_esitmates
to determine how many Cassandra Partitions should be in a
Spark Partition.
Due to Compression and Inflation, the actual in memory size
can be much larger
Certain Queries can't be broken Up
40
• Hot Spots Make a Spark Partition OOM
• Full C* Partition in Spark Partiton
Certain Queries can't be broken Up
41
• Hot Spots Make a Spark Partition OOM
• Full C* Partition in Spark Partiton
• Single Partition Lookups
• Can't do anything about this
• Don't know how partition is distributed
Certain Queries can't be broken Up
42
• Hot Spots Make a Spark Partition OOM
• Full C* Partition in Spark Partiton
• Single Partition Lookups
• Can't do anything about this
• Don't know how partition is distributed
• IN clauses
• Replace with JoinWithCassandraTable
• If all else fails use CassandraConnector
Read speed is mostly dictated by Cassandra's
Paging Speed
43
input.fetch.size_in_rows 1000 Number of CQL rows fetched per driver request
Cassandra of the Future, As Fast as CSV!?!
44
https://issues.apache.org/jira/browse/CASSANDRA-9259 :
Bulk Reading from Cassandra
Stefania Alborghetti
45
The End
Don't Let it End Like That!
Contribute to the Spark Cassandra Connector
46
• OSS Project that loves community involvement
• Bug Reports
• Feature Requests
• Write Code
• Doc Improvements
• Come join us!
https://github.com/datastax/spark-cassandra-connector
See you on the mailing list!
https://github.com/datastax/spark-cassandra-connector

More Related Content

What's hot

HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Databricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Kubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayKubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard way
Laurent Bernaille
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Living the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring BootLiving the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring Boot
Timothy Spann
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Michael Rainey
 
Tuning for Oracle RAC Wait Events
Tuning for Oracle RAC Wait EventsTuning for Oracle RAC Wait Events
Tuning for Oracle RAC Wait Events
Confio Software
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Staying Ahead of the Curve with Spring and Cassandra 4
Staying Ahead of the Curve with Spring and Cassandra 4Staying Ahead of the Curve with Spring and Cassandra 4
Staying Ahead of the Curve with Spring and Cassandra 4
VMware Tanzu
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark Applications
Databricks
 

What's hot (20)

HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Kubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayKubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard way
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Living the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring BootLiving the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring Boot
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
 
Tuning for Oracle RAC Wait Events
Tuning for Oracle RAC Wait EventsTuning for Oracle RAC Wait Events
Tuning for Oracle RAC Wait Events
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Staying Ahead of the Curve with Spring and Cassandra 4
Staying Ahead of the Curve with Spring and Cassandra 4Staying Ahead of the Curve with Spring and Cassandra 4
Staying Ahead of the Curve with Spring and Cassandra 4
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark Applications
 

Similar to Maximum Overdrive: Tuning the Spark Cassandra Connector

Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
Ravindra kumar
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
DataStax Academy
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
Alex Thompson
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
zznate
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
DataStax Academy
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache Cassandra
Trivadis
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
David Daeschler
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
DataStax Academy
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
DataStax Academy
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
DataStax Academy
 
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Richard Low
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
DataStax Academy
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Analyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraAnalyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraRich Beaudoin
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandraPL dream
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
shradha ambekar
 

Similar to Maximum Overdrive: Tuning the Spark Cassandra Connector (20)

Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache Cassandra
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
 
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Analyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraAnalyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_Cassandra
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 

More from Russell Spitzer

Cassandra and Spark SQL
Cassandra and Spark SQLCassandra and Spark SQL
Cassandra and Spark SQL
Russell Spitzer
 
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
Tale of Two Graph Frameworks: Graph Frames and TinkerpopTale of Two Graph Frameworks: Graph Frames and Tinkerpop
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
Russell Spitzer
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
Cassandra and IoT
Cassandra and IoTCassandra and IoT
Cassandra and IoT
Russell Spitzer
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
Russell Spitzer
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
Russell Spitzer
 

More from Russell Spitzer (9)

Cassandra and Spark SQL
Cassandra and Spark SQLCassandra and Spark SQL
Cassandra and Spark SQL
 
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
Tale of Two Graph Frameworks: Graph Frames and TinkerpopTale of Two Graph Frameworks: Graph Frames and Tinkerpop
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
Cassandra and IoT
Cassandra and IoTCassandra and IoT
Cassandra and IoT
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 

Recently uploaded

SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 

Recently uploaded (20)

SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 

Maximum Overdrive: Tuning the Spark Cassandra Connector

  • 1. Maximum Overdrive: Tuning the Spark Cassandra Connector Russell Spitzer, Datastax
  • 2. © DataStax, All Rights Reserved. Who is this guy and why should I listen to him? 2 Russell Spitzer, Passing Software Engineer •Been working at DataStax since 2013 •Worked in Test Engineering and now Analytics Dev •Have been working with Spark since 0.9 •Working with Cassandra since 1.2 •Main focus is the Spark Cassandra Connector •Surgically grafted to the Spark Cassandra Connector Mailing List
  • 3. © DataStax, All Rights Reserved. The Spark Cassandra Connector Connects Spark to Cassandra 3 It's all there in the name •Provides a DataSource for Datasets/DataFrames •Provides methods for Writing DataSets/Data Frames •Reading and Writing RDD •Connection Pooling •Type Conversions and Mapping •Data Locality •Open Source Software! https://github.com/datastax/spark-cassandra-connector
  • 4. © DataStax, All Rights Reserved. 4 WARNING: THIS TALK WILL CONTAIN TECHNICAL DETAILS AND EXPLICIT SCALA DISTRIBUTED SYSTEMS Tuning the Spark Cassandra Connector DISTRIBUTED SYSTEMS
  • 5. © DataStax, All Rights Reserved. 1 Lots of Write Tuning 2 A Bit of Read Tuning 5
  • 6. © DataStax, All Rights Reserved. 6 Context is Very Important Knowing your Data is Key for Maximum Performance
  • 7. © DataStax, All Rights Reserved. Write Tuning in the SCC Is all about Batching 7 Batches aren't good for performance in Cassandra. Not when the writes within the batch are in the same Partition and they are unlogged! I keep telling you this!
  • 8. © DataStax, All Rights Reserved. Multi-Partition Key Batches put load on the Coordinator 8 THE BATCH Cassandra Cluster Row Row Row Row
  • 9. © DataStax, All Rights Reserved. Multi-Partition Key Batches put load on the Coordinator 9 Cassandra Cluster R o w R o w R o w R o w A batch moves as a single entity to the Coordinator for that write This batch has to sit there until all the portions of it get confirmed at their set consistency level
  • 10. © DataStax, All Rights Reserved. Multi-Partition Key Batches put load on the Coordinator 10 Cassandra Cluster R o w R o w R o w R o w Even when some portions of the batch finish early we have to wait until the entire thing is done before we can respond to the client.
  • 11. © DataStax, All Rights Reserved. We end up with a lot of rows just sitting around in memory waiting for others to get out of the way 11
  • 12. © DataStax, All Rights Reserved. Single Partition Batches are Treated as A Single Mutation in Cassandra 12 THE BATCH Row Row Row RowRow Row Row Row Row Row Row Row Cassandra Cluster
  • 13. © DataStax, All Rights Reserved. Single Partition Batches are Treated as A Single Mutation in Cassandra 13 R o w R o w R o w R o w R o w R o w R o w R o w R o w R o w R o w R o w Cassandra Cluster Now the entire batch can be treated as a single mutation. We also only have to wait for one set of replicas
  • 14. © DataStax, All Rights Reserved. When all of the Rows are Going to the Same Place Writing to Cassandra is Fast 14
  • 15. The Connector Will Automatically Batch Writes 15 rdd.saveToCassandra("bestkeyspace", "besttable") df.write .format("org.apache.spark.sql.cassandra") .options(Map("table" -> "besttable", "keyspace" -> "bestkeyspace")) .save() import org.apache.spark.sql.cassandra._ df.write .cassandraFormat("besttable", "bestkeyspace") .save() RDD DataFrame
  • 16. By default batching happens on Identical Partition Key 16 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters WriteConf(batchGroupingKey= ?) Change it as a SparkConf or DataFrame Parameter Or directly pass in a WriteConf
  • 17. Batches are Placed in Holding Until Certain Thresholds are hit 17 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
  • 18. Batches are Placed in Holding Until Certain Thresholds are hit 18 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 19. Batches are Placed in Holding Until Certain Thresholds are hit 19 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 20. Batches are Placed in Holding Until Certain Thresholds are hit 20 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 21. Batches are Placed in Holding Until Certain Thresholds are hit 21 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 22. Spark Cassandra Stress for Running Basic Benchmarks 22 https://github.com/datastax/spark-cassandra-stress Running Benchmarks on a bunch of AWS Machines, 5 M3.2XLarge DSE 5.0.1 Spark 1.6.1 Spark CC 1.6.0 RF = 3 2 M Writes/ 100K C* Partitions and 400 Spark Partitions Caveat: Don't benchmark exactly like this I'm making some bad decisions to to make some broad points
  • 23. Depending on your use case Sorting within Partitions can Greatly Increase Write Performance 23 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 28 77 0 20 40 60 80 100 Rows Out of Order Rows In Order Default Conf kOps/s Grouping on Partition Key The Safest Thing You Can Do
  • 24. Including everything in the Batch 24 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 28 77 69 125 0 32.5 65 97.5 130 162.5 Rows Out of Order Rows In Order Default Conf kOps/s No Batch Key May be Safe For Short Durations BUT WILL LEAD TO SYSTEM INSTABILITY
  • 25. Grouping on Replica Set 25 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 28 77 70 125 0 32.5 65 97.5 130 162.5 Rows Out of Order Rows In Order Default Conf kOps/s Grouped on Replica Set Safer, But still will put extra load on the Coordinator
  • 26. Remember the Tortoise vs the Hare 26 Overwhelming Cassandra will slow you down Limit the amount of writes per executor : output.throughput_mb_per_sec Limit maximum executor cores : spark.max.cores Lower concurrency : output.concurrent.writes DEPENDING ON DISK PERFORMANCE YOUR INITIAL SPEEDS IN BENCHMARKING MAY NOT BE SUSTAINABLE
  • 27. For Example Lets run with Batch Key None for a Longer Test (20M writes) 27 [Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskS at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166) at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109) at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
  • 28. For Example Lets run with Batch Key None for a Longer Test (20M writes) 28 [Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskS at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166) at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109) at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
  • 29. Back to Default PartitionKey Batching 29 28 77 39 190 0 50 100 150 200 Rows Out of Order Rows In Order Default Conf kOps/s 10X Length Run So why are we doing so much better over a longer run?
  • 30. Back to Default PartitionKey Batching 30 28 77 39 190 0 50 100 150 200 Rows Out of Order Rows In Order Default Conf kOps/s 10X Length Run 400 Spark Partitions in Both Cases 2M/ 400 = 5000 20M / 400 = 50000
  • 31. Having Too Many Partitions will Slow Down your Writes 31 Every task has Setup and Teardown and we can only build up good batches if there are enough elements to build them from
  • 32. Depending on your use case Sorting within Partitions can Greatly Increase Write Performance 32 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 39 190 0 50 100 150 200 Rows Out of Order Rows In Order 10X Length Run A spark sort on partition key may speed up your total operation by several fold
  • 33. Maximizing performance for out of Order Writes or No Clustering Keys 33 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 39 190 47 59 0 50 100 150 200 Rows Out of Order Rows In Order 10X Length Run Modified Conf kOps/s Turn Off Batching Increase Concurrency spark.cassandra.output.batch.size.rows 1 spark.cassandra.output.concurrent.writes 2000
  • 34. Maximizing performance for out of Order Writes or No Clustering Keys 34 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 39 190 47 59 0 50 100 150 200 Rows Out of Order Rows In Order 10X Length Run Modified Conf kOps/s Turn Off Batching Increase Concurrency spark.cassandra.output.batch.size.rows 1 spark.cassandra.output.concurrent.writes 2000 Single Partition Batches are good I keep telling you!
  • 35. This turns the connector into a Multi-Machine Cassandra Loader (Basically just executeAsync as fast as possible) 35 https://github.com/brianmhess/cassandra-loader
  • 36. Now Let's Talk About Reading! 36
  • 37. Read Tuning mostly About Partitioning 37 • RDDs are a large Dataset Broken Into Bits, • These bits are call Partitions • Cassandra Partitions != Spark Partitions • Spark Partitions are sized based on the estimated data size of the underlying C* table • input.split.size_in_mb TokenRange Spark Partitions
  • 38. OOMs Caused by Spark Partitions Holding Too Much Data 38 Executor JVM Heap Core 1 Core 2 Core 3 As a general rule of thumb your Executor should be set to hold Number of Cores * Size of Partition * 1.2 See a lot of GC? OOM? Increase the amount of partitions Some Caveats • We don't know the actual partition size until runtime • Cassandra on disk memory usage != in memory size
  • 39. OOMs Caused by Spark Partitions Holding Too Much Data 39 Executor JVM Heap Core 1 Core 2 Core 3 input.split.size_in_mb 64 Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism split.size_in_mb compares uses the system table size_esitmates to determine how many Cassandra Partitions should be in a Spark Partition. Due to Compression and Inflation, the actual in memory size can be much larger
  • 40. Certain Queries can't be broken Up 40 • Hot Spots Make a Spark Partition OOM • Full C* Partition in Spark Partiton
  • 41. Certain Queries can't be broken Up 41 • Hot Spots Make a Spark Partition OOM • Full C* Partition in Spark Partiton • Single Partition Lookups • Can't do anything about this • Don't know how partition is distributed
  • 42. Certain Queries can't be broken Up 42 • Hot Spots Make a Spark Partition OOM • Full C* Partition in Spark Partiton • Single Partition Lookups • Can't do anything about this • Don't know how partition is distributed • IN clauses • Replace with JoinWithCassandraTable • If all else fails use CassandraConnector
  • 43. Read speed is mostly dictated by Cassandra's Paging Speed 43 input.fetch.size_in_rows 1000 Number of CQL rows fetched per driver request
  • 44. Cassandra of the Future, As Fast as CSV!?! 44 https://issues.apache.org/jira/browse/CASSANDRA-9259 : Bulk Reading from Cassandra Stefania Alborghetti
  • 46. Don't Let it End Like That! Contribute to the Spark Cassandra Connector 46 • OSS Project that loves community involvement • Bug Reports • Feature Requests • Write Code • Doc Improvements • Come join us! https://github.com/datastax/spark-cassandra-connector
  • 47. See you on the mailing list! https://github.com/datastax/spark-cassandra-connector