SlideShare a Scribd company logo
Cassandra And Spark Dataframes
Russell Spitzer
Software Engineer @ Datastax
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Tungsten Gives Dataframes OffHeap Power!
Can compare memory off-heap and bitwise!
Code generation!
The Core is the Cassandra Source
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra
/**
* Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]]
* It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down
* some filters to CQL
*
*/
DataFrame
source
org.apache.spark.sql.cassandra
The Core is the Cassandra Source
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra
/**
* Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]]
* It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down
* some filters to CQL
*
*/
DataFrame
CassandraSourceRelation
CassandraTableScanRDDConfiguration
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "default"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "other" ,
"cluster" -> "default"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Connector
Default
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Catalyst
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Catalyst
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame
DataFromC*
AND
add where clause to
CQL
"clusteringKey > 100"
Show
Catalyst
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
What can be pushed down?
1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate
2. Only push down primary key column predicates with = or IN predicate.
3. If there are regular columns in the pushdown predicates, they should have at least one EQ
expression on an indexed column and no IN predicates.
4. All partition column predicates must be included in the predicates to be pushed down, only
the last part of the partition key can be an IN predicate. For each partition column, only one
predicate is allowed.
5. For cluster column predicates, only last predicate can be non-EQ predicate including IN
predicate, and preceding column predicates must be EQ predicates.
6. If there is only one cluster column predicate, the predicates could be any non-IN predicate.
There is no pushdown predicates if there is any OR condition or NOT IN condition.
7. We're not allowed to push down multiple predicates for the same column if any of them is
equality or IN predicate.
What can be pushed down?
If you could write in CQL it will get pushed down.
What are we Pushing Down To?
CassandraTableScanRDD
All of the underlying code is the same as with
sc.cassandraTable so everything with Reading and Writing

applies
What are we Pushing Down To?
CassandraTableScanRDD
All of the underlying code is the same as with
sc.cassandraTable so everything with Reading and Writing

applies
https://academy.datastax.com/

Watch me talk about this in the privacy of your own home!
How the
Spark Cassandra Connector
Reads Data
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4
Node 2
Node 1
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
Cassandra Data is Distributed By Token Range
Cassandra Data is Distributed By Token Range
0
500
Cassandra Data is Distributed By Token Range
0
500
999
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
With vnodes
Node 1
120-220
300-500
780-830
0-50
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
The Connector Uses Information on the Node to Make 

Spark Partitions
Node 1
120-220
300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
1
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
1
Node 1
120-220
300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
2
1
Node 1 300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
2
1
Node 1 300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
2
1
Node 1
300-400
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
400-500
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
400-500
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
400-500
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
400-500
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
4
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
4
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
421
Node 1
The Connector Uses Information on the Node to Make 

Spark Partitions
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50780-830
Node 1
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows 50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
How The Spark
Cassandra Connector
Writes Data
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4
Node 2
Node 1
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
The Spark Cassandra
Connector saveToCassandra
method can be called on
almost all RDDs
rdd.saveToCassandra("Keyspace","Table")
Node 11
A Java Driver connection is made to
the local node and a prepared statement
is built for the target table
Java
Driver
Node 11
Batches are built from data in
Spark partitions
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
3,9,1
Node 11
By default these batches only
contain CQL Rows which share the same
partition key
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
Node 11
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
By default these batches only
contain CQL Rows which share the same
partition key
PK=1
Node 11
When an element is not part of an existing batch,
a new batch is started
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=1
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
When an element is not part of an existing batch,
a new batch is started
PK=1
PK=2
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
When an element is not part of an existing batch,
a new batch is started
PK=1
PK=2
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,13,2,1 3,4,1 3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
PK=3
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,13,2,1 3,4,1 3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=1
PK=2
PK=3
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,3,9,1
3,1,1
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,3,9,1
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
PK=3
Node 11
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
Java
Driver
1,1,1 1,2,1
2,1,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=1
PK=2
PK=3
Node 11
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=2
PK=3
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
Node 11
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
PK=2
PK=3
PK=5
Node 11
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,18,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,13,9,1
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,18,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,13,9,1
Write Acknowledged
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
8,4,1
3,9,1
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
3,1,1
5,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
8,4,1
3,9,1
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
3,1,1
5,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Write Acknowledged
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Write Acknowledged
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
Block
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Thanks for Coming and I hope you Have a Great Time

At C* Summit
http://cassandrasummit-datastax.com/agenda/the-spark-
cassandra-connector-past-present-and-future/
Also ask these guys really hard questions
Jacek PiotrAlex

More Related Content

What's hot

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
Duyhai Doan
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Jacek Lewandowski
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
Evan Chan
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 

What's hot (20)

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 

Viewers also liked

Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
Brian Hess
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
Duyhai Doan
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
DataStax Academy
 
Datastax enterprise presentation
Datastax enterprise presentationDatastax enterprise presentation
Datastax enterprise presentation
Duyhai Doan
 
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Spark Summit
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
DataStax
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Structured streaming in Spark
Structured streaming in SparkStructured streaming in Spark
Structured streaming in Spark
Giri R Varatharajan
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
DataStax
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks
 
DataStax: A deep look at the CQL WHERE clause
DataStax: A deep look at the CQL WHERE clauseDataStax: A deep look at the CQL WHERE clause
DataStax: A deep look at the CQL WHERE clause
DataStax Academy
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Spark Summit
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
Brendan Gregg
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
Helena Edelson
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 

Viewers also liked (20)

Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Datastax enterprise presentation
Datastax enterprise presentationDatastax enterprise presentation
Datastax enterprise presentation
 
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
 
Structured streaming in Spark
Structured streaming in SparkStructured streaming in Spark
Structured streaming in Spark
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
DataStax: A deep look at the CQL WHERE clause
DataStax: A deep look at the CQL WHERE clauseDataStax: A deep look at the CQL WHERE clause
DataStax: A deep look at the CQL WHERE clause
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 

Similar to Spark Cassandra Connector Dataframes

Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
Russell Spitzer
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
Montreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraMontreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraAdam Hutson
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
DataStax Academy
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
Nathan Milford
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
DataStax Academy
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
zznate
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
Spark Summit
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
MongoDB
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
Marilyn Waldman
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
datastaxjp
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 

Similar to Spark Cassandra Connector Dataframes (20)

Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Montreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraMontreal User Group - Cloning Cassandra
Montreal User Group - Cloning Cassandra
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 

Recently uploaded

A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 

Recently uploaded (20)

A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 

Spark Cassandra Connector Dataframes

  • 1. Cassandra And Spark Dataframes Russell Spitzer Software Engineer @ Datastax
  • 2. Cassandra And Spark Dataframes
  • 3. Cassandra And Spark Dataframes
  • 4. Cassandra And Spark Dataframes
  • 5. Cassandra And Spark Dataframes
  • 6. Tungsten Gives Dataframes OffHeap Power! Can compare memory off-heap and bitwise! Code generation!
  • 7. The Core is the Cassandra Source https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra /** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */ DataFrame source org.apache.spark.sql.cassandra
  • 8. The Core is the Cassandra Source https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra /** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */ DataFrame CassandraSourceRelation CassandraTableScanRDDConfiguration
  • 9. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()
  • 10. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load() Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 11. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 12. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 13. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "default" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 14. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "other" , "cluster" -> "default" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32 Connector Default
  • 15. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100
  • 16. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* Filter clusteringKey > 100 Show
  • 17. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* Filter clusteringKey > 100 Show Catalyst
  • 18. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* Filter clusteringKey > 100 Show Catalyst https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
  • 19. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* AND add where clause to CQL "clusteringKey > 100" Show Catalyst https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
  • 20. What can be pushed down? 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates. 6. If there is only one cluster column predicate, the predicates could be any non-IN predicate. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate.
  • 21. What can be pushed down? If you could write in CQL it will get pushed down.
  • 22. What are we Pushing Down To? CassandraTableScanRDD All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing
 applies
  • 23. What are we Pushing Down To? CassandraTableScanRDD All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing
 applies https://academy.datastax.com/
 Watch me talk about this in the privacy of your own home!
  • 24. How the Spark Cassandra Connector Reads Data
  • 25. Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 1 2 3 4 5 6 7 8 9Node 2 Node 1 Node 3 Node 4
  • 26. Node 2 Node 1 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 2 346 7 8 9 Node 3 Node 4 1 5
  • 27. Node 2 Node 1 RDD 2 346 7 8 9 Node 3 Node 4 1 5 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks
  • 28. Cassandra Data is Distributed By Token Range
  • 29. Cassandra Data is Distributed By Token Range 0 500
  • 30. Cassandra Data is Distributed By Token Range 0 500 999
  • 31. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4
  • 32. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 Without vnodes
  • 33. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 With vnodes
  • 34. Node 1 120-220 300-500 780-830 0-50 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb The Connector Uses Information on the Node to Make 
 Spark Partitions
  • 35. Node 1 120-220 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 1 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 36. 1 Node 1 120-220 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 37. 2 1 Node 1 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 38. 2 1 Node 1 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 39. 2 1 Node 1 300-400 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 40. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 41. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 42. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 400-500 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 43. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 44. 4 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 45. 4 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 46. 421 Node 1 The Connector Uses Information on the Node to Make 
 Spark Partitions 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 47. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50780-830 Node 1
  • 48. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
  • 49. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
  • 50. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows
  • 51. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows
  • 52. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows 50 CQL Rows
  • 53. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows
  • 54. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows
  • 55. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows
  • 56. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 57. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 58. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 59. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 60. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 61. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 62. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 63. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 64. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 65. How The Spark Cassandra Connector Writes Data
  • 66. Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 1 2 3 4 5 6 7 8 9Node 2 Node 1 Node 3 Node 4
  • 67. Node 2 Node 1 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 2 346 7 8 9 Node 3 Node 4 1 5
  • 68. Node 2 Node 1 RDD 2 346 7 8 9 Node 3 Node 4 1 5 The Spark Cassandra Connector saveToCassandra method can be called on almost all RDDs rdd.saveToCassandra("Keyspace","Table")
  • 69. Node 11 A Java Driver connection is made to the local node and a prepared statement is built for the target table Java Driver
  • 70. Node 11 Batches are built from data in Spark partitions Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 3,9,1
  • 71. Node 11 By default these batches only contain CQL Rows which share the same partition key Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1
  • 72. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 By default these batches only contain CQL Rows which share the same partition key PK=1
  • 73. Node 11 When an element is not part of an existing batch, a new batch is started Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=1
  • 74. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 When an element is not part of an existing batch, a new batch is started PK=1 PK=2
  • 75. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 When an element is not part of an existing batch, a new batch is started PK=1 PK=2
  • 76. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,13,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver PK=1 PK=2 PK=3
  • 77. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,13,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=1 PK=2 PK=3 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver
  • 78. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4,3,9,1 3,1,1 spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver PK=1 PK=2
  • 79. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4,3,9,1 spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver PK=1 PK=2 PK=3
  • 80. Node 11 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver Java Driver 1,1,1 1,2,1 2,1,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=1 PK=2 PK=3
  • 81. Node 11 Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=2 PK=3 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver
  • 82. Node 11 Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver PK=2 PK=3 PK=5
  • 83. Node 11 Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver PK=2 PK=3 PK=5
  • 84. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 2,1,1 3,1,1 5,4,1 2,4,18,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,13,9,1 PK=2 PK=3 PK=5
  • 85. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 2,1,1 3,1,1 5,4,1 2,4,18,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,13,9,1 Write Acknowledged PK=2 PK=3 PK=5
  • 86. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 8,4,1 3,9,1 PK=2 PK=3 PK=5
  • 87. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 3,1,1 5,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 8,4,1 3,9,1 PK=3 PK=5
  • 88. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 3,1,1 5,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 89. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 90. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 91. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5 Write Acknowledged
  • 92. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 93. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5 Write Acknowledged
  • 94. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 Block 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 95. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 96. Thanks for Coming and I hope you Have a Great Time
 At C* Summit http://cassandrasummit-datastax.com/agenda/the-spark- cassandra-connector-past-present-and-future/ Also ask these guys really hard questions Jacek PiotrAlex