DataSource V2 and Cassandra – A Whole New World

DatasourceV2 and Cassandra
A Whole New World!
Russell Spitzer
Software Engineer @ Datastax

• Software Engineer
• New Orleans
• Worked on Spark since 1.0
• Cassandra Test Engineer
• Spark Cassandra Connector
• Other analytics integrations
Type Type Code Code
Russell Spitzer

Why Do We Need DSV2
▪ Users need to know too much 
 
 
 
 
▪ Spark ends up knowing too little

DataSourceV1 - RDD Backed
▪ Black Box 
 
 
▪ Difficult Primitives 
 
 
▪ No Catalog Integration Cave of RDDs

DataSourceV2 - An Abstraction of a DataSource
▪ Separation of Concerns
▪ Friendly Interaction with Catalyst
▪ Extensible
▪ Native Catalogs

Catalog Features
▪ Infinite Catalogs Allowed 
▪ Properties per Catalog 
▪ Automatic Pickup of all C* Tables 
▪ Always up to date 
▪ Create Tables and Keyspaces Directly from SQL

Catalog Architecture
▪ Catalog Connects to Cassandra 
▪ Driver Metadata Populates
Catalog 
▪ Metadata updated whenever
cluster changes 
▪ Always up to date 
▪ Can Change Cassandra Cluster
SparkSession
Cassandra CatalogCassandra Catalog
Cassandra Connector Cassandra Connector
Cassandra Cluster Cassandra Cluster

Catalog Setup
▪ Add SCC via —Packages
▪ Configure in Spark Session or
SparkConf
Setup SparkConf.set  session.conf.set
session.sql("SET …") JDBC : SET …
spark.sql.catalog.mycluster =
com.datastax.spark.connector.datasource.CassandraCatalog
spark.sql.catalog.mycluster.prop = key
spark.sql.defaultCatalog = mycluster

Setting up and Using a Catalog
A Basic Example
SET spark.sql.catalog.mycluster = com.datastax.spark.connector.datasource.CassandraCatalog
SET spark.sql.defaultCatalog = mycluster 
 
SHOW DATABASES LIKE 'created*'
 
 
 
SHOW TABLES IN created_in_cassandra
SELECT * FROM created_in_cassandra.dogs
+--------------------+
| namespace|
+--------------------+
|created_in_cassandra|
+--------------------+
+--------------------+---------+
| namespace|tableName|
+--------------------+---------+
|created_in_cassandra| dogs|
+--------------------+---------+
+--------+---+
| name|age|
+--------+---+
| Cara| 13|
|Sundance| 13|
+--------+---+

Inspecting Table Metadata
SHOW TBLPROPERTIES created_in_cassandra.dogs
A basic example
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
| crc_check_chance| 1.0|
| compression|{chunk_length_in_...|
| clustering_key| []|
| max_index_interval| 2048|
| compaction|{class=org.apache...|
| gc_grace_seconds| 864000|
| extensions| {}|
|bloom_ﬁlter_fp_c...| 0.01|
| caching|{keys=ALL, rows_p...|
|dclocal_read_repa...| 0.1|
| min_index_interval| 128|
| speculative_retry| 99PERCENTILE|
| comment| |
|default_time_to_live| 0|
| read_repair_chance| 0.0|
|memtable_ﬂush_pe...| 0|
+--------------------+--------------------+

Setting up Multiple Catalogs for one Cluster
SET spark.sql.catalog.fast = com.datastax.spark.connector.datasource.CassandraCatalog
SET spark.sql.catalog.fast.spark.cassandra.output.throughputMBPerSec = 10
SET spark.sql.catalog.slow = com.datastax.spark.connector.datasource.CassandraCatalog
SET spark.sql.catalog.slow.spark.cassandra.output.throughputMBPerSec = 0.01 
 
INSERT INTO slow.created_in_cassandra.dogs SELECT * from fast.created_in_cassandra.dogs
A more complicated example
SparkSession
Fast
Slow Cassandra Connector
Cassandra Connector
Cassandra Cluster

Setting up Multiple Catalogs for Multiple Clusters
SET spark.sql.catalog.clustera = com.datastax.spark.connector.datasource.CassandraCatalog
SET spark.sql.catalog.clustera.spark.cassandra.connection.host = 127.0.0.1
SET spark.sql.catalog.clusterb = com.datastax.spark.connector.datasource.CassandraCatalog
SET spark.sql.catalog.clusterb.spark.cassandra.connection.host = 127.0.0.2 
 
INSERT INTO clusterb.created_in_cassandra.dogs SELECT * from clustera.created_in_cassandra.dogs
Maximum Complexity
SparkSession
clustera
clusterb Cassandra Connector
Cassandra Connector
Kyoto 
127.0.0.2
Tokyo 
127.0.0.1

Create Tables - Create Keyspaces
CREATE DATABASE IF NOT EXISTS created_in_spark WITH DBPROPERTIES  
(class='SimpleStrategy', replication_factor='1') 
 
CREATE TABLE created_in_spark.ages (age Int, name STRING)  
USING cassandra  
PARTITIONED BY (age)  
TBLPROPERTIES (clustering_key='name')
INSERT INTO created_in_spark.ages SELECT * FROM created_in_cassandra.dogs 
 
SELECT * from created_in_spark.ages
Maximum Cool
Almost 
CQL
+---+--------+
|age| name|
+---+--------+
| 13| Cara|
| 13|Sundance|
+---+--------+
cqlsh> SELECT * from created_in_spark.ages
... ;
age | name
-----+----------
13 | Cara
13 | Sundance

All Cassandra Table Options are Available
ALTER TABLE created_in_spark.ages SET TBLPROPERTIES  
(gc_grace_seconds = '22') 
SHOW TBLPROPERTIES created_in_spark.ages (gc_grace_seconds)
 
ALTER TABLE created_in_spark.ages SET TBLPROPERTIES
(compaction='{class=SizeTieredCompactionStrategy,bucket_high=1001}') 
SHOW TBLPROPERTIES created_in_spark.ages (compaction)
Maximum Cool
Almost 
CQL
+----------------+-----+
|key |value|
+----------------+-----+
|gc_grace_seconds|22 |
+----------------+-----+
+----------+---------------------------------------------------------------------------------------------------+
|key |value | |
+----------+---------------------------------------------------------------------------------------------------+
|compaction|{bucket_high=1001, class=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy, |
| | max_threshold=32, min_threshold=4}| |
+----------+---------------------------------------------------------------------------------------------------+

Data Sources as a Black Box
• Datasources Provide RDDs
• Writing Provides RDDs to
Datasource
• Filters / Pushdowns up to
implementer to modify RDD directly
• Spark Doesn't Know any of the
Details of the RDD

Partitioning
• Separates Different Ideas
• Table can create Readers or Writers
if Supported
• Can report it's own partitioning
• Readers and Writers extensible with
other capabilities
Table
Reader
Writer
Properties
Partitioning

Partitioning
• SELECT DISTINCT age
1
Created in Spark Ages Table
2
3
4
Part 1
Part 2
*(1) HashAggregate(keys=[age#1174], functions=[])
+- *(1) HashAggregate(keys=[age#1174], functions=[])
+- *(1) Project [age#1174]
+- BatchScan[age#1174] Cassandra Scan created_in_spark.ages

Partitioning
• SELECT DISTINCT name
+- *(1) Project [age#1174]
1 bob suewenjack
2
mag
bob calli sara
3 len sue
fumi
stan
4 dougbob priyaavra
Part 1
Part 2

Partitioning
bob suewenjack
mag
bob calli sara
len sue
fumi
stan
dougbob priyaavra
Part 1
Part 2
+- *(1) Project [age#1174]

Partitioning
1 bob suewenjack
2
mag
bob calli sara
3 len sue
fumi
stan
4 dougbob priyaavra
Part 1
Part 2
*(2) HashAggregate(keys=[name#1171], functions=[])
+- Exchange hashpartitioning(name#1171, 200), true, [id=#256]
+- *(1) HashAggregate(keys=[name#1171], functions=[])
+- *(1) Project [name#1171]
+- BatchScan[name#1171] Cassandra Scan created_in_spark.ages
+- *(1) Project [age#1174]

Partitioning
• Future - Clustering as well
• Partition function itself?
• …

New OSS Features from Datastax

spark.sql.extensions =
com.datastax.spark.connector.CassandraSparkExtensions

DirectJoin in Datasources
• JoinWithCassandraTable Support in
Catalyst
• Avoids Full Scans 
 
 
 
Partition Keys Cassandra Cluster
== Physical Plan ==
*(2) Project [name#215, age#216, age#110, name#111]
+- Cassandra Direct Join [age = age#216] created_in_spark.ages - Reading (age, name) Pushed {}
+- *(1) Project [name#215, age#216]
+- *(1) Filter isnotnull(age#216)
+- BatchScan[name#215, age#216] Cassandra Scan created_in_cassandra.dogs|
Server Side Filters []|
Columns [name,age]

TTL/Writetime Support
• Built into SparkSQL (with Extensions)
• Functional Support Also Provided
• SELECT TTL(age), WRITETIME(age) FROM created_in_cassandra.dogs
+--------+----------------+
|TTL(age)| WRITETIME(age)|
+--------+----------------+
| 49933|1589753892022515|
| null|1589745922353001|
| null|1589745922353000|
+--------+----------------+

InClause to Join
• Selecting many values in an In Clause is Translated to Direct Join
KeyA in (1,2,3 … 100)
KeyB in (1,2,3 … 100)
KeyC in (1,2,3 … 100)
Cql - KeyA, B, C in (1 ………….)
Old method generated one very large statement

InClause to Join
• Selecting many values in an In Clause is Translated to Direct Join
KeyA in (1,2,3 … 100)
KeyB in (1,2,3 … 100)
KeyC in (1,2,3 … 100)
Cassandra ClusterPermute

Future Plans
• Spark Cassandra Connector 2.5.0
• Long Term Bugﬁx Branch for
Spark 2.X
• Contains all DSE Features!
• Spark Cassandra Connector 3.0.0
• Spark 3.0 Compatibility
• DSV2 Catalog and Read/Write
• Deletes and more to come!

OSS needs you!
• github.com/datastax/spark-
cassandra-connector
• community.datastax.com 
• Our Mailing List 
https://groups.google.com/a/
lists.datastax.com/forum/#!forum/
spark-connector-user

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

DataSource V2 and Cassandra – A Whole New World

DataSource V2 and Cassandra – A Whole New World

More Related Content

What's hot

Similar to DataSource V2 and Cassandra – A Whole New World

More from Databricks

Recently uploaded

DataSource V2 and Cassandra – A Whole New World