DatasourceV2 and Cassandra
A Whole New World!
Russell Spitzer
Software Engineer @ Datastax
Can we make it better?
• Software Engineer
• New Orleans
• Worked on Spark since 1.0
• Cassandra Test Engineer
• Spark Cassandra Connector
• Other analytics integrations
Type Type Code Code
Russell Spitzer
Why Do We Need DSV2
▪ Users need to know too much









▪ Spark ends up knowing too little
DataSourceV1 - RDD Backed
▪ Black Box





▪ Difficult Primitives





▪ No Catalog Integration Cave of RDDs
DataSourceV2 - An Abstraction of a DataSource
▪ Separation of Concerns
▪ Friendly Interaction with Catalyst
▪ Extensible
▪ Native Catalogs
The Cassandra Catalog
Catalog Features
▪ Infinite Catalogs Allowed

▪ Properties per Catalog

▪ Automatic Pickup of all C* Tables

▪ Always up to date

▪ Create Tables and Keyspaces Directly from SQL
Catalog Architecture
▪ Catalog Connects to Cassandra

▪ Driver Metadata Populates
Catalog

▪ Metadata updated whenever
cluster changes

▪ Always up to date

▪ Can Change Cassandra Cluster
SparkSession
Cassandra CatalogCassandra Catalog
Cassandra Connector Cassandra Connector
Cassandra Cluster Cassandra Cluster
Catalog Setup
▪ Add SCC via —Packages
▪ Configure in Spark Session or
SparkConf
Setup SparkConf.set
 session.conf.set
session.sql("SET …") JDBC : SET …
spark.sql.catalog.mycluster =
com.datastax.spark.connector.datasource.CassandraCatalog
spark.sql.catalog.mycluster.prop = key
spark.sql.defaultCatalog = mycluster
Setting up and Using a Catalog
A Basic Example
SET spark.sql.catalog.mycluster = com.datastax.spark.connector.datasource.CassandraCatalog
SET spark.sql.defaultCatalog = mycluster



SHOW DATABASES LIKE 'created*'






SHOW TABLES IN created_in_cassandra
SELECT * FROM created_in_cassandra.dogs
+--------------------+
| namespace|
+--------------------+
|created_in_cassandra|
+--------------------+
+--------------------+---------+
| namespace|tableName|
+--------------------+---------+
|created_in_cassandra| dogs|
+--------------------+---------+
+--------+---+
| name|age|
+--------+---+
| Cara| 13|
|Sundance| 13|
+--------+---+
Inspecting Table Metadata
DESCRIBE NAMESPACE EXTENDED created_in_cassandra






DESCRIBE TABLE created_in_cassandra.dogs
A basic example
+--------------+--------------------------------------------------------------------------------------------------+
|name |value |
+--------------+--------------------------------------------------------------------------------------------------+
|Namespace Name|created_in_cassandra |
|Description |null |
|Location |null |
|Properties |((durable_writes,true),(class,org.apache.cassandra.locator.SimpleStrategy),(replication_factor,1))|
+--------------+--------------------------------------------------------------------------------------------------+
+--------------------+--------------------+-------+
| col_name| data_type|comment|
+--------------------+--------------------+-------+
| name| string| |
| age| int| |
| | | |
| # Partitioning| | |
| Part 0| name| |
+--------------------+--------------------+-------+
Inspecting Table Metadata
SHOW TBLPROPERTIES created_in_cassandra.dogs
A basic example
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
| crc_check_chance| 1.0|
| compression|{chunk_length_in_...|
| clustering_key| []|
| max_index_interval| 2048|
| compaction|{class=org.apache...|
| gc_grace_seconds| 864000|
| extensions| {}|
|bloom_filter_fp_c...| 0.01|
| caching|{keys=ALL, rows_p...|
|dclocal_read_repa...| 0.1|
| min_index_interval| 128|
| speculative_retry| 99PERCENTILE|
| comment| |
|default_time_to_live| 0|
| read_repair_chance| 0.0|
|memtable_flush_pe...| 0|
+--------------------+--------------------+
Setting up Multiple Catalogs for one Cluster
SET spark.sql.catalog.fast = com.datastax.spark.connector.datasource.CassandraCatalog
SET spark.sql.catalog.fast.spark.cassandra.output.throughputMBPerSec = 10
SET spark.sql.catalog.slow = com.datastax.spark.connector.datasource.CassandraCatalog
SET spark.sql.catalog.slow.spark.cassandra.output.throughputMBPerSec = 0.01



INSERT INTO slow.created_in_cassandra.dogs SELECT * from fast.created_in_cassandra.dogs
A more complicated example
SparkSession
Fast
Slow Cassandra Connector
Cassandra Connector
Cassandra Cluster
Setting up Multiple Catalogs for Multiple Clusters
SET spark.sql.catalog.clustera = com.datastax.spark.connector.datasource.CassandraCatalog
SET spark.sql.catalog.clustera.spark.cassandra.connection.host = 127.0.0.1
SET spark.sql.catalog.clusterb = com.datastax.spark.connector.datasource.CassandraCatalog
SET spark.sql.catalog.clusterb.spark.cassandra.connection.host = 127.0.0.2



INSERT INTO clusterb.created_in_cassandra.dogs SELECT * from clustera.created_in_cassandra.dogs
Maximum Complexity
SparkSession
clustera
clusterb Cassandra Connector
Cassandra Connector
Kyoto

127.0.0.2
Tokyo

127.0.0.1
Create Tables - Create Keyspaces
CREATE DATABASE IF NOT EXISTS created_in_spark WITH DBPROPERTIES 

(class='SimpleStrategy', replication_factor='1')



CREATE TABLE created_in_spark.ages (age Int, name STRING) 

USING cassandra 

PARTITIONED BY (age) 

TBLPROPERTIES (clustering_key='name')
INSERT INTO created_in_spark.ages SELECT * FROM created_in_cassandra.dogs



SELECT * from created_in_spark.ages
Maximum Cool
Almost

CQL
+---+--------+
|age| name|
+---+--------+
| 13| Cara|
| 13|Sundance|
+---+--------+
cqlsh> SELECT * from created_in_spark.ages
... ;
age | name
-----+----------
13 | Cara
13 | Sundance
All Cassandra Table Options are Available
ALTER TABLE created_in_spark.ages SET TBLPROPERTIES 

(gc_grace_seconds = '22')

SHOW TBLPROPERTIES created_in_spark.ages (gc_grace_seconds)


ALTER TABLE created_in_spark.ages SET TBLPROPERTIES
(compaction='{class=SizeTieredCompactionStrategy,bucket_high=1001}')

SHOW TBLPROPERTIES created_in_spark.ages (compaction)
Maximum Cool
Almost

CQL
+----------------+-----+
|key |value|
+----------------+-----+
|gc_grace_seconds|22 |
+----------------+-----+
+----------+---------------------------------------------------------------------------------------------------+
|key |value | |
+----------+---------------------------------------------------------------------------------------------------+
|compaction|{bucket_high=1001, class=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy, |
| | max_threshold=32, min_threshold=4}| |
+----------+---------------------------------------------------------------------------------------------------+
DSV2 Performance Enhancements
Data Sources as a Black Box
• Datasources Provide RDDs
• Writing Provides RDDs to
Datasource
• Filters / Pushdowns up to
implementer to modify RDD directly
• Spark Doesn't Know any of the
Details of the RDD
Partitioning
• Separates Different Ideas
• Table can create Readers or Writers
if Supported
• Can report it's own partitioning
• Readers and Writers extensible with
other capabilities
Table
Reader
Writer
Properties
Partitioning
Partitioning
• SELECT DISTINCT age
1
Created in Spark Ages Table
2
3
4
Part 1
Part 2
*(1) HashAggregate(keys=[age#1174], functions=[])
+- *(1) HashAggregate(keys=[age#1174], functions=[])
+- *(1) Project [age#1174]
+- BatchScan[age#1174] Cassandra Scan created_in_spark.ages
Partitioning
• SELECT DISTINCT age
• SELECT DISTINCT name
*(1) HashAggregate(keys=[age#1174], functions=[])
+- *(1) HashAggregate(keys=[age#1174], functions=[])
+- *(1) Project [age#1174]
+- BatchScan[age#1174] Cassandra Scan created_in_spark.ages
1 bob suewenjack
Created in Spark Ages Table
2
mag
bob calli sara
3 len sue
fumi
stan
4 dougbob priyaavra
Part 1
Part 2
Partitioning
• SELECT DISTINCT age
• SELECT DISTINCT name
bob suewenjack
Created in Spark Ages Table
mag
bob calli sara
len sue
fumi
stan
dougbob priyaavra
Part 1
Part 2
*(1) HashAggregate(keys=[age#1174], functions=[])
+- *(1) HashAggregate(keys=[age#1174], functions=[])
+- *(1) Project [age#1174]
+- BatchScan[age#1174] Cassandra Scan created_in_spark.ages
Partitioning
• SELECT DISTINCT age
• SELECT DISTINCT name
1 bob suewenjack
Created in Spark Ages Table
2
mag
bob calli sara
3 len sue
fumi
stan
4 dougbob priyaavra
Part 1
Part 2
*(2) HashAggregate(keys=[name#1171], functions=[])
+- Exchange hashpartitioning(name#1171, 200), true, [id=#256]
+- *(1) HashAggregate(keys=[name#1171], functions=[])
+- *(1) Project [name#1171]
+- BatchScan[name#1171] Cassandra Scan created_in_spark.ages
*(1) HashAggregate(keys=[age#1174], functions=[])
+- *(1) HashAggregate(keys=[age#1174], functions=[])
+- *(1) Project [age#1174]
+- BatchScan[age#1174] Cassandra Scan created_in_spark.ages
Partitioning
• Future - Clustering as well
• Partition function itself?
• …
New OSS Features from Datastax
spark.sql.extensions =
com.datastax.spark.connector.CassandraSparkExtensions
DirectJoin in Datasources
• JoinWithCassandraTable Support in
Catalyst
• Avoids Full Scans







Partition Keys Cassandra Cluster
== Physical Plan ==
*(2) Project [name#215, age#216, age#110, name#111]
+- Cassandra Direct Join [age = age#216] created_in_spark.ages - Reading (age, name) Pushed {}
+- *(1) Project [name#215, age#216]
+- *(1) Filter isnotnull(age#216)
+- BatchScan[name#215, age#216] Cassandra Scan created_in_cassandra.dogs|
Server Side Filters []|
Columns [name,age]
TTL/Writetime Support
• Built into SparkSQL (with Extensions)
• Functional Support Also Provided
• SELECT TTL(age), WRITETIME(age) FROM created_in_cassandra.dogs
+--------+----------------+
|TTL(age)| WRITETIME(age)|
+--------+----------------+
| 49933|1589753892022515|
| null|1589745922353001|
| null|1589745922353000|
+--------+----------------+
InClause to Join
• Selecting many values in an In Clause is Translated to Direct Join
KeyA in (1,2,3 … 100)
KeyB in (1,2,3 … 100)
KeyC in (1,2,3 … 100)
Cql - KeyA, B, C in (1 ………….)
Old method generated one very large statement
InClause to Join
• Selecting many values in an In Clause is Translated to Direct Join
KeyA in (1,2,3 … 100)
KeyB in (1,2,3 … 100)
KeyC in (1,2,3 … 100)
Cassandra ClusterPermute
Astra / Java Driver 4.0
More To Come!
Future Plans
• Spark Cassandra Connector 2.5.0
• Long Term Bugfix Branch for
Spark 2.X
• Contains all DSE Features!
• Spark Cassandra Connector 3.0.0
• Spark 3.0 Compatibility
• DSV2 Catalog and Read/Write
• Deletes and more to come!
OSS needs you!
• github.com/datastax/spark-
cassandra-connector
• community.datastax.com

• Our Mailing List

https://groups.google.com/a/
lists.datastax.com/forum/#!forum/
spark-connector-user
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
DataSource V2 and Cassandra – A Whole New World

DataSource V2 and Cassandra – A Whole New World

  • 2.
    DatasourceV2 and Cassandra AWhole New World! Russell Spitzer Software Engineer @ Datastax
  • 3.
    Can we makeit better?
  • 4.
    • Software Engineer •New Orleans • Worked on Spark since 1.0 • Cassandra Test Engineer • Spark Cassandra Connector • Other analytics integrations Type Type Code Code Russell Spitzer
  • 5.
    Why Do WeNeed DSV2 ▪ Users need to know too much
 
 
 
 
 ▪ Spark ends up knowing too little
  • 6.
    DataSourceV1 - RDDBacked ▪ Black Box
 
 
 ▪ Difficult Primitives
 
 
 ▪ No Catalog Integration Cave of RDDs
  • 7.
    DataSourceV2 - AnAbstraction of a DataSource ▪ Separation of Concerns ▪ Friendly Interaction with Catalyst ▪ Extensible ▪ Native Catalogs
  • 8.
  • 9.
    Catalog Features ▪ InfiniteCatalogs Allowed
 ▪ Properties per Catalog
 ▪ Automatic Pickup of all C* Tables
 ▪ Always up to date
 ▪ Create Tables and Keyspaces Directly from SQL
  • 10.
    Catalog Architecture ▪ CatalogConnects to Cassandra
 ▪ Driver Metadata Populates Catalog
 ▪ Metadata updated whenever cluster changes
 ▪ Always up to date
 ▪ Can Change Cassandra Cluster SparkSession Cassandra CatalogCassandra Catalog Cassandra Connector Cassandra Connector Cassandra Cluster Cassandra Cluster
  • 11.
    Catalog Setup ▪ AddSCC via —Packages ▪ Configure in Spark Session or SparkConf Setup SparkConf.set
 session.conf.set session.sql("SET …") JDBC : SET … spark.sql.catalog.mycluster = com.datastax.spark.connector.datasource.CassandraCatalog spark.sql.catalog.mycluster.prop = key spark.sql.defaultCatalog = mycluster
  • 12.
    Setting up andUsing a Catalog A Basic Example SET spark.sql.catalog.mycluster = com.datastax.spark.connector.datasource.CassandraCatalog SET spark.sql.defaultCatalog = mycluster
 
 SHOW DATABASES LIKE 'created*' 
 
 
 SHOW TABLES IN created_in_cassandra SELECT * FROM created_in_cassandra.dogs +--------------------+ | namespace| +--------------------+ |created_in_cassandra| +--------------------+ +--------------------+---------+ | namespace|tableName| +--------------------+---------+ |created_in_cassandra| dogs| +--------------------+---------+ +--------+---+ | name|age| +--------+---+ | Cara| 13| |Sundance| 13| +--------+---+
  • 13.
    Inspecting Table Metadata DESCRIBENAMESPACE EXTENDED created_in_cassandra 
 
 
 DESCRIBE TABLE created_in_cassandra.dogs A basic example +--------------+--------------------------------------------------------------------------------------------------+ |name |value | +--------------+--------------------------------------------------------------------------------------------------+ |Namespace Name|created_in_cassandra | |Description |null | |Location |null | |Properties |((durable_writes,true),(class,org.apache.cassandra.locator.SimpleStrategy),(replication_factor,1))| +--------------+--------------------------------------------------------------------------------------------------+ +--------------------+--------------------+-------+ | col_name| data_type|comment| +--------------------+--------------------+-------+ | name| string| | | age| int| | | | | | | # Partitioning| | | | Part 0| name| | +--------------------+--------------------+-------+
  • 14.
    Inspecting Table Metadata SHOWTBLPROPERTIES created_in_cassandra.dogs A basic example +--------------------+--------------------+ | key| value| +--------------------+--------------------+ | crc_check_chance| 1.0| | compression|{chunk_length_in_...| | clustering_key| []| | max_index_interval| 2048| | compaction|{class=org.apache...| | gc_grace_seconds| 864000| | extensions| {}| |bloom_filter_fp_c...| 0.01| | caching|{keys=ALL, rows_p...| |dclocal_read_repa...| 0.1| | min_index_interval| 128| | speculative_retry| 99PERCENTILE| | comment| | |default_time_to_live| 0| | read_repair_chance| 0.0| |memtable_flush_pe...| 0| +--------------------+--------------------+
  • 15.
    Setting up MultipleCatalogs for one Cluster SET spark.sql.catalog.fast = com.datastax.spark.connector.datasource.CassandraCatalog SET spark.sql.catalog.fast.spark.cassandra.output.throughputMBPerSec = 10 SET spark.sql.catalog.slow = com.datastax.spark.connector.datasource.CassandraCatalog SET spark.sql.catalog.slow.spark.cassandra.output.throughputMBPerSec = 0.01
 
 INSERT INTO slow.created_in_cassandra.dogs SELECT * from fast.created_in_cassandra.dogs A more complicated example SparkSession Fast Slow Cassandra Connector Cassandra Connector Cassandra Cluster
  • 16.
    Setting up MultipleCatalogs for Multiple Clusters SET spark.sql.catalog.clustera = com.datastax.spark.connector.datasource.CassandraCatalog SET spark.sql.catalog.clustera.spark.cassandra.connection.host = 127.0.0.1 SET spark.sql.catalog.clusterb = com.datastax.spark.connector.datasource.CassandraCatalog SET spark.sql.catalog.clusterb.spark.cassandra.connection.host = 127.0.0.2
 
 INSERT INTO clusterb.created_in_cassandra.dogs SELECT * from clustera.created_in_cassandra.dogs Maximum Complexity SparkSession clustera clusterb Cassandra Connector Cassandra Connector Kyoto
 127.0.0.2 Tokyo
 127.0.0.1
  • 17.
    Create Tables -Create Keyspaces CREATE DATABASE IF NOT EXISTS created_in_spark WITH DBPROPERTIES 
 (class='SimpleStrategy', replication_factor='1')
 
 CREATE TABLE created_in_spark.ages (age Int, name STRING) 
 USING cassandra 
 PARTITIONED BY (age) 
 TBLPROPERTIES (clustering_key='name') INSERT INTO created_in_spark.ages SELECT * FROM created_in_cassandra.dogs
 
 SELECT * from created_in_spark.ages Maximum Cool Almost
 CQL +---+--------+ |age| name| +---+--------+ | 13| Cara| | 13|Sundance| +---+--------+ cqlsh> SELECT * from created_in_spark.ages ... ; age | name -----+---------- 13 | Cara 13 | Sundance
  • 18.
    All Cassandra TableOptions are Available ALTER TABLE created_in_spark.ages SET TBLPROPERTIES 
 (gc_grace_seconds = '22')
 SHOW TBLPROPERTIES created_in_spark.ages (gc_grace_seconds) 
 ALTER TABLE created_in_spark.ages SET TBLPROPERTIES (compaction='{class=SizeTieredCompactionStrategy,bucket_high=1001}')
 SHOW TBLPROPERTIES created_in_spark.ages (compaction) Maximum Cool Almost
 CQL +----------------+-----+ |key |value| +----------------+-----+ |gc_grace_seconds|22 | +----------------+-----+ +----------+---------------------------------------------------------------------------------------------------+ |key |value | | +----------+---------------------------------------------------------------------------------------------------+ |compaction|{bucket_high=1001, class=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy, | | | max_threshold=32, min_threshold=4}| | +----------+---------------------------------------------------------------------------------------------------+
  • 19.
  • 20.
    Data Sources asa Black Box • Datasources Provide RDDs • Writing Provides RDDs to Datasource • Filters / Pushdowns up to implementer to modify RDD directly • Spark Doesn't Know any of the Details of the RDD
  • 21.
    Partitioning • Separates DifferentIdeas • Table can create Readers or Writers if Supported • Can report it's own partitioning • Readers and Writers extensible with other capabilities Table Reader Writer Properties Partitioning
  • 22.
    Partitioning • SELECT DISTINCTage 1 Created in Spark Ages Table 2 3 4 Part 1 Part 2 *(1) HashAggregate(keys=[age#1174], functions=[]) +- *(1) HashAggregate(keys=[age#1174], functions=[]) +- *(1) Project [age#1174] +- BatchScan[age#1174] Cassandra Scan created_in_spark.ages
  • 23.
    Partitioning • SELECT DISTINCTage • SELECT DISTINCT name *(1) HashAggregate(keys=[age#1174], functions=[]) +- *(1) HashAggregate(keys=[age#1174], functions=[]) +- *(1) Project [age#1174] +- BatchScan[age#1174] Cassandra Scan created_in_spark.ages 1 bob suewenjack Created in Spark Ages Table 2 mag bob calli sara 3 len sue fumi stan 4 dougbob priyaavra Part 1 Part 2
  • 24.
    Partitioning • SELECT DISTINCTage • SELECT DISTINCT name bob suewenjack Created in Spark Ages Table mag bob calli sara len sue fumi stan dougbob priyaavra Part 1 Part 2 *(1) HashAggregate(keys=[age#1174], functions=[]) +- *(1) HashAggregate(keys=[age#1174], functions=[]) +- *(1) Project [age#1174] +- BatchScan[age#1174] Cassandra Scan created_in_spark.ages
  • 25.
    Partitioning • SELECT DISTINCTage • SELECT DISTINCT name 1 bob suewenjack Created in Spark Ages Table 2 mag bob calli sara 3 len sue fumi stan 4 dougbob priyaavra Part 1 Part 2 *(2) HashAggregate(keys=[name#1171], functions=[]) +- Exchange hashpartitioning(name#1171, 200), true, [id=#256] +- *(1) HashAggregate(keys=[name#1171], functions=[]) +- *(1) Project [name#1171] +- BatchScan[name#1171] Cassandra Scan created_in_spark.ages *(1) HashAggregate(keys=[age#1174], functions=[]) +- *(1) HashAggregate(keys=[age#1174], functions=[]) +- *(1) Project [age#1174] +- BatchScan[age#1174] Cassandra Scan created_in_spark.ages
  • 26.
    Partitioning • Future -Clustering as well • Partition function itself? • …
  • 27.
    New OSS Featuresfrom Datastax
  • 28.
  • 29.
    DirectJoin in Datasources •JoinWithCassandraTable Support in Catalyst • Avoids Full Scans
 
 
 
 Partition Keys Cassandra Cluster == Physical Plan == *(2) Project [name#215, age#216, age#110, name#111] +- Cassandra Direct Join [age = age#216] created_in_spark.ages - Reading (age, name) Pushed {} +- *(1) Project [name#215, age#216] +- *(1) Filter isnotnull(age#216) +- BatchScan[name#215, age#216] Cassandra Scan created_in_cassandra.dogs| Server Side Filters []| Columns [name,age]
  • 30.
    TTL/Writetime Support • Builtinto SparkSQL (with Extensions) • Functional Support Also Provided • SELECT TTL(age), WRITETIME(age) FROM created_in_cassandra.dogs +--------+----------------+ |TTL(age)| WRITETIME(age)| +--------+----------------+ | 49933|1589753892022515| | null|1589745922353001| | null|1589745922353000| +--------+----------------+
  • 31.
    InClause to Join •Selecting many values in an In Clause is Translated to Direct Join KeyA in (1,2,3 … 100) KeyB in (1,2,3 … 100) KeyC in (1,2,3 … 100) Cql - KeyA, B, C in (1 ………….) Old method generated one very large statement
  • 32.
    InClause to Join •Selecting many values in an In Clause is Translated to Direct Join KeyA in (1,2,3 … 100) KeyB in (1,2,3 … 100) KeyC in (1,2,3 … 100) Cassandra ClusterPermute
  • 33.
    Astra / JavaDriver 4.0
  • 34.
  • 35.
    Future Plans • SparkCassandra Connector 2.5.0 • Long Term Bugfix Branch for Spark 2.X • Contains all DSE Features! • Spark Cassandra Connector 3.0.0 • Spark 3.0 Compatibility • DSV2 Catalog and Read/Write • Deletes and more to come!
  • 36.
    OSS needs you! •github.com/datastax/spark- cassandra-connector • community.datastax.com
 • Our Mailing List
 https://groups.google.com/a/ lists.datastax.com/forum/#!forum/ spark-connector-user
  • 37.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.