SlideShare a Scribd company logo
1 of 37
Download to read offline
Lessons Learned with
Cassandra & Spark_
Stephan Kepser
Matthias Niehoff
Cassandra Summit 2016
@matthiasniehoff
@codecentric
1
Our Use Cases
read
read
join
join
write
write
Lessons Learned with
Cassandra
! Primary key defines access way to a table
! It cannot be changed after the table is filled with data.
! Any change of the primary key, 

like, e.g, adding a column,

requires a data migration.
Data modeling: Primary key
! Well known:

If you delete a column or whole row,

the data is not really deleted.

Rather a tombstone is created to mark the deletion.
! Much later tombstones are removed during
compactions.
Data modeling: Deletions
! Updates don’t create tombstones, 

data is overwritten in situ.
! Exception: maps, lists, sets
! Use frozen<list> instead.
Unexpected Tombstones: Built-in Maps, Lists, Sets
! sstable2json shows details of sstable files in json
format.
! Usage: go to /var/lib/cassandra/data/keyspace/table
! > sstable2json *-Data.db
! You can actually see what’s in the individual rows of the
data files.
! sstabledump in 3.6
Debug tool: sstable2json
Example
CREATE TABLE customer_cache.tenant (
name text PRIMARY KEY,
status text
)
select * from tenant ;
name | status
------+--------
ru | ACTIVE
es | ACTIVE
jp | ACTIVE
vn | ACTIVE
pl | ACTIVE
cz | ACTIVE
Example
{"key": "ru",
"cells": [["status","ACTIVE",1464344127007511]]},
{"key": "it",
"cells": [[„status“,"ACTIVE",1464344146457930, T]]},
{"key": "de",
"cells": [["status","ACTIVE",1464343910541463]]},
{"key": "ro",
"cells": [["status","ACTIVE",1464344151160601]]},
{"key": "fr",
"cells": [["status","ACTIVE",1464344072061135]]},
{"key": "cn",
"cells": [["status","ACTIVE",1464344083085247]]},
{"key": "kz",
"cells": [["status","ACTIVE",1467190714345185]]}
deletion
marker
! Strategy to reduce partition size.
! Actually, general strategy in databases to distribute a
table.
! In Cassandra, each partition of a table can be
distributed over a set of buckets.
! Example: A train departure time table

is bucketed over the hours of the day.
! The bucket becomes part of the partition key.
! It must be easily calculable.
Bucketing
! Bulk read or write operations should not be performed
synchronously.
! It makes no sense to wait for each read in the bulk to
complete individually.
! Instead
! send the requests asynchronously to Cassandra, and
! collect the answers.
Bulk Reads or Writes: Asynchronously
Example
Session session = cc.openSession();

PreparedStatement getEntries = session.prepare("SELECT * FROM " + keyspace + „."
+ table +" WHERE tenant=? AND core_system=? AND change_type = ? AND seconds = ?
AND bucket = ? „);
private List<ResultSetFuture> sendQueries(Collection<Object[]> partitionKeys) {

List<ResultSetFuture> futures =
Lists.newArrayListWithExpectedSize(partitionKeys.size());

for (Object[] keys : partitionKeys) {

futures.add(session.executeAsync(getEntries.bind(keys)));

}

return futures;

}


Example
private void processAsyncResults(List<ResultSetFuture> futures) {

for (ListenableFuture<ResultSet> future : Futures.inCompletionOrder(futures)) {

ResultSet rs = future.get();

if (rs.getAvailableWithoutFetching() > 0 || rs.one() != null) {

// do your program logic here
}
}
}


Cassandra as Streaming Source
table of consolidated entriesEvent Source
! Event log fills over time
! Requires fine grained time based partitions to avoid
huge partition size.
! Spark jobs read latest events.
! Access path: time stamp
! Jobs have to read many partitions.
Cassandra as Streaming Source
! Pull method requires frequent reads of empty
partitions, because no new data arrived.
! Impossible to re-read the whole table: 

by far too many partitions to read

(our case: 900,000 per day -> 30 millions for a month)

Unsuitable for λ-architecture
! Eventual consistency
! When can you be sure that you read all events?

— Never
! Might improve with CDC
Cassandra is unsuitable as a time-based event source.
! One keyspace per tenant?
! One (set of) table(s) per tenant?
! Our option: Table per tenant
! Feasible only for limited number of tenants (~1000)
Separating Data of Different Tenants
! Switch on monitoring
! ELK, OpsCenter, self built, ....
! Avoid Log level debug for C* messages
! Drowning in irrelevant messages
! Substantial performance drawback
! Log level info for development, pre-production
! Log level error in production sufficient
Monitoring
! When writing data,

Cassandra never checks if there is enough space left
on disk.
! It keeps writing data till the disk is full.
! This can bring the OS to a halt.
! Cassandra error messages are confusing at this point.
! Thus monitoring disk space is mandatory.
Monitoring: Disk Space
! Cassandra requires a lot of disk space for compaction.
! For SizeTieredCompaction reserve as much disk
space as Cassandra needs to store the data.
! Set-up monitoring on disk space.
! Receive an alert if the data carrying disk partition fills
up to 50%.
! Add nodes to the cluster and rebalance.
Monitoring: Disk Space
Lessons Learned with
Cassandra & Spark
repartitionByCassandraReplica
Node 1
Node 2
Node 3
Node 4
1-25
26-5051-75
76-0
repartitionByCassandraReplica
Node 1
Node 2
Node 3
Node 4
1-25
26-5051-75
76-0
some tasks took ~3s longer..
! Watch for Spark Locality Level
! aim for process or node local
! avoid any
Spark locality
! spark job does not run on every C* node
! # spark nodes < # cassandra nodes
! # job cores < # cassandra nodes
! spark job cores all on one node
! time for repartition > time saving through locality
Do not use repartitionByCassandraReplica when ...
! one query per partition key
! one query at a time per executor
joinWithCassandraTable
Spark
Cassandra
t t+1 t+2 t+3 t+4 t+5
! parallel async queries
joinWithCassandraTable
Spark
Cassandra
t t+1 t+2 t+3 t+4 t+5
! built a custom async implementation
joinWithCassandraTable
someDStream.transformToPair(rdd	->	{	
			return	rdd.mapPartitionsToPair(iterator	->	{	
				...	
				Session	session	=	cc.openSession())	{	
				while	(iterator.hasNext())	{	
						...	
						session.executeAsync(..)	
				}	
				[collect	futures]	
return	List<Tuple2<Left,Optional<Right>>>	
		});	
});
! solved with SPARKC-233 

(1.6.0 / 1.5.1 / 1.4.3)
! 5-6 times faster 

than sync implementation!
joinWithCassandraTable
! joinWithCassandraTable is a full inner join
Left join with Cassandra
RDD C*
! Might include shuffle --> quite expensive
Left join with Cassandra
RDD C*join =
RDD substract = RDD
+ RDD=
Left join with Cassandra
! built a custom async implementation

someDStream.transformToPair(rdd	->	{	
			return	rdd.mapPartitionsToPair(iterator	->	{	
				...	
				Session	session	=	cc.openSession())	{	
				while	(iterator.hasNext())	{	
						...	
						session.executeAsync(..)	
						...	
				}	
				[collect	futures]	
return	List<Tuple2<Left,Optional<Right>>>	
		});	
});
! solved with SPARKC-1.81

(2.0.0)
! basically uses async joinWithC* 

implementation
Left join with Cassandra
! spark.cassandra.connection.keep_alive_ms
! Default: 5s
! Streaming Batch Size > 5s
! Open Connection for every new batch
! Should be multiple times the streaming interval!
Connection keep alive
! cache saves performance by preventing recalculation
! it also helps you with regards to correctness!
Cache! Not only for performance
val	changedStream	=	someDStream.map(e	->	someMethod(e)).cache()	
changedStream.saveToCassandra("keyspace","table1")	
changedStream.saveToCassandra("keyspace","table1")	
ChangedEntry	someMethod(Entry	e)	{	
		return	new	ChangedEntry(new	Date(),...);	
}
• Know the most important internals
• Know your tools
• Monitor your cluster
• Use existing knowledge resources
• Use the mailing lists
• Participate in the community
Summary
Questions?
Matthias Niehoff Stephan Kepser

More Related Content

What's hot

Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
DataStax
 

What's hot (20)

Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
C* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick BransonC* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick Branson
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE Search
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Cassandra EU - Data model on fire
Cassandra EU - Data model on fireCassandra EU - Data model on fire
Cassandra EU - Data model on fire
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and Spark
 

Viewers also liked

Viewers also liked (7)

Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithms
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 

Similar to Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016

Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
Holden Karau
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
Montreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraMontreal User Group - Cloning Cassandra
Montreal User Group - Cloning Cassandra
Adam Hutson
 

Similar to Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016 (20)

Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Moving from a Relational Database to Cassandra: Why, Where, When, and How
Moving from a Relational Database to Cassandra: Why, Where, When, and HowMoving from a Relational Database to Cassandra: Why, Where, When, and How
Moving from a Relational Database to Cassandra: Why, Where, When, and How
 
It Depends - Database admin for developers - Rev 20151205
It Depends - Database admin for developers - Rev 20151205It Depends - Database admin for developers - Rev 20151205
It Depends - Database admin for developers - Rev 20151205
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Migrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and HowMigrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and How
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Exactly once with spark streaming
Exactly once with spark streamingExactly once with spark streaming
Exactly once with spark streaming
 
Montreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraMontreal User Group - Cloning Cassandra
Montreal User Group - Cloning Cassandra
 

More from DataStax

More from DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016

  • 1. Lessons Learned with Cassandra & Spark_ Stephan Kepser Matthias Niehoff Cassandra Summit 2016 @matthiasniehoff @codecentric 1
  • 4. ! Primary key defines access way to a table ! It cannot be changed after the table is filled with data. ! Any change of the primary key, 
 like, e.g, adding a column,
 requires a data migration. Data modeling: Primary key
  • 5. ! Well known:
 If you delete a column or whole row,
 the data is not really deleted.
 Rather a tombstone is created to mark the deletion. ! Much later tombstones are removed during compactions. Data modeling: Deletions
  • 6. ! Updates don’t create tombstones, 
 data is overwritten in situ. ! Exception: maps, lists, sets ! Use frozen<list> instead. Unexpected Tombstones: Built-in Maps, Lists, Sets
  • 7. ! sstable2json shows details of sstable files in json format. ! Usage: go to /var/lib/cassandra/data/keyspace/table ! > sstable2json *-Data.db ! You can actually see what’s in the individual rows of the data files. ! sstabledump in 3.6 Debug tool: sstable2json
  • 8. Example CREATE TABLE customer_cache.tenant ( name text PRIMARY KEY, status text ) select * from tenant ; name | status ------+-------- ru | ACTIVE es | ACTIVE jp | ACTIVE vn | ACTIVE pl | ACTIVE cz | ACTIVE
  • 9. Example {"key": "ru", "cells": [["status","ACTIVE",1464344127007511]]}, {"key": "it", "cells": [[„status“,"ACTIVE",1464344146457930, T]]}, {"key": "de", "cells": [["status","ACTIVE",1464343910541463]]}, {"key": "ro", "cells": [["status","ACTIVE",1464344151160601]]}, {"key": "fr", "cells": [["status","ACTIVE",1464344072061135]]}, {"key": "cn", "cells": [["status","ACTIVE",1464344083085247]]}, {"key": "kz", "cells": [["status","ACTIVE",1467190714345185]]} deletion marker
  • 10. ! Strategy to reduce partition size. ! Actually, general strategy in databases to distribute a table. ! In Cassandra, each partition of a table can be distributed over a set of buckets. ! Example: A train departure time table
 is bucketed over the hours of the day. ! The bucket becomes part of the partition key. ! It must be easily calculable. Bucketing
  • 11. ! Bulk read or write operations should not be performed synchronously. ! It makes no sense to wait for each read in the bulk to complete individually. ! Instead ! send the requests asynchronously to Cassandra, and ! collect the answers. Bulk Reads or Writes: Asynchronously
  • 12. Example Session session = cc.openSession();
 PreparedStatement getEntries = session.prepare("SELECT * FROM " + keyspace + „." + table +" WHERE tenant=? AND core_system=? AND change_type = ? AND seconds = ? AND bucket = ? „); private List<ResultSetFuture> sendQueries(Collection<Object[]> partitionKeys) {
 List<ResultSetFuture> futures = Lists.newArrayListWithExpectedSize(partitionKeys.size());
 for (Object[] keys : partitionKeys) {
 futures.add(session.executeAsync(getEntries.bind(keys)));
 }
 return futures;
 } 

  • 13. Example private void processAsyncResults(List<ResultSetFuture> futures) {
 for (ListenableFuture<ResultSet> future : Futures.inCompletionOrder(futures)) {
 ResultSet rs = future.get();
 if (rs.getAvailableWithoutFetching() > 0 || rs.one() != null) {
 // do your program logic here } } } 

  • 14. Cassandra as Streaming Source table of consolidated entriesEvent Source
  • 15. ! Event log fills over time ! Requires fine grained time based partitions to avoid huge partition size. ! Spark jobs read latest events. ! Access path: time stamp ! Jobs have to read many partitions. Cassandra as Streaming Source
  • 16. ! Pull method requires frequent reads of empty partitions, because no new data arrived. ! Impossible to re-read the whole table: 
 by far too many partitions to read
 (our case: 900,000 per day -> 30 millions for a month)
 Unsuitable for λ-architecture ! Eventual consistency ! When can you be sure that you read all events?
 — Never ! Might improve with CDC Cassandra is unsuitable as a time-based event source.
  • 17. ! One keyspace per tenant? ! One (set of) table(s) per tenant? ! Our option: Table per tenant ! Feasible only for limited number of tenants (~1000) Separating Data of Different Tenants
  • 18. ! Switch on monitoring ! ELK, OpsCenter, self built, .... ! Avoid Log level debug for C* messages ! Drowning in irrelevant messages ! Substantial performance drawback ! Log level info for development, pre-production ! Log level error in production sufficient Monitoring
  • 19. ! When writing data,
 Cassandra never checks if there is enough space left on disk. ! It keeps writing data till the disk is full. ! This can bring the OS to a halt. ! Cassandra error messages are confusing at this point. ! Thus monitoring disk space is mandatory. Monitoring: Disk Space
  • 20. ! Cassandra requires a lot of disk space for compaction. ! For SizeTieredCompaction reserve as much disk space as Cassandra needs to store the data. ! Set-up monitoring on disk space. ! Receive an alert if the data carrying disk partition fills up to 50%. ! Add nodes to the cluster and rebalance. Monitoring: Disk Space
  • 22. repartitionByCassandraReplica Node 1 Node 2 Node 3 Node 4 1-25 26-5051-75 76-0
  • 23. repartitionByCassandraReplica Node 1 Node 2 Node 3 Node 4 1-25 26-5051-75 76-0 some tasks took ~3s longer..
  • 24. ! Watch for Spark Locality Level ! aim for process or node local ! avoid any Spark locality
  • 25. ! spark job does not run on every C* node ! # spark nodes < # cassandra nodes ! # job cores < # cassandra nodes ! spark job cores all on one node ! time for repartition > time saving through locality Do not use repartitionByCassandraReplica when ...
  • 26. ! one query per partition key ! one query at a time per executor joinWithCassandraTable Spark Cassandra t t+1 t+2 t+3 t+4 t+5
  • 27. ! parallel async queries joinWithCassandraTable Spark Cassandra t t+1 t+2 t+3 t+4 t+5
  • 28. ! built a custom async implementation joinWithCassandraTable someDStream.transformToPair(rdd -> { return rdd.mapPartitionsToPair(iterator -> { ... Session session = cc.openSession()) { while (iterator.hasNext()) { ... session.executeAsync(..) } [collect futures] return List<Tuple2<Left,Optional<Right>>> }); });
  • 29. ! solved with SPARKC-233 
 (1.6.0 / 1.5.1 / 1.4.3) ! 5-6 times faster 
 than sync implementation! joinWithCassandraTable
  • 30. ! joinWithCassandraTable is a full inner join Left join with Cassandra RDD C*
  • 31. ! Might include shuffle --> quite expensive Left join with Cassandra RDD C*join = RDD substract = RDD + RDD=
  • 32. Left join with Cassandra ! built a custom async implementation
 someDStream.transformToPair(rdd -> { return rdd.mapPartitionsToPair(iterator -> { ... Session session = cc.openSession()) { while (iterator.hasNext()) { ... session.executeAsync(..) ... } [collect futures] return List<Tuple2<Left,Optional<Right>>> }); });
  • 33. ! solved with SPARKC-1.81
 (2.0.0) ! basically uses async joinWithC* 
 implementation Left join with Cassandra
  • 34. ! spark.cassandra.connection.keep_alive_ms ! Default: 5s ! Streaming Batch Size > 5s ! Open Connection for every new batch ! Should be multiple times the streaming interval! Connection keep alive
  • 35. ! cache saves performance by preventing recalculation ! it also helps you with regards to correctness! Cache! Not only for performance val changedStream = someDStream.map(e -> someMethod(e)).cache() changedStream.saveToCassandra("keyspace","table1") changedStream.saveToCassandra("keyspace","table1") ChangedEntry someMethod(Entry e) { return new ChangedEntry(new Date(),...); }
  • 36. • Know the most important internals • Know your tools • Monitor your cluster • Use existing knowledge resources • Use the mailing lists • Participate in the community Summary