SlideShare a Scribd company logo
Lessons Learned with
Cassandra & Spark_
Stephan Kepser
Matthias Niehoff
Cassandra Summit 2016
@matthiasniehoff
@codecentric
1
Our Use Cases
read
read
join
join
write
write
Lessons Learned with
Cassandra
! Primary key defines access way to a table
! It cannot be changed after the table is filled with data.
! Any change of the primary key, 

like, e.g, adding a column,

requires a data migration.
Data modeling: Primary key
! Well known:

If you delete a column or whole row,

the data is not really deleted.

Rather a tombstone is created to mark the deletion.
! Much later tombstones are removed during
compactions.
Data modeling: Deletions
! Updates don’t create tombstones, 

data is overwritten in situ.
! Exception: maps, lists, sets
! Use frozen<list> instead.
Unexpected Tombstones: Built-in Maps, Lists, Sets
! sstable2json shows details of sstable files in json
format.
! Usage: go to /var/lib/cassandra/data/keyspace/table
! > sstable2json *-Data.db
! You can actually see what’s in the individual rows of the
data files.
! sstabledump in 3.6
Debug tool: sstable2json
Example
CREATE TABLE customer_cache.tenant (
name text PRIMARY KEY,
status text
)
select * from tenant ;
name | status
------+--------
ru | ACTIVE
es | ACTIVE
jp | ACTIVE
vn | ACTIVE
pl | ACTIVE
cz | ACTIVE
Example
{"key": "ru",
"cells": [["status","ACTIVE",1464344127007511]]},
{"key": "it",
"cells": [[„status“,"ACTIVE",1464344146457930, T]]},
{"key": "de",
"cells": [["status","ACTIVE",1464343910541463]]},
{"key": "ro",
"cells": [["status","ACTIVE",1464344151160601]]},
{"key": "fr",
"cells": [["status","ACTIVE",1464344072061135]]},
{"key": "cn",
"cells": [["status","ACTIVE",1464344083085247]]},
{"key": "kz",
"cells": [["status","ACTIVE",1467190714345185]]}
deletion
marker
! Strategy to reduce partition size.
! Actually, general strategy in databases to distribute a
table.
! In Cassandra, each partition of a table can be
distributed over a set of buckets.
! Example: A train departure time table

is bucketed over the hours of the day.
! The bucket becomes part of the partition key.
! It must be easily calculable.
Bucketing
! Bulk read or write operations should not be performed
synchronously.
! It makes no sense to wait for each read in the bulk to
complete individually.
! Instead
! send the requests asynchronously to Cassandra, and
! collect the answers.
Bulk Reads or Writes: Asynchronously
Example
Session session = cc.openSession();

PreparedStatement getEntries = session.prepare("SELECT * FROM " + keyspace + „."
+ table +" WHERE tenant=? AND core_system=? AND change_type = ? AND seconds = ?
AND bucket = ? „);
private List<ResultSetFuture> sendQueries(Collection<Object[]> partitionKeys) {

List<ResultSetFuture> futures =
Lists.newArrayListWithExpectedSize(partitionKeys.size());

for (Object[] keys : partitionKeys) {

futures.add(session.executeAsync(getEntries.bind(keys)));

}

return futures;

}


Example
private void processAsyncResults(List<ResultSetFuture> futures) {

for (ListenableFuture<ResultSet> future : Futures.inCompletionOrder(futures)) {

ResultSet rs = future.get();

if (rs.getAvailableWithoutFetching() > 0 || rs.one() != null) {

// do your program logic here
}
}
}


Cassandra as Streaming Source
table of consolidated entriesEvent Source
! Event log fills over time
! Requires fine grained time based partitions to avoid
huge partition size.
! Spark jobs read latest events.
! Access path: time stamp
! Jobs have to read many partitions.
Cassandra as Streaming Source
! Pull method requires frequent reads of empty
partitions, because no new data arrived.
! Impossible to re-read the whole table: 

by far too many partitions to read

(our case: 900,000 per day -> 30 millions for a month)

Unsuitable for λ-architecture
! Eventual consistency
! When can you be sure that you read all events?

— Never
! Might improve with CDC
Cassandra is unsuitable as a time-based event source.
! One keyspace per tenant?
! One (set of) table(s) per tenant?
! Our option: Table per tenant
! Feasible only for limited number of tenants (~1000)
Separating Data of Different Tenants
! Switch on monitoring
! ELK, OpsCenter, self built, ....
! Avoid Log level debug for C* messages
! Drowning in irrelevant messages
! Substantial performance drawback
! Log level info for development, pre-production
! Log level error in production sufficient
Monitoring
! When writing data,

Cassandra never checks if there is enough space left
on disk.
! It keeps writing data till the disk is full.
! This can bring the OS to a halt.
! Cassandra error messages are confusing at this point.
! Thus monitoring disk space is mandatory.
Monitoring: Disk Space
! Cassandra requires a lot of disk space for compaction.
! For SizeTieredCompaction reserve as much disk
space as Cassandra needs to store the data.
! Set-up monitoring on disk space.
! Receive an alert if the data carrying disk partition fills
up to 50%.
! Add nodes to the cluster and rebalance.
Monitoring: Disk Space
Lessons Learned with
Cassandra & Spark
repartitionByCassandraReplica
Node 1
Node 2
Node 3
Node 4
1-25
26-5051-75
76-0
repartitionByCassandraReplica
Node 1
Node 2
Node 3
Node 4
1-25
26-5051-75
76-0
some tasks took ~3s longer..
! Watch for Spark Locality Level
! aim for process or node local
! avoid any
Spark locality
! spark job does not run on every C* node
! # spark nodes < # cassandra nodes
! # job cores < # cassandra nodes
! spark job cores all on one node
! time for repartition > time saving through locality
Do not use repartitionByCassandraReplica when ...
! one query per partition key
! one query at a time per executor
joinWithCassandraTable
Spark
Cassandra
t t+1 t+2 t+3 t+4 t+5
! parallel async queries
joinWithCassandraTable
Spark
Cassandra
t t+1 t+2 t+3 t+4 t+5
! built a custom async implementation
joinWithCassandraTable
someDStream.transformToPair(rdd	->	{	
			return	rdd.mapPartitionsToPair(iterator	->	{	
				...	
				Session	session	=	cc.openSession())	{	
				while	(iterator.hasNext())	{	
						...	
						session.executeAsync(..)	
				}	
				[collect	futures]	
return	List<Tuple2<Left,Optional<Right>>>	
		});	
});
! solved with SPARKC-233 

(1.6.0 / 1.5.1 / 1.4.3)
! 5-6 times faster 

than sync implementation!
joinWithCassandraTable
! joinWithCassandraTable is a full inner join
Left join with Cassandra
RDD C*
! Might include shuffle --> quite expensive
Left join with Cassandra
RDD C*join =
RDD substract = RDD
+ RDD=
Left join with Cassandra
! built a custom async implementation

someDStream.transformToPair(rdd	->	{	
			return	rdd.mapPartitionsToPair(iterator	->	{	
				...	
				Session	session	=	cc.openSession())	{	
				while	(iterator.hasNext())	{	
						...	
						session.executeAsync(..)	
						...	
				}	
				[collect	futures]	
return	List<Tuple2<Left,Optional<Right>>>	
		});	
});
! solved with SPARKC-1.81

(2.0.0)
! basically uses async joinWithC* 

implementation
Left join with Cassandra
! spark.cassandra.connection.keep_alive_ms
! Default: 5s
! Streaming Batch Size > 5s
! Open Connection for every new batch
! Should be multiple times the streaming interval!
Connection keep alive
! cache saves performance by preventing recalculation
! it also helps you with regards to correctness!
Cache! Not only for performance
val	changedStream	=	someDStream.map(e	->	someMethod(e)).cache()	
changedStream.saveToCassandra("keyspace","table1")	
changedStream.saveToCassandra("keyspace","table1")	
ChangedEntry	someMethod(Entry	e)	{	
		return	new	ChangedEntry(new	Date(),...);	
}
• Know the most important internals
• Know your tools
• Monitor your cluster
• Use existing knowledge resources
• Use the mailing lists
• Participate in the community
Summary
Questions?
Matthias Niehoff Stephan Kepser

More Related Content

What's hot

Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
C* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick BransonC* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick Branson
DataStax Academy
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
DataStax
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
DataStax
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
Patrick McFadin
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
zznate
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
DataStax
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
Matthias Niehoff
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
DataStax Academy
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
Patrick McFadin
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE Search
Caleb Rackliffe
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Jacek Lewandowski
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
Patrick McFadin
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Cassandra EU - Data model on fire
Cassandra EU - Data model on fireCassandra EU - Data model on fire
Cassandra EU - Data model on fire
Patrick McFadin
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 

What's hot (20)

Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
C* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick BransonC* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick Branson
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE Search
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Cassandra EU - Data model on fire
Cassandra EU - Data model on fireCassandra EU - Data model on fire
Cassandra EU - Data model on fire
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and Spark
 

Viewers also liked

Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
DataStax
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithms
Duyhai Doan
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 

Viewers also liked (7)

Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithms
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 

Similar to Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016

Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
DataStax Academy
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
DataStax
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
Nathan Milford
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
Kyle Burke
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Moving from a Relational Database to Cassandra: Why, Where, When, and How
Moving from a Relational Database to Cassandra: Why, Where, When, and HowMoving from a Relational Database to Cassandra: Why, Where, When, and How
Moving from a Relational Database to Cassandra: Why, Where, When, and How
Anant Corporation
 
It Depends - Database admin for developers - Rev 20151205
It Depends - Database admin for developers - Rev 20151205It Depends - Database admin for developers - Rev 20151205
It Depends - Database admin for developers - Rev 20151205
Maggie Pint
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
Christian Johannsen
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
Aaron Ploetz
 
Migrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and HowMigrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and How
Anant Corporation
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Exactly once with spark streaming
Exactly once with spark streamingExactly once with spark streaming
Exactly once with spark streaming
Quentin Ambard
 
Montreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraMontreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraAdam Hutson
 

Similar to Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016 (20)

Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Moving from a Relational Database to Cassandra: Why, Where, When, and How
Moving from a Relational Database to Cassandra: Why, Where, When, and HowMoving from a Relational Database to Cassandra: Why, Where, When, and How
Moving from a Relational Database to Cassandra: Why, Where, When, and How
 
It Depends - Database admin for developers - Rev 20151205
It Depends - Database admin for developers - Rev 20151205It Depends - Database admin for developers - Rev 20151205
It Depends - Database admin for developers - Rev 20151205
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Migrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and HowMigrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and How
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Exactly once with spark streaming
Exactly once with spark streamingExactly once with spark streaming
Exactly once with spark streaming
 
Montreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraMontreal User Group - Cloning Cassandra
Montreal User Group - Cloning Cassandra
 

More from DataStax

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
DataStax
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
DataStax
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
DataStax
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
DataStax
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
DataStax
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
DataStax
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
DataStax
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
DataStax
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
DataStax
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
DataStax
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
DataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
DataStax
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
DataStax
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
DataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
DataStax
 

More from DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Recently uploaded

Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 

Recently uploaded (20)

Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 

Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016

  • 1. Lessons Learned with Cassandra & Spark_ Stephan Kepser Matthias Niehoff Cassandra Summit 2016 @matthiasniehoff @codecentric 1
  • 4. ! Primary key defines access way to a table ! It cannot be changed after the table is filled with data. ! Any change of the primary key, 
 like, e.g, adding a column,
 requires a data migration. Data modeling: Primary key
  • 5. ! Well known:
 If you delete a column or whole row,
 the data is not really deleted.
 Rather a tombstone is created to mark the deletion. ! Much later tombstones are removed during compactions. Data modeling: Deletions
  • 6. ! Updates don’t create tombstones, 
 data is overwritten in situ. ! Exception: maps, lists, sets ! Use frozen<list> instead. Unexpected Tombstones: Built-in Maps, Lists, Sets
  • 7. ! sstable2json shows details of sstable files in json format. ! Usage: go to /var/lib/cassandra/data/keyspace/table ! > sstable2json *-Data.db ! You can actually see what’s in the individual rows of the data files. ! sstabledump in 3.6 Debug tool: sstable2json
  • 8. Example CREATE TABLE customer_cache.tenant ( name text PRIMARY KEY, status text ) select * from tenant ; name | status ------+-------- ru | ACTIVE es | ACTIVE jp | ACTIVE vn | ACTIVE pl | ACTIVE cz | ACTIVE
  • 9. Example {"key": "ru", "cells": [["status","ACTIVE",1464344127007511]]}, {"key": "it", "cells": [[„status“,"ACTIVE",1464344146457930, T]]}, {"key": "de", "cells": [["status","ACTIVE",1464343910541463]]}, {"key": "ro", "cells": [["status","ACTIVE",1464344151160601]]}, {"key": "fr", "cells": [["status","ACTIVE",1464344072061135]]}, {"key": "cn", "cells": [["status","ACTIVE",1464344083085247]]}, {"key": "kz", "cells": [["status","ACTIVE",1467190714345185]]} deletion marker
  • 10. ! Strategy to reduce partition size. ! Actually, general strategy in databases to distribute a table. ! In Cassandra, each partition of a table can be distributed over a set of buckets. ! Example: A train departure time table
 is bucketed over the hours of the day. ! The bucket becomes part of the partition key. ! It must be easily calculable. Bucketing
  • 11. ! Bulk read or write operations should not be performed synchronously. ! It makes no sense to wait for each read in the bulk to complete individually. ! Instead ! send the requests asynchronously to Cassandra, and ! collect the answers. Bulk Reads or Writes: Asynchronously
  • 12. Example Session session = cc.openSession();
 PreparedStatement getEntries = session.prepare("SELECT * FROM " + keyspace + „." + table +" WHERE tenant=? AND core_system=? AND change_type = ? AND seconds = ? AND bucket = ? „); private List<ResultSetFuture> sendQueries(Collection<Object[]> partitionKeys) {
 List<ResultSetFuture> futures = Lists.newArrayListWithExpectedSize(partitionKeys.size());
 for (Object[] keys : partitionKeys) {
 futures.add(session.executeAsync(getEntries.bind(keys)));
 }
 return futures;
 } 

  • 13. Example private void processAsyncResults(List<ResultSetFuture> futures) {
 for (ListenableFuture<ResultSet> future : Futures.inCompletionOrder(futures)) {
 ResultSet rs = future.get();
 if (rs.getAvailableWithoutFetching() > 0 || rs.one() != null) {
 // do your program logic here } } } 

  • 14. Cassandra as Streaming Source table of consolidated entriesEvent Source
  • 15. ! Event log fills over time ! Requires fine grained time based partitions to avoid huge partition size. ! Spark jobs read latest events. ! Access path: time stamp ! Jobs have to read many partitions. Cassandra as Streaming Source
  • 16. ! Pull method requires frequent reads of empty partitions, because no new data arrived. ! Impossible to re-read the whole table: 
 by far too many partitions to read
 (our case: 900,000 per day -> 30 millions for a month)
 Unsuitable for λ-architecture ! Eventual consistency ! When can you be sure that you read all events?
 — Never ! Might improve with CDC Cassandra is unsuitable as a time-based event source.
  • 17. ! One keyspace per tenant? ! One (set of) table(s) per tenant? ! Our option: Table per tenant ! Feasible only for limited number of tenants (~1000) Separating Data of Different Tenants
  • 18. ! Switch on monitoring ! ELK, OpsCenter, self built, .... ! Avoid Log level debug for C* messages ! Drowning in irrelevant messages ! Substantial performance drawback ! Log level info for development, pre-production ! Log level error in production sufficient Monitoring
  • 19. ! When writing data,
 Cassandra never checks if there is enough space left on disk. ! It keeps writing data till the disk is full. ! This can bring the OS to a halt. ! Cassandra error messages are confusing at this point. ! Thus monitoring disk space is mandatory. Monitoring: Disk Space
  • 20. ! Cassandra requires a lot of disk space for compaction. ! For SizeTieredCompaction reserve as much disk space as Cassandra needs to store the data. ! Set-up monitoring on disk space. ! Receive an alert if the data carrying disk partition fills up to 50%. ! Add nodes to the cluster and rebalance. Monitoring: Disk Space
  • 22. repartitionByCassandraReplica Node 1 Node 2 Node 3 Node 4 1-25 26-5051-75 76-0
  • 23. repartitionByCassandraReplica Node 1 Node 2 Node 3 Node 4 1-25 26-5051-75 76-0 some tasks took ~3s longer..
  • 24. ! Watch for Spark Locality Level ! aim for process or node local ! avoid any Spark locality
  • 25. ! spark job does not run on every C* node ! # spark nodes < # cassandra nodes ! # job cores < # cassandra nodes ! spark job cores all on one node ! time for repartition > time saving through locality Do not use repartitionByCassandraReplica when ...
  • 26. ! one query per partition key ! one query at a time per executor joinWithCassandraTable Spark Cassandra t t+1 t+2 t+3 t+4 t+5
  • 27. ! parallel async queries joinWithCassandraTable Spark Cassandra t t+1 t+2 t+3 t+4 t+5
  • 28. ! built a custom async implementation joinWithCassandraTable someDStream.transformToPair(rdd -> { return rdd.mapPartitionsToPair(iterator -> { ... Session session = cc.openSession()) { while (iterator.hasNext()) { ... session.executeAsync(..) } [collect futures] return List<Tuple2<Left,Optional<Right>>> }); });
  • 29. ! solved with SPARKC-233 
 (1.6.0 / 1.5.1 / 1.4.3) ! 5-6 times faster 
 than sync implementation! joinWithCassandraTable
  • 30. ! joinWithCassandraTable is a full inner join Left join with Cassandra RDD C*
  • 31. ! Might include shuffle --> quite expensive Left join with Cassandra RDD C*join = RDD substract = RDD + RDD=
  • 32. Left join with Cassandra ! built a custom async implementation
 someDStream.transformToPair(rdd -> { return rdd.mapPartitionsToPair(iterator -> { ... Session session = cc.openSession()) { while (iterator.hasNext()) { ... session.executeAsync(..) ... } [collect futures] return List<Tuple2<Left,Optional<Right>>> }); });
  • 33. ! solved with SPARKC-1.81
 (2.0.0) ! basically uses async joinWithC* 
 implementation Left join with Cassandra
  • 34. ! spark.cassandra.connection.keep_alive_ms ! Default: 5s ! Streaming Batch Size > 5s ! Open Connection for every new batch ! Should be multiple times the streaming interval! Connection keep alive
  • 35. ! cache saves performance by preventing recalculation ! it also helps you with regards to correctness! Cache! Not only for performance val changedStream = someDStream.map(e -> someMethod(e)).cache() changedStream.saveToCassandra("keyspace","table1") changedStream.saveToCassandra("keyspace","table1") ChangedEntry someMethod(Entry e) { return new ChangedEntry(new Date(),...); }
  • 36. • Know the most important internals • Know your tools • Monitor your cluster • Use existing knowledge resources • Use the mailing lists • Participate in the community Summary