Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016

Lessons Learned with
Cassandra & Spark_
Stephan Kepser
Matthias Niehoff
Cassandra Summit 2016
@matthiasniehoff
@codecentric
1

Our Use Cases
read
read
join
join
write
write

Cassandra

! Primary key defines access way to a table
! It cannot be changed after the table is filled with data.
! Any change of the primary key,  
like, e.g, adding a column, 
requires a data migration.
Data modeling: Primary key

! Well known: 
If you delete a column or whole row, 
the data is not really deleted. 
Rather a tombstone is created to mark the deletion.
! Much later tombstones are removed during
compactions.
Data modeling: Deletions

! Updates don’t create tombstones,  
data is overwritten in situ.
! Exception: maps, lists, sets
! Use frozen<list> instead.
Unexpected Tombstones: Built-in Maps, Lists, Sets

! sstable2json shows details of sstable files in json
format.
! Usage: go to /var/lib/cassandra/data/keyspace/table
! > sstable2json *-Data.db
! You can actually see what’s in the individual rows of the
data files.
! sstabledump in 3.6
Debug tool: sstable2json

Example
{"key": "ru",
"cells": [["status","ACTIVE",1464344127007511]]},
{"key": "it",
"cells": [[„status“,"ACTIVE",1464344146457930, T]]},
{"key": "de",
{"key": "ro",
{"key": "fr",
{"key": "cn",
{"key": "kz",
"cells": [["status","ACTIVE",1467190714345185]]}
deletion
marker

! Strategy to reduce partition size.
! Actually, general strategy in databases to distribute a
table.
! In Cassandra, each partition of a table can be
distributed over a set of buckets.
! Example: A train departure time table 
is bucketed over the hours of the day.
! The bucket becomes part of the partition key.
! It must be easily calculable.
Bucketing

! Bulk read or write operations should not be performed
synchronously.
! It makes no sense to wait for each read in the bulk to
complete individually.
! Instead
! send the requests asynchronously to Cassandra, and
! collect the answers.
Bulk Reads or Writes: Asynchronously

Example
Session session = cc.openSession(); 
PreparedStatement getEntries = session.prepare("SELECT * FROM " + keyspace + „."
+ table +" WHERE tenant=? AND core_system=? AND change_type = ? AND seconds = ?
AND bucket = ? „);
private List<ResultSetFuture> sendQueries(Collection<Object[]> partitionKeys) { 
List<ResultSetFuture> futures =
Lists.newArrayListWithExpectedSize(partitionKeys.size()); 
for (Object[] keys : partitionKeys) { 
futures.add(session.executeAsync(getEntries.bind(keys))); 
} 
return futures; 
}

Example
private void processAsyncResults(List<ResultSetFuture> futures) { 
for (ListenableFuture<ResultSet> future : Futures.inCompletionOrder(futures)) { 
ResultSet rs = future.get(); 
if (rs.getAvailableWithoutFetching() > 0 || rs.one() != null) { 
// do your program logic here
}
}
}

Cassandra as Streaming Source
table of consolidated entriesEvent Source

! Event log fills over time
! Requires fine grained time based partitions to avoid
huge partition size.
! Spark jobs read latest events.
! Access path: time stamp
! Jobs have to read many partitions.
Cassandra as Streaming Source

! Pull method requires frequent reads of empty
partitions, because no new data arrived.
! Impossible to re-read the whole table:  
by far too many partitions to read 
(our case: 900,000 per day -> 30 millions for a month) 
Unsuitable for λ-architecture
! Eventual consistency
! When can you be sure that you read all events? 
— Never
! Might improve with CDC
Cassandra is unsuitable as a time-based event source.

! One keyspace per tenant?
! One (set of) table(s) per tenant?
! Our option: Table per tenant
! Feasible only for limited number of tenants (~1000)
Separating Data of Different Tenants

! Switch on monitoring
! ELK, OpsCenter, self built, ....
! Avoid Log level debug for C* messages
! Drowning in irrelevant messages
! Substantial performance drawback
! Log level info for development, pre-production
! Log level error in production sufficient
Monitoring

! When writing data, 
Cassandra never checks if there is enough space left
on disk.
! It keeps writing data till the disk is full.
! This can bring the OS to a halt.
! Cassandra error messages are confusing at this point.
! Thus monitoring disk space is mandatory.
Monitoring: Disk Space

! Cassandra requires a lot of disk space for compaction.
! For SizeTieredCompaction reserve as much disk
space as Cassandra needs to store the data.
! Set-up monitoring on disk space.
! Receive an alert if the data carrying disk partition fills
up to 50%.
! Add nodes to the cluster and rebalance.
Monitoring: Disk Space

Cassandra & Spark

repartitionByCassandraReplica
Node 1
Node 2
Node 3
Node 4
1-25
26-5051-75
76-0

repartitionByCassandraReplica
Node 1
Node 2
Node 3
Node 4
1-25
26-5051-75
76-0
some tasks took ~3s longer..

! Watch for Spark Locality Level
! aim for process or node local
! avoid any
Spark locality

! spark job does not run on every C* node
! # spark nodes < # cassandra nodes
! # job cores < # cassandra nodes
! spark job cores all on one node
! time for repartition > time saving through locality
Do not use repartitionByCassandraReplica when ...

! one query per partition key
! one query at a time per executor
joinWithCassandraTable
Spark
Cassandra
t t+1 t+2 t+3 t+4 t+5

! parallel async queries
Spark
Cassandra
t t+1 t+2 t+3 t+4 t+5

! built a custom async implementation
someDStream.transformToPair(rdd -> {
return rdd.mapPartitionsToPair(iterator -> {
...
Session session = cc.openSession()) {
while (iterator.hasNext()) {
...
session.executeAsync(..)
}
[collect futures]
return List<Tuple2<Left,Optional<Right>>>
});
});

! solved with SPARKC-233  
(1.6.0 / 1.5.1 / 1.4.3)
! 5-6 times faster  
than sync implementation!

! joinWithCassandraTable is a full inner join
Left join with Cassandra
RDD C*

! Might include shuffle --> quite expensive
RDD C*join =
RDD substract = RDD
+ RDD=

! built a custom async implementation 
someDStream.transformToPair(rdd -> {
return rdd.mapPartitionsToPair(iterator -> {
...
Session session = cc.openSession()) {
while (iterator.hasNext()) {
...
session.executeAsync(..)
...
}
[collect futures]
return List<Tuple2<Left,Optional<Right>>>
});
});

! solved with SPARKC-1.81 
(2.0.0)
! basically uses async joinWithC*  
implementation

! spark.cassandra.connection.keep_alive_ms
! Default: 5s
! Streaming Batch Size > 5s
! Open Connection for every new batch
! Should be multiple times the streaming interval!
Connection keep alive

! cache saves performance by preventing recalculation
! it also helps you with regards to correctness!
Cache! Not only for performance
val changedStream = someDStream.map(e -> someMethod(e)).cache()
changedStream.saveToCassandra("keyspace","table1")
changedStream.saveToCassandra("keyspace","table1")
ChangedEntry someMethod(Entry e) {
return new ChangedEntry(new Date(),...);
}

• Know the most important internals
• Know your tools
• Monitor your cluster
• Use existing knowledge resources
• Use the mailing lists
• Participate in the community
Summary

Questions?
Matthias Niehoff Stephan Kepser

Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016

Similar to Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016 (20)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016