Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
CASSANDRA DATA
MAINTENANCE WITH SPARK
Operate on your Data
WHAT IS SPARK?
A large-scale data processing framework
STEP 1:
Make Fake Data
(unless you have a million records to spare)
def create_fake_record( num: Int ) = {
(num,
1453389992000L + num,
s"My Token $num",
s"My Session Data$num")
}
sc.parallel...
THREE BASIC PATTERNS
• Read -Transform - Write (1:1) - .map()
• Read -Transform - Write (1:m) - .flatMap()
• Read - Filter ...
DELETES ARE TRICKY
DELETES ARE TRICKY
• Keep tombstones in mind
• Select the records you want to delete, then loop
over those and issue delet...
DELETING
PREDICATE PUSHDOWN
• Use Cassandra-level filtering at every opportunity
• With DSE, benefit from predicate pushdown to
solr_...
GOTCHAS
• Null fields
• Writing jobs which aren’t or can’t be distributed.
TIPS &TRICKS
• .spanBy( partition key ) - work on one Cassandra
partition at a time
• .repartitionByCassandraReplica()
• t...
USE CASE : CACHE
MAINTENANCE
USE CASE :TRIM USER
HISTORY
• Cassandra Data Model: PRIMARY KEY( userid,
last_access )
• Keep last X records
• .spanBy( pa...
USE CASE: PUBLISH DATA
• Cassandra Data Model: publish_date field
• filter by date, map to new RDD matching
destination, sav...
USE CASE: MULTITENANT
BACKUP AND RECOVERY
• Cassandra Data Model: PRIMARY KEY((tenant_id,
other_partition_key), other_clus...
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
Upcoming SlideShare
Loading in …5
×

Cassandra Data Maintenance with Spark

1,218 views

Published on

Most people hear "Spark" and think "Analytics". But the ability of Spark to efficiently distribute and manage a full-table traversal while functionally transforming the data make it perfectly suited to executing "Big Data" maintenance job

Published in: Technology

Cassandra Data Maintenance with Spark

  1. 1. CASSANDRA DATA MAINTENANCE WITH SPARK Operate on your Data
  2. 2. WHAT IS SPARK? A large-scale data processing framework
  3. 3. STEP 1: Make Fake Data (unless you have a million records to spare)
  4. 4. def create_fake_record( num: Int ) = { (num, 1453389992000L + num, s"My Token $num", s"My Session Data$num") } sc.parallelize(1 to 1000000) .map( create_fake_record ) .repartitionByCassandraReplica("maintdemo","user_visits",10) .saveToCassandra("user_visits","oauth_cache")
  5. 5. THREE BASIC PATTERNS • Read -Transform - Write (1:1) - .map() • Read -Transform - Write (1:m) - .flatMap() • Read - Filter - Delete (m:1) - it’s complicated
  6. 6. DELETES ARE TRICKY
  7. 7. DELETES ARE TRICKY • Keep tombstones in mind • Select the records you want to delete, then loop over those and issue deletes through the driver • OR select the records you want to keep, rewrite them, then delete the partitions they lived in… IN THE PAST…
  8. 8. DELETING
  9. 9. PREDICATE PUSHDOWN • Use Cassandra-level filtering at every opportunity • With DSE, benefit from predicate pushdown to solr_query
  10. 10. GOTCHAS • Null fields • Writing jobs which aren’t or can’t be distributed.
  11. 11. TIPS &TRICKS • .spanBy( partition key ) - work on one Cassandra partition at a time • .repartitionByCassandraReplica() • tune spark.cassandra.output.throughput_mb_per_sec to throttle writes
  12. 12. USE CASE : CACHE MAINTENANCE
  13. 13. USE CASE :TRIM USER HISTORY • Cassandra Data Model: PRIMARY KEY( userid, last_access ) • Keep last X records • .spanBy( partitionKey ) flatMap filtering Seq
  14. 14. USE CASE: PUBLISH DATA • Cassandra Data Model: publish_date field • filter by date, map to new RDD matching destination, saveToCassandra()
  15. 15. USE CASE: MULTITENANT BACKUP AND RECOVERY • Cassandra Data Model: PRIMARY KEY((tenant_id, other_partition_key), other_cluster, …) • Backup: filter for tenant_id and .foreach() write to external location. • Recovery: read backup and upsert

×