Cassandra at Lithium
Paul Cichonski, Senior Software Engineer
@paulcichonski
Lithium?
• Helping companies build
social communities for
their customers
• Founded in 2001
• ~300 customers
• ~84 million users
• ~5 million unique logins
in past 20 days

2
Use Case: Notification Service
1. Stores subscriptions
2. Processes community events
3. Generates notifications when events
match against subscriptions
4. Builds user activity feed out of
notifications

3
Notification Service  System View

4
The Cluster (v1.2.6)
• 4 nodes, each node:
– Centos 6.4
– 8 cores, 2TB for commit-log, 3x 512GB SSD
for data

• Average writes/s: 100-150, peak: 2000
• Average reads/s: 100, peak: 1500
• Use Astyanax on client-side

5
Data Model

6
Data Model: Subscriptions
Fulfillment

identifies target of subscription
identifies entity that is subscribed

7
standard_subscription_index row
stored as:
user:2:creationtimesta user:53:creationtimest user:88:creationtimest
66edfdb7-6ff7amp
amp
mp
458c-94a8421627c1b6f5:me
1390939665
1390939670
1390939660
ssage:13

maps to (cqlsh):

8
Data Model: Subscription Display
(time series)

9
subscriptions_for_entity_by_time row
stored as:
1390939670:label:testl
66edfdb7-6ff71390939665:board:53
abel
458c-94a8421627c1b6f5:use
r:2:0

1390939660:message:
13

maps to (cqlsh):

10
Data Model: Subscription Display
(content browsing)

11
subscriptions_for_entity_by_type row
stored as:
message:13:creationti
66edfdb7-6ff7mestamp
458c-94a8421627c1b6f5:use
1390939660
r:2

board:53:creationtime label:testlabel:creation
stamp
timestamp
1390939665

1390939670

maps to (cqlsh):

12
Data Model: Activity Feed (fan-out
writes)

JSON blob representing activity

13
activity_for_entity row
stored as:
66edfdb7-6ff7458c-94a8421627c1b6f5:use
r:2:0

31aac580-8550-11e3-ad74000c29351b9d:moderationA
ction:event_summary

f4efd590-82ca-11e3-ad74000c29351b9d:badge:event_
summary

1571b680-7254-11e3-8d70000c29351b9d:kudos:event_
summary

{moderation_json}

{badge_json}

{kudos_json}

maps to (cqlsh):

14
Migration Strategy
(mysql  cassandra)

15
Data Migration: Trust, but Verify
Fully repeatable due to
idempotent writes

1) Bulk Migrate all subscription data (HTTP)

lia

NS

2) Consistency check all subscription data (HTTP)

Also runs after
migration to verify
shadow-writes

16
Verify: Consistency Checking

17
Subscription Write Strategy
Reads for subscription
fulfillment happen in ns.

user
subscription_write

NS system boundary

subscription_write
(shadow_write)
lia

activemq

Notification
Service

mysql

Reads for UI fulfilled by
legacy mysql (temporary)

Cassandr
a

18
Path to Production: QA Issue #1
(many writes to same row kill cluster)

19
Problem: CQL INSERTS
Single Thread SLOW, even with BATCH
(multiple second latency for writing chunks
of 1000 subscriptions)
Largest customer (~20 million subscriptions)
would have taken weeks to migrate

20
Just Use More Threads? Not Quite

21
Cluster Essentially Died

22
Mutations Could Not Keep Up

23
Solution: Work Closer to Storage
Layer
Work here:
user:2:creationtimesta user:53:creationtimest user:88:creationtimest
66edfdb7-6ff7amp
amp
mp
458c-94a8421627c1b6f5:me
1390939665
1390939670
1390939660
ssage:13

Not here:

24
Solution: Thrift batch_mutate

More details: http://thelastpickle.com/blog/2013/09/13/CQL3-to-Astyanax-Compatibility.html
Allowed us to write 200,000 subscriptions to 3 CFs in ~45 seconds with almost no
impact on cluster.
NOTE: supposedly fixed in 2.0: CASSANDRA-4693
25
Path to Production: QA Issue #2
(read timeouts)

26
Tombstone Buildup and Timeouts

CF holding notification settings rewritten every 30 minutes
Eventually tombstone build-up caused
reads to time out

27
Solution

28
Production Issue #1
(dead cluster)

29
Hard Drive Failure on All Nodes
4 days after release, we started seeing this in /var/log/cassandra/system.log

After following a bunch of dead ends, we also found this in /var/messages.log

This cascaded to all nodes and within an hour, cluster was dead

30
TRIM Support to the Rescue

* http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives

31
Production Issue #2
(repair causing tornadoes of destruction)

32
Activity Feed Data Explosion
• Activity data written with a TTL of 30 days.
• Users in 99th percentile were receiving
multiple thousands of writes per day.
• compacted row maximum size: ~85mb
(after 30 days)

Here, be Dragons:
– CASSANDRA-5799: Column can expire while lazy compacting
it...
33
Problem Did Not Surface for 30
Days
• Repairs started taking up to a week
• Created 1000’s of SSTables
• High latency:

34
Solution: Trim Feeds Manually

35
activity_for_entity cfstats

36
How we monitor in Prod
• Nodetool, Opscenter and JMX to monitor
cluster
• Yammer Metrics at every layer of
Notification Service, use graphite to
visualize
• Use Netflix Hystrix in Notification Service
to guard against cluster failure

37
Lessons Learned
• Have a migration strategy that allows both
systems to stay live until you have proven
Cassandra in prod
• Longevity tests are key, especially if you
will have tombstones
• Understand how gc_grace_seconds and
compaction affect tombstone cleanup
• Test with production data loads if you can
38
Questions?
@paulcichonski

39

Cassandra at Lithium

Editor's Notes

  • #3 - Talk about how we are leaning towards Cassandra as our go-to data store for most real-time query purposes, NS was the first start.
  • #4 - Internal multi-tenant Dropwizard service within Lithium. It is a shared-nothing, horizontally scalable service that uses Cassandra for data storage.- Really more of a subscription fulfillment service that also emits notifications.Example Subscription Types:Subscription to a board where a user posts messages, wants to get notified whenever someone posts in the board.Subscription to a label that might be used on a message, wants to get notified whenever someone uses the label.Subscription to a specific message, wants to get notified whenever someone replies to it.
  • #5 Our traditional infrastructure is primarily single tenant, NS is multitenant service with many clients.All access to cassandra goes through notification service.Infrastructure events are dropped onto queue, then NS queries cassandra to find all subscriptions pertaining to that event, and generates the correct notifications and writes the necessary activity data.
  • #6 Running in prod Expect load to increase by around 50% when all customers are on.We have a duplicate cluster running in EMEA, but they are each stand-alone.
  • #8 Key pattern dictates query planWe denormalize subscriptions in three ways:One to support fast reading of all subscriptions associated with an event to allow us to generate notifications quickly (standard_subscription_index)One to support the ability to quickly tell if a user is subscribed to a specific “thing”, this is for when a user is browsing the community (subscriptions_for_entity_by_type)One to support fast reading of all subscriptions for a user, in a time series view (subscriptions_for_entity_by_time)Caveat: We are only actively using “standard_subscription_index” because there is still work on refactoring UI views to use the new data.
  • #9 This is essentially how the data is stored on disk and how cqlsh interprets it before presenting it to the user.Data Distribution:Standard_subscription_index rows:75th percentile of rows have ~5 subscriptions.99th percentile of rows have ~160 subscriptions. 99.9th percentile of rows have ~ 20k subscriptions.Largest customer has ~5 million subscriptionsAverage customer has ~100,000
  • #10 rowindex is used to allow for expansion in the future (i.e., one user across multiple rows).Were not using a UUID to represent the timestamp because we didn’t need that uniqueness constraint, having subscription_type in the composite key was enough.timeUUID also makes writes to this (or re-migrations) non-idempotent
  • #11 - This is essentially how the data is stored on disk and how cqlsh interprets it before presenting it to the user.
  • #12 Use Case: user is browsing a customer’s site and wants to know all the things they are subscribed to on a specific page (i.e., board, topic, specific message).
  • #13 - This is essentially how the data is stored on disk and how cqlsh interprets it before presenting it to the user.
  • #15 - This is essentially how the data is stored on disk and how cqlsh interprets it before presenting it to the user.
  • #17 Very simplistic view. Some key things not covered:NS is a cluster (currently 3 HA / share nothing nodes)We have hundreds of “lia” clients running in the infrastructure, every customer gets one or many lia instances for their site.
  • #18 Actual code uses last sub creationTime, not currentTime().Synchronous process in lia will run every n minutes and verify all subscriptions it wrote for the last n minutes.Only increments progress state if it is successful.Requires all writes in NS to be idempotentThis “consistency-repair” process has saved us multiple times during NS failure.Weak consistency necessitates need to recover consistency eventually
  • #19 Every shadow-write is also later verified with a synchronous HTTP request from lia to NS (offline anti-entropy)
  • #21 The data distribution for a single row was ~100-200 items for 75th percentile, with ~1000 items for 99th percentileFirst approach involved using a single thread to write the subscriptions we received from ‘lia’, there was no guarantee as to the ordering or key-distribution of the data we received.Migration time for largest customer included a buffer time to prevent performing migrations at peak times and to prevent overloading boxes.Don’t have the exact latency numbers because we never wrote them down
  • #22 Worked fine at first, but then increased chunk size to 100,000 subscriptions row key hotspots caused:We tried tweaking cluster settings like memory and concurrent_writes with little impactIssue was that to much data was being written to same row key at same time (hotspots)NOTE: opscenter was running on a cassandra node in qa, node was completely unresponsive during blackout times
  • #25 - Remember that cql is a row-oriented binding on top of a colum-oriented db, but it is still possible to work on the columns directly, this is what is shown in the cassandra-cli view of the data.
  • #26 First tried partitioning the writes so that a single thread would only write all the data for a single row, this didn’t really help.Then we switched the write for that row from many CQL inserts to a single Thrift batch_mutate against a single rowAllowed us to write 200,000 subscriptions in ~50 secondsMore details: http://mail-archives.apache.org/mod_mbox/cassandra-user/201309.mbox/%3C522F32FC.2030804@gmail.com%3E http://thelastpickle.com/blog/2013/09/13/CQL3-to-Astyanax-Compatibility.html
  • #28 - Cassandra cannot remove tombstones until 1) compaction runs on table and 2) gc_grace_seconds for data has expired (default is 10 days).Normally gc_grace_seconds is useful for allowing anti_entropy (i.e., nodetool repair) to run on data before it is removed, which prevents problems like deletes re-appearing.However, in the case described in this slide, the data was re-written every 30 minutes so it was not an issue.
  • #32 - Someone (not me) remembered Rick Branson’s talk* about how to reduce write amplification on SSDs- Enabled TRIM support and remounted all drives (OS tells disk that a certain region is no longer valid, SSD stops doing GC on that region).- Everything worked, total downtime of ~4 hoursConsistency verification in lia fixed all data lossThe lia consistency verification mechanism we had built into the lia code saved us here since all data that could not be written while the cluster was down was just re-written when it came back up.
  • #34 Row maximum size was when TTL started expiringhttps://issues.apache.org/jira/browse/CASSANDRA-5799
  • #36 Since this particular feature was in private beta, we were able to just truncate all the activity feed data we previously collectedUnfortunately there is no easy way to do “range-slice” deletes in Cassandra, so you need to grab all the entries and then explicitly delete the old ones.Cache of “recently seen entities” is in cassandra
  • #37 Max row size: ~23mb