Cassandra at Lithium
Paul Cichonski, Senior Software Engineer
@paulcichonski
Lithium?
• Helping companies build
social communities for
their customers
• Founded in 2001
• ~300 customers
• ~84 million...
Use Case: Notification Service
1. Stores subscriptions
2. Processes community events
3. Generates notifications when event...
Notification Service  System View

4
The Cluster (v1.2.6)
• 4 nodes, each node:
– Centos 6.4
– 8 cores, 2TB for commit-log, 3x 512GB SSD
for data

• Average wr...
Data Model

6
Data Model: Subscriptions
Fulfillment

identifies target of subscription
identifies entity that is subscribed

7
standard_subscription_index row
stored as:
user:2:creationtimesta user:53:creationtimest user:88:creationtimest
66edfdb7-6...
Data Model: Subscription Display
(time series)

9
subscriptions_for_entity_by_time row
stored as:
1390939670:label:testl
66edfdb7-6ff71390939665:board:53
abel
458c-94a84216...
Data Model: Subscription Display
(content browsing)

11
subscriptions_for_entity_by_type row
stored as:
message:13:creationti
66edfdb7-6ff7mestamp
458c-94a8421627c1b6f5:use
13909...
Data Model: Activity Feed (fan-out
writes)

JSON blob representing activity

13
activity_for_entity row
stored as:
66edfdb7-6ff7458c-94a8421627c1b6f5:use
r:2:0

31aac580-8550-11e3-ad74000c29351b9d:moder...
Migration Strategy
(mysql  cassandra)

15
Data Migration: Trust, but Verify
Fully repeatable due to
idempotent writes

1) Bulk Migrate all subscription data (HTTP)
...
Verify: Consistency Checking

17
Subscription Write Strategy
Reads for subscription
fulfillment happen in ns.

user
subscription_write

NS system boundary
...
Path to Production: QA Issue #1
(many writes to same row kill cluster)

19
Problem: CQL INSERTS
Single Thread SLOW, even with BATCH
(multiple second latency for writing chunks
of 1000 subscription...
Just Use More Threads? Not Quite

21
Cluster Essentially Died

22
Mutations Could Not Keep Up

23
Solution: Work Closer to Storage
Layer
Work here:
user:2:creationtimesta user:53:creationtimest user:88:creationtimest
66e...
Solution: Thrift batch_mutate

More details: http://thelastpickle.com/blog/2013/09/13/CQL3-to-Astyanax-Compatibility.html
...
Path to Production: QA Issue #2
(read timeouts)

26
Tombstone Buildup and Timeouts

CF holding notification settings rewritten every 30 minutes
Eventually tombstone build-up ...
Solution

28
Production Issue #1
(dead cluster)

29
Hard Drive Failure on All Nodes
4 days after release, we started seeing this in /var/log/cassandra/system.log

After follo...
TRIM Support to the Rescue

* http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives

31
Production Issue #2
(repair causing tornadoes of destruction)

32
Activity Feed Data Explosion
• Activity data written with a TTL of 30 days.
• Users in 99th percentile were receiving
mult...
Problem Did Not Surface for 30
Days
• Repairs started taking up to a week
• Created 1000’s of SSTables
• High latency:

34
Solution: Trim Feeds Manually

35
activity_for_entity cfstats

36
How we monitor in Prod
• Nodetool, Opscenter and JMX to monitor
cluster
• Yammer Metrics at every layer of
Notification Se...
Lessons Learned
• Have a migration strategy that allows both
systems to stay live until you have proven
Cassandra in prod
...
Questions?
@paulcichonski

39
Upcoming SlideShare
Loading in …5
×

Cassandra at Lithium

1,211 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,211
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
9
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • - Talk about how we are leaning towards Cassandra as our go-to data store for most real-time query purposes, NS was the first start.
  • - Internal multi-tenant Dropwizard service within Lithium. It is a shared-nothing, horizontally scalable service that uses Cassandra for data storage.- Really more of a subscription fulfillment service that also emits notifications.Example Subscription Types:Subscription to a board where a user posts messages, wants to get notified whenever someone posts in the board.Subscription to a label that might be used on a message, wants to get notified whenever someone uses the label.Subscription to a specific message, wants to get notified whenever someone replies to it.
  • Our traditional infrastructure is primarily single tenant, NS is multitenant service with many clients.All access to cassandra goes through notification service.Infrastructure events are dropped onto queue, then NS queries cassandra to find all subscriptions pertaining to that event, and generates the correct notifications and writes the necessary activity data.
  • Running in prod Expect load to increase by around 50% when all customers are on.We have a duplicate cluster running in EMEA, but they are each stand-alone.
  • Key pattern dictates query planWe denormalize subscriptions in three ways:One to support fast reading of all subscriptions associated with an event to allow us to generate notifications quickly (standard_subscription_index)One to support the ability to quickly tell if a user is subscribed to a specific “thing”, this is for when a user is browsing the community (subscriptions_for_entity_by_type)One to support fast reading of all subscriptions for a user, in a time series view (subscriptions_for_entity_by_time)Caveat: We are only actively using “standard_subscription_index” because there is still work on refactoring UI views to use the new data.
  • This is essentially how the data is stored on disk and how cqlsh interprets it before presenting it to the user.Data Distribution:Standard_subscription_index rows:75th percentile of rows have ~5 subscriptions.99th percentile of rows have ~160 subscriptions. 99.9th percentile of rows have ~ 20k subscriptions.Largest customer has ~5 million subscriptionsAverage customer has ~100,000
  • rowindex is used to allow for expansion in the future (i.e., one user across multiple rows).Were not using a UUID to represent the timestamp because we didn’t need that uniqueness constraint, having subscription_type in the composite key was enough.timeUUID also makes writes to this (or re-migrations) non-idempotent
  • - This is essentially how the data is stored on disk and how cqlsh interprets it before presenting it to the user.
  • Use Case: user is browsing a customer’s site and wants to know all the things they are subscribed to on a specific page (i.e., board, topic, specific message).
  • - This is essentially how the data is stored on disk and how cqlsh interprets it before presenting it to the user.
  • - This is essentially how the data is stored on disk and how cqlsh interprets it before presenting it to the user.
  • Very simplistic view. Some key things not covered:NS is a cluster (currently 3 HA / share nothing nodes)We have hundreds of “lia” clients running in the infrastructure, every customer gets one or many lia instances for their site.
  • Actual code uses last sub creationTime, not currentTime().Synchronous process in lia will run every n minutes and verify all subscriptions it wrote for the last n minutes.Only increments progress state if it is successful.Requires all writes in NS to be idempotentThis “consistency-repair” process has saved us multiple times during NS failure.Weak consistency necessitates need to recover consistency eventually
  • Every shadow-write is also later verified with a synchronous HTTP request from lia to NS (offline anti-entropy)
  • The data distribution for a single row was ~100-200 items for 75th percentile, with ~1000 items for 99th percentileFirst approach involved using a single thread to write the subscriptions we received from ‘lia’, there was no guarantee as to the ordering or key-distribution of the data we received.Migration time for largest customer included a buffer time to prevent performing migrations at peak times and to prevent overloading boxes.Don’t have the exact latency numbers because we never wrote them down
  • Worked fine at first, but then increased chunk size to 100,000 subscriptions row key hotspots caused:We tried tweaking cluster settings like memory and concurrent_writes with little impactIssue was that to much data was being written to same row key at same time (hotspots)NOTE: opscenter was running on a cassandra node in qa, node was completely unresponsive during blackout times
  • - Remember that cql is a row-oriented binding on top of a colum-oriented db, but it is still possible to work on the columns directly, this is what is shown in the cassandra-cli view of the data.
  • First tried partitioning the writes so that a single thread would only write all the data for a single row, this didn’t really help.Then we switched the write for that row from many CQL inserts to a single Thrift batch_mutate against a single rowAllowed us to write 200,000 subscriptions in ~50 secondsMore details: http://mail-archives.apache.org/mod_mbox/cassandra-user/201309.mbox/%3C522F32FC.2030804@gmail.com%3E http://thelastpickle.com/blog/2013/09/13/CQL3-to-Astyanax-Compatibility.html
  • - Cassandra cannot remove tombstones until 1) compaction runs on table and 2) gc_grace_seconds for data has expired (default is 10 days).Normally gc_grace_seconds is useful for allowing anti_entropy (i.e., nodetool repair) to run on data before it is removed, which prevents problems like deletes re-appearing.However, in the case described in this slide, the data was re-written every 30 minutes so it was not an issue.
  • - Someone (not me) remembered Rick Branson’s talk* about how to reduce write amplification on SSDs- Enabled TRIM support and remounted all drives (OS tells disk that a certain region is no longer valid, SSD stops doing GC on that region).- Everything worked, total downtime of ~4 hoursConsistency verification in lia fixed all data lossThe lia consistency verification mechanism we had built into the lia code saved us here since all data that could not be written while the cluster was down was just re-written when it came back up.
  • Row maximum size was when TTL started expiringhttps://issues.apache.org/jira/browse/CASSANDRA-5799
  • Since this particular feature was in private beta, we were able to just truncate all the activity feed data we previously collectedUnfortunately there is no easy way to do “range-slice” deletes in Cassandra, so you need to grab all the entries and then explicitly delete the old ones.Cache of “recently seen entities” is in cassandra
  • Max row size: ~23mb
  • Cassandra at Lithium

    1. 1. Cassandra at Lithium Paul Cichonski, Senior Software Engineer @paulcichonski
    2. 2. Lithium? • Helping companies build social communities for their customers • Founded in 2001 • ~300 customers • ~84 million users • ~5 million unique logins in past 20 days 2
    3. 3. Use Case: Notification Service 1. Stores subscriptions 2. Processes community events 3. Generates notifications when events match against subscriptions 4. Builds user activity feed out of notifications 3
    4. 4. Notification Service  System View 4
    5. 5. The Cluster (v1.2.6) • 4 nodes, each node: – Centos 6.4 – 8 cores, 2TB for commit-log, 3x 512GB SSD for data • Average writes/s: 100-150, peak: 2000 • Average reads/s: 100, peak: 1500 • Use Astyanax on client-side 5
    6. 6. Data Model 6
    7. 7. Data Model: Subscriptions Fulfillment identifies target of subscription identifies entity that is subscribed 7
    8. 8. standard_subscription_index row stored as: user:2:creationtimesta user:53:creationtimest user:88:creationtimest 66edfdb7-6ff7amp amp mp 458c-94a8421627c1b6f5:me 1390939665 1390939670 1390939660 ssage:13 maps to (cqlsh): 8
    9. 9. Data Model: Subscription Display (time series) 9
    10. 10. subscriptions_for_entity_by_time row stored as: 1390939670:label:testl 66edfdb7-6ff71390939665:board:53 abel 458c-94a8421627c1b6f5:use r:2:0 1390939660:message: 13 maps to (cqlsh): 10
    11. 11. Data Model: Subscription Display (content browsing) 11
    12. 12. subscriptions_for_entity_by_type row stored as: message:13:creationti 66edfdb7-6ff7mestamp 458c-94a8421627c1b6f5:use 1390939660 r:2 board:53:creationtime label:testlabel:creation stamp timestamp 1390939665 1390939670 maps to (cqlsh): 12
    13. 13. Data Model: Activity Feed (fan-out writes) JSON blob representing activity 13
    14. 14. activity_for_entity row stored as: 66edfdb7-6ff7458c-94a8421627c1b6f5:use r:2:0 31aac580-8550-11e3-ad74000c29351b9d:moderationA ction:event_summary f4efd590-82ca-11e3-ad74000c29351b9d:badge:event_ summary 1571b680-7254-11e3-8d70000c29351b9d:kudos:event_ summary {moderation_json} {badge_json} {kudos_json} maps to (cqlsh): 14
    15. 15. Migration Strategy (mysql  cassandra) 15
    16. 16. Data Migration: Trust, but Verify Fully repeatable due to idempotent writes 1) Bulk Migrate all subscription data (HTTP) lia NS 2) Consistency check all subscription data (HTTP) Also runs after migration to verify shadow-writes 16
    17. 17. Verify: Consistency Checking 17
    18. 18. Subscription Write Strategy Reads for subscription fulfillment happen in ns. user subscription_write NS system boundary subscription_write (shadow_write) lia activemq Notification Service mysql Reads for UI fulfilled by legacy mysql (temporary) Cassandr a 18
    19. 19. Path to Production: QA Issue #1 (many writes to same row kill cluster) 19
    20. 20. Problem: CQL INSERTS Single Thread SLOW, even with BATCH (multiple second latency for writing chunks of 1000 subscriptions) Largest customer (~20 million subscriptions) would have taken weeks to migrate 20
    21. 21. Just Use More Threads? Not Quite 21
    22. 22. Cluster Essentially Died 22
    23. 23. Mutations Could Not Keep Up 23
    24. 24. Solution: Work Closer to Storage Layer Work here: user:2:creationtimesta user:53:creationtimest user:88:creationtimest 66edfdb7-6ff7amp amp mp 458c-94a8421627c1b6f5:me 1390939665 1390939670 1390939660 ssage:13 Not here: 24
    25. 25. Solution: Thrift batch_mutate More details: http://thelastpickle.com/blog/2013/09/13/CQL3-to-Astyanax-Compatibility.html Allowed us to write 200,000 subscriptions to 3 CFs in ~45 seconds with almost no impact on cluster. NOTE: supposedly fixed in 2.0: CASSANDRA-4693 25
    26. 26. Path to Production: QA Issue #2 (read timeouts) 26
    27. 27. Tombstone Buildup and Timeouts CF holding notification settings rewritten every 30 minutes Eventually tombstone build-up caused reads to time out 27
    28. 28. Solution 28
    29. 29. Production Issue #1 (dead cluster) 29
    30. 30. Hard Drive Failure on All Nodes 4 days after release, we started seeing this in /var/log/cassandra/system.log After following a bunch of dead ends, we also found this in /var/messages.log This cascaded to all nodes and within an hour, cluster was dead 30
    31. 31. TRIM Support to the Rescue * http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives 31
    32. 32. Production Issue #2 (repair causing tornadoes of destruction) 32
    33. 33. Activity Feed Data Explosion • Activity data written with a TTL of 30 days. • Users in 99th percentile were receiving multiple thousands of writes per day. • compacted row maximum size: ~85mb (after 30 days) Here, be Dragons: – CASSANDRA-5799: Column can expire while lazy compacting it... 33
    34. 34. Problem Did Not Surface for 30 Days • Repairs started taking up to a week • Created 1000’s of SSTables • High latency: 34
    35. 35. Solution: Trim Feeds Manually 35
    36. 36. activity_for_entity cfstats 36
    37. 37. How we monitor in Prod • Nodetool, Opscenter and JMX to monitor cluster • Yammer Metrics at every layer of Notification Service, use graphite to visualize • Use Netflix Hystrix in Notification Service to guard against cluster failure 37
    38. 38. Lessons Learned • Have a migration strategy that allows both systems to stay live until you have proven Cassandra in prod • Longevity tests are key, especially if you will have tombstones • Understand how gc_grace_seconds and compaction affect tombstone cleanup • Test with production data loads if you can 38
    39. 39. Questions? @paulcichonski 39

    ×