Cassandra Day NY 2014: Utilizing Apache Cassandra at UltraVisual


Published on

Cassandra has been an integral part of Ultravisual’s infrastructure since its launch, allowing us to rapidly prototype and build new features that further enhance user experience. Over the course of this discussion we will cover three key topics. How Cassandra came to be used at Ultravisual and the key problem it solved. How the usage of Cassandra as part of our stack has evolved alongside the product. And finally, some of the experiences we’ve had with deploying and running Cassandra in a production environment.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cassandra Day NY 2014: Utilizing Apache Cassandra at UltraVisual

  1. 1. CASSANDRA @ ULTRAVISUAL Cassandra Day New York 2014 Skye Book Lead Systems Architect
  2. 2. ULTRAVISUA L A visual network for inspiration, expression, and collaboration
  3. 3. The Feed • A user’s first taste of UV • More than just posts • Constantly being tweaked and re-thought
  4. 4. SELECT DISTINCT _post.* FROM _post JOIN _collection_post cp ON _post.uuid=cp.post_uuid JOIN _collection_follow cf ON cp.c_uuid=cf.collection_uuid WHERE cf.user_id = ? ORDER BY _post.created_at DESC LIMIT 20 OFFSET 0 The Old Way Started Simple ! “Show me recent posts in collections I follow”
  5. 5. SELECT a.* FROM _user_follow a, _user_follow b WHERE b.follower=12345 AND a.follower=b.followed ORDER BY a.followed_at DESC LIMIT 20 OFFSET 0 The Old Way Added Complexity ! “Show me people recently followed by my connections”
  6. 6. The Old Way Every new feature needs another query ! Feed requests generate a disproportionate amount of load to normal CRUD ops
  7. 7. Reframing the Problem From This: A place for posts, new collections, social activity, and anything else interesting
  8. 8. Reframing the Problem To This: A list of items interesting to the user
  9. 9. The New Way Model First • With an SQL background, this can be misleading. • Essential Question: “How do I need to access this data?”
  10. 10. –Rick Branson, Instagram Cassandra Summit 2013 “Try to model data as a log of user intent” The New Way
  11. 11. } The New Way user statu s created_a t story json 2 0 61b97280 user_follow:3:5 {“foo”:”bar”} 2 1 5daa04c0 post:bfbd0a39 {“foo”:”bar”} 2 1 565752e0 collection_follow: 5:d70961c1 {“foo”:”bar”} 2 1 4a8189e0 user_follow:3:5 {“foo”:”bar”} Primary Key Cached story JSON Model for user feeds • Fast to fetch user stories • Cached JSON means almost zero SQL requests
  12. 12. Fast. Response times cut from over 100’s ms to 30ms range
  13. 13. Launch Week Featured by Apple! Cluster Disk Usage 26% 74%
  14. 14. Don’t be too cute cqlsh:ultravisual> ALTER TABLE latest_feed DROP json;
  15. 15. Handling Deletions • Data is only appended, never deleted from user feeds • Adapted Instagram’s ‘Anti- Column’ solution • Avoids missed deletions for nodes down longer than GCGraceSeconds • Avoids race condition where deletion arrives before write. Sam follows Sandy use r created_a t statu s story 2 4a8189e0 1 user_follow: 3:5 Sam unfollows Sandy use r created_a t statu s story 2 61b97280 0 user_follow: 3:5 2 4a8189e0 1 user_follow: 3:5
  16. 16. Negated Entries use r created_a t statu s story 2 61b97280 0 user_follow: 3:5 2 4a8189e0 1 user_follow: 3:5 use r statu s created_a t story 2 0 61b97280 user_follow: 3:5 2 1 4a8189e0 user_follow: 3:5 Keeps all entries in a single time series First page can usually be populated by a single read Splits user’s row into two lists, live and undo Will always require at least two reads
  17. 17. Further Uses • User Notifications • User Onboarding • Reshare Statistics • User & Content Reports • API Statistics
  18. 18. User Onboarding user created_a t sequence step content 2 61b97280 onboaring_v2 1 rec_collections_1 3 5daa04c0 onboaring_v2 2 rec_collections_2 5 565752e0 onboaring_v3 1 find_friends 6 4a8189e0 onboaring_v3 1 find_friends Sequenced feed entries for users on signup
  19. 19. Production Experiences Drivers • Java: Started with Astyanax, moved to Datastax v2 • Node.js: node-cassandra-cql
  20. 20. Cryptic message with large batch updates in pre-release versions of 2.0 driver DS Driver Issue 229 com.datastax.driver.core.exceptions.DriverInternalError: An unexpected protocol error occured. This is a bug in this library, please report: Unknown code 256 for a consistency level As of 2.0, batches with more than 64k statements throw a better exception: java.lang.IllagalStateException: Batch statement cannot contain more than 65536 statements.
  21. 21. Just use LZ4 Compression
  22. 22. Cassandra-4851 Unfortunate truth in Cassandra 2.0.5 ! cqlsh:test> SELECT * FROM user_feed WHERE user = 2 AND created_at > :some_uuid AND status=0; ! cqlsh:test> Bad Request: PRIMARY KEY part status cannot be restricted (preceding part created_at is either not restricted or by a non-EQ relation)
  23. 23. Cassandra-4851 Adds CQL3 support for vector comparison syntax ! cqlsh:test> SELECT * FROM timeline WHERE day = ’21 Jun 2014’ AND (hour,min) >= (3,50) AND (hour,min,sec) <= (4,37,30); Available in 2.0.6
  24. 24. Production Experiences Upgrades • Manual package installs (dsc20 from Datastax) • One node at a time • Upgrade, wait for healthy status & operations, move on • OpsCenter provides good overview
  25. 25. Production Experiences Speaking of OpsCenter… • Don’t be alarmed if nodes appear but agent data does not • opscenterd often needs a restart after cluster upgrade to see agents again
  26. 26. Production Experiences Service Discovery • Running on AWS using EC2MultiRegionSnitch • Using OpsWorks (Amazon’s Chef service) for seed config
  27. 27. Chef Cookbook cookbook • Forked from Michael Klishin’s awesome C* cookbook • Added integration with OpsWorks’ stack.json # Add this node as the first seed # If using the multi-region snitch, we must use the public IP address if node["cassandra"]["snitch"] == "Ec2MultiRegionSnitch" seed_array << node["opsworks"]["instance"]["ip"] else seed_array << node["opsworks"]["instance"]["private_ip"] end ! node["opsworks"]["layers"]["cassandra"]["instances"].each do |instance_name, values| if node["cassandra"]["snitch"] == "Ec2MultiRegionSnitch" seed_array << values["ip"] else seed_array << values["private_ip"] end end set[:cassandra][:seeds] = seed_array
  28. 28. Questions