Cassandra from the trenches: migrating Netflix

5,481 views
5,291 views

Published on

Slide deck on migrating Netflix to Cassandra in EC2 from a legacy, DC-bound relational database.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,481
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
90
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • Point of departure from the datacenterData modeling -Relational to non-relImplementation(s)real world – Ops, tuning, compactions, gotchas
  • Background as to why netflix has moved to the cloud and embraced new databases
  • Circa mid-late 2010, we evaluated a bunch of database systems, primarily focusing on the new NoSQL breed.
  • I lead AB testing and we’ll be using that data set as a model for discussion. I’ll describe the legacy oracle implementation and how I went about moving it to cass
  • Show example of an AB test (1482) on the homepage
  • Existing data sets in our legacy Oracle database that need to be migrated and transformed
  • LAST SLIDE ON DATA MODELING! Next is running this in prod!
  • Going to share real world issues from design, ops, performance
  • Some some systems, as long as one writes wins (eventual consistency), all is fine
  • Explain difference between read repair and node repair
  • Makes minor compactions smoother
  • Too large - AB Indices ran afoul of thisProblem for reads, compactions, and repairs
  • Cassandra from the trenches: migrating Netflix

    1. 1. Cassandra from the trenches: migrating Netflix Jason Brown Senior Software Engineer Netflix @jasobrown jasedbrown@gmail.com http://www.linkedin.com/in/jasedbrown
    2. 2. Your host for the evening• Sr. Software Engineer at Netflix > 3 years – Currently lead a team developing and operating AB testing infrastructure in EC2 – Spent time migrating core e-commerce functionality out of PL/SQL and scaling it up• MLB Advanced Media – Ran Ecommerce engineering group• Wandered about in the wireless space (J2ME, BREW)
    3. 3. History• In the beginning, there was the webapp – And a database, too – In one datacenter• Then we grew, and grew, and grew – More databases, all conjoined – Database links with PL/SQL and M views – Multi-Master replication
    4. 4. History,2• Then it melted down (2008) – Oracle MMR between two databases – SPOF – one Oracle instance for website (no backup)• Couldn’t ship DVDs for ~3 days
    5. 5. History,3• Time to rethink everything – Abandon datacenter for EC2 • We’re not in the business of building datacenters – Ditch monolithic webapp for distributed systems • Greater independence for all teams/initiatives – Migrate SPOF database to …
    6. 6. History,4• SimpleDb/S3 – Somebody else manages your database (yeah!) – Tried it out, but didn’t quite work well for us – High latency, rate limiting (throttling), (no) auto- sharding, no backup problems• Time to try out one of them (other) new fangled NoSql things…
    7. 7. Shiny new toy• We selected Cassandra – Dynamo-model appealed to us – Column-based, key-value data model seemed sufficient for most needs – Performance looked great (rudimentary tests)• Now what? – Put something into it – Run it in EC2 – Sounds easy enough…
    8. 8. • Data Modeling – Where the rubber meets the road
    9. 9. About Netflix’s AB Testing• We use it everywhere (no, really)• Basic concepts – Test – An experiment where several competing behaviors are implemented and compared – Cell – different experiences within a test that are being compared against each other – Allocation – a customer-specific assignment to a cell within a test • Customer can only be in one cell of a test at a time • Generally immutable (very important for analysis)
    10. 10. Data Modeling - background• AB has two sets of data – metadata about tests – allocations• Both need to be migrated out of Oracle and into Cassandra in the cloud
    11. 11. AB - allocations• Single table to hold allocations – Currently at ~950 million records – Plus indices!• One record for every test that every customer is allocated into• Unique constraint on customer/test
    12. 12. AB - metadata• Fairly typical parent-child table relationship• Not updated frequently, so service can cache
    13. 13. Data modeling in cassandra• Every where I looked, the internets told me to understand my data use patterns – Understand the questions that you need to answer from the data • Meaning: know how to query your data structure the persistence model to match• There’s no free lunch here, apparently
    14. 14. Identifying the AB questions that need to be answered• get all allocations for a customer• get count of customers in test/cell• find all customers in a test/cell – So we can kick them out of the test – So we can clean up ancient data – So we can move them to a different cell in test• find all customers allocated to test within a date range – So we can kick them out of the test
    15. 15. Modeling allocations in cassandra• As we’re read-heavy, read all allocations for a customer as fast as possible – Denormalize allocations into a single row – But, how do I denormalize?• Find all of customers in a test/cell = reverse index• Get count of customers in test/cell = count the entries in the reverse index
    16. 16. Denormalization-HOWTO• The internets talk about it, but no real world examples – ‘Normalization is for sissies’, Pat Helland• Denormalizing allocations per customer – Trivial with a schema-less database
    17. 17. Denormalized allocations• Sample normalized data• Sample denormalized data (sparse!)
    18. 18. Implementing allocations• As allocation for a customer has a handful of data points, they logically can be grouped together• Hello, super columns• Avoided blobs, json or otherwise – data race concerns – BI integration – Serialization alg changes could tank the data
    19. 19. Implementing allocations, second round• But, cassandra devs secretly despise don’t enjoy super columns• Switched to standard column family, using composite columns• Composite columns are sorted by each ‘token’ in name – This sorts each allocation’s data together (by testId)
    20. 20. Composite columns• Allocation column naming convention – <testId>:<field> – 42:cell = 2 – 42:enabled = Y – 47:cell = 0 – 47:enabled = Y• Using terse field names, but still have column name overhead (~15 bytes)
    21. 21. Implementing indices• Cassandra’s secondary indices vs. hand-built and maintained alternate indices• Secondary indices work great on uniform data between rows• But sparse column data not so easy
    22. 22. Hand-built Indices, 1• Reverse index – Test/cell (key) to custIds (columns) • Column value is timestamp• Mutate on allocating a customer into test
    23. 23. Hand-built indices, 2• Counter column family – Test/cell to count of customers in test columns – Mutate on allocating a customer into test• Counters are not idempotent!• Mutates need to write to every node that hosts that key
    24. 24. Index rebuilding• Yeah, even Oracle needs to have it’s indices rebuilt• Easy enough to rebuild the reverse index, but how about that counter column? – Read the reverse index for the count and write that as counter’s value
    25. 25. Modeling AB metadata in cassandra• Explored several models, including json blobs, spreading across multiple CFs, differing degrees of denormalization• Reverse index to identify all tests for loading
    26. 26. Implementing metadata• One CF, one row for all test’s data – Every data point is a column – no blobs• Composite columns – type:id:field • Types = base info, cells, allocation plans • Id = cell number, allocation plan (gu)id • Field = type-specific – Base info = test name, description, enabled – Cell’s name / description – Plan’s start/end dates, country to allocate to
    27. 27. Into the real world … here comes the hurt
    28. 28. Allocation mutates• AB allocations are immutable, so how do you prevent mutating? – Oracle – unique constraint on table – Cassandra – read before write• Read before write in a distributed system is a data race
    29. 29. Running cassandra• Compactions happen – Part of the Cassandra lifestyle – Mutations are written to memory (memtable) – Flushed to disk (sstable) on triggering threshold • Time • Size • Operations against column family – Eventually, Cassandra decides to merge sstables as data for a individual rows becomes scattered
    30. 30. Compactions, 2• Spikes happen, esp. on read-heavy systems – Everything can slow down – Sometimes, average latency > 95%ile – Throttling in newer Cass versions helps, I think – Affects clients (hector, astyanax)
    31. 31. Repairs• Different from read repair!• Fix all the data in a single node by pulling shared ranges from neighbor nodes
    32. 32. Repairs, 2• Replication factor determines number of nodes involved in repair of single node• Neighbor nodes will perform validation compaction – Pushes disk and network hard dep. on data size• Guess what happens when you run a multi- region cluster?
    33. 33. Client libraries• Round-robin is not the way to go for connection pooling – Coordinator Cassandra nodes will incorrectly be marked down rather than target slow node• Token-aware is safer, faster, but harder to implement
    34. 34. Tunings, 1• Key and row caches – Left unbounded can chew up jvm memory needed for normal work – Latencies will spike as the jvm needs to fight for memory – Off-heap row cache is better but still maintains data structures on-heap
    35. 35. Tunings, 2• mmap() as in-memory cache – When process terminated, mmap pages are added to the free list
    36. 36. Tunings, 3• Sizing memtable flushes for optimizing compactions – Easier when writes are uniformly distributed, timewise – easier to reason about flush patterns – Best to optimize flushes based on memtable size, not time
    37. 37. Tunings, 4• Sharding – Not dead yet! – If a single row has disproportionately high gets/mutates, the nodes holding it will become hot spots – If a row grows too large, it won’t fit into memory
    38. 38. Takeaways• Netflix is making all of our components distributed and fault tolerant as we grow domestically and internationally.• Cassandra is a core piece of our cloud infrastructure.
    39. 39. 終わり(The End)• Q&A @jasobrown jasedbrown@gmail.com http://www.linkedin.com/in/jasedbrown
    40. 40. References• Pat Helland, ‘Normalization Is for Sissies” http://blogs.msdn.com/b/pathelland/archive/ 2007/07/23/normalization-is-for-sissies.aspx• btoddb, “Storage Sizing” http://btoddb-cass- storage.blogspot.com/

    ×