Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar

Editor's Notes

  • #4 Notes : Add graphic explaining the numbers Continue-Watching accounts for a huge percentage of watch-time for Hotstar. Everyday on an average our users watch 1B mins of video. Every day we process almost 100-200GB of data to support accurate state of Continue-Watching for over 300 million users. Our use-case mainly focused on a DB which can handle heavy writes due to volatile nature of the user-watching behaviour We also needed a DB which can scale enough during high traffic times when the request volume goes 10-20x within a minute.
  • #5 Notes : Add graphic explaining the numbers Disney+ Hotstar as an OTT platform requires a strong data-store to store Continue-Watching data. Continue-Watching accounts for a huge percentage of watch-time for Hotstar. Everyday on an average our users watch 1B mins of video. Every day we process almost 100-200GB of data to support accurate state of Continue-Watching for over 300 million users. Our use-case mainly focused on a DB which can handle heavy writes due to volatile nature of the user-watching behaviour We also needed a DB which can scale enough during high traffic times when the request volume goes 10-20x within a minute.
  • #6 Cross platform
  • #7 Next episode New episode
  • #8 Next episode New episode
  • #9 Redis Redis gave good latencies but the increase in data-size meant that we needed to horizontally scale our cluster which increased our cost every 3-4 months. Elastic-search Elastic-search latencies were on the higher end of 200ms on an average and cost of the DB is very high considering the returns and we often had issues with node maintenance and required manual effort to resolve the issues.
  • #10 ES doc → Redis → What is the problem in ES and Redis Graphically explain
  • #11 Multiple Data-stores : Redis, Elasticsearch and Scylla open-source Different Data-models Huge data in the order of TBs Cost of migration
  • #12 We chose to go with a NoSQL Key-Value data store We wanted to simplify the data model to only have two tables User Table -> used to retrieve the entire Tray at once for the user New movie added is appended to the list for the same user_id key User-Content Table: Used for modifying a specific content_id data Ex: when the user resumes the video and pauses it a later time -> updated timestamp is stored When the video is fully watched, the entry can be directly queried and deleted
  • #17 We use snapshots of Redis instead of exporting data to csv since we don’t want to put load on Redis machine. Exported data from Redis to rdb file Convert rdb file to CSV file Use COPY `table` FROM `csv` with DELIMITER=`,` AND CHUNKSIZE=1 Ran with 7 threads and completed 1M records in 15 mins Scaled up the number of threads, increased the number of boxes to speed up the process. Similar approach followed with Elastic-search
  • #18 How did we moved our prod APIs to point to the new cluster / new flow and terminate off the older ones -- add diagrams How did we master the migration ?
  • #19 Before moving to Scylla-cloud we initially moved our data to Scylla open-source. After we explored the advantages of Scylla-cloud and it is enriched support we decided to move our data to Scylla-cloud.
  • #20 Link the snapshot folder to Scylla-Data folder ln -s ../snapshot<i> <keyspace>/<table>
  • #21 Cons : We were able to migrate the table with a single primary key to the Scylla cloud. We ran in batches of 3 nodes at a time in-order to avoid our production open-source table getting affected. SSTable migration slowed down when we have a secondary / composite key. In-order to speed up the process we tried Scylla-Spark-Migrator Notes : We ran in batches to maintain less load on active cluster.
  • #22 Unirestore tool helps in automating this. Duplicated the data with replication factor = 1. Mention the price, it is a big-node to use.
  • #31 Append only writes Write is fast on cassandra / Scylla Happy blind write Aggregate on reads Reduce number of rows Remove duplicated contents Partial aggregation for less online computation Expiration Aggregated to buckets in months
  • #33 Timeline We started seeing high latency on scylla cluster We ran nodetool repair -pr on two of the old nodes sequentially, the second one stuck and it is terminated (https://docs.scylladb.com/kb/stop-local-repair/) at ~8pm IST We scaled up the cluster. adding 3 new nodes We tried repair one of the newly added table, the repair stuck, terminated it with https://docs.scylladb.com/kb/stop-local-repair/
  • #35 Tombstones will bite you if you do lots of deletes! OMG! We are the antipattern: https://www.datastax.com/blog/cassandra-anti-patterns-queues-and-queue-datasets
  • #36 Tombstones will bite you if you do lots of deletes! OMG! We are the antipattern: https://www.datastax.com/blog/cassandra-anti-patterns-queues-and-queue-datasets
  • #37 We observed that compactions were causing the latency spikes To not have latency spikes, we stopped auto compactions during the morning hours and enable major compaction daily early in the morning Goal is to have predictable latencies
  • #38 Operations & Compaction Operation: cluster is back after removing tombstone with Compaction
  • #39 Operations & Compaction