Successfully reported this slideshow.
Your SlideShare is downloading. ×

Avoiding Data Hotspots at Scale

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 26 Ad

Avoiding Data Hotspots at Scale

Download to read offline

There are two key choices when scaling a NoSQL data store:
choosing between a hash or a range based sharding and choosing the right sharding key. Any choice is a trade-off between scalability of read, append, and update workloads.
In this talk I will present the standard scaling techniques,
some non-universal sharding tricks, less obvious reasons for
hotspots, as well as techniques to avoid them.

There are two key choices when scaling a NoSQL data store:
choosing between a hash or a range based sharding and choosing the right sharding key. Any choice is a trade-off between scalability of read, append, and update workloads.
In this talk I will present the standard scaling techniques,
some non-universal sharding tricks, less obvious reasons for
hotspots, as well as techniques to avoid them.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Avoiding Data Hotspots at Scale (20)

Advertisement

More from ScyllaDB (20)

Recently uploaded (20)

Advertisement

Avoiding Data Hotspots at Scale

  1. 1. Brought to you by Avoiding Data Hotspots At Scale Konstantin Osipov Engineering at
  2. 2. Konstantin Osipov Director of Engineering ■ Worked on lightweight transactions in Scylla ■ Rarely happy with the status quo (AKA the stubborn one) ■ A very happy father ■ Career and public speaking coach
  3. 3. RUM conjecture and scalability
  4. 4. What this talk is not ● replication ● Re-sharding and re-balancing data ● distributed queries & jobs will focus on principles data distribution only
  5. 5. Ways to shard
  6. 6. Define sharding Sharding - horizontal partitioning of data across multiple servers. Can be used to scale capacity and (possibly) throughput of the database. 3 key challenges: ● Choosing a way to split data across nodes ● Re-balancing data and maintaining location information ● Routing queries to the data
  7. 7. Hash based sharding Hash ring Hashed keys Consistent hash Ketama hash
  8. 8. Sharding: hash + virtual buckets in Couchbase
  9. 9. Sharding: chunk splits and migrations in MongoDB
  10. 10. Hotspots
  11. 11. Range based sharding
  12. 12. Sharding: ranges in CockroachDB
  13. 13. mongodb For queries that don’t include the shard key, mongos must query all shards, wait for their response and then return the result to the application. These “scatter/gather” queries can be long running operations. However, range based partitioning can result in an uneven distribution of data, which may negate some of the benefits of sharding. For example, if the shard key is a linearly increasing field, such as time, then all requests for a given time range will map to the same chunk, and thus the same shard. In this situation, a small set of shards may receive the majority of requests and the system would not scale very well.
  14. 14. spanner One cause of hotspots is having a column whose value monotonically increases as the first key part, because this results in all inserts occurring at the end of your key space. This pattern is undesirable because Cloud Spanner divides data among servers by key ranges, which means all your inserts will be directed at a single server that will end up doing all the work.
  15. 15. Avoiding hotspots
  16. 16. Bit-reversing the partition key
  17. 17. Descending order for timestamp-based keys CREATE TABLE UserAccessLog ( UserId INT64 NOT NULL, LastAccess TIMESTAMP NOT NULL, ... ) PRIMARY KEY (UserId, LastAccess DESC);
  18. 18. Replicating dimension tables everywhere
  19. 19. voltdb To further optimize performance, VoltDB allows selected tables to be replicated on all partitions of the cluster. This strategy minimizes cross-partition join operations. For example, a retail merchandising database that uses product codes as the primary key may have one table that simply correlates the product code with the product's category and full name. Since this table is relatively small and does not change frequently (unlike inventory and orders) it can be replicated to all partitions. This way stored procedures can retrieve and return user-friendly product information when searching by product code without impacting the performance of order and inventory updates and searches.
  20. 20. Good and bad shard keys ■ good: user session, shopping order ■ maybe: user_id (if user data isn’t too thick) ■ Better: (user_id, post_id) ■ bad: inventory item, order date
  21. 21. Special cases
  22. 22. Scaling a message queue
  23. 23. Scaling in a data warehouse ■ Data warehouses usually don’t check unique constraints ■ Data is sorted multiple times, according to multiple dimensions ■ Sharding can be done according to a hash of multiple fields
  24. 24. Let’s recap
  25. 25. Summary: design choices Hash Range Write heavy/monotonic//time series Linear scaling Hotspots Primary key read Linear scaling Linear scaling Partial key read Hotspots Linear scaling Indexed range read Hotspots Linear scaling Non-indexed read Hotspots Hotspots
  26. 26. Brought to you by Konstantin Osipov kostja@scylladb.com @kostja_osipov

×