Hellenic MongoDB user group - Introduction to sharding


Published on

Hellenic MongoDB user group.
16 Jan 2013,
1st meetup

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hellenic MongoDB user group - Introduction to sharding

  1. 1. Introduction to sharding Christos Soulios Software Architect, Persado (christos.soulios@persado.com)15 Jan 2013 0
  2. 2. Lets start with an example: We have launched our latest and greatest web application We use MongoDB database which is fast and cool We even have setup replication for high availability Our application turns out to be popular and we are already planning our next project Cool! 1
  3. 3. Unfortunately, our website becomes too popular too fast.And this causes problems 2
  4. 4. MongoDB problems when dataset grows Dataset does not fit on local disks.Solution: Let’s buy more disks Database indexes do not fit in memory. They have to be paged in and out. Database becomes sluggishSolution: Let’s buy more memory High throughput writing operations cause high contention on the infamous MongoDB locksNow what? We need to scale horizontally. We need sharding 3
  5. 5. What is sharding? Shardingis automatic data partitioning Distributes data evenly across cluster nodes (called shards) Allows for seamless querying. Almost no functionality lost over single master Keeps database consistent 4
  6. 6. How sharding works Collection data is broken into chunks based on the range of a selected collection field. This field is called the shard key Chunks are evenly distributed across shards. Each data chunk is controlled by a single shard Special config servers are responsible for storing which shard controls which chunks Database clients communicate with the shards through the mongos router process mongos router behaves to the client just as a normal mongod server. Sharding is transparent to the client For each database operation, the mongosrouter queries the config servers using the shard key and redirects the operation to the correct shards While more data is inserted, ranges are split into more chunks 5
  7. 7. Example (Users collection){„user_id‟ : 45, „username‟: „asterix‟, „email‟ : asterix@google.com „last_login‟: ‟11/11/2012‟},{„user_id‟ : 4503, „username‟: „gandalf‟, „email‟ : gandalf_rules@yahoo.com „last_login‟: ‟01/14/2013‟},{„user_id‟ : 1153, „username‟: „superman‟, „email‟ : superman@superdomain.com „last_login‟: ‟10/30/2012‟},{„user_id‟ : 5434, „username‟: „darth_vader‟, „email‟ : darth@stardestroyer.org „last_login‟: ‟07/01/2012‟} >db.runCommand( { shardcollection: “test.users”, key: { username: 1 }} ) 6
  8. 8. Shard architecture (sharding by user_id) 7
  9. 9. Database operations All queries are routed through the mongosprocess Insert operations are routed by shard key. Shard key is required Querying by shard key routes the query to shards Querying by non-shard key scatters the query to all shards and gathers results Updates and deletes behave like queries 8
  10. 10. Data balancing System becomes unbalanced when one shard stores more data chunks than others Data is automatically balanced without intervention from the client application or the administrator 9
  11. 11. Data balancing The range of the loaded shard is split and chunks are migrated to other shards 10
  12. 12. Data balancing Config servers are updated using a 2phase commit process to ensure database consistency System ends up balanced 11
  13. 13. Choosing a shard key Choosing a good shard key is critical Once chosen, we are stuck with it Shard key must be immutable Should distribute data load evenly across shards Should be of high cardinality. Enumerated values are not good shard keys Should not be monotically increasing. ObjectIds, dates or database sequences are not good shard keys, because they create hotspots Should be used by most critical queries to provide query isolation. Avoid scatter-gather queries Should provide good data affinity to avoid disk to memory transfers (random values are not good shard keys) 13
  14. 14. Choosing a shard keyKnow your data. It is important What is the expected dataset size? What is the write throughput? How do data look like? Which fields are random or increasing? Are there low cardinality fields? Can we identify any access patterns for reads? What data is indexed? What is the active working set? Are there historical data that are not used after sometime? 14
  15. 15. Choosing a shard key It is not trivial Most of the times there is no single field that can be used as shard key We have to invent one 15
  16. 16. Choosing a shard key Usually applications access lately inserted data more often What about a compound shard key? What about a combination of a coarsely ascending field and a commonly queried search key? Coarsely ascending key should have a few hundreds of chunks per value. This provides good data locality and even distribution Search key provides query isolation Rule of thumb: {coarseLocality: 1, search : 1} 16
  17. 17. Example (Tweets collection){user: „asterix‟,ts: ISODate(“01/14/2013Z22:53:33.123”), month: „2013-01‟retweets: 45, client: „TweetDeck‟, text: „Mongodbsharding is super cool!‟}We are typically looking for the latest tweets of a user. Therefore, a combination of „month + user‟ fields would create a good shard key monthfield is coarsely ascending, allowing to transfer only latest tweets to memory user field is a commonly searched key 17
  18. 18. Conclusion Sharding allows MongoDB databases to scale horizontally Shard balancing is performed automatically by the system Sharding is transparent to the client application Choosing a good shard key is critical Choosing a good shard key is not trivial Be creative and experiment with your data before choosing the shard key 18
  19. 19. Questions ?