Senior Solutions Architect, 10genJames Kerr2.4 Sharding Features
Agenda• Mechanics of sharding– Key space– Chunks– Balancing• Types of requests• Hashed shard keys– Why use hashed shard ke...
Sharded Cluster
Sharding your data
What is a Shard Key• Shard key is used to partition your collection• Shard key must exist in every document• Shard key is ...
The key space{x: 10} {x: -5} {x: -9} {x: 7} {x: 6} {x: 0}
Inserting data{x: 0}{x: 6}{x: 7}{x: -5}{x: 10} {x: -9}
Inserting data{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}
Chunk range and size{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}
Inserting further data{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}{x: 9}{x: -7} {x: 3}
Chunk splitting{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}0 0• Achunk is split once it exceeds the maximum size• There is n...
Data distribution• MinKey to 0 lives on Shard1• 0 to MaxKey lives on Shard2• Mongos routes queries appropriately
Mongos routes dataminKey  0 0  maxKeydb.test.insert({ x: -1000 })
Mongos routes dataminKey  0 0  maxKeydb.test.insert({ x: -1000 })
Unbalanced shardsminKey  0 0  maxKey
Balancing• Migration threshold• Number of chunks less than 20, migration threshold of 2• 21-80, migration threshold 4• >80...
Moving the chunk• One chunk of data is copied from Shard 1 to Shard 2
Committing Migration• Once everyone agrees the data has moved, that chunk getsdeleted from Shard 1.
Cleanup• Other mongos have to find out about new configuration
Migrations effect• Expensive• Can take a long time• Competes for limited resources
Picking a shard key• Cardinality• Optimize routing• Minimize (unnecessary) traffic• Allow best scaling
What about Object Id?ObjectId("51597ca8e28587b86528edfd”)• Used for _id• 12 byte value• Generated by the driver if not spe...
What about Object Id?ObjectId("51597ca8e28587b86528edfd”)12 BytesTimestampMACPIDCounter
// enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }// sharding the test collectionmongos> s...
shards:{ "_id" : "shard0000", "host" : "localhost:30000" }{ "_id" : "shard0001", "host" : "localhost:30001" }databases:{ "...
ObjectId gives a hot shardminKey  0 0  maxKey
Sharding on incrementalvalues like timestamp isnot optimum for evendistribution
Hashed Shard Keys
Hashed Shard Keys{x:2} md5 c81e728d9d4c2f636f067f89cc14862c{x:3} md5 eccbc87e4b5ce2fe28308fd9f2a7baf3{x:1} md5 c4ca4238a0b...
Hashed shard key eliminates hotshardsminKey  0 0  maxKey
Under the hood• Create a hashed index used for sharding• Uses the first 64-bits of md5 hash of field• Uses existing hash i...
// hash on 1 as an integer> db.runCommand({_hashBSONElement:1}){"key" : 1,"seed" : 0,"out" : NumberLong("59024087802609715...
Enabling hashed indexes• Create index– db.collection.ensureIndex( {field : ”hashed”} )• Options– Seed, specify a different...
Using hash shard keys• Enable sharding on collection– sh.shardCollection(“test.collection”, {field: “hashed”})• Options– n...
// enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }// shard by hashed _id fieldmongos> sh.s...
databases:{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }test.hashshard key: { "_id" : "hashed" }chunks:...
// create a loop inserting datamongos> for (x=0; x<10000; x++) {... db.hash.insert({value:x})... }Inserting into hashed sh...
test.hashshard key: { "_id" : "hashed" }chunks:shard0000 4shard0001 4{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLo...
Routing Requests
Cluster Request Routing• Targeted Queries• Scatter Gather Queries• Scatter Gather Queries with Sort
Cluster Request Routing: TargetedQuery
Routable request received
Request routed to appropriate shard
Shard returns results
Mongos returns results to client
Cluster Request Routing: Non-TargetedQuery
Non-Targeted Request Received
Request sent to all shards
Shards return results to mongos
Mongos returns results to client
Cluster Request Routing: Non-TargetedQuery with Sort
Non-Targeted request with sortreceived
Request sent to all shards
Query and sort performed locally
Shards return results to mongos
Mongos merges sorted results
Mongos returns results to client
Hash keys are great for equalityqueries• Equality queries directed to a specific shard• Will use the index• Most efficient...
mongos> db.hash.find({x:1}).explain(){"cursor" : "BtreeCursor x_hashed","n" : 1,"nscanned" : 1,"nscannedObjects" : 1,"mill...
But not so good for a range query• Range queries scatter gather• Won’t use index
mongos> db.hash.find({x:{$gt:1, $lt:99}}).explain(){"cursor" : "BasicCursor","n" : 97,"nChunkSkips" : 0,"nYields" : 0,"nsc...
Other limitations• Cannot use a compound key• Key cannot have an array value• Tag-aware sharding– Only makes sense to assi...
Summary• Range-based Sharding– Most efficient for applications that operate on ranges– Requires careful shard key selectio...
Resources• Tutorial:Sharding (laptop friendly)• Tutorial:Converting a Replica Set to a Replicated Sharded Cluster(laptopfr...
Questions?
James KerrThank YouSenior Solutions Architect, 10gen
Upcoming SlideShare
Loading in...5
×

Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding

2,178

Published on

In version 2.4, MongoDB introduces hash-based sharding, allowing the user to shard based on a randomized shard key to spread documents evenly across a cluster. Hash-based sharding is an alternative to range-based sharding, making it easier to manage your growing cluster. In this talk, we'll discuss provide an overview of this new feature and discuss the pros and cons of using a hash-based sharding vs. range-based approach.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,178
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
46
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • BIO: I live and work in the Washington D.C. area and focus on supporting the MongoDB community and delivering solutions using MongoDB to the Federal government. I’ve spent the last 7 years or so working with NoSQL database. I’ve worked in a variety of industries including precise timing systems, military command and control and digital mapping, e-commerce, distributed computing platforms, search, system integration and Big Data applications.
  • Remind everyone what a sharded cluster is. We will take a close look at some how sharded clusters work and at the new hashed shard key feature of 2.4
  • Isolating queries (to a few shards)Scatter -- gather ( high latency but not bad )hash keys
  • Min value includedMax value not included
  • Balancer is running on mongosOnce the difference in chunks between the most dense shard and the least dense shard is above the migration threshold, a balancing round starts
  • Moved chunk on shard2 should be gray
  • Source shard deletes moved dataMust wait for open cursors to either close or time outNoTimeout cursors may prevent the release of the lockMongos releases the balancer lock after old chunks are deleted
  • Moving data is expensive (i/o, network bandwidth)Moving many chunks takes a long time (can only move one chunk at a time)Balancing and migrations compete for resources with your application
  • What’s the solution to sharding on incremental values as a shard key?
  • mongod --setParameter=enableTestCommands=1
  • Seed and hashVersion are undocumented at this pointLet people know we are at least thinking about these things (especially the ability to change the hash algorithm)
  • When sharding a new collection using Hash-based shard keys, MongoDB will take care of the presplitting for you. Similarly sized ranges of the Hash-based key are distributed to each existing shard, which means that no initial balancing is needed (unless of course new shards are added).
  • Only happens on new collections
  • Query contains the shard key field
  • The mongos does not have to load the whole set into memory since each shard sorts locally. The mongos can just getMore from the shards as needed and incrementally return the results to the client.
  • Uses the hashed index
  • Assuming only a hashed index on “x”
  • Tag-aware note: it doesn’t usually make a lot of sense to tag anything other than the full hashed shard key collection to particular shards - by design, there’s no real way to know or control what data is in what range.since the chunk ranges are based on the value of the randomized hash of the shard key instead of the shard key itself, this is usually only useful for tagging the whole range to a specific set of shards
  • Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding

    1. 1. Senior Solutions Architect, 10genJames Kerr2.4 Sharding Features
    2. 2. Agenda• Mechanics of sharding– Key space– Chunks– Balancing• Types of requests• Hashed shard keys– Why use hashed shard keys– How to enable hashed shard keys– Limitations
    3. 3. Sharded Cluster
    4. 4. Sharding your data
    5. 5. What is a Shard Key• Shard key is used to partition your collection• Shard key must exist in every document• Shard key is immutable• Shard key values are immutable• Shard key must be indexed• Shard key is used to route requests to shards
    6. 6. The key space{x: 10} {x: -5} {x: -9} {x: 7} {x: 6} {x: 0}
    7. 7. Inserting data{x: 0}{x: 6}{x: 7}{x: -5}{x: 10} {x: -9}
    8. 8. Inserting data{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}
    9. 9. Chunk range and size{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}
    10. 10. Inserting further data{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}{x: 9}{x: -7} {x: 3}
    11. 11. Chunk splitting{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}0 0• Achunk is split once it exceeds the maximum size• There is no split point if all documents have the same shard key• Chunk split is a logical operation (no data is moved)• If split creates too large of a discrepancy of chunk count across clustera balancing round starts
    12. 12. Data distribution• MinKey to 0 lives on Shard1• 0 to MaxKey lives on Shard2• Mongos routes queries appropriately
    13. 13. Mongos routes dataminKey  0 0  maxKeydb.test.insert({ x: -1000 })
    14. 14. Mongos routes dataminKey  0 0  maxKeydb.test.insert({ x: -1000 })
    15. 15. Unbalanced shardsminKey  0 0  maxKey
    16. 16. Balancing• Migration threshold• Number of chunks less than 20, migration threshold of 2• 21-80, migration threshold 4• >80, migration threshold 8
    17. 17. Moving the chunk• One chunk of data is copied from Shard 1 to Shard 2
    18. 18. Committing Migration• Once everyone agrees the data has moved, that chunk getsdeleted from Shard 1.
    19. 19. Cleanup• Other mongos have to find out about new configuration
    20. 20. Migrations effect• Expensive• Can take a long time• Competes for limited resources
    21. 21. Picking a shard key• Cardinality• Optimize routing• Minimize (unnecessary) traffic• Allow best scaling
    22. 22. What about Object Id?ObjectId("51597ca8e28587b86528edfd”)• Used for _id• 12 byte value• Generated by the driver if not specified• Theoretically globally unique
    23. 23. What about Object Id?ObjectId("51597ca8e28587b86528edfd”)12 BytesTimestampMACPIDCounter
    24. 24. // enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }// sharding the test collectionmongos> sh.shardCollection("test.test",{_id:1}){ "collectionsharded" : "test.test", "ok" : 1 }// create a loop inserting datamongos> for (x=0; x<10000; x++) {... db.test.insert({value:x})... }Sharding on ObjectId
    25. 25. shards:{ "_id" : "shard0000", "host" : "localhost:30000" }{ "_id" : "shard0001", "host" : "localhost:30001" }databases:{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }test.testshard key: { "_id" : 1 }chunks:shard0001 3{ "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId(”...") }on : shard0001 { "t" : 1000, "i" : 1 }{ "_id" : ObjectId(”...”) } -->> { "_id" : { "$maxKey" : 1 } }on : shard0001 { "t" : 1000, "i" : 2 }ObjectId chunk distribution
    26. 26. ObjectId gives a hot shardminKey  0 0  maxKey
    27. 27. Sharding on incrementalvalues like timestamp isnot optimum for evendistribution
    28. 28. Hashed Shard Keys
    29. 29. Hashed Shard Keys{x:2} md5 c81e728d9d4c2f636f067f89cc14862c{x:3} md5 eccbc87e4b5ce2fe28308fd9f2a7baf3{x:1} md5 c4ca4238a0b923820dcc509a6f75849b
    30. 30. Hashed shard key eliminates hotshardsminKey  0 0  maxKey
    31. 31. Under the hood• Create a hashed index used for sharding• Uses the first 64-bits of md5 hash of field• Uses existing hash index, or creates a new oneon a collection• Hash both data and BSON type• Represented as a NumberLong in the JS shell
    32. 32. // hash on 1 as an integer> db.runCommand({_hashBSONElement:1}){"key" : 1,"seed" : 0,"out" : NumberLong("5902408780260971510"),"ok" : 1}// hash on “1” as a string> db.runCommand({_hashBSONElement:"1"}){"key" : "1","seed" : 0,"out" : NumberLong("-2448670538483119681"),"ok" : 1}Hash on simple or embedded BSONvalues
    33. 33. Enabling hashed indexes• Create index– db.collection.ensureIndex( {field : ”hashed”} )• Options– Seed, specify a different seed to use– hashVersion, at the moment only version 0 (md5).
    34. 34. Using hash shard keys• Enable sharding on collection– sh.shardCollection(“test.collection”, {field: “hashed”})• Options– numInitialChunks, specifies the number of initial chunksper shard. Default is two chunks per shard (use“sh._adminCommand” to specify options)
    35. 35. // enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }// shard by hashed _id fieldmongos> sh.shardCollection("test.hash",{_id:"hashed"}){ "collectionsharded" : "test.hash", "ok" : 1 }Sharding on hashed ObjectId
    36. 36. databases:{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }test.hashshard key: { "_id" : "hashed" }chunks:shard0000 2shard0001 2{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611686018427387902") } on :shard0000 { "t" : 2000, "i" : 2 }{ "_id" : NumberLong("-4611686018427387902") } -->> { "_id" : NumberLong(0) } on : shard0000{ "t" : 2000, "i" : 3 }{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611686018427387902") } on : shard0001 {"t" : 2000, "i" : 4 }{ "_id" : NumberLong("4611686018427387902") } -->> { "_id" : { "$maxKey" : 1 } } on : shard0001{ "t" : 2000, "i" : 5 }Pre-splitting the data
    37. 37. // create a loop inserting datamongos> for (x=0; x<10000; x++) {... db.hash.insert({value:x})... }Inserting into hashed shard keycollection
    38. 38. test.hashshard key: { "_id" : "hashed" }chunks:shard0000 4shard0001 4{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-7374407069602479355") } on : shard0000 { "t" : 2000, "i" : 8 }{ "_id" : NumberLong("-7374407069602479355") } -->> { "_id" : NumberLong("-4611686018427387902") } on : shard0000 { "t" : 2000, "i" : 9 }{ "_id" : NumberLong("-4611686018427387902") } -->> { "_id" : NumberLong("-2456929743513174890") } on : shard0000 { "t" : 2000, "i" : 6 }{ "_id" : NumberLong("-2456929743513174890") } -->> { "_id" : NumberLong(0)} on : shard0000 { "t" : 2000, "i" : 7 }{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("1483539935376971743")} on : shard0001 { "t" : 2000, "i" : 12 }Even distribution of chunks
    39. 39. Routing Requests
    40. 40. Cluster Request Routing• Targeted Queries• Scatter Gather Queries• Scatter Gather Queries with Sort
    41. 41. Cluster Request Routing: TargetedQuery
    42. 42. Routable request received
    43. 43. Request routed to appropriate shard
    44. 44. Shard returns results
    45. 45. Mongos returns results to client
    46. 46. Cluster Request Routing: Non-TargetedQuery
    47. 47. Non-Targeted Request Received
    48. 48. Request sent to all shards
    49. 49. Shards return results to mongos
    50. 50. Mongos returns results to client
    51. 51. Cluster Request Routing: Non-TargetedQuery with Sort
    52. 52. Non-Targeted request with sortreceived
    53. 53. Request sent to all shards
    54. 54. Query and sort performed locally
    55. 55. Shards return results to mongos
    56. 56. Mongos merges sorted results
    57. 57. Mongos returns results to client
    58. 58. Hash keys are great for equalityqueries• Equality queries directed to a specific shard• Will use the index• Most efficient query possible
    59. 59. mongos> db.hash.find({x:1}).explain(){"cursor" : "BtreeCursor x_hashed","n" : 1,"nscanned" : 1,"nscannedObjects" : 1,"millisShardTotal" : 0,"numQueries" : 1,"numShards" : 1,"indexBounds" : {"x" : [[NumberLong("5902408780260971510"),NumberLong("5902408780260971510")]]},"millis" : 0}Explain plan of an equality query
    60. 60. But not so good for a range query• Range queries scatter gather• Won’t use index
    61. 61. mongos> db.hash.find({x:{$gt:1, $lt:99}}).explain(){"cursor" : "BasicCursor","n" : 97,"nChunkSkips" : 0,"nYields" : 0,"nscanned" : 1000,"nscannedAllPlans" : 1000,"nscannedObjects" : 1000,"nscannedObjectsAllPlans" : 1000,"millisShardTotal" : 0,"millisShardAvg" : 0,"numQueries" : 2,"numShards" : 2,"millis" : 3}Explain plan of a range query
    62. 62. Other limitations• Cannot use a compound key• Key cannot have an array value• Tag-aware sharding– Only makes sense to assign the full hashed shard keycollection to particular shards– By design, there’s no real way to know or control whatdata is in what range• Key with poor cardinality is going to give a hashwith poor cardinality– Floating point numbers are squashed. E.g. 100.4 will behashed as 100
    63. 63. Summary• Range-based Sharding– Most efficient for applications that operate on ranges– Requires careful shard key selection• Hash-based Sharding– Uniform writes,– No routed range queries• TagAware Sharding– That’s another talk!
    64. 64. Resources• Tutorial:Sharding (laptop friendly)• Tutorial:Converting a Replica Set to a Replicated Sharded Cluster(laptopfriendly)• Manual:Select a Shard Key• Manual:Hash-based Sharding• Manual:TagAware Sharding• Manual:Strategiesfor Bulk Inserts in ShardedClusters• Manual:Manage Sharded Cluster Balancer
    65. 65. Questions?
    66. 66. James KerrThank YouSenior Solutions Architect, 10gen
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×