• Like

Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding

  • 2,033 views
Uploaded on

In version 2.4, MongoDB introduces hash-based sharding, allowing the user to shard based on a randomized shard key to spread documents evenly across a cluster. Hash-based sharding is an alternative to …

In version 2.4, MongoDB introduces hash-based sharding, allowing the user to shard based on a randomized shard key to spread documents evenly across a cluster. Hash-based sharding is an alternative to range-based sharding, making it easier to manage your growing cluster. In this talk, we'll discuss provide an overview of this new feature and discuss the pros and cons of using a hash-based sharding vs. range-based approach.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,033
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
37
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • BIO: I live and work in the Washington D.C. area and focus on supporting the MongoDB community and delivering solutions using MongoDB to the Federal government. I’ve spent the last 7 years or so working with NoSQL database. I’ve worked in a variety of industries including precise timing systems, military command and control and digital mapping, e-commerce, distributed computing platforms, search, system integration and Big Data applications.
  • Remind everyone what a sharded cluster is. We will take a close look at some how sharded clusters work and at the new hashed shard key feature of 2.4
  • Isolating queries (to a few shards)Scatter -- gather ( high latency but not bad )hash keys
  • Min value includedMax value not included
  • Balancer is running on mongosOnce the difference in chunks between the most dense shard and the least dense shard is above the migration threshold, a balancing round starts
  • Moved chunk on shard2 should be gray
  • Source shard deletes moved dataMust wait for open cursors to either close or time outNoTimeout cursors may prevent the release of the lockMongos releases the balancer lock after old chunks are deleted
  • Moving data is expensive (i/o, network bandwidth)Moving many chunks takes a long time (can only move one chunk at a time)Balancing and migrations compete for resources with your application
  • What’s the solution to sharding on incremental values as a shard key?
  • mongod --setParameter=enableTestCommands=1
  • Seed and hashVersion are undocumented at this pointLet people know we are at least thinking about these things (especially the ability to change the hash algorithm)
  • When sharding a new collection using Hash-based shard keys, MongoDB will take care of the presplitting for you. Similarly sized ranges of the Hash-based key are distributed to each existing shard, which means that no initial balancing is needed (unless of course new shards are added).
  • Only happens on new collections
  • Query contains the shard key field
  • The mongos does not have to load the whole set into memory since each shard sorts locally. The mongos can just getMore from the shards as needed and incrementally return the results to the client.
  • Uses the hashed index
  • Assuming only a hashed index on “x”
  • Tag-aware note: it doesn’t usually make a lot of sense to tag anything other than the full hashed shard key collection to particular shards - by design, there’s no real way to know or control what data is in what range.since the chunk ranges are based on the value of the randomized hash of the shard key instead of the shard key itself, this is usually only useful for tagging the whole range to a specific set of shards

Transcript

  • 1. Senior Solutions Architect, 10genJames Kerr2.4 Sharding Features
  • 2. Agenda• Mechanics of sharding– Key space– Chunks– Balancing• Types of requests• Hashed shard keys– Why use hashed shard keys– How to enable hashed shard keys– Limitations
  • 3. Sharded Cluster
  • 4. Sharding your data
  • 5. What is a Shard Key• Shard key is used to partition your collection• Shard key must exist in every document• Shard key is immutable• Shard key values are immutable• Shard key must be indexed• Shard key is used to route requests to shards
  • 6. The key space{x: 10} {x: -5} {x: -9} {x: 7} {x: 6} {x: 0}
  • 7. Inserting data{x: 0}{x: 6}{x: 7}{x: -5}{x: 10} {x: -9}
  • 8. Inserting data{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}
  • 9. Chunk range and size{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}
  • 10. Inserting further data{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}{x: 9}{x: -7} {x: 3}
  • 11. Chunk splitting{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}0 0• Achunk is split once it exceeds the maximum size• There is no split point if all documents have the same shard key• Chunk split is a logical operation (no data is moved)• If split creates too large of a discrepancy of chunk count across clustera balancing round starts
  • 12. Data distribution• MinKey to 0 lives on Shard1• 0 to MaxKey lives on Shard2• Mongos routes queries appropriately
  • 13. Mongos routes dataminKey  0 0  maxKeydb.test.insert({ x: -1000 })
  • 14. Mongos routes dataminKey  0 0  maxKeydb.test.insert({ x: -1000 })
  • 15. Unbalanced shardsminKey  0 0  maxKey
  • 16. Balancing• Migration threshold• Number of chunks less than 20, migration threshold of 2• 21-80, migration threshold 4• >80, migration threshold 8
  • 17. Moving the chunk• One chunk of data is copied from Shard 1 to Shard 2
  • 18. Committing Migration• Once everyone agrees the data has moved, that chunk getsdeleted from Shard 1.
  • 19. Cleanup• Other mongos have to find out about new configuration
  • 20. Migrations effect• Expensive• Can take a long time• Competes for limited resources
  • 21. Picking a shard key• Cardinality• Optimize routing• Minimize (unnecessary) traffic• Allow best scaling
  • 22. What about Object Id?ObjectId("51597ca8e28587b86528edfd”)• Used for _id• 12 byte value• Generated by the driver if not specified• Theoretically globally unique
  • 23. What about Object Id?ObjectId("51597ca8e28587b86528edfd”)12 BytesTimestampMACPIDCounter
  • 24. // enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }// sharding the test collectionmongos> sh.shardCollection("test.test",{_id:1}){ "collectionsharded" : "test.test", "ok" : 1 }// create a loop inserting datamongos> for (x=0; x<10000; x++) {... db.test.insert({value:x})... }Sharding on ObjectId
  • 25. shards:{ "_id" : "shard0000", "host" : "localhost:30000" }{ "_id" : "shard0001", "host" : "localhost:30001" }databases:{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }test.testshard key: { "_id" : 1 }chunks:shard0001 3{ "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId(”...") }on : shard0001 { "t" : 1000, "i" : 1 }{ "_id" : ObjectId(”...”) } -->> { "_id" : { "$maxKey" : 1 } }on : shard0001 { "t" : 1000, "i" : 2 }ObjectId chunk distribution
  • 26. ObjectId gives a hot shardminKey  0 0  maxKey
  • 27. Sharding on incrementalvalues like timestamp isnot optimum for evendistribution
  • 28. Hashed Shard Keys
  • 29. Hashed Shard Keys{x:2} md5 c81e728d9d4c2f636f067f89cc14862c{x:3} md5 eccbc87e4b5ce2fe28308fd9f2a7baf3{x:1} md5 c4ca4238a0b923820dcc509a6f75849b
  • 30. Hashed shard key eliminates hotshardsminKey  0 0  maxKey
  • 31. Under the hood• Create a hashed index used for sharding• Uses the first 64-bits of md5 hash of field• Uses existing hash index, or creates a new oneon a collection• Hash both data and BSON type• Represented as a NumberLong in the JS shell
  • 32. // hash on 1 as an integer> db.runCommand({_hashBSONElement:1}){"key" : 1,"seed" : 0,"out" : NumberLong("5902408780260971510"),"ok" : 1}// hash on “1” as a string> db.runCommand({_hashBSONElement:"1"}){"key" : "1","seed" : 0,"out" : NumberLong("-2448670538483119681"),"ok" : 1}Hash on simple or embedded BSONvalues
  • 33. Enabling hashed indexes• Create index– db.collection.ensureIndex( {field : ”hashed”} )• Options– Seed, specify a different seed to use– hashVersion, at the moment only version 0 (md5).
  • 34. Using hash shard keys• Enable sharding on collection– sh.shardCollection(“test.collection”, {field: “hashed”})• Options– numInitialChunks, specifies the number of initial chunksper shard. Default is two chunks per shard (use“sh._adminCommand” to specify options)
  • 35. // enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }// shard by hashed _id fieldmongos> sh.shardCollection("test.hash",{_id:"hashed"}){ "collectionsharded" : "test.hash", "ok" : 1 }Sharding on hashed ObjectId
  • 36. databases:{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }test.hashshard key: { "_id" : "hashed" }chunks:shard0000 2shard0001 2{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611686018427387902") } on :shard0000 { "t" : 2000, "i" : 2 }{ "_id" : NumberLong("-4611686018427387902") } -->> { "_id" : NumberLong(0) } on : shard0000{ "t" : 2000, "i" : 3 }{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611686018427387902") } on : shard0001 {"t" : 2000, "i" : 4 }{ "_id" : NumberLong("4611686018427387902") } -->> { "_id" : { "$maxKey" : 1 } } on : shard0001{ "t" : 2000, "i" : 5 }Pre-splitting the data
  • 37. // create a loop inserting datamongos> for (x=0; x<10000; x++) {... db.hash.insert({value:x})... }Inserting into hashed shard keycollection
  • 38. test.hashshard key: { "_id" : "hashed" }chunks:shard0000 4shard0001 4{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-7374407069602479355") } on : shard0000 { "t" : 2000, "i" : 8 }{ "_id" : NumberLong("-7374407069602479355") } -->> { "_id" : NumberLong("-4611686018427387902") } on : shard0000 { "t" : 2000, "i" : 9 }{ "_id" : NumberLong("-4611686018427387902") } -->> { "_id" : NumberLong("-2456929743513174890") } on : shard0000 { "t" : 2000, "i" : 6 }{ "_id" : NumberLong("-2456929743513174890") } -->> { "_id" : NumberLong(0)} on : shard0000 { "t" : 2000, "i" : 7 }{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("1483539935376971743")} on : shard0001 { "t" : 2000, "i" : 12 }Even distribution of chunks
  • 39. Routing Requests
  • 40. Cluster Request Routing• Targeted Queries• Scatter Gather Queries• Scatter Gather Queries with Sort
  • 41. Cluster Request Routing: TargetedQuery
  • 42. Routable request received
  • 43. Request routed to appropriate shard
  • 44. Shard returns results
  • 45. Mongos returns results to client
  • 46. Cluster Request Routing: Non-TargetedQuery
  • 47. Non-Targeted Request Received
  • 48. Request sent to all shards
  • 49. Shards return results to mongos
  • 50. Mongos returns results to client
  • 51. Cluster Request Routing: Non-TargetedQuery with Sort
  • 52. Non-Targeted request with sortreceived
  • 53. Request sent to all shards
  • 54. Query and sort performed locally
  • 55. Shards return results to mongos
  • 56. Mongos merges sorted results
  • 57. Mongos returns results to client
  • 58. Hash keys are great for equalityqueries• Equality queries directed to a specific shard• Will use the index• Most efficient query possible
  • 59. mongos> db.hash.find({x:1}).explain(){"cursor" : "BtreeCursor x_hashed","n" : 1,"nscanned" : 1,"nscannedObjects" : 1,"millisShardTotal" : 0,"numQueries" : 1,"numShards" : 1,"indexBounds" : {"x" : [[NumberLong("5902408780260971510"),NumberLong("5902408780260971510")]]},"millis" : 0}Explain plan of an equality query
  • 60. But not so good for a range query• Range queries scatter gather• Won’t use index
  • 61. mongos> db.hash.find({x:{$gt:1, $lt:99}}).explain(){"cursor" : "BasicCursor","n" : 97,"nChunkSkips" : 0,"nYields" : 0,"nscanned" : 1000,"nscannedAllPlans" : 1000,"nscannedObjects" : 1000,"nscannedObjectsAllPlans" : 1000,"millisShardTotal" : 0,"millisShardAvg" : 0,"numQueries" : 2,"numShards" : 2,"millis" : 3}Explain plan of a range query
  • 62. Other limitations• Cannot use a compound key• Key cannot have an array value• Tag-aware sharding– Only makes sense to assign the full hashed shard keycollection to particular shards– By design, there’s no real way to know or control whatdata is in what range• Key with poor cardinality is going to give a hashwith poor cardinality– Floating point numbers are squashed. E.g. 100.4 will behashed as 100
  • 63. Summary• Range-based Sharding– Most efficient for applications that operate on ranges– Requires careful shard key selection• Hash-based Sharding– Uniform writes,– No routed range queries• TagAware Sharding– That’s another talk!
  • 64. Resources• Tutorial:Sharding (laptop friendly)• Tutorial:Converting a Replica Set to a Replicated Sharded Cluster(laptopfriendly)• Manual:Select a Shard Key• Manual:Hash-based Sharding• Manual:TagAware Sharding• Manual:Strategiesfor Bulk Inserts in ShardedClusters• Manual:Manage Sharded Cluster Balancer
  • 65. Questions?
  • 66. James KerrThank YouSenior Solutions Architect, 10gen