MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by Brandon Black, 10gen
 

MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by Brandon Black, 10gen

on

  • 1,685 views

In version 2.4, MongoDB introduces hash-based sharding, a new option for distributing data in sharded collections. Hash-based sharding and range-based sharding present different advantages for MongoDB ...

In version 2.4, MongoDB introduces hash-based sharding, a new option for distributing data in sharded collections. Hash-based sharding and range-based sharding present different advantages for MongoDB users deploying large scale systems. In this talk, we'll provide an overview of this new feature and discuss when to use hash-based sharding or range-based sharding.

Statistics

Views

Total Views
1,685
Views on SlideShare
1,288
Embed Views
397

Actions

Likes
5
Downloads
27
Comments
0

5 Embeds 397

http://www.10gen.com 306
http://www.mongodb.com 87
https://www.mongodb.com 2
http://drupal1.10gen.cc 1
https://comwww-drupal.10gen.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Remind everyone what a sharded cluster is. We will take a close look at some how sharded clusters work and at the new hashed shard key feature of 2.4
  • Isolating queries (to a few shards)Scatter -- gather ( high latency but not bad )hash keys
  • Min value includedMax value not included
  • Balancer is running on mongosOnce the difference in chunks between the most dense shard and the least dense shard is above the migration threshold, a balancing round starts
  • Moved chunk on shard2 should be gray
  • Source shard deletes moved dataMust wait for open cursors to either close or time outNoTimeout cursors may prevent the release of the lockMongos releases the balancer lock after old chunks are deleted
  • Moving data is expensive (i/o, network bandwidth)Moving many chunks takes a long time (can only move one chunk at a time)Balancing and migrations compete for resources with your application
  • The mongos does not have to load the whole set into memory since each shard sorts locally. The mongos can just getMore from the shards as needed and incrementally return the results to the client.
  • What’s the solution to sharding on incremental values as a shard key?
  • Uses the hashed index
  • Range Based - bestHash Based – uniform writes but not routed range queriesTag Aware

MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by Brandon Black, 10gen MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by Brandon Black, 10gen Presentation Transcript

  • Software Engineer, 10gen@brandonmblackBrandon Black#MongoDBDaysHash-Based Sharding inMongoDB 2.4
  • Agenda• Mechanics of Sharding– Key space– Chunks– Balancing• Request Routing• Hashed Shard Keys– Why use hashed shard keys– How to enable hashed shard keys– Limitations
  • Sharded Cluster
  • Sharding Your Data
  • What Is A Shard Key?• Shard key is used to partition your collection• Shard key must exist in every document• Shard key is immutable• Shard key values are immutable• Shard key must be indexed• Shard key is used to route requests to shards
  • The Key Space{x: 10} {x: -5} {x: -9} {x: 7} {x: 6} {x: 0}
  • Inserting Data{x: 0}{x: 6}{x: 7}{x: -5}{x: 10} {x: -9}
  • Inserting Data{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}
  • Chunk Range and Size{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}
  • Inserting Further Data{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}{x: 9}{x: -7} {x: 3}
  • Chunk Splitting{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}0 0• Achunk is split once it exceeds the maximum size• There is no split point if all documents have the same shard key• Chunk split is a logical operation (no data is moved)• If split creates too large of a discrepancy of chunk count across clustera balancing round starts
  • Data Distribution• MinKey to 0 lives on Shard1• 0 to MaxKey lives on Shard2• Mongos routes queries appropriately
  • Mongos Routes DataminKey  0 0  maxKeydb.test.insert({ x: -1000 })
  • Mongos Routes DataminKey  0 0  maxKeydb.test.insert({ x: -1000 })
  • Unbalanced ShardsminKey  0 0  maxKey
  • Balancing• Migration threshold• Number of chunks less than 20, migration threshold of 2• 21-80, migration threshold 4• >80, migration threshold 8
  • Moving the chunk• One chunk of data is copied from Shard 1 to Shard 2
  • Committing Migration• Once everyone agrees the data has moved, that chunk getsdeleted from Shard 1.
  • Cleanup• Other mongos have to find out about new configuration
  • Effects of Migrations• Expensive• Can take a long time• Competes for limited resources
  • Picking A Shard Key• Cardinality• Optimize routing• Minimize (unnecessary) traffic• Allow best scaling
  • Routing Requests
  • Cluster Request Routing• Targeted Queries• Scatter Gather Queries• Scatter Gather Queries with Sort
  • Cluster Request Routing: TargetedQuery
  • Routable Request Received
  • Request routed to appropriate shard
  • Shard returns results
  • Mongos returns results to client
  • Cluster Request Routing: Non-TargetedQuery
  • Non-Targeted Request Received
  • Request sent to all shards
  • Shards return results to mongos
  • Mongos returns results to client
  • Cluster Request Routing: Non-TargetedQuery with Sort
  • Non-Targeted request with sortreceived
  • Request sent to all shards
  • Query and sort performed locally
  • Shards return results to mongos
  • Mongos merges sorted results
  • Mongos returns results to client
  • What About ObjectId?ObjectId("51597ca8e28587b86528edfd”)• Used for _id• 12 byte value• Generated by the driver if not specified• Theoretically globally unique
  • What About ObjectId?ObjectId("51597ca8e28587b86528edfd”)12 BytesTimestampMACPIDCounter
  • // enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }// sharding the test collectionmongos> sh.shardCollection("test.test",{_id:1}){ "collectionsharded" : "test.test", "ok" : 1 }// create a loop inserting datamongos> for (x=0; x<10000; x++) {... db.test.insert({value:x})... }Sharding on ObjectId
  • shards:{ "_id" : "shard0000", "host" : "localhost:30000" }{ "_id" : "shard0001", "host" : "localhost:30001" }databases:{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }test.testshard key: { "_id" : 1 }chunks:shard0001 3{ "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId(”...") }on : shard0001 { "t" : 1000, "i" : 1 }{ "_id" : ObjectId(”...”) } -->> { "_id" : { "$maxKey" : 1 } }on : shard0001 { "t" : 1000, "i" : 2 }ObjectId Chunk Distribution
  • ObjectId Results In A “Hot Shard”minKey  0 0  maxKey
  • Sharding on incrementalvalues like timestamp isnot optimum for evendistribution
  • Hashed Shard Keys
  • Hashed Shard Keys{x:2} md5 c81e728d9d4c2f636f067f89cc14862c{x:3} md5 eccbc87e4b5ce2fe28308fd9f2a7baf3{x:1} md5 c4ca4238a0b923820dcc509a6f75849b
  • Hashed Shard Key Eliminates “HotShard”minKey  0 0  maxKey
  • Under the Hood• Create a hashed index used for sharding• Uses the first 64-bits of md5 hash of field• Hash both data and BSON type• Represented as a NumberLong in the shell
  • // hash on 1 as an integer> db.runCommand({_hashBSONElement:1}){"key" : 1,"seed" : 0,"out" : NumberLong("5902408780260971510"),"ok" : 1}// hash on “1” as a string> db.runCommand({_hashBSONElement:"1"}){"key" : "1","seed" : 0,"out" : NumberLong("-2448670538483119681"),"ok" : 1}Hash on both data and BSON type
  • Enabling Hashed Indexes• Create index:db.collection.ensureIndex({field : ”hashed”})
  • Using Hash Shard Keys• Enable sharding on collection:sh.shardCollection(“test.collection”,{field: “hashed”})
  • // enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }// shard by hashed _id fieldmongos> sh.shardCollection("test.hash”,{_id:"hashed"}){ "collectionsharded" : "test.hash", "ok" : 1 }Sharding on Hashed ObjectId
  • databases:{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }test.hashshard key: { "_id" : "hashed" }chunks:shard0000 2shard0001 2{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611686018427387902") } on : shard0000 { "t" : 2000, "i" : 2 }{ "_id" : NumberLong("-4611686018427387902") } --> { "_id" : NumberLong(0) }on : shard0000 { "t" : 2000, "i" : 3 }{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611686018427387902") }on : shard0001 { "t" : 2000, "i" : 4 }{ "_id" : NumberLong("4611686018427387902") } -->> { "_id" : { "$maxKey" : 1} } on : shard0001 { "t" : 2000, "i" : 5 }Pre-Splitting the Data
  • // create a loop inserting datamongos> for (x=0; x<10000; x++) {... db.hash.insert({value:x})... }Inserting Into Hashed Shard KeyCollection
  • test.hashshard key: { "_id" : "hashed" }chunks:shard0000 4shard0001 4{"_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-7374407069602479355") } on : shard0000 { "t" : 2000, "i" : 8}{"_id" : NumberLong("-7374407069602479355") } -->> { "_id" : NumberLong("-4611686018427387902") } on : shard0000 { "t" : 2000, "i" : 9}{"_id" : NumberLong("-4611686018427387902") } -->> { "_id" : NumberLong("-2456929743513174890") } on : shard0000 { "t" : 2000, "i" : 6}{"_id" : NumberLong("-2456929743513174890") } -->> { "_id" : NumberLong(0)} on : shard0000 { "t" : 2000, "i" : 7}{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("1483539935376971743")} on : shard0001 { "t" : 2000, "i" : 12}Even Distribution of Chunks
  • Hash Keys Are Great for EqualityQueries• Equality queries directed to a specific shard• Will use the index• Most efficient query possible
  • mongos> db.hash.find({x:1}).explain(){"cursor" : "BtreeCursor x_hashed","n" : 1,"nscanned" : 1,"nscannedObjects" : 1,"millisShardTotal" : 0,"numQueries" : 1,"numShards" : 1,"indexBounds" : {"x" : [[NumberLong("5902408780260971510"),NumberLong("5902408780260971510")]]},"millis" : 0}Explain Plan of an Equality Query
  • Not So Good for a Range Query• Range queries scatter gather• Don’t use the index• Inefficient query
  • mongos> db.hash.find({x:{$gt:1, $lt:99}}).explain(){"cursor" : "BasicCursor","n" : 97,"nChunkSkips" : 0,"nYields" : 0,"nscanned" : 1000,"nscannedAllPlans" : 1000,"nscannedObjects" : 1000,"nscannedObjectsAllPlans" : 1000,"millisShardTotal" : 0,"millisShardAvg" : 0,"numQueries" : 2,"numShards" : 2,"millis" : 3}Explain Plan of a Range Query
  • Limitations• Cannot use a compound key• Key cannot have an array value• Incompatible with tag aware sharding– Tags would be assigned the value of the hash, not thevalue of the underlying key• Key with poor cardinality is going to give a hashwith poor cardinality– Floating point numbers are squashed. E.g. 100.4 will behashed as 100
  • Summary• There are 3 different approaches for sharding• Hash shard keys give great distribution• Hash shard keys are good for equality• Pick the right shard key for your application
  • #MongoDBDaysThank YouSoftware Engineer, 10gen@brandonmblackBrandon Black