• Like
Advanced Sharding Features in MongoDB 2.4
 

Advanced Sharding Features in MongoDB 2.4

on

  • 2,619 views

 

Statistics

Views

Total Views
2,619
Views on SlideShare
1,672
Embed Views
947

Actions

Likes
4
Downloads
75
Comments
0

8 Embeds 947

http://www.10gen.com 653
http://www.mongodb.com 278
http://drupal1.10gen.cc 5
https://www.mongodb.com 5
http://wp.joshlabs.in 2
http://localhost 2
http://jfeeds.carsmantra.com 1
http://wp.joshlabs.webfactional.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Remind everyone what a sharded cluster is. We will be talking about shard key considerations, hashed shard keys, and tag aware sharding.
  • Make this point: *If you need to. Not everyone needs to shard!*
  • Shard key: standard stuffHashed shard keys: useful for some applications that need uniform write distribution but don’t have suitable fields available to use as a shard key.Tag aware sharding: how to influence the balancer
  • There is also a case of sorting on the shard key, which entails multiple targeted queries in order.
  • Routing targeted requests scales better than scattered requests.
  • Entails additional processing on mongos to sort all results from the shards. When possible, sorting on the shard key is preferable, as it entails multiple targeted queries executed in order.
  • Cardinality – Can your data be broken down enough? How many documents share the same shard key?Write distribution – How well are writes and data balanced across all shards?Query Isolation - Query targeting to a specific shard, vs. scatter/gatherReliability – Impact on the application if a shard outage occurs (ideally, good replica set design can avoid this)Index locality – How hot the index on each shard will be, and how distributed the most used indexes are across our cluster.A good shard key can:Optimize routingMinimize (unnecessary) trafficAllow best scaling
  • The visuals for index locality and shard usage go hand in hand.For sharding, we want to have reads and writes distributed across the cluster to make efficient use of hardware. But for indexes, the inverse is true. We’d prefer not to access all parts of an index if we can avoid it. Just like disk access, we’d prefer to read/write sequentially, and only to a portion. This avoids having the entire index “hot”.
  • This show good index or disk locality on a single shard.
  • This also illustrates the inverse relationship between index usage on a shard and write distribution on the shard cluster. In this case, one shard is receiving all of the writes, and the index is being used for right-balanced insertions (incremental data).
  • While hashed shard keys offer poor index locality, that may be ok for some architectures if uniform write distribution is a higher priority. Also, if the indexes can fit in memory, random access of the index may not be a concern.
  • Compound shard key, utilizes an existing index we already need.For queries on a user’s inbox ordered by time, if that user is on multiple shards it may be an example of ordered, targeted queries.
  • Hashed shard keys are not for everyone, but they will generally be a better option for folks that don’t have suitable shard keys in their schema, or those that were manually hashing their shard keys. Since hashed values are 8 bytes (64-bit portion of the md5), the index will also be tinier than a typical, ordered ObjectId index.
  • _hashBSONElement requires --enableTestCommands to work
  • Note: the “2 per shard” default is a special computed value. If any value is specified for numInitialChunks, it will be divided by the total number of shards to determine how many chunks to create on each shard. The shardCollection helper method doesn’t seem to support this option, so users will have to use the raw command to specify this option.
  • Illustrates default pre-splitting behavior. Each shard for this new collection has two chunks from the get-go.
  • Uses the hashed index
  • A supplemental, ordered index is a regular, non-hashed index that starts with the shard key. Although mongos will have to scatter the query across all shards, the shards themselves will be able to use the ordered index on the shard key field instead of a BasicCursor. The take-away is that a hashed index is simply not usable for range queries; thankfully, MongoDB allows an ordered index to co-exist for the same field.
  • Emphasize the last point, since users cannot change their shard key after the fact. Additionally, if the application is new and users don’t fully understand their query patterns, they cannot make an informed decision on what would make a suitable shard key.
  • An application has a global user base and we’d like to optimize read/write latency for those users. This will also reduce overall network traffic.We’re not discussing replica set distribution for disaster recover; that is a separate topic.
  • Having all of our writeable databases/shards in a single region is a bottleneck.
  • Ideally, we’d like to have writeable shards in each region, to service users (and application servers) in that region.
  • Many-to-many relationship between shards and tags.
  • Configuring the range may be unintuitive for a fixed string value. The lower bound is inclusive and the upper bound is exclusive. For this example, we have to concoct a meaningless country code for the upper bound, because it is the next logical value above “aus”.APAC stands for “Asia Pacific” (for Australia in this case).
  • For controlling collection distribution, Kristina suggested creating tag ranges for the entire shard key range and assigning it to a particular shard. Although we technically enabled sharding for such collections, all of their data will reside on a single shard. This technique is used to get around the default behavior to place non-sharded collections on the primary shard for a database.

Advanced Sharding Features in MongoDB 2.4 Advanced Sharding Features in MongoDB 2.4 Presentation Transcript

  • Software Engineer, 10genJeremy Mikola#MongoDBDaysAdvanced ShardingFeatures in MongoDB 2.4jmikola
  • Sharded cluster
  • Sharding is a powerfulway scale yourdatabase…
  • MongoDB 2.4 adds somenew features to get moreout of it.
  • Agenda• Shard keys– Desired properties– Evaluating shard key choices• Hashed shard keys– Why and how to use hashed shard keys– Limitations• Tag-aware sharding– How it works– Use case examples
  • Shard Keys
  • What is a shard key?• Incorporates one or more fields• Used to partition your collection• Must be indexed and exist in every document• Definition and values are immutable• Used to route requests to shards
  • Cluster request routing• Targeted queries• Scatter/gather queries• Scatter/gather queries with sort
  • Cluster request routing: writes• Inserts– Shard key required– Targeted query• Updates and removes– Shard key optional for multi-document operations– May be targeted or scattered
  • Cluster request routing: reads• Queries– With shard key: targeted– Without shard key: scatter/gather• Sorted queries– With shard key: targeted in order– Without shard key: distributed merge sort
  • Cluster request routing: targetedquery
  • Routable request received
  • Request routed to appropriate shard
  • Shard returns results
  • Mongos returns results to client
  • Cluster request routing: scatteredquery
  • Non-targeted request received
  • Request sent to all shards
  • Shards return results to mongos
  • Mongos returns results to client
  • Distributed merge sort
  • Shard key considerations• Cardinality• Write Distribution• Query Isolation• Reliability• Index Locality
  • Request distribution and indexlocalityShard 1 Shard 2 Shard 3mongos
  • Request distribution and indexlocalityShard 1 Shard 2 Shard 3mongos
  • {_id: ObjectId(),user: 123,time: Date(),subject: "…",recipients: [],body: "…",attachments: []}Example: email storageMost common scenario, canbe applied to 90% of casesEach document can be up to16MBEach user may have GBs ofstorageMost common query: getuser emails sorted by timeIndexes on {_id}, {user, time},{recipients}
  • Example: email storageCardinalityWritescalingQueryisolationReliabilityIndexlocality_idhash(_id)useruser, time
  • ObjectId compositionObjectId("51597ca8e28587b86528edfd”)12 BytesTimestampHostPIDCounter
  • Sharding on ObjectId// enable sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }// shard the test collectionmongos> sh.shardCollection("test.test", { _id: 1 }){ "collectionsharded" : "test.test", "ok" : 1 }// insert many documents in a loopmongos> for (x=0; x<10000; x++) db.test.insert({ value: x });
  • shards:{ "_id" : "shard0000", "host" : "localhost:30000" }{ "_id" : "shard0001", "host" : "localhost:30001" }databases:{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }test.testshard key: { "_id" : 1 }chunks:shard0001 2{ "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId("…") }on : shard0001 { "t" : 1000, "i" : 1 }{ "_id" : ObjectId("…") } -->> { "_id" : { "$maxKey" : 1 } }on : shard0001 { "t" : 1000, "i" : 2 }Uneven chunk distribution
  • Incremental values leads to a hotshardminKey  0 0  maxKey
  • Example: email storageCardinalityWritescalingQueryisolationReliabilityIndexlocality_id Doc level One shardScatter/gatherAll usersaffectedGoodhash(_id)useruser, time
  • Example: email storageCardinalityWritescalingQueryisolationReliabilityIndexlocality_id Doc level One shardScatter/gatherAll usersaffectedGoodhash(_id) Hash level All ShardsScatter/gatherAll usersaffectedPooruseruser, time
  • Example: email storageCardinalityWritescalingQueryisolationReliabilityIndexlocality_id Doc level One shardScatter/gatherAll usersaffectedGoodhash(_id) Hash level All ShardsScatter/gatherAll usersaffectedPooruserManydocsAll Shards TargetedSomeusersaffectedGooduser, time
  • Example: email storageCardinalityWritescalingQueryisolationReliabilityIndexlocality_id Doc level One shardScatter/gatherAll usersaffectedGoodhash(_id) Hash level All ShardsScatter/gatherAll usersaffectedPooruserManydocsAll Shards TargetedSomeusersaffectedGooduser, time Doc level All Shards TargetedSomeusersaffectedGood
  • Hashed Shard Keys
  • Why is this relevant?• Documents may not already have a suitablevalue• Hashing allows us to utilize an existing field• More efficient index storage– At the expense of locality
  • Hashed shard keys{x:2} md5 c81e728d9d4c2f636f067f89cc14862c{x:3} md5 eccbc87e4b5ce2fe28308fd9f2a7baf3{x:1} md5 c4ca4238a0b923820dcc509a6f75849b
  • minKey  0 0  maxKeyHashed shard keys avoids a hotshard
  • Under the hood• Create a hashed index for use with sharding• Contains first 64 bits of a field’s md5 hash• Considers BSON type and value• Represented as NumberLong in the JS shell
  • // hash on 1 as an integer> db.runCommand({ _hashBSONElement: 1 }){"key" : 1,"seed" : 0,"out" : NumberLong("5902408780260971510"),"ok" : 1}// hash on "1" as a string> db.runCommand({ _hashBSONElement: "1" }){"key" : "1","seed" : 0,"out" : NumberLong("-2448670538483119681"),"ok" : 1}Hashing BSON elements
  • Using hashed indexes• Create index:– db.collection.ensureIndex({ field : "hashed" })• Options:– seed: specify a hash seed to use (default: 0)– hashVersion: currently supports only version 0 (md5)
  • Using hashed shard keys• Enable sharding on collection:– sh.shardCollection("test.collection", { field: "hashed" })• Options:– numInitialChunks: chunks to create (default: 2 pershard)
  • // enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }// shard by hashed _id fieldmongos> sh.shardCollection("test.hash", { _id: "hashed" }){ "collectionsharded" : "test.hash", "ok" : 1 }Sharding on hashed ObjectId
  • databases:{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }test.hashshard key: { "_id" : "hashed" }chunks:shard0000 2shard0001 2{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611...") }on : shard0000 { "t" : 2000, "i" : 2 }{ "_id" : NumberLong("-4611...") } -->> { "_id" : NumberLong(0) }on : shard0000 { "t" : 2000, "i" : 3 }{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611...") }on : shard0001 { "t" : 2000, "i" : 4 }{ "_id" : NumberLong("4611...") } -->> { "_id" : { "$maxKey" : 1 } }on : shard0001 { "t" : 2000, "i" : 5 }Pre-splitting the data
  • test.hashshard key: { "_id" : "hashed" }chunks:shard0000 4shard0001 4{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-7374...") }on : shard0000 { "t" : 2000, "i" : 8 }{ "_id" : NumberLong("-7374...") } -->> { "_id" : NumberLong(”-4611...") }on : shard0000 { "t" : 2000, "i" : 9 }{ "_id" : NumberLong("-4611…") } -->> { "_id" : NumberLong("-2456…") }on : shard0000 { "t" : 2000, "i" : 6 }{ "_id" : NumberLong("-2456…") } -->> { "_id" : NumberLong(0) }on : shard0000 { "t" : 2000, "i" : 7 }{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("1483…") }on : shard0001 { "t" : 2000, "i" : 12 }Even chunk distribution afterinsertions
  • Hashed keys are great for equalityqueries• Equality queries routed to a specific shard• Will make use of the hashed index• Most efficient query possible
  • mongos> db.hash.find({ x: 1 }).explain(){"cursor" : "BtreeCursor x_hashed","n" : 1,"nscanned" : 1,"nscannedObjects" : 1,"numQueries" : 1,"numShards" : 1,"indexBounds" : {"x" : [[NumberLong("5902408780260971510"),NumberLong("5902408780260971510")]]},"millis" : 0}Explain plan of an equality query
  • But not so good for range queries• Range queries will be scatter/gather• Cannot utilize a hashed index– Supplemental, ordered index may be used at the shardlevel• Inefficient query pattern
  • mongos> db.hash.find({ x: { $gt: 1, $lt: 99 }}).explain(){"cursor" : "BasicCursor","n" : 97,"nscanned" : 1000,"nscannedObjects" : 1000,"numQueries" : 2,"numShards" : 2,"millis" : 3}Explain plan of a range query
  • Other limitations of hashed indexes• Cannot be used in compound or unique indexes• No support for multi-key indexes (i.e. arrayvalues)• Incompatible with tag aware sharding– Tags would be assigned hashed values, not the originalkey• Will not overcome keys with poor cardinality– Floating point numbers are truncated before hashing
  • Summary• There are multiple approaches for sharding• Hashed shard keys give great distribution• Hashed shard keys are good for equality queries• Pick a shard key that best suits your application
  • Tag Aware Sharding
  • Global scenario
  • Single database
  • Optimal architecture
  • Tag aware sharding• Associate shard key ranges with specific shards• Shards may have multiple tags, and vice versa• Dictates behavior of the balancer process• No relation to replica set member tags
  • // tag a shardmongos> sh.addShardTag("shard0001", "APAC")// shard by country code and user IDmongos> sh.shardCollection("test.tas", { c: 1, uid: 1 }){ "collectionsharded" : "test.tas", "ok" : 1 }// tag a shard key rangemongos> sh.addTagRange("test.tas",... { c: "aus", uid: MinKey },... { c: "aut", uid: MaxKey },... "APAC"... )Configuring tag aware sharding
  • Use cases for tag aware sharding• Operational and/or location-based separation• Legal requirements for data storage• Reducing latency of geographical requests• Cost of overseas network bandwidth• Controlling collection distribution– http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
  • Other Changes in 2.4
  • Other changes in 2.4• Make secondaryThrottle the default– https://jira.mongodb.org/browse/SERVER-7779• Faster migration of empty chunks– https://jira.mongodb.org/browse/SERVER-3602• Specify chunk by bounds for moveChunk– https://jira.mongodb.org/browse/SERVER-7674• Read preferences for commands– https://jira.mongodb.org/browse/SERVER-7423
  • Questions?
  • Software Engineer, 10genJeremy Mikola#MongoDBDaysThank Youjmikola