MongoDB Hacks of Frustration

MongoDB Hacks of Frustration






Total Views
Views on SlideShare
Embed Views



6 Embeds 496 288 191 7 6 3 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

MongoDB Hacks of Frustration MongoDB Hacks of Frustration Presentation Transcript

  • 2013MongoDB Hacks ofFrustrationFoursquare Hacks for a Better MongoMongoNYCJune 21, 2013Leo KimSoftware Engineer, Foursquare
  • 2013Agenda• About Foursquare• Vital stats• Our hacks/tools• Questions2013
  • 2013What is Foursquare?Foursquare helps youexplore the worldaround you.Meet up with friends,discover new places,and save moneyusing your phone.
  • 2013Big stats35,000,000+ people4,000,000,000+ check-ins55,000,000+ points of interest1,300,000+ merchants2013
  • 2013Moar stats●5MM-6MM checkins a day●~4K-5K qps against our API, ~150K-300K qps againstMongo●11 clusters, 8 sharded + 3 replica sets– Mongo 2.2.{3,4}●~4TB of data and indexes, all of it kept in memory– 24-core Intel, 192GB RAM, 4 120GB SSD drives●Extensive use of sharding + replica setReadPreference.secondary reads2013
  • 2013Mongo has scaled with usWe have been using Mongo for three years.Mongo has enabled us to scale our servicealong with our growing user base. It hasgiven us the flexibility and agility to innovatein an exciting space where there is still muchto figure out.2013
  • 2013Still, some things to deal with●Monitoring– MMS is good, but always could use more stats tonarrow down on pain points.●General maintenance– Constant struggle with data fragmentation●Sharding– No load-based balancing (SERVER-2472)– Overhead of all-shards queries– Bugs can leave duplicate records on the wrongshards2013
  • 2013Monitoring hack: “ODash”2013
  • 2013Monitoring hack: “Mongolyzer”2013
  • 2013Monitoring hack: “Telemetry”2013
  • 2013Data size and fragmentation• Problem: Even with bare metal and SSDs, fragmentationcan degrade performance by increasing data size beyondavailable memory.– Can also be an issue with autobalancing as chunkmoves induce increased paging and further degradeI/O• We have enough replicas (~400) where we need to dothis regularly2013
  • 2013Alerts!2013
  • 2013Hack: “Mackinac”• (Mostly) automated repair script• “Kill file” mongod– Drain queries gracefully from mongod– About kill files:• Stops mongod• Resyncs from primary– We considered running compact() but it doesntreclaim disk space. May revisit this though.2013
  • 2013Hack: “Mackinac”2013
  • 2013Hack: Shard key checker• Checks for shard key usage in the app• Loads shard keys from mongo config servers,matches keys against a given {query, document}• e.g.db.users({ _id : 12345 }) // shard key match!db.users({ twitter: “” }) // shard keymiss!• Why use this?2013
  • 2013Detect all-shards queries!• Problem: All-shards queries– Not all our queries use shard keys (unfortunately)– Use up connections, network traffic, overhead inquery processing on mongod + mongos– Gets worse with more shards– What happens if one of the replicas is notresponding?●Solution: Measure by intercepting queries withshard key checker and count misses2013
  • 2013All-shards queries[2013-06-09 19:24:27,706] WARN c.f.boot.RogueShardKeyChecker - Possible all-shards query: db.venues.find({ "del" : { "$ne" : true} , "mayor" : xxxx , "closed" :{ "$ne" : true}}).sort({ "mayor_count" : -1})[2013-06-09 19:24:28,296] WARN c.f.boot.RogueShardKeyChecker - Possible all-shards query:{ "uid" : xxxx}, { "_id" : 1})[2013-06-09 19:24:28,326] WARN c.f.boot.RogueShardKeyChecker - Possible all-shards query:{ "uid" : xxxx})[2013-06-09 19:24:28,696] WARN c.f.boot.RogueShardKeyChecker - Possible all-shards query: db.comments2.find({ "c.u" : xxxx})[2013-06-09 19:24:32,246] WARN c.f.boot.RogueShardKeyChecker - Possible all-shards query: db.user_venue_aggregations2.find({ "_id" : { "$gte" : { "u" : xxxx , "v" :{ "$oid" : "000000000000000000000000"}} , "$lte" : { "u" : xxxx , "v" : { "$oid" :"000000000000000000000000"}}}})2013
  • 2013All-shards queries2013
  • 2013Find hot chunks!• Problem: Autobalancer balances by data size, but notload.– Checkins shard key → {u : 1}– Imagine a group of users who check in a bunch.– Imagine the balancer putting all those users on thesame shard, or even the same chunk.• Solution: Intercept queries with shard key checker andbucket hits by chunk2013
  • 2013Hack: Hot chunk detector• Need to do a little more to make this work with theshard key checker– Create trees of chunk ranges per-collection– Match shard keys from queries to chunk ranges,accumulate counts2013
  • 2013Hack: Hot chunk detector2013
  • 2013Deeper hack: Hot chunk mover• A standalone process reading from JSON endpoint onhot chunk detector– Identifies the hottest chunk– Attempts to split it (if necessary)– Move the chunk to the “coldest” shard• Subject to same problems as regular chunk moves, candisrupt latencies• Currently using a hit ratio of p9999/p50 to identify hotchunk• Work in progress2013
  • 2013Sample hot chunk detector json2013
  • 2013Fix data integrity!• Problem: Application doesnt always clean up after itselfproperly.– Duplicate documents can exist on multiple shards• Solution: Compare the document shard key from thehost replica against the canonical metadata in shardkey checker (i.e. where the document should “live”)2013
  • 2013Hack: “Chunksanity”• Simple algorithm:– Connects to each shard– Iterates through each document in eachcollection– Verifies that the document is correctly placedaccording to chunk data in mongo config server– Deletes any incorrectly placed documents• Heavyweight process, run only periodically onspecific suspicious collections2013
  • 2013Sample Chunksanity logging[2013-05-29 18:23:36,128] [main] INFO c.f.m.chunksanity.MongoChunkSanity -Logging misplaced docs to localhost/production_misplaced_docs[2013-05-29 18:23:36,301] [main] INFO c.f.m.chunksanity.MongoChunkSanity -Verifying chunks on users at office-canary-2/[2013-05-29 18:23:37,971] [ForkJoinPool-1-worker-1] INFOc.f.m.chunksanity.MongoChunkSanity - Looking at collection foursquare.users onshard shard0007 using filter: { } and shardKey { "_id" : 1.0}[2013-05-29 18:24:06,892] [ForkJoinPool-1-worker-2] INFOc.f.m.chunksanity.MongoChunkSanity - Done with foursquare.users on shard users200/xxxx misplaced in 28911 ms2013
  • 20132013Questions?