Successfully reported this slideshow.

MongoDB Real-time Data Collection and Stats Generation

64

Share

1 of 20
1 of 20

MongoDB Real-time Data Collection and Stats Generation

64

Share

Download to read offline

My talk from #mongoseattle on how I've used MongoDB for real-time data collection and stats generation. Includes basic usage of increment modifiers as well as map/reduce example.

My talk from #mongoseattle on how I've used MongoDB for real-time data collection and stats generation. Includes basic usage of increment modifiers as well as map/reduce example.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

MongoDB Real-time Data Collection and Stats Generation

  1. 1. MongoDB for real-time data collection and stats generation Damon Cortesi (@dacort) damon@rowfeeder.com
  2. 2. The Past • Lots of data • 2M rows/day • Post-computation is slow • TweetStats == 1B tweets
  3. 3. mysql> select count(*) from gnip_activity_2_1 WHERE created_at BETWEEN '2010-03-01' AND '2010-03-02'; +----------+ | count(*) | +----------+ | 1480294 | +----------+ 1 row in set (6.32 sec) mysql> SELECT app,(count(*)/(select count(*) from gnip_activity_2_1 WHERE created_at BETWEEN '2010-03-01' AND '2010-03-02'))*100 AS percent, count(*) as count -> FROM gnip_activity_2_1 -> WHERE created_at BETWEEN '2010-03-01' AND '2010-03-02' -> GROUP BY app -> ORDER BY count DESC -> LIMIT 10; +-------------+---------+--------+ | app | percent | count | +-------------+---------+--------+ | web | 31.1618 | 461286 | | TweetDeck | 12.4685 | 184570 | | UberTwitter | 7.3333 | 108555 | This is NOT real-time | twitterfeed | 7.1350 | 105619 | | API | 5.3761 | 79582 | | Echofon | 4.2635 | 63112 | | Tweetie | 3.4734 | 51416 | | Seesmic | 2.1913 | 32438 | | mobile web | 1.8382 | 27211 | | HootSuite | 1.7951 | 26573 | +-------------+---------+--------+ 10 rows in set (1 min 27.07 sec)
  4. 4. So now what? Tokyo Riak CouchDB MongoDB http://www.flickr.com/photos/bob_august/4307291275/
  5. 5. • Initially simple - Tweets -> Spreadsheet • “Maybe we should save this data...” • Real-time updates > db.serverStatus() • Minimal stats generation "opcounters" : { "insert" : 36687455, "query" : 857059, • Writes > Reads (200x) "update" : 189207744, "delete" : 0, "getmore" : 4176334, "command" : 36734580 }
  6. 6. Streaming Twitter curl http://stream.twitter.com/1/statuses/sample.json - u<user>:<pass> | mongoimport -c twitter_live Courtesy @eliothorowitz ...
  7. 7. Slightly more complex ;) { Twitter/Facebook/etc Redis/Resque Queues Process MongoDB { Process Process Save/Update/Stats
  8. 8. Modifier Operations No query/retrieve Just $set or $inc
  9. 9. Stats
  10. 10. Other Benefits • New field? No multi-day ALTER statement. • Auto-sharding in 1.6 • --notablescan • aka, don’t pull a Twitter • “On Monday, our users database, where we store millions of user records, got hung up running a long-running query” -- 7/21
  11. 11. So...that earlier SQL? Incrementers Much Better
  12. 12. Simple. Pre-Computed. Wait, can’t I pre-compute in MySQL? Mongo == Fire/forget and async
  13. 13. But sometimes... • Aggregation might still be necessary • This works...but what if we need more?
  14. 14. lnkby.me ‣ Problem statement: 1. Aggregate stats on shortened links 2. Top domains based on # clicks Easy Slightly more difficult 3. Top users driving traffic to those top domains for the past seven days
  15. 15. Solution • Gather stats • Map/Reduce! • Server-side JavaScript • Temporary collection to hold output • Can be written to permanent collection • temp collection renamed atomically
  16. 16. click
  17. 17. aggregate Get Top 10 Domains, then...
  18. 18. Caveat • Indexes, indexes, indexes • Compound indexes • a,b,c • Query on a; a,b; or a,b,c • Sort on last field
  19. 19. Mongo++ • High-volume updates -- win • Stats generation -- win • Mutable schema && json -- win

Editor's Notes









































  • ×