MongoDB Real-time Data Collection and Stats Generation
Upcoming SlideShare
Loading in...5
×
 

MongoDB Real-time Data Collection and Stats Generation

on

  • 37,191 views

My talk from #mongoseattle on how I've used MongoDB for real-time data collection and stats generation. Includes basic usage of increment modifiers as well as map/reduce example.

My talk from #mongoseattle on how I've used MongoDB for real-time data collection and stats generation. Includes basic usage of increment modifiers as well as map/reduce example.

Statistics

Views

Total Views
37,191
Views on SlideShare
29,243
Embed Views
7,948

Actions

Likes
46
Downloads
228
Comments
3

22 Embeds 7,948

http://www.dcortesi.com 4626
http://dcortesi.com 3236
http://paper.li 17
http://webcache.googleusercontent.com 13
http://translate.googleusercontent.com 11
http://www.linkedin.com 8
http://dcortesi.herokuapp.com 7
https://www.linkedin.com 5
http://admin.totalmarketing.com 4
https://twitter.com 3
http://us-w1.rockmelt.com 3
http://glowing-autumn-1229.herokuapp.com 3
http://www.schoox.com 3
http://localhost 1
https://www.dcortesi.com 1
http://cat.www.dcortesi.com.meowbify.com 1
http://cache.baidu.com 1
http://207.46.192.232 1
http://twitter.com 1
http://content.wildfiresocial.com 1
http://static.slidesharecdn.com 1
http://www.slideshare.net 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I can't seem to convert your Keynote to PDF / view it on my Keynote program (error displayed is 'too old version')
    Are you sure you want to
    Your message goes here
    Processing…
  • is it well with analytic queries on data about 3 GB and RAM 2 GB? how it perform in this scenario?
    Are you sure you want to
    Your message goes here
    Processing…
  • are you using mongoid orm here?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />

MongoDB Real-time Data Collection and Stats Generation Presentation Transcript

  • 1. MongoDB for real-time data collection and stats generation Damon Cortesi (@dacort) damon@rowfeeder.com
  • 2. The Past • Lots of data • 2M rows/day • Post-computation is slow • TweetStats == 1B tweets
  • 3. mysql> select count(*) from gnip_activity_2_1 WHERE created_at BETWEEN '2010-03-01' AND '2010-03-02'; +----------+ | count(*) | +----------+ | 1480294 | +----------+ 1 row in set (6.32 sec) mysql> SELECT app,(count(*)/(select count(*) from gnip_activity_2_1 WHERE created_at BETWEEN '2010-03-01' AND '2010-03-02'))*100 AS percent, count(*) as count -> FROM gnip_activity_2_1 -> WHERE created_at BETWEEN '2010-03-01' AND '2010-03-02' -> GROUP BY app -> ORDER BY count DESC -> LIMIT 10; +-------------+---------+--------+ | app | percent | count | +-------------+---------+--------+ | web | 31.1618 | 461286 | | TweetDeck | 12.4685 | 184570 | | UberTwitter | 7.3333 | 108555 | This is NOT real-time | twitterfeed | 7.1350 | 105619 | | API | 5.3761 | 79582 | | Echofon | 4.2635 | 63112 | | Tweetie | 3.4734 | 51416 | | Seesmic | 2.1913 | 32438 | | mobile web | 1.8382 | 27211 | | HootSuite | 1.7951 | 26573 | +-------------+---------+--------+ 10 rows in set (1 min 27.07 sec)
  • 4. So now what? Tokyo Riak CouchDB MongoDB http://www.flickr.com/photos/bob_august/4307291275/
  • 5. • Initially simple - Tweets -> Spreadsheet • “Maybe we should save this data...” • Real-time updates > db.serverStatus() • Minimal stats generation "opcounters" : { "insert" : 36687455, "query" : 857059, • Writes > Reads (200x) "update" : 189207744, "delete" : 0, "getmore" : 4176334, "command" : 36734580 }
  • 6. Streaming Twitter curl http://stream.twitter.com/1/statuses/sample.json - u<user>:<pass> | mongoimport -c twitter_live Courtesy @eliothorowitz ...
  • 7. Slightly more complex ;) { Twitter/Facebook/etc Redis/Resque Queues Process MongoDB { Process Process Save/Update/Stats
  • 8. Modifier Operations No query/retrieve Just $set or $inc
  • 9. Stats
  • 10. Other Benefits • New field? No multi-day ALTER statement. • Auto-sharding in 1.6 • --notablescan • aka, don’t pull a Twitter • “On Monday, our users database, where we store millions of user records, got hung up running a long-running query” -- 7/21
  • 11. So...that earlier SQL? Incrementers Much Better
  • 12. Simple. Pre-Computed. Wait, can’t I pre-compute in MySQL? Mongo == Fire/forget and async
  • 13. But sometimes... • Aggregation might still be necessary • This works...but what if we need more?
  • 14. lnkby.me ‣ Problem statement: 1. Aggregate stats on shortened links 2. Top domains based on # clicks Easy Slightly more difficult 3. Top users driving traffic to those top domains for the past seven days
  • 15. Solution • Gather stats • Map/Reduce! • Server-side JavaScript • Temporary collection to hold output • Can be written to permanent collection • temp collection renamed atomically
  • 16. click
  • 17. aggregate Get Top 10 Domains, then...
  • 18. Caveat • Indexes, indexes, indexes • Compound indexes • a,b,c • Query on a; a,b; or a,b,c • Sort on last field
  • 19. Mongo++ • High-volume updates -- win • Stats generation -- win • Mutable schema && json -- win