MongoDB for real-time
 data collection and
  stats generation
     Damon Cortesi (@dacort)
      damon@rowfeeder.com
The Past


• Lots of data
• 2M rows/day
• Post-computation is slow
• TweetStats == 1B tweets
mysql> select count(*) from gnip_activity_2_1 WHERE created_at BETWEEN '2010-03-01' AND
'2010-03-02';
+----------+
| count...
So now what?


                                                       Tokyo
                                              ...
• Initially simple - Tweets -> Spreadsheet
• “Maybe we should save this data...”
• Real-time updates           > db.server...
Streaming Twitter
curl http://stream.twitter.com/1/statuses/sample.json -
   u<user>:<pass> | mongoimport -c twitter_live
...
Slightly more complex ;)

                 {
                     Twitter/Facebook/etc
  Redis/Resque
                    ...
Modifier Operations
          No query/retrieve
          Just $set or $inc
Stats
Other Benefits
• New field? No multi-day ALTER statement.
• Auto-sharding in 1.6
• --notablescan
 • aka, don’t pull a Twitte...
So...that earlier SQL?

                 Incrementers




Much Better
Simple. Pre-Computed.




  Wait, can’t I pre-compute in MySQL?
   Mongo == Fire/forget and async
But sometimes...
• Aggregation might still be necessary




• This works...but what if we need more?
lnkby.me

       ‣ Problem statement:
       1. Aggregate stats on shortened links
       2. Top domains based on # clicks...
Solution
• Gather stats
• Map/Reduce!
 • Server-side JavaScript
 • Temporary collection to hold output
 • Can be written t...
click
aggregate
Get Top 10 Domains, then...
Caveat

• Indexes, indexes, indexes
• Compound indexes
 • a,b,c
 • Query on a; a,b; or a,b,c
 • Sort on last field
Mongo++

• High-volume updates -- win
• Stats generation -- win
• Mutable schema && json -- win
MongoDB Real-time Data Collection and Stats Generation
Upcoming SlideShare
Loading in...5
×

MongoDB Real-time Data Collection and Stats Generation

39,691

Published on

My talk from #mongoseattle on how I've used MongoDB for real-time data collection and stats generation. Includes basic usage of increment modifiers as well as map/reduce example.

Published in: Technology
3 Comments
55 Likes
Statistics
Notes
  • I can't seem to convert your Keynote to PDF / view it on my Keynote program (error displayed is 'too old version')
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • is it well with analytic queries on data about 3 GB and RAM 2 GB? how it perform in this scenario?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • are you using mongoid orm here?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
39,691
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
283
Comments
3
Likes
55
Embeds 0
No embeds

No notes for slide








































  • MongoDB Real-time Data Collection and Stats Generation

    1. 1. MongoDB for real-time data collection and stats generation Damon Cortesi (@dacort) damon@rowfeeder.com
    2. 2. The Past • Lots of data • 2M rows/day • Post-computation is slow • TweetStats == 1B tweets
    3. 3. mysql> select count(*) from gnip_activity_2_1 WHERE created_at BETWEEN '2010-03-01' AND '2010-03-02'; +----------+ | count(*) | +----------+ | 1480294 | +----------+ 1 row in set (6.32 sec) mysql> SELECT app,(count(*)/(select count(*) from gnip_activity_2_1 WHERE created_at BETWEEN '2010-03-01' AND '2010-03-02'))*100 AS percent, count(*) as count -> FROM gnip_activity_2_1 -> WHERE created_at BETWEEN '2010-03-01' AND '2010-03-02' -> GROUP BY app -> ORDER BY count DESC -> LIMIT 10; +-------------+---------+--------+ | app | percent | count | +-------------+---------+--------+ | web | 31.1618 | 461286 | | TweetDeck | 12.4685 | 184570 | | UberTwitter | 7.3333 | 108555 | This is NOT real-time | twitterfeed | 7.1350 | 105619 | | API | 5.3761 | 79582 | | Echofon | 4.2635 | 63112 | | Tweetie | 3.4734 | 51416 | | Seesmic | 2.1913 | 32438 | | mobile web | 1.8382 | 27211 | | HootSuite | 1.7951 | 26573 | +-------------+---------+--------+ 10 rows in set (1 min 27.07 sec)
    4. 4. So now what? Tokyo Riak CouchDB MongoDB http://www.flickr.com/photos/bob_august/4307291275/
    5. 5. • Initially simple - Tweets -> Spreadsheet • “Maybe we should save this data...” • Real-time updates > db.serverStatus() • Minimal stats generation "opcounters" : { "insert" : 36687455, "query" : 857059, • Writes > Reads (200x) "update" : 189207744, "delete" : 0, "getmore" : 4176334, "command" : 36734580 }
    6. 6. Streaming Twitter curl http://stream.twitter.com/1/statuses/sample.json - u<user>:<pass> | mongoimport -c twitter_live Courtesy @eliothorowitz ...
    7. 7. Slightly more complex ;) { Twitter/Facebook/etc Redis/Resque Queues Process MongoDB { Process Process Save/Update/Stats
    8. 8. Modifier Operations No query/retrieve Just $set or $inc
    9. 9. Stats
    10. 10. Other Benefits • New field? No multi-day ALTER statement. • Auto-sharding in 1.6 • --notablescan • aka, don’t pull a Twitter • “On Monday, our users database, where we store millions of user records, got hung up running a long-running query” -- 7/21
    11. 11. So...that earlier SQL? Incrementers Much Better
    12. 12. Simple. Pre-Computed. Wait, can’t I pre-compute in MySQL? Mongo == Fire/forget and async
    13. 13. But sometimes... • Aggregation might still be necessary • This works...but what if we need more?
    14. 14. lnkby.me ‣ Problem statement: 1. Aggregate stats on shortened links 2. Top domains based on # clicks Easy Slightly more difficult 3. Top users driving traffic to those top domains for the past seven days
    15. 15. Solution • Gather stats • Map/Reduce! • Server-side JavaScript • Temporary collection to hold output • Can be written to permanent collection • temp collection renamed atomically
    16. 16. click
    17. 17. aggregate Get Top 10 Domains, then...
    18. 18. Caveat • Indexes, indexes, indexes • Compound indexes • a,b,c • Query on a; a,b; or a,b,c • Sort on last field
    19. 19. Mongo++ • High-volume updates -- win • Stats generation -- win • Mutable schema && json -- win
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×