Realtime Analytics with MongoDB


Published on

My talk from Mongo Boston (9/20/2010) about how we use MongoDB to scale Rails at Yottaa.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Realtime Analytics with MongoDB

  1. Scaling Rails @ Yottaa<br />Jared Rosoff<br />@forjared<br /><br />September 20th 2010<br />
  2. From zero to humongous<br />2<br />About our application <br />How we chose MongoDB<br />How we use MongoDB<br />
  3. About our application<br />3<br />We collect lots of data<br />6000+ URLs<br />300 samples per URL per day<br />Some samples are >1MB (firebug) <br />Missing a sample isn’t a bit deal<br />We visualize data in real-time<br />No delay when showing data<br />“On-Demand” samples <br />The “check now” button <br />
  4. The Yottaa Network<br />4<br />
  5. How we chose mongo<br />5<br />
  6. Requirements<br />Our data set is going to grow very quickly <br />Scalable by default<br />We have a very small team<br />Focus on application, not infrastructure<br />We are a startup <br />Requirements change hourly<br />Operations<br />We’re 100% in the cloud<br />6<br />
  7. Rails default architecture<br />Performance Bottleneck: Too much load<br />Collection Server<br />Data Source<br />MySQL<br />User<br />Reporting Server<br />“Just” a Rails App<br />
  8. Let’s add replication!<br />Performance Bottleneck: Still can’t scale writes<br />MySQL<br />Master<br />Collection Server<br />Data Source<br />Replication<br />MySQL<br />Master<br />User<br />Reporting Server<br />MySQL<br />Master<br />MySQL<br />Master<br />Off the shelf!<br />Scalable Reads!<br />
  9. What about sharding?<br />Development Bottleneck:<br />Need to write custom code<br />Collection Server<br />Data Source<br />Sharding<br />MySQL<br />Master<br />MySQL<br />Master<br />MySQL<br />Master<br />User<br />Reporting Server<br />Sharding<br />Scalable Writes!<br />
  10. Key Value stores to the rescue?<br />Development Bottleneck:<br />Reporting is limited / hard<br />Collection Server<br />Data Source<br />MySQL<br />Master<br />MySQL<br />Master<br />Cassandra or<br />Voldemort<br />User<br />Reporting Server<br />Scalable Writes!<br />
  11. Can I Hadoop my way out of this?<br /> Development Bottleneck:<br />Too many systems!<br />MySQL<br />Master<br />MySQL<br />Master<br />Cassandra or<br />Voldemort<br />Collection Server<br />Data Source<br />Hadoop<br />MySQL<br />Master<br />Scalable Writes!<br />Flexible Reports!<br />“Just” a Rails App<br />MySQL<br />Master<br />User<br />Reporting Server<br />MySQL<br />Master<br />MySQL<br />Slave<br />
  12. MongoDB! <br />Collection Server<br />Data Source<br />MySQL<br />Master<br />MySQL<br />Master<br />MongoDB<br />User<br />Reporting Server<br />Scalable Writes!<br />“Just” a rails app<br />Flexible Reporting!<br />
  13. MongoD<br />App Server<br />Data Source<br />Collection<br />MongoD<br />Load<br />Balancer<br />Passenger<br />Nginx<br />Mongos<br />Reporting<br />User<br />MongoD<br />Sharding!<br />High Concurrency<br />Scale-Out<br />
  14. Sharding is critical<br />14<br />Distribute write load across servers<br />Decentralize data storage<br />Scale out! <br />
  15. Before Sharding<br />15<br />App<br />Server<br />App Server<br />App Server<br />Need higher write volume<br />Buy a bigger database<br />Need more storage volume<br />Buy a bigger database<br />
  16. After Sharding<br />16<br />App<br />Server<br />App Server<br />App Server<br />Need higher write volume<br />Add more servers<br />Need more storage volume<br />Add more servers<br />
  17. Scale out is the new scale up<br />17<br />App<br />Server<br />App Server<br />App Server<br />
  18. How we’re using MongoDB<br />18<br />
  19. Our Data Model<br />19<br />Document per URL we track <br />Meta-data<br />Summary Data<br />Most recent measurements<br />Document per URL per Day<br />Detailed metrics<br />Pre-aggregated data<br />
  20. Thinking in rows<br />20<br />{ url: ‘’,<br /> location: “SFO” <br /> connect: 23, <br />first_byte: 123, <br />last_byte: 245, <br /> timestamp: 1234 } <br />{ url: ‘’,<br /> location: “NYC” <br /> connect: 23, <br />first_byte: 123, <br />last_byte: 245, <br /> timestamp: 2345 } <br />
  21. Thinking in rows<br />21<br />What was the average connect time for google on friday?<br />From SFO?<br />From NYC?<br />Between 1AM-2AM?<br />
  22. Thinking in rows<br />22<br /> Up to 100’s of samples per URL per day!!<br />Day 1<br />AVG<br />Result<br />Day 2<br />An “average” chart had to hit 600 rows <br />AVG<br />Day 3<br />AVG<br />30 days average query range<br />
  23. Thinking in Documents<br />This document contains all data for collected during 9/20/2010<br />This tells us the average value for this metric for this url / time period<br />Average value from SFO<br />Average value from NYC<br />23<br />
  24. Storing a sample<br />24<br />db.metrics.dailies.update( <br /> { url: ‘’, <br /> day: ‘9/20/2010’ }, <br /> { ‘$inc’: { <br /> ‘connect.sum’:1234,<br /> ‘connect.count’:1,<br /> ‘connect.sfo.sum’:1234,<br /> ‘connect.sfo.count’:1 } },<br /> { upsert: true } <br />);<br />Which document we’re updating<br />Update the aggregate value<br />Update the location specific value<br />Atomically update the document<br />Create the document if it doesn’t already exist<br />
  25. Putting it together<br />25<br />Atomically update the daily data<br />1<br />{ url: ‘’,<br /> location: “SFO” <br /> connect: 23, <br />first_byte: 123, <br />last_byte: 245, <br /> timestamp: 1234 } <br />Atomically update the weekly data<br />2<br />Atomically update the monthly data<br />3<br />
  26. Drawing connect time graph<br />26<br />db.metrics.dailies.find( <br /> { url: ‘’, <br /> day: { “$gte”: ‘9/1/2010’,<br /> “$lte”:’9/20/2010’ }, <br /> { ‘connect’:true}<br />);<br />Data for google<br />We just want connect time data<br />Compound index to make this query fast<br />The range of dates for the chart<br />db.metrics.dailies.ensureIndex({url:1,day:-1})<br />
  27. More efficient charts<br />27<br />1 Document per URL per Day<br />Day 1<br />AVG<br />Result<br />Day 2<br />Average chart hits 30 documents. <br />AVG<br />20x fewer<br />Day 3<br />AVG<br />30 days == 30 documents<br />
  28. Real Time Updates<br />28<br />Single query to fetch all metric data for a URL<br />Fast enough that browser can poll constantly for updated data without impacting server<br />
  29. Final thoughts<br />Mongo has been a great choice <br />80gb of data and counting<br />Majorly compressed after moving from table to document oriented data model <br />100’s of updates per second 24x7<br />Not using Sharding in production yet, but planning on it soon <br />You are using replication, right? <br />29<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.