Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Realtime Analytics with MongoDB

29,760 views

Published on

My talk from Mongo Boston (9/20/2010) about how we use MongoDB to scale Rails at Yottaa.

Published in: Technology
  • Be the first to comment

Realtime Analytics with MongoDB

  1. Scaling Rails @ Yottaa<br />Jared Rosoff<br />@forjared<br />jrosoff@yottaa.com<br />September 20th 2010<br />
  2. From zero to humongous<br />2<br />About our application <br />How we chose MongoDB<br />How we use MongoDB<br />
  3. About our application<br />3<br />We collect lots of data<br />6000+ URLs<br />300 samples per URL per day<br />Some samples are >1MB (firebug) <br />Missing a sample isn’t a bit deal<br />We visualize data in real-time<br />No delay when showing data<br />“On-Demand” samples <br />The “check now” button <br />
  4. The Yottaa Network<br />4<br />
  5. How we chose mongo<br />5<br />
  6. Requirements<br />Our data set is going to grow very quickly <br />Scalable by default<br />We have a very small team<br />Focus on application, not infrastructure<br />We are a startup <br />Requirements change hourly<br />Operations<br />We’re 100% in the cloud<br />6<br />
  7. Rails default architecture<br />Performance Bottleneck: Too much load<br />Collection Server<br />Data Source<br />MySQL<br />User<br />Reporting Server<br />“Just” a Rails App<br />
  8. Let’s add replication!<br />Performance Bottleneck: Still can’t scale writes<br />MySQL<br />Master<br />Collection Server<br />Data Source<br />Replication<br />MySQL<br />Master<br />User<br />Reporting Server<br />MySQL<br />Master<br />MySQL<br />Master<br />Off the shelf!<br />Scalable Reads!<br />
  9. What about sharding?<br />Development Bottleneck:<br />Need to write custom code<br />Collection Server<br />Data Source<br />Sharding<br />MySQL<br />Master<br />MySQL<br />Master<br />MySQL<br />Master<br />User<br />Reporting Server<br />Sharding<br />Scalable Writes!<br />
  10. Key Value stores to the rescue?<br />Development Bottleneck:<br />Reporting is limited / hard<br />Collection Server<br />Data Source<br />MySQL<br />Master<br />MySQL<br />Master<br />Cassandra or<br />Voldemort<br />User<br />Reporting Server<br />Scalable Writes!<br />
  11. Can I Hadoop my way out of this?<br /> Development Bottleneck:<br />Too many systems!<br />MySQL<br />Master<br />MySQL<br />Master<br />Cassandra or<br />Voldemort<br />Collection Server<br />Data Source<br />Hadoop<br />MySQL<br />Master<br />Scalable Writes!<br />Flexible Reports!<br />“Just” a Rails App<br />MySQL<br />Master<br />User<br />Reporting Server<br />MySQL<br />Master<br />MySQL<br />Slave<br />
  12. MongoDB! <br />Collection Server<br />Data Source<br />MySQL<br />Master<br />MySQL<br />Master<br />MongoDB<br />User<br />Reporting Server<br />Scalable Writes!<br />“Just” a rails app<br />Flexible Reporting!<br />
  13. MongoD<br />App Server<br />Data Source<br />Collection<br />MongoD<br />Load<br />Balancer<br />Passenger<br />Nginx<br />Mongos<br />Reporting<br />User<br />MongoD<br />Sharding!<br />High Concurrency<br />Scale-Out<br />
  14. Sharding is critical<br />14<br />Distribute write load across servers<br />Decentralize data storage<br />Scale out! <br />
  15. Before Sharding<br />15<br />App<br />Server<br />App Server<br />App Server<br />Need higher write volume<br />Buy a bigger database<br />Need more storage volume<br />Buy a bigger database<br />
  16. After Sharding<br />16<br />App<br />Server<br />App Server<br />App Server<br />Need higher write volume<br />Add more servers<br />Need more storage volume<br />Add more servers<br />
  17. Scale out is the new scale up<br />17<br />App<br />Server<br />App Server<br />App Server<br />
  18. How we’re using MongoDB<br />18<br />
  19. Our Data Model<br />19<br />Document per URL we track <br />Meta-data<br />Summary Data<br />Most recent measurements<br />Document per URL per Day<br />Detailed metrics<br />Pre-aggregated data<br />
  20. Thinking in rows<br />20<br />{ url: ‘www.google.com’,<br /> location: “SFO” <br /> connect: 23, <br />first_byte: 123, <br />last_byte: 245, <br /> timestamp: 1234 } <br />{ url: ‘www.google.com’,<br /> location: “NYC” <br /> connect: 23, <br />first_byte: 123, <br />last_byte: 245, <br /> timestamp: 2345 } <br />
  21. Thinking in rows<br />21<br />What was the average connect time for google on friday?<br />From SFO?<br />From NYC?<br />Between 1AM-2AM?<br />
  22. Thinking in rows<br />22<br /> Up to 100’s of samples per URL per day!!<br />Day 1<br />AVG<br />Result<br />Day 2<br />An “average” chart had to hit 600 rows <br />AVG<br />Day 3<br />AVG<br />30 days average query range<br />
  23. Thinking in Documents<br />This document contains all data for www.google.com collected during 9/20/2010<br />This tells us the average value for this metric for this url / time period<br />Average value from SFO<br />Average value from NYC<br />23<br />
  24. Storing a sample<br />24<br />db.metrics.dailies.update( <br /> { url: ‘www.google.com’, <br /> day: ‘9/20/2010’ }, <br /> { ‘$inc’: { <br /> ‘connect.sum’:1234,<br /> ‘connect.count’:1,<br /> ‘connect.sfo.sum’:1234,<br /> ‘connect.sfo.count’:1 } },<br /> { upsert: true } <br />);<br />Which document we’re updating<br />Update the aggregate value<br />Update the location specific value<br />Atomically update the document<br />Create the document if it doesn’t already exist<br />
  25. Putting it together<br />25<br />Atomically update the daily data<br />1<br />{ url: ‘www.google.com’,<br /> location: “SFO” <br /> connect: 23, <br />first_byte: 123, <br />last_byte: 245, <br /> timestamp: 1234 } <br />Atomically update the weekly data<br />2<br />Atomically update the monthly data<br />3<br />
  26. Drawing connect time graph<br />26<br />db.metrics.dailies.find( <br /> { url: ‘www.google.com’, <br /> day: { “$gte”: ‘9/1/2010’,<br /> “$lte”:’9/20/2010’ }, <br /> { ‘connect’:true}<br />);<br />Data for google<br />We just want connect time data<br />Compound index to make this query fast<br />The range of dates for the chart<br />db.metrics.dailies.ensureIndex({url:1,day:-1})<br />
  27. More efficient charts<br />27<br />1 Document per URL per Day<br />Day 1<br />AVG<br />Result<br />Day 2<br />Average chart hits 30 documents. <br />AVG<br />20x fewer<br />Day 3<br />AVG<br />30 days == 30 documents<br />
  28. Real Time Updates<br />28<br />Single query to fetch all metric data for a URL<br />Fast enough that browser can poll constantly for updated data without impacting server<br />
  29. Final thoughts<br />Mongo has been a great choice <br />80gb of data and counting<br />Majorly compressed after moving from table to document oriented data model <br />100’s of updates per second 24x7<br />Not using Sharding in production yet, but planning on it soon <br />You are using replication, right? <br />29<br />

×