Realtime Analytics with MongoDB

Scaling Rails @ Yottaa Jared Rosoff @forjared jrosoff@yottaa.com September 20th 2010

From zero to humongous 2 About our application How we chose MongoDB How we use MongoDB

About our application 3 We collect lots of data 6000+ URLs 300 samples per URL per day Some samples are >1MB (firebug) Missing a sample isn’t a bit deal We visualize data in real-time No delay when showing data “On-Demand” samples The “check now” button

Requirements Our data set is going to grow very quickly Scalable by default We have a very small team Focus on application, not infrastructure We are a startup Requirements change hourly Operations We’re 100% in the cloud 6

Rails default architecture Performance Bottleneck: Too much load Collection Server Data Source MySQL User Reporting Server “Just” a Rails App

Let’s add replication! Performance Bottleneck: Still can’t scale writes MySQL Master Collection Server Data Source Replication MySQL Master User Reporting Server MySQL Master MySQL Master Off the shelf! Scalable Reads!

What about sharding? Development Bottleneck: Need to write custom code Collection Server Data Source Sharding MySQL Master MySQL Master MySQL Master User Reporting Server Sharding Scalable Writes!

Key Value stores to the rescue? Development Bottleneck: Reporting is limited / hard Collection Server Data Source MySQL Master MySQL Master Cassandra or Voldemort User Reporting Server Scalable Writes!

Can I Hadoop my way out of this? Development Bottleneck: Too many systems! MySQL Master MySQL Master Cassandra or Voldemort Collection Server Data Source Hadoop MySQL Master Scalable Writes! Flexible Reports! “Just” a Rails App MySQL Master User Reporting Server MySQL Master MySQL Slave

MongoDB! Collection Server Data Source MySQL Master MySQL Master MongoDB User Reporting Server Scalable Writes! “Just” a rails app Flexible Reporting!

MongoD App Server Data Source Collection MongoD Load Balancer Passenger Nginx Mongos Reporting User MongoD Sharding! High Concurrency Scale-Out

Sharding is critical 14 Distribute write load across servers Decentralize data storage Scale out!

Before Sharding 15 App Server App Server App Server Need higher write volume Buy a bigger database Need more storage volume Buy a bigger database

After Sharding 16 App Server App Server App Server Need higher write volume Add more servers Need more storage volume Add more servers

Scale out is the new scale up 17 App Server App Server App Server

Our Data Model 19 Document per URL we track Meta-data Summary Data Most recent measurements Document per URL per Day Detailed metrics Pre-aggregated data

Thinking in rows 20 { url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 } { url: ‘www.google.com’, location: “NYC” connect: 23, first_byte: 123, last_byte: 245, timestamp: 2345 }

Thinking in rows 21 What was the average connect time for google on friday? From SFO? From NYC? Between 1AM-2AM?

Thinking in rows 22 Up to 100’s of samples per URL per day!! Day 1 AVG Result Day 2 An “average” chart had to hit 600 rows AVG Day 3 AVG 30 days average query range

Thinking in Documents This document contains all data for www.google.com collected during 9/20/2010 This tells us the average value for this metric for this url / time period Average value from SFO Average value from NYC 23

Storing a sample 24 db.metrics.dailies.update( { url: ‘www.google.com’, day: ‘9/20/2010’ }, { ‘$inc’: { ‘connect.sum’:1234, ‘connect.count’:1, ‘connect.sfo.sum’:1234, ‘connect.sfo.count’:1 } }, { upsert: true } ); Which document we’re updating Update the aggregate value Update the location specific value Atomically update the document Create the document if it doesn’t already exist

Putting it together 25 Atomically update the daily data 1 { url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 } Atomically update the weekly data 2 Atomically update the monthly data 3

Drawing connect time graph 26 db.metrics.dailies.find( { url: ‘www.google.com’, day: { “$gte”: ‘9/1/2010’, “$lte”:’9/20/2010’ }, { ‘connect’:true} ); Data for google We just want connect time data Compound index to make this query fast The range of dates for the chart db.metrics.dailies.ensureIndex({url:1,day:-1})

More efficient charts 27 1 Document per URL per Day Day 1 AVG Result Day 2 Average chart hits 30 documents. AVG 20x fewer Day 3 AVG 30 days == 30 documents

Real Time Updates 28 Single query to fetch all metric data for a URL Fast enough that browser can poll constantly for updated data without impacting server

Final thoughts Mongo has been a great choice 80gb of data and counting Majorly compressed after moving from table to document oriented data model 100’s of updates per second 24x7 Not using Sharding in production yet, but planning on it soon You are using replication, right? 29

Realtime Analytics with MongoDB

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

More from Jared Rosoff

More from Jared Rosoff (10)

Recently uploaded

Recently uploaded (20)

Realtime Analytics with MongoDB