Realtime Analytics with MongoDB - MongoDB Meetup NYC

Yottaa Inc.
2 Canal Park 5th Floor
Cambridge MA 02141
http://www.yottaa.com
Realtime Analytics with
MongoDB & Rails
Jared Rosoff
@forjared
forjared@gmail.com

2
Overview
• About Yottaa
• Engineering challenges
• Approaches we considered
• How we did it
• How it works

©Yottaa Confidential. Do Not Distribute.
Who’s driving your website?
3
http://stop-the-damage.com/2010/08/276/

We can help you make it faster
4
OMG!! 15
seconds?
WTF?

Knowing is half the battle
5
San Francisco
Washington DC
London

Data data everywhere
• We collect lots of data
– 14,000+ URLs being tracked
– Up to 300 samples per URL per day
– Some samples are >1mb (firebug)
– Missing a sample isn’t a big deal
• We try to make everything real time
– No batch jobs, everything displayed as it
happens
– “Check Now” button runs tests on demand
6

7
Demo!

8
Engineering Challenges
• High write volume from day 1
– Sample collection is like having millions of users on the first day
– After 60 days, we have > 150GB of data
– Adding about 5gb / day today
• Small engineering team
– 1 built data ware house & portal, 1 built monitoring agents
– Bigger team now, but this was how we started
• Must be Agile
– We didn’t know exactly what features we’d need
– Requirements change daily
• Limited operations budget
– No full time operations staff
– 100% in the cloud

Rails default architecture
MySQL
Data
Source
Collection Server
User Reporting Server
“Just” a Rails App
Performance
Bottleneck: Too much load

Let’s add replication!
MySQL
Master
MySQL
Master
MySQL
Slave
MySQL
Master
Replication
Data
Source
Collection Server
Off the shelf!
Scalable Reads!
Performance
Bottleneck: Still can’t scale
writes

What about sharding?
MySQL
Master
MySQL
Master
MySQL
Master
Data
Source
Collection Server
ShardingSharding
Scalable Writes!
Development Bottleneck:
Need to write custom code

Key Value stores to the rescue?
MySQL
Master
MySQL
Master
Cassandra
or
Voldemort
Data
Source
Collection Server
Scalable Writes!
Development Bottleneck:
Reporting is limited / hard

Can I Hadoop my way out of this?
MySQL
Master
MySQL
Master
Cassandra
or
Voldemort
Data
Source
Collection Server
Hadoop
MySQL
Master
MySQL
Master
MySQL
Slave
MySQL
Master
Scalable Writes!
Flexible Reports!
“Just” a Rails App
Development
Bottleneck:
Too many systems!

MongoDB!
MySQL
Master
MySQL
MasterMongoDB
Data
Source
Collection Server
Scalable Writes!
“Just” a rails app
Flexible Reporting!

MongoD
MongoD
MongoD
Data
Source
App Server
CollectionNginx
Passenger
Mongos
Reporting
User
Sharding!
High ConcurrencyScale-Out
Load
Balancer
Easy as Rails!

3 Steps to Real Time Analytics
16
1. Collect data
2. Store Data
3. Display Reports

17
1. Collect data
2. Store Data
3. Display Reports

Collecting Data
18
Data
Source
Collection Server
Data
Source
Data
Source
Collection Server
Collection Server
Collection Server
Load
Balancer
POST http://collector.com/samples
We use Amazon ELB
We use Amazon EC2

Collecting Data
19
- Sample data is passed in body of POST request
- Rails makes it really easy to parse JSON, XML, YML (we use JSON)
- We have a bunch of other stuff that happens when data arrives, but
all you really need to do is write the data

A Sample Sample!
20
{
url: ‘www.google.com’,
location: “SFO”
connect: 23,
first_byte: 123,
last_byte: 245,
timestamp: 1234
}

22
"{"location":"aws-us-east","timestamp":"08/05/2010
07:11:54","http_archive":{"log":{"creator":{"name":"Firebug","version":"1.4.3"},"version":"1.1","pages":[{"title":"u4e2d
u56fdu7f51u7edcu7535u89c6u53f0-CNTV","id":"page_0","startedDateTime":"2010-08-05T08:11:51.897
01:00","pageTimings":{"onContentLoad":1883,"onLoad":2828}}],"entries":[{"timings":{"connect":null,"wait":561,"blocked":null,
"receive":19,"send":0,"dns":0},"response":{"statusText":"OK","headersSize":-
1,"httpVersion":"HTTP/1.1","bodySize":2067,"content":{"size":4467,"mimeType":"text/html"},"status":200,"redirectURL":""},
"cache":{},"pageref":"page_0","time":580,"startedDateTime":"2010-08-05T08:11:51.897 01:00","request":{"headersSize":-
1,"method":"GET","url":"http://www.cntv.cn/","httpVersion":"HTTP/1.1","bodySize":-
1}},{"timings":{"connect":null,"wait":188,"blocked":null,"receive":1,"send":0,"dns":0},"response":{"statusText":"OK","header
sSize":-
1,"httpVersion":"HTTP/1.1","bodySize":740,"content":{"size":740,"mimeType":"image/jpeg"},"status":200,"redirectURL":""},"
cache":{},"pageref":"page_0","time":370,"startedDateTime":"2010-08-05T08:11:52.481 01:00","request":{"headersSize":-
1,"method":"GET","url":"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_bg.jpg","httpVersion":"HTTP/1.1","b
odySize":-
1}},{"timings":{"connect":null,"wait":3,"blocked":null,"receive":1,"send":0,"dns":1280},"response":{"statusText":"OK","heade
rsSize":-1,"httpVersion":"HTTP/1.1","bodySize":2933,"content":{"size":7377,"mimeType":"application/x-
javascript"},"status":200,"redirectURL":""},"cache":{},"pageref":"page_0","time":1285,"startedDateTime":"2010-08-
05T08:11:52.483 01:00","request":{"headersSize":-
1,"method":"GET","url":"http://www.cctv.com/Library/a2.js","httpVersion":"HTTP/1.1","bodySize":-
1}},{"timings":{"connect":null,"wait":171,"blocked":null,"receive":83,"send":0,"dns":363},"response":{"statusText":"OK","hea
dersSize":-
1,"httpVersion":"HTTP/1.1","bodySize":76508,"content":{"size":76508,"mimeType":"image/png"},"status":200,"redirectURL":"
"},"cache":{},"pageref":"page_0","time":716,"startedDateTime":"2010-08-05T08:11:52.489 01:00","request":{"headersSize":-
1,"method":"GET","url":"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_top.png","httpVersion":"HTTP/1.1","
bodySize":-
1}},{"timings":{"connect":null,"wait":156,"blocked":null,"receive":1,"send":0,"dns":472},"response":{"statusText":"OK","head
ersSize":-
1,"httpVersion":"HTTP/1.1","bodySize":5351,"content":{"size":5351,"mimeType":"image/png"},"status":200,"redirectURL":""}
,"cache":{},"pageref":"page_0","time":629,"startedDateTime":"2010-08-05T08:11:52.490 01:00","request":{"headersSize":-
1,"method":"GET","url":"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_link.png","httpVersion":"HTTP/1.1","
bodySize":-
1}},{"timings":{"connect":null,"wait":147,"blocked":null,"receive":0,"send":0,"dns":470},"response":{"statusText":"OK","head
ersSize":-
1,"httpVersion":"HTTP/1.1","bodySize":2068,"content":{"size":2068,"mimeType":"image/png"},"status":200,"redirectURL":""}
,"cache":{},"pageref":"page_0","time":617,"startedDateTime":"2010-08-05T08:11:52.492 01:00","request":{"headersSize":-
1,"method":"GET","url":"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_bottom.png","httpVersion":"HTTP/1.1
","bodySize":-

23
1. Collect data
2. Store Data
3. Display Reports

Thinking in rows
24
URL Location Connec
t
First
Byte
Last Byte Timestamp{ url: ‘www.google.com’,
location: “SFO”
connect: 23,
first_byte: 123,
last_byte: 245,
timestamp: 1234 }
{ url: ‘www.google.com’,
location: “NYC”
connect: 23,
first_byte: 123,
last_byte: 245,
timestamp: 2345 }

Thinking in rows
25
URL Location Connec
t
First
Byte
Last Byte Timestamp
What was the
average connect
time for google on
friday?
From SFO?
From NYC?
Between 1AM-2AM?

Thinking in rows
26
URL Location Connec
t
First
Byte
Last Byte Timestamp
AVG
AVG
AVG
Day 1
Day 2
Day 3
Result
Up to 100’s of
samples per
URL per day!!
30 days
average query
range
An “average”
chart had to hit
600 rows

Thinking in Documents
27
URL www.google.com
Day 9/20/2010
Last Byte
Sum 2312
Count 12
SFO
NYC
Sum 1200
Count 5
Sum 1112
Count 7
This document contains all
data for www.google.com
collected during 9/20/2010
This tells us the
average value for
this metric for this
url / time period
Average value from
SFO
Average value from
NYC

Storing a sample
28
Create the document if
it doesn’t already exist
Update the
location specific
value
Update the
aggregate value
Which document
we’re updating
Atomically update the
document
db.metrics.dailies.update(
day: new Date(2010,9,2)},
{ ‘$inc’: {
‘connect.sum’:1234,
‘connect.count’:1,
‘connect.sfo.sum’:1234,
‘connect.sfo.count’:1 } },
true // upsert
);

An example document
29
{
"_id": ObjectId("4bb55c59c3666e02fc000001"),
"url": ”http://www.google.com/",
"date": "Mon Jun 07 2010 00:00:00 GMT",
"connect":{
"sum": 999, # sum of all the locations
"sum_of_squares": 99999,
"count": 99,
”san_francisco":{
"sum": 555, # sum of this location
"sum_of_squares": 55555,
"count": 55,
"values": [
[”Mon Jun 07 2010 20:00:00 GMT", 12],
[”Mon Jun 07 2010 20:10:00 GMT", 13],
.........
]
},

Putting it together
30
location: “SFO”
connect: 23,
first_byte: 123,
last_byte: 245,
timestamp: 1234 }
Atomically update
the daily data
1
Atomically update
the weekly data
2
Atomically update
the monthly data
3

Sharding our Data
31
Shard 1
Shard 2
Shard 3
Shard 4
Reporting Server
Collection Server
URL 1
URL 2
URL 3
URL 4
URL 5
URL 6
URL 7
URL 8
Shard by URL
Write load evenly
distributed
Most reads hit a
single shard

32
1. Collect data
2. Store Data
3. Display Reports

Drawing connect time graph
33
We just want
connect time
data. But we can
include as many
metrics as we
want
Data for google
The range of dates for
the chart
Compound index
to make this
query fast
db.metrics.dailies.ensureIndex({url:1,day:-1})
db.metrics.dailies.find(
day: { “$gte”: new Date(2010,9,1),
“$lte”: new Date(2010,9,30)},
{ ‘connect’:true}
);

More efficient charts
34
URL Day <data>
AVG
AVG
AVG
Day 1
Day 2
Day 3
Result
1 Document
per URL per
Day
30 days == 30
documents
Average chart
hits 30
documents.
20x
fewer

Real Time Updates
URL Most Recent Data
Single query to fetch all
metric data for a URL
Fast enough that
browser can poll
constantly for updated
data without impacting
server

Evaluation
36
• High write volume
– Currently handling 1000’s of db writes per second on a single
MongoDB server
– Adding ~5GB per day
• Small Engineering Team
– Core system built by 2 engineers in <1 month
• Agile
– BDD using Rails
• Limited operations budget
– Runs on a handful of EC2 instances
– No major issues

Final thoughts
37
• Love MongoDB. (It’s now my default when
starting a new project)
• Using MongoMapper as ORM, but think
there must a better way, more in tune with
document model rather than a port of AR
• There’s magic in documents but it requires
thinking about your data in new ways.

38
Q & A
Thank you for viewing

Realtime Analytics with MongoDB - MongoDB Meetup NYC

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Realtime Analytics with MongoDB - MongoDB Meetup NYC