How Yottaa used MongoDB & Ruby on Rails to build a scalable realtime analytics platform. This was my presentation for the NYC MongoDB Meetup on 11-16-2010.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Realtime Analytics with MongoDB - MongoDB Meetup NYC
1. Yottaa Inc.
2 Canal Park 5th Floor
Cambridge MA 02141
http://www.yottaa.com
Realtime Analytics with
MongoDB & Rails
Jared Rosoff
@forjared
forjared@gmail.com
8. 8
Engineering Challenges
• High write volume from day 1
– Sample collection is like having millions of users on the first day
– After 60 days, we have > 150GB of data
– Adding about 5gb / day today
• Small engineering team
– 1 built data ware house & portal, 1 built monitoring agents
– Bigger team now, but this was how we started
• Must be Agile
– We didn’t know exactly what features we’d need
– Requirements change daily
• Limited operations budget
– No full time operations staff
– 100% in the cloud
19. Collecting Data
19
- Sample data is passed in body of POST request
- Rails makes it really easy to parse JSON, XML, YML (we use JSON)
- We have a bunch of other stuff that happens when data arrives, but
all you really need to do is write the data
25. Thinking in rows
25
URL Location Connec
t
First
Byte
Last Byte Timestamp
What was the
average connect
time for google on
friday?
From SFO?
From NYC?
Between 1AM-2AM?
26. Thinking in rows
26
URL Location Connec
t
First
Byte
Last Byte Timestamp
AVG
AVG
AVG
Day 1
Day 2
Day 3
Result
Up to 100’s of
samples per
URL per day!!
30 days
average query
range
An “average”
chart had to hit
600 rows
27. Thinking in Documents
27
URL www.google.com
Day 9/20/2010
Last Byte
Sum 2312
Count 12
SFO
NYC
Sum 1200
Count 5
Sum 1112
Count 7
This document contains all
data for www.google.com
collected during 9/20/2010
This tells us the
average value for
this metric for this
url / time period
Average value from
SFO
Average value from
NYC
28. Storing a sample
28
Create the document if
it doesn’t already exist
Update the
location specific
value
Update the
aggregate value
Which document
we’re updating
Atomically update the
document
db.metrics.dailies.update(
{ url: ‘www.google.com’,
day: new Date(2010,9,2)},
{ ‘$inc’: {
‘connect.sum’:1234,
‘connect.count’:1,
‘connect.sfo.sum’:1234,
‘connect.sfo.count’:1 } },
true // upsert
);
29. An example document
29
{
"_id": ObjectId("4bb55c59c3666e02fc000001"),
"url": ”http://www.google.com/",
"date": "Mon Jun 07 2010 00:00:00 GMT",
"connect":{
"sum": 999, # sum of all the locations
"sum_of_squares": 99999,
"count": 99,
”san_francisco":{
"sum": 555, # sum of this location
"sum_of_squares": 55555,
"count": 55,
"values": [
[”Mon Jun 07 2010 20:00:00 GMT", 12],
[”Mon Jun 07 2010 20:10:00 GMT", 13],
.........
]
},
30. Putting it together
30
{ url: ‘www.google.com’,
location: “SFO”
connect: 23,
first_byte: 123,
last_byte: 245,
timestamp: 1234 }
Atomically update
the daily data
1
Atomically update
the weekly data
2
Atomically update
the monthly data
3
31. Sharding our Data
31
Shard 1
Shard 2
Shard 3
Shard 4
Reporting Server
Collection Server
URL 1
URL 2
URL 3
URL 4
URL 5
URL 6
URL 7
URL 8
Shard by URL
Write load evenly
distributed
Most reads hit a
single shard
32. 3 Steps to Real Time Analytics
32
1. Collect data
2. Store Data
3. Display Reports
33. Drawing connect time graph
33
We just want
connect time
data. But we can
include as many
metrics as we
want
Data for google
The range of dates for
the chart
Compound index
to make this
query fast
db.metrics.dailies.ensureIndex({url:1,day:-1})
db.metrics.dailies.find(
{ url: ‘www.google.com’,
day: { “$gte”: new Date(2010,9,1),
“$lte”: new Date(2010,9,30)},
{ ‘connect’:true}
);
34. More efficient charts
34
URL Day <data>
AVG
AVG
AVG
Day 1
Day 2
Day 3
Result
1 Document
per URL per
Day
30 days == 30
documents
Average chart
hits 30
documents.
20x
fewer
35. Real Time Updates
URL Most Recent Data
Single query to fetch all
metric data for a URL
Fast enough that
browser can poll
constantly for updated
data without impacting
server
36. Evaluation
36
• High write volume
– Currently handling 1000’s of db writes per second on a single
MongoDB server
– Adding ~5GB per day
• Small Engineering Team
– Core system built by 2 engineers in <1 month
• Agile
– BDD using Rails
• Limited operations budget
– Runs on a handful of EC2 instances
– No major issues
37. Final thoughts
37
• Love MongoDB. (It’s now my default when
starting a new project)
• Using MongoMapper as ORM, but think
there must a better way, more in tune with
document model rather than a port of AR
• There’s magic in documents but it requires
thinking about your data in new ways.