Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Yottaa Inc.
2 Canal Park 5th Floor
Cambridge MA 02141
http://www.yottaa.com
Realtime Analytics with
MongoDB & Rails
Jared ...
2
Overview
• About Yottaa
• Engineering challenges
• Approaches we considered
• How we did it
• How it works
©Yottaa Confidential. Do Not Distribute.
Who’s driving your website?
3
http://stop-the-damage.com/2010/08/276/
©Yottaa Confidential. Do Not Distribute.
We can help you make it faster
4
OMG!! 15
seconds?
WTF?
©Yottaa Confidential. Do Not Distribute.
Knowing is half the battle
5
San Francisco
Washington DC
London
©Yottaa Confidential. Do Not Distribute.
Data data everywhere
• We collect lots of data
– 14,000+ URLs being tracked
– Up ...
©Yottaa Confidential. Do Not Distribute.
7
Demo!
8
Engineering Challenges
• High write volume from day 1
– Sample collection is like having millions of users on the first ...
©Yottaa Confidential. Do Not Distribute.
Rails default architecture
MySQL
Data
Source
Collection Server
User Reporting Ser...
©Yottaa Confidential. Do Not Distribute.
Let’s add replication!
MySQL
Master
MySQL
Master
MySQL
Slave
MySQL
Master
Replica...
©Yottaa Confidential. Do Not Distribute.
What about sharding?
MySQL
Master
MySQL
Master
MySQL
Master
Data
Source
Collectio...
©Yottaa Confidential. Do Not Distribute.
Key Value stores to the rescue?
MySQL
Master
MySQL
Master
Cassandra
or
Voldemort
...
©Yottaa Confidential. Do Not Distribute.
Can I Hadoop my way out of this?
MySQL
Master
MySQL
Master
Cassandra
or
Voldemort...
©Yottaa Confidential. Do Not Distribute.
MongoDB!
MySQL
Master
MySQL
MasterMongoDB
Data
Source
Collection Server
User Repo...
MongoD
MongoD
MongoD
Data
Source
App Server
CollectionNginx
Passenger
Mongos
Reporting
User
Sharding!
High ConcurrencyScal...
3 Steps to Real Time Analytics
16
1. Collect data
2. Store Data
3. Display Reports
3 Steps to Real Time Analytics
17
1. Collect data
2. Store Data
3. Display Reports
Collecting Data
18
Data
Source
Collection Server
Data
Source
Data
Source
Collection Server
Collection Server
Collection Se...
Collecting Data
19
- Sample data is passed in body of POST request
- Rails makes it really easy to parse JSON, XML, YML (w...
A Sample Sample!
20
{
url: ‘www.google.com’,
location: “SFO”
connect: 23,
first_byte: 123,
last_byte: 245,
timestamp: 1234...
A more complicated example
21
22
"{"location":"aws-us-east","timestamp":"08/05/2010
07:11:54","http_archive":{"log":{"creator":{"name":"Firebug","versio...
3 Steps to Real Time Analytics
23
1. Collect data
2. Store Data
3. Display Reports
Thinking in rows
24
URL Location Connec
t
First
Byte
Last Byte Timestamp{ url: ‘www.google.com’,
location: “SFO”
connect: ...
Thinking in rows
25
URL Location Connec
t
First
Byte
Last Byte Timestamp
What was the
average connect
time for google on
f...
Thinking in rows
26
URL Location Connec
t
First
Byte
Last Byte Timestamp
AVG
AVG
AVG
Day 1
Day 2
Day 3
Result
Up to 100’s ...
Thinking in Documents
27
URL www.google.com
Day 9/20/2010
Last Byte
Sum 2312
Count 12
SFO
NYC
Sum 1200
Count 5
Sum 1112
Co...
Storing a sample
28
Create the document if
it doesn’t already exist
Update the
location specific
value
Update the
aggregat...
An example document
29
{
"_id": ObjectId("4bb55c59c3666e02fc000001"),
"url": ”http://www.google.com/",
"date": "Mon Jun 07...
Putting it together
30
{ url: ‘www.google.com’,
location: “SFO”
connect: 23,
first_byte: 123,
last_byte: 245,
timestamp: 1...
Sharding our Data
31
Shard 1
Shard 2
Shard 3
Shard 4
Reporting Server
Collection Server
URL 1
URL 2
URL 3
URL 4
URL 5
URL ...
3 Steps to Real Time Analytics
32
1. Collect data
2. Store Data
3. Display Reports
Drawing connect time graph
33
We just want
connect time
data. But we can
include as many
metrics as we
want
Data for googl...
More efficient charts
34
URL Day <data>
AVG
AVG
AVG
Day 1
Day 2
Day 3
Result
1 Document
per URL per
Day
30 days == 30
docu...
Real Time Updates
URL Most Recent Data
Single query to fetch all
metric data for a URL
Fast enough that
browser can poll
c...
Evaluation
36
• High write volume
– Currently handling 1000’s of db writes per second on a single
MongoDB server
– Adding ...
Final thoughts
37
• Love MongoDB. (It’s now my default when
starting a new project)
• Using MongoMapper as ORM, but think
...
38
Q & A
Thank you for viewing
Upcoming SlideShare
Loading in …5
×

Realtime Analytics with MongoDB - MongoDB Meetup NYC

23,811 views

Published on

How Yottaa used MongoDB & Ruby on Rails to build a scalable realtime analytics platform. This was my presentation for the NYC MongoDB Meetup on 11-16-2010.

Published in: Technology, Design

Realtime Analytics with MongoDB - MongoDB Meetup NYC

  1. Yottaa Inc. 2 Canal Park 5th Floor Cambridge MA 02141 http://www.yottaa.com Realtime Analytics with MongoDB & Rails Jared Rosoff @forjared forjared@gmail.com
  2. 2 Overview • About Yottaa • Engineering challenges • Approaches we considered • How we did it • How it works
  3. ©Yottaa Confidential. Do Not Distribute. Who’s driving your website? 3 http://stop-the-damage.com/2010/08/276/
  4. ©Yottaa Confidential. Do Not Distribute. We can help you make it faster 4 OMG!! 15 seconds? WTF?
  5. ©Yottaa Confidential. Do Not Distribute. Knowing is half the battle 5 San Francisco Washington DC London
  6. ©Yottaa Confidential. Do Not Distribute. Data data everywhere • We collect lots of data – 14,000+ URLs being tracked – Up to 300 samples per URL per day – Some samples are >1mb (firebug) – Missing a sample isn’t a big deal • We try to make everything real time – No batch jobs, everything displayed as it happens – “Check Now” button runs tests on demand 6
  7. ©Yottaa Confidential. Do Not Distribute. 7 Demo!
  8. 8 Engineering Challenges • High write volume from day 1 – Sample collection is like having millions of users on the first day – After 60 days, we have > 150GB of data – Adding about 5gb / day today • Small engineering team – 1 built data ware house & portal, 1 built monitoring agents – Bigger team now, but this was how we started • Must be Agile – We didn’t know exactly what features we’d need – Requirements change daily • Limited operations budget – No full time operations staff – 100% in the cloud
  9. ©Yottaa Confidential. Do Not Distribute. Rails default architecture MySQL Data Source Collection Server User Reporting Server “Just” a Rails App Performance Bottleneck: Too much load
  10. ©Yottaa Confidential. Do Not Distribute. Let’s add replication! MySQL Master MySQL Master MySQL Slave MySQL Master Replication Data Source Collection Server User Reporting Server Off the shelf! Scalable Reads! Performance Bottleneck: Still can’t scale writes
  11. ©Yottaa Confidential. Do Not Distribute. What about sharding? MySQL Master MySQL Master MySQL Master Data Source Collection Server User Reporting Server ShardingSharding Scalable Writes! Development Bottleneck: Need to write custom code
  12. ©Yottaa Confidential. Do Not Distribute. Key Value stores to the rescue? MySQL Master MySQL Master Cassandra or Voldemort Data Source Collection Server User Reporting Server Scalable Writes! Development Bottleneck: Reporting is limited / hard
  13. ©Yottaa Confidential. Do Not Distribute. Can I Hadoop my way out of this? MySQL Master MySQL Master Cassandra or Voldemort Data Source Collection Server User Reporting Server Hadoop MySQL Master MySQL Master MySQL Slave MySQL Master Scalable Writes! Flexible Reports! “Just” a Rails App Development Bottleneck: Too many systems!
  14. ©Yottaa Confidential. Do Not Distribute. MongoDB! MySQL Master MySQL MasterMongoDB Data Source Collection Server User Reporting Server Scalable Writes! “Just” a rails app Flexible Reporting!
  15. MongoD MongoD MongoD Data Source App Server CollectionNginx Passenger Mongos Reporting User Sharding! High ConcurrencyScale-Out Load Balancer Easy as Rails!
  16. 3 Steps to Real Time Analytics 16 1. Collect data 2. Store Data 3. Display Reports
  17. 3 Steps to Real Time Analytics 17 1. Collect data 2. Store Data 3. Display Reports
  18. Collecting Data 18 Data Source Collection Server Data Source Data Source Collection Server Collection Server Collection Server Load Balancer POST http://collector.com/samples We use Amazon ELB We use Amazon EC2
  19. Collecting Data 19 - Sample data is passed in body of POST request - Rails makes it really easy to parse JSON, XML, YML (we use JSON) - We have a bunch of other stuff that happens when data arrives, but all you really need to do is write the data
  20. A Sample Sample! 20 { url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 }
  21. A more complicated example 21
  22. 22 "{"location":"aws-us-east","timestamp":"08/05/2010 07:11:54","http_archive":{"log":{"creator":{"name":"Firebug","version":"1.4.3"},"version":"1.1","pages":[{"title":"u4e2d u56fdu7f51u7edcu7535u89c6u53f0-CNTV","id":"page_0","startedDateTime":"2010-08-05T08:11:51.897 01:00","pageTimings":{"onContentLoad":1883,"onLoad":2828}}],"entries":[{"timings":{"connect":null,"wait":561,"blocked":null, "receive":19,"send":0,"dns":0},"response":{"statusText":"OK","headersSize":- 1,"httpVersion":"HTTP/1.1","bodySize":2067,"content":{"size":4467,"mimeType":"text/html"},"status":200,"redirectURL":""}, "cache":{},"pageref":"page_0","time":580,"startedDateTime":"2010-08-05T08:11:51.897 01:00","request":{"headersSize":- 1,"method":"GET","url":"http://www.cntv.cn/","httpVersion":"HTTP/1.1","bodySize":- 1}},{"timings":{"connect":null,"wait":188,"blocked":null,"receive":1,"send":0,"dns":0},"response":{"statusText":"OK","header sSize":- 1,"httpVersion":"HTTP/1.1","bodySize":740,"content":{"size":740,"mimeType":"image/jpeg"},"status":200,"redirectURL":""}," cache":{},"pageref":"page_0","time":370,"startedDateTime":"2010-08-05T08:11:52.481 01:00","request":{"headersSize":- 1,"method":"GET","url":"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_bg.jpg","httpVersion":"HTTP/1.1","b odySize":- 1}},{"timings":{"connect":null,"wait":3,"blocked":null,"receive":1,"send":0,"dns":1280},"response":{"statusText":"OK","heade rsSize":-1,"httpVersion":"HTTP/1.1","bodySize":2933,"content":{"size":7377,"mimeType":"application/x- javascript"},"status":200,"redirectURL":""},"cache":{},"pageref":"page_0","time":1285,"startedDateTime":"2010-08- 05T08:11:52.483 01:00","request":{"headersSize":- 1,"method":"GET","url":"http://www.cctv.com/Library/a2.js","httpVersion":"HTTP/1.1","bodySize":- 1}},{"timings":{"connect":null,"wait":171,"blocked":null,"receive":83,"send":0,"dns":363},"response":{"statusText":"OK","hea dersSize":- 1,"httpVersion":"HTTP/1.1","bodySize":76508,"content":{"size":76508,"mimeType":"image/png"},"status":200,"redirectURL":" "},"cache":{},"pageref":"page_0","time":716,"startedDateTime":"2010-08-05T08:11:52.489 01:00","request":{"headersSize":- 1,"method":"GET","url":"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_top.png","httpVersion":"HTTP/1.1"," bodySize":- 1}},{"timings":{"connect":null,"wait":156,"blocked":null,"receive":1,"send":0,"dns":472},"response":{"statusText":"OK","head ersSize":- 1,"httpVersion":"HTTP/1.1","bodySize":5351,"content":{"size":5351,"mimeType":"image/png"},"status":200,"redirectURL":""} ,"cache":{},"pageref":"page_0","time":629,"startedDateTime":"2010-08-05T08:11:52.490 01:00","request":{"headersSize":- 1,"method":"GET","url":"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_link.png","httpVersion":"HTTP/1.1"," bodySize":- 1}},{"timings":{"connect":null,"wait":147,"blocked":null,"receive":0,"send":0,"dns":470},"response":{"statusText":"OK","head ersSize":- 1,"httpVersion":"HTTP/1.1","bodySize":2068,"content":{"size":2068,"mimeType":"image/png"},"status":200,"redirectURL":""} ,"cache":{},"pageref":"page_0","time":617,"startedDateTime":"2010-08-05T08:11:52.492 01:00","request":{"headersSize":- 1,"method":"GET","url":"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_bottom.png","httpVersion":"HTTP/1.1 ","bodySize":-
  23. 3 Steps to Real Time Analytics 23 1. Collect data 2. Store Data 3. Display Reports
  24. Thinking in rows 24 URL Location Connec t First Byte Last Byte Timestamp{ url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 } { url: ‘www.google.com’, location: “NYC” connect: 23, first_byte: 123, last_byte: 245, timestamp: 2345 }
  25. Thinking in rows 25 URL Location Connec t First Byte Last Byte Timestamp What was the average connect time for google on friday? From SFO? From NYC? Between 1AM-2AM?
  26. Thinking in rows 26 URL Location Connec t First Byte Last Byte Timestamp AVG AVG AVG Day 1 Day 2 Day 3 Result Up to 100’s of samples per URL per day!! 30 days average query range An “average” chart had to hit 600 rows
  27. Thinking in Documents 27 URL www.google.com Day 9/20/2010 Last Byte Sum 2312 Count 12 SFO NYC Sum 1200 Count 5 Sum 1112 Count 7 This document contains all data for www.google.com collected during 9/20/2010 This tells us the average value for this metric for this url / time period Average value from SFO Average value from NYC
  28. Storing a sample 28 Create the document if it doesn’t already exist Update the location specific value Update the aggregate value Which document we’re updating Atomically update the document db.metrics.dailies.update( { url: ‘www.google.com’, day: new Date(2010,9,2)}, { ‘$inc’: { ‘connect.sum’:1234, ‘connect.count’:1, ‘connect.sfo.sum’:1234, ‘connect.sfo.count’:1 } }, true // upsert );
  29. An example document 29 { "_id": ObjectId("4bb55c59c3666e02fc000001"), "url": ”http://www.google.com/", "date": "Mon Jun 07 2010 00:00:00 GMT", "connect":{ "sum": 999, # sum of all the locations "sum_of_squares": 99999, "count": 99, ”san_francisco":{ "sum": 555, # sum of this location "sum_of_squares": 55555, "count": 55, "values": [ [”Mon Jun 07 2010 20:00:00 GMT", 12], [”Mon Jun 07 2010 20:10:00 GMT", 13], ......... ] },
  30. Putting it together 30 { url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 } Atomically update the daily data 1 Atomically update the weekly data 2 Atomically update the monthly data 3
  31. Sharding our Data 31 Shard 1 Shard 2 Shard 3 Shard 4 Reporting Server Collection Server URL 1 URL 2 URL 3 URL 4 URL 5 URL 6 URL 7 URL 8 Shard by URL Write load evenly distributed Most reads hit a single shard
  32. 3 Steps to Real Time Analytics 32 1. Collect data 2. Store Data 3. Display Reports
  33. Drawing connect time graph 33 We just want connect time data. But we can include as many metrics as we want Data for google The range of dates for the chart Compound index to make this query fast db.metrics.dailies.ensureIndex({url:1,day:-1}) db.metrics.dailies.find( { url: ‘www.google.com’, day: { “$gte”: new Date(2010,9,1), “$lte”: new Date(2010,9,30)}, { ‘connect’:true} );
  34. More efficient charts 34 URL Day <data> AVG AVG AVG Day 1 Day 2 Day 3 Result 1 Document per URL per Day 30 days == 30 documents Average chart hits 30 documents. 20x fewer
  35. Real Time Updates URL Most Recent Data Single query to fetch all metric data for a URL Fast enough that browser can poll constantly for updated data without impacting server
  36. Evaluation 36 • High write volume – Currently handling 1000’s of db writes per second on a single MongoDB server – Adding ~5GB per day • Small Engineering Team – Core system built by 2 engineers in <1 month • Agile – BDD using Rails • Limited operations budget – Runs on a handful of EC2 instances – No major issues
  37. Final thoughts 37 • Love MongoDB. (It’s now my default when starting a new project) • Using MongoMapper as ORM, but think there must a better way, more in tune with document model rather than a port of AR • There’s magic in documents but it requires thinking about your data in new ways.
  38. 38 Q & A Thank you for viewing

×