MongoDB for Analytics
A loving conversation with @jnunemaker




MongoChicago 2012        John Nunemaker
November 12, 2012                 GitHub
Background
How hernias can be good for you
1 month
Of evenings and weekends
18 months
Since public launch
10-15 Million
Page views per day
2.7 Billion
Page views to date
13 tiny servers
2 web, 6 app, 3 db, 2 queue
requests/sec
ops/sec
cpu %
lock %
Implementation
How we do what we do
Doing It (mostly) Live
No aggregate querying
get('/track.gif') do
  track_service.record(...)
  TrackGif
end
class TrackService
  def record(attrs)
    message = MessagePack.pack(attrs)
    @client.set(@queue, message)
  end
end
class TrackProcessor
  def run
    loop { process }
  end

 def process
   record @client.get(@queue)
 end

  def record(message)
    attrs = MessagePack.unpack(message)
    Hit.record(attrs)
  end
end
http://bit.ly/rt-kestrel
class Hit
  def record
    site.atomic_update(site_updates)

    Resolution.record(self)
    Technology.record(self)
    Location.record(self)
    Referrer.record(self)
    Content.record(self)
    Search.record(self)
    Notification.record(self)
    View.record(self)
  end
end
class Resolution
  def record(hit)
    query = {'_id' => "..."}
    update = {'$inc' => {}}
    update['$inc']["sx.#{hit.screenx}"] = 1
    update['$inc']["bx.#{hit.browserx}"] = 1
    update['$inc']["by.#{hit.browsery}"] = 1

    collection(hit.created_on)
      .update(query, update, :upsert => true)
    end
  end
end
Pros
Pros
 Space
Pros
 Space
 RAM
Pros
 Space
 RAM
 Reads
Pros
 Space
 RAM
 Reads
 Live
Cons
Cons
 Writes
Cons
 Writes
 Constraints
Cons
 Writes
 Constraints
 More Forethought
Cons
 Writes
 Constraints
 More Forethought
 No raw data
http://bit.ly/rt-counters
http://bit.ly/rt-counters2
Time Frame
Minute, hour, month, day, year, forever?
# of Variations
One document vs many
Single Document
Per Time Frame
{
    "t" => 336381,
    "u" => 158951,
    "2011" => {
      "02" => {
        "18" => {
          "t" => 9,
          "u" => 6
        }
      }
    }
}
{
    '$inc' => {
      't' => 1,
      'u' => 1,
      '2011.02.18.t' => 1,
      '2011.02.18.u' => 1,
    }
}
Single Document
For all ranges in time frame
{
    "_id" =>"...:10",
    "bx" => {
      "320" => 85,
      "480" => 318,
      "800" => 1938,
      "1024" => 5033,
      "1280" => 6288,
      "1440" => 2323,
      "1600" => 3817,
      "2000" => 137
    },
    "by" => {
      "480" => 2205,
      "600" => 7359,
"600" =>    7359,
      "768" =>    4515,
      "900" =>    3833,
      "1024" =>   2026
    },
    "sx" => {
      "320" =>    191,
      "480" =>    179,
      "800" =>    195,
      "1024" =>   1059,
      "1280" =>   5861,
      "1440" =>   3533,
      "1600" =>   7675,
      "2000" =>   1279
    }
}
{
    '$inc' => {
      'sx.1440' => 1,
      'bx.1280' => 1,
      'by.768' => 1,
    }
}
Many Documents
Search terms, content, referrers...
[
    {
      "_id"   =>   "<oid>:<hash>",
      "t"     =>   "ruby class variables",
      "sid"   =>   BSON::ObjectId('<oid>'),
      "v"     =>   352
    },
    {
      "_id"   =>   "<oid>:<hash>",
      "t"     =>   "ruby unless",
      "sid"   =>   BSON::ObjectId('<oid>'),
      "v"     =>   347
    },
]
Writes
{'_id' => "#{sid}:#{hash}"}
Reads
[['sid', 1], ['v', -1]]
Growth
Don’t say shard, don’t say shard...
Partition Hot Data
Currently using collections for time frames
[
    "content.2011.7",
    "content.2011.8",
    "content.2011.9",
    "content.2011.10",
    "content.2011.11",
    "content.2011.12",
    "content.2012.1",
    "content.2012.2",
    "content.2012.3",
    "content.2012.4",
]
[
    "resolutions.2011",
    "resolutions.2012",
]
Move
Move
BigintMove
Move
BigintMove
MakeYouWannaMove
Move
BigintMove
MakeYouWannaMove
DaMove
Move
BigintMove
MakeYouWannaMove
DaMove
SmoothMove
Move
BigintMove
MakeYouWannaMove
DaMove
SmoothMove
NightMove
Move
BigintMove
MakeYouWannaMove
DaMove
SmoothMove
NightMove
DanceMove
Bigger, Faster Server
More CPU, RAM, Disk Space
Users
              Sites
Users
Sites
              Content
Content
Referrers     Referrers
Terms
Engines       Terms
Resolutions
Locations     Engines
              Resolutions
              Locations
Partition by Function
Spread writes across a few servers
Users

 Sites

Content

Referrers

Engines

 Terms

Locations

Resolutions
Partition by Server
Spread writes across a ton of servers,
way down the road, not worried yet
Thank you!
john@github.com
@jnunemaker



MongoChicago 2012   John Nunemaker
November 12, 2012            GitHub

MongoDB for Analytics