Flexible Event Logging

Analyzing Funnels, Retention, and
  Viral Spread with MongoDB


       Paul Gebheim - Justin.tv
How can we effectively use our data to make
            Justin.tv better?
Questions

   Who does what, and how?
          Funnels


How valuable are groups of users?
             Virality


   Are our changes working?
  Retention, Funnel Conversion
The Dream




A general framework for creating, deploying, and analyzing A/B
      tests in terms of Funnels, Virality, and Retention.
Backend Dreams


              Flexibility

            Queryability

             Scalability

... and it should be easy to work with
Backend Dreams... come true


             Schema-less!

  Rich data access/manipulation toolset

   At home in a web-centric toolchain

   Sharding, Map/Reduce, Replication
Lets build it...
Aggregating Data
              Web Site Events

     [{
      "name": "front_page/broadcast_click",
      "date": "2010-04-20 12:00:00-7000",
      "unique_id": "fRx8zq",
      "bucket": "big_red_button"
},
{
      "name": "front_page/broadcast_click",
      "date": "2010-04-20 12:01:00-7000",
      "unique_id": "9aB8c2",
      "bucket": "small_blue_button"
}]
Aggregating Data
            Video System Events



[{
      "name": "broadcast/started",
      "date": "2010-04-20 12:10:00-7000",
      "unique_id": "fRx8zq",
      "bucket": "big_red_button",
      "channel": "my_1337_ch4nn31l",
}]
Processing Data


             Python

          Map/Reduce

     Configuration Documents

Generate/Apply MongoDB operations
Example:




Count how many times each event occurs per 'bucket'
Example
        Historical Data with SQL:




1   select
2       event_name, bucket, count(*)
3   from
4       events
5   group by event_name, bucket;
Mongo can do that!
 For small datasets, use collection.group()

 1   var count_events_per_bucket = function() {
 2       return db.events.group({
 3           key: {name: 1, bucket: 1},
 4           cond: {/* include all events */},
 5           reduce: function(event, aggregate) {
 6               aggregate.count += 1;
 7           },
 8           initial: {
 9               count: 0
10           }
11       });
12   }
Mongo can do that!
For large datasets, use collection.mapReduce()
 1   var count_events_per_bucket_big = function() {
 2       var res = db.events.mapReduce(
 3           // map
 4           function() {
 5               emit({
 6                   name: this.name,
 7                   bucket: this.bucket
 8               }, 1);
 9           },
10           // reduce
11           function(key, values_list) {
12               var count=0;
13               each(values_list, function(v,n) {
14                   count += v;
15               });
16               return count;
17           }
18       );
19
20       return db[res.result].find();
21   };
Mongo can also...
              be used to do the counting in real time!

 1   matchers = {
 2        "front_page/broadcast_click": lambda event: event["bucket"],
 3        "broadcast/started": lambda event["bucket"]
 4   }
 5
 6   for event in events:
 7       key = event["name"]
 8       if key in matchers:
 9           count_key = "counts.%s.%s" % (
10                           extractDay(event["date"]),
11                           matchers[key](event))
12           event_db.event_counts.update(
13                   {"_id": key},
14                   {"$inc": {count_key: 1}},
15                   multi=True, upsert=True)
16       event_db.events.insert(event)
17
Example
       How the results appear in Mongo
 1   > db.event_counts.find()
 2   {
 3       "_id": "front_page/broadcast_click",
 4       "counts": {
 5           "2010-04-20": {
 6               "big_red_button": 1231,
 7               "small_blue_button": 86
 8           }
 9       }
10   }
11   {
12       "_id": "broadcast/started",
13       "counts": {
14           "2010-04-20": {
15               "big_red_button": 72,
16               "small_blue_button": 6
17           }
18       }
19   }
20   >
What’s that we have there?



  First, Click the “Broadcast Button”

      Then, Start Broadcasting
We can add more events...

  First, Click the “Broadcast Button”

            Authenticate

  Click flash “Allow” or Disallow” box

         Share with friends

                   ...

     Then, Start Broadcasting
Periodic Map/Reduce



Computing a bunch of stuff every half hour is fine if its fast enough

 A program can generate arbitrarily complex Map/Reduce code...
Accurate Funnel Calculation

• Per user rollup
   – For each user, which steps in the funnel have they been
     at with constraints applied
   – A map to get unique users, a reduce to count which
     unique events they triggered
• Per bucket rollup
   – For each bucket, how many users at each ‘step’ in the
     funnel
   – Sum counts at each step per bucket
Same strategy...




All calculations ended up being done in batch jobs...
Thoughts...




Interactive performance poor during M/R jobs

      Eliot says this is fixed in 1.5.0 :-)
Thoughts...




Even so... its fast enough!
Future work



Migrating old Postgres-backed system to MongoDB

  Real-time calculation for timeseries calculation

   Batch jobs for Funnel, Retention, and Virality

Flexible Event Tracking (Paul Gebheim)

  • 1.
    Flexible Event Logging AnalyzingFunnels, Retention, and Viral Spread with MongoDB Paul Gebheim - Justin.tv
  • 2.
    How can weeffectively use our data to make Justin.tv better?
  • 3.
    Questions Who does what, and how? Funnels How valuable are groups of users? Virality Are our changes working? Retention, Funnel Conversion
  • 4.
    The Dream A generalframework for creating, deploying, and analyzing A/B tests in terms of Funnels, Virality, and Retention.
  • 5.
    Backend Dreams Flexibility Queryability Scalability ... and it should be easy to work with
  • 6.
    Backend Dreams... cometrue Schema-less! Rich data access/manipulation toolset At home in a web-centric toolchain Sharding, Map/Reduce, Replication
  • 7.
  • 8.
    Aggregating Data Web Site Events [{     "name": "front_page/broadcast_click",     "date": "2010-04-20 12:00:00-7000",     "unique_id": "fRx8zq",     "bucket": "big_red_button" }, {     "name": "front_page/broadcast_click",     "date": "2010-04-20 12:01:00-7000",     "unique_id": "9aB8c2",     "bucket": "small_blue_button" }]
  • 9.
    Aggregating Data Video System Events [{     "name": "broadcast/started",     "date": "2010-04-20 12:10:00-7000",     "unique_id": "fRx8zq",     "bucket": "big_red_button",     "channel": "my_1337_ch4nn31l", }]
  • 10.
    Processing Data Python Map/Reduce Configuration Documents Generate/Apply MongoDB operations
  • 11.
    Example: Count how manytimes each event occurs per 'bucket'
  • 12.
    Example Historical Data with SQL: 1 select 2     event_name, bucket, count(*) 3 from 4     events 5 group by event_name, bucket;
  • 13.
    Mongo can dothat! For small datasets, use collection.group()  1 var count_events_per_bucket = function() {  2     return db.events.group({  3         key: {name: 1, bucket: 1},  4         cond: {/* include all events */},  5         reduce: function(event, aggregate) {  6             aggregate.count += 1;  7         },  8         initial: {  9             count: 0 10         } 11     }); 12 }
  • 14.
    Mongo can dothat! For large datasets, use collection.mapReduce()  1 var count_events_per_bucket_big = function() {  2     var res = db.events.mapReduce(  3         // map  4         function() {  5             emit({  6                 name: this.name,  7                 bucket: this.bucket  8             }, 1);  9         }, 10         // reduce 11         function(key, values_list) { 12             var count=0; 13             each(values_list, function(v,n) { 14                 count += v; 15 }); 16             return count; 17         } 18     ); 19 20     return db[res.result].find(); 21 };
  • 15.
    Mongo can also... be used to do the counting in real time!  1 matchers = {  2      "front_page/broadcast_click": lambda event: event["bucket"],  3      "broadcast/started": lambda event["bucket"]  4 }  5  6 for event in events:  7     key = event["name"]  8     if key in matchers:  9         count_key = "counts.%s.%s" % ( 10                         extractDay(event["date"]), 11                         matchers[key](event)) 12         event_db.event_counts.update( 13                 {"_id": key}, 14                 {"$inc": {count_key: 1}}, 15                 multi=True, upsert=True) 16     event_db.events.insert(event) 17
  • 16.
    Example How the results appear in Mongo  1 > db.event_counts.find()  2 {  3     "_id": "front_page/broadcast_click",  4     "counts": {  5         "2010-04-20": {  6             "big_red_button": 1231,  7             "small_blue_button": 86  8         }  9     } 10 } 11 { 12     "_id": "broadcast/started", 13     "counts": { 14         "2010-04-20": { 15             "big_red_button": 72, 16             "small_blue_button": 6 17         } 18     } 19 } 20 >
  • 17.
    What’s that wehave there? First, Click the “Broadcast Button” Then, Start Broadcasting
  • 18.
    We can addmore events... First, Click the “Broadcast Button” Authenticate Click flash “Allow” or Disallow” box Share with friends ... Then, Start Broadcasting
  • 19.
    Periodic Map/Reduce Computing abunch of stuff every half hour is fine if its fast enough A program can generate arbitrarily complex Map/Reduce code...
  • 20.
    Accurate Funnel Calculation •Per user rollup – For each user, which steps in the funnel have they been at with constraints applied – A map to get unique users, a reduce to count which unique events they triggered • Per bucket rollup – For each bucket, how many users at each ‘step’ in the funnel – Sum counts at each step per bucket
  • 22.
    Same strategy... All calculationsended up being done in batch jobs...
  • 23.
    Thoughts... Interactive performance poorduring M/R jobs Eliot says this is fixed in 1.5.0 :-)
  • 24.
  • 25.
    Future work Migrating oldPostgres-backed system to MongoDB Real-time calculation for timeseries calculation Batch jobs for Funnel, Retention, and Virality