Social Analytics with MongoDB


Published on

Published in: Technology
  • Be the first to comment

Social Analytics with MongoDB

  1. 1. Platform Overview<br />Social analyticson MongoDB<br />New York Mongodb user group <br />Feb 24, 2011<br />1<br />
  2. 2. 2<br />Why MongoDB?<br />
  3. 3. Why mongodb?<br />3<br />There was no real reason. It just looked cool and we like messing with cool things because then we are cool by association.<br />
  4. 4. Why mongodb?<br />4<br />Ok, actually we started with Cassandra cuz Facebook wrote Cassandra and Facebook <br />is really cool.<br />
  5. 5. Why mongodb?<br />5<br />Ya know what’s not cool?<br />Thrift is not cool.<br />Thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, and OCaml.<br />
  6. 6. Why mongodb?<br />6<br />So MongoDB it was.<br />Also, 10gen is in NYC and what’s cooler than NYC? Nothing.<br />
  7. 7. Transition to mongo<br />3 Stages<br />Stage 1 – Started blindly writing code on a secret project. Data was NON-Critical.<br />Stage 2 – Stage 1 worked out pretty well, we think we’re pretty smart now. Let’s go all-in.<br />Stage 3 – OK, now we are REALLY smart. Lets use it for analytics.<br />7<br />
  8. 8. Stage 1<br />Stage 1: Non Critical Project w/ Non Critical Data<br />We thought it would be cool to let clients view their user activity.<br />Mostly, we wanted to be able to prove to our Project Managers that it was the client who screwed up their content, not our platform.<br />Effectively a LOG Application<br />8<br />We didn’t even read that.<br />
  9. 9. Stage 1<br />9<br />
  10. 10. Stage 1 – what we learned<br />MongoDB is not MySQL  we re-learn this almost every day. You will too.<br />“Schemaless” is a bad way to think about it.<br />Schema is INCREDIBLY important.<br />DBRef for foreign keys are kind of nasty.<br />If you *really* need FKs, it’s usually easier to just use the IDs naturally.<br />You probably do this in MySQL anyway and don’t bother with FK constraints. Your app handles it.<br />You probably don’t need references. MongoDB is not MySQL.<br />Use Sub Documents. If you don’t, you’ve just got rows and tables and rows and tables is MySQL.<br />It’s OK to store data in sub docs that will change later. If it’s not… you’re probably trying to use the wrong tool.<br />Don’t let that 4mb document limit worry you too much. 4mb is a lot.<br />Use the right tool for the job!!! Typical jobs include:<br />Logging<br />Queues<br />Aggregate Analytics<br />A BSON Object is not an ORM Object.<br />DONT take the whole document, alter it, and re-save it. (ORM)<br />You Don’t need ORM.<br />You probably don’t need a heavy abstraction layer. It sort of depends on what language you’re using.<br />If you’re using PHP, you might want an abstraction layer. You might also want a new language…<br />10<br />PHP<br />Python<br />JavaScript Console<br />
  11. 11. Stage 1 – what we learned<br />11<br />Most importantly. Don’t let people find out about secret side projects. BLAMO, You’re in production<br />
  12. 12. Stage 2<br />Stage 2: Critical Data. Medium Volume. Big Spikes.<br />We manage hundreds of pages on Facebook which account for 100’s of millions of fans. <br />Those fans posts to brand walls.<br />Let’s build an app which lets the brand moderate that content.<br />12<br />
  13. 13. Stage 2<br />13<br />
  14. 14. Stage 2 – What we Learned<br />Modifier Operators are dope.<br /><br />Remember to use $set. MySQL complains, Mongo happily destroys your document.<br />Tell your query what you want returned.<br />Be Careful with 64 bit integers on 32 bit machines. (Facebook uses 64 bit ID’s)<br /><br />Read everything you can about indexing. You will likely create 2 dozen indexes that never get used.<br /><br />Make sure your indexes fit in memory.<br />Use Replica Sets. Seriously, use them.<br />.stats(), .explain(), profiler and mongostat are your friends.<br />Got slow queries? Use .explain() + .stats() to figure out if it’s using your indexes effectively. Try .hint()<br />Got slow queries but can’t find them? Use profiler. Hell, you can query your queries!<br />Still slow? Use mongostat to look for faults and locks. Faults = going to disc.<br />14<br />
  15. 15. Stage 3 – Analytics<br />Lots of “stuff” happens at Buddy Media. <br />We needed a structured way to keep track of all of that stuff.<br />Has to be flexible enough to handle different levels of aggregation.<br />Has to be near-real time. (1 minute aggregates).<br />Need to be able to add new “stuff” or aggregates on-the-fly.<br />Needs to handle lots of writes very fast.<br />15<br />
  16. 16. Stage 3 – Analytics<br />MongoDB Does.<br />16<br />Well DUH! That’s everyone’s dream Analytics system!<br />What makes you special? <br />1<br />1<br />1<br />0<br />0<br />0<br />1<br />0<br />0<br />1<br />1<br />1<br />1<br />
  17. 17. 17<br />Stage 3 – analytcs<br />Upsert + $inc = <br />*Remember earlier when I said Modifier Operators are awesome?<br />
  18. 18. Stage 3 - Analytics<br />18<br />A Metric:<br />{<br /> "_id" : ObjectId("4d656bd84b4395dce2bb7110"),<br /> "aggregates" : {<br /> "site1" : 3,<br /> "site2" : 2<br /> },<br /> "type" : "pageviews",<br /> "period" : "minute",<br /> "start_date" : "2011-02-23 15:32:00",<br />}<br />
  19. 19. 19<br />
  20. 20. Stage 3 – Analytics <br />Storing the Metrics<br />Originally, we keyed an event on [type, period, start_date, object]. This got huge… fast.<br />Think pageviews. If you have 1000 pages and use a unique document to store pageviews for each object then you have max 1000 * 60 documents per hour. That’s for ONE Metric. <br />It’s not very Mongo-Like (use sub documents!).<br />If I want to know the pageviews all 1000 pages got in an hour by minute, I have to return and iterate over 60,000 documents.<br />Instead, we key on [type, period, start_date]. Reduces number of documents dramatically. <br />60 documents per hour instead of 60,000.<br />20<br />
  21. 21. Stage 3 – Analytics <br />Pulling Events off the Queue<br />21<br />defgetEvent(self, db):<br />        try:<br />            item =db.command(<br />                'findAndModify', <br />                'events’,<br />                query={"status.state": 0},<br />                update={ "$set": {"status": {"state":1, "updated":now}}},<br />                sort={"created_date":1})<br />            returnitem['value']<br />        except:<br />            return None<br />One Event at a Time.<br />No Race Conditions.<br />
  22. 22. Stage 3 – Analytics <br />Events  Metrics <br />While we can only pull one event off the Queue at a time, that doesn’t mean we should process 1 event at a time.<br />Remember, our documents contain lots of “object” aggregates. We can update a whole bunch at once.<br />We get 10k events off Queue. At 10k (or empty Queue) we process by creating local documents in memory, adding each object to each document.<br />We then construct a single upsert per metric instead of per event.<br />22<br />
  23. 23. Stage 3 – Analytics <br />23<br />db.metrics.find({type:"pageviews",period:"minute",<br />start_date:"2011-02-23 15:32:00"});<br />{<br /> "_id" : ObjectId("4d656bd84b4395dce2bb7110"),<br /> "aggregates" : {<br />”blue" : 3,<br />”green" : 2<br /> },<br /> "type" : "pageviews",<br /> "period" : "minute",<br /> "start_date" : "2011-02-23 15:32:00",<br />}<br />Pageview Events<br />All 5 occurred in minute 32 of 3pm<br />metric = {<br />"type" : "pageviews",<br /> "period" : "minute",<br /> "start_date" : "2011-02-23 15:32:00“,<br />}<br />aggregates[blue] = 3<br />aggregates[green] = 2<br />incrementors = {}<br />for agg in aggregates:<br />incrementors['aggregates.%s' % agg] = aggregates[agg]<br />db.metrics.update(metric, {‘$inc’:incrementors}, True)<br />Baby Jesus<br />
  24. 24. Stage 3 - Analytics<br />24<br />SQL<br />MongoDB<br />SELECT<br /> DATE_TRUNC('day', event_time) AS group_time<br /> ,COUNT(DISTINCT event_transaction_id) AS counts<br />FROM<br />f_localevents<br />WHERE 1=1<br /> AND module_id = '4ce420bc36913'<br /> AND event_name = 'polls.vote_submitted'<br /> AND event_time >= '2010-12-01 00:00:00'<br /> AND event_time < '2010-12-31 00:00:00'<br />GROUP BY<br /> DATE_TRUNC('day', event_time)<br />ORDER BY<br />group_time;<br />Time: 0 hrs, 0 mins, 33 secs, 264 ms(holy shit)<br />db.metrics.find(<br /> { <br /> name:"module.polls.vote_submitted", <br />period:"day", <br />start_date:{<br /> "$gte":"2010-12-01 00:00:00", <br /> "$lte":"2010-12-31 23:59:59”<br /> }<br /> }, {"aggregates.4ce420bc36913":1} <br />).explain();<br />{"cursor" : "BtreeCursor name_1_period_1_start_date_1",<br /> "nscanned" : 7,<br /> "nscannedObjects" : 7,<br /> "n" : 7,<br />"millis" : 0,<br /> …<br />
  25. 25. Stage 3 – what we learned<br />You probably don’t need sharding. But if there is one situation where sharding is going to come up quickly it’s on the cloud.<br />I can’t say, give me all pageviews for pages in category A. There’s a way to do this, but we haven’t quite figured it out yet. Our app handles it for now.<br />Less documents are always better. Find ways to combine data structures effectively.<br />And last but not least…. Our LEAST favorite thing in MongoDB…<br />25<br />
  26. 26. Stage 3 – What we learned<br />26<br />patrick@newdev:~$ mongo localhost<br />MongoDB shell version: 1.6.4<br />connecting to: localhost<br />> use analtyics;<br />switched to dbanaltyics<br />><br />
  27. 27. Shameless Plug(s)<br />We are hiring all walks of life.<br />Engineers, SysOps, Product Managers, UX Designers.<br />Get to work on cool problems like this. (That makes you cool by association).<br /><br />Meet us at SXSW!<br /><br />27<br />
  28. 28. Even More Shameless…<br />28<br />@patr1cks<br />@buddymedia<br />