3. Why mongodb? 3 There was no real reason. It just looked cool and we like messing with cool things because then we are cool by association.
4. Why mongodb? 4 Ok, actually we started with Cassandra cuz Facebook wrote Cassandra and Facebook is really cool.
5. Why mongodb? 5 Ya know what’s not cool? Thrift is not cool. Thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, and OCaml.
6. Why mongodb? 6 So MongoDB it was. Also, 10gen is in NYC and what’s cooler than NYC? Nothing.
7. Transition to mongo 3 Stages Stage 1 – Started blindly writing code on a secret project. Data was NON-Critical. Stage 2 – Stage 1 worked out pretty well, we think we’re pretty smart now. Let’s go all-in. Stage 3 – OK, now we are REALLY smart. Lets use it for analytics. 7
8. Stage 1 Stage 1: Non Critical Project w/ Non Critical Data We thought it would be cool to let clients view their user activity. Mostly, we wanted to be able to prove to our Project Managers that it was the client who screwed up their content, not our platform. Effectively a LOG Application 8 We didn’t even read that.
10. Stage 1 – what we learned MongoDB is not MySQL we re-learn this almost every day. You will too. “Schemaless” is a bad way to think about it. Schema is INCREDIBLY important. DBRef for foreign keys are kind of nasty. If you *really* need FKs, it’s usually easier to just use the IDs naturally. You probably do this in MySQL anyway and don’t bother with FK constraints. Your app handles it. You probably don’t need references. MongoDB is not MySQL. Use Sub Documents. If you don’t, you’ve just got rows and tables and rows and tables is MySQL. It’s OK to store data in sub docs that will change later. If it’s not… you’re probably trying to use the wrong tool. Don’t let that 4mb document limit worry you too much. 4mb is a lot. Use the right tool for the job!!! Typical jobs include: Logging Queues Aggregate Analytics A BSON Object is not an ORM Object. DONT take the whole document, alter it, and re-save it. (ORM) You Don’t need ORM. You probably don’t need a heavy abstraction layer. It sort of depends on what language you’re using. If you’re using PHP, you might want an abstraction layer. You might also want a new language… 10 PHP Python JavaScript Console
11. Stage 1 – what we learned 11 Most importantly. Don’t let people find out about secret side projects. BLAMO, You’re in production
12. Stage 2 Stage 2: Critical Data. Medium Volume. Big Spikes. We manage hundreds of pages on Facebook which account for 100’s of millions of fans. Those fans posts to brand walls. Let’s build an app which lets the brand moderate that content. 12
14. Stage 2 – What we Learned Modifier Operators are dope. http://www.mongodb.org/display/DOCS/Updating Remember to use $set. MySQL complains, Mongo happily destroys your document. Tell your query what you want returned. Be Careful with 64 bit integers on 32 bit machines. (Facebook uses 64 bit ID’s) http://derickrethans.nl/64bit-ints-in-mongodb.html Read everything you can about indexing. You will likely create 2 dozen indexes that never get used. http://kylebanker.com/blog/2010/09/21/the-joy-of-mongodb-indexes/ Make sure your indexes fit in memory. Use Replica Sets. Seriously, use them. .stats(), .explain(), profiler and mongostat are your friends. Got slow queries? Use .explain() + .stats() to figure out if it’s using your indexes effectively. Try .hint() Got slow queries but can’t find them? Use profiler. Hell, you can query your queries! Still slow? Use mongostat to look for faults and locks. Faults = going to disc. 14
15. Stage 3 – Analytics Lots of “stuff” happens at Buddy Media. We needed a structured way to keep track of all of that stuff. Has to be flexible enough to handle different levels of aggregation. Has to be near-real time. (1 minute aggregates). Need to be able to add new “stuff” or aggregates on-the-fly. Needs to handle lots of writes very fast. 15
16. Stage 3 – Analytics MongoDB Does. 16 Well DUH! That’s everyone’s dream Analytics system! What makes you special? 1 1 1 0 0 0 1 0 0 1 1 1 1
17. 17 Stage 3 – analytcs Upsert + $inc = *Remember earlier when I said Modifier Operators are awesome?
20. Stage 3 – Analytics Storing the Metrics Originally, we keyed an event on [type, period, start_date, object]. This got huge… fast. Think pageviews. If you have 1000 pages and use a unique document to store pageviews for each object then you have max 1000 * 60 documents per hour. That’s for ONE Metric. It’s not very Mongo-Like (use sub documents!). If I want to know the pageviews all 1000 pages got in an hour by minute, I have to return and iterate over 60,000 documents. Instead, we key on [type, period, start_date]. Reduces number of documents dramatically. 60 documents per hour instead of 60,000. 20
21. Stage 3 – Analytics Pulling Events off the Queue 21 defgetEvent(self, db): try: item =db.command( 'findAndModify', 'events’, query={"status.state": 0}, update={ "$set": {"status": {"state":1, "updated":now}}}, sort={"created_date":1}) returnitem['value'] except: return None One Event at a Time. No Race Conditions.
22. Stage 3 – Analytics Events Metrics While we can only pull one event off the Queue at a time, that doesn’t mean we should process 1 event at a time. Remember, our documents contain lots of “object” aggregates. We can update a whole bunch at once. We get 10k events off Queue. At 10k (or empty Queue) we process by creating local documents in memory, adding each object to each document. We then construct a single upsert per metric instead of per event. 22
24. Stage 3 - Analytics 24 SQL MongoDB SELECT DATE_TRUNC('day', event_time) AS group_time ,COUNT(DISTINCT event_transaction_id) AS counts FROM f_localevents WHERE 1=1 AND module_id = '4ce420bc36913' AND event_name = 'polls.vote_submitted' AND event_time >= '2010-12-01 00:00:00' AND event_time < '2010-12-31 00:00:00' GROUP BY DATE_TRUNC('day', event_time) ORDER BY group_time; Time: 0 hrs, 0 mins, 33 secs, 264 ms(holy shit) db.metrics.find( { name:"module.polls.vote_submitted", period:"day", start_date:{ "$gte":"2010-12-01 00:00:00", "$lte":"2010-12-31 23:59:59” } }, {"aggregates.4ce420bc36913":1} ).explain(); {"cursor" : "BtreeCursor name_1_period_1_start_date_1", "nscanned" : 7, "nscannedObjects" : 7, "n" : 7, "millis" : 0, …
25. Stage 3 – what we learned You probably don’t need sharding. But if there is one situation where sharding is going to come up quickly it’s on the cloud. I can’t say, give me all pageviews for pages in category A. There’s a way to do this, but we haven’t quite figured it out yet. Our app handles it for now. Less documents are always better. Find ways to combine data structures effectively. And last but not least…. Our LEAST favorite thing in MongoDB… 25
26. Stage 3 – What we learned 26 patrick@newdev:~$ mongo localhost MongoDB shell version: 1.6.4 connecting to: localhost > use analtyics; switched to dbanaltyics >
27. Shameless Plug(s) We are hiring all walks of life. Engineers, SysOps, Product Managers, UX Designers. Get to work on cool problems like this. (That makes you cool by association). http://bddy.me/ia8gi3 Meet us at SXSW! http://www.facebook.com/event.php?eid=204744279542095 27