Lessons Learned from Migrating 2+ Billion Documents at Craigslist<br />Jeremy Zawodny<br />jzawodn@craigslist.org<br />Jer...
Outline<br />Recap last year’s MongoSV Talk<br />The Archive, Why MongoDB, etc.<br />http://www.10gen.com/video/mongosv201...
Craigslist Numbers<br />2 data centers<br />~500 servers<br />~100 MySQL servers<br />~700 cities, worldwide<br />~1 billi...
Archive: Where Data Goes To Die<br />Live Numbers<br />~1.75M posts/day<br />~14 day avg. lifetime<br />~60 day retention<...
Archive Pain<br />Coupled Schemas<br />Big Indexes<br />Hardware Failures<br />Replication Lag<br />Poor Search<br />Human...
MongoDB Wins<br />Scalable<br />Fast<br />Friendly<br />Proven<br />Pragmatic<br />Approachable<br />
MongoDB Details<br />Plan for 5 billion documents<br />Average size: 2KB<br />3 Replica sets, 3 Servers each<br />Deploy t...
MongoDB Architecture<br />Typical Sharding with Replica Sets<br />(external sphinx full-text indexers not pictured)<br />c...
Lesson: Know Your Hardware<br />MongoDB on blades really sucks<br />Single 10k RPM disks can’t take it when data is notice...
Lesson: Replica Sets Rock<br />Lots of reboots happened during dev environment troubleshooting<br />Each time, one of the ...
Lesson: Know Your Data<br />MongoDB is UTF-8<br />Some of our older data is decidedly NOT UTF-8<br />We have lots of slopp...
Lesson: Know Your Data Size<br />MongoDB has a doc size limits<br />4MB in 1.6.x, 16MB in 1.8.x<br />What to do with outli...
Lesson: Know Your Data Types<br />Field Types and Conversions can be expensive to do after the fact!<br />MongoDB treats s...
Data Types, continued<br />“If the type of a field is ambiguous and important to your application, you should document wha...
Lesson: Know SomeSharding<br />The Balancer can be your frenemy<br />Initial insert rate: 8,000/sec<br />Later drops to 20...
Lesson: Know Some Replica Sets<br />Replica Set re-sync requires index rebuilds on the secondary<br />Most painful when a ...
MongoDBWishlist<br />Replica set node re-sync without out index rebuilding<br />Record (or field) compression (not everyon...
craigslist is hiring!<br />send resumes to: z@craigslist.org<br />Plain Text or PDF, no Word Docs!<br />Front-end Engineer...
craigslist is hiring!<br />send resumes to: z@craigslist.org<br />Plain Text or PDF, no Word Docs!<br />Laid back, non-cor...
Upcoming SlideShare
Loading in...5
×

Lessons Learned Migrating 2+ Billion Documents at Craigslist

48,668

Published on

The slides from my 2011 MongoSF talk of the same name

Published in: Technology
0 Comments
52 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
48,668
On Slideshare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
460
Comments
0
Likes
52
Embeds 0
No embeds

No notes for slide

Lessons Learned Migrating 2+ Billion Documents at Craigslist

  1. 1. Lessons Learned from Migrating 2+ Billion Documents at Craigslist<br />Jeremy Zawodny<br />jzawodn@craigslist.org<br />Jeremy@Zawodny.com<br />http://blog.zawodny.com/<br />
  2. 2. Outline<br />Recap last year’s MongoSV Talk<br />The Archive, Why MongoDB, etc.<br />http://www.10gen.com/video/mongosv2010/craigslist<br />The Infrastructure<br />The Lessons<br />Wishlist<br />Q&A<br />
  3. 3. Craigslist Numbers<br />2 data centers<br />~500 servers<br />~100 MySQL servers<br />~700 cities, worldwide<br />~1 billion hits/day<br />~1.5 million posts/day<br />
  4. 4. Archive: Where Data Goes To Die<br />Live Numbers<br />~1.75M posts/day<br />~14 day avg. lifetime<br />~60 day retention<br />~100M posts<br />We keep all postings<br />Users reuse postings<br />Daily archive migration<br />Internal query tools<br />
  5. 5. Archive Pain<br />Coupled Schemas<br />Big Indexes<br />Hardware Failures<br />Replication Lag<br />Poor Search<br />Human Time Costs<br />
  6. 6. MongoDB Wins<br />Scalable<br />Fast<br />Friendly<br />Proven<br />Pragmatic<br />Approachable<br />
  7. 7. MongoDB Details<br />Plan for 5 billion documents<br />Average size: 2KB<br />3 Replica sets, 3 Servers each<br />Deploy to 2 datacenters<br />Same deployment in each datacenter<br />Posting ID is sharding key<br />
  8. 8. MongoDB Architecture<br />Typical Sharding with Replica Sets<br />(external sphinx full-text indexers not pictured)<br />config<br />client<br />client<br />client<br />client<br />config<br />config<br />mongos<br />mongos<br />mongos<br />shard001<br />shard003<br />shard002<br />replica set<br />replica set<br />replica set<br />
  9. 9. Lesson: Know Your Hardware<br />MongoDB on blades really sucks<br />Single 10k RPM disks can’t take it when data is noticeably larger than RAM<br />Mongo operations can hit the client timeout (30 sec default)<br />Even minutely cron jobs start to spew<br />Lots of time wasted in development environment, trying different kernels, tuning, etc.<br />Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons<br />
  10. 10. Lesson: Replica Sets Rock<br />Lots of reboots happened during dev environment troubleshooting<br />Each time, one of the remaining nodes took over<br />No “reclone” no config file or DNS changes<br />Stuff “just worked” while nodes bounced up and down<br />
  11. 11. Lesson: Know Your Data<br />MongoDB is UTF-8<br />Some of our older data is decidedly NOT UTF-8<br />We have lots of sloppy encoding issues to clean up. But we had to clean them all up.<br />Start data load. Wait 12-36 hours. Witness fail. Fix code. Start over. Sigh.<br />This is a combination of having been sloppy and having old data. Even with a lot less history, this can bite you. Get your encoding house in order!<br />
  12. 12. Lesson: Know Your Data Size<br />MongoDB has a doc size limits<br />4MB in 1.6.x, 16MB in 1.8.x<br />What to do with outliers?<br />In our case, trim off some useless data.<br />But going from relational to document means this sort of problem is easy to have. One parent, many children.<br />It’d be nice if this was easier to change, but clients have it hard-coded too.<br />Compression would help, of course.<br />
  13. 13. Lesson: Know Your Data Types<br />Field Types and Conversions can be expensive to do after the fact!<br />MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious<br />This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789”<br />http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod<br />
  14. 14. Data Types, continued<br />“If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.”<br />Do you know how to do that in your language of choice?<br />Some drivers may make a “guess” that gets it right most of the time.<br />
  15. 15. Lesson: Know SomeSharding<br />The Balancer can be your frenemy<br />Initial insert rate: 8,000/sec<br />Later drops to 200/sec<br />Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again<br />Pre-split your data if possible<br />http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/<br />
  16. 16. Lesson: Know Some Replica Sets<br />Replica Set re-sync requires index rebuilds on the secondary<br />Most painful when a slave is down too long and can’t catch up using the oplog<br />Typically during high write volumes<br />In a large data set, the index rebuilding can take a couple of days w/out many indexes<br />What if you lose another while that is happening?<br />
  17. 17. MongoDBWishlist<br />Replica set node re-sync without out index rebuilding<br />Record (or field) compression (not everyone uses a filesystem that offers compression)<br />Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.)<br />Hash-based sharding (coming soon?)<br />Cluster snapshot/backup tool<br />
  18. 18. craigslist is hiring!<br />send resumes to: z@craigslist.org<br />Plain Text or PDF, no Word Docs!<br />Front-end Engineering<br />HTML, CSS, JavaScript, jQuery<br />(Mobile too)<br />Network Administration<br />Routers, switches, load balancers, etc.<br />Back-end Engineering<br />Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc.<br />Systems Administration<br />Help keep all those systems running.<br />
  19. 19. craigslist is hiring!<br />send resumes to: z@craigslist.org<br />Plain Text or PDF, no Word Docs!<br />Laid back, non-corporateenvironment<br />Engineering driven culture<br />Lots of interesting technical challenges<br />Easy SF commute<br />Excellent benefits and pay<br />High-impact work<br />Millions use craigslist daily<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×