Lessons Learned Migrating 2+ Billion Documents at Craigslist
Upcoming SlideShare
Loading in...5

Lessons Learned Migrating 2+ Billion Documents at Craigslist



The slides from my 2011 MongoSF talk of the same name

The slides from my 2011 MongoSF talk of the same name



Total Views
Views on SlideShare
Embed Views



48 Embeds 34,793

http://www.10gen.com 22289
http://blog.nosqlfan.com 4548
http://blog.zawodny.com 3252
http://www.mongodb.com 3070
https://www.mongodb.com 669
http://simple-is-better.com 498
http://www.kuqin.com 75
http://dandan.nonobo.com 61
http://archive.10gen.com 61
http://feed.feedsky.com 58
http://www.twylah.com 32
http://www.slideshare.net 27
url_unknown 26
http://xianguo.com 20
http://www.simple-is-better.com 18
http://irr.posterous.com 11
http://zhuaxia.com 10
http://reader.youdao.com 8 6
http://www.newsblur.com 5 4
http://www.bgol.us 4
http://presentations.10gen.com 3
http://paper.li 3
http://drupal1.10gen.cc 3
http://dkangala.wordpress.com 3
http://webcache.googleusercontent.com 3
http://xue.uplook.cn 3
http://www.hanrss.com 3
http://translate.googleusercontent.com 2
http://ww.mongodb.org 1
http://ru.wiki.mongodb.org 1
http://www.google.com 1
http://localhost 1
http://pythontip.sinaapp.com 1 1
http://wordpress.com 1
http://dev.10gen.com 1
http://www.zhuaxia.com 1 1
http://twitter.com 1 1
http://cache.baidu.com 1
http://local.10gen.com 1
http://clipboard.com 1
http://download.mongodb.org 1
http://tssskci.10gen.com 1
http://cache.baiducontent.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Lessons Learned Migrating 2+ Billion Documents at Craigslist Lessons Learned Migrating 2+ Billion Documents at Craigslist Presentation Transcript

    • Lessons Learned from Migrating 2+ Billion Documents at Craigslist
      Jeremy Zawodny
    • Outline
      Recap last year’s MongoSV Talk
      The Archive, Why MongoDB, etc.
      The Infrastructure
      The Lessons
    • Craigslist Numbers
      2 data centers
      ~500 servers
      ~100 MySQL servers
      ~700 cities, worldwide
      ~1 billion hits/day
      ~1.5 million posts/day
    • Archive: Where Data Goes To Die
      Live Numbers
      ~1.75M posts/day
      ~14 day avg. lifetime
      ~60 day retention
      ~100M posts
      We keep all postings
      Users reuse postings
      Daily archive migration
      Internal query tools
    • Archive Pain
      Coupled Schemas
      Big Indexes
      Hardware Failures
      Replication Lag
      Poor Search
      Human Time Costs
    • MongoDB Wins
    • MongoDB Details
      Plan for 5 billion documents
      Average size: 2KB
      3 Replica sets, 3 Servers each
      Deploy to 2 datacenters
      Same deployment in each datacenter
      Posting ID is sharding key
    • MongoDB Architecture
      Typical Sharding with Replica Sets
      (external sphinx full-text indexers not pictured)
      replica set
      replica set
      replica set
    • Lesson: Know Your Hardware
      MongoDB on blades really sucks
      Single 10k RPM disks can’t take it when data is noticeably larger than RAM
      Mongo operations can hit the client timeout (30 sec default)
      Even minutely cron jobs start to spew
      Lots of time wasted in development environment, trying different kernels, tuning, etc.
      Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons
    • Lesson: Replica Sets Rock
      Lots of reboots happened during dev environment troubleshooting
      Each time, one of the remaining nodes took over
      No “reclone” no config file or DNS changes
      Stuff “just worked” while nodes bounced up and down
    • Lesson: Know Your Data
      MongoDB is UTF-8
      Some of our older data is decidedly NOT UTF-8
      We have lots of sloppy encoding issues to clean up. But we had to clean them all up.
      Start data load. Wait 12-36 hours. Witness fail. Fix code. Start over. Sigh.
      This is a combination of having been sloppy and having old data. Even with a lot less history, this can bite you. Get your encoding house in order!
    • Lesson: Know Your Data Size
      MongoDB has a doc size limits
      4MB in 1.6.x, 16MB in 1.8.x
      What to do with outliers?
      In our case, trim off some useless data.
      But going from relational to document means this sort of problem is easy to have. One parent, many children.
      It’d be nice if this was easier to change, but clients have it hard-coded too.
      Compression would help, of course.
    • Lesson: Know Your Data Types
      Field Types and Conversions can be expensive to do after the fact!
      MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious
      This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789”
    • Data Types, continued
      “If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.”
      Do you know how to do that in your language of choice?
      Some drivers may make a “guess” that gets it right most of the time.
    • Lesson: Know SomeSharding
      The Balancer can be your frenemy
      Initial insert rate: 8,000/sec
      Later drops to 200/sec
      Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again
      Pre-split your data if possible
    • Lesson: Know Some Replica Sets
      Replica Set re-sync requires index rebuilds on the secondary
      Most painful when a slave is down too long and can’t catch up using the oplog
      Typically during high write volumes
      In a large data set, the index rebuilding can take a couple of days w/out many indexes
      What if you lose another while that is happening?
    • MongoDBWishlist
      Replica set node re-sync without out index rebuilding
      Record (or field) compression (not everyone uses a filesystem that offers compression)
      Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.)
      Hash-based sharding (coming soon?)
      Cluster snapshot/backup tool
    • craigslist is hiring!
      send resumes to: z@craigslist.org
      Plain Text or PDF, no Word Docs!
      Front-end Engineering
      HTML, CSS, JavaScript, jQuery
      (Mobile too)
      Network Administration
      Routers, switches, load balancers, etc.
      Back-end Engineering
      Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc.
      Systems Administration
      Help keep all those systems running.
    • craigslist is hiring!
      send resumes to: z@craigslist.org
      Plain Text or PDF, no Word Docs!
      Laid back, non-corporateenvironment
      Engineering driven culture
      Lots of interesting technical challenges
      Easy SF commute
      Excellent benefits and pay
      High-impact work
      Millions use craigslist daily