Lessons Learned Migrating 2+ Billion Documents at Craigslist
Upcoming SlideShare
Loading in...5
×
 

Lessons Learned Migrating 2+ Billion Documents at Craigslist

on

  • 45,891 views

The slides from my 2011 MongoSF talk of the same name

The slides from my 2011 MongoSF talk of the same name

Statistics

Views

Total Views
45,891
Views on SlideShare
11,055
Embed Views
34,836

Actions

Likes
48
Downloads
440
Comments
0

48 Embeds 34,836

http://www.10gen.com 22289
http://blog.nosqlfan.com 4558
http://blog.zawodny.com 3253
http://www.mongodb.com 3088
https://www.mongodb.com 683
http://simple-is-better.com 498
http://www.kuqin.com 75
http://dandan.nonobo.com 61
http://archive.10gen.com 61
http://feed.feedsky.com 58
http://www.twylah.com 32
http://www.slideshare.net 27
url_unknown 26
http://xianguo.com 20
http://www.simple-is-better.com 18
http://irr.posterous.com 11
http://zhuaxia.com 10
http://reader.youdao.com 8
http://127.0.0.1:8000 6
http://www.newsblur.com 5
http://68.166.223.4 4
http://www.bgol.us 4
http://presentations.10gen.com 3
http://paper.li 3
http://drupal1.10gen.cc 3
http://dkangala.wordpress.com 3
http://webcache.googleusercontent.com 3
http://xue.uplook.cn 3
http://www.hanrss.com 3
http://translate.googleusercontent.com 2
http://ww.mongodb.org 1
http://ru.wiki.mongodb.org 1
http://www.google.com 1
http://localhost 1
http://pythontip.sinaapp.com 1
http://10.0.1.2 1
http://wordpress.com 1
http://dev.10gen.com 1
http://www.zhuaxia.com 1
http://192.168.100.5:15871 1
http://twitter.com 1
http://172.25.1.62:3000 1
http://cache.baidu.com 1
http://local.10gen.com 1
http://clipboard.com 1
http://download.mongodb.org 1
http://tssskci.10gen.com 1
http://cache.baiducontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lessons Learned Migrating 2+ Billion Documents at Craigslist Lessons Learned Migrating 2+ Billion Documents at Craigslist Presentation Transcript

  • Lessons Learned from Migrating 2+ Billion Documents at Craigslist
    Jeremy Zawodny
    jzawodn@craigslist.org
    Jeremy@Zawodny.com
    http://blog.zawodny.com/
  • Outline
    Recap last year’s MongoSV Talk
    The Archive, Why MongoDB, etc.
    http://www.10gen.com/video/mongosv2010/craigslist
    The Infrastructure
    The Lessons
    Wishlist
    Q&A
  • Craigslist Numbers
    2 data centers
    ~500 servers
    ~100 MySQL servers
    ~700 cities, worldwide
    ~1 billion hits/day
    ~1.5 million posts/day
  • Archive: Where Data Goes To Die
    Live Numbers
    ~1.75M posts/day
    ~14 day avg. lifetime
    ~60 day retention
    ~100M posts
    We keep all postings
    Users reuse postings
    Daily archive migration
    Internal query tools
  • Archive Pain
    Coupled Schemas
    Big Indexes
    Hardware Failures
    Replication Lag
    Poor Search
    Human Time Costs
  • MongoDB Wins
    Scalable
    Fast
    Friendly
    Proven
    Pragmatic
    Approachable
  • MongoDB Details
    Plan for 5 billion documents
    Average size: 2KB
    3 Replica sets, 3 Servers each
    Deploy to 2 datacenters
    Same deployment in each datacenter
    Posting ID is sharding key
  • MongoDB Architecture
    Typical Sharding with Replica Sets
    (external sphinx full-text indexers not pictured)
    config
    client
    client
    client
    client
    config
    config
    mongos
    mongos
    mongos
    shard001
    shard003
    shard002
    replica set
    replica set
    replica set
  • Lesson: Know Your Hardware
    MongoDB on blades really sucks
    Single 10k RPM disks can’t take it when data is noticeably larger than RAM
    Mongo operations can hit the client timeout (30 sec default)
    Even minutely cron jobs start to spew
    Lots of time wasted in development environment, trying different kernels, tuning, etc.
    Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons
  • Lesson: Replica Sets Rock
    Lots of reboots happened during dev environment troubleshooting
    Each time, one of the remaining nodes took over
    No “reclone” no config file or DNS changes
    Stuff “just worked” while nodes bounced up and down
  • Lesson: Know Your Data
    MongoDB is UTF-8
    Some of our older data is decidedly NOT UTF-8
    We have lots of sloppy encoding issues to clean up. But we had to clean them all up.
    Start data load. Wait 12-36 hours. Witness fail. Fix code. Start over. Sigh.
    This is a combination of having been sloppy and having old data. Even with a lot less history, this can bite you. Get your encoding house in order!
  • Lesson: Know Your Data Size
    MongoDB has a doc size limits
    4MB in 1.6.x, 16MB in 1.8.x
    What to do with outliers?
    In our case, trim off some useless data.
    But going from relational to document means this sort of problem is easy to have. One parent, many children.
    It’d be nice if this was easier to change, but clients have it hard-coded too.
    Compression would help, of course.
  • Lesson: Know Your Data Types
    Field Types and Conversions can be expensive to do after the fact!
    MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious
    This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789”
    http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod
  • Data Types, continued
    “If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.”
    Do you know how to do that in your language of choice?
    Some drivers may make a “guess” that gets it right most of the time.
  • Lesson: Know SomeSharding
    The Balancer can be your frenemy
    Initial insert rate: 8,000/sec
    Later drops to 200/sec
    Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again
    Pre-split your data if possible
    http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
  • Lesson: Know Some Replica Sets
    Replica Set re-sync requires index rebuilds on the secondary
    Most painful when a slave is down too long and can’t catch up using the oplog
    Typically during high write volumes
    In a large data set, the index rebuilding can take a couple of days w/out many indexes
    What if you lose another while that is happening?
  • MongoDBWishlist
    Replica set node re-sync without out index rebuilding
    Record (or field) compression (not everyone uses a filesystem that offers compression)
    Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.)
    Hash-based sharding (coming soon?)
    Cluster snapshot/backup tool
  • craigslist is hiring!
    send resumes to: z@craigslist.org
    Plain Text or PDF, no Word Docs!
    Front-end Engineering
    HTML, CSS, JavaScript, jQuery
    (Mobile too)
    Network Administration
    Routers, switches, load balancers, etc.
    Back-end Engineering
    Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc.
    Systems Administration
    Help keep all those systems running.
  • craigslist is hiring!
    send resumes to: z@craigslist.org
    Plain Text or PDF, no Word Docs!
    Laid back, non-corporateenvironment
    Engineering driven culture
    Lots of interesting technical challenges
    Easy SF commute
    Excellent benefits and pay
    High-impact work
    Millions use craigslist daily