Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Living with SQL and NoSQL at craigslist, a Pragmatic Approach


Published on

From the 2012 Percona Live MySQL Conference in Santa Clara, CA.

Craigslist uses a variety of data storage systems in its backend systems: in-memory, SQL, and NoSQL. This talk is an overview of how craigslist works with a focus on the data storage and management choices that were made in each of its major subsystems. These include MySQL, memcached, Redis, MongoDB, Sphinx, and the filesystem. Special attention will be paid to the benefits and tradeoffs associated with choosing from the various popular data storage systems, including long-term viability, support, and ease of integration.

Published in: Technology
  • Login to see the comments

Living with SQL and NoSQL at craigslist, a Pragmatic Approach

  1. Living with SQL andNoSQL at Craigslist Jeremy Zawodny craigslist
  2. There is no stack anymore...-- Mårten Mickos during Wednesday’s Keynote
  3. Data Storage at craigslist• MySQL• Memcached• Redis• MongoDB• Sphinx• Filesystem
  4. Choosing the Right Tool• Durability• Performance• Query API• Features• Complexity• Support
  5. Request Flow (reads)Browser Load Balancer Caching Proxy Posting, Search, Browse Perl+epoll Memcached Proxy Cache Web Server Async ServicesApache mod_perl Memcached Perl+epoll Memcached Posting Cache haproxy MongoDB Sphinx MySQL Archived Postings Live and Archived Postings Live Postings
  6. Request Flow (reads)Browser Load Balancer Caching Proxy Image Requests Perl+epoll Memcached Proxy Cache Image Storage Apache mod_perl xfs+JBOD
  7. Data Repositories MongoDB MySQL FilesystemOldPostings Email Meta Postings Finance Images Logs Users Misc Meta Abuse WorkQueue Stats Monitoring Redis Memcached Counters Lists Sphinx Counters Postings Blobs Monitoring Postings Internal Blobs Objects WorkQueue Forums Archive
  8. MySQL at craigslist• Vertical Partitioning: Clusters • auth/users, abuse/spam, postings, finance• Sub-partitioning: Roles • master, read, long read, dumper, thrash• Lots of SSD storage (mostly fusion-io) • solved most of our performance problems• Few manual tasks • re-cloning slaves, master swaps
  9. MySQL at craigslist• MySQL 5.5.x • hoping to move to 5.6.x • GTID + crash-safe slaves?!?!• InnoDB almost everywhere • InnoDB compression where it works well • Large buffer pool (48GB common)• haproxy sits between clients and servers
  10. MySQL at Craigslist Postings Database Cluster long read long read dumper thrash write read read read read haproxy client(s)
  11. Why MySQL?• It’s the devil we know! • Very reliable • Lots of Admin and Dev skills• Durability• Replication• Support • Seriously, look at this ecosystem• Data Model
  12. Why memcached?• Wickedly Fast• Stable• Virtually zero administration required• Easily co-exists with CPU-intensive services• Muti-core? Run more instances!
  13. Memcached at craigslist• Primary cache for rendered pages (compresed and full), serialized objects, and misc. other data• Used for lots of transient data blobs (and occasional counters)• Custom async client library • Some key encoding issues• Durability via client-side mirroring (think RAID-1)
  14. Redis at craigslist• Primary repository of posting activity metadata used in analysis tasks• Remote replication in 2nd data center• 80+% of data in sorted sets (ZSETS)• Sharded multi-node cluster • See:
  15. Why Redis?• Features• Performance• Flexible Persistence• Excellent but simple API• Project Vision• Muti-core? Run more instances!
  16. MongoDB at craigslist• Repository of 2.5+ billion archived postings • growing and growing and growing• 3 shards across 3 node replica sets • duplicate config in 2nd data center• ~6TB of data, sized up to 12TB• Biggest challenge was data migration• Previous talks: • (before) • (after)
  17. Why MongoDB?• Schema free• Active community• Commercial support• Perl client!• Ease of scaling • Yay! for built-in sharding support• Fewer single points of failure • Replica sets are awesome
  18. Sphinx at craigslist• Full-text indexing and search of • all live postings • all archived postings • all forums (in progress)• 300+ million daily queries
  19. Why Sphinx?• Performance• Friendly API• Flexibility in deployment model• Commercial support
  20. Filesystem at craigslist• All uploaded images are stored in XFS• Multiple image sizes, resized upon upload
  21. Why Filesystem?• Reliable (and Simple) • We use XFS for images and databases • Proven technology• Fast • Some other filesystems have had performance issues• Easy to move data around• No other metadata/indexes to worry about
  22. So Many Data Stores...• Can be hard for developers if you don’t have good APIs or abstractions in place! • We built an object layer for our MongoDB migration • It speaks MySQL, Sphinx, MongoDB, Memcached• Relational vs. Non-Relational? • In practice, we often just don’t care • NoSQL is a stupid label
  23. Craigslist Tech FAQs• Self-hosted (no virtualization or “cloud”)• Mix of hardware (2 main vendors) • Blades • Larger multi-U multi-disk RAID boxes• Mostly local storage (SAN for backups)• Virtually all open source infrastructure tools• Famously small (but growing) tech team
  24. Craigslist is Hiring!• Developers • Back-end • Front-end• Systems Administrators• Network Engineers• Email: plain text resume!