Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Living with SQL and NoSQL at craigslist, a Pragmatic Approach

Uploaded on

From the 2012 Percona Live MySQL Conference in Santa Clara, CA....

From the 2012 Percona Live MySQL Conference in Santa Clara, CA.

Craigslist uses a variety of data storage systems in its backend systems: in-memory, SQL, and NoSQL. This talk is an overview of how craigslist works with a focus on the data storage and management choices that were made in each of its major subsystems. These include MySQL, memcached, Redis, MongoDB, Sphinx, and the filesystem. Special attention will be paid to the benefits and tradeoffs associated with choosing from the various popular data storage systems, including long-term viability, support, and ease of integration.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 247 186 39 14 2
http://localhost 2 1 1 1 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n


  • 1. Living with SQL andNoSQL at Craigslist Jeremy Zawodny craigslist
  • 2. There is no stack anymore...-- Mårten Mickos during Wednesday’s Keynote
  • 3. Data Storage at craigslist• MySQL• Memcached• Redis• MongoDB• Sphinx• Filesystem
  • 4. Choosing the Right Tool• Durability• Performance• Query API• Features• Complexity• Support
  • 5. Request Flow (reads)Browser Load Balancer Caching Proxy Posting, Search, Browse Perl+epoll Memcached Proxy Cache Web Server Async ServicesApache mod_perl Memcached Perl+epoll Memcached Posting Cache haproxy MongoDB Sphinx MySQL Archived Postings Live and Archived Postings Live Postings
  • 6. Request Flow (reads)Browser Load Balancer Caching Proxy Image Requests Perl+epoll Memcached Proxy Cache Image Storage Apache mod_perl xfs+JBOD
  • 7. Data Repositories MongoDB MySQL FilesystemOldPostings Email Meta Postings Finance Images Logs Users Misc Meta Abuse WorkQueue Stats Monitoring Redis Memcached Counters Lists Sphinx Counters Postings Blobs Monitoring Postings Internal Blobs Objects WorkQueue Forums Archive
  • 8. MySQL at craigslist• Vertical Partitioning: Clusters • auth/users, abuse/spam, postings, finance• Sub-partitioning: Roles • master, read, long read, dumper, thrash• Lots of SSD storage (mostly fusion-io) • solved most of our performance problems• Few manual tasks • re-cloning slaves, master swaps
  • 9. MySQL at craigslist• MySQL 5.5.x • hoping to move to 5.6.x • GTID + crash-safe slaves?!?!• InnoDB almost everywhere • InnoDB compression where it works well • Large buffer pool (48GB common)• haproxy sits between clients and servers
  • 10. MySQL at Craigslist Postings Database Cluster long read long read dumper thrash write read read read read haproxy client(s)
  • 11. Why MySQL?• It’s the devil we know! • Very reliable • Lots of Admin and Dev skills• Durability• Replication• Support • Seriously, look at this ecosystem• Data Model
  • 12. Why memcached?• Wickedly Fast• Stable• Virtually zero administration required• Easily co-exists with CPU-intensive services• Muti-core? Run more instances!
  • 13. Memcached at craigslist• Primary cache for rendered pages (compresed and full), serialized objects, and misc. other data• Used for lots of transient data blobs (and occasional counters)• Custom async client library • Some key encoding issues• Durability via client-side mirroring (think RAID-1)
  • 14. Redis at craigslist• Primary repository of posting activity metadata used in analysis tasks• Remote replication in 2nd data center• 80+% of data in sorted sets (ZSETS)• Sharded multi-node cluster • See:
  • 15. Why Redis?• Features• Performance• Flexible Persistence• Excellent but simple API• Project Vision• Muti-core? Run more instances!
  • 16. MongoDB at craigslist• Repository of 2.5+ billion archived postings • growing and growing and growing• 3 shards across 3 node replica sets • duplicate config in 2nd data center• ~6TB of data, sized up to 12TB• Biggest challenge was data migration• Previous talks: • (before) • (after)
  • 17. Why MongoDB?• Schema free• Active community• Commercial support• Perl client!• Ease of scaling • Yay! for built-in sharding support• Fewer single points of failure • Replica sets are awesome
  • 18. Sphinx at craigslist• Full-text indexing and search of • all live postings • all archived postings • all forums (in progress)• 300+ million daily queries
  • 19. Why Sphinx?• Performance• Friendly API• Flexibility in deployment model• Commercial support
  • 20. Filesystem at craigslist• All uploaded images are stored in XFS• Multiple image sizes, resized upon upload
  • 21. Why Filesystem?• Reliable (and Simple) • We use XFS for images and databases • Proven technology• Fast • Some other filesystems have had performance issues• Easy to move data around• No other metadata/indexes to worry about
  • 22. So Many Data Stores...• Can be hard for developers if you don’t have good APIs or abstractions in place! • We built an object layer for our MongoDB migration • It speaks MySQL, Sphinx, MongoDB, Memcached• Relational vs. Non-Relational? • In practice, we often just don’t care • NoSQL is a stupid label
  • 23. Craigslist Tech FAQs• Self-hosted (no virtualization or “cloud”)• Mix of hardware (2 main vendors) • Blades • Larger multi-U multi-disk RAID boxes• Mostly local storage (SAN for backups)• Virtually all open source infrastructure tools• Famously small (but growing) tech team
  • 24. Craigslist is Hiring!• Developers • Back-end • Front-end• Systems Administrators• Network Engineers• Email: plain text resume!