From the 2012 Percona Live MySQL Conference in Santa Clara, CA.
Craigslist uses a variety of data storage systems in its backend systems: in-memory, SQL, and NoSQL. This talk is an overview of how craigslist works with a focus on the data storage and management choices that were made in each of its major subsystems. These include MySQL, memcached, Redis, MongoDB, Sphinx, and the filesystem. Special attention will be paid to the benefits and tradeoffs associated with choosing from the various popular data storage systems, including long-term viability, support, and ease of integration.
7. Data Repositories
MongoDB MySQL Filesystem
OldPostings Email Meta Postings Finance Images Logs
Users Misc Meta
Abuse WorkQueue
Stats Monitoring
Redis
Memcached Counters Lists Sphinx
Counters Postings Blobs Monitoring Postings Internal
Blobs Objects WorkQueue Forums Archive
8. MySQL at craigslist
• Vertical Partitioning: Clusters
• auth/users, abuse/spam, postings, finance
• Sub-partitioning: Roles
• master, read, long read, dumper, thrash
• Lots of SSD storage (mostly fusion-io)
• solved most of our performance problems
• Few manual tasks
• re-cloning slaves, master swaps
9. MySQL at craigslist
• MySQL 5.5.x
• hoping to move to 5.6.x
• GTID + crash-safe slaves?!?!
• InnoDB almost everywhere
• InnoDB compression where it works well
• Large buffer pool (48GB common)
• haproxy sits between clients and servers
10. MySQL at Craigslist
Postings Database Cluster
long read
long read
dumper
thrash
write
read
read
read
read
haproxy
client(s)
11. Why MySQL?
• It’s the devil we know!
• Very reliable
• Lots of Admin and Dev skills
• Durability
• Replication
• Support
• Seriously, look at this ecosystem
• Data Model
12. Why memcached?
• Wickedly Fast
• Stable
• Virtually zero administration required
• Easily co-exists with CPU-intensive services
• Muti-core? Run more instances!
13. Memcached at craigslist
• Primary cache for rendered pages
(compresed and full), serialized objects, and
misc. other data
• Used for lots of transient data blobs (and
occasional counters)
• Custom async client library
• Some key encoding issues
• Durability via client-side mirroring (think
RAID-1)
14. Redis at craigslist
• Primary repository of posting activity
metadata used in analysis tasks
• Remote replication in 2nd data center
• 80+% of data in sorted sets (ZSETS)
• Sharded multi-node cluster
• See: http://bit.ly/I4XUCj
15. Why Redis?
• Features
• Performance
• Flexible Persistence
• Excellent but simple API
• Project Vision
• Muti-core? Run more instances!
16. MongoDB at craigslist
• Repository of 2.5+ billion archived postings
• growing and growing and growing
• 3 shards across 3 node replica sets
• duplicate config in 2nd data center
• ~6TB of data, sized up to 12TB
• Biggest challenge was data migration
• Previous talks:
• http://bit.ly/HEYJ57 (before)
• http://bit.ly/Hr2qMf (after)
17. Why MongoDB?
• Schema free
• Active community
• Commercial support
• Perl client!
• Ease of scaling
• Yay! for built-in sharding support
• Fewer single points of failure
• Replica sets are awesome
18. Sphinx at craigslist
• Full-text indexing and search of
• all live postings
• all archived postings
• all forums (in progress)
• 300+ million daily queries
20. Filesystem at craigslist
• All uploaded images are stored in XFS
• Multiple image sizes, resized upon upload
21. Why Filesystem?
• Reliable (and Simple)
• We use XFS for images and databases
• Proven technology
• Fast
• Some other filesystems have had
performance issues
• Easy to move data around
• No other metadata/indexes to worry about
22. So Many Data Stores...
• Can be hard for developers if you don’t have
good APIs or abstractions in place!
• We built an object layer for our MongoDB
migration
• It speaks MySQL, Sphinx, MongoDB,
Memcached
• Relational vs. Non-Relational?
• In practice, we often just don’t care
• NoSQL is a stupid label
23. Craigslist Tech FAQs
• Self-hosted (no virtualization or “cloud”)
• Mix of hardware (2 main vendors)
• Blades
• Larger multi-U multi-disk RAID boxes
• Mostly local storage (SAN for backups)
• Virtually all open source infrastructure
tools
• Famously small (but growing) tech team
24. Craigslist is Hiring!
• Developers
• Back-end
• Front-end
• Systems Administrators
• Network Engineers
• Email: z@craiglist.org plain text resume!