Living with SQL andNoSQL at Craigslist      Jeremy Zawodny          craigslist
There is no stack     anymore...-- Mårten Mickos during Wednesday’s Keynote
Data Storage at craigslist• MySQL• Memcached• Redis• MongoDB• Sphinx• Filesystem
Choosing the Right Tool• Durability• Performance• Query API• Features• Complexity• Support
Request Flow (reads)Browser                       Load Balancer                       Caching Proxy         Posting, Searc...
Request Flow (reads)Browser                Load Balancer                   Caching Proxy      Image Requests              ...
Data Repositories   MongoDB                      MySQL                 FilesystemOldPostings   Email Meta    Postings     ...
MySQL at craigslist•   Vertical Partitioning: Clusters    •   auth/users, abuse/spam, postings, finance•   Sub-partitioning...
MySQL at craigslist• MySQL 5.5.x • hoping to move to 5.6.x    • GTID + crash-safe slaves?!?!• InnoDB almost everywhere • I...
MySQL at Craigslist      Postings Database Cluster                                       long read                        ...
Why MySQL?•   It’s the devil we know!    •   Very reliable    •   Lots of Admin and Dev skills•   Durability•   Replicatio...
Why memcached?• Wickedly Fast• Stable• Virtually zero administration required• Easily co-exists with CPU-intensive service...
Memcached at craigslist• Primary cache for rendered pages  (compresed and full), serialized objects, and  misc. other data...
Redis at craigslist• Primary repository of posting activity  metadata used in analysis tasks• Remote replication in 2nd da...
Why Redis?• Features• Performance• Flexible Persistence• Excellent but simple API• Project Vision• Muti-core? Run more ins...
MongoDB at craigslist•   Repository of 2.5+ billion archived postings    •   growing and growing and growing•   3 shards a...
Why MongoDB?• Schema free• Active community• Commercial support• Perl client!• Ease of scaling  • Yay! for built-in shardi...
Sphinx at craigslist• Full-text indexing and search of • all live postings • all archived postings • all forums (in progre...
Why Sphinx?• Performance• Friendly API• Flexibility in deployment model• Commercial support
Filesystem at craigslist• All uploaded images are stored in XFS• Multiple image sizes, resized upon upload
Why Filesystem?• Reliable (and Simple) • We use XFS for images and databases • Proven technology• Fast • Some other filesys...
So Many Data Stores...• Can be hard for developers if you don’t have  good APIs or abstractions in place!  • We built an o...
Craigslist Tech FAQs• Self-hosted (no virtualization or “cloud”)• Mix of hardware (2 main vendors) • Blades • Larger multi...
Craigslist is Hiring!• Developers • Back-end • Front-end• Systems Administrators• Network Engineers• Email: z@craiglist.or...
Upcoming SlideShare
Loading in...5
×

Living with SQL and NoSQL at craigslist, a Pragmatic Approach

18,807

Published on

From the 2012 Percona Live MySQL Conference in Santa Clara, CA.

Craigslist uses a variety of data storage systems in its backend systems: in-memory, SQL, and NoSQL. This talk is an overview of how craigslist works with a focus on the data storage and management choices that were made in each of its major subsystems. These include MySQL, memcached, Redis, MongoDB, Sphinx, and the filesystem. Special attention will be paid to the benefits and tradeoffs associated with choosing from the various popular data storage systems, including long-term viability, support, and ease of integration.

Published in: Technology
1 Comment
101 Likes
Statistics
Notes
No Downloads
Views
Total Views
18,807
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
494
Comments
1
Likes
101
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Living with SQL and NoSQL at craigslist, a Pragmatic Approach

    1. 1. Living with SQL andNoSQL at Craigslist Jeremy Zawodny craigslist
    2. 2. There is no stack anymore...-- Mårten Mickos during Wednesday’s Keynote
    3. 3. Data Storage at craigslist• MySQL• Memcached• Redis• MongoDB• Sphinx• Filesystem
    4. 4. Choosing the Right Tool• Durability• Performance• Query API• Features• Complexity• Support
    5. 5. Request Flow (reads)Browser Load Balancer Caching Proxy Posting, Search, Browse Perl+epoll Memcached Proxy Cache Web Server Async ServicesApache mod_perl Memcached Perl+epoll Memcached Posting Cache haproxy MongoDB Sphinx MySQL Archived Postings Live and Archived Postings Live Postings
    6. 6. Request Flow (reads)Browser Load Balancer Caching Proxy Image Requests Perl+epoll Memcached Proxy Cache Image Storage Apache mod_perl xfs+JBOD
    7. 7. Data Repositories MongoDB MySQL FilesystemOldPostings Email Meta Postings Finance Images Logs Users Misc Meta Abuse WorkQueue Stats Monitoring Redis Memcached Counters Lists Sphinx Counters Postings Blobs Monitoring Postings Internal Blobs Objects WorkQueue Forums Archive
    8. 8. MySQL at craigslist• Vertical Partitioning: Clusters • auth/users, abuse/spam, postings, finance• Sub-partitioning: Roles • master, read, long read, dumper, thrash• Lots of SSD storage (mostly fusion-io) • solved most of our performance problems• Few manual tasks • re-cloning slaves, master swaps
    9. 9. MySQL at craigslist• MySQL 5.5.x • hoping to move to 5.6.x • GTID + crash-safe slaves?!?!• InnoDB almost everywhere • InnoDB compression where it works well • Large buffer pool (48GB common)• haproxy sits between clients and servers
    10. 10. MySQL at Craigslist Postings Database Cluster long read long read dumper thrash write read read read read haproxy client(s)
    11. 11. Why MySQL?• It’s the devil we know! • Very reliable • Lots of Admin and Dev skills• Durability• Replication• Support • Seriously, look at this ecosystem• Data Model
    12. 12. Why memcached?• Wickedly Fast• Stable• Virtually zero administration required• Easily co-exists with CPU-intensive services• Muti-core? Run more instances!
    13. 13. Memcached at craigslist• Primary cache for rendered pages (compresed and full), serialized objects, and misc. other data• Used for lots of transient data blobs (and occasional counters)• Custom async client library • Some key encoding issues• Durability via client-side mirroring (think RAID-1)
    14. 14. Redis at craigslist• Primary repository of posting activity metadata used in analysis tasks• Remote replication in 2nd data center• 80+% of data in sorted sets (ZSETS)• Sharded multi-node cluster • See: http://bit.ly/I4XUCj
    15. 15. Why Redis?• Features• Performance• Flexible Persistence• Excellent but simple API• Project Vision• Muti-core? Run more instances!
    16. 16. MongoDB at craigslist• Repository of 2.5+ billion archived postings • growing and growing and growing• 3 shards across 3 node replica sets • duplicate config in 2nd data center• ~6TB of data, sized up to 12TB• Biggest challenge was data migration• Previous talks: • http://bit.ly/HEYJ57 (before) • http://bit.ly/Hr2qMf (after)
    17. 17. Why MongoDB?• Schema free• Active community• Commercial support• Perl client!• Ease of scaling • Yay! for built-in sharding support• Fewer single points of failure • Replica sets are awesome
    18. 18. Sphinx at craigslist• Full-text indexing and search of • all live postings • all archived postings • all forums (in progress)• 300+ million daily queries
    19. 19. Why Sphinx?• Performance• Friendly API• Flexibility in deployment model• Commercial support
    20. 20. Filesystem at craigslist• All uploaded images are stored in XFS• Multiple image sizes, resized upon upload
    21. 21. Why Filesystem?• Reliable (and Simple) • We use XFS for images and databases • Proven technology• Fast • Some other filesystems have had performance issues• Easy to move data around• No other metadata/indexes to worry about
    22. 22. So Many Data Stores...• Can be hard for developers if you don’t have good APIs or abstractions in place! • We built an object layer for our MongoDB migration • It speaks MySQL, Sphinx, MongoDB, Memcached• Relational vs. Non-Relational? • In practice, we often just don’t care • NoSQL is a stupid label
    23. 23. Craigslist Tech FAQs• Self-hosted (no virtualization or “cloud”)• Mix of hardware (2 main vendors) • Blades • Larger multi-U multi-disk RAID boxes• Mostly local storage (SAN for backups)• Virtually all open source infrastructure tools• Famously small (but growing) tech team
    24. 24. Craigslist is Hiring!• Developers • Back-end • Front-end• Systems Administrators• Network Engineers• Email: z@craiglist.org plain text resume!
    1. Gostou de algum slide específico?

      Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

    ×