Your SlideShare is downloading. ×
0
MySQL and Search at Craigslist


           Jeremy Zawodny
        jzawodn@craigslist.org
          http://craigslist.org/...
Who Am I?
    Creator and co-author of High Performance
●

    MySQL
    Creator of mytop
●


    Perl Hacker
●


    MySQ...
What is Craigslist?
What is Craigslist?
    Local Classifieds
●


        Jobs, Housing, Autos, Goods, Services
    –

    ~500 cities world-w...
What is Craigslist?
    Forums
●


        100M posts
    –

        100s of forums
    –
Technical and other Challenges
    High ad churn rate
●


        Post half-life can be short
    –

    Growth
●


    Hi...
Technical and other Challenges
    Small Team
●


        Fires take priority
    –

        Infrastructure gets creaky
  ...
Goals
    Use Open Source
●


    Keep infrastructure small and simple
●


        Lower power is good!
    –

        Eff...
Craigslist Internals Overview
                                   Load Balancer



Read Proxy Array                        ...
Vertical Partitioning: Roles

Users             Classifieds             Forums




        Write   Read     Long   Trash

...
Vertical Partitioning
    Different roles have different access patterns
●


        Sub-roles based on query type
    –

...
Horizontal Partitioning: Hydra

                                        ...
cluster_01   cluster_02    cluster_03         ...
Horizontal Partitioning: Hydra
    Need to retrofit a lot of code
●


    Need non-blocking Perl MySQL client
●


    Wrap...
Search Evolution
    Problem: Users want to find stuff.
●


    Solution: Use MySQL Full Text.
●


    ...time passes...
●...
MySQL Full-Text Problems
    Hitting invisible limits
●


        CPU not pegged, Memory available
    –

        Disk I/O...
Sphinx: My First CL Project
    Sphinx is designed for text search
●


    Fast and lean C++ code
●


    Forking model sc...
Search Implementation Details
    Partitioning based on cities (each has a
●

    numeric id)
    Attributes vs. Keywords
...
Sphinx Incremental Indexing
    Re-index every N minutes
●


    Use main + delta strategy
●


        Adopted as: index +...
Sphinx Incremental Indexing
Sphinx Issues
    Merge bugs [fixed]
●


    File descriptor corruption [fixed]
●


    Persistent connections [fixed]
●

...
Search Project Results
    From 25 MySQL Boxes to 10 Sphinx
●


    Lots more headroom!
●


    New Features
●


        N...
Sphinx Wishlist
    Efficient delete handling (kill lists)
●


    Non-fatal “missing” indexes
●


    Index dump tool
●

...
Data Archiving, Replication, Indexes
    Problem: We want to keep everything.
●


    Solution: Archive to an archive clus...
Data Archiving: OOB Replication
    Eventual Consistency
●


    Master process
●


        SET SQL_LOG_BIN=0
    –

     ...
Data Archiving: OOB Replication
    Slave process
●


        One per MySQL slave
    –

        Throttled to minimize imp...
Long Term Data Archiving
    Schema coupling is bad
●


        ALTER TABLE takes forever
    –

        Lots of NULLs fly...
Drizzle, XtraDB, Future Stuff
    CouchDB looks very interesting. Maybe for
●

    archive?
    XtraDB / InnoDB plugin
●

...
We're Hiring!
    Work in San Francisco
●


    Flexible, Small Company
●


    Excellent Benefits
●


    Help Millions o...
Questions?
Upcoming SlideShare
Loading in...5
×

My Sql And Search At Craigslist

2,035

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,035
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "My Sql And Search At Craigslist"

  1. 1. MySQL and Search at Craigslist Jeremy Zawodny jzawodn@craigslist.org http://craigslist.org/ Jeremy@Zawodny.com http://jeremy.zawodny.com/blog/
  2. 2. Who Am I? Creator and co-author of High Performance ● MySQL Creator of mytop ● Perl Hacker ● MySQL Geek ● Craigslist Engineer (as of July, 2008) ● MySQL, Data, Search, Perl – Ex-Yahoo (Perl, MySQL, Search, Web ● Services)
  3. 3. What is Craigslist?
  4. 4. What is Craigslist? Local Classifieds ● Jobs, Housing, Autos, Goods, Services – ~500 cities world-wide ● Free ● Except for jobs in ~18 cities and brokered – apartments in NYC Over 20B pageviews/month – 50M monthly users – 50+ countries, multiple languages – 40+M ads/month, 10+M images –
  5. 5. What is Craigslist? Forums ● 100M posts – 100s of forums –
  6. 6. Technical and other Challenges High ad churn rate ● Post half-life can be short – Growth ● High traffic volume ● Back-end tools and data analysis needs ● Growth ● Need to archive postings... forever! ● 100s of millions, searchable – Internationalization and UTF-8 ●
  7. 7. Technical and other Challenges Small Team ● Fires take priority – Infrastructure gets creaky – Organic code and schema growth over years – Growth ● Lack of abstractions ● Too much embedded SQL in code – Documentation vs. Institutional Knowledge ● “Why do we have things configured like this?” –
  8. 8. Goals Use Open Source ● Keep infrastructure small and simple ● Lower power is good! – Efficiency all around – Do more with less – Keep site easy and appraochable ● Don't overload with features – People are easily confuse –
  9. 9. Craigslist Internals Overview Load Balancer Read Proxy Array Write Proxy Array Perl + memcached ... Web Read Array Apache 1.3 + mod_perl Object Cache Search Cluster Perl + memcached Sphinx Not Included: Read DB Cluster MySQL 5.0.xx - user db, image db - async tasks, email - accounting, internal tools - and more!
  10. 10. Vertical Partitioning: Roles Users Classifieds Forums Write Read Long Trash Stats Archive
  11. 11. Vertical Partitioning Different roles have different access patterns ● Sub-roles based on query type – Easier to manage and scale ● Logical, self-contained data ● Servers may not need to be as ● big/fast/expensive Difficult to do retroactively ● Various named db “handles” in code ●
  12. 12. Horizontal Partitioning: Hydra ... cluster_01 cluster_02 cluster_03 cluster_N client
  13. 13. Horizontal Partitioning: Hydra Need to retrofit a lot of code ● Need non-blocking Perl MySQL client ● Wrapped ● http://code.google.com/p/perl-mysql-async/ Eventually can size DB boxes based on ● price/power and adjust mapping function(s) Choose hardware first – Make the db “fit” – Archiving lets us age a cluster instead of ● migrating it's data to a new one.
  14. 14. Search Evolution Problem: Users want to find stuff. ● Solution: Use MySQL Full Text. ● ...time passes... ● Problem: MySQL Full Text Doesn't Scale! ● Solution: Use Sphinx. ● ...time passes... ● Problem: Sphinx doesn't scale! ● Solution: Patch Sphinx. ●
  15. 15. MySQL Full-Text Problems Hitting invisible limits ● CPU not pegged, Memory available – Disk I/O not unreasonable – Locking / Mutex contention? Probably. – MyISAM has occasional crashing / corruption ● 5 clusters of 5 machines ● Partitioning based on city and category – All “hand balanced” and high-maintenance – ~30M queries/day ● Close to limits –
  16. 16. Sphinx: My First CL Project Sphinx is designed for text search ● Fast and lean C++ code ● Forking model scales well on multi-core ● Control over indexing, weighting, etc. ● Also spent some time looking at Apache Solr ●
  17. 17. Search Implementation Details Partitioning based on cities (each has a ● numeric id) Attributes vs. Keywords ● Persistent Connections ● Custom client and server modifications – Minimal stopword List ● Partition into 2 clusters (1 master, 4 slaves) ●
  18. 18. Sphinx Incremental Indexing Re-index every N minutes ● Use main + delta strategy ● Adopted as: index + today + delta – One set per city (~500 * 3) – Slaves handle live queries, update via rsync ● Need lots of FDs ● Use all 4 cores to index ● Every night, perform “daily merge” ● Generate config files via Perl ●
  19. 19. Sphinx Incremental Indexing
  20. 20. Sphinx Issues Merge bugs [fixed] ● File descriptor corruption [fixed] ● Persistent connections [fixed] ● Overhead of fork() was substantial in our testing – 200 queries/sec vs. 1,000 queries/sec per box – Missing attribute updates [unreported] ● Bogus docids in responses ● We need to upgrade to latest Sphinx soon ● Andrew and team have been excellent! ●
  21. 21. Search Project Results From 25 MySQL Boxes to 10 Sphinx ● Lots more headroom! ● New Features ● Nearby Search – No seizing or locking issues ● 1,000+ qps during peak w/room to grow ● 50M queries per day w/steady growth ● Cluster partitioning built but not needed (yet?) ● Better separation of code ●
  22. 22. Sphinx Wishlist Efficient delete handling (kill lists) ● Non-fatal “missing” indexes ● Index dump tool ● Live document add/change/delete ● Built-in replication ● Stats and counters ● Text attributes ● Protocol checksum ●
  23. 23. Data Archiving, Replication, Indexes Problem: We want to keep everything. ● Solution: Archive to an archive cluster. ● Problem: Archiving is too painful. Index ● updates are expensive! Slaves affected. Solution: Archive with home-grown eventually ● consistent replication.
  24. 24. Data Archiving: OOB Replication Eventual Consistency ● Master process ● SET SQL_LOG_BIN=0 – Select expired IDs – Export records from live master – Import records into archive master – Delete expired from live master – Add IDs to list –
  25. 25. Data Archiving: OOB Replication Slave process ● One per MySQL slave – Throttled to minimize impact – State kept on slave – Clone friendly ● Simple logic – Select expired IDs added since my sequence number ● Delete expired records ● Update local “last seen” sequence number ●
  26. 26. Long Term Data Archiving Schema coupling is bad ● ALTER TABLE takes forever – Lots of NULLs flying around – CouchDB or similar long-term? ● Schema-free feels like a good fit – Tested some home grown solutions already ● Separate storage and indexing? ● Indexing with Sphinx? –
  27. 27. Drizzle, XtraDB, Future Stuff CouchDB looks very interesting. Maybe for ● archive? XtraDB / InnoDB plugin ● Better concurrency – Better tuning of InnoDB internals – libdrizzle + Perl ● DBI/DBD may not fit an async model well – Can talk to both MySQL and Drizzle! – Oracle buying Sun?!?! ●
  28. 28. We're Hiring! Work in San Francisco ● Flexible, Small Company ● Excellent Benefits ● Help Millions of People Every Week ● We Need Perl/MySQL Hackers ● Come Help us Scale and Grow ●
  29. 29. Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×