Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MySQL And Search At Craigslist


Published on

My talk as given at the 2009 MySQL Conference and Expo in Santa Clara, CA

Published in: Technology
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @
    Are you sure you want to  Yes  No
    Your message goes here
  • Best one
    Hope you are in good health. My name is AMANDA . I am a single girl, Am looking for reliable and honest person. please have a little time for me. Please reach me back so that i can explain all about myself .
    Best regards AMANDA.
    Are you sure you want to  Yes  No
    Your message goes here
  • Great preso! Thanks for uploading it.

    We use sphinx at slideshare as well but we've been having some reliability problems lately. Need to get down to what the cause is ... I'm sure it's our fault, not sphinx!
    Are you sure you want to  Yes  No
    Your message goes here
  • Welcome to the neighborhood Jeremy. Great slideshow.

    You said 'Also spent some time looking at Apache Solr'. What was the result? Why you did not use Solr?
    Are you sure you want to  Yes  No
    Your message goes here

MySQL And Search At Craigslist

  1. MySQL and Search at Craigslist <ul><ul><li>Jeremy Zawodny </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li> </li></ul></ul>
  2. Who Am I? <ul><li>Creator and co-author of High Performance MySQL </li></ul><ul><li>Creator of mytop </li></ul><ul><li>Perl Hacker </li></ul><ul><li>MySQL Geek </li></ul><ul><li>Craigslist Engineer (as of July, 2008) </li></ul><ul><ul><li>MySQL, Data, Search, Perl </li></ul></ul><ul><li>Ex-Yahoo (Perl, MySQL, Search, Web Services) </li></ul>
  3. What is Craigslist?
  4. What is Craigslist? <ul><li>Local Classifieds </li></ul><ul><ul><li>Jobs, Housing, Autos, Goods, Services </li></ul></ul><ul><li>~500 cities world-wide </li></ul><ul><li>Free </li></ul><ul><ul><li>Except for jobs in ~18 cities and brokered apartments in NYC </li></ul></ul><ul><ul><li>Over 20B pageviews/month </li></ul></ul><ul><ul><li>50M monthly users </li></ul></ul><ul><ul><li>50+ countries, multiple languages </li></ul></ul><ul><ul><li>40+M ads/month, 10+M images </li></ul></ul>
  5. What is Craigslist? <ul><li>Forums </li></ul><ul><ul><li>100M posts </li></ul></ul><ul><ul><li>100s of forums </li></ul></ul>
  6. Technical and other Challenges <ul><li>High ad churn rate </li></ul><ul><ul><li>Post half-life can be short </li></ul></ul><ul><li>Growth </li></ul><ul><li>High traffic volume </li></ul><ul><li>Back-end tools and data analysis needs </li></ul><ul><li>Growth </li></ul><ul><li>Need to archive postings... forever! </li></ul><ul><ul><li>100s of millions, searchable </li></ul></ul><ul><li>Internationalization and UTF-8 </li></ul>
  7. Technical and other Challenges <ul><li>Small Team </li></ul><ul><ul><li>Fires take priority </li></ul></ul><ul><ul><li>Infrastructure gets creaky </li></ul></ul><ul><ul><li>Organic code and schema growth over years </li></ul></ul><ul><li>Growth </li></ul><ul><li>Lack of abstractions </li></ul><ul><ul><li>Too much embedded SQL in code </li></ul></ul><ul><li>Documentation vs. Institutional Knowledge </li></ul><ul><ul><li>“Why do we have things configured like this?” </li></ul></ul>
  8. Goals <ul><li>Use Open Source </li></ul><ul><li>Keep infrastructure small and simple </li></ul><ul><ul><li>Lower power is good! </li></ul></ul><ul><ul><li>Efficiency all around </li></ul></ul><ul><ul><li>Do more with less </li></ul></ul><ul><li>Keep site easy and appraochable </li></ul><ul><ul><li>Don't overload with features </li></ul></ul><ul><ul><li>People are easily confuse </li></ul></ul>
  9. Craigslist Internals Overview Perl + memcached Apache 1.3 + mod_perl Perl + memcached MySQL 5.0.xx Sphinx ... Load Balancer Read Proxy Array Write Proxy Array Web Read Array Object Cache Read DB Cluster Not Included : - user db, image db - async tasks, email - accounting, internal tools - and more! Search Cluster
  10. Vertical Partitioning: Roles Users Classifieds Users Classifieds Forums Stats Archive Write Read Long Trash
  11. Vertical Partitioning <ul><li>Different roles have different access patterns </li></ul><ul><ul><li>Sub-roles based on query type </li></ul></ul><ul><li>Easier to manage and scale </li></ul><ul><li>Logical, self-contained data </li></ul><ul><li>Servers may not need to be as big/fast/expensive </li></ul><ul><li>Difficult to do retroactively </li></ul><ul><li>Various named db “handles” in code </li></ul>
  12. Horizontal Partitioning: Hydra cluster_01 cluster_02 cluster_03 cluster_N ... client
  13. Horizontal Partitioning: Hydra <ul><li>Need to retrofit a lot of code </li></ul><ul><li>Need non-blocking Perl MySQL client </li></ul><ul><li>Wrapped </li></ul><ul><li>Eventually can size DB boxes based on price/power and adjust mapping function(s) </li></ul><ul><ul><li>Choose hardware first </li></ul></ul><ul><ul><li>Make the db “fit” </li></ul></ul><ul><li>Archiving lets us age a cluster instead of migrating it's data to a new one. </li></ul>
  14. Search Evolution <ul><li>Problem: Users want to find stuff. </li></ul><ul><li>Solution: Use MySQL Full Text. </li></ul><ul><li>...time passes... </li></ul><ul><li>Problem: MySQL Full Text Doesn't Scale! </li></ul><ul><li>Solution: Use Sphinx. </li></ul><ul><li>...time passes... </li></ul><ul><li>Problem: Sphinx doesn't scale! </li></ul><ul><li>Solution: Patch Sphinx. </li></ul>
  15. MySQL Full-Text Problems <ul><li>Hitting invisible limits </li></ul><ul><ul><li>CPU not pegged, Memory available </li></ul></ul><ul><ul><li>Disk I/O not unreasonable </li></ul></ul><ul><ul><li>Locking / Mutex contention? Probably. </li></ul></ul><ul><li>MyISAM has occasional crashing / corruption </li></ul><ul><li>5 clusters of 5 machines </li></ul><ul><ul><li>Partitioning based on city and category </li></ul></ul><ul><ul><li>All “hand balanced” and high-maintenance </li></ul></ul><ul><li>~30M queries/day </li></ul><ul><ul><li>Close to limits </li></ul></ul>
  16. Sphinx: My First CL Project <ul><li>Sphinx is designed for text search </li></ul><ul><li>Fast and lean C++ code </li></ul><ul><li>Forking model scales well on multi-core </li></ul><ul><li>Control over indexing, weighting, etc. </li></ul><ul><li>Also spent some time looking at Apache Solr </li></ul>
  17. Search Implementation Details <ul><li>Partitioning based on cities (each has a numeric id) </li></ul><ul><li>Attributes vs. Keywords </li></ul><ul><li>Persistent Connections </li></ul><ul><ul><li>Custom client and server modifications </li></ul></ul><ul><li>Minimal stopword List </li></ul><ul><li>Partition into 2 clusters (1 master, 4 slaves) </li></ul>
  18. Sphinx Incremental Indexing <ul><li>Re-index every N minutes </li></ul><ul><li>Use main + delta strategy </li></ul><ul><ul><li>Adopted as: index + today + delta </li></ul></ul><ul><ul><li>One set per city (~500 * 3) </li></ul></ul><ul><li>Slaves handle live queries, update via rsync </li></ul><ul><li>Need lots of FDs </li></ul><ul><li>Use all 4 cores to index </li></ul><ul><li>Every night, perform “daily merge” </li></ul><ul><li>Generate config files via Perl </li></ul>
  19. Sphinx Incremental Indexing
  20. Sphinx Issues <ul><li>Merge bugs [fixed] </li></ul><ul><li>File descriptor corruption [fixed] </li></ul><ul><li>Persistent connections [fixed] </li></ul><ul><ul><li>Overhead of fork() was substantial in our testing </li></ul></ul><ul><ul><li>200 queries/sec vs. 1,000 queries/sec per box </li></ul></ul><ul><li>Missing attribute updates [unreported] </li></ul><ul><li>Bogus docids in responses </li></ul><ul><li>We need to upgrade to latest Sphinx soon </li></ul><ul><li>Andrew and team have been excellent! </li></ul>
  21. Search Project Results <ul><li>From 25 MySQL Boxes to 10 Sphinx </li></ul><ul><li>Lots more headroom! </li></ul><ul><li>New Features </li></ul><ul><ul><li>Nearby Search </li></ul></ul><ul><li>No seizing or locking issues </li></ul><ul><li>1,000+ qps during peak w/room to grow </li></ul><ul><li>50M queries per day w/steady growth </li></ul><ul><li>Cluster partitioning built but not needed (yet?) </li></ul><ul><li>Better separation of code </li></ul>
  22. Sphinx Wishlist <ul><li>Efficient delete handling (kill lists) </li></ul><ul><li>Non-fatal “missing” indexes </li></ul><ul><li>Index dump tool </li></ul><ul><li>Live document add/change/delete </li></ul><ul><li>Built-in replication </li></ul><ul><li>Stats and counters </li></ul><ul><li>Text attributes </li></ul><ul><li>Protocol checksum </li></ul>
  23. Data Archiving, Replication, Indexes <ul><li>Problem: We want to keep everything. </li></ul><ul><li>Solution: Archive to an archive cluster. </li></ul><ul><li>Problem: Archiving is too painful. Index updates are expensive! Slaves affected. </li></ul><ul><li>Solution: Archive with home-grown eventually consistent replication. </li></ul>
  24. Data Archiving: OOB Replication <ul><li>Eventual Consistency </li></ul><ul><li>Master process </li></ul><ul><ul><li>SET SQL_LOG_BIN=0 </li></ul></ul><ul><ul><li>Select expired IDs </li></ul></ul><ul><ul><li>Export records from live master </li></ul></ul><ul><ul><li>Import records into archive master </li></ul></ul><ul><ul><li>Delete expired from live master </li></ul></ul><ul><ul><li>Add IDs to list </li></ul></ul>
  25. Data Archiving: OOB Replication <ul><li>Slave process </li></ul><ul><ul><li>One per MySQL slave </li></ul></ul><ul><ul><li>Throttled to minimize impact </li></ul></ul><ul><ul><li>State kept on slave </li></ul></ul><ul><ul><ul><li>Clone friendly </li></ul></ul></ul><ul><ul><li>Simple logic </li></ul></ul><ul><ul><ul><li>Select expired IDs added since my sequence number </li></ul></ul></ul><ul><ul><ul><li>Delete expired records </li></ul></ul></ul><ul><ul><ul><li>Update local “last seen” sequence number </li></ul></ul></ul>
  26. Long Term Data Archiving <ul><li>Schema coupling is bad </li></ul><ul><ul><li>ALTER TABLE takes forever </li></ul></ul><ul><ul><li>Lots of NULLs flying around </li></ul></ul><ul><li>CouchDB or similar long-term? </li></ul><ul><ul><li>Schema-free feels like a good fit </li></ul></ul><ul><li>Tested some home grown solutions already </li></ul><ul><li>Separate storage and indexing? </li></ul><ul><ul><li>Indexing with Sphinx? </li></ul></ul>
  27. Drizzle, XtraDB, Future Stuff <ul><li>CouchDB looks very interesting. Maybe for archive? </li></ul><ul><li>XtraDB / InnoDB plugin </li></ul><ul><ul><li>Better concurrency </li></ul></ul><ul><ul><li>Better tuning of InnoDB internals </li></ul></ul><ul><li>libdrizzle + Perl </li></ul><ul><ul><li>DBI/DBD may not fit an async model well </li></ul></ul><ul><ul><li>Can talk to both MySQL and Drizzle! </li></ul></ul><ul><li>Oracle buying Sun?!?! </li></ul>
  28. We're Hiring! <ul><li>Work in San Francisco </li></ul><ul><li>Flexible, Small Company </li></ul><ul><li>Excellent Benefits </li></ul><ul><li>Help Millions of People Every Week </li></ul><ul><li>We Need Perl/MySQL Hackers </li></ul><ul><li>Come Help us Scale and Grow </li></ul>
  29. Questions?