MySQL And Search At Craigslist
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

MySQL And Search At Craigslist

on

  • 22,409 views

My talk as given at the 2009 MySQL Conference and Expo in Santa Clara, CA

My talk as given at the 2009 MySQL Conference and Expo in Santa Clara, CA

Statistics

Views

Total Views
22,409
Views on SlideShare
16,259
Embed Views
6,150

Actions

Likes
24
Downloads
221
Comments
3

20 Embeds 6,150

http://www.moskalyuk.com 3253
http://jeremy.zawodny.com 2839
http://www.slideshare.net 29
http://translate.googleusercontent.com 10
http://feeds.feedburner.com 2
http://translate.yandex.net 2
http://xss.yandex.net 2
http://www.mefeedia.com 1
http://www.google.com 1
http://webcache.googleusercontent.com 1
http://www.hanrss.com 1
http://www.translate.ru 1
http://www.int.galaxy.ch 1
http://127.0.0.1:8795 1
http://www.e-presentations.us 1
http://trunk.ly 1
http://news.bbc.co.uk 1
http://64.233.163.132 1
http://static.slidesharecdn.com 1
http://209.85.129.132 1
More...

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Best one
    Hope you are in good health. My name is AMANDA . I am a single girl, Am looking for reliable and honest person. please have a little time for me. Please reach me back amanda_n14144@yahoo.com so that i can explain all about myself .
    Best regards AMANDA.
    amanda_n14144@yahoo.com
    Are you sure you want to
    Your message goes here
    Processing…
  • Great preso! Thanks for uploading it.

    We use sphinx at slideshare as well but we've been having some reliability problems lately. Need to get down to what the cause is ... I'm sure it's our fault, not sphinx!
    Are you sure you want to
    Your message goes here
    Processing…
  • Welcome to the neighborhood Jeremy. Great slideshow.

    You said 'Also spent some time looking at Apache Solr'. What was the result? Why you did not use Solr?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

MySQL And Search At Craigslist Presentation Transcript

  • 1. MySQL and Search at Craigslist
      • Jeremy Zawodny
      • [email_address]
      • http://craigslist.org/
      • [email_address]
      • http://jeremy.zawodny.com/blog/
  • 2. Who Am I?
    • Creator and co-author of High Performance MySQL
    • Creator of mytop
    • Perl Hacker
    • MySQL Geek
    • Craigslist Engineer (as of July, 2008)
      • MySQL, Data, Search, Perl
    • Ex-Yahoo (Perl, MySQL, Search, Web Services)
  • 3. What is Craigslist?
  • 4. What is Craigslist?
    • Local Classifieds
      • Jobs, Housing, Autos, Goods, Services
    • ~500 cities world-wide
    • Free
      • Except for jobs in ~18 cities and brokered apartments in NYC
      • Over 20B pageviews/month
      • 50M monthly users
      • 50+ countries, multiple languages
      • 40+M ads/month, 10+M images
  • 5. What is Craigslist?
    • Forums
      • 100M posts
      • 100s of forums
  • 6. Technical and other Challenges
    • High ad churn rate
      • Post half-life can be short
    • Growth
    • High traffic volume
    • Back-end tools and data analysis needs
    • Growth
    • Need to archive postings... forever!
      • 100s of millions, searchable
    • Internationalization and UTF-8
  • 7. Technical and other Challenges
    • Small Team
      • Fires take priority
      • Infrastructure gets creaky
      • Organic code and schema growth over years
    • Growth
    • Lack of abstractions
      • Too much embedded SQL in code
    • Documentation vs. Institutional Knowledge
      • “Why do we have things configured like this?”
  • 8. Goals
    • Use Open Source
    • Keep infrastructure small and simple
      • Lower power is good!
      • Efficiency all around
      • Do more with less
    • Keep site easy and appraochable
      • Don't overload with features
      • People are easily confuse
  • 9. Craigslist Internals Overview Perl + memcached Apache 1.3 + mod_perl Perl + memcached MySQL 5.0.xx Sphinx ... Load Balancer Read Proxy Array Write Proxy Array Web Read Array Object Cache Read DB Cluster Not Included : - user db, image db - async tasks, email - accounting, internal tools - and more! Search Cluster
  • 10. Vertical Partitioning: Roles Users Classifieds Users Classifieds Forums Stats Archive Write Read Long Trash
  • 11. Vertical Partitioning
    • Different roles have different access patterns
      • Sub-roles based on query type
    • Easier to manage and scale
    • Logical, self-contained data
    • Servers may not need to be as big/fast/expensive
    • Difficult to do retroactively
    • Various named db “handles” in code
  • 12. Horizontal Partitioning: Hydra cluster_01 cluster_02 cluster_03 cluster_N ... client
  • 13. Horizontal Partitioning: Hydra
    • Need to retrofit a lot of code
    • Need non-blocking Perl MySQL client
    • Wrapped http://code.google.com/p/perl-mysql-async/
    • Eventually can size DB boxes based on price/power and adjust mapping function(s)
      • Choose hardware first
      • Make the db “fit”
    • Archiving lets us age a cluster instead of migrating it's data to a new one.
  • 14. Search Evolution
    • Problem: Users want to find stuff.
    • Solution: Use MySQL Full Text.
    • ...time passes...
    • Problem: MySQL Full Text Doesn't Scale!
    • Solution: Use Sphinx.
    • ...time passes...
    • Problem: Sphinx doesn't scale!
    • Solution: Patch Sphinx.
  • 15. MySQL Full-Text Problems
    • Hitting invisible limits
      • CPU not pegged, Memory available
      • Disk I/O not unreasonable
      • Locking / Mutex contention? Probably.
    • MyISAM has occasional crashing / corruption
    • 5 clusters of 5 machines
      • Partitioning based on city and category
      • All “hand balanced” and high-maintenance
    • ~30M queries/day
      • Close to limits
  • 16. Sphinx: My First CL Project
    • Sphinx is designed for text search
    • Fast and lean C++ code
    • Forking model scales well on multi-core
    • Control over indexing, weighting, etc.
    • Also spent some time looking at Apache Solr
  • 17. Search Implementation Details
    • Partitioning based on cities (each has a numeric id)
    • Attributes vs. Keywords
    • Persistent Connections
      • Custom client and server modifications
    • Minimal stopword List
    • Partition into 2 clusters (1 master, 4 slaves)
  • 18. Sphinx Incremental Indexing
    • Re-index every N minutes
    • Use main + delta strategy
      • Adopted as: index + today + delta
      • One set per city (~500 * 3)
    • Slaves handle live queries, update via rsync
    • Need lots of FDs
    • Use all 4 cores to index
    • Every night, perform “daily merge”
    • Generate config files via Perl
  • 19. Sphinx Incremental Indexing
  • 20. Sphinx Issues
    • Merge bugs [fixed]
    • File descriptor corruption [fixed]
    • Persistent connections [fixed]
      • Overhead of fork() was substantial in our testing
      • 200 queries/sec vs. 1,000 queries/sec per box
    • Missing attribute updates [unreported]
    • Bogus docids in responses
    • We need to upgrade to latest Sphinx soon
    • Andrew and team have been excellent!
  • 21. Search Project Results
    • From 25 MySQL Boxes to 10 Sphinx
    • Lots more headroom!
    • New Features
      • Nearby Search
    • No seizing or locking issues
    • 1,000+ qps during peak w/room to grow
    • 50M queries per day w/steady growth
    • Cluster partitioning built but not needed (yet?)
    • Better separation of code
  • 22. Sphinx Wishlist
    • Efficient delete handling (kill lists)
    • Non-fatal “missing” indexes
    • Index dump tool
    • Live document add/change/delete
    • Built-in replication
    • Stats and counters
    • Text attributes
    • Protocol checksum
  • 23. Data Archiving, Replication, Indexes
    • Problem: We want to keep everything.
    • Solution: Archive to an archive cluster.
    • Problem: Archiving is too painful. Index updates are expensive! Slaves affected.
    • Solution: Archive with home-grown eventually consistent replication.
  • 24. Data Archiving: OOB Replication
    • Eventual Consistency
    • Master process
      • SET SQL_LOG_BIN=0
      • Select expired IDs
      • Export records from live master
      • Import records into archive master
      • Delete expired from live master
      • Add IDs to list
  • 25. Data Archiving: OOB Replication
    • Slave process
      • One per MySQL slave
      • Throttled to minimize impact
      • State kept on slave
        • Clone friendly
      • Simple logic
        • Select expired IDs added since my sequence number
        • Delete expired records
        • Update local “last seen” sequence number
  • 26. Long Term Data Archiving
    • Schema coupling is bad
      • ALTER TABLE takes forever
      • Lots of NULLs flying around
    • CouchDB or similar long-term?
      • Schema-free feels like a good fit
    • Tested some home grown solutions already
    • Separate storage and indexing?
      • Indexing with Sphinx?
  • 27. Drizzle, XtraDB, Future Stuff
    • CouchDB looks very interesting. Maybe for archive?
    • XtraDB / InnoDB plugin
      • Better concurrency
      • Better tuning of InnoDB internals
    • libdrizzle + Perl
      • DBI/DBD may not fit an async model well
      • Can talk to both MySQL and Drizzle!
    • Oracle buying Sun?!?!
  • 28. We're Hiring!
    • Work in San Francisco
    • Flexible, Small Company
    • Excellent Benefits
    • Help Millions of People Every Week
    • We Need Perl/MySQL Hackers
    • Come Help us Scale and Grow
  • 29. Questions?