My Sql And Search At Craigslist
Upcoming SlideShare
Loading in...5
×
 

My Sql And Search At Craigslist

on

  • 2,460 views

 

Statistics

Views

Total Views
2,460
Views on SlideShare
2,459
Embed Views
1

Actions

Likes
0
Downloads
19
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

My Sql And Search At Craigslist My Sql And Search At Craigslist Presentation Transcript

  • MySQL and Search at Craigslist Jeremy Zawodny jzawodn@craigslist.org http://craigslist.org/ Jeremy@Zawodny.com http://jeremy.zawodny.com/blog/
  • Who Am I? Creator and co-author of High Performance ● MySQL Creator of mytop ● Perl Hacker ● MySQL Geek ● Craigslist Engineer (as of July, 2008) ● MySQL, Data, Search, Perl – Ex-Yahoo (Perl, MySQL, Search, Web ● Services)
  • What is Craigslist?
  • What is Craigslist? Local Classifieds ● Jobs, Housing, Autos, Goods, Services – ~500 cities world-wide ● Free ● Except for jobs in ~18 cities and brokered – apartments in NYC Over 20B pageviews/month – 50M monthly users – 50+ countries, multiple languages – 40+M ads/month, 10+M images –
  • What is Craigslist? Forums ● 100M posts – 100s of forums –
  • Technical and other Challenges High ad churn rate ● Post half-life can be short – Growth ● High traffic volume ● Back-end tools and data analysis needs ● Growth ● Need to archive postings... forever! ● 100s of millions, searchable – Internationalization and UTF-8 ●
  • Technical and other Challenges Small Team ● Fires take priority – Infrastructure gets creaky – Organic code and schema growth over years – Growth ● Lack of abstractions ● Too much embedded SQL in code – Documentation vs. Institutional Knowledge ● “Why do we have things configured like this?” –
  • Goals Use Open Source ● Keep infrastructure small and simple ● Lower power is good! – Efficiency all around – Do more with less – Keep site easy and appraochable ● Don't overload with features – People are easily confuse –
  • Craigslist Internals Overview Load Balancer Read Proxy Array Write Proxy Array Perl + memcached ... Web Read Array Apache 1.3 + mod_perl Object Cache Search Cluster Perl + memcached Sphinx Not Included: Read DB Cluster MySQL 5.0.xx - user db, image db - async tasks, email - accounting, internal tools - and more!
  • Vertical Partitioning: Roles Users Classifieds Forums Write Read Long Trash Stats Archive
  • Vertical Partitioning Different roles have different access patterns ● Sub-roles based on query type – Easier to manage and scale ● Logical, self-contained data ● Servers may not need to be as ● big/fast/expensive Difficult to do retroactively ● Various named db “handles” in code ●
  • Horizontal Partitioning: Hydra ... cluster_01 cluster_02 cluster_03 cluster_N client
  • Horizontal Partitioning: Hydra Need to retrofit a lot of code ● Need non-blocking Perl MySQL client ● Wrapped ● http://code.google.com/p/perl-mysql-async/ Eventually can size DB boxes based on ● price/power and adjust mapping function(s) Choose hardware first – Make the db “fit” – Archiving lets us age a cluster instead of ● migrating it's data to a new one.
  • Search Evolution Problem: Users want to find stuff. ● Solution: Use MySQL Full Text. ● ...time passes... ● Problem: MySQL Full Text Doesn't Scale! ● Solution: Use Sphinx. ● ...time passes... ● Problem: Sphinx doesn't scale! ● Solution: Patch Sphinx. ●
  • MySQL Full-Text Problems Hitting invisible limits ● CPU not pegged, Memory available – Disk I/O not unreasonable – Locking / Mutex contention? Probably. – MyISAM has occasional crashing / corruption ● 5 clusters of 5 machines ● Partitioning based on city and category – All “hand balanced” and high-maintenance – ~30M queries/day ● Close to limits –
  • Sphinx: My First CL Project Sphinx is designed for text search ● Fast and lean C++ code ● Forking model scales well on multi-core ● Control over indexing, weighting, etc. ● Also spent some time looking at Apache Solr ●
  • Search Implementation Details Partitioning based on cities (each has a ● numeric id) Attributes vs. Keywords ● Persistent Connections ● Custom client and server modifications – Minimal stopword List ● Partition into 2 clusters (1 master, 4 slaves) ●
  • Sphinx Incremental Indexing Re-index every N minutes ● Use main + delta strategy ● Adopted as: index + today + delta – One set per city (~500 * 3) – Slaves handle live queries, update via rsync ● Need lots of FDs ● Use all 4 cores to index ● Every night, perform “daily merge” ● Generate config files via Perl ●
  • Sphinx Incremental Indexing
  • Sphinx Issues Merge bugs [fixed] ● File descriptor corruption [fixed] ● Persistent connections [fixed] ● Overhead of fork() was substantial in our testing – 200 queries/sec vs. 1,000 queries/sec per box – Missing attribute updates [unreported] ● Bogus docids in responses ● We need to upgrade to latest Sphinx soon ● Andrew and team have been excellent! ●
  • Search Project Results From 25 MySQL Boxes to 10 Sphinx ● Lots more headroom! ● New Features ● Nearby Search – No seizing or locking issues ● 1,000+ qps during peak w/room to grow ● 50M queries per day w/steady growth ● Cluster partitioning built but not needed (yet?) ● Better separation of code ●
  • Sphinx Wishlist Efficient delete handling (kill lists) ● Non-fatal “missing” indexes ● Index dump tool ● Live document add/change/delete ● Built-in replication ● Stats and counters ● Text attributes ● Protocol checksum ●
  • Data Archiving, Replication, Indexes Problem: We want to keep everything. ● Solution: Archive to an archive cluster. ● Problem: Archiving is too painful. Index ● updates are expensive! Slaves affected. Solution: Archive with home-grown eventually ● consistent replication.
  • Data Archiving: OOB Replication Eventual Consistency ● Master process ● SET SQL_LOG_BIN=0 – Select expired IDs – Export records from live master – Import records into archive master – Delete expired from live master – Add IDs to list –
  • Data Archiving: OOB Replication Slave process ● One per MySQL slave – Throttled to minimize impact – State kept on slave – Clone friendly ● Simple logic – Select expired IDs added since my sequence number ● Delete expired records ● Update local “last seen” sequence number ●
  • Long Term Data Archiving Schema coupling is bad ● ALTER TABLE takes forever – Lots of NULLs flying around – CouchDB or similar long-term? ● Schema-free feels like a good fit – Tested some home grown solutions already ● Separate storage and indexing? ● Indexing with Sphinx? –
  • Drizzle, XtraDB, Future Stuff CouchDB looks very interesting. Maybe for ● archive? XtraDB / InnoDB plugin ● Better concurrency – Better tuning of InnoDB internals – libdrizzle + Perl ● DBI/DBD may not fit an async model well – Can talk to both MySQL and Drizzle! – Oracle buying Sun?!?! ●
  • We're Hiring! Work in San Francisco ● Flexible, Small Company ● Excellent Benefits ● Help Millions of People Every Week ● We Need Perl/MySQL Hackers ● Come Help us Scale and Grow ●
  • Questions?