My Sql And Search At Craigslist

MySQL and Search at Craigslist

Jeremy Zawodny
jzawodn@craigslist.org
http://craigslist.org/

Jeremy@Zawodny.com
http://jeremy.zawodny.com/blog/

Who Am I?
Creator and co-author of High Performance
●

MySQL
Creator of mytop
●

Perl Hacker
●

MySQL Geek
●

Craigslist Engineer (as of July, 2008)
●

MySQL, Data, Search, Perl
–

Ex-Yahoo (Perl, MySQL, Search, Web
●

Services)

What is Craigslist?
Local Classifieds
●

Jobs, Housing, Autos, Goods, Services
–

~500 cities world-wide
●

Free
●

Except for jobs in ~18 cities and brokered
–
apartments in NYC
Over 20B pageviews/month
–

50M monthly users
–

50+ countries, multiple languages
–

40+M ads/month, 10+M images
–

What is Craigslist?
Forums
●

100M posts
–

100s of forums
–

Technical and other Challenges
High ad churn rate
●

Post half-life can be short
–

Growth
●

High traffic volume
●

Back-end tools and data analysis needs
●

Growth
●

Need to archive postings... forever!
●

100s of millions, searchable
–

Internationalization and UTF-8
●

Technical and other Challenges
Small Team
●

Fires take priority
–

Infrastructure gets creaky
–

Organic code and schema growth over years
–

Growth
●

Lack of abstractions
●

Too much embedded SQL in code
–

Documentation vs. Institutional Knowledge
●

“Why do we have things configured like this?”
–

Goals
Use Open Source
●

Keep infrastructure small and simple
●

Lower power is good!
–

Efficiency all around
–

Do more with less
–

Keep site easy and appraochable
●

Don't overload with features
–

People are easily confuse
–

Craigslist Internals Overview
Load Balancer

Read Proxy Array Write Proxy Array
Perl + memcached

...
Web Read Array Apache 1.3 + mod_perl

Object Cache Search Cluster
Perl + memcached Sphinx

Not Included:
Read DB Cluster MySQL 5.0.xx - user db, image db
- async tasks, email
- accounting, internal tools
- and more!

Vertical Partitioning: Roles

Users Classifieds Forums

Write Read Long Trash

Stats Archive

Vertical Partitioning
Different roles have different access patterns
●

Sub-roles based on query type
–

Easier to manage and scale
●

Logical, self-contained data
●

Servers may not need to be as
●

big/fast/expensive
Difficult to do retroactively
●

Various named db “handles” in code
●

Horizontal Partitioning: Hydra

...
cluster_01 cluster_02 cluster_03 cluster_N

client

Horizontal Partitioning: Hydra
Need to retrofit a lot of code
●

Need non-blocking Perl MySQL client
●

Wrapped
●

http://code.google.com/p/perl-mysql-async/
Eventually can size DB boxes based on
●

price/power and adjust mapping function(s)
Choose hardware first
–

Make the db “fit”
–

Archiving lets us age a cluster instead of
●

migrating it's data to a new one.

Search Evolution
Problem: Users want to find stuff.
●

Solution: Use MySQL Full Text.
●

...time passes...
●

Problem: MySQL Full Text Doesn't Scale!
●

Solution: Use Sphinx.
●

...time passes...
●

Problem: Sphinx doesn't scale!
●

Solution: Patch Sphinx.
●

MySQL Full-Text Problems
Hitting invisible limits
●

CPU not pegged, Memory available
–

Disk I/O not unreasonable
–

Locking / Mutex contention? Probably.
–

MyISAM has occasional crashing / corruption
●

5 clusters of 5 machines
●

Partitioning based on city and category
–

All “hand balanced” and high-maintenance
–

~30M queries/day
●

Close to limits
–

Sphinx: My First CL Project
Sphinx is designed for text search
●

Fast and lean C++ code
●

Forking model scales well on multi-core
●

Control over indexing, weighting, etc.
●

Also spent some time looking at Apache Solr
●

Search Implementation Details
Partitioning based on cities (each has a
●

numeric id)
Attributes vs. Keywords
●

Persistent Connections
●

Custom client and server modifications
–

Minimal stopword List
●

Partition into 2 clusters (1 master, 4 slaves)
●

Sphinx Incremental Indexing
Re-index every N minutes
●

Use main + delta strategy
●

Adopted as: index + today + delta
–

One set per city (~500 * 3)
–

Slaves handle live queries, update via rsync
●

Need lots of FDs
●

Use all 4 cores to index
●

Every night, perform “daily merge”
●

Generate config files via Perl
●

Sphinx Issues
Merge bugs [fixed]
●

File descriptor corruption [fixed]
●

Persistent connections [fixed]
●

Overhead of fork() was substantial in our testing
–

200 queries/sec vs. 1,000 queries/sec per box
–

Missing attribute updates [unreported]
●

Bogus docids in responses
●

We need to upgrade to latest Sphinx soon
●

Andrew and team have been excellent!
●

Search Project Results
From 25 MySQL Boxes to 10 Sphinx
●

Lots more headroom!
●

New Features
●

Nearby Search
–

No seizing or locking issues
●

1,000+ qps during peak w/room to grow
●

50M queries per day w/steady growth
●

Cluster partitioning built but not needed (yet?)
●

Better separation of code
●

Sphinx Wishlist
Efficient delete handling (kill lists)
●

Non-fatal “missing” indexes
●

Index dump tool
●

Live document add/change/delete
●

Built-in replication
●

Stats and counters
●

Text attributes
●

Protocol checksum
●

Data Archiving, Replication, Indexes
Problem: We want to keep everything.
●

Solution: Archive to an archive cluster.
●

Problem: Archiving is too painful. Index
●

updates are expensive! Slaves affected.
Solution: Archive with home-grown eventually
●

consistent replication.

Data Archiving: OOB Replication
Eventual Consistency
●

Master process
●

SET SQL_LOG_BIN=0
–

Select expired IDs
–

Export records from live master
–

Import records into archive master
–

Delete expired from live master
–

Add IDs to list
–

Data Archiving: OOB Replication
Slave process
●

One per MySQL slave
–

Throttled to minimize impact
–

State kept on slave
–

Clone friendly
●

Simple logic
–

Select expired IDs added since my sequence number
●

Delete expired records
●

Update local “last seen” sequence number
●

Long Term Data Archiving
Schema coupling is bad
●

ALTER TABLE takes forever
–

Lots of NULLs flying around
–

CouchDB or similar long-term?
●

Schema-free feels like a good fit
–

Tested some home grown solutions already
●

Separate storage and indexing?
●

Indexing with Sphinx?
–

Drizzle, XtraDB, Future Stuff
CouchDB looks very interesting. Maybe for
●

archive?
XtraDB / InnoDB plugin
●

Better concurrency
–

Better tuning of InnoDB internals
–

libdrizzle + Perl
●

DBI/DBD may not fit an async model well
–

Can talk to both MySQL and Drizzle!
–

Oracle buying Sun?!?!
●

We're Hiring!
Work in San Francisco
●

Flexible, Small Company
●

Excellent Benefits
●

Help Millions of People Every Week
●

We Need Perl/MySQL Hackers
●

Come Help us Scale and Grow
●

My Sql And Search At Craigslist

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to My Sql And Search At Craigslist

Similar to My Sql And Search At Craigslist (20)

More from MySQLConference

More from MySQLConference (17)

Recently uploaded

Recently uploaded (20)

My Sql And Search At Craigslist