2. About Me
at craigslist since mid-2008!
first major project: “fix search”!
Perl, search, MySQL, redis, MongoDB, data,
backend services!
previously: Yahoo! and Marathon Oil!
wrote 1st edition of High Performance MySQL
3. About craigslist
engineering culture!
no product managers or marketing!
< 50 employees!
self-hosted infrastructure we own & manage!
no “cloud” or virtualization!
multi-datacenter!
driven by user needs and feedback
6. Challenges
indexing rate (incoming volume)!
thousands of postings per minute!
churn and half-life!
traffic (always increasing)!
peak over 4,000 queries/second!
query multipliers (new features)!
spreading the load!
sharding and partitioning
8. Evolution
needs and desires are changing!
sphinx is improving!
hardware is more capable!
learn from previous mistakes!
it’s fun to do new things :-)!
searching > browsing
10. MySQL Full-Text
used up until late 2008!
manual sharding!
performance was poor (easy to DoS)!
often fell off a cliff!
limited query syntax!
MyISAM corruption
13. Sphinx Tools
searchd: the sphinx server process!
multi-threaded or pre-forking!
indexer: build batch indexes off-line!
indextool: check indexes and get details!
search: diagnostic tool for simple searches
14. Master/Slave Sphinx
one index per city!
growth by sharding into 2 then 3 clusters!
masters build indexes every 10 minutes!
used indexer and perl scripts to generate XML!
build versioning and rollback mechanism!
slaves pull indexes via rsync and reload!
used pre-forking config!
hardware was dual proc, dual core AMD Opterons with 32GB RAM
20. Real-Time Sphinx
RT indexes in sphinx have matured!
reduce overhead from the searchd restart!
reduce time to search from posting going live!
goal < 10 seconds!
eliminate XML generation code!
use MySQL protocol
21. Sphinx Clusters
Live: what you use!
highest traffic, volume, churn!
Team: what we use!
lowest traffic, lots of extra data!
Forums: yes, we have threaded discussions!
low volume, low traffic!
Archive: posting more than a few months old!
terabytes of indexes, constantly growing
22. Ram & Disk Chunks
Indexes begin as “ram chunks”!
rt_mem_limit caps their size!
once too large, they become “disk chunks”!
obviously, disk is slower than RAM!
the more chunks, the more docs to check!
query times fall, CPU use rises...
23. Lessons
stopwords: google has spoiled users!
MONITOR ALL THE THINGS!!1!!
Mind your rt_mem_limit!
Keep it all in RAM!
Make re-indexing easy!
Automate cloning
24. Questions?
While you ask, keeping in mind...!
craigslist is hiring!
front-end developers!
systems and network admins!
back-end developers!
send me your resume: z@craigslist.org!
https://www.craigslist.org/about/craigslist_is_hiring