Transcript of "Realtime Search Infrastructure at Craigslist (OpenWest 2014)"
sphinx at craigslist
at craigslist since mid-2008!
ﬁrst major project: “ﬁx search”!
Perl, search, MySQL, redis, MongoDB, data,
previously: Yahoo! and Marathon Oil!
wrote 1st edition of High Performance MySQL
no product managers or marketing!
< 50 employees!
self-hosted infrastructure we own & manage!
no “cloud” or virtualization!
driven by user needs and feedback
history of search at craigslist!
indexing rate (incoming volume)!
thousands of postings per minute!
churn and half-life!
trafﬁc (always increasing)!
peak over 4,000 queries/second!
query multipliers (new features)!
spreading the load!
sharding and partitioning
very fast queries!
easy to understand!
searchd: the sphinx server process!
multi-threaded or pre-forking!
indexer: build batch indexes off-line!
indextool: check indexes and get details!
search: diagnostic tool for simple searches
one index per city!
growth by sharding into 2 then 3 clusters!
masters build indexes every 10 minutes!
used indexer and perl scripts to generate XML!
build versioning and rollback mechanism!
slaves pull indexes via rsync and reload!
used pre-forking conﬁg!
hardware was dual proc, dual core AMD Opterons with 32GB RAM
RT indexes in sphinx have matured!
reduce overhead from the searchd restart!
reduce time to search from posting going live!
goal < 10 seconds!
eliminate XML generation code!
use MySQL protocol
Live: what you use!
highest trafﬁc, volume, churn!
Team: what we use!
lowest trafﬁc, lots of extra data!
Forums: yes, we have threaded discussions!
low volume, low trafﬁc!
Archive: posting more than a few months old!
terabytes of indexes, constantly growing
Ram & Disk Chunks
Indexes begin as “ram chunks”!
rt_mem_limit caps their size!
once too large, they become “disk chunks”!
obviously, disk is slower than RAM!
the more chunks, the more docs to check!
query times fall, CPU use rises...
stopwords: google has spoiled users!
MONITOR ALL THE THINGS!!1!!
Mind your rt_mem_limit!
Keep it all in RAM!
Make re-indexing easy!
While you ask, keeping in mind...!
craigslist is hiring!
systems and network admins!
send me your resume: firstname.lastname@example.org!
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.