These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and
These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and future work.
3. CL Sphinx Infrastructure• Live Sphinx • ~30 million postings • end users searching for stuff on craigslist• Team Sphinx • ~100 million postings • additional indexes of postings for internal use (including non-live postings)
4. CL Sphinx Infrastructure• Archive Sphinx • older postings (~3 billion) • constantly growing in size• Real-Time Sphinx • last ~2 days worth of postings• Forums Sphinx • ~150 million forum postings
5. How We Got Here
6. Back in 2008• MySQL FULL TEXT (MyISAM)• 25 Servers• Melted Down Frequently• Desperately Needed a Solution• This was my ﬁrst project at craigslist...• Looked at Solr, Sphinx, Xapian• Sphinx felt like the right ﬁt
7. Making Sphinx Work• Benchmarking showed promising results • Query performance was great • ~800qps/instance • back then we only needed 1,200/sec • Indexing performance too • Can index documents far faster than I can make the XML for input (from Perl)• Can’t index and serve at the same time, though...
8. “Live” Sphinx• One index per city (~700 indexes) • Main + Delta • xmlpipe2 input• Data all ﬁts on a single machine• 32bit ids• High churn rate• Settled on Master/Slave model w/rsync replication• Deployed in January, 2009
10. Main+Delta Indexes delta Regular Merge from transient delta today Periodic Merge Logical to clean house Index index
11. Early Issues• Monitoring• Persistent Connections w/prefork • hacked up my own initially• Index merge crashes/bugs• We’re aways running svn snapshots
12. Early Success• Replaced the 25 MySQL servers• Used 10 sphinx servers (2 masters, 8 slaves)• Search trafﬁc continued to increase• Tons of headroom!• Typical search is under 5ms• New Features • “nearby” search • sort by: recent, price, best match
13. Early Mistakes• Stopwords• Not setting query limits • Sphinx handled this just ﬁne!• ASCII-only• Query mangling • need to understand how users search and what they expect to ﬁnd• UpdateAttributes (no kill lists!)
14. What Then?
15. Growth• Wanted Sphinx for “internal” use• Created internal “team sphinx” with more indexed data • includes not visible postings • includes additional ﬁelds• Space became an issue, so had to build some simple sharding into our code • 2 clusters: even/odd split for indexes
16. Live Sphinx Today• 300+ million queries/day• 5,000 queries/sec peak load• removed stopwords• threaded workers• dict=keywords• wildcard search enabled• UTF-8 (mostly) and charset_table• blend_chars• kill lists (no searchd on masters)• sharded (3 masters, 18 slaves) on blades
18. Query Volume
19. Archive Sphinx• The Archive Project!• 2.5 billion postings• Growing by ~1.6 million daily• String attributes• 4 shards, each is a 1 master, 2 slave cluster• Bucket based on UserID (not city)• Low query volume• Need a way to reindex all docs
20. Real-Time Sphinx• There’s a delay in indexing data on the master and replicating to the slaves...• What if we want to offer “real-time search” of your own postings?
21. So I built something...• Known as rtsd (real-time search daemon)• Sphinx instance with MySQL Protocol• Primarily uses in-memory indexes• Used to bridge the gap between “now” and “archive sphinx”• Conﬁgured as an N day rolling window• Runs on archive sphinx master hosts
22. Sphinx Time Horizons Classic Team Archive rtsd0-20min All20m-1day Visible All All1-60 days Visible All All60+ days All Note:Visible postings are ﬁndable on the site.
24. Daily Posting Buckets• 3 indexes • yesterday • today • tomorrow• (DayofYear(PostedDate)%3) = $index_num• Nightly cron to “TRUNCATE RTINDEX” on the “tomorrow” index • sponsored feature!
25. rtsd indexes
26. rtsd virtual indexes
27. rtsd virtual indexes
28. Future Work• autonomous nodes (no master/slave) • many-core blades with SSD storage• better performance metrics • we drop a lot of data on the ﬂoor• log mining and analysis• sphinx for “table of contents” (browsing)• haproxy in front of sphinx• generic sharding code• testing framework
29. Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)• distributed search (agent) conﬁg with multiple servers per index (for failover and load):
30. Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)• distributed search (agent) conﬁg with multiple servers per index (for failover and load):
31. Craigslist is Hiring!• Developers • Back-end • Front-end• Systems Administrators• Network Engineers• Email: firstname.lastname@example.org plain text resume!