Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sphinx at Craigslist in 2012


Published on

These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and future work.

Sphinx at Craigslist in 2012

  1. 1. Sphinx at Craigslist Jeremy Zawodny craigslist
  2. 2. Brief Overview
  3. 3. CL Sphinx Infrastructure• Live Sphinx • ~30 million postings • end users searching for stuff on craigslist• Team Sphinx • ~100 million postings • additional indexes of postings for internal use (including non-live postings)
  4. 4. CL Sphinx Infrastructure• Archive Sphinx • older postings (~3 billion) • constantly growing in size• Real-Time Sphinx • last ~2 days worth of postings• Forums Sphinx • ~150 million forum postings
  5. 5. How We Got Here
  6. 6. Back in 2008• MySQL FULL TEXT (MyISAM)• 25 Servers• Melted Down Frequently• Desperately Needed a Solution• This was my first project at craigslist...• Looked at Solr, Sphinx, Xapian• Sphinx felt like the right fit
  7. 7. Making Sphinx Work• Benchmarking showed promising results • Query performance was great • ~800qps/instance • back then we only needed 1,200/sec • Indexing performance too • Can index documents far faster than I can make the XML for input (from Perl)• Can’t index and serve at the same time, though...
  8. 8. “Live” Sphinx• One index per city (~700 indexes) • Main + Delta • xmlpipe2 input• Data all fits on a single machine• 32bit ids• High churn rate• Settled on Master/Slave model w/rsync replication• Deployed in January, 2009
  9. 9. Master/Slave Clusters• Number of slaves varies (typically 3-7) master master slave slave slave slave master master slave slave slave slave
  10. 10. Main+Delta Indexes delta Regular Merge from transient delta today Periodic Merge Logical to clean house Index index
  11. 11. Early Issues• Monitoring• Persistent Connections w/prefork • hacked up my own initially• Index merge crashes/bugs• We’re aways running svn snapshots
  12. 12. Early Success• Replaced the 25 MySQL servers• Used 10 sphinx servers (2 masters, 8 slaves)• Search traffic continued to increase• Tons of headroom!• Typical search is under 5ms• New Features • “nearby” search • sort by: recent, price, best match
  13. 13. Early Mistakes• Stopwords• Not setting query limits • Sphinx handled this just fine!• ASCII-only• Query mangling • need to understand how users search and what they expect to find• UpdateAttributes (no kill lists!)
  14. 14. What Then?
  15. 15. Growth• Wanted Sphinx for “internal” use• Created internal “team sphinx” with more indexed data • includes not visible postings • includes additional fields• Space became an issue, so had to build some simple sharding into our code • 2 clusters: even/odd split for indexes
  16. 16. Live Sphinx Today• 300+ million queries/day• 5,000 queries/sec peak load• removed stopwords• threaded workers• dict=keywords• wildcard search enabled• UTF-8 (mostly) and charset_table• blend_chars• kill lists (no searchd on masters)• sharded (3 masters, 18 slaves) on blades
  17. 17. Sharding
  18. 18. Query Volume
  19. 19. Archive Sphinx• The Archive Project!• 2.5 billion postings• Growing by ~1.6 million daily• String attributes• 4 shards, each is a 1 master, 2 slave cluster• Bucket based on UserID (not city)• Low query volume• Need a way to reindex all docs
  20. 20. Real-Time Sphinx• There’s a delay in indexing data on the master and replicating to the slaves...• What if we want to offer “real-time search” of your own postings?
  21. 21. So I built something...• Known as rtsd (real-time search daemon)• Sphinx instance with MySQL Protocol• Primarily uses in-memory indexes• Used to bridge the gap between “now” and “archive sphinx”• Configured as an N day rolling window• Runs on archive sphinx master hosts
  22. 22. Sphinx Time Horizons Classic Team Archive rtsd0-20min All20m-1day Visible All All1-60 days Visible All All60+ days All Note:Visible postings are findable on the site.
  23. 23. rtsd overviewPostingInfo tablertsd_consumer redis queue rtsd_indexer PostingCache rtsd_sphinx webbie webbie webbie
  24. 24. Daily Posting Buckets• 3 indexes • yesterday • today • tomorrow• (DayofYear(PostedDate)%3) = $index_num• Nightly cron to “TRUNCATE RTINDEX” on the “tomorrow” index • sponsored feature!
  25. 25. rtsd indexes
  26. 26. rtsd virtual indexes
  27. 27. rtsd virtual indexes
  28. 28. Future Work• autonomous nodes (no master/slave) • many-core blades with SSD storage• better performance metrics • we drop a lot of data on the floor• log mining and analysis• sphinx for “table of contents” (browsing)• haproxy in front of sphinx• generic sharding code• testing framework
  29. 29. Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)• distributed search (agent) config with multiple servers per index (for failover and load):
  30. 30. Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)• distributed search (agent) config with multiple servers per index (for failover and load):
  31. 31. Craigslist is Hiring!• Developers • Back-end • Front-end• Systems Administrators• Network Engineers• Email: plain text resume!