Your SlideShare is downloading. ×
0
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Sphinx at Craigslist in 2012

5,529

Published on

These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and …

These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and future work.

1 Comment
10 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,529
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
127
Comments
1
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. Sphinx at Craigslist Jeremy Zawodny craigslist
    • 2. Brief Overview
    • 3. CL Sphinx Infrastructure• Live Sphinx • ~30 million postings • end users searching for stuff on craigslist• Team Sphinx • ~100 million postings • additional indexes of postings for internal use (including non-live postings)
    • 4. CL Sphinx Infrastructure• Archive Sphinx • older postings (~3 billion) • constantly growing in size• Real-Time Sphinx • last ~2 days worth of postings• Forums Sphinx • ~150 million forum postings
    • 5. How We Got Here
    • 6. Back in 2008• MySQL FULL TEXT (MyISAM)• 25 Servers• Melted Down Frequently• Desperately Needed a Solution• This was my first project at craigslist...• Looked at Solr, Sphinx, Xapian• Sphinx felt like the right fit
    • 7. Making Sphinx Work• Benchmarking showed promising results • Query performance was great • ~800qps/instance • back then we only needed 1,200/sec • Indexing performance too • Can index documents far faster than I can make the XML for input (from Perl)• Can’t index and serve at the same time, though...
    • 8. “Live” Sphinx• One index per city (~700 indexes) • Main + Delta • xmlpipe2 input• Data all fits on a single machine• 32bit ids• High churn rate• Settled on Master/Slave model w/rsync replication• Deployed in January, 2009
    • 9. Master/Slave Clusters• Number of slaves varies (typically 3-7) master master slave slave slave slave master master slave slave slave slave
    • 10. Main+Delta Indexes delta Regular Merge from transient delta today Periodic Merge Logical to clean house Index index
    • 11. Early Issues• Monitoring• Persistent Connections w/prefork • hacked up my own initially• Index merge crashes/bugs• We’re aways running svn snapshots
    • 12. Early Success• Replaced the 25 MySQL servers• Used 10 sphinx servers (2 masters, 8 slaves)• Search traffic continued to increase• Tons of headroom!• Typical search is under 5ms• New Features • “nearby” search • sort by: recent, price, best match
    • 13. Early Mistakes• Stopwords• Not setting query limits • Sphinx handled this just fine!• ASCII-only• Query mangling • need to understand how users search and what they expect to find• UpdateAttributes (no kill lists!)
    • 14. What Then?
    • 15. Growth• Wanted Sphinx for “internal” use• Created internal “team sphinx” with more indexed data • includes not visible postings • includes additional fields• Space became an issue, so had to build some simple sharding into our code • 2 clusters: even/odd split for indexes
    • 16. Live Sphinx Today• 300+ million queries/day• 5,000 queries/sec peak load• removed stopwords• threaded workers• dict=keywords• wildcard search enabled• UTF-8 (mostly) and charset_table• blend_chars• kill lists (no searchd on masters)• sharded (3 masters, 18 slaves) on blades
    • 17. Sharding
    • 18. Query Volume
    • 19. Archive Sphinx• The Archive Project!• 2.5 billion postings• Growing by ~1.6 million daily• String attributes• 4 shards, each is a 1 master, 2 slave cluster• Bucket based on UserID (not city)• Low query volume• Need a way to reindex all docs
    • 20. Real-Time Sphinx• There’s a delay in indexing data on the master and replicating to the slaves...• What if we want to offer “real-time search” of your own postings?
    • 21. So I built something...• Known as rtsd (real-time search daemon)• Sphinx instance with MySQL Protocol• Primarily uses in-memory indexes• Used to bridge the gap between “now” and “archive sphinx”• Configured as an N day rolling window• Runs on archive sphinx master hosts
    • 22. Sphinx Time Horizons Classic Team Archive rtsd0-20min All20m-1day Visible All All1-60 days Visible All All60+ days All Note:Visible postings are findable on the site.
    • 23. rtsd overviewPostingInfo tablertsd_consumer redis queue rtsd_indexer PostingCache rtsd_sphinx webbie webbie webbie
    • 24. Daily Posting Buckets• 3 indexes • yesterday • today • tomorrow• (DayofYear(PostedDate)%3) = $index_num• Nightly cron to “TRUNCATE RTINDEX” on the “tomorrow” index • sponsored feature!
    • 25. rtsd indexes
    • 26. rtsd virtual indexes
    • 27. rtsd virtual indexes
    • 28. Future Work• autonomous nodes (no master/slave) • many-core blades with SSD storage• better performance metrics • we drop a lot of data on the floor• log mining and analysis• sphinx for “table of contents” (browsing)• haproxy in front of sphinx• generic sharding code• testing framework
    • 29. Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)• distributed search (agent) config with multiple servers per index (for failover and load):
    • 30. Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)• distributed search (agent) config with multiple servers per index (for failover and load):
    • 31. Craigslist is Hiring!• Developers • Back-end • Front-end• Systems Administrators• Network Engineers• Email: z@craiglist.org plain text resume!

    ×