Sphinx at Craigslist      Jeremy Zawodny          craigslist
Brief Overview
CL Sphinx Infrastructure• Live Sphinx • ~30 million postings • end users searching for stuff on craigslist• Team Sphinx • ...
CL Sphinx Infrastructure• Archive Sphinx • older postings (~3 billion) • constantly growing in size• Real-Time Sphinx • la...
How We Got Here
Back in 2008• MySQL FULL TEXT (MyISAM)• 25 Servers• Melted Down Frequently• Desperately Needed a Solution• This was my firs...
Making Sphinx Work• Benchmarking showed promising results • Query performance was great   • ~800qps/instance   • back then...
“Live” Sphinx• One index per city (~700 indexes) • Main + Delta • xmlpipe2 input• Data all fits on a single machine• 32bit ...
Master/Slave Clusters• Number of slaves varies (typically 3-7)          master                    master       slave   sla...
Main+Delta Indexes                         delta     Regular Merge  from transient delta                         today    ...
Early Issues• Monitoring• Persistent Connections w/prefork • hacked up my own initially• Index merge crashes/bugs• We’re a...
Early Success• Replaced the 25 MySQL servers• Used 10 sphinx servers (2 masters, 8 slaves)• Search traffic continued to inc...
Early Mistakes• Stopwords• Not setting query limits • Sphinx handled this just fine!• ASCII-only• Query mangling • need to ...
What Then?
Growth• Wanted Sphinx for “internal” use• Created internal “team sphinx” with more indexed  data • includes not visible po...
Live Sphinx Today•   300+ million queries/day•   5,000 queries/sec peak load•   removed stopwords•   threaded workers•   d...
Sharding
Query Volume
Archive Sphinx• The Archive Project!• 2.5 billion postings• Growing by ~1.6 million daily• String attributes• 4 shards, ea...
Real-Time Sphinx• There’s a delay in indexing data on the master and  replicating to the slaves...• What if we want to off...
So I built something...• Known as rtsd (real-time search daemon)• Sphinx instance with MySQL Protocol• Primarily uses in-m...
Sphinx Time Horizons            Classic     Team      Archive       rtsd0-20min                                         Al...
rtsd overviewPostingInfo tablertsd_consumer       redis queue                    rtsd_indexer   PostingCache              ...
Daily Posting Buckets• 3 indexes • yesterday • today • tomorrow• (DayofYear(PostedDate)%3) = $index_num• Nightly cron to “...
rtsd indexes
rtsd virtual indexes
rtsd virtual indexes
Future Work•   autonomous nodes (no master/slave)    •   many-core blades with SSD storage•   better performance metrics  ...
Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)•...
Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)•...
Craigslist is Hiring!• Developers • Back-end • Front-end• Systems Administrators• Network Engineers• Email: z@craiglist.or...
Upcoming SlideShare
Loading in...5
×

Sphinx at Craigslist in 2012

5,648

Published on

These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and future work.

1 Comment
11 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,648
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
129
Comments
1
Likes
11
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Sphinx at Craigslist in 2012

    1. 1. Sphinx at Craigslist Jeremy Zawodny craigslist
    2. 2. Brief Overview
    3. 3. CL Sphinx Infrastructure• Live Sphinx • ~30 million postings • end users searching for stuff on craigslist• Team Sphinx • ~100 million postings • additional indexes of postings for internal use (including non-live postings)
    4. 4. CL Sphinx Infrastructure• Archive Sphinx • older postings (~3 billion) • constantly growing in size• Real-Time Sphinx • last ~2 days worth of postings• Forums Sphinx • ~150 million forum postings
    5. 5. How We Got Here
    6. 6. Back in 2008• MySQL FULL TEXT (MyISAM)• 25 Servers• Melted Down Frequently• Desperately Needed a Solution• This was my first project at craigslist...• Looked at Solr, Sphinx, Xapian• Sphinx felt like the right fit
    7. 7. Making Sphinx Work• Benchmarking showed promising results • Query performance was great • ~800qps/instance • back then we only needed 1,200/sec • Indexing performance too • Can index documents far faster than I can make the XML for input (from Perl)• Can’t index and serve at the same time, though...
    8. 8. “Live” Sphinx• One index per city (~700 indexes) • Main + Delta • xmlpipe2 input• Data all fits on a single machine• 32bit ids• High churn rate• Settled on Master/Slave model w/rsync replication• Deployed in January, 2009
    9. 9. Master/Slave Clusters• Number of slaves varies (typically 3-7) master master slave slave slave slave master master slave slave slave slave
    10. 10. Main+Delta Indexes delta Regular Merge from transient delta today Periodic Merge Logical to clean house Index index
    11. 11. Early Issues• Monitoring• Persistent Connections w/prefork • hacked up my own initially• Index merge crashes/bugs• We’re aways running svn snapshots
    12. 12. Early Success• Replaced the 25 MySQL servers• Used 10 sphinx servers (2 masters, 8 slaves)• Search traffic continued to increase• Tons of headroom!• Typical search is under 5ms• New Features • “nearby” search • sort by: recent, price, best match
    13. 13. Early Mistakes• Stopwords• Not setting query limits • Sphinx handled this just fine!• ASCII-only• Query mangling • need to understand how users search and what they expect to find• UpdateAttributes (no kill lists!)
    14. 14. What Then?
    15. 15. Growth• Wanted Sphinx for “internal” use• Created internal “team sphinx” with more indexed data • includes not visible postings • includes additional fields• Space became an issue, so had to build some simple sharding into our code • 2 clusters: even/odd split for indexes
    16. 16. Live Sphinx Today• 300+ million queries/day• 5,000 queries/sec peak load• removed stopwords• threaded workers• dict=keywords• wildcard search enabled• UTF-8 (mostly) and charset_table• blend_chars• kill lists (no searchd on masters)• sharded (3 masters, 18 slaves) on blades
    17. 17. Sharding
    18. 18. Query Volume
    19. 19. Archive Sphinx• The Archive Project!• 2.5 billion postings• Growing by ~1.6 million daily• String attributes• 4 shards, each is a 1 master, 2 slave cluster• Bucket based on UserID (not city)• Low query volume• Need a way to reindex all docs
    20. 20. Real-Time Sphinx• There’s a delay in indexing data on the master and replicating to the slaves...• What if we want to offer “real-time search” of your own postings?
    21. 21. So I built something...• Known as rtsd (real-time search daemon)• Sphinx instance with MySQL Protocol• Primarily uses in-memory indexes• Used to bridge the gap between “now” and “archive sphinx”• Configured as an N day rolling window• Runs on archive sphinx master hosts
    22. 22. Sphinx Time Horizons Classic Team Archive rtsd0-20min All20m-1day Visible All All1-60 days Visible All All60+ days All Note:Visible postings are findable on the site.
    23. 23. rtsd overviewPostingInfo tablertsd_consumer redis queue rtsd_indexer PostingCache rtsd_sphinx webbie webbie webbie
    24. 24. Daily Posting Buckets• 3 indexes • yesterday • today • tomorrow• (DayofYear(PostedDate)%3) = $index_num• Nightly cron to “TRUNCATE RTINDEX” on the “tomorrow” index • sponsored feature!
    25. 25. rtsd indexes
    26. 26. rtsd virtual indexes
    27. 27. rtsd virtual indexes
    28. 28. Future Work• autonomous nodes (no master/slave) • many-core blades with SSD storage• better performance metrics • we drop a lot of data on the floor• log mining and analysis• sphinx for “table of contents” (browsing)• haproxy in front of sphinx• generic sharding code• testing framework
    29. 29. Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)• distributed search (agent) config with multiple servers per index (for failover and load):
    30. 30. Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)• distributed search (agent) config with multiple servers per index (for failover and load):
    31. 31. Craigslist is Hiring!• Developers • Back-end • Front-end• Systems Administrators• Network Engineers• Email: z@craiglist.org plain text resume!
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×