Sphinx at Craigslist in 2012
Upcoming SlideShare
Loading in...5
×
 

Sphinx at Craigslist in 2012

on

  • 5,688 views

These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and ...

These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and future work.

Statistics

Views

Total Views
5,688
Views on SlideShare
5,606
Embed Views
82

Actions

Likes
8
Downloads
87
Comments
1

3 Embeds 82

http://irrlab.com 72
https://twitter.com 8
http://sphinxsearch.com 2

Accessibility

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Sphinx at Craigslist in 2012 Sphinx at Craigslist in 2012 Presentation Transcript

  • Sphinx at Craigslist Jeremy Zawodny craigslist
  • Brief Overview
  • CL Sphinx Infrastructure• Live Sphinx • ~30 million postings • end users searching for stuff on craigslist• Team Sphinx • ~100 million postings • additional indexes of postings for internal use (including non-live postings)
  • CL Sphinx Infrastructure• Archive Sphinx • older postings (~3 billion) • constantly growing in size• Real-Time Sphinx • last ~2 days worth of postings• Forums Sphinx • ~150 million forum postings
  • How We Got Here
  • Back in 2008• MySQL FULL TEXT (MyISAM)• 25 Servers• Melted Down Frequently• Desperately Needed a Solution• This was my first project at craigslist...• Looked at Solr, Sphinx, Xapian• Sphinx felt like the right fit
  • Making Sphinx Work• Benchmarking showed promising results • Query performance was great • ~800qps/instance • back then we only needed 1,200/sec • Indexing performance too • Can index documents far faster than I can make the XML for input (from Perl)• Can’t index and serve at the same time, though...
  • “Live” Sphinx• One index per city (~700 indexes) • Main + Delta • xmlpipe2 input• Data all fits on a single machine• 32bit ids• High churn rate• Settled on Master/Slave model w/rsync replication• Deployed in January, 2009
  • Master/Slave Clusters• Number of slaves varies (typically 3-7) master master slave slave slave slave master master slave slave slave slave
  • Main+Delta Indexes delta Regular Merge from transient delta today Periodic Merge Logical to clean house Index index
  • Early Issues• Monitoring• Persistent Connections w/prefork • hacked up my own initially• Index merge crashes/bugs• We’re aways running svn snapshots
  • Early Success• Replaced the 25 MySQL servers• Used 10 sphinx servers (2 masters, 8 slaves)• Search traffic continued to increase• Tons of headroom!• Typical search is under 5ms• New Features • “nearby” search • sort by: recent, price, best match
  • Early Mistakes• Stopwords• Not setting query limits • Sphinx handled this just fine!• ASCII-only• Query mangling • need to understand how users search and what they expect to find• UpdateAttributes (no kill lists!)
  • What Then?
  • Growth• Wanted Sphinx for “internal” use• Created internal “team sphinx” with more indexed data • includes not visible postings • includes additional fields• Space became an issue, so had to build some simple sharding into our code • 2 clusters: even/odd split for indexes
  • Live Sphinx Today• 300+ million queries/day• 5,000 queries/sec peak load• removed stopwords• threaded workers• dict=keywords• wildcard search enabled• UTF-8 (mostly) and charset_table• blend_chars• kill lists (no searchd on masters)• sharded (3 masters, 18 slaves) on blades
  • Sharding
  • Query Volume
  • Archive Sphinx• The Archive Project!• 2.5 billion postings• Growing by ~1.6 million daily• String attributes• 4 shards, each is a 1 master, 2 slave cluster• Bucket based on UserID (not city)• Low query volume• Need a way to reindex all docs
  • Real-Time Sphinx• There’s a delay in indexing data on the master and replicating to the slaves...• What if we want to offer “real-time search” of your own postings?
  • So I built something...• Known as rtsd (real-time search daemon)• Sphinx instance with MySQL Protocol• Primarily uses in-memory indexes• Used to bridge the gap between “now” and “archive sphinx”• Configured as an N day rolling window• Runs on archive sphinx master hosts
  • Sphinx Time Horizons Classic Team Archive rtsd0-20min All20m-1day Visible All All1-60 days Visible All All60+ days All Note:Visible postings are findable on the site.
  • rtsd overviewPostingInfo tablertsd_consumer redis queue rtsd_indexer PostingCache rtsd_sphinx webbie webbie webbie
  • Daily Posting Buckets• 3 indexes • yesterday • today • tomorrow• (DayofYear(PostedDate)%3) = $index_num• Nightly cron to “TRUNCATE RTINDEX” on the “tomorrow” index • sponsored feature!
  • rtsd indexes
  • rtsd virtual indexes
  • rtsd virtual indexes
  • Future Work• autonomous nodes (no master/slave) • many-core blades with SSD storage• better performance metrics • we drop a lot of data on the floor• log mining and analysis• sphinx for “table of contents” (browsing)• haproxy in front of sphinx• generic sharding code• testing framework
  • Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)• distributed search (agent) config with multiple servers per index (for failover and load):
  • Sphinx Wishlist• 32 -> 64 bit migration tool• capture stats at daemon shut down• RT optimizations for DELETE (high churn)• distributed search (agent) config with multiple servers per index (for failover and load):
  • Craigslist is Hiring!• Developers • Back-end • Front-end• Systems Administrators• Network Engineers• Email: z@craiglist.org plain text resume!