Scaling / optimizing search on netlog


Published on

Presentation I gave on #barcamp2, 29-11-2008 @ Ghent

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scaling / optimizing search on netlog

  1. Scaling search & content filtering
  2. Search optimization Netlog => social network • meet / connect to new people => search essential • localized content => content filtering essential Types of searches
  3. Content filtering
  4. Search filtering
  5. Daily search statistics on Netlog
  6. How to handle this Problem 1: Large number of requests + each request is kind of unique Problem 2: Content to search on is spread • different distributions (nl, en, fr, .. ) • each with their own databasehosts/ isolations : videos, photos, ... • different shards as explained previously
  7. Solution #1 Add fulltext indexes to tables aggregate different data later on f.e. VIDEOS Full text index on title, tags, description, combine results at the end Problems • Large indexes • Not all indexes are effective • Locking of table => searches are having an impact on other things on the site • May work good for a small site but otherwise => BAD
  8. Solution #2 Create seperate tables with fulltext indexes especially for searching queries f.e. VIDEOS • Table SEARCH_VIDEOS (videoid (int), searchvideo(text)) Combine title, tags, description, .. in 1 mysql text field: “searchvideo”. Add a full text index on it. Combine results at the end. Problems • Duplication of data may cause inconsistencies • Not easy to rebuild (takes a very long time) • Peak moments: updates of changes + a lot of searches => table locks. (MyISAM)
  9. Solution #3 ...almost there :) Looking for non MySQL based alternatives • Google • no control over results or whats being indexed/ when its being indexed. • Yahoo BOSS • promising, great step on making search more open. Is rather new, so may suffer from bugs. • still rely on a third party for delivering your results, f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo! reserves the right to limit unintended usage • Lucene • Java based, from the creators of Apache • Servers are not optimized for running java/ tomcat + more custom coding is needed to make php <-> java bridge • Sphinx • C++ based, more inhouse expertise • fast results in test setup
  10. Solution #3 ...sphinx! How sphinx works: • Full text search engine • two essential cli- tools: • indexer • creating indexes • searchd • daemon that serves indexes & handles search requests, delivers results in form of documentids & attributes • uses custom protocol for retreiving results => need a sphinx API in PHP, java,.. to talk to this daemon: (use search for debugging) • Some sphinx terminology • sphinx.conf the basic config file, with two essential parts: sources & indexes • documentid: id that uniquely identifies a document in the sphinx search index (must be unique!) • attribute: each documentid can have additional attributtes, these can
  11. Indexing (1) • Indexing • We need to index a data source (SQL database, text files, html files.. ) defining this in sphinx.conf can be as easy as source users { type = mysql sql_host = localhost sql_db = localdb sql_user = jayme sql_pass = ******* sql_port = 3306 sql_query = SELECT id, firstname, lastname, counter_photos FROM USERS sql_attr_uint = counter_photos } • We define counter_photos as an attribute, because we want to sort/ filter on it later on.
  12. Indexing (2) 1. Define the index in the config, which searchd will serve. An index can have more then 1 source. index users { docinfo = extern source = users path = /var/lib/sphinx/data/users } 2. When running the indexer, sphinx splits up each document (SQL record in our case) in to several words internally : a. creates a dictionary of all of these words. (WordIDs) b. keeps references to documentIDs for each WordID c. stores attributes with references to documentIDs
  13. Indexing (3) & searching • indexing ./indexer -c ../etc/sphinx.conf users or ./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running) Searching using php api:
  14. Sphinx netlog setup We use a main+delta scheme main: For each search type (people, video, photo,..) we have a main index that is being rebuild every night. Takes +- 20 minutes to rebuild the largest table that we have. delta: Changes to videos, photos, .. are tracked in a table f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo. Halfhourly : sphinx regenerates a delta index based on this index. This table is truncated once day. When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’) Sphinx will use the last index first when searching, so if needed newer content will be found / returned
  15. Future developments on Netlog with sphinx Indexing of shards (messages / friendships) • Running an indexer on each shard • Creating a main index for x shards (merge these shards in to 1) • Running distributed searches on these indexes Generation of tag clouds ./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops => sphinx has an option to generate the most used words in an index which can be relevant for tags
  16. Some sphinx tips & tweaks • Use range queries when indexing data try always to have a an autoincrement field on MySQL tables when indexing. Sphinx has a mechanism which does indexes ranges of data, thus avoiding table locks. (where id > 1000 AND id < 2000 etc.. ) • Narrowest search first (f.e. when searching for users in Belgium that are basketball @hobbies basket @country BE) • Avoid searhes on small words with OR (f.e. the|new|...) • Define a charset table when indexing UTF-8, foreign languages • Check if there are no trailing spaces after in your sphinx.conf when using multi -lined queries, can cause weird errors else. • Cache results! • More info/ advanced usage on:
  17. Questions? -