Scaling / optimizing search on netlog

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

1 comments

Comments 1 - 1 of 1 previous next Post a comment

Post a comment
Embed Video
Edit your comment Cancel

5 Favorites & 1 Event

Scaling / optimizing search on netlog - Presentation Transcript

  1. Scaling search & content filtering
  2. Search optimization Netlog => social network • meet / connect to new people => search essential • localized content => content filtering essential Types of searches
  3. Content filtering
  4. Search filtering
  5. Daily search statistics on Netlog
  6. How to handle this Problem 1: Large number of requests + each request is kind of unique Problem 2: Content to search on is spread • different distributions (nl, en, fr, .. ) • each with their own databasehosts/ isolations : videos, photos, ... • different shards as explained previously
  7. Solution #1 Add fulltext indexes to tables aggregate different data later on f.e. VIDEOS Full text index on title, tags, description, combine results at the end Problems • Large indexes • Not all indexes are effective • Locking of table => searches are having an impact on other things on the site • May work good for a small site but otherwise => BAD
  8. Solution #2 Create seperate tables with fulltext indexes especially for searching queries f.e. VIDEOS • Table SEARCH_VIDEOS (videoid (int), searchvideo(text)) Combine title, tags, description, .. in 1 mysql text field: “searchvideo”. Add a full text index on it. Combine results at the end. Problems • Duplication of data may cause inconsistencies • Not easy to rebuild (takes a very long time) • Peak moments: updates of changes + a lot of searches => table locks. (MyISAM)
  9. Solution #3 ...almost there :) Looking for non MySQL based alternatives • Google • no control over results or whats being indexed/ when its being indexed. • Yahoo BOSS • promising, great step on making search more open. Is rather new, so may suffer from bugs. • still rely on a third party for delivering your results, f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo! reserves the right to limit unintended usage • Lucene • Java based, from the creators of Apache • Servers are not optimized for running java/ tomcat + more custom coding is needed to make php <-> java bridge • Sphinx • C++ based, more inhouse expertise • fast results in test setup
  10. Solution #3 ...sphinx! How sphinx works: • Full text search engine • two essential cli- tools: • indexer • creating indexes • searchd • daemon that serves indexes & handles search requests, delivers results in form of documentids & attributes • uses custom protocol for retreiving results => need a sphinx API in PHP, java,.. to talk to this daemon: (use search for debugging) • Some sphinx terminology • sphinx.conf the basic config file, with two essential parts: sources & indexes • documentid: id that uniquely identifies a document in the sphinx search index (must be unique!) • attribute: each documentid can have additional attributtes, these can
  11. Indexing (1) • Indexing • We need to index a data source (SQL database, text files, html files.. ) defining this in sphinx.conf can be as easy as source users { type = mysql sql_host = localhost sql_db = localdb sql_user = jayme sql_pass = ******* sql_port = 3306 sql_query = SELECT id, firstname, lastname, counter_photos FROM USERS sql_attr_uint = counter_photos } • We define counter_photos as an attribute, because we want to sort/ filter on it later on.
  12. Indexing (2) 1. Define the index in the config, which searchd will serve. An index can have more then 1 source. index users { docinfo = extern source = users path = /var/lib/sphinx/data/users } 2. When running the indexer, sphinx splits up each document (SQL record in our case) in to several words internally : a. creates a dictionary of all of these words. (WordIDs) b. keeps references to documentIDs for each WordID c. stores attributes with references to documentIDs
  13. Indexing (3) & searching • indexing ./indexer -c ../etc/sphinx.conf users or ./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running) Searching using php api:
  14. Sphinx netlog setup We use a main+delta scheme main: For each search type (people, video, photo,..) we have a main index that is being rebuild every night. Takes +- 20 minutes to rebuild the largest table that we have. delta: Changes to videos, photos, .. are tracked in a table f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo. Halfhourly : sphinx regenerates a delta index based on this index. This table is truncated once day. When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’) Sphinx will use the last index first when searching, so if needed newer content will be found / returned
  15. Future developments on Netlog with sphinx Indexing of shards (messages / friendships) • Running an indexer on each shard • Creating a main index for x shards (merge these shards in to 1) • Running distributed searches on these indexes Generation of tag clouds ./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops => sphinx has an option to generate the most used words in an index which can be relevant for tags
  16. Some sphinx tips & tweaks • Use range queries when indexing data try always to have a an autoincrement field on MySQL tables when indexing. Sphinx has a mechanism which does indexes ranges of data, thus avoiding table locks. (where id > 1000 AND id < 2000 etc.. ) • Narrowest search first (f.e. when searching for users in Belgium that are basketball @hobbies basket @country BE) • Avoid searhes on small words with OR (f.e. the|new|...) • Define a charset table when indexing UTF-8, foreign languages • Check if there are no trailing spaces after \\ in your sphinx.conf when using multi -lined queries, can cause weird errors else. • Cache results! • More info/ advanced usage on: sphinxsearch.com
  17. Questions? netlog.com/go/developer jayme@netlog.com - jurriaan@netlog.com

+ Jayme RotsaertJayme Rotsaert, 7 months ago

custom

2216 views, 5 favs, 1 embeds more stats

Presentation I gave on #barcamp2, 29-11-2008 @ Ghen more

More Info

© All Rights Reserved

Go to text version
  • Total Views 2216
    • 2215 on SlideShare
    • 1 from embeds
  • Comments 1
  • Favorites 5
  • Downloads 67
Most viewed embeds
  • 1 views on http://dinhcuongvu.wordpress.com

more

All embeds
  • 1 views on http://dinhcuongvu.wordpress.com

less

Flagged as inappropriate Flag as inappropriate
Flag as innappropriate

Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

Cancel

Categories

Groups / Events