Scaling / optimizing search on netlog
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Scaling / optimizing search on netlog

  • 12,532 views
Uploaded on

Presentation I gave on #barcamp2, 29-11-2008 @ Ghent

Presentation I gave on #barcamp2, 29-11-2008 @ Ghent

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
12,532
On Slideshare
12,492
From Embeds
40
Number of Embeds
5

Actions

Shares
Downloads
176
Comments
2
Likes
16

Embeds 40

http://www.slideshare.net 24
http://dinh.wordpress.com 13
http://dinhcuongvu.wordpress.com 1
http://static.slidesharecdn.com 1
http://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Scaling search & content filtering
  • 2. Search optimization Netlog => social network • meet / connect to new people => search essential • localized content => content filtering essential Types of searches
  • 3. Content filtering
  • 4. Search filtering
  • 5. Daily search statistics on Netlog
  • 6. How to handle this Problem 1: Large number of requests + each request is kind of unique Problem 2: Content to search on is spread • different distributions (nl, en, fr, .. ) • each with their own databasehosts/ isolations : videos, photos, ... • different shards as explained previously
  • 7. Solution #1 Add fulltext indexes to tables aggregate different data later on f.e. VIDEOS Full text index on title, tags, description, combine results at the end Problems • Large indexes • Not all indexes are effective • Locking of table => searches are having an impact on other things on the site • May work good for a small site but otherwise => BAD
  • 8. Solution #2 Create seperate tables with fulltext indexes especially for searching queries f.e. VIDEOS • Table SEARCH_VIDEOS (videoid (int), searchvideo(text)) Combine title, tags, description, .. in 1 mysql text field: “searchvideo”. Add a full text index on it. Combine results at the end. Problems • Duplication of data may cause inconsistencies • Not easy to rebuild (takes a very long time) • Peak moments: updates of changes + a lot of searches => table locks. (MyISAM)
  • 9. Solution #3 ...almost there :) Looking for non MySQL based alternatives • Google • no control over results or whats being indexed/ when its being indexed. • Yahoo BOSS • promising, great step on making search more open. Is rather new, so may suffer from bugs. • still rely on a third party for delivering your results, f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo! reserves the right to limit unintended usage • Lucene • Java based, from the creators of Apache • Servers are not optimized for running java/ tomcat + more custom coding is needed to make php <-> java bridge • Sphinx • C++ based, more inhouse expertise • fast results in test setup
  • 10. Solution #3 ...sphinx! How sphinx works: • Full text search engine • two essential cli- tools: • indexer • creating indexes • searchd • daemon that serves indexes & handles search requests, delivers results in form of documentids & attributes • uses custom protocol for retreiving results => need a sphinx API in PHP, java,.. to talk to this daemon: (use search for debugging) • Some sphinx terminology • sphinx.conf the basic config file, with two essential parts: sources & indexes • documentid: id that uniquely identifies a document in the sphinx search index (must be unique!) • attribute: each documentid can have additional attributtes, these can
  • 11. Indexing (1) • Indexing • We need to index a data source (SQL database, text files, html files.. ) defining this in sphinx.conf can be as easy as source users { type = mysql sql_host = localhost sql_db = localdb sql_user = jayme sql_pass = ******* sql_port = 3306 sql_query = SELECT id, firstname, lastname, counter_photos FROM USERS sql_attr_uint = counter_photos } • We define counter_photos as an attribute, because we want to sort/ filter on it later on.
  • 12. Indexing (2) 1. Define the index in the config, which searchd will serve. An index can have more then 1 source. index users { docinfo = extern source = users path = /var/lib/sphinx/data/users } 2. When running the indexer, sphinx splits up each document (SQL record in our case) in to several words internally : a. creates a dictionary of all of these words. (WordIDs) b. keeps references to documentIDs for each WordID c. stores attributes with references to documentIDs
  • 13. Indexing (3) & searching • indexing ./indexer -c ../etc/sphinx.conf users or ./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running) Searching using php api:
  • 14. Sphinx netlog setup We use a main+delta scheme main: For each search type (people, video, photo,..) we have a main index that is being rebuild every night. Takes +- 20 minutes to rebuild the largest table that we have. delta: Changes to videos, photos, .. are tracked in a table f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo. Halfhourly : sphinx regenerates a delta index based on this index. This table is truncated once day. When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’) Sphinx will use the last index first when searching, so if needed newer content will be found / returned
  • 15. Future developments on Netlog with sphinx Indexing of shards (messages / friendships) • Running an indexer on each shard • Creating a main index for x shards (merge these shards in to 1) • Running distributed searches on these indexes Generation of tag clouds ./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops => sphinx has an option to generate the most used words in an index which can be relevant for tags
  • 16. Some sphinx tips & tweaks • Use range queries when indexing data try always to have a an autoincrement field on MySQL tables when indexing. Sphinx has a mechanism which does indexes ranges of data, thus avoiding table locks. (where id > 1000 AND id < 2000 etc.. ) • Narrowest search first (f.e. when searching for users in Belgium that are basketball @hobbies basket @country BE) • Avoid searhes on small words with OR (f.e. the|new|...) • Define a charset table when indexing UTF-8, foreign languages • Check if there are no trailing spaces after in your sphinx.conf when using multi -lined queries, can cause weird errors else. • Cache results! • More info/ advanced usage on: sphinxsearch.com
  • 17. Questions? netlog.com/go/developer jayme@netlog.com - jurriaan@netlog.com