Scaling search &
content filtering
Search optimization



Netlog => social network
 • meet / connect to new people => search essential
 • localized content =...
Content filtering
Search filtering
Daily search statistics on Netlog
How to handle this



Problem 1:
Large number of requests
+ each request is kind of unique

Problem 2:
Content to search o...
Solution #1



 Add fulltext indexes to tables
 aggregate different data later on
 f.e. VIDEOS
   Full text index on title...
Solution #2



 Create seperate tables with fulltext indexes especially
 for searching queries
 f.e. VIDEOS
 • Table SEARC...
Solution #3 ...almost there :)


Looking for non MySQL based alternatives
• Google
 • no control over results or whats bei...
Solution #3 ...sphinx!


How sphinx works:
• Full text search engine
   • two essential cli- tools:
   • indexer
      • c...
Indexing (1)



 • Indexing
    • We need to index a data source (SQL database, text files, html
      files.. ) defining ...
Indexing (2)



1. Define the index in the config, which searchd will
   serve. An index can have more then 1 source.
    ...
Indexing (3) & searching



• indexing
  ./indexer -c ../etc/sphinx.conf users or
  ./indexer -c ../etc/sphinx.conf users ...
Sphinx netlog setup



We use a main+delta scheme
main:
For each search type (people, video, photo,..) we have a main inde...
Future developments on Netlog with sphinx



 Indexing of shards (messages / friendships)
 • Running an indexer on each sh...
Some sphinx tips & tweaks



• Use range queries when indexing data
  try always to have a an autoincrement field on MySQL...
Questions?




 netlog.com/go/developer
jayme@netlog.com - jurriaan@netlog.com
Scaling / optimizing search on netlog
Upcoming SlideShare
Loading in...5
×

Scaling / optimizing search on netlog

7,844

Published on

Presentation I gave on #barcamp2, 29-11-2008 @ Ghent

Published in: Technology
2 Comments
16 Likes
Statistics
Notes
No Downloads
Views
Total Views
7,844
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
180
Comments
2
Likes
16
Embeds 0
No embeds

No notes for slide

Scaling / optimizing search on netlog

  1. 1. Scaling search & content filtering
  2. 2. Search optimization Netlog => social network • meet / connect to new people => search essential • localized content => content filtering essential Types of searches
  3. 3. Content filtering
  4. 4. Search filtering
  5. 5. Daily search statistics on Netlog
  6. 6. How to handle this Problem 1: Large number of requests + each request is kind of unique Problem 2: Content to search on is spread • different distributions (nl, en, fr, .. ) • each with their own databasehosts/ isolations : videos, photos, ... • different shards as explained previously
  7. 7. Solution #1 Add fulltext indexes to tables aggregate different data later on f.e. VIDEOS Full text index on title, tags, description, combine results at the end Problems • Large indexes • Not all indexes are effective • Locking of table => searches are having an impact on other things on the site • May work good for a small site but otherwise => BAD
  8. 8. Solution #2 Create seperate tables with fulltext indexes especially for searching queries f.e. VIDEOS • Table SEARCH_VIDEOS (videoid (int), searchvideo(text)) Combine title, tags, description, .. in 1 mysql text field: “searchvideo”. Add a full text index on it. Combine results at the end. Problems • Duplication of data may cause inconsistencies • Not easy to rebuild (takes a very long time) • Peak moments: updates of changes + a lot of searches => table locks. (MyISAM)
  9. 9. Solution #3 ...almost there :) Looking for non MySQL based alternatives • Google • no control over results or whats being indexed/ when its being indexed. • Yahoo BOSS • promising, great step on making search more open. Is rather new, so may suffer from bugs. • still rely on a third party for delivering your results, f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo! reserves the right to limit unintended usage • Lucene • Java based, from the creators of Apache • Servers are not optimized for running java/ tomcat + more custom coding is needed to make php <-> java bridge • Sphinx • C++ based, more inhouse expertise • fast results in test setup
  10. 10. Solution #3 ...sphinx! How sphinx works: • Full text search engine • two essential cli- tools: • indexer • creating indexes • searchd • daemon that serves indexes & handles search requests, delivers results in form of documentids & attributes • uses custom protocol for retreiving results => need a sphinx API in PHP, java,.. to talk to this daemon: (use search for debugging) • Some sphinx terminology • sphinx.conf the basic config file, with two essential parts: sources & indexes • documentid: id that uniquely identifies a document in the sphinx search index (must be unique!) • attribute: each documentid can have additional attributtes, these can
  11. 11. Indexing (1) • Indexing • We need to index a data source (SQL database, text files, html files.. ) defining this in sphinx.conf can be as easy as source users { type = mysql sql_host = localhost sql_db = localdb sql_user = jayme sql_pass = ******* sql_port = 3306 sql_query = SELECT id, firstname, lastname, counter_photos FROM USERS sql_attr_uint = counter_photos } • We define counter_photos as an attribute, because we want to sort/ filter on it later on.
  12. 12. Indexing (2) 1. Define the index in the config, which searchd will serve. An index can have more then 1 source. index users { docinfo = extern source = users path = /var/lib/sphinx/data/users } 2. When running the indexer, sphinx splits up each document (SQL record in our case) in to several words internally : a. creates a dictionary of all of these words. (WordIDs) b. keeps references to documentIDs for each WordID c. stores attributes with references to documentIDs
  13. 13. Indexing (3) & searching • indexing ./indexer -c ../etc/sphinx.conf users or ./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running) Searching using php api:
  14. 14. Sphinx netlog setup We use a main+delta scheme main: For each search type (people, video, photo,..) we have a main index that is being rebuild every night. Takes +- 20 minutes to rebuild the largest table that we have. delta: Changes to videos, photos, .. are tracked in a table f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo. Halfhourly : sphinx regenerates a delta index based on this index. This table is truncated once day. When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’) Sphinx will use the last index first when searching, so if needed newer content will be found / returned
  15. 15. Future developments on Netlog with sphinx Indexing of shards (messages / friendships) • Running an indexer on each shard • Creating a main index for x shards (merge these shards in to 1) • Running distributed searches on these indexes Generation of tag clouds ./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops => sphinx has an option to generate the most used words in an index which can be relevant for tags
  16. 16. Some sphinx tips & tweaks • Use range queries when indexing data try always to have a an autoincrement field on MySQL tables when indexing. Sphinx has a mechanism which does indexes ranges of data, thus avoiding table locks. (where id > 1000 AND id < 2000 etc.. ) • Narrowest search first (f.e. when searching for users in Belgium that are basketball @hobbies basket @country BE) • Avoid searhes on small words with OR (f.e. the|new|...) • Define a charset table when indexing UTF-8, foreign languages • Check if there are no trailing spaces after in your sphinx.conf when using multi -lined queries, can cause weird errors else. • Cache results! • More info/ advanced usage on: sphinxsearch.com
  17. 17. Questions? netlog.com/go/developer jayme@netlog.com - jurriaan@netlog.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×