Scaling / optimizing search on netlog

Scaling search &
content filtering

Search optimization

Netlog => social network
• meet / connect to new people => search essential
• localized content => content filtering essential

Types of searches

Daily search statistics on Netlog

How to handle this

Problem 1:
Large number of requests
+ each request is kind of unique

Problem 2:
Content to search on is spread
• different distributions (nl, en, fr, .. )
• each with their own databasehosts/ isolations :
videos, photos, ...
• different shards as explained previously

Solution #1

Add fulltext indexes to tables
aggregate different data later on
f.e. VIDEOS
Full text index on title, tags, description,
combine results at the end

Problems
• Large indexes
• Not all indexes are effective
• Locking of table => searches are having an impact on
other things on the site
• May work good for a small site but otherwise => BAD

Solution #2

Create seperate tables with fulltext indexes especially
for searching queries
f.e. VIDEOS
• Table SEARCH_VIDEOS (videoid (int), searchvideo(text))
Combine title, tags, description, .. in 1 mysql text field: “searchvideo”.
Add a full text index on it. Combine results at the end.

Problems
• Duplication of data may cause inconsistencies
• Not easy to rebuild (takes a very long time)
• Peak moments: updates of changes + a lot of
searches => table locks. (MyISAM)

Solution #3 ...almost there :)

Looking for non MySQL based alternatives
• Google
• no control over results or whats being indexed/ when its being
indexed.
• Yahoo BOSS
• promising, great step on making search more open.
Is rather new, so may suffer from bugs.
• still rely on a third party for delivering your results,
f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo!
reserves the right to limit unintended usage

• Lucene
• Java based, from the creators of Apache
• Servers are not optimized for running java/ tomcat +
more custom coding is needed to make php <-> java bridge
• Sphinx
• C++ based, more inhouse expertise
• fast results in test setup

Solution #3 ...sphinx!

How sphinx works:
• Full text search engine
• two essential cli- tools:
• indexer
• creating indexes
• searchd
• daemon that serves indexes & handles search requests, delivers
results in form of documentids & attributes
• uses custom protocol for retreiving results => need a sphinx API
in PHP, java,.. to talk to this daemon: (use search for debugging)
• Some sphinx terminology
• sphinx.conf the basic config file, with two essential parts: sources &
indexes
• documentid: id that uniquely identifies a document in the sphinx
search index (must be unique!)
• attribute: each documentid can have additional attributtes, these can

Indexing (1)

• Indexing
• We need to index a data source (SQL database, text files, html
files.. ) defining this in sphinx.conf can be as easy as
source users
{
type = mysql
sql_host = localhost
sql_db = localdb
sql_user = jayme
sql_pass = *******
sql_port = 3306
sql_query = SELECT id, firstname, lastname, counter_photos FROM USERS
sql_attr_uint = counter_photos
}

• We define counter_photos as an attribute, because we want to sort/
filter on it later on.

Indexing (2)

1. Define the index in the config, which searchd will
serve. An index can have more then 1 source.
index users
{
docinfo = extern
source = users
path = /var/lib/sphinx/data/users
}

2. When running the indexer, sphinx splits up each
document (SQL record in our case) in to several words
internally :
a. creates a dictionary of all of these words. (WordIDs)
b. keeps references to documentIDs for each WordID
c. stores attributes with references to documentIDs

Indexing (3) & searching

• indexing
./indexer -c ../etc/sphinx.conf users or
./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running)
Searching
using php api:

Sphinx netlog setup

We use a main+delta scheme
main:
For each search type (people, video, photo,..) we have a main index that is
being rebuild every night. Takes +- 20 minutes to rebuild the largest table
that we have.

delta:
Changes to videos, photos, .. are tracked in a table
f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo.
Halfhourly : sphinx regenerates a delta index based on this index. This table
is truncated once day.

When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’)
Sphinx will use the last index first when searching,
so if needed newer content will be found / returned

Future developments on Netlog with sphinx

Indexing of shards (messages / friendships)
• Running an indexer on each shard
• Creating a main index for x shards
(merge these shards in to 1)
• Running distributed searches on these indexes

Generation of tag clouds
./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops
=> sphinx has an option to generate the most used words in an index which
can be relevant for tags

Some sphinx tips & tweaks

• Use range queries when indexing data
try always to have a an autoincrement field on MySQL tables when
indexing. Sphinx has a mechanism which does indexes ranges of data,
thus avoiding table locks.
(where id > 1000 AND id < 2000 etc.. )
• Narrowest search first
(f.e. when searching for users in Belgium that are basketball @hobbies
basket @country BE)
• Avoid searhes on small words with OR (f.e. the|new|...)
• Define a charset table when indexing UTF-8,
foreign languages
• Check if there are no trailing spaces after in your sphinx.conf
when using multi -lined queries, can cause weird errors else.
• Cache results!
• More info/ advanced usage on: sphinxsearch.com

Questions?

netlog.com/go/developer
jayme@netlog.com - jurriaan@netlog.com

Scaling / optimizing search on netlog

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling / optimizing search on netlog

Similar to Scaling / optimizing search on netlog (20)

Recently uploaded

Recently uploaded (20)

Scaling / optimizing search on netlog