7. How to handle this
Problem 1:
Large number of requests
+ each request is kind of unique
Problem 2:
Content to search on is spread
• different distributions (nl, en, fr, .. )
• each with their own databasehosts/ isolations :
videos, photos, ...
• different shards as explained previously
8. Solution #1
Add fulltext indexes to tables
aggregate different data later on
f.e. VIDEOS
Full text index on title, tags, description,
combine results at the end
Problems
• Large indexes
• Not all indexes are effective
• Locking of table => searches are having an impact on
other things on the site
• May work good for a small site but otherwise => BAD
9. Solution #2
Create seperate tables with fulltext indexes especially
for searching queries
f.e. VIDEOS
• Table SEARCH_VIDEOS (videoid (int), searchvideo(text))
Combine title, tags, description, .. in 1 mysql text field: “searchvideo”.
Add a full text index on it. Combine results at the end.
Problems
• Duplication of data may cause inconsistencies
• Not easy to rebuild (takes a very long time)
• Peak moments: updates of changes + a lot of
searches => table locks. (MyISAM)
10. Solution #3 ...almost there :)
Looking for non MySQL based alternatives
• Google
• no control over results or whats being indexed/ when its being
indexed.
• Yahoo BOSS
• promising, great step on making search more open.
Is rather new, so may suffer from bugs.
• still rely on a third party for delivering your results,
f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo!
reserves the right to limit unintended usage
• Lucene
• Java based, from the creators of Apache
• Servers are not optimized for running java/ tomcat +
more custom coding is needed to make php <-> java bridge
• Sphinx
• C++ based, more inhouse expertise
• fast results in test setup
11. Solution #3 ...sphinx!
How sphinx works:
• Full text search engine
• two essential cli- tools:
• indexer
• creating indexes
• searchd
• daemon that serves indexes & handles search requests, delivers
results in form of documentids & attributes
• uses custom protocol for retreiving results => need a sphinx API
in PHP, java,.. to talk to this daemon: (use search for debugging)
• Some sphinx terminology
• sphinx.conf the basic config file, with two essential parts: sources &
indexes
• documentid: id that uniquely identifies a document in the sphinx
search index (must be unique!)
• attribute: each documentid can have additional attributtes, these can
12. Indexing (1)
• Indexing
• We need to index a data source (SQL database, text files, html
files.. ) defining this in sphinx.conf can be as easy as
source users
{
type = mysql
sql_host = localhost
sql_db = localdb
sql_user = jayme
sql_pass = *******
sql_port = 3306
sql_query = SELECT id, firstname, lastname, counter_photos FROM USERS
sql_attr_uint = counter_photos
}
• We define counter_photos as an attribute, because we want to sort/
filter on it later on.
13. Indexing (2)
1. Define the index in the config, which searchd will
serve. An index can have more then 1 source.
index users
{
docinfo = extern
source = users
path = /var/lib/sphinx/data/users
}
2. When running the indexer, sphinx splits up each
document (SQL record in our case) in to several words
internally :
a. creates a dictionary of all of these words. (WordIDs)
b. keeps references to documentIDs for each WordID
c. stores attributes with references to documentIDs
14. Indexing (3) & searching
• indexing
./indexer -c ../etc/sphinx.conf users or
./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running)
Searching
using php api:
15. Sphinx netlog setup
We use a main+delta scheme
main:
For each search type (people, video, photo,..) we have a main index that is
being rebuild every night. Takes +- 20 minutes to rebuild the largest table
that we have.
delta:
Changes to videos, photos, .. are tracked in a table
f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo.
Halfhourly : sphinx regenerates a delta index based on this index. This table
is truncated once day.
When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’)
Sphinx will use the last index first when searching,
so if needed newer content will be found / returned
16. Future developments on Netlog with sphinx
Indexing of shards (messages / friendships)
• Running an indexer on each shard
• Creating a main index for x shards
(merge these shards in to 1)
• Running distributed searches on these indexes
Generation of tag clouds
./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops
=> sphinx has an option to generate the most used words in an index which
can be relevant for tags
17. Some sphinx tips & tweaks
• Use range queries when indexing data
try always to have a an autoincrement field on MySQL tables when
indexing. Sphinx has a mechanism which does indexes ranges of data,
thus avoiding table locks.
(where id > 1000 AND id < 2000 etc.. )
• Narrowest search first
(f.e. when searching for users in Belgium that are basketball @hobbies
basket @country BE)
• Avoid searhes on small words with OR (f.e. the|new|...)
• Define a charset table when indexing UTF-8,
foreign languages
• Check if there are no trailing spaces after in your sphinx.conf
when using multi -lined queries, can cause weird errors else.
• Cache results!
• More info/ advanced usage on: sphinxsearch.com