Your SlideShare is downloading. ×
0
Advanced fulltext search
with Sphinx
Adrian Nuta // Sphinxsearch // 2014
Fulltext search in MySQL
● available for MyISAM and lately for InnoDB
● limited in indexation options
○ only min length an...
Why Sphinx?
● GPLv2
● better performance
● lot of features, both on indexing and
searching
● easy to transit from MySQL:
○...
MySQL vs Sphinx fulltext index
● B-tree index
● easy to update
frequently, easy to
access by PK
● columnar storage
● OLTP
...
Simple fulltext search
MySQL:
mysql> SELECT * FROM myindex
WHERE MATCH('title,content') AGAINST ('find me fast');
Sphinx:
...
More complete Sphinx search
mysql> SELECT * FROM index WHERE
MATCH('"a quorum search is made here"/4')
ORDER BY WEIGHT() D...
Searching only on some fields
● Not possible in MySQL, need to declare
separate index
● in Sphinx - syntax operator:
mysql...
Indexing features
●
●
●
●
●
●

charset table
stopwords, wordforms
stemming and lemmatization
HTML stripping
blending, igno...
Searching operators
●
●
●
●
●
●
●

wildcard
proximity
phrase
start/end
qourum matching
strict order
sentence, paragraph, H...
Ranking factors formulas
● bm25
● LCS - distance between query and
document
● word and hit counting
● tf_idf and idf
● wor...
Ranking without field weighting
mysql> SELECT id,title,weight() FROM wikipedia WHERE MATCH('inverted index') OPTION
ranker...
Ranking with field weighting
mysql> SELECT id,title,WEIGHT() FROM index WHERE MATCH('inverted index') OPTION
ranker=expr('...
Words proximity
mysql> SELECT id,title,WEIGHT() FROM index
WHERE MATCH('@title list of football players') OPTION ranker=ex...
word and hit count
mysql> SELECT id,title,WEIGHT() AS w FROM index WHERE MATCH('@title php | api') OPTION
ranker=expr('sum...
Position
mysql> select id,title,weight() as w from forum where match('@title sphinx php api')
option ranker=expr('sum(min_...
IDF
mysql> select id,title,weight() from wikipedia where match('@title (Polyphonic |
Polysyllabic | Oberheim) ') option ra...
BM25F
mysql> select … where match('odbc') option ranker=expr('1000*bm25f(1,1)');
+------+---------------------------------...
Language morphology
Will the user search ‘shirt’ or ‘shirts’?
● stemming:
○ shirt = shirts

● index_exact_form for exact m...
EF-S 18-200mm f/3.5-5.6

blend_chars
● act as both separators and valid chars
● 10-200mm with - blended will index 3 terms...
Sentence delimitation
mysql> INSERT INTO index VALUES(1,
'quick brown fox jumps over the lazy dog');
mysql> INSERT INTO in...
Paragraph delimitation
mysql> INSERT
'<p>The quick
mysql> INSERT
'<p>The quick
dog</p>');

INTO index VALUES(1,
brown fox ...
More fulltext features
●
●
●
●
●
●

bigrams
more ranking factors: lccs, wlccs, atc
phrase boundary chars
HTML index attrib...
Thank you!
http://www.sphinxsearch.com
adrian.nuta@sphinxsearch.com
Upcoming SlideShare
Loading in...5
×

Advanced fulltext search with Sphinx

2,195

Published on

My talk on several advanced fulltext features in Sphinx. Presented at FOSDEM 2014, MySQL room

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,195
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
21
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Advanced fulltext search with Sphinx"

  1. 1. Advanced fulltext search with Sphinx Adrian Nuta // Sphinxsearch // 2014
  2. 2. Fulltext search in MySQL ● available for MyISAM and lately for InnoDB ● limited in indexation options ○ only min length and list of stopwords ● limited in search options ○ boolean ○ natural mode ○ with query expansion
  3. 3. Why Sphinx? ● GPLv2 ● better performance ● lot of features, both on indexing and searching ● easy to transit from MySQL: ○ easy to index from MySQL ○ SphinxQL - access and query Sphinx using any MySQL client
  4. 4. MySQL vs Sphinx fulltext index ● B-tree index ● easy to update frequently, easy to access by PK ● columnar storage ● OLTP ● inverted index ● hard to update, fast to read ● keyword based storage ● OLAP
  5. 5. Simple fulltext search MySQL: mysql> SELECT * FROM myindex WHERE MATCH('title,content') AGAINST ('find me fast'); Sphinx: mysql> SELECT * FROM myindex WHERE MATCH('find me fast');
  6. 6. More complete Sphinx search mysql> SELECT * FROM index WHERE MATCH('"a quorum search is made here"/4') ORDER BY WEIGHT() DESC, id ASC OPTION ranker = expr( 'sum( exact_hit+10*(min_hit_pos==1)+lcs*(0.1*my_attr) )*1000 + bm25' );
  7. 7. Searching only on some fields ● Not possible in MySQL, need to declare separate index ● in Sphinx - syntax operator: mysql> SELECT * FROM myindex WHERE MATCH(‘@(title,content) find me fast’);
  8. 8. Indexing features ● ● ● ● ● ● charset table stopwords, wordforms stemming and lemmatization HTML stripping blending, ignore chars, bigram words custom regexp filters
  9. 9. Searching operators ● ● ● ● ● ● ● wildcard proximity phrase start/end qourum matching strict order sentence, paragraph, HTML zone limitation
  10. 10. Ranking factors formulas ● bm25 ● LCS - distance between query and document ● word and hit counting ● tf_idf and idf ● word positioning ● possible to use attribute values
  11. 11. Ranking without field weighting mysql> SELECT id,title,weight() FROM wikipedia WHERE MATCH('inverted index') OPTION ranker=expr('sum(hit_count*user_weight)'), field_weights=(title=1,body=1); +-----------+----------------------------------------------------------+----------+ | id | title | weight() | +-----------+----------------------------------------------------------+----------+ | 221501516 | Index (search engine) | 125 | | 221487412 | Inverted index | 47 | Doc. 221501516: 1 hit in ‘title’ x 100 + 124 hits in ‘body’ = 125 Doc. 221487412: 2 hits in ‘title’x 100 + 45 hits in ‘body’ = 47
  12. 12. Ranking with field weighting mysql> SELECT id,title,WEIGHT() FROM index WHERE MATCH('inverted index') OPTION ranker=expr('sum(hit_count*user_weight)'), field_weights=(title=100,body=1); +-----------+------------------------------------------------------+----------+ | id | title | WEIGHT() | +-----------+------------------------------------------------------+----------+ | 221487412 | Inverted index | 245 | | 221501516 | Index (search engine) | 224 | Doc. 221501516: 1 hit in ‘title’ x 100 + 124 hits in ‘body’ = 100+124 = 224 Doc. 221487412: 2 hits in ‘title’ x 100 + 45 hits in ‘body’ = 200 +45 = 245
  13. 13. Words proximity mysql> SELECT id,title,WEIGHT() FROM index WHERE MATCH('@title list of football players') OPTION ranker=expr('sum(lcs)'); +-----------+-----------------------------------------------------+----------+ | id | title | weight() | +-----------+-----------------------------------------------------+----------+ | 207381464 | List of football players from Amsterdam | 4 | | 221196229 | List of Football Kingz F.C. players | 3 | | 210456301 | List of Florida State University football players | 2 | +-----------+-----------------------------------------------------+----------+
  14. 14. word and hit count mysql> SELECT id,title,WEIGHT() AS w FROM index WHERE MATCH('@title php | api') OPTION ranker=expr('sum(hit_count)'); +---------+----------------------------------------------------------+------+ | id | title | w | +---------+----------------------------------------------------------+------+ | 1000671 | PHP API gives PHP Warnings - tips? | 3 | ... mysql> SELECT id,title,WEIGHT() AS w FROM index WHERE MATCH('@title php | api') OPTION ranker=expr('sum(word_count)'); +---------+----------------------------------------------------------+------+ | id | title | w | +---------+----------------------------------------------------------+------+ | 1000671 | PHP API gives PHP Warnings - tips? | 2 |
  15. 15. Position mysql> select id,title,weight() as w from forum where match('@title sphinx php api') option ranker=expr('sum(min_hit_pos)'); +---------+--------------------------------------------------------------+------+ | id | title | w | +---------+--------------------------------------------------------------+------+ | 1004955 | how can i do a sample search use sphinx php api | 9 | | 1004900 | How to update fulltext field using sphinx api of PHP? | 7 | | 1008783 | Update MVA-Attributes with the PHP-API Sphinx 2.0.2 | 6 | | 1000498 | Limits in sphinx when using PHP sphinx API | 3 | how can i do a sample search use sphinx php api 1 2 3 4 5 6 7 8 9
  16. 16. IDF mysql> select id,title,weight() from wikipedia where match('@title (Polyphonic | Polysyllabic | Oberheim) ') option ranker=expr('sum(max_idf)*1000'); +-----------+---------------------------+----------+ | id | title | weight() | +-----------+---------------------------+----------+ | 165867281 | The Polysyllabic Spree | 112 | Polysyllabic - rare | 208650218 | Oberheim Xpander | 108 | Oberheim - not so rare | 209138112 | Oberheim OB-8 | 108 | | 180503990 | Polyphonic Era | 85 | Polyphonic - common | 183135294 | Polyphonic C sharp | 85 | | 219939232 | Polyphonic HMI | 85 | +-----------+---------------------------+----------+
  17. 17. BM25F mysql> select … where match('odbc') option ranker=expr('1000*bm25f(1,1)'); +------+------------------------------------+-----------+----------+----------+ | id | title | title_len | body_len | weight() | +------+------------------------------------+-----------+----------+----------+ | 179 | odbc_dsn | 1 | 69 | 775 | | 170 | type | 1 | 124 | 742 | … mysql> select … where match('odbc') option ranker=expr('1000*bm25f(1,0)'); +------+------------------------------------+-----------+----------+----------+ | id | title | title_len | body_len | weight() | +------+------------------------------------+-----------+----------+----------+ | 169 | Data source configuration options | 4 | 6246 | 758 | | 179 | odbc_dsn | 1 | 69 | 743 | | 170 | type | 1 | 124 | 689 |
  18. 18. Language morphology Will the user search ‘shirt’ or ‘shirts’? ● stemming: ○ shirt = shirts ● index_exact_form for exact matching ● lemmatization: ○ men = man
  19. 19. EF-S 18-200mm f/3.5-5.6 blend_chars ● act as both separators and valid chars ● 10-200mm with - blended will index 3 terms: 10-200mm, 10 and 200mm ● leading or trailing blend char behaviour can be configured to be stripped or indexed
  20. 20. Sentence delimitation mysql> INSERT INTO index VALUES(1, 'quick brown fox jumps over the lazy dog'); mysql> INSERT INTO index VALUES(2, 'The quick brown fox made it. Where was the lazy dog?'); mysql> SELECT * FROM index WHERE MATCH('brown fox SENTENCE lazy dog'); +------+ | id | +------+ | 1 | +------+
  21. 21. Paragraph delimitation mysql> INSERT '<p>The quick mysql> INSERT '<p>The quick dog</p>'); INTO index VALUES(1, brown fox jumps over the lazy dog</p>'); INTO index VALUES(2, brown fox jumps</p><p>over the lazy mysql> SELECT * FROM index WHERE MATCH('brown fox PARAGRAPH lazy dog'); +------+ | id | +------+ | 1 | +------+
  22. 22. More fulltext features ● ● ● ● ● ● bigrams more ranking factors: lccs, wlccs, atc phrase boundary chars HTML index attributes, elements removal RLP Chinese tokenization position step tunning
  23. 23. Thank you! http://www.sphinxsearch.com adrian.nuta@sphinxsearch.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×