Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real time fulltext search with sphinx


Published on

My talk about real-time indexing and searching with Sphinx. It was given at the 2013 Froscon in Sankt Augustin, Germany.

Published in: Technology
  • Was there a video for this one?
    Are you sure you want to  Yes  No
    Your message goes here

Real time fulltext search with sphinx

  1. 1. Real time fulltext search with Sphinx Adrian Nuta // Sphinxsearch // 2013
  2. 2. Quick intro Sphinx search • high performance fulltext search engine • written in C++ • serving searches since 2001 • can work on any modern architecture • distributed under GPL2 licence
  3. 3. Why a search engine? • performance o a search engine delivery faster a search and with less resourses • quality of search o build-in FTS in databases don’t offer advanced search options • independent FTS engines offer speed not only for FT searches, but other types, like geo or faceted searches
  4. 4. Classic way of indexing in Sphinx on-disk (classic) method: • use a data source which is indexed • to update the index you need to reindex again • in addition to main index, a secondary index (delta) index can be used to reindex only latest changes • easy because indexing doesn’t require changes in the application, but: • reindexing, even delta one, can put pressure on data source and system
  5. 5. Real time indexing in Sphinx • index has no data source • everything that needs be indexed must be added manually in the index • you can add/update/remove at any time • compared to classic method, RT requires changes in the application • performance is same or near same as classic index • Only specific requirement : workers = threads
  6. 6. Structures
  7. 7. RealTime index definition index rt { type = rt rt_field = title rt_field = content rt_attr_uint = user_id rt_attr_string = title rt_attr_json = metadata }
  8. 8. Schema - Fields rt_field - fulltext field, raw text is not stored Tokenization features: wildcarding ( prefix or infix), morphology, custom charset definition, stopwords, synonyms, segmentation, html stripping, paragraph/sentence detection etc.
  9. 9. Schema - Attributes • rt_attr_uint & rt_attr_bigint • rt_attr_bool • rt_attr_float • rt_attr_multi & rt_attr_multi64 - integer set • rt_attr_timestamp • rt_attr_string - actual text stored, kept in memory, used only for display, sorting and grouping. • rt_attr_json - full support for JSON documents
  10. 10. Content manipulation
  11. 11. Quick intro to SphinxQL • our SQL dialect • any mysql client can be used to connect to Sphinx • MySQL server is not required! • Full document updates only possible with SphinxQL • to enable it, add in searchd section of config listen = host:port:mysql41
  12. 12. Content insert $mysql> INSERT INTO rt (id,title,content,user_id,metadata) VALUES(100,’My title’, ‘Some long content to search’, 10, ’{“image_id”:1,”props”:[20,30,40]}’);
  13. 13. Full content replace $mysql> REPLACE INTO rt (id,title,content,user_id,metadata) VALUES(100,’My title’, ‘Some long content to search’, 10, ’{“image_id”:1,”props”:[20,30,40]}’); • needed for text field, json and string attribute updates
  14. 14. Updating numerics • For numeric attributes including MVA: $mysql> UPDATE rt SET user_id = 10 WHERE id = 100; • For numeric JSON elements it’s possible to do inplace updates: $mysql> UPDATE rt SET metadata.image_id = 1234 WHERE id=100;
  15. 15. Deleting $mysql> DELETE FROM rt WHERE id = 100; $mysql> DELETE FROM rt WHERE user_id > 100; $mysql> TRUNCATE RTINDEX rt; ● empty the memory shard, delete all disk shards and release the index binlogs
  16. 16. Adding new attributes mysql> ALTER TABLE rt ADD COLUMN gid INTEGER; • only for int/bigint/float/bool attributes for now
  17. 17. Searching
  18. 18. Searching • no difference in searching a RT or classic index • dict = keywords required for wildcard search.
  19. 19. Relevancy ranking • build-in rankers: o proximity_bm25 ( default) o none, matchany,wordcount,fieldmask,bm25 • custom ranker - create own expression rank example ranker = proximity_bm25 same as ranker = expr(‘sum(lcs*user_weight)*1000+bm25’)
  20. 20. Tokenization settings example index rt { … charset_type = utf-8 dict = keywords min_word_len = 2 min_infix_len = 3 morphology = stem_en enable_star = 1 … }
  21. 21. Operators on fulltext fields • Boolean: hello | world, hello ! world • phrasing: “hello world” • proximity: “hello world”~10 • quorum: “world is a beautiful place”/3 • exact form: =cats and =dogs • strict order: cats << and << dogs • zone limit: (h2,h4) cats and dogs • SENTENCE: all SENTENCE words SENTENCE “ in one sentence” • PARAGRAPH: “this search” PARAGRAPH “is fast” • selected fields only: @(title,body) hello world • excluded fields: @!(title,body) hello world
  22. 22. Using API <?php require("sphinxapi.php"); $cl = new SphinxClient(); $res = $cl->Query('search me now','rt'); print_r($res); Official: PHP, Python, Ruby, Java, C Unofficial: JS(Node.js), perl, C++, Haskell, .NET
  23. 23. Using SphinxQL $mysql> SELECT * FROM rt WHERE MATCH('”search me fuzzy”~10') AND featured = 1 LIMIT 0,20; $mysql> SELECT * FROM rt WHERE MATCH('”search me fuzzy”~10 @tag computers') AND featured = 1 GROUP BY user_id ORDER BY title ASC LIMIT 30,60 OPTION field_weights=(title=10,content=1), ranker=expr(‘sum((4*lcs+2*(min_hit_pos==1) +exact_hit)*user_weight)*1000+bm25’);
  24. 24. Boolean filtering $mysql> SELECT *, views > 10 OR category = 4 AS cond FROM rt WHERE MATCH('”search me proximity”~10') AND featured = 1 AND cond = 1 GROUP BY user_id ORDER BY title ASC LIMIT 30,60 OPTION ranker=sph04;
  25. 25. Geo search mysql> SELECT *, GEODIST(lat,long,0.71147,- 1.29153) as distance FROM rt WHERE distance < 1000 ORDER BY distance ASC; mysql> SELECT *, GEODIST(lat,long,40.76439,- 73.99976, {in=degrees,out=miles,method=adaptive}) as distance FROM rt WHERE distance < 10 ORDER BY distance ASC;
  26. 26. Multi-queries mysql> DELIMITER mysql> SELECT *,COUNT(*) as counter FROM rt WHERE MATCH('search me') GROUP by property_one ORDER by counter DESC;SELECT *,COUNT(*) as counter FROM rt WHERE MATCH('search me') GROUP by property_two ORDER by counter DESC;SELECT *,COUNT(*) as counter FROM rt WHERE MATCH('search me') GROUP by property_three ORDER by counter DESC; • used for faceting
  27. 27. Internals
  28. 28. Internal architecture Each RT index is a sharded index consisting of: • one memory shard for latest content • one or more disk shards
  29. 29. Internal shards management rt_mem_limit = maximum size of memory shard When full, is flushed to disk as a new disk shard. • OPTIMIZE INDEX rt - merge all disk shards into one. o Merging too intensive? throttle with rt_merge_iops and rt_merge_maxiosize
  30. 30. Binlog support Sphinx support binlogs, so memory shard will not be lost in case of disasters • binlog_flush o like innodb_flush_log_at_trx_commit o 0 - flush and sync every second - fastest, 1 sec lose o 1 - flush and sync every transaction - most safe, but slowest o 2 - flush every transaction, sync every second - best balance, default mode • binlog_path o binlog_path = # disable logging
  31. 31. Fast RT setup using classic index • Create classic index to get initial data. • Declare a RT index • mysql> ATTACH INDEX classic TO RTINDEX rt • transform classic index to RT • operation is almost instant o in essence is a file renaming: classic index becomes a RT disk shard
  32. 32. Sphinx use 1 CPU core per index More power? Distribute!
  33. 33. Distributed RT index Update on each shard, search on everything index distributed { type = distributed local = rtlocal_one local = rtlocal_two agent = some.ip:rtremote_one } don’t forget about dist_threads = x
  34. 34. Copy RT index from one server to another • just simulate a daemon restart • searchd --stopwait • flushes memory shard to disk • Copy all index files to new server. • Add RT index on new server sphinx.conf • Start searchd on new server
  35. 35. Questions? Docs: Wiki: Official blog: SVN repository: