Full text search

6,488 views

Published on

This contains basic information about full text search and how it can be implemented in PostgreSQL.
This was presented at India PostgreSQL meetup at Pune on 16 Nov, 2013.

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,488
On SlideShare
0
From Embeds
0
Number of Embeds
1,455
Actions
Shares
0
Downloads
61
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Full text search

  1. 1. Full text Search Rahila Syed Beena Emerson © 2013 NTT DATA, Inc.
  2. 2. Index • Full text search and its types • Full text search in PostgreSQL • PostgreSQL extension • Similarity Search © 2013 NTT DATA, Inc. 2
  3. 3. Full Text Search © 2013 NTT DATA, Inc. 3
  4. 4. What is full text search? • Searching for a group of keywords in a pile of texts – Document – Query – Similarity • Full text search in database – Searching for a set of keywords in a text field of a database table – The data used for full text search can be huge – Indexing words and associating indexed words with documents © 2013 NTT DATA, Inc. 4
  5. 5. Full Text Search in PostgreSQL © 2013 NTT DATA, Inc. 5
  6. 6. Steps • Creating Tokens – Parsing document into set of tokens like numbers, words, complex words, email addresses. • Creating Lexemes – Normalization: Dictionary controls this. • Removal of suffixes – converts variants into a single form (worry, worries, worried, etc.) • Conversion to lower case • Remove stop words – common words useless for searching (the, at etc.) • Storing preprocessed documents – Storing documents and creating indexes over them for faster search • Relevance ranking © 2013 NTT DATA, Inc. 6
  7. 7. Full text search in PostgreSQL • Full integration • 27 built-in configurations for 10 languages • Support of user-defined FTS configurations • Pluggable dictionaries ( ispell, snowball, thesaurus ), parsers • Relevance ranking • GIN and GiST index © 2013 NTT DATA, Inc. 7
  8. 8. Full text search in PostgreSQL Morphological Search • Indexed tokens are words of a language • Eg. Tree, book, rain N-gram search • Indexed tokens are characters. • Small index size • Big index size • Good in orthographical variants • Cannot match orthographical variants • Eg. _t, tr, re, e_ (2 grams) • Search results depends on division of words • Results closer to indexed LIKE • Used for large documents like thesis • Better suited for a limited set of words • Ex. Tsvector • Ex. pg_bigm, pg_tigm © 2013 NTT DATA, Inc. 8
  9. 9. Why full text search? • Search similar words(No linguistic support) • Ranking of search results • Searches substrings – Indexes does not support substring search – LIKE operator doesn’t use INDEX when preceded by %. – Low performance when compared to full text search using GIN and GiST • Accuracy issue Eg. LIKE %one% matches prone, money, lonely © 2013 NTT DATA, Inc. 9
  10. 10. Measurement results • POSIX Expression =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc ~ 'postgresql'; QUERY PLAN -------------------------------------------------------------------------Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40 width=152) (actual time=10.871..390.019 rows=250 loops=1) Filter: (doc ~ 'postgresql'::text) Rows Removed by Filter: 11397 Total runtime: 390.060 ms • LIKE Query =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc LIKE '%postgresql%'; QUERY PLAN -----------------------------------------------------------------------Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40 width=152) (actual time=1.342..110.107 rows=250 loops=1) Filter: (doc ~~ '%postgresql%'::text) Rows Removed by Filter: 11397 Total runtime: 110.134 ms © 2013 NTT DATA, Inc. 10
  11. 11. Measurement results • Full Text Search Nested Loop (cost=352.83..508.22 rows=107 width=64) (actual time=1.397..1.575 rows=250 loops=1) -> Function Scan on to_tsquery query (cost=0.00..0.01 rows=1 width=32) (actual time=0.023..0.023 rows=1 loops=1) -> Bitmap Heap Scan on full_text_search (cost=352.83..507.14 rows=107 width=32) (actual time=1.371..1.516 rows=250 loops=1) Recheck Cond: (query.query @@ to_tsvector('english'::regconfig, doc)) -> Bitmap Index Scan on full_search_idx (cost=0.00..352.80 rows=107 width=0) (actual time=1.354..1.354 rows=348 loops=1) Index Cond: (query.query @@ to_tsvector('english'::regconfig, doc)) Total runtime: 1.619 ms © 2013 NTT DATA, Inc. 11
  12. 12. Ranking Example Normal Search: SELECT * FROM tbl WHERE col1 LIKE 'The tiger is the largest cat species'; col1 -------------------------------------The tiger is the largest cat species (1 row) Full Text Search: SELECT col1, similarity(col1, 'The tiger is the largest cat species') AS sml FROM tbl_t WHERE col1 % 'The tiger is the largest cat species' ORDER BY sml DESC, col1; col1 | sml -----------------------------------------+---------The tiger is the largest cat species | 1 The peacock is the largest bird species | 0.511111 The cheetah is the fastest cat species | 0.466667 (3 rows) © 2013 NTT DATA, Inc. 12
  13. 13. Indexes Used in Full Text Search • GIN(Generalized Inverted Index) • Custom strategies for particular data types • Inverted indexes • Interface for custom data types • Slower to update • Deterministic • Appropriate for fixed data sets. KEY TID Meetup 100 ,140 Pune 100 , 150 Here 100 © 2013 NTT DATA, Inc. 13
  14. 14. Indexes Used in Full Text Search • GiST (Generalized Search Tree) • Interface for data types and access methods • Document is represented in the index by a fixed-length signature • Based on hash tables • Probability of false match • Table row must be retrieved to see if the match is correct • In appropriate for large data sets • Filtering data at the end of index search to remove false match EXPLAIN SELECT * FROM tab WHERE text_search @@ to_tsquery(‘Mountain'); ------------------------------- QUERY PLAN ----------------------------------------Index Scan using text_search_idx on tab (cost=0.00..12.29 rows=2 width=1469) Index Cond: (textsearch @@ '‘Mountain'''::tsquery) Filter: (textsearch @@ ''‘Mountain'''::tsquery) © 2013 NTT DATA, Inc. 14
  15. 15. tsvector • Representation of document best suited for full text search • Normalized lexemes formed by pre-processing of the documents • Functions to convert normal text to tsvector: • to_tsvector to_tsvector([ config regconfig, ] document text) returns tsvector =# SELECT to_tsvector('english', 'Glad to be part of this meetup'); to_tsvector -----------------------------'glad':1 'meetup':7 'part':4 (1 row) • The query above specifies 'english' as the configuration to be used to parse and normalize the strings. The default_text_search_config value will be used if the configuration parameter is omitted. © 2013 NTT DATA, Inc. 15
  16. 16. tsquery • Representation of search query best suited for full text search • Normalized lexemes formed by processing the query • Maybe combined using AND, OR, or NOT operator. • All keywords used for search © 2013 NTT DATA, Inc. 16
  17. 17. tsquery • Functions to convert normal text to tsquery: • to_tsquery to_tsquery([ config regconfig, ] querytext text) returns tsquery =# SELECT to_tsquery('meetups & in & ! Pune'); to_tsquery -------------------'meetup' & !'pune' (1 row) • plainto_tsquery plainto_tsquery([ returns tsquery config regconfig, ] querytext =# SELECT plainto_tsquery ('english','meetups in plainto_tsquery ------------------'meetup' & 'pune' (1 row) © 2013 NTT DATA, Inc. text) Pune'); 17
  18. 18. Match operator @@ • Checks a tsvector(document) with a tsquery(search word) • Returns true if all tsquery elements are present in the tsvector of the document =# SELECT to_tsvector('Welcome to this postgresql meetup') @@ plainto_tsquery('PostgreSQL Meetups'); ?column? ---------t (1 row) =# SELECT to_tsvector('Welcome to this postgresql meetup') @@ plainto_tsquery('Pune meetup'); ?column? ---------f (1 row) © 2013 NTT DATA, Inc. 18
  19. 19. Full text search without index SELECT * FROM <table> WHERE to_tsvector('<config>', <colname>) @@ to_tsquery('<config>', '<search word>'); The configuration parameter of the functions to_tsvector and to_tsquery should be same. Example: =# SELECT * FROM tbl WHERE to_tsvector('english', col) @@ to_tsquery('english', 'enjoy'); col -------------------------------He enjoyed the party He enjoys the classical music. (2 rows) © 2013 NTT DATA, Inc. 19
  20. 20. Full text search using index • Creating the index CREATE INDEX <index_name> ON <table> USING gin(to_tsvector('<config>', <col>)); • Performing search using the index: SELECT * FROM <table> WHERE to_tsvector('<config>', <col>) @@ plainto_tsquery('<config>','<search word>') Example: =# CREATE INDEX idx ON tbl USING gin(to_tsvector('english', col)); =# SELECT * FROM tbl WHERE to_tsvector('english', col) @@ plainto_tsquery('english','enjoy'); col -------------------------------He enjoyed the party He enjoys the classical music. (2 rows) © 2013 NTT DATA, Inc. 20
  21. 21. Full text search using separate column • Procedure – Create a column of tsvector type – Define a trigger which will automatically update the tsvector column – Perform Search on the tsvector column • Advantages: – No need to specify the text search configuration in every query in order to make use of the index – Faster searches as the to_tsvector function will not be called for each search query. © 2013 NTT DATA, Inc. 21
  22. 22. Full text search using separate column Example: =# CREATE TABLE tbl (col text, tsv_col tsvector); =# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON tbl FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(tsv_col, 'pg_catalog.english', col); =# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the classical music.'),('The moon winked at him'); =# SELECT * FROM tbl; col | tsv --------------------------------+--------------------------------He enjoyed the party | 'enjoy':2 'parti':4 He enjoys the classical music. | 'classic':4 'enjoy':2 'music':5 The moon winked at him | 'moon':2 'wink':3 (3 rows) © 2013 NTT DATA, Inc. 22
  23. 23. Full text search using separate column Example: =# CREATE TABLE tbl (col text, tsv_col tsvector); =# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON tbl FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(tsv_col, 'pg_catalog.english', col); =# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the classical music.'),('The moon winked at him'); =# SELECT col FROM tbl WHERE tsv_col @@ to_tsquery('enjoys'); col -------------------------------He enjoyed the party He enjoys the classical music. (2 rows) © 2013 NTT DATA, Inc. 23
  24. 24. Ranking •ts_rank –Lexical ranking ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4 =# select ts_rank(to_tsvector('Free text seaRCh is a wonderful Thing'), to_tsquery('wonderful | thing')); ts_rank ----------- 0.0607927 •ts_rank_cd –Proximity ranking =# select ts_rank_cd(to_tsvector('Free text seaRCh wonderful Thing'), to_tsquery('wonderful & thing')); is a ts_rank_cd ------------ 0.1 © 2013 NTT DATA, Inc. 24
  25. 25. Ranking • Structural ranking – Query select ts_rank( array[0.1,0.1,0.9,0.1], setweight(to_tsvector('All about search'), 'B') || setweight(to_tsvector('Free text seaRCh is a wonderfulThing'),'A'), to_tsquery('wonderful & search')); – Result ts_rank 0.328337 © 2013 NTT DATA, Inc. 25
  26. 26. PostgreSQL Extension © 2013 NTT DATA, Inc. 26
  27. 27. pg_trgm • Uses index made from trigrams – 3 consecutive characters from string. • Find string similarity by comparing the trigrams. • provides GiST and GIN index operator classes to create index. CREATE INDEX <idx> ON <tbl> USING gist(<col> gist_trgm_ops); CREATE INDEX <idx> ON <tbl> USING gin (<col> gin_trgm_ops); • Problem: − No partial match algorithm − Slow when search key is < 3 characters GIN_SEARCH_MODE_ALL is used © 2013 NTT DATA, Inc. 27
  28. 28. pg_bigm • PostgreSQL module which provides full text search capability using 2-gram index. • Based on pg_trgm • First released on April 2013. Version 1.1 to be released soon. • Developed by NTT Data • Site: http://sourceforge.jp/projects/pgbigm/ © 2013 NTT DATA, Inc. 28
  29. 29. Difference Feature pg_trgm pg_bigm Method of full text search 3-gram 2-gram " " a", ab, bc, cd, "d " Available index GIN and GiST GIN only 1-2 character keyword search Slow Fast © 2013 NTT DATA, Inc. a", " ab", abc, bcd 29
  30. 30. Install pg_bigm • Download tar.gz file from the site • Install pg_bigm $ make USE_PGXS=1 $ su # make USE_PGXS=1 install • Register- Set the postgresql.conf variables: – shared_preload_libraries = 'pg_bigm' – custom_variable_classes = 'pg_bigm' (only in 9.1) • Load into the required database =# CREATE EXTENSION pg_bigm; © 2013 NTT DATA, Inc. 30
  31. 31. Function – show_bigm Argument: Search String Return Value: Array of all possible 2-gram character string Procedure: • For each word perform the following: • Add a space character before and after the text • Moving from left to right extract strings in the unit of 2 characters. =# SELECT show_bigm('ab'); show_bigm ---------------{" a",ab,"b "} (1 row) © 2013 NTT DATA, Inc. 31
  32. 32. Function - likequery Argument: Search string Return Value: String in a pattern to be used in LIKE for full-text search Procedure: • Add % to the beginning and the end of retrieval string. • Add a backlash () before every underscore (_), percent (%) and backlash () present in the retrieval string. =# SELECT likequery ('pg_bigm ppt'); likequery ---------------%pg_bigm ppt% (1 row) © 2013 NTT DATA, Inc. 32
  33. 33. Creation of Index • Only GIN support • Create Index on the text column of a table CREATE INDEX <index_name> ON <table> USING gin (<column>, gin_bigm_ops); Index Key " c" Data 1 cat 5 mat © 2013 NTT DATA, Inc. Generate bigrams cat - " c", at, ca, "t " mat - " m", at, ma, "t " " m" 5 at 1, 5 ca 1 5 "t " TID 1 ma Table TID 1, 5 33
  34. 34. Full text search Query SELECT * FROM <tbl> WHERE <col> LIKE likequery(‘<word>'); =# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE likequery('cat'); QUERY PLAN ------------------------------------------------------------------Bitmap Heap Scan on tbl (cost=12.00..16.01 rows=1 width=4) (actual time=0.038..0.039 rows=1 loops=1) Recheck Cond: (col ~~ '%cat%'::text) -> Bitmap Index Scan on idx (cost=0.00..12.00 rows=1 width=0) (actual time=0.025..0.025 rows=1 loops=1) Index Cond: (col ~~ '%cat%'::text) Total runtime: 0.093 ms (5 rows) © 2013 NTT DATA, Inc. 34
  35. 35. Full text search Query Index lookup Key " c" 1 cat Final Result © 2013 NTT DATA, Inc. Perform Recheck 1, 5 1 5 "t " Data at ma TID 5 ca Generate bigrams 1 " m" Search key TID 1, 5 TID Data 1 cat Result Candidates 35
  36. 36. Why Recheck? • Removes wrong results from result candidates of index scan. =# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE likequery('trial'); QUERY PLAN -----------------------------------------------------------------------------------------------------------Bitmap Heap Scan on tbl (cost=24.00..28.01 rows=1 width=5) (actual time=0.060..0.060 rows=1 loops=1) Recheck Cond: (col ~~ '%trial%'::text) Rows Removed by Index Recheck: 1 -> Bitmap Index Scan on idx (cost=0.00..24.00 rows=1 width=0) (actual time=0.043..0.043 rows=2 loops=1) Index Cond: (col ~~ '%trial%'::text) Total runtime: 0.117 ms (6 rows) © 2013 NTT DATA, Inc. 36
  37. 37. Why Recheck? TID Data 1 trial 2 trivial trial trivial " t",al,ia,"l ",ri,tr " t",al,ia,iv,"l ",ri,tr,vi Key " t" 1, 2 ia 1, 2 TID Data iv 2 1 trial “l " 1, 2 2 trivial ri 1, 2 tr 1, 2 vi ‘trial’ 1, 2 al Search TID 2 Recheck TID Data 1 trial Index scan © 2013 NTT DATA, Inc. 37
  38. 38. Disabling Recheck Parameter - enable_recheck • To disable Recheck and get all the results retrieved by index scan • Values on/off =# SET pg_bigm.enable_recheck = on; =# SELECT * FROM tbl WHERE doc LIKE likequery('trial'); doc ---------------------He is awaiting trial (1 row) =# SET pg_bigm.enable_recheck = off; =# SELECT * FROM tbl WHERE doc LIKE likequery('trial'); doc -------------------------He is awaiting trial It was a trivial mistake (2 rows) © 2013 NTT DATA, Inc. 38
  39. 39. pg_bigm Full Text Search Sample =# CREATE TABLE tbl (col text); =# CREATE INDEX tbl_idx ON tbl USING =# INSERT INTO tbl VALUES ('He is awaiting trial'), ('Those orchids are very special to ('pg_bigm performs full text search ('pg_trgm performs full text search gin (col gin_bigm_ops); her '), using 2 gram index'), using 3 gram index'); =# SELECT * FROM tbl WHERE col LIKE likequery('full text search'); col -----------------------------------------------------pg_bigm performs full text search using 2 gram index pg_trgm performs full text search using 3 gram index (2 rows) © 2013 NTT DATA, Inc. 39
  40. 40. Similarity Search © 2013 NTT DATA, Inc. 40
  41. 41. Function – bigm_similarity Argument: The 2 strings whose similarity is to be checked Return value - the similarity value of two arguments (0 - 1) • measures the similarity of two strings by counting the number of 2-grams they share. =# SELECT bigm_similarity ('test','text'); bigm_similarity ----------------0.6 (1 row) © 2013 NTT DATA, Inc. 41
  42. 42. Parameter - similarity_limit • specifies threshold used for the similarity search • Search returns rows with similarity value >= similarity_limit • Default: 0.3 • SET command can be used to modify the value. =# SHOW pg_bigm.similarity_limit; pg_bigm.similarity_limit -------------------------0.3 (1 row) =# SET pg_bigm.similarity_limit = 0.5; © 2013 NTT DATA, Inc. 42
  43. 43. Similarity Operator - =% • Used to perform similarity search • Uses full text search index. • Returns rows whose similarity is higher than or equal to the value of pg_bigm.similarity_limit SELECT * FROM <tbl> WHERE <col> =% ‘<key>'; © 2013 NTT DATA, Inc. 43
  44. 44. Similarity Search Sample =# SET pg_bigm.similarity_limit = 0.2; =# SELECT *, bigm_similarity(col, 'test') 'test'; col | bigm_similarity -------+----------------test | 1 text | 0.6 treat | 0.333333 (3 rows) =# SET pg_bigm.similarity_limit = 0.5; =# SELECT *, bigm_similarity(col, 'test') 'test'; col | bigm_similarity ------+----------------test | 1 text | 0.6 (2 rows) © 2013 NTT DATA, Inc. FROM tbl WHERE col =% FROM tbl WHERE col =% 44
  45. 45. References • PostgreSQL documents • wiki.postgresql.org • Understanding Full Text Search • http://linuxgazette.net/164/sephton.html • http://www.slideshare.net/billkarwin/full-text-search-in-postgresql • Understanding pg_bigm • pgbigm.sourceforge.jp • www.slideshare.net/masahikosawada98/pg-bigm © 2013 NTT DATA, Inc. 45
  46. 46. © 2013 NTT DATA, Inc.

×