Successfully reported this slideshow.
Your SlideShare is downloading. ×

Full Text Search Throwdown

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 69 Ad

Full Text Search Throwdown

Download to read offline

Odds are if you develop database applications you have been asked to make a large table of textual data searchable. How can we do this most reliably and efficiently? Bill compares a range of techniques to search text data with MySQL, with a focus on performance.
- Ad hoc searches with the LIKE predicate or regular expressions.
- MySQL FULLTEXT index.
- InnoDB FULLTEXT index in MySQL 5.6.
- Apache Solr search engine
- Sphinx search engine.
- Trigraphs.

Odds are if you develop database applications you have been asked to make a large table of textual data searchable. How can we do this most reliably and efficiently? Bill compares a range of techniques to search text data with MySQL, with a focus on performance.
- Ad hoc searches with the LIKE predicate or regular expressions.
- MySQL FULLTEXT index.
- InnoDB FULLTEXT index in MySQL 5.6.
- Apache Solr search engine
- Sphinx search engine.
- Trigraphs.

Advertisement
Advertisement

More Related Content

Viewers also liked (19)

Recently uploaded (20)

Advertisement

Full Text Search Throwdown

  1. 1. Full Text Search Throwdown Bill Karwin, Percona Inc.
  2. 2. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. http://www.flickr.com/photos/tryingyouth/
  3. 3. StackOverflow Test Data • Data dump, exported December 2011 • 7.4 million Posts = 8.18 GB www.percona.com
  4. 4. StackOverflow ER diagram searchable text www.percona.com
  5. 5. The Baseline: Naive Search Predicates
  6. 6. Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinsky www.percona.com
  7. 7. Accuracy issue • Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: SELECT * FROM Posts WHERE Body LIKE '%one%' • Regular expressions in MySQL support escapes for word boundaries: SELECT * FROM Posts WHERE Body RLIKE '[[:<:]]one[[:>:]]' www.percona.com
  8. 8. Performance issue • LIKE with wildcards: ! SELECT * FROM Posts 49 sec WHERE title LIKE '%performance%' ! OR body LIKE '%performance%' ! OR tags LIKE '%performance%'; • POSIX regular expressions: ! SELECT * FROM Posts 7 min 57 sec WHERE title RLIKE '[[:<:]]performance[[:>:]]' ! OR body RLIKE '[[:<:]]performance[[:>:]]' ! OR tags RLIKE '[[:<:]]performance[[:>:]]'; www.percona.com
  9. 9. Why so slow? CREATE TABLE TelephoneBook ( ! FullName VARCHAR(50)); CREATE INDEX name_idx ON TelephoneBook ! (FullName); INSERT INTO TelephoneBook VALUES ! ('Riddle, Thomas'), ! ('Thomas, Dean'); www.percona.com
  10. 10. Why so slow? • Search for all with last name “Thomas” uses index SELECT * FROM telephone_book WHERE full_name LIKE 'Thomas%' • Search for all with first name “Thomas” SELECT * FROM telephone_book WHERE full_name LIKE '%Thomas' can’t use index www.percona.com
  11. 11. Because: B-Tree indexes can’t ☞ search for substrings www.percona.com
  12. 12. • FULLTEXT in MyISAM • FULLTEXT in InnoDB • Apache Solr • Sphinx Search • Trigraphs
  13. 13. FULLTEXT in MyISAM
  14. 14. FULLTEXT Index with MyISAM • Special index type for MyISAM • Integrated with SQL queries • Indexes always in sync with data • Balances features vs. speed vs. space www.percona.com
  15. 15. Insert Data into Index (MyISAM) mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 33 min, 34 sec www.percona.com
  16. 16. Build Index on Data (MyISAM) mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 31 min, 18 sec www.percona.com
  17. 17. Querying SELECT * FROM Posts WHERE MATCH(column(s)) AGAINST('query pattern'); must include all columns of your index, in the order you defined www.percona.com
  18. 18. Natural Language Mode (MyISAM) • Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags ) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 200 milliseconds www.percona.com
  19. 19. Query Profile: Natural Language Mode (MyISAM) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000068 | | checking permissions | 0.000006 | | Opening tables | 0.000017 | | init | 0.000032 | | System lock | 0.000007 | | optimizing | 0.000007 | | statistics | 0.000018 | | preparing | 0.000006 | | FULLTEXT initialization | 0.198358 | | executing | 0.000012 | | Sending data | 0.001921 | | end | 0.000005 | | query end | 0.000003 | | closing tables | 0.000018 | | freeing items | 0.000341 | | cleaning up | 0.000012 | +-------------------------+----------+ www.percona.com
  20. 20. Boolean Mode (MyISAM) • Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100; time with index: 16 milliseconds www.percona.com
  21. 21. Query Profile: Boolean Mode (MyISAM) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000031 | | checking permissions | 0.000003 | | Opening tables | 0.000008 | | init | 0.000017 | | System lock | 0.000004 | | optimizing | 0.000004 | | statistics | 0.000008 | | preparing | 0.000003 | | FULLTEXT initialization | 0.000008 | | executing | 0.000001 | | Sending data | 0.015703 | | end | 0.000004 | | query end | 0.000002 | | closing tables | 0.000007 | | freeing items | 0.000381 | | cleaning up | 0.000007 | +-------------------------+----------+ www.percona.com
  22. 22. FULLTEXT in InnoDB
  23. 23. FULLTEXT Index with InnoDB • Under development in MySQL 5.6 • I’m testing 5.6.6 m1 • Usage very similar to FULLTEXT in MyISAM • Integrated with SQL queries • Indexes always* in sync with data • Read the blogs for more details: • http://blogs.innodb.com/wp/2011/07/overview-and-getting-started-with-innodb-fts/ • http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/ • http://blogs.innodb.com/wp/2011/07/innodb-fts-performance/ • http://blogs.innodb.com/wp/2011/07/difference-between-innodb-fts-and-myisam-fts/ www.percona.com
  24. 24. Insert Data into Index (InnoDB) mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 55 min 46 sec www.percona.com
  25. 25. Build Index on Data (InnoDB) • Still under development; you might see problems: mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); ERROR 2013 (HY000): Lost connection to MySQL server during query www.percona.com
  26. 26. Build Index on Data (InnoDB) • Solution: make sure you define a primary key column `FTS_DOC_ID` explicitly: mysql> ALTER TABLE Posts CHANGE COLUMN PostId `FTS_DOC_ID` BIGINT UNSIGNED; mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 25 min 27 sec www.percona.com
  27. 27. Natural Language Mode (InnoDB) • Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 740 milliseconds www.percona.com
  28. 28. Query Profile: Natural Language Mode (InnoDB) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000074 | | checking permissions | 0.000007 | | Opening tables | 0.000020 | | init | 0.000034 | | System lock | 0.000007 | | optimizing | 0.000009 | | statistics | 0.000020 | | preparing | 0.000008 | | FULLTEXT initialization | 0.577257 | | executing | 0.000013 | | Sending data | 0.106279 | | end | 0.000018 | | query end | 0.000012 | | closing tables | 0.000018 | | freeing items | 0.055584 | | cleaning up | 0.000039 | +-------------------------+----------+ www.percona.com
  29. 29. Boolean Mode (InnoDB) • Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100; time with index: 350 milliseconds www.percona.com
  30. 30. Query Profile: Boolean Mode (InnoDB) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000064 | | checking permissions | 0.000005 | | Opening tables | 0.000017 | | init | 0.000047 | | System lock | 0.000007 | | optimizing | 0.000009 | | statistics | 0.000019 | | preparing | 0.000008 | | FULLTEXT initialization | 0.347172 | | executing | 0.000014 | | Sending data | 0.008089 | | end | 0.000011 | | query end | 0.000012 | | closing tables | 0.000015 | | freeing items | 0.001570 | | cleaning up | 0.000023 | +-------------------------+----------+ www.percona.com
  31. 31. Apache Solr
  32. 32. Apache Solr • http://lucene.apache.org/solr/ • Formerly known as Lucene, started 2001 • Apache License • Java implementation • Web service architecture • Many sophisticated search features www.percona.com
  33. 33. DataImportHandler • conf/solrconfig.xml: . . . <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler> . . . www.percona.com
  34. 34. DataImportHandler • conf/data-config.xml: <dataConfig> <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/testpattern?useUnicode=true" batchSize="-1" user="xxxx" password="xxxx"/> <document> <entity name="id" query="SELECT PostId, ParentId, Title, Body, Tags FROM Posts"> </entity> </document> </dataConfig> extremely important to avoid buffering the whole query result! www.percona.com
  35. 35. DataImportHandler •conf/schema.xml: . . . <fields> <field name="PostId" type="string" indexed="true" stored="true" required="true" /> <field name="ParentId" type="string" indexed="true" stored="true" required="false" /> <field name="Title" type="text_general" indexed="false" stored="false" required="false" /> <field name="Body" type="text_general" indexed="false" stored="false" required="false" / > <field name="Tags" type="text_general" indexed="false" stored="false" required="false" / > <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/ > <fields> <uniqueKey>PostId</uniqueKey> <defaultSearchField>text</defaultSearchField> <copyField source="Title" dest="text"/> <copyField source="Body" dest="text"/> <copyField source="Tags" dest="text"/> . . . www.percona.com
  36. 36. Insert Data into Index (Solr) • http://localhost:8983/solr/dataimport? command=full-import time: 14 min 28 sec www.percona.com
  37. 37. Searching Solr • http://localhost:8983/solr/select/?q=mysql+AND +performance time: 79ms Query results are cached (like MySQL Query Cache), so they return much faster on subsequent execution www.percona.com
  38. 38. Sphinx Search
  39. 39. Sphinx Search • http://sphinxsearch.com/ • Started in 2001 • GPLv2 license • C++ implementation • SphinxSE storage engine for MySQL • Supports MySQL protocol, SQL-like queries • Many sophisticated search features www.percona.com
  40. 40. sphinx.conf source src1 { ! type = mysql ! sql_host = localhost ! sql_user = xxxx ! sql_pass = xxxx ! sql_db = testpattern ! sql_query = SELECT PostId, ParentId, Title, ! ! Body, Tags FROM Posts ! sql_query_info = SELECT * FROM Posts ! ! WHERE PostId=$id } www.percona.com
  41. 41. sphinx.conf ! index test1 { ! source = src1 ! path = C:Sphinxdata } www.percona.com
  42. 42. Insert Data into Index (Sphinx) ! C:Sphinx> indexer.exe -c sphinx.conf.in --verbose test1 Sphinx 2.0.5-release (r3309) using config file 'sphinx.conf'... indexing index 'test1'... collected 7397507 docs, 5731.8 MB time: 8 min 20 sec sorted 920.3 Mhits, 100.0% done total 7397507 docs, 5731776959 bytes total 500.149 sec, 11460138 bytes/sec, 14790.60 docs/sec total 11 reads, 15.898 sec, 314584.8 kb/call avg, 1445.3 msec/call avg total 542 writes, 3.129 sec, 12723.3 kb/call avg, 5.7 msec/ call avg Execution time: 500.196 s www.percona.com
  43. 43. Querying index $ mysql --port 9306 Server version: 2.0.5-release (r3309) mysql> SELECT * FROM test1 WHERE MATCH('mysql performance'); +---------+--------+ | id | weight | +---------+--------+ | 6016856 | 6600 | | 4207641 | 6595 | | 2656325 | 6593 | | 7192928 | 5605 | | 8118235 | 5598 | . . . 20 rows in set (0.02 sec) www.percona.com
  44. 44. Querying index mysql> SHOW META; +---------------+-------------+ | Variable_name | Value | +---------------+-------------+ | total | 1000 | | total_found | 7672 | | time | 0.013 | | keyword[0] | mysql | | docs[0] | 162287 | time: 13ms | hits[0] | 363694 | | keyword[1] | performance | | docs[1] | 147249 | | hits[1] | 210895 | +---------------+-------------+ www.percona.com
  45. 45. Trigraphs
  46. 46. Trigraphs Overview • Not very fast, but still better than LIKE / RLIKE • Generic, portable SQL solution • No dependency on version, storage engine, third- party technology www.percona.com
  47. 47. Three-Letter Sequences ! CREATE TABLE AtoZ ( ! c! ! CHAR(1), ! PRIMARY KEY (c)); ! INSERT INTO AtoZ (c) VALUES ('a'), ('b'), ('c'), ... ! CREATE TABLE Trigraphs ( ! Tri! ! CHAR(3), ! PRIMARY KEY (Tri)); ! INSERT INTO Trigraphs (Tri) SELECT CONCAT(t1.c, t2.c, t3.c) FROM AtoZ t1 JOIN AtoZ t2 JOIN AtoZ t3; www.percona.com
  48. 48. Insert Data Into Index my $sth = $dbh1->prepare("SELECT * FROM Posts") or die $dbh1->errstr; $sth->execute() or die $dbh1->errstr; $dbh2->begin_work; my $i = 0; while (my $row = $sth->fetchrow_hashref ) { my $text = lc(join('|', ($row->{title}, $row->{body}, $row->{tags}))); my %tri; map($tri{$_}=1, ( $text =~ m/[[:alpha:]]{3}/g )); next unless %tri; my $tuple_list = join(",", map("('$_',$row->{postid})", keys %tri)); my $sql = "INSERT IGNORE INTO PostsTrigraph (tri, PostId) VALUES $tuple_list"; $dbh2->do($sql) or die "SQL = $sql, ".$dbh2->errstr; if (++$i % 1000 == 0) { print "."; $dbh2->commit; $dbh2->begin_work; time: 116 min 50 sec } } space: 16.2GiB print ".n"; $dbh2->commit; rows: 519 million www.percona.com
  49. 49. Indexed Lookups ! SELECT p.* time: 46 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' ! www.percona.com
  50. 50. Search Among Fewer Matches ! SELECT p.* time: 19 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' www.percona.com
  51. 51. Search Among Fewer Matches ! SELECT p.* time: 22 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' www.percona.com
  52. 52. Search Among Fewer Matches ! SELECT p.* time: 13.6 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = 'man' www.percona.com
  53. 53. Narrow Down Further ! SELECT p.* time: 13.8 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = 'man' WHERE CONCAT(p.title,p.body,p.tags) LIKE '%mysql%' ! AND CONCAT(p.title,p.body,p.tags) LIKE '%performance%'; www.percona.com
  54. 54. And the winner is... Jarrett Campbell http://www.flickr.com/people/77744839@N00
  55. 55. Time to Insert Data into Index LIKE expression n/a FULLTEXT MyISAM 33 min, 34 sec FULLTEXT InnoDB 55 min, 46 sec Apache Solr 14 min, 28 sec Sphinx Search 8 min, 20 sec Trigraphs 116 min, 50 sec www.percona.com
  56. 56. Insert Data into Index (sec) 8000 6000 4000 2000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  57. 57. Time to Build Index on Data LIKE expression n/a FULLTEXT MyISAM 31 min, 18 sec FULLTEXT InnoDB 25 min, 27 sec Apache Solr n/a Sphinx Search n/a Trigraphs n/a www.percona.com
  58. 58. Build Index on Data (sec) 4000 3000 2000 1000 n/a n/a n/a 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  59. 59. Index Storage LIKE expression n/a FULLTEXT MyISAM 2382 MiB FULLTEXT InnoDB ? MiB Apache Solr 2766 MiB Sphinx Search 3355 MiB Trigraphs 16589 MiB www.percona.com
  60. 60. Index Storage (MiB) 20000 15000 10000 5000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  61. 61. Query Speed LIKE expression 49,000ms - 399,000ms FULLTEXT MyISAM 16-200ms FULLTEXT InnoDB 350-740ms Apache Solr 79ms Sphinx Search 13ms Trigraphs 13800ms www.percona.com
  62. 62. Query Speed (ms) 400000 350000 300000 250000 200000 150000 100000 50000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  63. 63. 1000 Query Speed (ms) 750 500 250 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  64. 64. Bottom Line build insert storage query solution 49k-399k LIKE expression 0 0 0 ms SQL FULLTEXT MyISAM 31:18 33:28 2382MiB 16-200ms MySQL FULLTEXT InnoDB 25:27 55:46 ? 350-740ms MySQL 5.6 Apache Solr n/a 14:28 2766MiB 79ms Java Sphinx Search n/a 8:20 3487MiB 13ms C++ Trigraphs n/a 116:50 16.2 GiB 13,800ms SQL www.percona.com
  65. 65. Final Thoughts • Third-party search engines are complex to keep in sync with data, and adding another type of server adds more operations work for you. • Built-in FULLTEXT indexes are therefore useful even if they are not absolutely the fastest. • Different search implementations may return different results, so you should evaluate what works best for your project. • Any indexed search solution is orders of magnitude better than LIKE! www.percona.com
  66. 66. New York, October 1-2, 2012 London, December 3-4, 2012 Santa Clara, April 22-25, 2013 www.percona.com/live
  67. 67. Expert instructors In-person training Custom onsite training Live virtual training http://www.percona.com/training www.percona.com
  68. 68. http://www.pragprog.com/titles/bksqla/ www.percona.com
  69. 69. Copyright 2012 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions: Attribution. Noncommercial. No Derivative Works. You must attribute this You may not use this work You may not alter, work to Bill Karwin. for commercial purposes. transform, or build upon this work. www.percona.com

×