Full Text Search Throwdown
Upcoming SlideShare
Loading in...5
×
 

Full Text Search Throwdown

on

  • 49,440 views

Odds are if you develop database applications you have been asked to make a large table of textual data searchable. How can we do this most reliably and efficiently? Bill compares a range of ...

Odds are if you develop database applications you have been asked to make a large table of textual data searchable. How can we do this most reliably and efficiently? Bill compares a range of techniques to search text data with MySQL, with a focus on performance.
- Ad hoc searches with the LIKE predicate or regular expressions.
- MySQL FULLTEXT index.
- InnoDB FULLTEXT index in MySQL 5.6.
- Apache Solr search engine
- Sphinx search engine.
- Trigraphs.

Statistics

Views

Total Views
49,440
Views on SlideShare
47,633
Embed Views
1,807

Actions

Likes
61
Downloads
976
Comments
8

28 Embeds 1,807

http://misnotaslinux.blogspot.com 659
http://misnotaslinux.blogspot.com.es 460
http://misnotaslinux.blogspot.mx 338
http://www.slideshare.net 167
http://misnotaslinux.blogspot.com.ar 119
http://itnlu.info 9
http://www.misnotaslinux.blogspot.com 8
http://a0.twimg.com 8
http://misnotaslinux.blogspot.com.br 7
https://twitter.com 4
http://www.techgig.com 4
http://www.linkedin.com 3
http://misnotaslinux.blogspot.ca 2
http://misnotaslinux.blogspot.co.at 2
http://www.misnotaslinux.blogspot.com.es 2
http://misnotaslinux.blogspot.pt 2
http://translate.googleusercontent.com 2
http://misnotaslinux.blogspot.de 1
http://www.lmodules.com 1
http://misnotaslinux.blogspot.co.uk 1
file:// 1
http://misnotaslinux.blogspot.fr 1
http://misnotaslinux.blogspot.cz 1
http://misnotaslinux.blogspot.jp 1
http://paper.li 1
https://twimg0-a.akamaihd.net 1
https://si0.twimg.com 1
http://us-w1.rockmelt.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

18 of 8 Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I can't find the definition of PostsTrigraph in the slideshow. The columns are obvious of course, but I'm interested on the indexing you used.
    Are you sure you want to
    Your message goes here
    Processing…
  • Best one
    Hope you are in good health. My name is AMANDA . I am a single girl, Am looking for reliable and honest person. please have a little time for me. Please reach me back amanda_n14144@yahoo.com so that i can explain all about myself .
    Best regards AMANDA.
    amanda_n14144@yahoo.com
    Are you sure you want to
    Your message goes here
    Processing…
  • Not Work with Arabic (utf8_general_ci ) Text ??!!
    Help Please
    Are you sure you want to
    Your message goes here
    Processing…
  • thank you guys
    Are you sure you want to
    Your message goes here
    Processing…
  • Very good. really helpful for me.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Full Text Search Throwdown Full Text Search Throwdown Presentation Transcript

    • Full Text Search Throwdown Bill Karwin, Percona Inc.
    • In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.http://www.flickr.com/photos/tryingyouth/
    • StackOverflow Test Data• Data dump, exported December 2011• 7.4 million Posts = 8.18 GB www.percona.com
    • StackOverflow ER diagramsearchable text www.percona.com
    • The Baseline:Naive Search Predicates
    • Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinsky www.percona.com
    • Accuracy issue• Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: SELECT * FROM Posts WHERE Body LIKE %one%• Regular expressions in MySQL support escapes for word boundaries: SELECT * FROM Posts WHERE Body RLIKE [[:<:]]one[[:>:]] www.percona.com
    • Performance issue• LIKE with wildcards: ! SELECT * FROM Posts 49 sec WHERE title LIKE %performance% ! OR body LIKE %performance% ! OR tags LIKE %performance%;• POSIX regular expressions: ! SELECT * FROM Posts 7 min 57 sec WHERE title RLIKE [[:<:]]performance[[:>:]] ! OR body RLIKE [[:<:]]performance[[:>:]] ! OR tags RLIKE [[:<:]]performance[[:>:]]; www.percona.com
    • Why so slow?CREATE TABLE TelephoneBook ( ! FullName VARCHAR(50));CREATE INDEX name_idx ON TelephoneBook ! (FullName);INSERT INTO TelephoneBook VALUES ! (Riddle, Thomas), ! (Thomas, Dean); www.percona.com
    • Why so slow?• Search for all with last name “Thomas” uses index SELECT * FROM telephone_book WHERE full_name LIKE Thomas%• Search for all with first name “Thomas” SELECT * FROM telephone_book WHERE full_name LIKE %Thomas can’t use index www.percona.com
    • Because: B-Tree indexes can’t☞ search for substrings www.percona.com
    • • FULLTEXT in MyISAM• FULLTEXT in InnoDB• Apache Solr• Sphinx Search• Trigraphs
    • FULLTEXTin MyISAM
    • FULLTEXT Index with MyISAM• Special index type for MyISAM• Integrated with SQL queries• Indexes always in sync with data• Balances features vs. speed vs. space www.percona.com
    • Insert Data into Index (MyISAM)mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 33 min, 34 sec www.percona.com
    • Build Index on Data (MyISAM)mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 31 min, 18 sec www.percona.com
    • QueryingSELECT * FROM Posts WHERE MATCH(column(s)) AGAINST(query pattern); must include all columns of your index, in the order you defined www.percona.com
    • Natural Language Mode (MyISAM)• Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags ) AGAINST(mysql performance IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 200 milliseconds www.percona.com
    • Query Profile: Natural Language Mode (MyISAM)+-------------------------+----------+| Status | Duration |+-------------------------+----------+| starting | 0.000068 || checking permissions | 0.000006 || Opening tables | 0.000017 || init | 0.000032 || System lock | 0.000007 || optimizing | 0.000007 || statistics | 0.000018 || preparing | 0.000006 || FULLTEXT initialization | 0.198358 || executing | 0.000012 || Sending data | 0.001921 || end | 0.000005 || query end | 0.000003 || closing tables | 0.000018 || freeing items | 0.000341 || cleaning up | 0.000012 |+-------------------------+----------+ www.percona.com
    • Boolean Mode (MyISAM)• Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST(+mysql +performance IN BOOLEAN MODE) LIMIT 100; time with index: 16 milliseconds www.percona.com
    • Query Profile: Boolean Mode (MyISAM)+-------------------------+----------+| Status | Duration |+-------------------------+----------+| starting | 0.000031 || checking permissions | 0.000003 || Opening tables | 0.000008 || init | 0.000017 || System lock | 0.000004 || optimizing | 0.000004 || statistics | 0.000008 || preparing | 0.000003 || FULLTEXT initialization | 0.000008 || executing | 0.000001 || Sending data | 0.015703 || end | 0.000004 || query end | 0.000002 || closing tables | 0.000007 || freeing items | 0.000381 || cleaning up | 0.000007 |+-------------------------+----------+ www.percona.com
    • FULLTEXTin InnoDB
    • FULLTEXT Index with InnoDB• Under development in MySQL 5.6 • I’m testing 5.6.6 m1• Usage very similar to FULLTEXT in MyISAM• Integrated with SQL queries• Indexes always* in sync with data• Read the blogs for more details: • http://blogs.innodb.com/wp/2011/07/overview-and-getting-started-with-innodb-fts/ • http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/ • http://blogs.innodb.com/wp/2011/07/innodb-fts-performance/ • http://blogs.innodb.com/wp/2011/07/difference-between-innodb-fts-and-myisam-fts/ www.percona.com
    • Insert Data into Index (InnoDB)mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 55 min 46 sec www.percona.com
    • Build Index on Data (InnoDB)• Still under development; you might see problems:mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags);ERROR 2013 (HY000): Lost connection to MySQL server during query www.percona.com
    • Build Index on Data (InnoDB)• Solution: make sure you define a primary key column `FTS_DOC_ID` explicitly:mysql> ALTER TABLE Posts CHANGE COLUMN PostId `FTS_DOC_ID` BIGINT UNSIGNED;mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 25 min 27 sec www.percona.com
    • Natural Language Mode (InnoDB)• Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST(mysql performance IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 740 milliseconds www.percona.com
    • Query Profile: Natural Language Mode (InnoDB)+-------------------------+----------+| Status | Duration |+-------------------------+----------+| starting | 0.000074 || checking permissions | 0.000007 || Opening tables | 0.000020 || init | 0.000034 || System lock | 0.000007 || optimizing | 0.000009 || statistics | 0.000020 || preparing | 0.000008 || FULLTEXT initialization | 0.577257 || executing | 0.000013 || Sending data | 0.106279 || end | 0.000018 || query end | 0.000012 || closing tables | 0.000018 || freeing items | 0.055584 || cleaning up | 0.000039 |+-------------------------+----------+ www.percona.com
    • Boolean Mode (InnoDB)• Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST(+mysql +performance IN BOOLEAN MODE) LIMIT 100; time with index: 350 milliseconds www.percona.com
    • Query Profile: Boolean Mode (InnoDB)+-------------------------+----------+| Status | Duration |+-------------------------+----------+| starting | 0.000064 || checking permissions | 0.000005 || Opening tables | 0.000017 || init | 0.000047 || System lock | 0.000007 || optimizing | 0.000009 || statistics | 0.000019 || preparing | 0.000008 || FULLTEXT initialization | 0.347172 || executing | 0.000014 || Sending data | 0.008089 || end | 0.000011 || query end | 0.000012 || closing tables | 0.000015 || freeing items | 0.001570 || cleaning up | 0.000023 |+-------------------------+----------+ www.percona.com
    • Apache Solr
    • Apache Solr• http://lucene.apache.org/solr/• Formerly known as Lucene, started 2001• Apache License• Java implementation• Web service architecture• Many sophisticated search features www.percona.com
    • DataImportHandler• conf/solrconfig.xml:. . .<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst></requestHandler>. . . www.percona.com
    • DataImportHandler• conf/data-config.xml: <dataConfig> <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/testpattern?useUnicode=true" batchSize="-1" user="xxxx" password="xxxx"/> <document> <entity name="id" query="SELECT PostId, ParentId, Title, Body, Tags FROM Posts"> </entity> </document> </dataConfig> extremely important to avoid buffering the whole query result! www.percona.com
    • DataImportHandler•conf/schema.xml:. . .<fields> <field name="PostId" type="string" indexed="true" stored="true" required="true" /> <field name="ParentId" type="string" indexed="true" stored="true" required="false" /> <field name="Title" type="text_general" indexed="false" stored="false" required="false" /> <field name="Body" type="text_general" indexed="false" stored="false" required="false" / > <field name="Tags" type="text_general" indexed="false" stored="false" required="false" / > <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/ ><fields><uniqueKey>PostId</uniqueKey><defaultSearchField>text</defaultSearchField><copyField source="Title" dest="text"/><copyField source="Body" dest="text"/><copyField source="Tags" dest="text"/>. . . www.percona.com
    • Insert Data into Index (Solr)• http://localhost:8983/solr/dataimport? command=full-import time: 14 min 28 sec www.percona.com
    • Searching Solr• http://localhost:8983/solr/select/?q=mysql+AND +performance time: 79ms Query results are cached (like MySQL Query Cache), so they return much faster on subsequent execution www.percona.com
    • Sphinx Search
    • Sphinx Search• http://sphinxsearch.com/• Started in 2001• GPLv2 license• C++ implementation• SphinxSE storage engine for MySQL• Supports MySQL protocol, SQL-like queries• Many sophisticated search features www.percona.com
    • sphinx.confsource src1{! type = mysql! sql_host = localhost! sql_user = xxxx! sql_pass = xxxx! sql_db = testpattern! sql_query = SELECT PostId, ParentId, Title,! ! Body, Tags FROM Posts! sql_query_info = SELECT * FROM Posts ! ! WHERE PostId=$id} www.percona.com
    • sphinx.conf! index test1 { ! source = src1 ! path = C:Sphinxdata } www.percona.com
    • Insert Data into Index (Sphinx)! C:Sphinx> indexer.exe -c sphinx.conf.in --verbose test1 Sphinx 2.0.5-release (r3309) using config file sphinx.conf... indexing index test1... collected 7397507 docs, 5731.8 MB time: 8 min 20 sec sorted 920.3 Mhits, 100.0% done total 7397507 docs, 5731776959 bytes total 500.149 sec, 11460138 bytes/sec, 14790.60 docs/sec total 11 reads, 15.898 sec, 314584.8 kb/call avg, 1445.3 msec/call avg total 542 writes, 3.129 sec, 12723.3 kb/call avg, 5.7 msec/ call avg Execution time: 500.196 s www.percona.com
    • Querying index$ mysql --port 9306Server version: 2.0.5-release (r3309)mysql> SELECT * FROM test1 WHERE MATCH(mysql performance);+---------+--------+| id | weight |+---------+--------+| 6016856 | 6600 || 4207641 | 6595 || 2656325 | 6593 || 7192928 | 5605 || 8118235 | 5598 |. . .20 rows in set (0.02 sec) www.percona.com
    • Querying indexmysql> SHOW META;+---------------+-------------+| Variable_name | Value |+---------------+-------------+| total | 1000 || total_found | 7672 || time | 0.013 || keyword[0] | mysql || docs[0] | 162287 | time: 13ms| hits[0] | 363694 || keyword[1] | performance || docs[1] | 147249 || hits[1] | 210895 |+---------------+-------------+ www.percona.com
    • Trigraphs
    • Trigraphs Overview• Not very fast, but still better than LIKE / RLIKE• Generic, portable SQL solution• No dependency on version, storage engine, third- party technology www.percona.com
    • Three-Letter Sequences! CREATE TABLE AtoZ ( ! c! ! CHAR(1), ! PRIMARY KEY (c));! INSERT INTO AtoZ (c) VALUES (a), (b), (c), ...! CREATE TABLE Trigraphs ( ! Tri! ! CHAR(3), ! PRIMARY KEY (Tri));! INSERT INTO Trigraphs (Tri) SELECT CONCAT(t1.c, t2.c, t3.c) FROM AtoZ t1 JOIN AtoZ t2 JOIN AtoZ t3; www.percona.com
    • Insert Data Into Indexmy $sth = $dbh1->prepare("SELECT * FROM Posts") or die $dbh1->errstr;$sth->execute() or die $dbh1->errstr;$dbh2->begin_work;my $i = 0;while (my $row = $sth->fetchrow_hashref ) { my $text = lc(join(|, ($row->{title}, $row->{body}, $row->{tags}))); my %tri; map($tri{$_}=1, ( $text =~ m/[[:alpha:]]{3}/g )); next unless %tri; my $tuple_list = join(",", map("($_,$row->{postid})", keys %tri)); my $sql = "INSERT IGNORE INTO PostsTrigraph (tri, PostId) VALUES $tuple_list"; $dbh2->do($sql) or die "SQL = $sql, ".$dbh2->errstr; if (++$i % 1000 == 0) { print "."; $dbh2->commit; $dbh2->begin_work; time: 116 min 50 sec} } space: 16.2GiBprint ".n";$dbh2->commit; rows: 519 million www.percona.com
    • Indexed Lookups! SELECT p.* time: 46 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = mys ! www.percona.com
    • Search Among Fewer Matches! SELECT p.* time: 19 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = mys JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = per www.percona.com
    • Search Among Fewer Matches! SELECT p.* time: 22 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = mys JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = per JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = for www.percona.com
    • Search Among Fewer Matches! SELECT p.* time: 13.6 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = mys JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = per JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = for JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = man www.percona.com
    • Narrow Down Further! SELECT p.* time: 13.8 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = mys JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = per JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = for JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = man WHERE CONCAT(p.title,p.body,p.tags) LIKE %mysql% ! AND CONCAT(p.title,p.body,p.tags) LIKE %performance%; www.percona.com
    • And the winner is... Jarrett Campbell http://www.flickr.com/people/77744839@N00
    • Time to Insert Data into IndexLIKE expression n/aFULLTEXT MyISAM 33 min, 34 secFULLTEXT InnoDB 55 min, 46 secApache Solr 14 min, 28 secSphinx Search 8 min, 20 secTrigraphs 116 min, 50 sec www.percona.com
    • Insert Data into Index (sec)8000600040002000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
    • Time to Build Index on DataLIKE expression n/aFULLTEXT MyISAM 31 min, 18 secFULLTEXT InnoDB 25 min, 27 secApache Solr n/aSphinx Search n/aTrigraphs n/a www.percona.com
    • Build Index on Data (sec)4000300020001000 n/a n/a n/a 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
    • Index StorageLIKE expression n/aFULLTEXT MyISAM 2382 MiBFULLTEXT InnoDB ? MiBApache Solr 2766 MiBSphinx Search 3355 MiBTrigraphs 16589 MiB www.percona.com
    • Index Storage (MiB)200001500010000 5000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
    • Query SpeedLIKE expression 49,000ms - 399,000msFULLTEXT MyISAM 16-200msFULLTEXT InnoDB 350-740msApache Solr 79msSphinx Search 13msTrigraphs 13800ms www.percona.com
    • Query Speed (ms)400000350000300000250000200000150000100000 50000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
    • 1000 Query Speed (ms) 750 500 250 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
    • Bottom Line build insert storage query solution 49k-399kLIKE expression 0 0 0 ms SQLFULLTEXT MyISAM 31:18 33:28 2382MiB 16-200ms MySQLFULLTEXT InnoDB 25:27 55:46 ? 350-740ms MySQL 5.6Apache Solr n/a 14:28 2766MiB 79ms JavaSphinx Search n/a 8:20 3487MiB 13ms C++Trigraphs n/a 116:50 16.2 GiB 13,800ms SQL www.percona.com
    • Final Thoughts• Third-party search engines are complex to keep in sync with data, and adding another type of server adds more operations work for you.• Built-in FULLTEXT indexes are therefore useful even if they are not absolutely the fastest.• Different search implementations may return different results, so you should evaluate what works best for your project.• Any indexed search solution is orders of magnitude better than LIKE! www.percona.com
    • New York, October 1-2, 2012London, December 3-4, 2012Santa Clara, April 22-25, 2013 www.percona.com/live
    • Expert instructors In-person training Custom onsite training Live virtual traininghttp://www.percona.com/training www.percona.com
    • http://www.pragprog.com/titles/bksqla/ www.percona.com
    • Copyright 2012 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions: Attribution. Noncommercial. No Derivative Works.You must attribute this You may not use this work You may not alter, work to Bill Karwin. for commercial purposes. transform, or build upon this work. www.percona.com