Odds are if you develop database applications you have been asked to make a large table of textual data searchable. How can we do this most reliably and efficiently? Bill compares a range of techniques to search text data with MySQL, with a focus on performance.
- Ad hoc searches with the LIKE predicate or regular expressions.
- MySQL FULLTEXT index.
- InnoDB FULLTEXT index in MySQL 5.6.
- Apache Solr search engine
- Sphinx search engine.
- Trigraphs.
2. In a full text
search, the search
engine examines
all of the words in
every stored
document as it
tries to match
search words
supplied by the
user.
http://www.flickr.com/photos/tryingyouth/
3. StackOverflow Test Data
• Data dump, exported December 2011
• 7.4 million Posts = 8.18 GB
www.percona.com
6. Some people, when confronted with a problem, think
“I know, I’ll use regular expressions.”
Now they have two problems.
— Jamie Zawinsky
www.percona.com
7. Accuracy issue
• Irrelevant or false matching words ‘one’, ‘money’,
‘prone’, etc.:
SELECT * FROM Posts
WHERE Body LIKE '%one%'
• Regular expressions in MySQL support escapes
for word boundaries:
SELECT * FROM Posts
WHERE Body RLIKE '[[:<:]]one[[:>:]]'
www.percona.com
8. Performance issue
• LIKE with wildcards:
! SELECT * FROM Posts
49 sec
WHERE title LIKE '%performance%'
! OR body LIKE '%performance%'
! OR tags LIKE '%performance%';
• POSIX regular expressions:
! SELECT * FROM Posts 7 min 57 sec
WHERE title RLIKE '[[:<:]]performance[[:>:]]'
! OR body RLIKE '[[:<:]]performance[[:>:]]'
! OR tags RLIKE '[[:<:]]performance[[:>:]]';
www.percona.com
9. Why so slow?
CREATE TABLE TelephoneBook (
! FullName VARCHAR(50));
CREATE INDEX name_idx ON TelephoneBook
! (FullName);
INSERT INTO TelephoneBook VALUES
! ('Riddle, Thomas'),
! ('Thomas, Dean');
www.percona.com
10. Why so slow?
• Search for all with last name “Thomas”
uses index
SELECT * FROM telephone_book
WHERE full_name LIKE 'Thomas%'
• Search for all with first name “Thomas”
SELECT * FROM telephone_book
WHERE full_name LIKE '%Thomas'
can’t use index
www.percona.com
11. Because:
B-Tree indexes can’t
☞ search for substrings
www.percona.com
12. • FULLTEXT in MyISAM
• FULLTEXT in InnoDB
• Apache Solr
• Sphinx Search
• Trigraphs
14. FULLTEXT Index with MyISAM
• Special index type for MyISAM
• Integrated with SQL queries
• Indexes always in sync with data
• Balances features vs. speed vs. space
www.percona.com
15. Insert Data into Index (MyISAM)
mysql> INSERT INTO Posts
SELECT * FROM PostsSource;
time: 33 min, 34 sec
www.percona.com
16. Build Index on Data (MyISAM)
mysql> CREATE FULLTEXT INDEX PostText
! ON Posts(title, body, tags);
time: 31 min, 18 sec
www.percona.com
17. Querying
SELECT * FROM Posts
WHERE MATCH(column(s))
AGAINST('query pattern');
must include all columns
of your index, in the
order you defined
www.percona.com
18. Natural Language Mode (MyISAM)
• Searches concepts with free text queries:
! SELECT * FROM Posts
WHERE MATCH(title, body, tags )
AGAINST('mysql performance'
IN NATURAL LANGUAGE MODE)
LIMIT 100;
time with index:
200 milliseconds
www.percona.com
20. Boolean Mode (MyISAM)
• Searches words using mini-language:
! SELECT * FROM Posts
WHERE MATCH(title, body, tags)
AGAINST('+mysql +performance'
IN BOOLEAN MODE)
LIMIT 100;
time with index:
16 milliseconds
www.percona.com
23. FULLTEXT Index with InnoDB
• Under development in MySQL 5.6
• I’m testing 5.6.6 m1
• Usage very similar to FULLTEXT in MyISAM
• Integrated with SQL queries
• Indexes always* in sync with data
• Read the blogs for more details:
• http://blogs.innodb.com/wp/2011/07/overview-and-getting-started-with-innodb-fts/
• http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/
• http://blogs.innodb.com/wp/2011/07/innodb-fts-performance/
• http://blogs.innodb.com/wp/2011/07/difference-between-innodb-fts-and-myisam-fts/
www.percona.com
24. Insert Data into Index (InnoDB)
mysql> INSERT INTO Posts
SELECT * FROM PostsSource;
time: 55 min 46 sec
www.percona.com
25. Build Index on Data (InnoDB)
• Still under development; you might see problems:
mysql> CREATE FULLTEXT INDEX PostText
! ON Posts(title, body, tags);
ERROR 2013 (HY000): Lost connection to
MySQL server during query
www.percona.com
26. Build Index on Data (InnoDB)
• Solution: make sure you define a primary key
column `FTS_DOC_ID` explicitly:
mysql> ALTER TABLE Posts
CHANGE COLUMN PostId
`FTS_DOC_ID` BIGINT UNSIGNED;
mysql> CREATE FULLTEXT INDEX PostText
! ON Posts(title, body, tags);
time: 25 min 27 sec
www.percona.com
27. Natural Language Mode (InnoDB)
• Searches concepts with free text queries:
! SELECT * FROM Posts
WHERE MATCH(title, body, tags)
AGAINST('mysql performance'
IN NATURAL LANGUAGE MODE)
LIMIT 100;
time with index:
740 milliseconds
www.percona.com
29. Boolean Mode (InnoDB)
• Searches words using mini-language:
! SELECT * FROM Posts
WHERE MATCH(title, body, tags)
AGAINST('+mysql +performance'
IN BOOLEAN MODE)
LIMIT 100;
time with index:
350 milliseconds
www.percona.com
32. Apache Solr
• http://lucene.apache.org/solr/
• Formerly known as Lucene, started 2001
• Apache License
• Java implementation
• Web service architecture
• Many sophisticated search features
www.percona.com
39. Sphinx Search
• http://sphinxsearch.com/
• Started in 2001
• GPLv2 license
• C++ implementation
• SphinxSE storage engine for MySQL
• Supports MySQL protocol, SQL-like queries
• Many sophisticated search features
www.percona.com
46. Trigraphs Overview
• Not very fast, but still better than LIKE / RLIKE
• Generic, portable SQL solution
• No dependency on version, storage engine, third-
party technology
www.percona.com
48. Insert Data Into Index
my $sth = $dbh1->prepare("SELECT * FROM Posts") or die $dbh1->errstr;
$sth->execute() or die $dbh1->errstr;
$dbh2->begin_work;
my $i = 0;
while (my $row = $sth->fetchrow_hashref ) {
my $text = lc(join('|', ($row->{title}, $row->{body}, $row->{tags})));
my %tri;
map($tri{$_}=1, ( $text =~ m/[[:alpha:]]{3}/g ));
next unless %tri;
my $tuple_list = join(",", map("('$_',$row->{postid})", keys %tri));
my $sql = "INSERT IGNORE INTO PostsTrigraph (tri, PostId) VALUES
$tuple_list";
$dbh2->do($sql) or die "SQL = $sql, ".$dbh2->errstr;
if (++$i % 1000 == 0) {
print ".";
$dbh2->commit;
$dbh2->begin_work;
time: 116 min 50 sec
}
}
space: 16.2GiB
print ".n";
$dbh2->commit; rows: 519 million
www.percona.com
49. Indexed Lookups
! SELECT p.* time: 46 sec
FROM Posts p
JOIN PostsTrigraph t1 ON
! t1.PostId = p.PostId AND t1.Tri = 'mys'
!
www.percona.com
50. Search Among Fewer Matches
! SELECT p.* time: 19 sec
FROM Posts p
JOIN PostsTrigraph t1 ON
! t1.PostId = p.PostId AND t1.Tri = 'mys'
JOIN PostsTrigraph t2 ON
! t2.PostId = p.PostId AND t2.Tri = 'per'
www.percona.com
51. Search Among Fewer Matches
! SELECT p.* time: 22 sec
FROM Posts p
JOIN PostsTrigraph t1 ON
! t1.PostId = p.PostId AND t1.Tri = 'mys'
JOIN PostsTrigraph t2 ON
! t2.PostId = p.PostId AND t2.Tri = 'per'
JOIN PostsTrigraph t3 ON
! t3.PostId = p.PostId AND t3.Tri = 'for'
www.percona.com
52. Search Among Fewer Matches
! SELECT p.* time: 13.6 sec
FROM Posts p
JOIN PostsTrigraph t1 ON
! t1.PostId = p.PostId AND t1.Tri = 'mys'
JOIN PostsTrigraph t2 ON
! t2.PostId = p.PostId AND t2.Tri = 'per'
JOIN PostsTrigraph t3 ON
! t3.PostId = p.PostId AND t3.Tri = 'for'
JOIN PostsTrigraph t4 ON
! t4.PostId = p.PostId AND t4.Tri = 'man'
www.percona.com
53. Narrow Down Further
! SELECT p.* time: 13.8 sec
FROM Posts p
JOIN PostsTrigraph t1 ON
! t1.PostId = p.PostId AND t1.Tri = 'mys'
JOIN PostsTrigraph t2 ON
! t2.PostId = p.PostId AND t2.Tri = 'per'
JOIN PostsTrigraph t3 ON
! t3.PostId = p.PostId AND t3.Tri = 'for'
JOIN PostsTrigraph t4 ON
! t4.PostId = p.PostId AND t4.Tri = 'man'
WHERE CONCAT(p.title,p.body,p.tags) LIKE '%mysql%'
! AND CONCAT(p.title,p.body,p.tags) LIKE '%performance%';
www.percona.com
54. And the winner is...
Jarrett Campbell
http://www.flickr.com/people/77744839@N00
55. Time to Insert Data into Index
LIKE expression n/a
FULLTEXT MyISAM 33 min, 34 sec
FULLTEXT InnoDB 55 min, 46 sec
Apache Solr 14 min, 28 sec
Sphinx Search 8 min, 20 sec
Trigraphs 116 min, 50 sec
www.percona.com
56. Insert Data into Index (sec)
8000
6000
4000
2000
0
LIKE MyISAM InnoDB Solr Sphinx Trigraph
www.percona.com
57. Time to Build Index on Data
LIKE expression n/a
FULLTEXT MyISAM 31 min, 18 sec
FULLTEXT InnoDB 25 min, 27 sec
Apache Solr n/a
Sphinx Search n/a
Trigraphs n/a
www.percona.com
58. Build Index on Data (sec)
4000
3000
2000
1000
n/a n/a n/a
0
LIKE MyISAM InnoDB Solr Sphinx Trigraph
www.percona.com
64. Bottom Line
build insert storage query solution
49k-399k
LIKE expression 0 0 0
ms
SQL
FULLTEXT MyISAM 31:18 33:28 2382MiB 16-200ms MySQL
FULLTEXT InnoDB 25:27 55:46 ? 350-740ms MySQL 5.6
Apache Solr n/a 14:28 2766MiB 79ms Java
Sphinx Search n/a 8:20 3487MiB 13ms C++
Trigraphs n/a 116:50 16.2 GiB 13,800ms SQL
www.percona.com
65. Final Thoughts
• Third-party search engines are complex to keep in
sync with data, and adding another type of
server adds more operations work for you.
• Built-in FULLTEXT indexes are therefore useful
even if they are not absolutely the fastest.
• Different search implementations may return
different results, so you should evaluate what
works best for your project.
• Any indexed search solution is orders of
magnitude better than LIKE!
www.percona.com
66. New York, October 1-2, 2012
London, December 3-4, 2012
Santa Clara, April 22-25, 2013
www.percona.com/live
67. Expert instructors
In-person training
Custom onsite training
Live virtual training
http://www.percona.com/training
www.percona.com
69. Copyright 2012 Bill Karwin
www.slideshare.net/billkarwin
Released under a Creative Commons 3.0 License:
http://creativecommons.org/licenses/by-nc-nd/3.0/
You are free to share - to copy, distribute and
transmit this work, under the following conditions:
Attribution. Noncommercial. No Derivative Works.
You must attribute this You may not use this work You may not alter,
work to Bill Karwin. for commercial purposes. transform, or build
upon this work.
www.percona.com