Sphinx Full Text Search Server
Upcoming SlideShare
Loading in...5
×
 

Sphinx Full Text Search Server

on

  • 3,395 views

Sphinx is a standalone, full-text search daemon that allows advanced searching over large collections of blocks of text, either from a database or as documents on a file system. Sphinx can scale to ...

Sphinx is a standalone, full-text search daemon that allows advanced searching over large collections of blocks of text, either from a database or as documents on a file system. Sphinx can scale to billions of documents while still providing sub-second results to boolean queries, wildcards and other advanced search features. I cover basic setup, building a simple index, and demonstrate how to use SQL queries to retrieve results through its API.

Statistics

Views

Total Views
3,395
Views on SlideShare
3,390
Embed Views
5

Actions

Likes
5
Downloads
35
Comments
2

3 Embeds 5

http://dev.andrewkandels.com 3
https://twimg0-a.akamaihd.net 1
http://www.docshut.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Sphinx Full Text Search Server Sphinx Full Text Search Server Presentation Transcript

    • Search ServerSphinx is an open source full text search server, designed fromthe ground up with performance, relevance (a.k.a. searchquality), and integration simplicity in mind.• Craigslist serves 200 million queries/day• Used by Slashdot, Mozilla, Meetup• Scales to billions of documents (distributed)• Support almost any data source (SQL, XML, etc.)• Batch and real-time indexes By Andrew Kandels
    • What is a Search Server?Sphinx is like a database because…• It has a schema• It has field types (integer, boolean, strings, dates)• It responds to queries (SQL, API): SELECT * FROM Books WHERE MATCH(“a rose by any other name”)
    • DocumentsSphinx indexes data from just about any source.SELECT CONCAT(a.first_name, , a.last_name) AS full_name, COUNT(b.book_id) AS num_books, MIN(b.publish_date) AS first_publishedFROM author aINNER JOIN book b ON a.author_id = b.author_id<?xml version=“1.0”?><author> <id>1433</id> <name>Mark Twain</name> <books> <book>A Connecticut Yankee in King Arthur’s Court</book> </books></author>
    • How it WorksSphinx parses plain text queries and answers with rows.Search@author_id 15 “Mark Twain” king << arthurResults1. document=1433, weight=1692, createdAt=Jan 1 1889
    • RelevanceOnly the strongest will survive; but, relevance is in theeye of the beholder. Some factors include:• How many times did our keywords match?• How many times did they repeat in the query?• How frequently do keywords appear?• Do keywords in the document appear in the same order as the query?• Did we match exactly, or is it a stemmed match?
    • B-Tree Index User Index (Last Name (4))First Name Last Name City State Notes Row # ContentsAllison Janney Baltimore MA Cregg 1 JannJohn Spencer Des Moines IA McGarry 5 MoloBradley Whitford Newport VA Lyman 6 SchiMartin Sheen Seattle WA Bartlett 4 SheeJanel Moloney Hollywood CA Moss 2 SpenRichard Schiff Lincoln NE Ziegler 3 WhitA B-tree is a tree data structure that keeps data sorted and allows searches,sequential access, insertions, and deletions in logarithmic time.
    • Logical QueriesLogical conditions return a boolean result based on anexpression:country = “United States”AND num_published >= 50AND (author_id = 5 OR author_id = 8 OR author_id = 10)Logic queries can be complex and typically evaluate based onthe whole value of a column.
    • StemmingStemming (a.k.a. morphology) is the process for reducing inflected or derivedwords to their stem, base or root form.For example, “dove” is a synonym for “pigeon”. The words are different; but theycan mean the same thing.
    • TokenizingSphinx breaks down documents into keywords. This is called tokenization.Word breaker characters allow exception cases for keywords like AT&T, C++ or T-Mobile.Short words are ignored (by default, words less than 3 characters) but a placeholderis saved to support proximity and phrase searching.
    • Full Text Index InversionDocument Index (Full Text)A man caught a fish [spacer] man, person, human, being caught, catch, catcher, catching, catches [spacer] fish, fishing, fished, fisher Metadata man 2 1 caught 3 1 fish 5 1
    • Full Text QueriesSearches multiple columns or within contents in columns, also known as KeywordSearching.Boolean Search fiction AND (Twain OR Dickens)Phrase Search “Mark Twain”Field-Based Search @author_id 15Proximity Search “fear itself”~2, fear << itselfSubstring Search @author[4] MarkQuorum Search “the world is a wonderful place”/3Same Sentence/Paragraph fear SENTENCE itself
    • Getting SphinxDownload it from http://www.sphinxsearch.com (RPM, DEB, Tarball)
    • Important Files and BinariesA successful Sphinx installation will yield the following:searchd The search daemon, answers queriesIndexer Collects documents and builds the indexsearch Performs a search (useful for debugging)sphinx.conf Defines your data and configures your indexes and daemon
    • Sphinx.confDefaults to /etc/sphinx/sphinx.conf, but can exist anywhere.It can even be executable:#!/usr/bin/env phpsource mysource{ type = mysql sql_host = <?php echo DB_HOST; ?>}
    • Sphinx.conf BlocksThe contents of sphinx.conf consists of several named blocks:source Defines your data source and queriesindex Define sources to index searches forindexer Configure the indexer utilitysearchd Configure the search daemon
    • SourceDefine the connection to your database and query in the source block.source filmssource{ type = mysql sql_host = localhost sql_user = root sql_pass = sql_db = sakila sql_query = SELECT f.film_id, f.title, f.description, f.release_year, f.rating, l.name as language FROM film f INNER JOIN language l ON l.language_id = f.language_id sql_attr_uint = release_year sql_attr_string = rating sql_attr_string = language}
    • IndexDefine which sources to include and index parameters:index films{ source = filmssource charset_type = utf-8 path = /home/andrew/sphinx/films stopwords = /home/andrew/sphinx/stopwords.txt enable_star = 1 min_word_len = 2 min_prefix_len = 0 min_infix_len = 2}
    • Indexer (optional)Configure the indexing process which runs occasionally as a batch:indexer{ mem_limit = 256M}
    • Searchd (optional)Configure the search daemon (searchd) which answers queries:searchd{ listen = localhost:9312 listen = localhost:9306:mysql41 log = /home/andrew/sphinx.log read_timeout = 8 max_children = 30 pid_file = /home/andrew/sphinx.pid max_matches = 25 seamless_rotate = 1 preopen_indexes = 1 unlink_old = 1}
    • stopwords.txtTo generate stopwords from your data, use the indexer binary:indexer --config /path/to/sphinx.conf --buildstops /path/to/stopwords.txt 25ofwhomustinandthemadAnBuilds a stopwords.txt file with the 25 most commonly found words.Use --buildfreqs to include counts.Stopwords can dramatically reduce the index size and time-to-build; but, it’s agood idea to inspect the output before using it!
    • Build your IndexTo generate your index, use the indexer binary:indexer --config /path/to/sphinx.conf --all –rotateSphinx 2.0.4-release (r3135)Copyright (c) 2001-2012, Andrew AksyonoffCopyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)using config file sphinx.conf`...indexing index films...collected 1000 docs, 0.1 MBsorted 0.3 Mhits, 100.0% donetotal 1000 docs, 108077 bytestotal 0.148 sec, 727012 bytes/sec, 6726.80 docs/sectotal 3 reads, 0.003 sec, 675.6 kb/call avg, 1.1 msec/call avgtotal 11 writes, 0.004 sec, 331.8 kb/call avg, 0.4 msec/call avg
    • Start the ServerStart the server by executing the searchd binary:searchd --config /path/to/sphinx.confSphinx 2.0.4-release (r3135)Copyright (c) 2001-2012, Andrew AksyonoffCopyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)using config file sphinx.conf’...listening on 127.0.0.1:9312listening on 127.0.0.1:9306precaching index filmsprecached 1 indexes in 0.001 sec
    • Run a SearchTest your index by running a search:search --limit 3 robotSphinx 2.0.4-release (r3135)Copyright (c) 2001-2012, Andrew AksyonoffCopyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)using config file ./sphinx.conf...index films: query robot : returned 77 matches of 77 total in 0.000 secdisplaying matches:1. document=138, weight=1612, release_year=2006, rating=R, language=English2. document=920, weight=1612, release_year=2006, rating=G, language=English3. document=6, weight=1581, release_year=2006, rating=PG, language=Englishwords:1. robot: 77 documents, 79 hits
    • MySQL InterfaceYou can query Sphinx using the MySQL protocol:mysql –h127.0.0.1 –P 9306Reading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -AWelcome to the MySQL monitor. Commands end with ; or g.Your MySQL connection id is 1Server version: 2.0.4-release (r3135)Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.This software comes with ABSOLUTELY NO WARRANTY. This is free software,and you are welcome to modify and redistribute it under the GPL v2 licenseType help; or h for help. Type c to clear the current input statement.mysql>
    • MySQL InterfaceQueries are written in SphinxQL, which is much like SQL:mysql> SELECT *FROM filmsWHERE MATCH(robot)ORDER BY release_year DESCLIMIT 5;+------+--------+--------------+--------+----------+| id | weight | release_year | rating | language |+------+--------+--------------+--------+----------+| 6 | 1581 | 2006 | PG | English || 16 | 1581 | 2006 | NC-17 | English || 25 | 1581 | 2006 | G | English || 42 | 1581 | 2006 | NC-17 | English || 61 | 1581 | 2006 | G | English |+------+--------+--------------+--------+----------+5 rows in set (0.00 sec)
    • MySQL InterfaceAdditional metrics can also be retrieved:mysql> SHOW META;+---------------+-------+| Variable_name | Value |+---------------+-------+| total | 77 || total_found | 77 || time | 0.000 || keyword[0] | robot || docs[0] | 77 || hits[0] | 79 |+---------------+-------+6 rows in set (0.00 sec)
    • MySQL InterfaceYou can even do grouping:mysql> SELECT rating, COUNT(*) AS num_movies, MIN(release_year) AS first_year FROM films GROUP BY rating ORDER BY num_movies DESC;+------+--------+--------------+--------+------------+--------+| id | weight | release_year | rating | first_year | @count |+------+--------+--------------+--------+------------+--------+| 7| 1| 2006 | PG-13 | 2006 | 223 || 3| 1| 2006 | NC-17 | 2006 | 210 || 8| 1| 2006 | R | 2006 | 195 || 1| 1| 2006 | PG | 2006 | 194 || 2| 1| 2006 | G | 2006 | 178 |+------+--------+--------------+--------+------------+--------+5 rows in set (0.00 sec)
    • Other ApplicationsSphinx does more than just full text search. It has other practicalapplications as well:• Metrics and Reporting• Data Warehouse• Materialized Views• Operational Data Store• Offloading Queries
    • Quick and Dirty PHPIntegrate Sphinx by using any MySQL driver (like PDO):
    • SphinxAPIOr use a native extension like SphinxClient for PHP:Download it here: http://pecl.php.net/sphinx
    • Indexing StrategiesSphinx supports several types of indexes:• Disk• In-memory• Distributed• Real-time
    • Main+delta Batch IndexesDisk indexes often use the main+delta(s) strategy:• One or more delta indexes collect new data as often as every minute.• Larger batch indexes rebuild daily, weekly or even less frequently.Disk indexes have the following benefits:• They can be re-indexed online without interruption (--rotate)• They can be distributed over filesystems and hardware
    • The EndThere’s a book! Andrew Kandels Website: http://andrewkandels.com Twitter: @andrewkandels Facebook/G+: No thanks