Full Text Search for when a database is not enough...
TOC● What is "Full text search"?● How does it work?● What is it good for?● What makes it so good?● Common Caracteristics● Some of the most known solutions● Who uses them?● Practical Example
What is full text search?Wikipedia says: full text search refers to a technique for searching acomputer-stored document or database. In a full text search, the search engineexamines all of the words in every stored document as it tries to match searchwords supplied by the user.I say: Full text search is a technique for searching documents or databasesthat allows for a more relevant search (getting the results that we need insteadof the results that just "match" with our query).
How does it work?In order to do a full text search, we first have to index all the information.There are several techniques for indexing, but the basic idea behind it is asfollows: 1. Scan the document 2. For every word within the document, create an entry in the index with that word, and with the relative position within the document. 3. Apply specific rules to the terms, such us: ○ Ignoring stop words ○ Stemming ○ etc
... how? part IIWe have the index ready, now what?Depending on the solution used, well have access to a formal queryinglanguage. Using that, we can query our engine to tell it what were looking for.Something like:title:"The Right Way" AND text:goorjakarta^4 apacheThis will tell our search engine to look for documents with a title equal to "TheRight Way" and also, those that have the words "goorjakarta" and "apache"on its text, the only difference, is that "goorjakarta" is 4 times more importantthan the word "apache"
What is it good for?Full text search allows us to search (well duh!) very large amounts ofinformation in a very small time frame.This type of solutions are generally used when the size of the database to besearch rises to the giga bytes.It is normally used for searching inside the content of documents, such as worddocuments, excel spreadsheets, web pages, etc.
What makes it so good?Full text search is great! (but why?)Some of the most important caracteristics to all full text searchsolutions are:- Relevant search: The results we get can be sorted based on relevance, thisallows for the user to get what he is looking for easily. (i.e: if we search for "red"and "apple" we want to get the fruit and not results about the Apple company)- Keywords: When indexing, keywords can be assigned to different parts of thedocuments, allowing for a more specific type of query.- Wildcards: Great tool that allows us to search terms when we dont knowexactly how to write it.- Fuzzy search: Using this techniques, we can search terms that are close tothe ones on our query string.
Common caracteristicsLets talk about some of the most common caracteristicsamongst full text search solutions. ● Presicion vs. Recall ● Stopwords ● Stemming ● Wildcards
Precision vs. recall tradeoffPrecision: Number of relevant results returned divided by thetotal of results returned.Recall: Number of relevant results returned divided by the totalof relevant results.When choosing a solution, it is important to manage this twoconcepts correctly. An increase on precision regularly means adecrease on recall, and the oposite also applies.
StopwordsStopwords are terms that are too common on a language andtherefore are not specific enough to be of used whensearching.Some examples of this are words like "the", "a", "an", "by","can", etc.Theyre normally ignored by full text analyzers when indexinginformation.
StemmingStemming allows us to reduce a word to its root form (or stem)in order to generalize terms while searching. Note that this isnot the same as synonyms.For example, a stemmer would generalize words like "catlike","catty" and "cats" to their root form: "cat".
W?ldc*ds (A.k.a: Wildcards)Wildcards are a bit more known and they do what youd expectthem to do: they are used in place of characters when you dontknow exactly how your search terms are formed.Wildcards characters may vary from one solution to the other,but there are normally two: one that represents a singlecharacter, and one that represents a group of them.For example: the string hel* would match words like hello,helium and others, while the string hel? would only matchwords that begin with "hel" and end with one more character,like "hell" but not "helium".
Some of the most known solutions There are different types of solutions, some of them are justAPIs that can be integrated into our proyects, whilst others areservers that provide an entire layer of services between ourapplication and the information.Some examples of this are:APIs: ● Xapian ● LuceneServers: ● Sphinx ● Solr
... a bit more about Lucene and XapianThere are many more, but those are some of the most knownones...Xapian and Lucene are two APIs but they work differently,because Xapian needs bindins for every language in order tobe compatible.In the case of Lucene, there are specific implementations ofLucene for every compatible language.
... and a bit more about Sphinx and SolrOn the other hand, Solr (which is based on Lucene) andSphinx are both full text search servers.They both provide their functionalities through interfaces andnot directly inside the application.Sphinx is designed to be efficient while indexing databasecontent.
Who uses them?This types of solutions are used by many companies, forexample:- Debian uses Xapian for many tasks, one of themis Searching their archive of software packages- NASA Planetary Data System (PDS) uses Solr to search fordataset, mission, instrument, target, and host information- Digg uses Solr for searching their site- Craigslist uses Sphinx- Moove-it! has used Sphinx on some of its projects- And many more...
Practical ExampleLets take a look at a very original example...