SlideShare a Scribd company logo
1 of 18
Full Text Search
for when a database is not enough...
TOC
● What is "Full text search"?
● How does it work?
● What is it good for?
● What makes it so good?
● Common Caracteristics
● Some of the most known solutions
● Who uses them?
● Practical Example
What is full text search?
Wikipedia says: full text search refers to a technique for searching a
computer-stored document or database. In a full text search, the search engine
examines all of the words in every stored document as it tries to match search
words supplied by the user.
I say: Full text search is a technique for searching documents or databases
that allows for a more relevant search (getting the results that we need instead
of the results that just "match" with our query).
How does it work?
In order to do a full text search, we first have to index all the information.
There are several techniques for indexing, but the basic idea behind it is as
follows:
1. Scan the document
2. For every word within the document, create an entry in the index with that
word, and with the relative position within the document.
3. Apply specific rules to the terms, such us:
○ Ignoring stop words
○ Stemming
○ etc
... how? part II
We have the index ready, now what?
Depending on the solution used, we'll have access to a formal querying
language. Using that, we can query our engine to tell it what we're looking for.
Something like:
title:"The Right Way" AND text:goorjakarta^4 apache
This will tell our search engine to look for documents with a title equal to "The
Right Way" and also, those that have the words "goorjakarta" and "apache"
on it's text, the only difference, is that "goorjakarta" is 4 times more important
than the word "apache"
What is it good for?
Full text search allows us to search (well duh!) very large amounts of
information in a very small time frame.
This type of solutions are generally used when the size of the database to be
search rises to the giga bytes.
It is normally used for searching inside the content of documents, such as word
documents, excel spreadsheets, web pages, etc.
What makes it so good?
Full text search is great! (but why?)
Some of the most important caracteristics to all full text search
solutions are:
-Relevant search: The results we get can be sorted based on relevance, this
allows for the user to get what he is looking for easily. (i.e: if we search for "red"
and "apple" we want to get the fruit and not results about the Apple company)
-Keywords: When indexing, keywords can be assigned to different parts of the
documents, allowing for a more specific type of query.
-Wildcards: Great tool that allows us to search terms when we don't know
exactly how to write it.
-Fuzzy search: Using this techniques, we can search terms that are close to
the ones on our query string.
Common caracteristics
Let's talk about some of the most common caracteristics
amongst full text search solutions.
● Presicion vs. Recall
● Stopwords
● Stemming
● Wildcards
Precision vs. recall tradeoff
Precision: Number of relevant results returned divided by the
total of results returned.
Recall: Number of relevant results returned divided by the total
of relevant results.
When choosing a solution, it is important to manage this two
concepts correctly. An increase on precision regularly means a
decrease on recall, and the oposite also applies.
Stopwords
Stopwords are terms that are too common on a language and
therefore are not specific enough to be of used when
searching.
Some examples of this are words like "the", "a", "an", "by",
"can", etc.
They're normally ignored by full text analyzers when indexing
information.
Stemming
Stemming allows us to reduce a word to it's root form (or stem)
in order to generalize terms while searching. Note that this is
not the same as synonyms.
For example, a stemmer would generalize words like "catlike",
"catty" and "cats" to their root form: "cat".
W?ldc*ds (A.k.a: Wildcards)
Wildcards are a bit more known and they do what you'd expect
them to do: they are used in place of characters when you don't
know exactly how your search terms are formed.
Wildcards characters may vary from one solution to the other,
but there are normally two: one that represents a single
character, and one that represents a group of them.
For example: the string 'hel*' would match words like 'hello',
'helium' and others, while the string 'hel?' would only match
words that begin with "hel" and end with one more character,
like "hell" but not "helium".
Some of the most known solutions
There are different types of solutions, some of them are just
APIs that can be integrated into our proyects, whilst others are
servers that provide an entire layer of services between our
application and the information.
Some examples of this are:
APIs:
● Xapian
● Lucene
Servers:
● Sphinx
● Solr
... a bit more about Lucene and Xapian
There are many more, but those are some of the most known
ones...
Xapian and Lucene are two APIs but they work differently,
because Xapian needs bindins for every language in order to
be compatible.
In the case of Lucene, there are specific implementations of
Lucene for every compatible language.
... and a bit more about Sphinx and Solr
On the other hand, Solr (which is based on Lucene) and
Sphinx are both full text search servers.
They both provide their functionalities through interfaces and
not directly inside the application.
Sphinx is designed to be efficient while indexing database
content.
Who uses them?
This types of solutions are used by many companies, for
example:
-Debian uses Xapian for many tasks, one of them
is Searching their archive of software packages
- NASA Planetary Data System (PDS) uses Solr to search for
dataset, mission, instrument, target, and host information
- Digg uses Solr for searching their site
- Craigslist uses Sphinx
- Moove-it! has used Sphinx on some of it's projects
- And many more...
Practical Example
Let's take a look at a very original example...
Thanks for reading...
... and happy searching!

More Related Content

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

fulltextsearch-110729124429-phpapp02.pptx

  • 1. Full Text Search for when a database is not enough...
  • 2. TOC ● What is "Full text search"? ● How does it work? ● What is it good for? ● What makes it so good? ● Common Caracteristics ● Some of the most known solutions ● Who uses them? ● Practical Example
  • 3. What is full text search? Wikipedia says: full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. I say: Full text search is a technique for searching documents or databases that allows for a more relevant search (getting the results that we need instead of the results that just "match" with our query).
  • 4. How does it work? In order to do a full text search, we first have to index all the information. There are several techniques for indexing, but the basic idea behind it is as follows: 1. Scan the document 2. For every word within the document, create an entry in the index with that word, and with the relative position within the document. 3. Apply specific rules to the terms, such us: ○ Ignoring stop words ○ Stemming ○ etc
  • 5. ... how? part II We have the index ready, now what? Depending on the solution used, we'll have access to a formal querying language. Using that, we can query our engine to tell it what we're looking for. Something like: title:"The Right Way" AND text:goorjakarta^4 apache This will tell our search engine to look for documents with a title equal to "The Right Way" and also, those that have the words "goorjakarta" and "apache" on it's text, the only difference, is that "goorjakarta" is 4 times more important than the word "apache"
  • 6. What is it good for? Full text search allows us to search (well duh!) very large amounts of information in a very small time frame. This type of solutions are generally used when the size of the database to be search rises to the giga bytes. It is normally used for searching inside the content of documents, such as word documents, excel spreadsheets, web pages, etc.
  • 7. What makes it so good? Full text search is great! (but why?) Some of the most important caracteristics to all full text search solutions are: -Relevant search: The results we get can be sorted based on relevance, this allows for the user to get what he is looking for easily. (i.e: if we search for "red" and "apple" we want to get the fruit and not results about the Apple company) -Keywords: When indexing, keywords can be assigned to different parts of the documents, allowing for a more specific type of query. -Wildcards: Great tool that allows us to search terms when we don't know exactly how to write it. -Fuzzy search: Using this techniques, we can search terms that are close to the ones on our query string.
  • 8. Common caracteristics Let's talk about some of the most common caracteristics amongst full text search solutions. ● Presicion vs. Recall ● Stopwords ● Stemming ● Wildcards
  • 9. Precision vs. recall tradeoff Precision: Number of relevant results returned divided by the total of results returned. Recall: Number of relevant results returned divided by the total of relevant results. When choosing a solution, it is important to manage this two concepts correctly. An increase on precision regularly means a decrease on recall, and the oposite also applies.
  • 10. Stopwords Stopwords are terms that are too common on a language and therefore are not specific enough to be of used when searching. Some examples of this are words like "the", "a", "an", "by", "can", etc. They're normally ignored by full text analyzers when indexing information.
  • 11. Stemming Stemming allows us to reduce a word to it's root form (or stem) in order to generalize terms while searching. Note that this is not the same as synonyms. For example, a stemmer would generalize words like "catlike", "catty" and "cats" to their root form: "cat".
  • 12. W?ldc*ds (A.k.a: Wildcards) Wildcards are a bit more known and they do what you'd expect them to do: they are used in place of characters when you don't know exactly how your search terms are formed. Wildcards characters may vary from one solution to the other, but there are normally two: one that represents a single character, and one that represents a group of them. For example: the string 'hel*' would match words like 'hello', 'helium' and others, while the string 'hel?' would only match words that begin with "hel" and end with one more character, like "hell" but not "helium".
  • 13. Some of the most known solutions There are different types of solutions, some of them are just APIs that can be integrated into our proyects, whilst others are servers that provide an entire layer of services between our application and the information. Some examples of this are: APIs: ● Xapian ● Lucene Servers: ● Sphinx ● Solr
  • 14. ... a bit more about Lucene and Xapian There are many more, but those are some of the most known ones... Xapian and Lucene are two APIs but they work differently, because Xapian needs bindins for every language in order to be compatible. In the case of Lucene, there are specific implementations of Lucene for every compatible language.
  • 15. ... and a bit more about Sphinx and Solr On the other hand, Solr (which is based on Lucene) and Sphinx are both full text search servers. They both provide their functionalities through interfaces and not directly inside the application. Sphinx is designed to be efficient while indexing database content.
  • 16. Who uses them? This types of solutions are used by many companies, for example: -Debian uses Xapian for many tasks, one of them is Searching their archive of software packages - NASA Planetary Data System (PDS) uses Solr to search for dataset, mission, instrument, target, and host information - Digg uses Solr for searching their site - Craigslist uses Sphinx - Moove-it! has used Sphinx on some of it's projects - And many more...
  • 17. Practical Example Let's take a look at a very original example...
  • 18. Thanks for reading... ... and happy searching!