SlideShare a Scribd company logo
1 of 18
Download to read offline
Full Text Search




     for when a database is not enough...
TOC

● What is "Full text search"?
● How does it work?
● What is it good for?
● What makes it so good?
● Common Caracteristics
● Some of the most known solutions
● Who uses them?
● Practical Example
What is full text search?

Wikipedia says: full text search refers to a technique for searching a
computer-stored document or database. In a full text search, the search engine
examines all of the words in every stored document as it tries to match search
words supplied by the user.



I say: Full text search is a technique for searching documents or databases
that allows for a more relevant search (getting the results that we need instead
of the results that just "match" with our query).
How does it work?

In order to do a full text search, we first have to index all the information.

There are several techniques for indexing, but the basic idea behind it is as
follows:

 1. Scan the document
 2. For every word within the document, create an entry in the index with that
    word, and with the relative position within the document.
 3. Apply specific rules to the terms, such us:
      ○ Ignoring stop words
      ○ Stemming
      ○ etc
... how? part II

We have the index ready, now what?
Depending on the solution used, we'll have access to a formal querying
language. Using that, we can query our engine to tell it what we're looking for.

Something like:
title:"The Right Way" AND text:goorjakarta^4 apache

This will tell our search engine to look for documents with a title equal to "The
Right Way" and also, those that have the words "goorjakarta" and "apache"
on it's text, the only difference, is that "goorjakarta" is 4 times more important
than the word "apache"
What is it good for?

Full text search allows us to search (well duh!) very large amounts of
information in a very small time frame.


This type of solutions are generally used when the size of the database to be
search rises to the giga bytes.


It is normally used for searching inside the content of documents, such as word
documents, excel spreadsheets, web pages, etc.
What makes it so good?
Full text search is great! (but why?)

Some of the most important caracteristics to all full text search
solutions are:

- Relevant search: The results we get can be sorted based on relevance, this
allows for the user to get what he is looking for easily. (i.e: if we search for "red"
and "apple" we want to get the fruit and not results about the Apple company)

- Keywords: When indexing, keywords can be assigned to different parts of the
documents, allowing for a more specific type of query.

- Wildcards: Great tool that allows us to search terms when we don't know
exactly how to write it.

- Fuzzy search: Using this techniques, we can search terms that are close to
the ones on our query string.
Common caracteristics

Let's talk about some of the most common caracteristics
amongst full text search solutions.

 ● Presicion vs. Recall
 ● Stopwords
 ● Stemming
 ● Wildcards
Precision vs. recall tradeoff

Precision: Number of relevant results returned divided by the
total of results returned.

Recall: Number of relevant results returned divided by the total
of relevant results.

When choosing a solution, it is important to manage this two
concepts correctly. An increase on precision regularly means a
decrease on recall, and the oposite also applies.
Stopwords

Stopwords are terms that are too common on a language and
therefore are not specific enough to be of used when
searching.

Some examples of this are words like "the", "a", "an", "by",
"can", etc.

They're normally ignored by full text analyzers when indexing
information.
Stemming

Stemming allows us to reduce a word to it's root form (or stem)
in order to generalize terms while searching. Note that this is
not the same as synonyms.

For example, a stemmer would generalize words like "catlike",
"catty" and "cats" to their root form: "cat".
W?ldc*ds (A.k.a: Wildcards)

Wildcards are a bit more known and they do what you'd expect
them to do: they are used in place of characters when you don't
know exactly how your search terms are formed.

Wildcards characters may vary from one solution to the other,
but there are normally two: one that represents a single
character, and one that represents a group of them.

For example: the string 'hel*' would match words like 'hello',
'helium' and others, while the string 'hel?' would only match
words that begin with "hel" and end with one more character,
like "hell" but not "helium".
Some of the most known solutions 

There are different types of solutions, some of them are just
APIs that can be integrated into our proyects, whilst others are
servers that provide an entire layer of services between our
application and the information.
Some examples of this are:

APIs:
 ● Xapian
 ● Lucene

Servers:
 ● Sphinx
 ● Solr
... a bit more about Lucene and Xapian

There are many more, but those are some of the most known
ones...

Xapian and Lucene are two APIs but they work differently,
because Xapian needs bindins for every language in order to
be compatible.
In the case of Lucene, there are specific implementations of
Lucene for every compatible language.
... and a bit more about Sphinx and Solr

On the other hand, Solr (which is based on Lucene) and
Sphinx are both full text search servers.

They both provide their functionalities through interfaces and
not directly inside the application.

Sphinx is designed to be efficient while indexing database
content.
Who uses them?

This types of solutions are used by many companies, for
example:


- Debian uses Xapian for many tasks, one of them
is Searching their archive of software packages
- NASA Planetary Data System (PDS) uses Solr to search for
dataset, mission, instrument, target, and host information
- Digg uses Solr for searching their site
- Craigslist uses Sphinx
- Moove-it! has used Sphinx on some of it's projects
- And many more...
Practical Example

Let's take a look at a very original example...
Thanks for reading...

 ... and happy searching!

More Related Content

What's hot

Online Help Desk ppt
Online  Help  Desk pptOnline  Help  Desk ppt
Online Help Desk ppt
nagarjunagoud
 
IT Asset Management System for UL-Software Engineering
IT Asset Management System for UL-Software EngineeringIT Asset Management System for UL-Software Engineering
IT Asset Management System for UL-Software Engineering
Shiv Koppad
 

What's hot (11)

Unix commands
Unix commandsUnix commands
Unix commands
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
Online Help Desk ppt
Online  Help  Desk pptOnline  Help  Desk ppt
Online Help Desk ppt
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
IT Asset Management System for UL-Software Engineering
IT Asset Management System for UL-Software EngineeringIT Asset Management System for UL-Software Engineering
IT Asset Management System for UL-Software Engineering
 
Quick Guide with Linux Command Line
Quick Guide with Linux Command LineQuick Guide with Linux Command Line
Quick Guide with Linux Command Line
 
Business Intelligence for kids (example project)
Business Intelligence for kids (example project)Business Intelligence for kids (example project)
Business Intelligence for kids (example project)
 
Help desk project
Help desk projectHelp desk project
Help desk project
 
Sad project fed_ex_express
Sad project fed_ex_expressSad project fed_ex_express
Sad project fed_ex_express
 
[소프트웨어교육] 알고리즘 교사 연수 자료
[소프트웨어교육] 알고리즘 교사 연수 자료[소프트웨어교육] 알고리즘 교사 연수 자료
[소프트웨어교육] 알고리즘 교사 연수 자료
 
Hospital management system
Hospital management systemHospital management system
Hospital management system
 

Viewers also liked

Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 

Viewers also liked (7)

Interfaces to xapian
Interfaces to xapianInterfaces to xapian
Interfaces to xapian
 
Enterprise Search and Findability in 2013
Enterprise Search and Findability in 2013Enterprise Search and Findability in 2013
Enterprise Search and Findability in 2013
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
The Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellThe Enterprise Search Market in a Nutshell
The Enterprise Search Market in a Nutshell
 
Search++: Cognitive transformation of human-system interaction: Presented by ...
Search++: Cognitive transformation of human-system interaction: Presented by ...Search++: Cognitive transformation of human-system interaction: Presented by ...
Search++: Cognitive transformation of human-system interaction: Presented by ...
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 

Similar to Full text search

Searching the Internet
Searching the Internet Searching the Internet
Searching the Internet
guest32ae6
 
Academic Skills 4
Academic Skills 4Academic Skills 4
Academic Skills 4
Hala Nur
 

Similar to Full text search (20)

MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
NLP todo
NLP todoNLP todo
NLP todo
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Searching the Internet
Searching the Internet Searching the Internet
Searching the Internet
 
Key Phrases for Better Search
Key Phrases for Better SearchKey Phrases for Better Search
Key Phrases for Better Search
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Php packages
Php packagesPhp packages
Php packages
 
Keyword Research.pdf
Keyword Research.pdfKeyword Research.pdf
Keyword Research.pdf
 
Semantic Search with Topic Maps
Semantic Search with Topic MapsSemantic Search with Topic Maps
Semantic Search with Topic Maps
 
Better problem solving through scripting: How to think through your #eprdctn ...
Better problem solving through scripting: How to think through your #eprdctn ...Better problem solving through scripting: How to think through your #eprdctn ...
Better problem solving through scripting: How to think through your #eprdctn ...
 
E-LEARN: Search Strategies
E-LEARN: Search StrategiesE-LEARN: Search Strategies
E-LEARN: Search Strategies
 
Relevance redefined
Relevance redefinedRelevance redefined
Relevance redefined
 
Dictionary implementation using TRIE
Dictionary implementation using TRIEDictionary implementation using TRIE
Dictionary implementation using TRIE
 
Parser
ParserParser
Parser
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Academic Skills 4
Academic Skills 4Academic Skills 4
Academic Skills 4
 
Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Full text search

  • 1. Full Text Search for when a database is not enough...
  • 2. TOC ● What is "Full text search"? ● How does it work? ● What is it good for? ● What makes it so good? ● Common Caracteristics ● Some of the most known solutions ● Who uses them? ● Practical Example
  • 3. What is full text search? Wikipedia says: full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. I say: Full text search is a technique for searching documents or databases that allows for a more relevant search (getting the results that we need instead of the results that just "match" with our query).
  • 4. How does it work? In order to do a full text search, we first have to index all the information. There are several techniques for indexing, but the basic idea behind it is as follows: 1. Scan the document 2. For every word within the document, create an entry in the index with that word, and with the relative position within the document. 3. Apply specific rules to the terms, such us: ○ Ignoring stop words ○ Stemming ○ etc
  • 5. ... how? part II We have the index ready, now what? Depending on the solution used, we'll have access to a formal querying language. Using that, we can query our engine to tell it what we're looking for. Something like: title:"The Right Way" AND text:goorjakarta^4 apache This will tell our search engine to look for documents with a title equal to "The Right Way" and also, those that have the words "goorjakarta" and "apache" on it's text, the only difference, is that "goorjakarta" is 4 times more important than the word "apache"
  • 6. What is it good for? Full text search allows us to search (well duh!) very large amounts of information in a very small time frame. This type of solutions are generally used when the size of the database to be search rises to the giga bytes. It is normally used for searching inside the content of documents, such as word documents, excel spreadsheets, web pages, etc.
  • 7. What makes it so good? Full text search is great! (but why?) Some of the most important caracteristics to all full text search solutions are: - Relevant search: The results we get can be sorted based on relevance, this allows for the user to get what he is looking for easily. (i.e: if we search for "red" and "apple" we want to get the fruit and not results about the Apple company) - Keywords: When indexing, keywords can be assigned to different parts of the documents, allowing for a more specific type of query. - Wildcards: Great tool that allows us to search terms when we don't know exactly how to write it. - Fuzzy search: Using this techniques, we can search terms that are close to the ones on our query string.
  • 8. Common caracteristics Let's talk about some of the most common caracteristics amongst full text search solutions. ● Presicion vs. Recall ● Stopwords ● Stemming ● Wildcards
  • 9. Precision vs. recall tradeoff Precision: Number of relevant results returned divided by the total of results returned. Recall: Number of relevant results returned divided by the total of relevant results. When choosing a solution, it is important to manage this two concepts correctly. An increase on precision regularly means a decrease on recall, and the oposite also applies.
  • 10. Stopwords Stopwords are terms that are too common on a language and therefore are not specific enough to be of used when searching. Some examples of this are words like "the", "a", "an", "by", "can", etc. They're normally ignored by full text analyzers when indexing information.
  • 11. Stemming Stemming allows us to reduce a word to it's root form (or stem) in order to generalize terms while searching. Note that this is not the same as synonyms. For example, a stemmer would generalize words like "catlike", "catty" and "cats" to their root form: "cat".
  • 12. W?ldc*ds (A.k.a: Wildcards) Wildcards are a bit more known and they do what you'd expect them to do: they are used in place of characters when you don't know exactly how your search terms are formed. Wildcards characters may vary from one solution to the other, but there are normally two: one that represents a single character, and one that represents a group of them. For example: the string 'hel*' would match words like 'hello', 'helium' and others, while the string 'hel?' would only match words that begin with "hel" and end with one more character, like "hell" but not "helium".
  • 13. Some of the most known solutions  There are different types of solutions, some of them are just APIs that can be integrated into our proyects, whilst others are servers that provide an entire layer of services between our application and the information. Some examples of this are: APIs: ● Xapian ● Lucene Servers: ● Sphinx ● Solr
  • 14. ... a bit more about Lucene and Xapian There are many more, but those are some of the most known ones... Xapian and Lucene are two APIs but they work differently, because Xapian needs bindins for every language in order to be compatible. In the case of Lucene, there are specific implementations of Lucene for every compatible language.
  • 15. ... and a bit more about Sphinx and Solr On the other hand, Solr (which is based on Lucene) and Sphinx are both full text search servers. They both provide their functionalities through interfaces and not directly inside the application. Sphinx is designed to be efficient while indexing database content.
  • 16. Who uses them? This types of solutions are used by many companies, for example: - Debian uses Xapian for many tasks, one of them is Searching their archive of software packages - NASA Planetary Data System (PDS) uses Solr to search for dataset, mission, instrument, target, and host information - Digg uses Solr for searching their site - Craigslist uses Sphinx - Moove-it! has used Sphinx on some of it's projects - And many more...
  • 17. Practical Example Let's take a look at a very original example...
  • 18. Thanks for reading... ... and happy searching!