Full-text & Relational search
VIJAY YADAV
072-BCT-547
2
SEARCH IS HARD
 Average no. of search per day
is over 3.5 billion on Google
alone.
 That’s one search for every two
people (including babies and
grandmothers, but excluding
zombies) in the world.
3
 That doesn’t even include the number of searches on Amazon,
LinkedIn and Facebook. We use search for everything.
 Oh, except the company data. We still use BI analysts, data
scientists, specialized tools, and SQL for that.
4
So what is full-text search ?
 It is simply a document-based
search mainly employed by
word processing applications
and various search engines.
 It often performs two tasks:
indexing and searching.
5
 The indexing stage will scan the text of all the documents and
build a list of search terms (often called as index). The indexer
will ignore stop words such as "the" and "and“. Also the word
drives, drove, driven will be recorded only as a single word
“drive”.
 In the search stage, when performing a specific query, only the
index is referenced, rather than the text of the original
documents.
6
Two ways of performance improvements
I. Improved query tools
II. Improved search algorithms
Improved search algorithms
 PageRank algorithm developed by google.
7
Improved query tools
 Keywords: Creators are asked to list the words that best
describe the text including synonyms.
 Phrase search: Will search only those documents that
contain certain phrase.
 Fuzzy search: Will search documents with even some
variations around the given term.
Some fuzzy search algorithms
Soundex
Metaphone
Double Metaphone
Soundex
In PostgreSQL, below two queries will result in same
output with soundex algorithm and hence even wrong
typed word can give right result.
1. SELECT soundex(‘elephant’);
————————————-
=> E415
2. SELECT soundex(‘elephents’);
————————————–
=> E415
Software performing full-text search
Problems with full-text search
• The results may not be 100% accurate.
• Large number of irrelevant search results due to
lack of relation among the words
Why Relational search ?
• Gives more accurate and relevant
result.
• Much useful for business analytics.
but…
Relational search is even harder because
1. Company’s data is complicated
• Search on LinkedIn probably means searching for a
person or a company.
• Search on Amazon probably means searching for a
product.
• But company’s data includes multiple databases, tables,
columns, rows with complicated relationships between
them.
2. Needs to be 100% accurate or you risk your business
What’s worse
than guessing?
Being
convinced by
bad data.
3. Needs to be faster
Relational search makes huge
difference in enterprise because it
takes deterministic input to give
deterministic output.
THANK YOU
18

Full text and relational search

  • 1.
    Full-text & Relationalsearch VIJAY YADAV 072-BCT-547
  • 2.
    2 SEARCH IS HARD Average no. of search per day is over 3.5 billion on Google alone.  That’s one search for every two people (including babies and grandmothers, but excluding zombies) in the world.
  • 3.
    3  That doesn’teven include the number of searches on Amazon, LinkedIn and Facebook. We use search for everything.  Oh, except the company data. We still use BI analysts, data scientists, specialized tools, and SQL for that.
  • 4.
    4 So what isfull-text search ?  It is simply a document-based search mainly employed by word processing applications and various search engines.  It often performs two tasks: indexing and searching.
  • 5.
    5  The indexingstage will scan the text of all the documents and build a list of search terms (often called as index). The indexer will ignore stop words such as "the" and "and“. Also the word drives, drove, driven will be recorded only as a single word “drive”.  In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
  • 6.
    6 Two ways ofperformance improvements I. Improved query tools II. Improved search algorithms Improved search algorithms  PageRank algorithm developed by google.
  • 7.
    7 Improved query tools Keywords: Creators are asked to list the words that best describe the text including synonyms.  Phrase search: Will search only those documents that contain certain phrase.  Fuzzy search: Will search documents with even some variations around the given term.
  • 8.
    Some fuzzy searchalgorithms Soundex Metaphone Double Metaphone
  • 9.
    Soundex In PostgreSQL, belowtwo queries will result in same output with soundex algorithm and hence even wrong typed word can give right result. 1. SELECT soundex(‘elephant’); ————————————- => E415 2. SELECT soundex(‘elephents’); ————————————– => E415
  • 10.
  • 11.
    Problems with full-textsearch • The results may not be 100% accurate. • Large number of irrelevant search results due to lack of relation among the words
  • 13.
    Why Relational search? • Gives more accurate and relevant result. • Much useful for business analytics. but…
  • 14.
    Relational search iseven harder because 1. Company’s data is complicated • Search on LinkedIn probably means searching for a person or a company. • Search on Amazon probably means searching for a product. • But company’s data includes multiple databases, tables, columns, rows with complicated relationships between them.
  • 15.
    2. Needs tobe 100% accurate or you risk your business What’s worse than guessing? Being convinced by bad data.
  • 16.
    3. Needs tobe faster
  • 17.
    Relational search makeshuge difference in enterprise because it takes deterministic input to give deterministic output.
  • 18.