SlideShare a Scribd company logo
Search Engine Basics
       Ruben Ortega
What is covered?

• A non-programmers introduction to:
 • Why do we have search engines.
 • How search works across a page, a book,
    thousands of books, to millions of books.
 • How to get a good search result.
Speaker Background
• 10+ years working on search engines
 • Amazon, A9.com, Mechanical Turk,
    Trusera.com
 • 13 patents -- Helping people find anything
  • Billions of dollars of revenue
  • Millions of searches per hour
Have you searched a
 book for your name?
• Wonder how many times your name was
  mentioned in your High School yearbook?
• Find your name across all your High School
  and college yearbooks?
• Which would be the “best result” if I
  searched for your name in those
  yearbooks?
Success of Search
    Engines
Search engines not
taught before the web
• Not taught because there was no demand.
• Why no demand?
 • Machines had 10-20MB of disk.
 • $100 per MB of disk --> Disk quotas
 • Limited networking --> Limited
    information
What does 1 Megabyte
   of space hold?

• Book Page -- 2.5 Kilobytes of text
• 1 Megabyte == 400 pages ~ 1 thick book
Is it worth it to store a
         book ?

• If disk space cost $100 per MB it had
  better be worth it!
• Copying a $20 book into a $100 of disk
  space is not cost effective.
Why has Search grown
     so quickly?

• Lots and lots of fantastically cheap disk
  space!
Inexpensive Disk!
                      Cost per Megabyte of Disk                                             Megabytes per dollar of disk
100                                                                  10000




 75                                                                  7500




 50                                                                  5000




 25
                                                                     2500




 0
  1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008      0
                                                                         1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
So what happens?

• Information blossoms.
• Quotas are gone -- Never have to delete!
• Email
• Web Pages
• Books, Image data, Music
Demand for search
     skyrocketed
• Cheaper disks == more data to search.
• More data means
 • Demand better search techniques
 • Different handling of items indexed.
 • Better user interfaces
• Reminder: There is no magic in search!
How does search
        work?


• Let’s run through a text search example
Simple Searching

• How do you search for the word “coyness”
  in the following string:
• “Had we but world enough and time thy
  coyness lady would be no crime.”
Find the first
          “c”
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
No match.
       coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coynes
to his coy mistress
s
had we but world enough


and time thy coyness lady
coyne
to his coy mistress
ss
had we but world enough


and time thy coyness lady
coyn
to his coy mistress
ess
had we but world enough


and time thy coyness lady
coy
to his coy mistress
ness
had we but world enough


and time thy coyness lady
co
to his coy mistress
yness
had we but world enough


and time thy coyness lady
c
to his coy mistress
oyness
had we but world enough


and time thy coyness lady
to his coy mistress
coyness
had we but world enough


and time thy coyness lady
to his coy mistress
coyness
had we but world enough


and time thy coyness lady
to his coy mistress
  coyness
had we but world enough


and time thy coyness lady
to his coy mistress
   coyness
had we but world enough


and time thy coyness lady
to his coy mistress
    coyness
had we but world enough


and time thy coyness lady
to his coy mistress
     coyness
had we but world enough


and time thy coyness lady
to his coy mistress
      coyness
had we but world enough


and time thy coyness lady
to his coy mistress
       coyness
had we but world enough


and time thy coyness lady
to his coy mistress
        coyness
had we but world enough


and time thy coyness lady
to his coy mistress
         coyness
had we but world enough


and time thy coyness lady
to his coy mistress
          coyness
had we but world enough


and time thy coyness lady
to his coy mistress
          coyness
had we but world enough


and time thy coyness lady
to his coy mistress
           coyness
had we but world enough


and time thy coyness lady
to his coy mistress
            coyness
had we but world enough


and time thy coyness lady
to his coy mistress
             coyness
had we but world enough


and time thy coyness lady
to his coy mistress
              coyness
had we but world enough


and time thy coyness lady
to his coy mistress
               coyness
had we but world enough


and time thy coyness lady
to his coy mistress
                coynes
had we but world enough
s
and time thy coyness lady
to his coy mistress
                  coyne
had we but world enough
ss
and time thy coyness lady
to his coy mistress
                  coyn
had we but world enough
ess
and time thy coyness lady
to his coy mistress
                      coy
had we but world enough
ness
and time thy coyness lady
to his coy mistress
                      co
had we but world enough
yness
and time thy coyness lady
to his coy mistress
                      c
had we but world enough
oyness
and time thy coyness lady
to his coy mistress


had we but world enough
coyness
and time thy coyness lady
to his coy mistress


had we but world enough
 coyness
and time thy coyness lady
to his coy mistress


had we but world enough
  coyness
and time thy coyness lady
to his coy mistress


had we but world enough
   coyness
and time thy coyness lady
to his coy mistress


had we but world enough
    coyness
and time thy coyness lady
to his coy mistress


had we but world enough
     coyness
and time thy coyness lady
to his coy mistress


had we but world enough
      coyness
and time thy coyness lady
to his coy mistress


had we but world enough
       coyness
and time thy coyness lady
to his coy mistress


had we but world enough
        coyness
and time thy coyness lady
to his coy mistress


had we but world enough
         coyness
and time thy coyness lady
to his coy mistress


had we but world enough
          coyness
and time thy coyness lady
to his coy mistress


had we but world enough
           coyness
and time thy coyness lady
to his coy mistress


had we but world enough
            coyness
and time thy coyness lady
to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
Matched!

to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
Can we find it faster?

• Yes!
• Boyer-Moore-Horspool.
 • Start searching from the end of the word
 • If a character matches one in the word,
    shift forward to the character.
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyness
to his coy mistress


had we but world enough


and time thy coyness lady
No match, skip.
       coyness
to his coy mistress


had we but world enough


and time thy coyness lady
coyne
to his coy mistress
ss
had we but world enough


and time thy coyness lady
to his coy mistress
  coyness
had we but world enough


and time thy coyness lady
to his coy mistress
         coyness
had we but world enough


and time thy coyness lady
to his coy mistress
               coyness
had we but world enough


and time thy coyness lady
to his coy mistress


had we but world enough
coyness
and time thy coyness lady
Doesn’t match but C is
 a letter in our word
to his coy mistress


had we but world enough
       coyness
and time thy coyness lady
Jump 7 spaces

to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
Matched!

to his coy mistress


had we but world enough
             coyness
and time thy coyness lady
Simple Search works!
• Naive algorithm can work quickly for
  documents you have never seen before and
  don’t want to bother keeping around.
• Boyer Moore Horspool works even faster
  with a little extra overhead of building a
  table
• But, what if I have extra disk space to store
  a book and want to go even faster?
Build an index!




Image by Dan Taylor: http://www.flickr.com/photos/dantaylor/1145628275/
Indexes are not new

• Indexes created in the 10th century to find
  words in books.
• Card catalogs in libraries provide indexes
  to books.
• What is new is how much information can
  be stored in a single place.
Indexing is simple


• For each word in a book
 • Store which page in the book it is on.
Partial index
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
Indexes use more disk
        space

• A complete index takes about 33% of the
  text indexed.
• In 1984, that would be $133 in disk space
  per book.
• In 2008, $133 is able to store and index 1
  million books.
How do you search
    with an index?

• Step 1: Pick the words you are looking for
  from the index.
• Step 2: Return all the pages that the word
  appears on.
Search for “coyness”
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
Search for “coyness”
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
Search for “coyness”
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
Search for “Cat in the
        Hat”
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
Search for “Cat in the
        Hat”
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
Phrase Search for “Cat
     in the Hat”
• a -- page 1(3, 12, 15,18),2( 12, 54,56)....
• cat -- page 20(45), 56(5), 58(3), 93(23)....
• coyness -- 70(56, 82), 152(45), 425(12)
• hat -- 6, 10, 35, 58(6), 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58(4),......
• the -- 1,2,3,4,5,6,7,8,9,10,....58(5),....
        Added page position in ()
How about Searching
  1000's of books?

• Leverage the same tools we used before
 • Create an index over multiple books
 • Perform a search returning books and
    pages
Multiple books for “Cat
      in the Hat”
• cat -- [Dr. Seuss] 20, 45, 56, 58, [Pet Health
   Dictionary] 5, 25, 68
• hat -- [Harry Potter] 6, 92, [Dr. Seuss] 35,
   58, 89,105
• in -- [Twilight]1,2,...[Dr. Seuss],1,2,3,...58,...
• the -- [Programming Perl] 1,2,3,4,5, .... [Dr.
   Seuss]...58,....

          Added Book titles in []
How do you search
  Millions of Books?
• Similar to finding all the Aces in a deck of
  cards.
  • 1 person -- 30 seconds if deck is
    unsorted
  • 1 person -- 3 seconds if deck is sorted
  • 26 people -- 1 second if each has 2 cards.
How do you search
Millions of Books?
          Website



       Search Service       Search across
                           many machines
     Query Collector       and return best
                               results
     Index Server
      Index Server
       Index Server
        Index Server
         Index Server
          Index Server
           Index Server
            Book Indexes
Millions of books to
    millions of customers.
                             Website



                         Search Service


     Query Collector    Query Collector     Query Collector



Index Server        Index Server
                     Index Server       Index Server
                                         Index Server
 Index Server
  Index Server        Index Server
                       Index Server       Index Server
   Index Server
    Index Server        Index Server       Index Server
                                            Index Server
     Index Server        Index Server
                          Index Server       Index Server
                                              Index Server
      Index Server
       Index Server        Index Server        Index Server
Which is the best
        result?
• Should a search for “cat in the hat” return:
 • The book by Dr. Seuss,
 • A book about all the Dr. Seuss books,
 • A story where the mother reads the
    story to their child?
• Did you get what the customer wanted?
Relevancy (It depends)
 • TF/IDF -- Prefer results with rare words
   versus results with common words
 • Amazon -- Biases towards what people
   are searching and buying recently.
 • Google -- Biases towards user activity,
   PageRank, and other factors.
 • Depends on what the customer intends
   and how they ask the question.
Last step: Get the text
        snippet.
• You have searched across millions of books,
• You have found the “Best” books with the
  words “cat in the hat”
• You have spent 50 msec across 100’s of
  machines to get the right result.
• How do you find the “snippet” on the
  page?
Snippets
           Excerpts
Get snippet using
      simple search
• Fetch the book page from a different disk.
• Use a simple linear search like Naive or
  Boyer-Moore to get snippet and
  surrounding text.
• Simple techniques applied across more
  machines.
Future Trends

• Disk space costs dropping --> More data
• More networked devices --> More sharing
• What would you do with:
 • All the web on your cell phone
 • All your family/friends instantly available
Just scratching the
          surface.
• Lucene search engine -- Open source. How
  to index and search results. http://
  lucene.apache.org/
• Google --Presentations and research notes.
  -- http://research.google.com/video.html
• http://www.searchenginehistory.com/
Questions?

More Related Content

Recently uploaded

Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
Avinash Rai
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 

Recently uploaded (20)

Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resources
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Forest and Wildlife Resources Class 10 Free Study Material PDF
Forest and Wildlife Resources Class 10 Free Study Material PDFForest and Wildlife Resources Class 10 Free Study Material PDF
Forest and Wildlife Resources Class 10 Free Study Material PDF
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
Gyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxGyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptx
 
NLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxNLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptx
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Search Engine Basics - Ruben Ortega

  • 1. Search Engine Basics Ruben Ortega
  • 2. What is covered? • A non-programmers introduction to: • Why do we have search engines. • How search works across a page, a book, thousands of books, to millions of books. • How to get a good search result.
  • 3. Speaker Background • 10+ years working on search engines • Amazon, A9.com, Mechanical Turk, Trusera.com • 13 patents -- Helping people find anything • Billions of dollars of revenue • Millions of searches per hour
  • 4. Have you searched a book for your name? • Wonder how many times your name was mentioned in your High School yearbook? • Find your name across all your High School and college yearbooks? • Which would be the “best result” if I searched for your name in those yearbooks?
  • 6. Search engines not taught before the web • Not taught because there was no demand. • Why no demand? • Machines had 10-20MB of disk. • $100 per MB of disk --> Disk quotas • Limited networking --> Limited information
  • 7. What does 1 Megabyte of space hold? • Book Page -- 2.5 Kilobytes of text • 1 Megabyte == 400 pages ~ 1 thick book
  • 8. Is it worth it to store a book ? • If disk space cost $100 per MB it had better be worth it! • Copying a $20 book into a $100 of disk space is not cost effective.
  • 9. Why has Search grown so quickly? • Lots and lots of fantastically cheap disk space!
  • 10. Inexpensive Disk! Cost per Megabyte of Disk Megabytes per dollar of disk 100 10000 75 7500 50 5000 25 2500 0 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 0 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
  • 11. So what happens? • Information blossoms. • Quotas are gone -- Never have to delete! • Email • Web Pages • Books, Image data, Music
  • 12. Demand for search skyrocketed • Cheaper disks == more data to search. • More data means • Demand better search techniques • Different handling of items indexed. • Better user interfaces • Reminder: There is no magic in search!
  • 13. How does search work? • Let’s run through a text search example
  • 14. Simple Searching • How do you search for the word “coyness” in the following string: • “Had we but world enough and time thy coyness lady would be no crime.”
  • 15. Find the first “c” coyness to his coy mistress had we but world enough and time thy coyness lady
  • 16. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 17. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 18. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 19. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 20. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 21. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 22. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 23. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 24. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 25. No match. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 26. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 27. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 28. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 29. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 30. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 31. coynes to his coy mistress s had we but world enough and time thy coyness lady
  • 32. coyne to his coy mistress ss had we but world enough and time thy coyness lady
  • 33. coyn to his coy mistress ess had we but world enough and time thy coyness lady
  • 34. coy to his coy mistress ness had we but world enough and time thy coyness lady
  • 35. co to his coy mistress yness had we but world enough and time thy coyness lady
  • 36. c to his coy mistress oyness had we but world enough and time thy coyness lady
  • 37. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 38. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 39. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 40. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 41. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 42. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 43. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 44. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 45. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 46. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 47. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 48. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 49. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 50. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 51. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 52. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 53. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 54. to his coy mistress coynes had we but world enough s and time thy coyness lady
  • 55. to his coy mistress coyne had we but world enough ss and time thy coyness lady
  • 56. to his coy mistress coyn had we but world enough ess and time thy coyness lady
  • 57. to his coy mistress coy had we but world enough ness and time thy coyness lady
  • 58. to his coy mistress co had we but world enough yness and time thy coyness lady
  • 59. to his coy mistress c had we but world enough oyness and time thy coyness lady
  • 60. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 61. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 62. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 63. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 64. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 65. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 66. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 67. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 68. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 69. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 70. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 71. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 72. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 73. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 74. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 75. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 76. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 77. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 78. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 79. Matched! to his coy mistress had we but world enough coyness and time thy coyness lady
  • 80. Can we find it faster? • Yes! • Boyer-Moore-Horspool. • Start searching from the end of the word • If a character matches one in the word, shift forward to the character.
  • 81. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 82. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 83. No match, skip. coyness to his coy mistress had we but world enough and time thy coyness lady
  • 84. coyne to his coy mistress ss had we but world enough and time thy coyness lady
  • 85. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 86. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 87. to his coy mistress coyness had we but world enough and time thy coyness lady
  • 88. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 89. Doesn’t match but C is a letter in our word to his coy mistress had we but world enough coyness and time thy coyness lady
  • 90. Jump 7 spaces to his coy mistress had we but world enough coyness and time thy coyness lady
  • 91. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 92. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 93. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 94. to his coy mistress had we but world enough coyness and time thy coyness lady
  • 95. Matched! to his coy mistress had we but world enough coyness and time thy coyness lady
  • 96. Simple Search works! • Naive algorithm can work quickly for documents you have never seen before and don’t want to bother keeping around. • Boyer Moore Horspool works even faster with a little extra overhead of building a table • But, what if I have extra disk space to store a book and want to go even faster?
  • 97. Build an index! Image by Dan Taylor: http://www.flickr.com/photos/dantaylor/1145628275/
  • 98. Indexes are not new • Indexes created in the 10th century to find words in books. • Card catalogs in libraries provide indexes to books. • What is new is how much information can be stored in a single place.
  • 99. Indexing is simple • For each word in a book • Store which page in the book it is on.
  • 100. Partial index • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  • 101. Indexes use more disk space • A complete index takes about 33% of the text indexed. • In 1984, that would be $133 in disk space per book.
  • 102. • In 2008, $133 is able to store and index 1 million books.
  • 103. How do you search with an index? • Step 1: Pick the words you are looking for from the index. • Step 2: Return all the pages that the word appears on.
  • 104. Search for “coyness” • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  • 105. Search for “coyness” • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  • 106. Search for “coyness” • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  • 107. Search for “Cat in the Hat” • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  • 108. Search for “Cat in the Hat” • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  • 109. Phrase Search for “Cat in the Hat” • a -- page 1(3, 12, 15,18),2( 12, 54,56).... • cat -- page 20(45), 56(5), 58(3), 93(23).... • coyness -- 70(56, 82), 152(45), 425(12) • hat -- 6, 10, 35, 58(6), 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58(4),...... • the -- 1,2,3,4,5,6,7,8,9,10,....58(5),.... Added page position in ()
  • 110. How about Searching 1000's of books? • Leverage the same tools we used before • Create an index over multiple books • Perform a search returning books and pages
  • 111. Multiple books for “Cat in the Hat” • cat -- [Dr. Seuss] 20, 45, 56, 58, [Pet Health Dictionary] 5, 25, 68 • hat -- [Harry Potter] 6, 92, [Dr. Seuss] 35, 58, 89,105 • in -- [Twilight]1,2,...[Dr. Seuss],1,2,3,...58,... • the -- [Programming Perl] 1,2,3,4,5, .... [Dr. Seuss]...58,.... Added Book titles in []
  • 112. How do you search Millions of Books? • Similar to finding all the Aces in a deck of cards. • 1 person -- 30 seconds if deck is unsorted • 1 person -- 3 seconds if deck is sorted • 26 people -- 1 second if each has 2 cards.
  • 113. How do you search Millions of Books? Website Search Service Search across many machines Query Collector and return best results Index Server Index Server Index Server Index Server Index Server Index Server Index Server Book Indexes
  • 114. Millions of books to millions of customers. Website Search Service Query Collector Query Collector Query Collector Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server
  • 115. Which is the best result? • Should a search for “cat in the hat” return: • The book by Dr. Seuss, • A book about all the Dr. Seuss books, • A story where the mother reads the story to their child? • Did you get what the customer wanted?
  • 116. Relevancy (It depends) • TF/IDF -- Prefer results with rare words versus results with common words • Amazon -- Biases towards what people are searching and buying recently. • Google -- Biases towards user activity, PageRank, and other factors. • Depends on what the customer intends and how they ask the question.
  • 117. Last step: Get the text snippet. • You have searched across millions of books, • You have found the “Best” books with the words “cat in the hat” • You have spent 50 msec across 100’s of machines to get the right result. • How do you find the “snippet” on the page?
  • 118. Snippets Excerpts
  • 119. Get snippet using simple search • Fetch the book page from a different disk. • Use a simple linear search like Naive or Boyer-Moore to get snippet and surrounding text. • Simple techniques applied across more machines.
  • 120. Future Trends • Disk space costs dropping --> More data • More networked devices --> More sharing • What would you do with: • All the web on your cell phone • All your family/friends instantly available
  • 121. Just scratching the surface. • Lucene search engine -- Open source. How to index and search results. http:// lucene.apache.org/ • Google --Presentations and research notes. -- http://research.google.com/video.html • http://www.searchenginehistory.com/