SlideShare a Scribd company logo
1 of 23
Search Engines
Mahesh Sharma(CE/10/1158)
Computer Science Department
2
Today's Coverage
● Introduction
● Types of Search Engines
● Components of a Search Engine
● Semantics and Relevancy
● Search Engine Optimization
Introduction
• A web search engine is a software system that
is designed to search for information on
the World Wide Web. The search results are
generally presented in a line of results often
referred to as search engine results pages.
• Search engines look through their own
databases of information in order to find what
it is that you are looking for…
4
Types of Search Engine
● Crawler Powered Indexes
– Guruji.com, Google.com
● Human Powered Indexes
– www.dmoz.org
● Hybrid Models
– Submitted URLs to a search engine ?
● Semantic Indexes
– Hakia.com,
Copyleft (ɔ) 2009 Sudarsun Santhiappan 5
Copyleft (ɔ) 2009 Sudarsun Santhiappan 6
Copyleft (ɔ) 2009 Sudarsun Santhiappan 7
How does a Search Engine work ?
Copyleft (ɔ) 2009 Sudarsun Santhiappan 8
Your
Browser
How Search Engines Work
(Sherman 2003)
The Web
URL1
URL2
URL3 URL4
Crawler
Indexer
Search
Engine
Database Eggs?
Eggs.
Eggs - 90%
Eggo - 81%
Ego- 40%
Huh? - 10%
All About
Eggs
by
S. I. Am
Copyleft (ɔ) 2009 Sudarsun Santhiappan 9
Search Engine Internals
Copyleft (ɔ) 2009 Sudarsun Santhiappan 10
Search Engine Internals
● Crawlers
● Indexers
● Searching
● Semantics
● Ranking
Crawlers
• A crawler is a program that visits Web sites
and reads their pages and other information
in order to create entries for a search
engine index. The major search engines on the
Web all have such a program, which is also
known as a "spider" or a "bot."
Indexers
• A database index is a data structure that
improves the speed of data retrieval
operations on a database table at the cost of
additional writes and the use of more storage
space to maintain the extra copy of data.
Semantics
• Semantics is the study of meaning. It focuses
on the relation between signifiers, like
words, phrases, signs, and symbols, and what
they stand for, their denotation. semantics is
the study of meaning that is used for
understanding human expression through
language.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 14
Inverted Indexes
Copyleft (ɔ) 2009 Sudarsun Santhiappan 15
How Inverted Files
Are Created
● Periodically rebuilt, static otherwise.
● Documents are parsed to extract
tokens. These are saved with the
Document ID.
Now is the time
for all good men
to come to the aid
of their country
Doc 1
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Doc 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2
Copyleft (ɔ) 2009 Sudarsun Santhiappan 16
How Inverted
Files are Created
● After all
documents have
been parsed the
inverted file is
sorted
alphabetically.
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2
Copyleft (ɔ) 2009 Sudarsun Santhiappan 17
How Inverted
Files are Created
● Multiple term
entries for a
single document
are merged.
● Within-
document term
frequency
information is
compiled.
Term Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
Copyleft (ɔ) 2009 Sudarsun Santhiappan 18
How Inverted Files are Created
● Finally, the file can be split into
– A Dictionary or Lexicon file
and
– A Postings file
Copyleft (ɔ) 2009 Sudarsun Santhiappan 19
How Inverted Files are Created
Dictionary/Lexicon Postings
Term Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Doc # Freq
2 1
1 1
1 1
2 1
1 1
1 1
2 1
2 1
1 1
1 1
2 1
1 1
2 1
2 1
1 1
2 1
2 1
1 1
1 1
2 1
2 1
1 2
2 2
1 1
1 1
2 1
1 2
2 2
Term N docs Tot Freq
a 1 1
aid 1 1
all 1 1
and 1 1
come 1 1
country 2 2
dark 1 1
for 1 1
good 1 1
in 1 1
is 1 1
it 1 1
manor 1 1
men 1 1
midnight 1 1
night 1 1
now 1 1
of 1 1
past 1 1
stormy 1 1
the 2 4
their 1 1
time 2 2
to 1 2
was 1 2
inverted index
• In computer science, an inverted index (also referred
to as postings file or inverted file) is an index data
structure storing a mapping from content, such as
words or numbers, to its locations in a database file, or
in a document or a set of documents. The purpose of
an inverted index is to allow fast full text searches, at a
cost of increased processing when a document is
added to the database. The inverted file may be the
database file itself, rather than its index. It is the most
popular data structure used in document
retrieval systems, used on a large scale for example in
search engines.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 21
From description of the FAST search engine, by Knut Risvik
In this example, the data
for the pages is partitioned
across machines.
Additionally, each partition
is allocated multiple
machines to handle the
queries.
Each row can handle 120
queries per second
Each column can handle
7M pages
To handle more queries,
add another row.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 22
PageRank
● Let A1, A2, …, An be the pages that point to
page A. Let C(P) be the # links out of page
P. The PageRank (PR) of page A is defined
as:
● PageRank is principal eigenvector of the
link matrix of the web.
● Can be computed as the fixpoint of the
above equation.
PR(A) = (1-d) + d ( PR(A1)/C(A1) + … + PR(An)/C(An) )
Copyleft (ɔ) 2009 Sudarsun Santhiappan 23
Search Engine Optimization

More Related Content

Similar to Search engines

[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹台灣資料科學年會
 
Natural language processing and its application in ai
Natural language processing and its application in aiNatural language processing and its application in ai
Natural language processing and its application in aiRam Kumar
 
Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Hugo Besemer
 
Development of analysis rules to identify proper noun from bengali sentence f...
Development of analysis rules to identify proper noun from bengali sentence f...Development of analysis rules to identify proper noun from bengali sentence f...
Development of analysis rules to identify proper noun from bengali sentence f...Syeful Islam
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AISATHYANARAYANAKB
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiPadma Metta
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 
Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010Valentini Mellas
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finaleAjit More
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolAll Things Open
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
Presentation - JIAMCATT 2013
Presentation - JIAMCATT 2013Presentation - JIAMCATT 2013
Presentation - JIAMCATT 2013Ashok Hariharan
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsikanow
 
Text Representations for Deep learning
Text Representations for Deep learningText Representations for Deep learning
Text Representations for Deep learningZachary S. Brown
 

Similar to Search engines (20)

[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
Natural language processing and its application in ai
Natural language processing and its application in aiNatural language processing and its application in ai
Natural language processing and its application in ai
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
Search pitb
Search pitbSearch pitb
Search pitb
 
4.3.pdf
4.3.pdf4.3.pdf
4.3.pdf
 
Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1
 
Development of analysis rules to identify proper noun from bengali sentence f...
Development of analysis rules to identify proper noun from bengali sentence f...Development of analysis rules to identify proper noun from bengali sentence f...
Development of analysis rules to identify proper noun from bengali sentence f...
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to Hindi
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and Stanbol
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Presentation - JIAMCATT 2013
Presentation - JIAMCATT 2013Presentation - JIAMCATT 2013
Presentation - JIAMCATT 2013
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problems
 
Text Representations for Deep learning
Text Representations for Deep learningText Representations for Deep learning
Text Representations for Deep learning
 
Presentation1
Presentation1Presentation1
Presentation1
 

Recently uploaded

REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptxREPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptxmanishaJyala2
 
II BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING II
II BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING IIII BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING II
II BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING IIagpharmacy11
 
HVAC System | Audit of HVAC System | Audit and regulatory Comploance.pptx
HVAC System | Audit of HVAC System | Audit and regulatory Comploance.pptxHVAC System | Audit of HVAC System | Audit and regulatory Comploance.pptx
HVAC System | Audit of HVAC System | Audit and regulatory Comploance.pptxKunal10679
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Celine George
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya - UEM Kolkata Quiz Club
 
The Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryThe Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryEugene Lysak
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Mohamed Rizk Khodair
 
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45MysoreMuleSoftMeetup
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxneillewis46
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...Nguyen Thanh Tu Collection
 
Software testing for project report .pdf
Software testing for project report .pdfSoftware testing for project report .pdf
Software testing for project report .pdfKamal Acharya
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...Nguyen Thanh Tu Collection
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024Borja Sotomayor
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文中 央社
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjMohammed Sikander
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxCeline George
 

Recently uploaded (20)

REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptxREPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
 
II BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING II
II BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING IIII BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING II
II BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING II
 
HVAC System | Audit of HVAC System | Audit and regulatory Comploance.pptx
HVAC System | Audit of HVAC System | Audit and regulatory Comploance.pptxHVAC System | Audit of HVAC System | Audit and regulatory Comploance.pptx
HVAC System | Audit of HVAC System | Audit and regulatory Comploance.pptx
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
The Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryThe Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. Henry
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).
 
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptx
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
 
Software testing for project report .pdf
Software testing for project report .pdfSoftware testing for project report .pdf
Software testing for project report .pdf
 
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
 
IPL Online Quiz by Pragya; Question Set.
IPL Online Quiz by Pragya; Question Set.IPL Online Quiz by Pragya; Question Set.
IPL Online Quiz by Pragya; Question Set.
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024
 
Word Stress rules esl .pptx
Word Stress rules esl               .pptxWord Stress rules esl               .pptx
Word Stress rules esl .pptx
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptx
 

Search engines

  • 2. 2 Today's Coverage ● Introduction ● Types of Search Engines ● Components of a Search Engine ● Semantics and Relevancy ● Search Engine Optimization
  • 3. Introduction • A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages. • Search engines look through their own databases of information in order to find what it is that you are looking for…
  • 4. 4 Types of Search Engine ● Crawler Powered Indexes – Guruji.com, Google.com ● Human Powered Indexes – www.dmoz.org ● Hybrid Models – Submitted URLs to a search engine ? ● Semantic Indexes – Hakia.com,
  • 5. Copyleft (ɔ) 2009 Sudarsun Santhiappan 5
  • 6. Copyleft (ɔ) 2009 Sudarsun Santhiappan 6
  • 7. Copyleft (ɔ) 2009 Sudarsun Santhiappan 7 How does a Search Engine work ?
  • 8. Copyleft (ɔ) 2009 Sudarsun Santhiappan 8 Your Browser How Search Engines Work (Sherman 2003) The Web URL1 URL2 URL3 URL4 Crawler Indexer Search Engine Database Eggs? Eggs. Eggs - 90% Eggo - 81% Ego- 40% Huh? - 10% All About Eggs by S. I. Am
  • 9. Copyleft (ɔ) 2009 Sudarsun Santhiappan 9 Search Engine Internals
  • 10. Copyleft (ɔ) 2009 Sudarsun Santhiappan 10 Search Engine Internals ● Crawlers ● Indexers ● Searching ● Semantics ● Ranking
  • 11. Crawlers • A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot."
  • 12. Indexers • A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and the use of more storage space to maintain the extra copy of data.
  • 13. Semantics • Semantics is the study of meaning. It focuses on the relation between signifiers, like words, phrases, signs, and symbols, and what they stand for, their denotation. semantics is the study of meaning that is used for understanding human expression through language.
  • 14. Copyleft (ɔ) 2009 Sudarsun Santhiappan 14 Inverted Indexes
  • 15. Copyleft (ɔ) 2009 Sudarsun Santhiappan 15 How Inverted Files Are Created ● Periodically rebuilt, static otherwise. ● Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2 Term Doc # now 1 is 1 the 1 time 1 for 1 all 1 good 1 men 1 to 1 come 1 to 1 the 1 aid 1 of 1 their 1 country 1 it 2 was 2 a 2 dark 2 and 2 stormy 2 night 2 in 2 the 2 country 2 manor 2 the 2 time 2 was 2 past 2 midnight 2
  • 16. Copyleft (ɔ) 2009 Sudarsun Santhiappan 16 How Inverted Files are Created ● After all documents have been parsed the inverted file is sorted alphabetically. Term Doc # a 2 aid 1 all 1 and 2 come 1 country 1 country 2 dark 2 for 1 good 1 in 2 is 1 it 2 manor 2 men 1 midnight 2 night 2 now 1 of 1 past 2 stormy 2 the 1 the 1 the 2 the 2 their 1 time 1 time 2 to 1 to 1 was 2 was 2 Term Doc # now 1 is 1 the 1 time 1 for 1 all 1 good 1 men 1 to 1 come 1 to 1 the 1 aid 1 of 1 their 1 country 1 it 2 was 2 a 2 dark 2 and 2 stormy 2 night 2 in 2 the 2 country 2 manor 2 the 2 time 2 was 2 past 2 midnight 2
  • 17. Copyleft (ɔ) 2009 Sudarsun Santhiappan 17 How Inverted Files are Created ● Multiple term entries for a single document are merged. ● Within- document term frequency information is compiled. Term Doc # Freq a 2 1 aid 1 1 all 1 1 and 2 1 come 1 1 country 1 1 country 2 1 dark 2 1 for 1 1 good 1 1 in 2 1 is 1 1 it 2 1 manor 2 1 men 1 1 midnight 2 1 night 2 1 now 1 1 of 1 1 past 2 1 stormy 2 1 the 1 2 the 2 2 their 1 1 time 1 1 time 2 1 to 1 2 was 2 2 Term Doc # a 2 aid 1 all 1 and 2 come 1 country 1 country 2 dark 2 for 1 good 1 in 2 is 1 it 2 manor 2 men 1 midnight 2 night 2 now 1 of 1 past 2 stormy 2 the 1 the 1 the 2 the 2 their 1 time 1 time 2 to 1 to 1 was 2 was 2
  • 18. Copyleft (ɔ) 2009 Sudarsun Santhiappan 18 How Inverted Files are Created ● Finally, the file can be split into – A Dictionary or Lexicon file and – A Postings file
  • 19. Copyleft (ɔ) 2009 Sudarsun Santhiappan 19 How Inverted Files are Created Dictionary/Lexicon Postings Term Doc # Freq a 2 1 aid 1 1 all 1 1 and 2 1 come 1 1 country 1 1 country 2 1 dark 2 1 for 1 1 good 1 1 in 2 1 is 1 1 it 2 1 manor 2 1 men 1 1 midnight 2 1 night 2 1 now 1 1 of 1 1 past 2 1 stormy 2 1 the 1 2 the 2 2 their 1 1 time 1 1 time 2 1 to 1 2 was 2 2 Doc # Freq 2 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 2 2 2 1 1 1 1 2 1 1 2 2 2 Term N docs Tot Freq a 1 1 aid 1 1 all 1 1 and 1 1 come 1 1 country 2 2 dark 1 1 for 1 1 good 1 1 in 1 1 is 1 1 it 1 1 manor 1 1 men 1 1 midnight 1 1 night 1 1 now 1 1 of 1 1 past 1 1 stormy 1 1 the 2 4 their 1 1 time 2 2 to 1 2 was 1 2
  • 20. inverted index • In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines.
  • 21. Copyleft (ɔ) 2009 Sudarsun Santhiappan 21 From description of the FAST search engine, by Knut Risvik In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row.
  • 22. Copyleft (ɔ) 2009 Sudarsun Santhiappan 22 PageRank ● Let A1, A2, …, An be the pages that point to page A. Let C(P) be the # links out of page P. The PageRank (PR) of page A is defined as: ● PageRank is principal eigenvector of the link matrix of the web. ● Can be computed as the fixpoint of the above equation. PR(A) = (1-d) + d ( PR(A1)/C(A1) + … + PR(An)/C(An) )
  • 23. Copyleft (ɔ) 2009 Sudarsun Santhiappan 23 Search Engine Optimization