SlideShare a Scribd company logo
Web Technology Presentation
How search works in SEARCH ENGINE
Index
✔ Prelude
✔ How search engine works
✔ Search Engine Architecture
✔ Working
✔ Features(Web Crawling,Page Rank,etc.)
✔ Problems
✔ Future
✔ Conclusion
Prelude
✔ A Search Engine is an information retrieval system
designed to help find information stored on a computer
system, such as on the
● World Wide Web,
● inside a corporate or proprietary network,
● or in a personal computer.
✔ The search engine allows one to ask for content meeting
specific criteria (typically those containing a given word or
phrase) and retrieves a list of items that match those criteria.
How search engine works
✔ SEARCH STARTS WITH THE WEB. IT'S MADE UP OF OVER
60  TRILLION INDIVIDUAL PAGES AND IT'S CONSTANTLY
GROWING.
✔ Search engine navigates the web by:
● Web Crawling
● Indexing
● Searching
Architecture Overview
Working...
✔ There is an URL server that sends lists of URL to be
fetched to the crawlers.
✔ The web pages that are fetched are then sent to the store
server.
✔ The store server then compresses and stores the web
pages into a repository.
✔ Every web page has an associated ID number called a
doc ID which is assigned whenever a new URL is parsed
out of a web page.
Working...(cont.)
✔ The indexing function is performed by the indexer and the
sorter.
✔ The indexer performs a number of functions. It reads the
repository, uncompresses the documents, and parses them.
✔ The indexer distributes hits into a set of "barrels", creating a
partially sorted forward index.
✔ It parses out all the links in every web page and stores
important information about them in anchor files.
✔ Anchor file contains enough information to determine where
each link points from and to, and the text of the link.
Working...(cont.)
✔ The URL resolver reads the anchors file and converts
relative URL into absolute URL and in turn into doc ID.
✔ It puts the anchor text into the forward index, associated
with the doc ID that the anchor points to. It also
generates a database of links which are pairs of doc IDs.
The links database is used to compute Page Ranks for
all the documents.
✔ The sorter takes the barrels, which are sorted by doc ID
and resorts them by word ID to generate the inverted
index.
Working...(cont.)
✔ A program called Dump Lexicon takes this list together
with the lexicon produced by the indexer and generates a
new lexicon to be used by the searcher.
✔ The searcher is run by a web server and uses the lexicon
built by Dump Lexicon together with the inverted index
and the Page Ranks to answer queries.
Crawling the Web (Web Crawler)
✔ A web crawler or spider is used to create an index of web pages to
be used by a web search engine.
✔ Web search is based on this index.
✔ Search engines has a fast distributed crawling system.
✔ A single URL server serves lists of URL to a number of crawlers.
✔ Both the URL server and the crawlers are implemented in Python.
✔ Each crawler keeps roughly 300 connections open at once. At peak
speeds, the system can crawl over 100 web pages per second using
four crawlers. This amounts to roughly 600K per second of data.
✔ Each crawler maintains its own DNS cache so it does not need to do
a DNS lookup before crawling each document.
Crawling the Web
Indexing the Web
✔ Parsing
◦ Any parser which is designed to run on the entire Web must handle a huge
array of possible errors.
✔
Indexing Documents into Barrels
◦ After each document is parsed, it is encoded into a number of barrels. Every
word is converted into a word ID by using an in-memory hash table -- the
lexicon.
◦ Once the words are converted into word ID, their occurrences in the current
document are translated into hit lists and are written into the forward barrels.
✔ Sorting
◦ the sorter takes each of the forward barrels and sorts it by word ID to produce
an inverted barrel for title and anchor hits and a full text inverted barrel.
Indexing the Web
Web Searching
✔ Match the query terms with words in the index
✔ Sort documents by relevance
✔ Find files or records
✔ Open each one and read it
✔ Store each word in a search able index
✔ Kinds of information
✔ Dates
✔ Languages
✔ Update processes
Web Searching
Other System Features
✔ It makes use of the link structure of the Web to calculate
a quality ranking for each web page, called Page Rank
◦ Page Rank is a trademark of Google. The Page Rank
process has been patented.
✔ Search Engine utilizes link to improve search results.
Page Rank
✔ Page Rank is a link analysis algorithm which assigns a
numerical weighting to each Web page, with the purpose
of "measuring" relative importance.
Based on the hyper links map
An excellent way to prioritize the results of web keyword
searches
Intuitive Justification
✔ A "random surfer" who is given a web page at random and keeps
clicking on links, never hitting "back“, but eventually gets bored
and starts on another random page.
◦ The probability that the random surfer visits a page is its Page
Rank.
◦ The d damping factor is the probability at each page the "random
surfer" will get bored and request another random page.
✔ A page can have a high Page Rank
◦ If there are many pages that point to it
◦ Or if there are some pages that point to it, and have a high Page
Rank.
Anchor Text
● <A href =" http://www.yahoo.com/">Yahoo!</A>
Besides the text of a hyper link (anchor text) is
associated with the page that the link is on, it is also
associated with the page the link points to.
◦ anchors often provide more accurate descriptions of
web pages than the pages themselves.
◦ anchors may exist for documents which cannot be
indexed by a text-based search engine, such as images,
programs, and databases.
Major Data Structures
✔
Big Files
 virtual files spanning multiple file systems and are addressable by 64 bit integers.
✔ Repository
 contains the full HTML of every web page.
✔ Document Index
 keeps information about each document.
✔
Lexicon
 two parts – a list of the words and a hash table of pointers.
✔
Hit Lists
 a list of occurrences of a particular word in a particular document including position, font, and
capitalization information.
✔
Forward Index
 stored in a number of barrels
✔
Inverted Index
 consists of the same barrels as the forward index, except that they have been processed by the sorter.
Search is NOT a Panacea
✔ Search can’t find what’s not there
◦ The content is hugely important
✔ Information Architecture is vital
✔ Usable sites have good navigation and structure
Problems
✔ High Quality Search
The biggest problem facing users of web search engines
today is the quality of the results they get back.
✔ Scalable Architecture
Google is designed to scale. It must be efficient in both
space and time
The Future
“The ultimate search engine would
understand exactly what you mean
and give back exactly what you
want.”
- Larry Page
Conclusion
✔ The primary goal is to provide high quality search results
over a rapidly growing World Wide Web.
✔ Search engines employs a number of techniques to
improve search quality including page rank, anchor text,
and proximity information.
✔ Search engines is a complete architecture for gathering
web pages, indexing them, and performing search
queries over them.
And thats how search works...
● THANK YOU
Presented by:Presented by:
SAURABH GOELSAURABH GOEL
111438111438

More Related Content

Viewers also liked

Sita an illustrated retelling of the ramayana devdutt pattanaik
Sita  an illustrated retelling of the ramayana   devdutt pattanaikSita  an illustrated retelling of the ramayana   devdutt pattanaik
Sita an illustrated retelling of the ramayana devdutt pattanaik
saurabh goel
 
Basic about-router
Basic about-routerBasic about-router
Basic about-router
saurabh goel
 
Dynamic routing
Dynamic routingDynamic routing
Dynamic routing
saurabh goel
 
Iaf presentation
Iaf presentationIaf presentation
Iaf presentation
saurabh goel
 
Homeloans
HomeloansHomeloans
Homeloans
saurabh goel
 
Последний звонок 2013
Последний звонок 2013Последний звонок 2013
Последний звонок 2013sch86
 
Don't lose your mind lose your weight rujuta diwekar
Don't lose your mind lose your weight rujuta diwekarDon't lose your mind lose your weight rujuta diwekar
Don't lose your mind lose your weight rujuta diwekar
saurabh goel
 
Medical malpractices
Medical malpracticesMedical malpractices
Medical malpractices
saurabh goel
 
Medical malpractices (1)
Medical malpractices (1)Medical malpractices (1)
Medical malpractices (1)
saurabh goel
 
Interview presentation
Interview presentationInterview presentation
Interview presentation
saurabh goel
 
Secrets of success
Secrets of success Secrets of success
Secrets of success
saurabh goel
 
Последний звонок 2011
Последний звонок 2011Последний звонок 2011
Последний звонок 2011sch86
 

Viewers also liked (12)

Sita an illustrated retelling of the ramayana devdutt pattanaik
Sita  an illustrated retelling of the ramayana   devdutt pattanaikSita  an illustrated retelling of the ramayana   devdutt pattanaik
Sita an illustrated retelling of the ramayana devdutt pattanaik
 
Basic about-router
Basic about-routerBasic about-router
Basic about-router
 
Dynamic routing
Dynamic routingDynamic routing
Dynamic routing
 
Iaf presentation
Iaf presentationIaf presentation
Iaf presentation
 
Homeloans
HomeloansHomeloans
Homeloans
 
Последний звонок 2013
Последний звонок 2013Последний звонок 2013
Последний звонок 2013
 
Don't lose your mind lose your weight rujuta diwekar
Don't lose your mind lose your weight rujuta diwekarDon't lose your mind lose your weight rujuta diwekar
Don't lose your mind lose your weight rujuta diwekar
 
Medical malpractices
Medical malpracticesMedical malpractices
Medical malpractices
 
Medical malpractices (1)
Medical malpractices (1)Medical malpractices (1)
Medical malpractices (1)
 
Interview presentation
Interview presentationInterview presentation
Interview presentation
 
Secrets of success
Secrets of success Secrets of success
Secrets of success
 
Последний звонок 2011
Последний звонок 2011Последний звонок 2011
Последний звонок 2011
 

Similar to Web Search Engine

Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
ScrbifPt
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGle
binit singh
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
Harini Sirisena
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
vinay arora
 
How search engine work ppt
How search engine work pptHow search engine work ppt
How search engine work ppt
Shubham Chinchkar
 
How Google Works
How Google WorksHow Google Works
How Google Works
Ganesh Solanke
 
CAB 2.pptx
CAB 2.pptxCAB 2.pptx
Basics of SEO
Basics of SEO Basics of SEO
Basics of SEO
Tishone Robson
 
Modern web search: Lecture 11
Modern web search: Lecture 11Modern web search: Lecture 11
Modern web search: Lecture 11
Artificial Intelligence Institute at UofSC
 
Modern web search: Web Information Systems
Modern web search: Web Information SystemsModern web search: Web Information Systems
Modern web search: Web Information Systems
Artificial Intelligence Institute at UofSC
 
Google Paper
Google Paper Google Paper
Google Paper
girish1m
 
Get the best Seo training in Pune at brainmine.
Get the best Seo training in Pune at brainmine.Get the best Seo training in Pune at brainmine.
Get the best Seo training in Pune at brainmine.
Seo Brainmine
 
Notes for
Notes forNotes for
Notes for
9pallen
 
Search Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEOSearch Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEO
Neeraj Reddy
 
Search Engine
Search EngineSearch Engine
Search Engine
Ram Dutt Shukla
 
Trending Update How Search Engine Work
Trending Update How Search Engine WorkTrending Update How Search Engine Work
Trending Update How Search Engine Work
AyeshaFerry
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO Basics
Jenifer Renjini
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
Sanjeev Kumar Jaiswal
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
Chidanand Byahatti
 
page ranking web crawling
page ranking web crawlingpage ranking web crawling
page ranking web crawling
pradiprahul
 

Similar to Web Search Engine (20)

Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGle
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
How search engine work ppt
How search engine work pptHow search engine work ppt
How search engine work ppt
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
CAB 2.pptx
CAB 2.pptxCAB 2.pptx
CAB 2.pptx
 
Basics of SEO
Basics of SEO Basics of SEO
Basics of SEO
 
Modern web search: Lecture 11
Modern web search: Lecture 11Modern web search: Lecture 11
Modern web search: Lecture 11
 
Modern web search: Web Information Systems
Modern web search: Web Information SystemsModern web search: Web Information Systems
Modern web search: Web Information Systems
 
Google Paper
Google Paper Google Paper
Google Paper
 
Get the best Seo training in Pune at brainmine.
Get the best Seo training in Pune at brainmine.Get the best Seo training in Pune at brainmine.
Get the best Seo training in Pune at brainmine.
 
Notes for
Notes forNotes for
Notes for
 
Search Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEOSearch Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEO
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Trending Update How Search Engine Work
Trending Update How Search Engine WorkTrending Update How Search Engine Work
Trending Update How Search Engine Work
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO Basics
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
page ranking web crawling
page ranking web crawlingpage ranking web crawling
page ranking web crawling
 

Recently uploaded

Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
MadhavJungKarki
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
sydezfe
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
AnasAhmadNoor
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
ijaia
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
Unit -II Spectroscopy - EC I B.Tech.pdf
Unit -II Spectroscopy - EC  I B.Tech.pdfUnit -II Spectroscopy - EC  I B.Tech.pdf
Unit -II Spectroscopy - EC I B.Tech.pdf
TeluguBadi
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
Kamal Acharya
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Transcat
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
Seetal Daas
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
PreethaV16
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 

Recently uploaded (20)

Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
Unit -II Spectroscopy - EC I B.Tech.pdf
Unit -II Spectroscopy - EC  I B.Tech.pdfUnit -II Spectroscopy - EC  I B.Tech.pdf
Unit -II Spectroscopy - EC I B.Tech.pdf
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 

Web Search Engine

  • 1. Web Technology Presentation How search works in SEARCH ENGINE
  • 2. Index ✔ Prelude ✔ How search engine works ✔ Search Engine Architecture ✔ Working ✔ Features(Web Crawling,Page Rank,etc.) ✔ Problems ✔ Future ✔ Conclusion
  • 3. Prelude ✔ A Search Engine is an information retrieval system designed to help find information stored on a computer system, such as on the ● World Wide Web, ● inside a corporate or proprietary network, ● or in a personal computer. ✔ The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria.
  • 4. How search engine works ✔ SEARCH STARTS WITH THE WEB. IT'S MADE UP OF OVER 60  TRILLION INDIVIDUAL PAGES AND IT'S CONSTANTLY GROWING. ✔ Search engine navigates the web by: ● Web Crawling ● Indexing ● Searching
  • 6. Working... ✔ There is an URL server that sends lists of URL to be fetched to the crawlers. ✔ The web pages that are fetched are then sent to the store server. ✔ The store server then compresses and stores the web pages into a repository. ✔ Every web page has an associated ID number called a doc ID which is assigned whenever a new URL is parsed out of a web page.
  • 7. Working...(cont.) ✔ The indexing function is performed by the indexer and the sorter. ✔ The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. ✔ The indexer distributes hits into a set of "barrels", creating a partially sorted forward index. ✔ It parses out all the links in every web page and stores important information about them in anchor files. ✔ Anchor file contains enough information to determine where each link points from and to, and the text of the link.
  • 8. Working...(cont.) ✔ The URL resolver reads the anchors file and converts relative URL into absolute URL and in turn into doc ID. ✔ It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links which are pairs of doc IDs. The links database is used to compute Page Ranks for all the documents. ✔ The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID to generate the inverted index.
  • 9. Working...(cont.) ✔ A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. ✔ The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries.
  • 10. Crawling the Web (Web Crawler) ✔ A web crawler or spider is used to create an index of web pages to be used by a web search engine. ✔ Web search is based on this index. ✔ Search engines has a fast distributed crawling system. ✔ A single URL server serves lists of URL to a number of crawlers. ✔ Both the URL server and the crawlers are implemented in Python. ✔ Each crawler keeps roughly 300 connections open at once. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. ✔ Each crawler maintains its own DNS cache so it does not need to do a DNS lookup before crawling each document.
  • 12. Indexing the Web ✔ Parsing ◦ Any parser which is designed to run on the entire Web must handle a huge array of possible errors. ✔ Indexing Documents into Barrels ◦ After each document is parsed, it is encoded into a number of barrels. Every word is converted into a word ID by using an in-memory hash table -- the lexicon. ◦ Once the words are converted into word ID, their occurrences in the current document are translated into hit lists and are written into the forward barrels. ✔ Sorting ◦ the sorter takes each of the forward barrels and sorts it by word ID to produce an inverted barrel for title and anchor hits and a full text inverted barrel.
  • 14. Web Searching ✔ Match the query terms with words in the index ✔ Sort documents by relevance ✔ Find files or records ✔ Open each one and read it ✔ Store each word in a search able index ✔ Kinds of information ✔ Dates ✔ Languages ✔ Update processes
  • 16. Other System Features ✔ It makes use of the link structure of the Web to calculate a quality ranking for each web page, called Page Rank ◦ Page Rank is a trademark of Google. The Page Rank process has been patented. ✔ Search Engine utilizes link to improve search results.
  • 17. Page Rank ✔ Page Rank is a link analysis algorithm which assigns a numerical weighting to each Web page, with the purpose of "measuring" relative importance. Based on the hyper links map An excellent way to prioritize the results of web keyword searches
  • 18. Intuitive Justification ✔ A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back“, but eventually gets bored and starts on another random page. ◦ The probability that the random surfer visits a page is its Page Rank. ◦ The d damping factor is the probability at each page the "random surfer" will get bored and request another random page. ✔ A page can have a high Page Rank ◦ If there are many pages that point to it ◦ Or if there are some pages that point to it, and have a high Page Rank.
  • 19. Anchor Text ● <A href =" http://www.yahoo.com/">Yahoo!</A> Besides the text of a hyper link (anchor text) is associated with the page that the link is on, it is also associated with the page the link points to. ◦ anchors often provide more accurate descriptions of web pages than the pages themselves. ◦ anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases.
  • 20. Major Data Structures ✔ Big Files  virtual files spanning multiple file systems and are addressable by 64 bit integers. ✔ Repository  contains the full HTML of every web page. ✔ Document Index  keeps information about each document. ✔ Lexicon  two parts – a list of the words and a hash table of pointers. ✔ Hit Lists  a list of occurrences of a particular word in a particular document including position, font, and capitalization information. ✔ Forward Index  stored in a number of barrels ✔ Inverted Index  consists of the same barrels as the forward index, except that they have been processed by the sorter.
  • 21. Search is NOT a Panacea ✔ Search can’t find what’s not there ◦ The content is hugely important ✔ Information Architecture is vital ✔ Usable sites have good navigation and structure
  • 22. Problems ✔ High Quality Search The biggest problem facing users of web search engines today is the quality of the results they get back. ✔ Scalable Architecture Google is designed to scale. It must be efficient in both space and time
  • 23. The Future “The ultimate search engine would understand exactly what you mean and give back exactly what you want.” - Larry Page
  • 24. Conclusion ✔ The primary goal is to provide high quality search results over a rapidly growing World Wide Web. ✔ Search engines employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. ✔ Search engines is a complete architecture for gathering web pages, indexing them, and performing search queries over them.
  • 25. And thats how search works... ● THANK YOU Presented by:Presented by: SAURABH GOELSAURABH GOEL 111438111438