SlideShare a Scribd company logo
1 of 25
Download to read offline
Web Technology Presentation
How search works in SEARCH ENGINE
Index
✔ Prelude
✔ How search engine works
✔ Search Engine Architecture
✔ Working
✔ Features(Web Crawling,Page Rank,etc.)
✔ Problems
✔ Future
✔ Conclusion
Prelude
✔ A Search Engine is an information retrieval system
designed to help find information stored on a computer
system, such as on the
● World Wide Web,
● inside a corporate or proprietary network,
● or in a personal computer.
✔ The search engine allows one to ask for content meeting
specific criteria (typically those containing a given word or
phrase) and retrieves a list of items that match those criteria.
How search engine works
✔ SEARCH STARTS WITH THE WEB. IT'S MADE UP OF OVER
60  TRILLION INDIVIDUAL PAGES AND IT'S CONSTANTLY
GROWING.
✔ Search engine navigates the web by:
● Web Crawling
● Indexing
● Searching
Architecture Overview
Working...
✔ There is an URL server that sends lists of URL to be
fetched to the crawlers.
✔ The web pages that are fetched are then sent to the store
server.
✔ The store server then compresses and stores the web
pages into a repository.
✔ Every web page has an associated ID number called a
doc ID which is assigned whenever a new URL is parsed
out of a web page.
Working...(cont.)
✔ The indexing function is performed by the indexer and the
sorter.
✔ The indexer performs a number of functions. It reads the
repository, uncompresses the documents, and parses them.
✔ The indexer distributes hits into a set of "barrels", creating a
partially sorted forward index.
✔ It parses out all the links in every web page and stores
important information about them in anchor files.
✔ Anchor file contains enough information to determine where
each link points from and to, and the text of the link.
Working...(cont.)
✔ The URL resolver reads the anchors file and converts
relative URL into absolute URL and in turn into doc ID.
✔ It puts the anchor text into the forward index, associated
with the doc ID that the anchor points to. It also
generates a database of links which are pairs of doc IDs.
The links database is used to compute Page Ranks for
all the documents.
✔ The sorter takes the barrels, which are sorted by doc ID
and resorts them by word ID to generate the inverted
index.
Working...(cont.)
✔ A program called Dump Lexicon takes this list together
with the lexicon produced by the indexer and generates a
new lexicon to be used by the searcher.
✔ The searcher is run by a web server and uses the lexicon
built by Dump Lexicon together with the inverted index
and the Page Ranks to answer queries.
Crawling the Web (Web Crawler)
✔ A web crawler or spider is used to create an index of web pages to
be used by a web search engine.
✔ Web search is based on this index.
✔ Search engines has a fast distributed crawling system.
✔ A single URL server serves lists of URL to a number of crawlers.
✔ Both the URL server and the crawlers are implemented in Python.
✔ Each crawler keeps roughly 300 connections open at once. At peak
speeds, the system can crawl over 100 web pages per second using
four crawlers. This amounts to roughly 600K per second of data.
✔ Each crawler maintains its own DNS cache so it does not need to do
a DNS lookup before crawling each document.
Crawling the Web
Indexing the Web
✔ Parsing
◦ Any parser which is designed to run on the entire Web must handle a huge
array of possible errors.
✔
Indexing Documents into Barrels
◦ After each document is parsed, it is encoded into a number of barrels. Every
word is converted into a word ID by using an in-memory hash table -- the
lexicon.
◦ Once the words are converted into word ID, their occurrences in the current
document are translated into hit lists and are written into the forward barrels.
✔ Sorting
◦ the sorter takes each of the forward barrels and sorts it by word ID to produce
an inverted barrel for title and anchor hits and a full text inverted barrel.
Indexing the Web
Web Searching
✔ Match the query terms with words in the index
✔ Sort documents by relevance
✔ Find files or records
✔ Open each one and read it
✔ Store each word in a search able index
✔ Kinds of information
✔ Dates
✔ Languages
✔ Update processes
Web Searching
Other System Features
✔ It makes use of the link structure of the Web to calculate
a quality ranking for each web page, called Page Rank
◦ Page Rank is a trademark of Google. The Page Rank
process has been patented.
✔ Search Engine utilizes link to improve search results.
Page Rank
✔ Page Rank is a link analysis algorithm which assigns a
numerical weighting to each Web page, with the purpose
of "measuring" relative importance.
Based on the hyper links map
An excellent way to prioritize the results of web keyword
searches
Intuitive Justification
✔ A "random surfer" who is given a web page at random and keeps
clicking on links, never hitting "back“, but eventually gets bored
and starts on another random page.
◦ The probability that the random surfer visits a page is its Page
Rank.
◦ The d damping factor is the probability at each page the "random
surfer" will get bored and request another random page.
✔ A page can have a high Page Rank
◦ If there are many pages that point to it
◦ Or if there are some pages that point to it, and have a high Page
Rank.
Anchor Text
● <A href =" http://www.yahoo.com/">Yahoo!</A>
Besides the text of a hyper link (anchor text) is
associated with the page that the link is on, it is also
associated with the page the link points to.
◦ anchors often provide more accurate descriptions of
web pages than the pages themselves.
◦ anchors may exist for documents which cannot be
indexed by a text-based search engine, such as images,
programs, and databases.
Major Data Structures
✔
Big Files
 virtual files spanning multiple file systems and are addressable by 64 bit integers.
✔ Repository
 contains the full HTML of every web page.
✔ Document Index
 keeps information about each document.
✔
Lexicon
 two parts – a list of the words and a hash table of pointers.
✔
Hit Lists
 a list of occurrences of a particular word in a particular document including position, font, and
capitalization information.
✔
Forward Index
 stored in a number of barrels
✔
Inverted Index
 consists of the same barrels as the forward index, except that they have been processed by the sorter.
Search is NOT a Panacea
✔ Search can’t find what’s not there
◦ The content is hugely important
✔ Information Architecture is vital
✔ Usable sites have good navigation and structure
Problems
✔ High Quality Search
The biggest problem facing users of web search engines
today is the quality of the results they get back.
✔ Scalable Architecture
Google is designed to scale. It must be efficient in both
space and time
The Future
“The ultimate search engine would
understand exactly what you mean
and give back exactly what you
want.”
- Larry Page
Conclusion
✔ The primary goal is to provide high quality search results
over a rapidly growing World Wide Web.
✔ Search engines employs a number of techniques to
improve search quality including page rank, anchor text,
and proximity information.
✔ Search engines is a complete architecture for gathering
web pages, indexing them, and performing search
queries over them.
And thats how search works...
● THANK YOU
Presented by:Presented by:
SAURABH GOELSAURABH GOEL
111438111438

More Related Content

Viewers also liked

Sita an illustrated retelling of the ramayana devdutt pattanaik
Sita  an illustrated retelling of the ramayana   devdutt pattanaikSita  an illustrated retelling of the ramayana   devdutt pattanaik
Sita an illustrated retelling of the ramayana devdutt pattanaiksaurabh goel
 
Basic about-router
Basic about-routerBasic about-router
Basic about-routersaurabh goel
 
Последний звонок 2013
Последний звонок 2013Последний звонок 2013
Последний звонок 2013sch86
 
Don't lose your mind lose your weight rujuta diwekar
Don't lose your mind lose your weight rujuta diwekarDon't lose your mind lose your weight rujuta diwekar
Don't lose your mind lose your weight rujuta diwekarsaurabh goel
 
Medical malpractices
Medical malpracticesMedical malpractices
Medical malpracticessaurabh goel
 
Medical malpractices (1)
Medical malpractices (1)Medical malpractices (1)
Medical malpractices (1)saurabh goel
 
Interview presentation
Interview presentationInterview presentation
Interview presentationsaurabh goel
 
Secrets of success
Secrets of success Secrets of success
Secrets of success saurabh goel
 
Последний звонок 2011
Последний звонок 2011Последний звонок 2011
Последний звонок 2011sch86
 

Viewers also liked (12)

Sita an illustrated retelling of the ramayana devdutt pattanaik
Sita  an illustrated retelling of the ramayana   devdutt pattanaikSita  an illustrated retelling of the ramayana   devdutt pattanaik
Sita an illustrated retelling of the ramayana devdutt pattanaik
 
Basic about-router
Basic about-routerBasic about-router
Basic about-router
 
Dynamic routing
Dynamic routingDynamic routing
Dynamic routing
 
Iaf presentation
Iaf presentationIaf presentation
Iaf presentation
 
Homeloans
HomeloansHomeloans
Homeloans
 
Последний звонок 2013
Последний звонок 2013Последний звонок 2013
Последний звонок 2013
 
Don't lose your mind lose your weight rujuta diwekar
Don't lose your mind lose your weight rujuta diwekarDon't lose your mind lose your weight rujuta diwekar
Don't lose your mind lose your weight rujuta diwekar
 
Medical malpractices
Medical malpracticesMedical malpractices
Medical malpractices
 
Medical malpractices (1)
Medical malpractices (1)Medical malpractices (1)
Medical malpractices (1)
 
Interview presentation
Interview presentationInterview presentation
Interview presentation
 
Secrets of success
Secrets of success Secrets of success
Secrets of success
 
Последний звонок 2011
Последний звонок 2011Последний звонок 2011
Последний звонок 2011
 

Similar to Web Search Engine

Similar to Web Search Engine (20)

Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGle
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
How search engine work ppt
How search engine work pptHow search engine work ppt
How search engine work ppt
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
CAB 2.pptx
CAB 2.pptxCAB 2.pptx
CAB 2.pptx
 
Basics of SEO
Basics of SEO Basics of SEO
Basics of SEO
 
Modern web search: Web Information Systems
Modern web search: Web Information SystemsModern web search: Web Information Systems
Modern web search: Web Information Systems
 
Modern web search: Lecture 11
Modern web search: Lecture 11Modern web search: Lecture 11
Modern web search: Lecture 11
 
Google Paper
Google Paper Google Paper
Google Paper
 
Get the best Seo training in Pune at brainmine.
Get the best Seo training in Pune at brainmine.Get the best Seo training in Pune at brainmine.
Get the best Seo training in Pune at brainmine.
 
Notes for
Notes forNotes for
Notes for
 
Search Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEOSearch Engine Optimization - Fundamentals - SEO
Search Engine Optimization - Fundamentals - SEO
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Trending Update How Search Engine Work
Trending Update How Search Engine WorkTrending Update How Search Engine Work
Trending Update How Search Engine Work
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO Basics
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
page ranking web crawling
page ranking web crawlingpage ranking web crawling
page ranking web crawling
 

Recently uploaded

Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 

Web Search Engine

  • 1. Web Technology Presentation How search works in SEARCH ENGINE
  • 2. Index ✔ Prelude ✔ How search engine works ✔ Search Engine Architecture ✔ Working ✔ Features(Web Crawling,Page Rank,etc.) ✔ Problems ✔ Future ✔ Conclusion
  • 3. Prelude ✔ A Search Engine is an information retrieval system designed to help find information stored on a computer system, such as on the ● World Wide Web, ● inside a corporate or proprietary network, ● or in a personal computer. ✔ The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria.
  • 4. How search engine works ✔ SEARCH STARTS WITH THE WEB. IT'S MADE UP OF OVER 60  TRILLION INDIVIDUAL PAGES AND IT'S CONSTANTLY GROWING. ✔ Search engine navigates the web by: ● Web Crawling ● Indexing ● Searching
  • 6. Working... ✔ There is an URL server that sends lists of URL to be fetched to the crawlers. ✔ The web pages that are fetched are then sent to the store server. ✔ The store server then compresses and stores the web pages into a repository. ✔ Every web page has an associated ID number called a doc ID which is assigned whenever a new URL is parsed out of a web page.
  • 7. Working...(cont.) ✔ The indexing function is performed by the indexer and the sorter. ✔ The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. ✔ The indexer distributes hits into a set of "barrels", creating a partially sorted forward index. ✔ It parses out all the links in every web page and stores important information about them in anchor files. ✔ Anchor file contains enough information to determine where each link points from and to, and the text of the link.
  • 8. Working...(cont.) ✔ The URL resolver reads the anchors file and converts relative URL into absolute URL and in turn into doc ID. ✔ It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links which are pairs of doc IDs. The links database is used to compute Page Ranks for all the documents. ✔ The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID to generate the inverted index.
  • 9. Working...(cont.) ✔ A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. ✔ The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries.
  • 10. Crawling the Web (Web Crawler) ✔ A web crawler or spider is used to create an index of web pages to be used by a web search engine. ✔ Web search is based on this index. ✔ Search engines has a fast distributed crawling system. ✔ A single URL server serves lists of URL to a number of crawlers. ✔ Both the URL server and the crawlers are implemented in Python. ✔ Each crawler keeps roughly 300 connections open at once. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. ✔ Each crawler maintains its own DNS cache so it does not need to do a DNS lookup before crawling each document.
  • 12. Indexing the Web ✔ Parsing ◦ Any parser which is designed to run on the entire Web must handle a huge array of possible errors. ✔ Indexing Documents into Barrels ◦ After each document is parsed, it is encoded into a number of barrels. Every word is converted into a word ID by using an in-memory hash table -- the lexicon. ◦ Once the words are converted into word ID, their occurrences in the current document are translated into hit lists and are written into the forward barrels. ✔ Sorting ◦ the sorter takes each of the forward barrels and sorts it by word ID to produce an inverted barrel for title and anchor hits and a full text inverted barrel.
  • 14. Web Searching ✔ Match the query terms with words in the index ✔ Sort documents by relevance ✔ Find files or records ✔ Open each one and read it ✔ Store each word in a search able index ✔ Kinds of information ✔ Dates ✔ Languages ✔ Update processes
  • 16. Other System Features ✔ It makes use of the link structure of the Web to calculate a quality ranking for each web page, called Page Rank ◦ Page Rank is a trademark of Google. The Page Rank process has been patented. ✔ Search Engine utilizes link to improve search results.
  • 17. Page Rank ✔ Page Rank is a link analysis algorithm which assigns a numerical weighting to each Web page, with the purpose of "measuring" relative importance. Based on the hyper links map An excellent way to prioritize the results of web keyword searches
  • 18. Intuitive Justification ✔ A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back“, but eventually gets bored and starts on another random page. ◦ The probability that the random surfer visits a page is its Page Rank. ◦ The d damping factor is the probability at each page the "random surfer" will get bored and request another random page. ✔ A page can have a high Page Rank ◦ If there are many pages that point to it ◦ Or if there are some pages that point to it, and have a high Page Rank.
  • 19. Anchor Text ● <A href =" http://www.yahoo.com/">Yahoo!</A> Besides the text of a hyper link (anchor text) is associated with the page that the link is on, it is also associated with the page the link points to. ◦ anchors often provide more accurate descriptions of web pages than the pages themselves. ◦ anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases.
  • 20. Major Data Structures ✔ Big Files  virtual files spanning multiple file systems and are addressable by 64 bit integers. ✔ Repository  contains the full HTML of every web page. ✔ Document Index  keeps information about each document. ✔ Lexicon  two parts – a list of the words and a hash table of pointers. ✔ Hit Lists  a list of occurrences of a particular word in a particular document including position, font, and capitalization information. ✔ Forward Index  stored in a number of barrels ✔ Inverted Index  consists of the same barrels as the forward index, except that they have been processed by the sorter.
  • 21. Search is NOT a Panacea ✔ Search can’t find what’s not there ◦ The content is hugely important ✔ Information Architecture is vital ✔ Usable sites have good navigation and structure
  • 22. Problems ✔ High Quality Search The biggest problem facing users of web search engines today is the quality of the results they get back. ✔ Scalable Architecture Google is designed to scale. It must be efficient in both space and time
  • 23. The Future “The ultimate search engine would understand exactly what you mean and give back exactly what you want.” - Larry Page
  • 24. Conclusion ✔ The primary goal is to provide high quality search results over a rapidly growing World Wide Web. ✔ Search engines employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. ✔ Search engines is a complete architecture for gathering web pages, indexing them, and performing search queries over them.
  • 25. And thats how search works... ● THANK YOU Presented by:Presented by: SAURABH GOELSAURABH GOEL 111438111438