SlideShare a Scribd company logo
Modern Web Search
Web Information Systems -
2015
Module Outline
● Story so far…
● This presentation talks about:
o Search Engine (Google as a case study)
 Crawlers
 Web Indexing
Web Search Basics
● Web search engines don’t search the web
o They search a copy of the web
o They crawl or spider documents from the web
o They index the documents, and provide a
search interface based on that index
o Users click on links to visit the actual, original
web document
slide taken from Hugh E. Williams Ebay Inc.
Google Search Engine Architecture
Taken from: http://infolab.stanford.edu/~backrub/google.html
Google Search Engine Architecture
Taken from: http://infolab.stanford.edu/~backrub/google.html
Google Search Engine Architecture
Taken from: http://infolab.stanford.edu/~backrub/google.html
Web Crawlers
Main idea:
● Start with known sites
● Record information for these sites
● Follow the links from each site
● Record information found at new sites
● Repeat
Reference: http://web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf
Web Crawlers
● Issues with this:
o Speed
o Politeness
o Excluded content
o Spamming
o Continuous Crawling - e.g. “current weather”
Reference: http://web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf
Some more Crawler Challenges
Crawler challenges:
● HTTP HEAD and GET requests sometimes return different
headers
● A pages can redirect to itself, or into a cycle (more in a
moment)
● Pages can look different to end-user browsers and crawlers
● Pages can require JavaScript processing
● Pages can require cookies
● Pages can be built in non-HTML environments
The Processing Involved ...A Google Case Study
“In order to scale to hundreds of millions of web pages, Google
has a fast distributed crawling system. A single URLserver
serves lists of URLs to a number of crawlers (we typically ran
about 3). Each crawler keeps roughly 300 connections open at
once. This is necessary to retrieve web pages at a fast enough
pace. At peak speeds, the system can crawl over 100 web pages
per second using four crawlers. This amounts to roughly 600K
per second of data. ”
Taken from: http://infolab.stanford.edu/~backrub/google.html
Web Indexing
● Given a set of keywords, how do we find relevant documents?
● Looked at inverted index in the previous lecture
Taken from: http://infolab.stanford.edu/~backrub/google.html and
http://www.google.com/insidesearch/howsearchworks/crawling-indexing.html
Inverted Index
● Simple (one occurrence per doc):
word (e.g. “knoesis”): DocID, DocID, DocID
● Detailed (all occurrences in docs):
word (e.g. “knoesis”):: DocID, Position, Position, DocID, Position.
. .
● Even more detailed: Position annotated with info (heading,
boldface, . .). Useful for ranking.
● Adding another dimension: spatial e.g. Google’s new Pigeon
rank search
Reference: http://www.google.com/technology/pigeonrank.html
Google’s Inverted Index
Indexing Tricks
● Scale Up: “Barrels”
● Phrase Indexing: sublists
o “electrical” is the union of the sets containing terms
{“electrical apparatus”, “electrical current”, ..}
● Anchor text: text linking to other web pages. These links can
also provide a summary about the linked page
o This is Tanvi Banerjee’s web page
References: Web Search Engines: Part 2, David Hawking
Page Ranking Using Google as Case Study
“Google maintains much more information about web documents
than typical search engines. Every hitlist includes position, font,
and capitalization information. Additionally, we factor in hits from
anchor text and the PageRank of the document. Combining all of
this information into a rank is difficult. We designed our ranking
function so that no particular factor can have too much influence.”
Reference: http://infolab.stanford.edu/~backrub/google.html
Query Processing
● Current search engines return only documents containing all
the query words
● Improve search quality by incorporating more factors in the
page ranking algorithm => makes it slow
● Speed up Query Processing
o Clever numbering of search documents in decreasing
order of importance to current query
o Caching
References: Web Search Engines: Part 2, David Hawking
Other Aspects of Search Engines
● Numerous other aspects of a search engine
o query specific advertisement
o image, video, audio search
● Many challenges to building a search engine
o Continuous room for improvement (Google’s Pigeon
rank..)

More Related Content

What's hot

Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
School of Data
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
CynthiaCruz55
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
Robert Dempsey
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
Yu-Chang Ho
 
Getting started with Scrapy in Python
Getting started with Scrapy in PythonGetting started with Scrapy in Python
Getting started with Scrapy in PythonViren Rajput
 
Hydra: A Vocabulary for Hypermedia-Driven Web APIs
Hydra: A Vocabulary for Hypermedia-Driven Web APIsHydra: A Vocabulary for Hypermedia-Driven Web APIs
Hydra: A Vocabulary for Hypermedia-Driven Web APIs
Markus Lanthaler
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
Tushar Mittal
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
Viren Rajput
 
Semantic Search with Semantic Web
Semantic Search with Semantic WebSemantic Search with Semantic Web
Semantic Search with Semantic Web
Zahra Sadeghi
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
Maris Lemba
 
Creating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraCreating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with Hydra
Markus Lanthaler
 
Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RES
Christophe Guéret
 
Searching the web effectively f09
Searching the web effectively f09Searching the web effectively f09
Searching the web effectively f09
Infomanjjb
 
Google Scholar Workshop
Google Scholar WorkshopGoogle Scholar Workshop
Google Scholar Workshop
Mandi Goodsett
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
Krishna Sunuwar
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
Denis Shestakov
 
Google scholar and the academic web
Google scholar and the academic webGoogle scholar and the academic web
Google scholar and the academic web
Jamie Bisset
 
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic WebTwo Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
Danny Ayers
 
PDE2440 and PDE4421 Oct 2018
PDE2440 and PDE4421 Oct 2018PDE2440 and PDE4421 Oct 2018
PDE2440 and PDE4421 Oct 2018
EISLibrarian
 
Presentation google scholar
Presentation google scholarPresentation google scholar
Presentation google scholarmaryamfarooqi
 

What's hot (20)

Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
Getting started with Scrapy in Python
Getting started with Scrapy in PythonGetting started with Scrapy in Python
Getting started with Scrapy in Python
 
Hydra: A Vocabulary for Hypermedia-Driven Web APIs
Hydra: A Vocabulary for Hypermedia-Driven Web APIsHydra: A Vocabulary for Hypermedia-Driven Web APIs
Hydra: A Vocabulary for Hypermedia-Driven Web APIs
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Semantic Search with Semantic Web
Semantic Search with Semantic WebSemantic Search with Semantic Web
Semantic Search with Semantic Web
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Creating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraCreating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with Hydra
 
Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RES
 
Searching the web effectively f09
Searching the web effectively f09Searching the web effectively f09
Searching the web effectively f09
 
Google Scholar Workshop
Google Scholar WorkshopGoogle Scholar Workshop
Google Scholar Workshop
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Google scholar and the academic web
Google scholar and the academic webGoogle scholar and the academic web
Google scholar and the academic web
 
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic WebTwo Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
 
PDE2440 and PDE4421 Oct 2018
PDE2440 and PDE4421 Oct 2018PDE2440 and PDE4421 Oct 2018
PDE2440 and PDE4421 Oct 2018
 
Presentation google scholar
Presentation google scholarPresentation google scholar
Presentation google scholar
 

Viewers also liked

Examples of Real-World Big Data Application
Examples of Real-World Big Data ApplicationExamples of Real-World Big Data Application
Examples of Real-World Big Data Application
Artificial Intelligence Institute at UofSC
 
Mastering the Velocity Dimension of Big Data
Mastering the Velocity Dimension of Big DataMastering the Velocity Dimension of Big Data
Mastering the Velocity Dimension of Big Data
Artificial Intelligence Institute at UofSC
 
Knowledge Enabled Location Prediction of Twitter Users
Knowledge Enabled Location Prediction of Twitter UsersKnowledge Enabled Location Prediction of Twitter Users
Knowledge Enabled Location Prediction of Twitter Users
Artificial Intelligence Institute at UofSC
 
Semantic, Cognitive and Perceptual Computing -Human mental representation
Semantic, Cognitive and Perceptual Computing -Human mental representationSemantic, Cognitive and Perceptual Computing -Human mental representation
Semantic, Cognitive and Perceptual Computing -Human mental representation
Artificial Intelligence Institute at UofSC
 
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
Artificial Intelligence Institute at UofSC
 
Challenges in understanding clinical notes: Why NLP engines fall short and wh...
Challenges in understanding clinical notes: Why NLP engines fall short and wh...Challenges in understanding clinical notes: Why NLP engines fall short and wh...
Challenges in understanding clinical notes: Why NLP engines fall short and wh...
Artificial Intelligence Institute at UofSC
 
Semantic, Cognitive and Perceptual Computing -Deep learning
Semantic, Cognitive and Perceptual Computing -Deep learning Semantic, Cognitive and Perceptual Computing -Deep learning
Semantic, Cognitive and Perceptual Computing -Deep learning
Artificial Intelligence Institute at UofSC
 
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
Artificial Intelligence Institute at UofSC
 
Semantic perception tkp(1)
Semantic perception tkp(1)Semantic perception tkp(1)
Semantic perception tkp(1)
Artificial Intelligence Institute at UofSC
 
Mastering the variety dimension - Ontop Demonstration
Mastering the variety dimension - Ontop DemonstrationMastering the variety dimension - Ontop Demonstration
Mastering the variety dimension - Ontop Demonstration
Artificial Intelligence Institute at UofSC
 
Lecture 8: Class project-planning
Lecture 8: Class project-planningLecture 8: Class project-planning
Lecture 8: Class project-planning
Artificial Intelligence Institute at UofSC
 
Contrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and ClassificationContrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and Classification
Artificial Intelligence Institute at UofSC
 
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent MiningAshutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Artificial Intelligence Institute at UofSC
 
Mining and Analyzing Subjective Experiences in User-generated Content
Mining and Analyzing Subjective Experiences in User-generated ContentMining and Analyzing Subjective Experiences in User-generated Content
Mining and Analyzing Subjective Experiences in User-generated Content
Artificial Intelligence Institute at UofSC
 
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
Artificial Intelligence Institute at UofSC
 
Web and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sisWeb and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sis
Artificial Intelligence Institute at UofSC
 

Viewers also liked (16)

Examples of Real-World Big Data Application
Examples of Real-World Big Data ApplicationExamples of Real-World Big Data Application
Examples of Real-World Big Data Application
 
Mastering the Velocity Dimension of Big Data
Mastering the Velocity Dimension of Big DataMastering the Velocity Dimension of Big Data
Mastering the Velocity Dimension of Big Data
 
Knowledge Enabled Location Prediction of Twitter Users
Knowledge Enabled Location Prediction of Twitter UsersKnowledge Enabled Location Prediction of Twitter Users
Knowledge Enabled Location Prediction of Twitter Users
 
Semantic, Cognitive and Perceptual Computing -Human mental representation
Semantic, Cognitive and Perceptual Computing -Human mental representationSemantic, Cognitive and Perceptual Computing -Human mental representation
Semantic, Cognitive and Perceptual Computing -Human mental representation
 
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
 
Challenges in understanding clinical notes: Why NLP engines fall short and wh...
Challenges in understanding clinical notes: Why NLP engines fall short and wh...Challenges in understanding clinical notes: Why NLP engines fall short and wh...
Challenges in understanding clinical notes: Why NLP engines fall short and wh...
 
Semantic, Cognitive and Perceptual Computing -Deep learning
Semantic, Cognitive and Perceptual Computing -Deep learning Semantic, Cognitive and Perceptual Computing -Deep learning
Semantic, Cognitive and Perceptual Computing -Deep learning
 
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
 
Semantic perception tkp(1)
Semantic perception tkp(1)Semantic perception tkp(1)
Semantic perception tkp(1)
 
Mastering the variety dimension - Ontop Demonstration
Mastering the variety dimension - Ontop DemonstrationMastering the variety dimension - Ontop Demonstration
Mastering the variety dimension - Ontop Demonstration
 
Lecture 8: Class project-planning
Lecture 8: Class project-planningLecture 8: Class project-planning
Lecture 8: Class project-planning
 
Contrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and ClassificationContrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and Classification
 
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent MiningAshutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
 
Mining and Analyzing Subjective Experiences in User-generated Content
Mining and Analyzing Subjective Experiences in User-generated ContentMining and Analyzing Subjective Experiences in User-generated Content
Mining and Analyzing Subjective Experiences in User-generated Content
 
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
 
Web and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sisWeb and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sis
 

Similar to Modern web search: Lecture 11

How Google Works
How Google WorksHow Google Works
How Google Works
Ganesh Solanke
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanism
Umang MIshra
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
Carlo Vaccari
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
ScrbifPt
 
Google indexing
Google indexingGoogle indexing
Google indexing
tahoor71
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010steverz
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paper
didip
 
Search Engine Optimization for the Research Librarian, or, How Librarians Can...
Search Engine Optimization for the Research Librarian, or, How Librarians Can...Search Engine Optimization for the Research Librarian, or, How Librarians Can...
Search Engine Optimization for the Research Librarian, or, How Librarians Can...
melissagasparotto
 
Google Search Engine
Google Search EngineGoogle Search Engine
Google Search Engine
RichaManchanda
 
Google Search Engine
Google Search EngineGoogle Search Engine
Google Search Engine
guestf460ed0
 
The anatomy of google
The anatomy of googleThe anatomy of google
The anatomy of google
maelmardi
 
G017254554
G017254554G017254554
G017254554
IOSR Journals
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
iosrjce
 
Google ppt by amit
Google ppt by amitGoogle ppt by amit
Google ppt by amit
DAVV
 

Similar to Modern web search: Lecture 11 (20)

How Google Works
How Google WorksHow Google Works
How Google Works
 
Google
GoogleGoogle
Google
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanism
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Google
GoogleGoogle
Google
 
Google indexing
Google indexingGoogle indexing
Google indexing
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paper
 
Search Engine Optimization for the Research Librarian, or, How Librarians Can...
Search Engine Optimization for the Research Librarian, or, How Librarians Can...Search Engine Optimization for the Research Librarian, or, How Librarians Can...
Search Engine Optimization for the Research Librarian, or, How Librarians Can...
 
Google Search Engine
Google Search EngineGoogle Search Engine
Google Search Engine
 
Google Search Engine
Google Search EngineGoogle Search Engine
Google Search Engine
 
The anatomy of google
The anatomy of googleThe anatomy of google
The anatomy of google
 
G017254554
G017254554G017254554
G017254554
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
 
Test
TestTest
Test
 
Google
GoogleGoogle
Google
 
Google ppt by amit
Google ppt by amitGoogle ppt by amit
Google ppt by amit
 

Modern web search: Lecture 11

  • 1. Modern Web Search Web Information Systems - 2015
  • 2. Module Outline ● Story so far… ● This presentation talks about: o Search Engine (Google as a case study)  Crawlers  Web Indexing
  • 3. Web Search Basics ● Web search engines don’t search the web o They search a copy of the web o They crawl or spider documents from the web o They index the documents, and provide a search interface based on that index o Users click on links to visit the actual, original web document slide taken from Hugh E. Williams Ebay Inc.
  • 4. Google Search Engine Architecture Taken from: http://infolab.stanford.edu/~backrub/google.html
  • 5. Google Search Engine Architecture Taken from: http://infolab.stanford.edu/~backrub/google.html
  • 6. Google Search Engine Architecture Taken from: http://infolab.stanford.edu/~backrub/google.html
  • 7. Web Crawlers Main idea: ● Start with known sites ● Record information for these sites ● Follow the links from each site ● Record information found at new sites ● Repeat Reference: http://web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf
  • 8. Web Crawlers ● Issues with this: o Speed o Politeness o Excluded content o Spamming o Continuous Crawling - e.g. “current weather” Reference: http://web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf
  • 9. Some more Crawler Challenges Crawler challenges: ● HTTP HEAD and GET requests sometimes return different headers ● A pages can redirect to itself, or into a cycle (more in a moment) ● Pages can look different to end-user browsers and crawlers ● Pages can require JavaScript processing ● Pages can require cookies ● Pages can be built in non-HTML environments
  • 10. The Processing Involved ...A Google Case Study “In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. ” Taken from: http://infolab.stanford.edu/~backrub/google.html
  • 11. Web Indexing ● Given a set of keywords, how do we find relevant documents? ● Looked at inverted index in the previous lecture Taken from: http://infolab.stanford.edu/~backrub/google.html and http://www.google.com/insidesearch/howsearchworks/crawling-indexing.html
  • 12. Inverted Index ● Simple (one occurrence per doc): word (e.g. “knoesis”): DocID, DocID, DocID ● Detailed (all occurrences in docs): word (e.g. “knoesis”):: DocID, Position, Position, DocID, Position. . . ● Even more detailed: Position annotated with info (heading, boldface, . .). Useful for ranking. ● Adding another dimension: spatial e.g. Google’s new Pigeon rank search Reference: http://www.google.com/technology/pigeonrank.html
  • 14. Indexing Tricks ● Scale Up: “Barrels” ● Phrase Indexing: sublists o “electrical” is the union of the sets containing terms {“electrical apparatus”, “electrical current”, ..} ● Anchor text: text linking to other web pages. These links can also provide a summary about the linked page o This is Tanvi Banerjee’s web page References: Web Search Engines: Part 2, David Hawking
  • 15. Page Ranking Using Google as Case Study “Google maintains much more information about web documents than typical search engines. Every hitlist includes position, font, and capitalization information. Additionally, we factor in hits from anchor text and the PageRank of the document. Combining all of this information into a rank is difficult. We designed our ranking function so that no particular factor can have too much influence.” Reference: http://infolab.stanford.edu/~backrub/google.html
  • 16. Query Processing ● Current search engines return only documents containing all the query words ● Improve search quality by incorporating more factors in the page ranking algorithm => makes it slow ● Speed up Query Processing o Clever numbering of search documents in decreasing order of importance to current query o Caching References: Web Search Engines: Part 2, David Hawking
  • 17. Other Aspects of Search Engines ● Numerous other aspects of a search engine o query specific advertisement o image, video, audio search ● Many challenges to building a search engine o Continuous room for improvement (Google’s Pigeon rank..)

Editor's Notes

  1. http://www.cs.au.dk/~gerth/webalg02/slides/indexing.pdf