CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler

The Anatomy of Search Engines
(Assignment 2: Build Basic Crawler)
Lecture 3

Next Week (Review Matrices):
“The Google Matrix”
G = αS + (1- α)1/neeT

 B.S Physics 1993, University of Washington
 M.S EE 1998, Washington State (four patents)
 10+ Years in Search Marketing
 Founder of SEMJ.org (Research Journal)
 Frequent Speaker
 Blogger for SemanticWeb.com
 President of Future Farm Inc.

 Google 1998 – ~26 million pages
 Reported 1 Trillion Indexed Pages in
2008.
 ~3 billion searches per day
 ~450,000 servers ~ $2million/mo power
bill
 Project 02 - $600million (2006). Dalles,
OR. Cooling towers 4 stories high. Two
football fields long

 Information Retrieval
 AI
 Algorithms
 Search Engines
 Architectures
 Crawling
 Indexing
 Ranking
 Text processing -> Unstructured data.
 Big Data
 Data Science & Analytics
 Social Networks
 Semantic Data

 Web page updates follow the Poisson
distribution on average.
 time until the next update is governed by an
exponential distribution
 Alpha is ave. change freq. i.e 1/7 seven days
 Cho & Garcia-Molina, 2003)

 Below: If ave. alpha = 7 of doc. set and
crawl after 1wk. Average age of docs
is 2.6 days. Y-axis age, X-axis crawl
day.

 Crawler Module
 Walk through the resources or data as
directed and downloads content
 Example: Directory of list of sites

 Spiders
 Directed by Crawlers to with sets of URLs
to visit. Following links across the web.

 Repository
 Storage of data from spiders
 Indexer
 Reads the repository, parses vital
information and descriptors
 Indexes
 Holds compressed information for web
documents
 Content Index, structure index,

 Query Module
 Display relevant results to users
 Convert languages
 Gets appropriate data from indexes
 Ranking Module
 Ranks a set of relevant web pages
 Content scoring
 Popularity scoring
○ Page Rank Algorithms

 A list of all the words in a language
or:

“It can be thought of as a list of all
possible roots of a language, or all
morphemes-- parts of words that contain
no smaller meaningful parts-- that can
stand alone or be combined with other
parts to produce words. ”

 Web crawler client program connects to a
domain name system (DNS) server
 DNS server translates the hostname into an
internet protocol (IP) address
 Crawler then attempts to connect to server host
using specific port
 After connection, crawler sends an HTTP
request to the web server to request a page
 usually a GET request

 Every page has a unique uniform
resource locator (URL)
 Web pages are stored on web servers
that use HTTP to exchange information
with client software
 e.g.,

 Web crawlers spend a lot of time waiting for
responses to requests
 To reduce this inefficiency, web crawlers use
threads and fetch hundreds of pages at once
 Crawlers could potentially flood sites with
requests for pages
 To avoid this problem, web crawlers use
politeness policies
 e.g., delay between requests to same web
server

Parse a file for “important” information.
Example: Inverted file (lookup table)

Term 1 (computer) 2, 7, 112
Term 2 (book) 2, 22, 117, 1674, 250121
Term 3 (Table) 3, 5, 201, 656.
Etc…..

 Large files
 Large number of pages using same
words
 If pages change content the inverted
files must change
 Updating Index files is an active area
of research

 Suppose we store other information in
the Inverted file:
 Term1 in a title
 Term1 in some type of metadata
 Term1 in a description
 Term1 frequency

Append with a new vector:

Term 1 (computer) 2, 7 [2 7 4 8], 112
Term 2 (book) 2, 22, 117, 1674, 250121
Term 3 (Table) 3, 5, 201, 656.
Etc…..

Trusting the author of the document

 HTTP protocol returns:
 Last-Modified: Fri, 04 Jan 2008

 Build a focused crawler in:
Java, Python, PERL, Matlab
 Point at MSU home page. Gather all the URLs
and store for later use.
http://www.montana.edu/robots.txt
 Store all the HTML and label with DocID.
 Read Google’s Paper. Next time Page Rank &
the Google Matrix.
 Contest: Who can store the most unique URLS?

 #! /user/bin/python
 ### Basic Web Crawler in Python to Grab a URL from command
line
 ## Use the urllib2 library for URLs, Use BeautifulSoup
 #
 from BeautifulSoup import BeautifulSoup
 import sys #allow users to input string
 import urllib2
 ####change user-agent name
 from urllib import FancyURLopener
 class MyOpener(FancyURLopener):
 version = 'BadBot/1.0'
 print MyOpener.version # print the user agent name
 httpResponse = urllib2.urlopen(sys.argv[1])

 #store html page in an object called htmlPage
 htmlPage = httpResponse.read()
 print htmlPage
 htmlDom = BeautifulSoup(htmlPage)
 # dump page title
 print htmlDom.title.string
 # dump all links in page
 allLinks = htmlDom.findAll('a', {'href': True})
 for link in allLinks:
 print link['href']
#Print name of Bot
 MyOpener.version

 Open source Java-based crawler
 https://webarchive.jira.com/wiki/display/H
eritrix/Heritrix;jsessionid=AE9A595F01C
AAB59BBCDC50C8A3ED2A9
 http://www.robotstxt.org/robotstxt.html
 http://www.commoncrawl.org/

CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler

Similar to CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler (20)

More from Sean Golliher

More from Sean Golliher (8)

Recently uploaded

Recently uploaded (20)

CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler

Editor's Notes