CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler

1,243 views
1,187 views

Published on

Sean Golliher: Computer Science Lecture (Jan 2012) MSU. Describes Elements of Search Engines and How to Write a Basic Crawler in Python.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,243
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Created using Python and Java
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Google Secretive about data centers. Project 02 leaked… chosen for cheap hydroelectric power.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Alpha is ave. change frequency of doc. set. If your average change freq is 7.
  • Alpha is ave. change frequency of doc. set. If your average change freq is 7.
  • Google’s original architecture: URL Server sends lists of URLs to be fetched. Repsitory texts from docs. From paperIn Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Google’s original architecture: URL Server sends lists of URLs to be fetched. Repsitory texts from docs. From paperIn Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.
  • If it has definitions then it is a dictionary.
  • Hyper text transer protocol…
  • Hyper text transer protocol…
  • Hyper text transer protocol…
  • Hyper text transer protocol… What your IP sends to webserver.
  • Term 1 is computer and it is documents 2, 7, 12. If we want computer and Book we AND together. Computer Book give us page 2.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Page 7 has term 1 with 2 occurences in title tag. 7 in matedata, density, fruquency of 8…
  • Page 7 has term 1 with 2 occurences in title tag. 7 in matedata, density, fruquency of 8…
  • Page 7 has term 1 with 2 occurences in title tag. 7 in matedata, density, fruquency of 8…
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Hyper text transer protocol…
  • CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler

    1. 1. The Anatomy of Search Engines (Assignment 2: Build Basic Crawler) Lecture 3
    2. 2. Next Week (Review Matrices): “The Google Matrix” G = αS + (1- α)1/neeT
    3. 3.  B.S Physics 1993, University of Washington M.S EE 1998, Washington State (four patents) 10+ Years in Search Marketing Founder of SEMJ.org (Research Journal) Frequent Speaker Blogger for SemanticWeb.com President of Future Farm Inc.
    4. 4.  Google 1998 – ~26 million pages Reported 1 Trillion Indexed Pages in 2008. ~3 billion searches per day ~450,000 servers ~ $2million/mo power bill Project 02 - $600million (2006). Dalles, OR. Cooling towers 4 stories high. Two football fields long
    5. 5.  Information Retrieval AI Algorithms Search Engines  Architectures  Crawling  Indexing  Ranking Text processing -> Unstructured data. Big Data Data Science & Analytics Social Networks Semantic Data
    6. 6.  Web page updates follow the Poisson distribution on average.  time until the next update is governed by an exponential distribution  Alpha is ave. change freq. i.e 1/7 seven days  Cho & Garcia-Molina, 2003)
    7. 7.  Below: If ave. alpha = 7 of doc. set and crawl after 1wk. Average age of docs is 2.6 days. Y-axis age, X-axis crawl day.
    8. 8.  Crawler Module  Walk through the resources or data as directed and downloads content  Example: Directory of list of sites Spiders  Directed by Crawlers to with sets of URLs to visit. Following links across the web.
    9. 9.  Repository  Storage of data from spiders Indexer  Reads the repository, parses vital information and descriptors Indexes  Holds compressed information for web documents  Content Index, structure index,
    10. 10.  Query Module  Display relevant results to users  Convert languages  Gets appropriate data from indexes Ranking Module  Ranks a set of relevant web pages  Content scoring  Popularity scoring ○ Page Rank Algorithms
    11. 11.  A list of all the words in a languageor: “It can be thought of as a list of all possible roots of a language, or all morphemes-- parts of words that contain no smaller meaningful parts-- that can stand alone or be combined with other parts to produce words. ”
    12. 12.  Web crawler client program connects to a domain name system (DNS) server DNS server translates the hostname into an internet protocol (IP) address Crawler then attempts to connect to server host using specific port After connection, crawler sends an HTTP request to the web server to request a page  usually a GET request
    13. 13.  Every page has a unique uniform resource locator (URL) Web pages are stored on web servers that use HTTP to exchange information with client software e.g.,
    14. 14.  Web crawlers spend a lot of time waiting for responses to requests To reduce this inefficiency, web crawlers use threads and fetch hundreds of pages at once Crawlers could potentially flood sites with requests for pages To avoid this problem, web crawlers use politeness policies  e.g., delay between requests to same web server
    15. 15. Parse a file for “important” information. Example: Inverted file (lookup table)Term 1 (computer) 2, 7, 112Term 2 (book) 2, 22, 117, 1674, 250121Term 3 (Table) 3, 5, 201, 656.Etc…..
    16. 16.  Large files Large number of pages using same words If pages change content the inverted files must change Updating Index files is an active area of research
    17. 17.  Suppose we store other information in the Inverted file:  Term1 in a title  Term1 in some type of metadata  Term1 in a description  Term1 frequency
    18. 18. Append with a new vector:Term 1 (computer) 2, 7 [2 7 4 8], 112Term 2 (book) 2, 22, 117, 1674, 250121Term 3 (Table) 3, 5, 201, 656.Etc…..
    19. 19. Trusting the author of the document
    20. 20.  HTTP protocol returns:  Last-Modified: Fri, 04 Jan 2008
    21. 21.  Build a focused crawler in: Java, Python, PERL, Matlab Point at MSU home page. Gather all the URLs and store for later use. http://www.montana.edu/robots.txt Store all the HTML and label with DocID. Read Google’s Paper. Next time Page Rank & the Google Matrix. Contest: Who can store the most unique URLS?
    22. 22.  #! /user/bin/python ### Basic Web Crawler in Python to Grab a URL from command line ## Use the urllib2 library for URLs, Use BeautifulSoup # from BeautifulSoup import BeautifulSoup import sys #allow users to input string import urllib2 ####change user-agent name from urllib import FancyURLopener class MyOpener(FancyURLopener): version = BadBot/1.0 print MyOpener.version # print the user agent name httpResponse = urllib2.urlopen(sys.argv[1])
    23. 23.  #store html page in an object called htmlPage htmlPage = httpResponse.read() print htmlPage htmlDom = BeautifulSoup(htmlPage) # dump page title print htmlDom.title.string # dump all links in page allLinks = htmlDom.findAll(a, {href: True}) for link in allLinks: print link[href]#Print name of Bot MyOpener.version
    24. 24.  Open source Java-based crawler https://webarchive.jira.com/wiki/display/H eritrix/Heritrix;jsessionid=AE9A595F01C AAB59BBCDC50C8A3ED2A9 http://www.robotstxt.org/robotstxt.html http://www.commoncrawl.org/
    25. 25. Questions?

    ×