Information Retrieval      Lecture 6
-   Some crawlers were stuck in loops.-   Issues with storage-   MSU’s site has about 60,00+ URLs    including subdomains.
- 37.5% of grade based on Crawler- 37.5% on presentation- 25% Attendance- Paper topics handed out next week
- Information retrieval (IR) is finding material(usually documents) of an unstructured nature(usually text) that satisfies a...
   Three reasons to use multiple computers for    crawling     Helps to put the crawler closer to the sites it      craw...
 Linear Scan of Documents is called  “grepping”’ In order to perform “ranked retrieval” we  need the best answer among m...
 Ex: The works of Shakespeare. Record whether each book contains a word  out of all the words (~32,000) Create binary-t...
   Suppose we want to answer the Brutus AND    Ceaser and NOT Calpurnia. We operate on    the row vectors (NOT(Calpurnia)...
 Collection of Documents: Corpus To asses the effectiveness of an IR system     Precision: What fraction returned docum...
 Last example, incidence matrix, too large. A  500k x 1M matrix has ½ trillion 0’s and 1’s.  And too sparse. 99.8% of mat...
   Text is stored in hundreds of incompatible file    formats.. Bytes on a web server.     e.g., raw text, RTF, HTML, XM...
   A character encoding is a mapping    between bits and glyphs     i.e., getting from bits in a file to characters on  ...
   Other languages can have many more    glyphs     e.g., Chinese has more than 40,000     characters, with over 3,000 i...
 ASCII: Was able to rep. all of English  characters use numbers 32-127. (Unix and C  being invented) IBM started using t...
 A different way of thinking about character  representation Question is A different than A? In some  languages even the...
 UTF-8 -> Identical to ASCII codes. Coding  bits to Glyphs. You have to know what Encoding is being  used!
 Single mapping from numbers to glyphs  that attempts to include all glyphs in  common use in all known languages Unicod...
 e.g., Greek letter pi (π) is Unicode symbol  number 960 In binary, 00000011 11000000 (3C0 in  hexadecimal) Final encod...
   Web Site use:    <html>    <head>   <meta http-equiv="Content-Type"    content="text/html; charset=UTF-8" />
   Requirements for document storage system:     Random access      ○ request the content of a document based on its    ...
 Text is highly redundant (or predictable) Compression techniques exploit this  redundancy to make files smaller without...
   Google’s document storage system (Chang    et al., 2006)     Customized for storing, finding, and updating web      p...
   No query language, no complex queries to    optimize   Only row-level transactions   Tablets are stored in a replica...
 Logically organized into rows A row stores data for a single web page Combination of a row key, a column key, and  a t...
   BigTable can have a huge number of    columns per row     all rows have the same column groups     not all rows have...
   Duplicate and near-duplicate documents    occur in many situations     Copies, versions, plagiarism, spam, mirror    ...
 Exact duplicate detection is relatively easy Checksum techniques     A checksum is a value that is computed based on t...
   More challenging task     Are web pages with same text context but     different advertising or format near-duplicate...
 Many web pages contain text, links, and  pictures that are not directly related to the  main content of the page This a...
 Many web pages contain text, links, and  pictures that are not directly related to the  main content of the page This a...
   Other approaches    use DOM    structure and    visual (layout)    features
   Cumulative distribution of tags in the example    web page     Main text content of the page corresponds to the      ...
   R. Song et al., 2004. “Learning Block    Importance Models for Web Pages”
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Upcoming SlideShare
Loading in …5
×

Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing

1,714 views

Published on

Lecture 6 on information retrival, indexing, BigTable.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,714
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Grab all terms and store docID. 2nd column organize alphabetical. 3rd column doc. Freq is number of documents it appears in (good for speed later). Last column, postings lists, is which documents it appears in. Good for ranking algorithms later.
  • Comp
  • 2^8 = 256
  • 2^8 = 256
  • 2^8 = 256
  • 960 = 1111000000 so 10 bits. Which means we need to use 128-2047. The first numbers are FIXED in the coding scheme.
  • Distributed database p6. 57 IR book. In use at Google. Petabyte 10^15.n Can’t use SQL it is a large relational database. Table
  • Notes: Not all rows have to have the same columns. Big difference between relational databases. Site is example.com…. Other.com links to example .com (since that is the row we are talking about) with anchor text example. Annchor is the the anchor text. Nulll.com links to example.com with anchor text click here.
  • Tokens. Should we treat apple the same as Apple?
  • DOM –Document object Model. Rong
  • DOM –Document object Model. Rong
  • Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing

    1. 1. Information Retrieval Lecture 6
    2. 2. - Some crawlers were stuck in loops.- Issues with storage- MSU’s site has about 60,00+ URLs including subdomains.
    3. 3. - 37.5% of grade based on Crawler- 37.5% on presentation- 25% Attendance- Paper topics handed out next week
    4. 4. - Information retrieval (IR) is finding material(usually documents) of an unstructured nature(usually text) that satisfies an information needfrom within large collections (usually stored oncomputers)
    5. 5.  Three reasons to use multiple computers for crawling  Helps to put the crawler closer to the sites it crawls  Reduces the number of sites the crawler has to remember  Reduces computing resources required Distributed crawler uses a hash function to assign URLs to crawling computers  hash function should be computed on the host part of each URL
    6. 6.  Linear Scan of Documents is called “grepping”’ In order to perform “ranked retrieval” we need the best answer among many documents (billions). Need to Index documents in advance. Ex: Boolean Retrieval
    7. 7.  Ex: The works of Shakespeare. Record whether each book contains a word out of all the words (~32,000) Create binary-term document. Incident Matrix.
    8. 8.  Suppose we want to answer the Brutus AND Ceaser and NOT Calpurnia. We operate on the row vectors (NOT(Calpurnia)) 110100 AND 110111 AND 101111 =100100=> Ans: “Antony and Cleopatra” and “Hamlet”
    9. 9.  Collection of Documents: Corpus To asses the effectiveness of an IR system  Precision: What fraction returned documents are relevant to the information need?  Recall: What fraction of the relevant documents in a the collection were returned by the system
    10. 10.  Last example, incidence matrix, too large. A 500k x 1M matrix has ½ trillion 0’s and 1’s. And too sparse. 99.8% of matrix is zero. Inverted Index: Record items that do appear. To gain speed at retrieval time we build an index in advance. Two steps:  Collect documents  Tokenize the text (break into terms)
    11. 11.  Text is stored in hundreds of incompatible file formats.. Bytes on a web server.  e.g., raw text, RTF, HTML, XML, Microsoft Word, ODF, PDF Other types of files also important  e.g., PowerPoint, Excel Typically use a conversion tool  converts the document content into a tagged text format such as HTML or XML  retains some of the important formatting information
    12. 12.  A character encoding is a mapping between bits and glyphs  i.e., getting from bits in a file to characters on a screen  Can be a major source of incompatibility ASCII (1963) is basic character encoding scheme for English  encodes 128 letters, numbers, special characters, and control characters in 7 bits, extended with an extra bit for storage in bytes
    13. 13.  Other languages can have many more glyphs  e.g., Chinese has more than 40,000 characters, with over 3,000 in common use Many languages have multiple encoding schemes  e.g., CJK (Chinese-Japanese-Korean) family of East Asian languages, Hindi, Arabic  must specify encoding  can’t have multiple languages in one file
    14. 14.  ASCII: Was able to rep. all of English characters use numbers 32-127. (Unix and C being invented) IBM started using the other bits for special characters. Then came the ANSI Standard The WWW caused all this to “crash”….UNICODE (1991)
    15. 15.  A different way of thinking about character representation Question is A different than A? In some languages even the shape matters. Every latter is assigned a number (code point) by Unicode consortium. Ex: A = U + 0041 (Hex)
    16. 16.  UTF-8 -> Identical to ASCII codes. Coding bits to Glyphs. You have to know what Encoding is being used!
    17. 17.  Single mapping from numbers to glyphs that attempts to include all glyphs in common use in all known languages Unicode is a mapping between numbers and glyphs  does not uniquely specify bits to glyph mapping!  e.g., UTF-8 (1-8 bytes), UTF-16, UTF-32
    18. 18.  e.g., Greek letter pi (π) is Unicode symbol number 960 In binary, 00000011 11000000 (3C0 in hexadecimal) Final encoding is (110)01111 (10)000000 (CF80 in hexadecimal)
    19. 19.  Web Site use: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    20. 20.  Requirements for document storage system:  Random access ○ request the content of a document based on its URL ○ hash function based on URL is typical  Compression and large files ○ reducing storage requirements and efficient access  Update ○ handling large volumes of new and modified documents ○ adding new anchor text
    21. 21.  Text is highly redundant (or predictable) Compression techniques exploit this redundancy to make files smaller without losing any of the content Popular algorithms can compress HTML and XML text by 80%  e.g., DEFLATE (zip, gzip) and LZW (UNIX compress, PDF)
    22. 22.  Google’s document storage system (Chang et al., 2006)  Customized for storing, finding, and updating web pages. Tablets served by 1000’s of machines.  Handles large collection sizes using inexpensive computers
    23. 23.  No query language, no complex queries to optimize Only row-level transactions Tablets are stored in a replicated file system that is accessible by all BigTable servers Any changes to a BigTable tablet are recorded to a transaction log, which is also stored in a shared file system If any tablet server crashes, another server can immediately read the tablet data and transaction log from the file system and take over
    24. 24.  Logically organized into rows A row stores data for a single web page Combination of a row key, a column key, and a timestamp point to a single cell in the row http://video.google.com/videoplay?docid=727 8544055668715642
    25. 25.  BigTable can have a huge number of columns per row  all rows have the same column groups  not all rows have the same columns  important for reducing disk reads to access document data Rows are partitioned into tablets based on their row keys  simplifies determining which server is appropriate
    26. 26.  Duplicate and near-duplicate documents occur in many situations  Copies, versions, plagiarism, spam, mirror sites  30% of the web pages in a large crawl are exact or near duplicates of pages in the other 70% Duplicates consume significant resources during crawling, indexing, and search  Little value to most users
    27. 27.  Exact duplicate detection is relatively easy Checksum techniques  A checksum is a value that is computed based on the content of the document ○ e.g., sum of the bytes in the document file  Possible for files with different text to have same checksum Functions such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes
    28. 28.  More challenging task  Are web pages with same text context but different advertising or format near-duplicates? A near-duplicate document is defined using a threshold value for some similarity measure between pairs of documents  e.g., document D1 is a near-duplicate of document D2 if more than 90% of the words in the documents are the same
    29. 29.  Many web pages contain text, links, and pictures that are not directly related to the main content of the page This additional material is mostly noise that could negatively affect the ranking of the page Techniques have been developed to detect the content blocks in a web page  Non-content material is either ignored or reduced in importance in the indexing process
    30. 30.  Many web pages contain text, links, and pictures that are not directly related to the main content of the page This additional material is mostly noise that could negatively affect the ranking of the page Techniques have been developed to detect the content blocks in a web page  Non-content material is either ignored or reduced in importance in the indexing process
    31. 31.  Other approaches use DOM structure and visual (layout) features
    32. 32.  Cumulative distribution of tags in the example web page  Main text content of the page corresponds to the “plateau” in the middle of the distribution
    33. 33.  R. Song et al., 2004. “Learning Block Importance Models for Web Pages”

    ×