SlideShare a Scribd company logo
Information Retrieval
      Lecture 6
-   Some crawlers were stuck in loops.
-   Issues with storage
-   MSU’s site has about 60,00+ URLs
    including subdomains.
- 37.5% of grade based on Crawler
- 37.5% on presentation
- 25% Attendance
- Paper topics handed out next week
- Information retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers)
   Three reasons to use multiple computers for
    crawling
     Helps to put the crawler closer to the sites it
      crawls
     Reduces the number of sites the crawler has to
      remember
     Reduces computing resources required
   Distributed crawler uses a hash function to
    assign URLs to crawling computers
     hash function should be computed on the host
      part of each URL
 Linear Scan of Documents is called
  “grepping”’
 In order to perform “ranked retrieval” we
  need the best answer among many
  documents (billions).
 Need to Index documents in advance.
 Ex: Boolean Retrieval
 Ex: The works of Shakespeare.
 Record whether each book contains a word
  out of all the words (~32,000)
 Create binary-term document. Incident
  Matrix.
   Suppose we want to answer the Brutus AND
    Ceaser and NOT Calpurnia. We operate on
    the row vectors (NOT(Calpurnia))

    110100 AND 110111 AND 101111 =
100100

=> Ans: “Antony and Cleopatra” and “Hamlet”
 Collection of Documents: Corpus
 To asses the effectiveness of an IR system
     Precision: What fraction returned documents are
      relevant to the information need?
     Recall: What fraction of the relevant documents in
      a the collection were returned by the system
 Last example, incidence matrix, too large. A
  500k x 1M matrix has ½ trillion 0’s and 1’s.
  And too sparse. 99.8% of matrix is zero.
 Inverted Index: Record items that do appear.
 To gain speed at retrieval time we build an
  index in advance.
 Two steps:
     Collect documents
     Tokenize the text (break into terms)
   Text is stored in hundreds of incompatible file
    formats.. Bytes on a web server.
     e.g., raw text, RTF, HTML, XML, Microsoft
     Word, ODF, PDF
   Other types of files also important
     e.g., PowerPoint, Excel
   Typically use a conversion tool
     converts the document content into a tagged text
      format such as HTML or XML
     retains some of the important formatting
      information
   A character encoding is a mapping
    between bits and glyphs
     i.e., getting from bits in a file to characters on
      a screen
     Can be a major source of incompatibility
   ASCII (1963) is basic character encoding
    scheme for English
     encodes 128 letters, numbers, special
      characters, and control characters in 7
      bits, extended with an extra bit for storage in
      bytes
   Other languages can have many more
    glyphs
     e.g., Chinese has more than 40,000
     characters, with over 3,000 in common use
   Many languages have multiple encoding
    schemes
     e.g., CJK (Chinese-Japanese-Korean) family of
      East Asian languages, Hindi, Arabic
     must specify encoding
     can’t have multiple languages in one file
 ASCII: Was able to rep. all of English
  characters use numbers 32-127. (Unix and C
  being invented)
 IBM started using the other bits for special
  characters.
 Then came the ANSI Standard
 The WWW caused all this to
  “crash”….UNICODE (1991)
 A different way of thinking about character
  representation
 Question is A different than A? In some
  languages even the shape matters.
 Every latter is assigned a number (code
  point) by Unicode consortium. Ex: A = U +
  0041 (Hex)
 UTF-8 -> Identical to ASCII codes. Coding
  bits to Glyphs.
 You have to know what Encoding is being
  used!
 Single mapping from numbers to glyphs
  that attempts to include all glyphs in
  common use in all known languages
 Unicode is a mapping between numbers
  and glyphs
     does not uniquely specify bits to glyph
      mapping!
     e.g., UTF-8 (1-8 bytes), UTF-16, UTF-32
 e.g., Greek letter pi (π) is Unicode symbol
  number 960
 In binary, 00000011 11000000 (3C0 in
  hexadecimal)
 Final encoding is (110)01111 (10)000000 (CF80
  in hexadecimal)
   Web Site use:
    <html>
    <head>
   <meta http-equiv="Content-Type"
    content="text/html; charset=UTF-8" />
   Requirements for document storage system:
     Random access
      ○ request the content of a document based on its
        URL
      ○ hash function based on URL is typical
     Compression and large files
      ○ reducing storage requirements and efficient access
     Update
      ○ handling large volumes of new and modified
        documents
      ○ adding new anchor text
 Text is highly redundant (or predictable)
 Compression techniques exploit this
  redundancy to make files smaller without
  losing any of the content
 Popular algorithms can compress HTML and
  XML text by 80%
     e.g., DEFLATE (zip, gzip) and LZW (UNIX
     compress, PDF)
   Google’s document storage system (Chang
    et al., 2006)
     Customized for storing, finding, and updating web
      pages. Tablets served by 1000’s of machines.
     Handles large collection sizes using inexpensive
      computers
   No query language, no complex queries to
    optimize
   Only row-level transactions
   Tablets are stored in a replicated file system
    that is accessible by all BigTable servers
   Any changes to a BigTable tablet are
    recorded to a transaction log, which is also
    stored in a shared file system
   If any tablet server crashes, another server
    can immediately read the tablet data and
    transaction log from the file system and take
    over
 Logically organized into rows
 A row stores data for a single web page




 Combination of a row key, a column key, and
  a timestamp point to a single cell in the row
 http://video.google.com/videoplay?docid=727
  8544055668715642
   BigTable can have a huge number of
    columns per row
     all rows have the same column groups
     not all rows have the same columns
     important for reducing disk reads to access
     document data
   Rows are partitioned into tablets based
    on their row keys
     simplifies determining which server is
     appropriate
   Duplicate and near-duplicate documents
    occur in many situations
     Copies, versions, plagiarism, spam, mirror
      sites
     30% of the web pages in a large crawl are
      exact or near duplicates of pages in the other
      70%
   Duplicates consume significant resources
    during crawling, indexing, and search
     Little value to most users
 Exact duplicate detection is relatively easy
 Checksum techniques
     A checksum is a value that is computed based on the
      content of the document
      ○ e.g., sum of the bytes in the document file



     Possible for files with different text to have same
      checksum
   Functions such as a cyclic redundancy check
    (CRC), have been developed that consider the
    positions of the bytes
   More challenging task
     Are web pages with same text context but
     different advertising or format near-duplicates?
   A near-duplicate document is defined using a
    threshold value for some similarity measure
    between pairs of documents
     e.g., document D1 is a near-duplicate of
     document D2 if more than 90% of the words in
     the documents are the same
 Many web pages contain text, links, and
  pictures that are not directly related to the
  main content of the page
 This additional material is mostly noise that
  could negatively affect the ranking of the
  page
 Techniques have been developed to detect
  the content blocks in a web page
     Non-content material is either ignored or reduced
     in importance in the indexing process
 Many web pages contain text, links, and
  pictures that are not directly related to the
  main content of the page
 This additional material is mostly noise that
  could negatively affect the ranking of the
  page
 Techniques have been developed to detect
  the content blocks in a web page
     Non-content material is either ignored or reduced
     in importance in the indexing process
   Other approaches
    use DOM
    structure and
    visual (layout)
    features
   Cumulative distribution of tags in the example
    web page




     Main text content of the page corresponds to the
      “plateau” in the middle of the distribution
   R. Song et al., 2004. “Learning Block
    Importance Models for Web Pages”

More Related Content

What's hot

Inverted index
Inverted indexInverted index
Inverted index
Krishna Gehlot
 
Data file handling in python reading & writing methods
Data file handling in python reading & writing methodsData file handling in python reading & writing methods
Data file handling in python reading & writing methods
keeeerty
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
Edhole.com
 
Indexing
IndexingIndexing
File structures
File structuresFile structures
File structures
Shyam Kumar
 
python file handling
python file handlingpython file handling
python file handling
jhona2z
 
Introduction to the design and specification of file structures
Introduction to the design and specification of file structuresIntroduction to the design and specification of file structures
Introduction to the design and specification of file structures
Devyani Vaidya
 
File handling
File handlingFile handling
File handling
RoshanMaharjan13
 
File handling and Dictionaries in python
File handling and Dictionaries in pythonFile handling and Dictionaries in python
File handling and Dictionaries in python
nitamhaske
 
File Types in Data Structure
File Types in Data StructureFile Types in Data Structure
File Types in Data Structure
Prof Ansari
 
Python-File handling-slides-pkt
Python-File handling-slides-pktPython-File handling-slides-pkt
Python-File handling-slides-pkt
Pradyumna Tripathy
 
Make Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 MinutesMake Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 Minutes
UCLA Social Sciences Computing
 
File management
File managementFile management
File management
sumathiv9
 
C files
C filesC files
Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees
Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter TreesExpediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees
Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees
David Lillis
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search enginesunyil96
 
A basic course on Research data management, part 3: sharing your data
A basic course on Research data management, part 3: sharing your dataA basic course on Research data management, part 3: sharing your data
A basic course on Research data management, part 3: sharing your data
Leon Osinski
 
Switching & Multiplexing
Switching & MultiplexingSwitching & Multiplexing
Switching & Multiplexing
MelkamuEndale1
 
C programming disk file reading and writing
C programming disk file reading and writingC programming disk file reading and writing
C programming disk file reading and writing
rishi ram khanal
 

What's hot (20)

Inverted index
Inverted indexInverted index
Inverted index
 
Data file handling in python reading & writing methods
Data file handling in python reading & writing methodsData file handling in python reading & writing methods
Data file handling in python reading & writing methods
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
 
Indexing
IndexingIndexing
Indexing
 
File structures
File structuresFile structures
File structures
 
python file handling
python file handlingpython file handling
python file handling
 
Introduction to the design and specification of file structures
Introduction to the design and specification of file structuresIntroduction to the design and specification of file structures
Introduction to the design and specification of file structures
 
Topic 21
Topic 21Topic 21
Topic 21
 
File handling
File handlingFile handling
File handling
 
File handling and Dictionaries in python
File handling and Dictionaries in pythonFile handling and Dictionaries in python
File handling and Dictionaries in python
 
File Types in Data Structure
File Types in Data StructureFile Types in Data Structure
File Types in Data Structure
 
Python-File handling-slides-pkt
Python-File handling-slides-pktPython-File handling-slides-pkt
Python-File handling-slides-pkt
 
Make Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 MinutesMake Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 Minutes
 
File management
File managementFile management
File management
 
C files
C filesC files
C files
 
Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees
Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter TreesExpediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees
Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
A basic course on Research data management, part 3: sharing your data
A basic course on Research data management, part 3: sharing your dataA basic course on Research data management, part 3: sharing your data
A basic course on Research data management, part 3: sharing your data
 
Switching & Multiplexing
Switching & MultiplexingSwitching & Multiplexing
Switching & Multiplexing
 
C programming disk file reading and writing
C programming disk file reading and writingC programming disk file reading and writing
C programming disk file reading and writing
 

Viewers also liked

Cognitive Neuroscience of Memory for Software Engineers
Cognitive Neuroscience of Memory for Software EngineersCognitive Neuroscience of Memory for Software Engineers
Cognitive Neuroscience of Memory for Software Engineers
Chris Parnin
 
Psicologia Cognitiva
Psicologia CognitivaPsicologia Cognitiva
Psicologia CognitivaJorge Barbosa
 
Why Is Information "Capture" Now More Than Just Scanning Documents Into An Ar...
Why Is Information "Capture" Now More Than Just Scanning Documents Into An Ar...Why Is Information "Capture" Now More Than Just Scanning Documents Into An Ar...
Why Is Information "Capture" Now More Than Just Scanning Documents Into An Ar...
AIIM International
 
The Why and How of Knowledge Management: Some Applications in Teaching and Le...
The Why and How of Knowledge Management: Some Applications in Teaching and Le...The Why and How of Knowledge Management: Some Applications in Teaching and Le...
The Why and How of Knowledge Management: Some Applications in Teaching and Le...
Olivier Serrat
 
Information Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyInformation Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case Study
Bhojaraju Gunjal
 
COGS 107B - Winter 2010 - Lecture 17 - PFC, Attention
COGS 107B - Winter 2010 - Lecture 17 - PFC, AttentionCOGS 107B - Winter 2010 - Lecture 17 - PFC, Attention
COGS 107B - Winter 2010 - Lecture 17 - PFC, Attention
Tim Mullen
 
Unit 3 Storage And Retreival Of Information
Unit 3 Storage And Retreival Of InformationUnit 3 Storage And Retreival Of Information
Unit 3 Storage And Retreival Of Informationiarthur
 
Effective Information Management V2 18sep2008
Effective Information Management V2 18sep2008Effective Information Management V2 18sep2008
Effective Information Management V2 18sep2008
Collabor8now Ltd
 
Working memory
Working memoryWorking memory
Working memory
Derek Stiles
 
Information Management in a Web 2.0 World May 2009
Information Management in a Web 2.0 World May 2009Information Management in a Web 2.0 World May 2009
Information Management in a Web 2.0 World May 2009
Collabor8now Ltd
 
Types, purposes and applications of information systems
Types, purposes and applications of information systemsTypes, purposes and applications of information systems
Types, purposes and applications of information systemsMary May Porto
 
Mod 3 working memory model slides
Mod 3 working memory model slidesMod 3 working memory model slides
Mod 3 working memory model slides
mpape
 
Neuro psychiatric aspect of frontal lobe
Neuro psychiatric aspect of frontal lobeNeuro psychiatric aspect of frontal lobe
Neuro psychiatric aspect of frontal lobe
divyesh2k5
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
Amanda Whitmire
 
Threats to Privacy in the Management of Data Stored in Computer Systems by Gu...
Threats to Privacy in the Management of Data Stored in Computer Systems by Gu...Threats to Privacy in the Management of Data Stored in Computer Systems by Gu...
Threats to Privacy in the Management of Data Stored in Computer Systems by Gu...AltheimPrivacy
 
Storage And Retrieval Of Information
Storage And Retrieval Of InformationStorage And Retrieval Of Information
Storage And Retrieval Of InformationMarcus9000
 

Viewers also liked (17)

Cognitive Neuroscience of Memory for Software Engineers
Cognitive Neuroscience of Memory for Software EngineersCognitive Neuroscience of Memory for Software Engineers
Cognitive Neuroscience of Memory for Software Engineers
 
Psicologia Cognitiva
Psicologia CognitivaPsicologia Cognitiva
Psicologia Cognitiva
 
Why Is Information "Capture" Now More Than Just Scanning Documents Into An Ar...
Why Is Information "Capture" Now More Than Just Scanning Documents Into An Ar...Why Is Information "Capture" Now More Than Just Scanning Documents Into An Ar...
Why Is Information "Capture" Now More Than Just Scanning Documents Into An Ar...
 
The Why and How of Knowledge Management: Some Applications in Teaching and Le...
The Why and How of Knowledge Management: Some Applications in Teaching and Le...The Why and How of Knowledge Management: Some Applications in Teaching and Le...
The Why and How of Knowledge Management: Some Applications in Teaching and Le...
 
Information Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyInformation Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case Study
 
COGS 107B - Winter 2010 - Lecture 17 - PFC, Attention
COGS 107B - Winter 2010 - Lecture 17 - PFC, AttentionCOGS 107B - Winter 2010 - Lecture 17 - PFC, Attention
COGS 107B - Winter 2010 - Lecture 17 - PFC, Attention
 
Prefrontal Cortex
Prefrontal CortexPrefrontal Cortex
Prefrontal Cortex
 
Unit 3 Storage And Retreival Of Information
Unit 3 Storage And Retreival Of InformationUnit 3 Storage And Retreival Of Information
Unit 3 Storage And Retreival Of Information
 
Effective Information Management V2 18sep2008
Effective Information Management V2 18sep2008Effective Information Management V2 18sep2008
Effective Information Management V2 18sep2008
 
Working memory
Working memoryWorking memory
Working memory
 
Information Management in a Web 2.0 World May 2009
Information Management in a Web 2.0 World May 2009Information Management in a Web 2.0 World May 2009
Information Management in a Web 2.0 World May 2009
 
Types, purposes and applications of information systems
Types, purposes and applications of information systemsTypes, purposes and applications of information systems
Types, purposes and applications of information systems
 
Mod 3 working memory model slides
Mod 3 working memory model slidesMod 3 working memory model slides
Mod 3 working memory model slides
 
Neuro psychiatric aspect of frontal lobe
Neuro psychiatric aspect of frontal lobeNeuro psychiatric aspect of frontal lobe
Neuro psychiatric aspect of frontal lobe
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
 
Threats to Privacy in the Management of Data Stored in Computer Systems by Gu...
Threats to Privacy in the Management of Data Stored in Computer Systems by Gu...Threats to Privacy in the Management of Data Stored in Computer Systems by Gu...
Threats to Privacy in the Management of Data Stored in Computer Systems by Gu...
 
Storage And Retrieval Of Information
Storage And Retrieval Of InformationStorage And Retrieval Of Information
Storage And Retrieval Of Information
 

Similar to Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing

Ppt programming by alyssa marie paral
Ppt programming by alyssa marie paralPpt programming by alyssa marie paral
Ppt programming by alyssa marie paralalyssamarieparal
 
8023.ppt
8023.ppt8023.ppt
8023.ppt
PoojaTripathi92
 
Ncp computer appls web tech asish
Ncp computer appls  web tech asishNcp computer appls  web tech asish
Ncp computer appls web tech asish
NCP
 
Understanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) EnvironmentUnderstanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) Environment
Adetula Bunmi
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
Minio
 
21 domino mohan-1
21 domino mohan-121 domino mohan-1
21 domino mohan-1
ashish61_scs
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
Rob Gillen
 
Azure: Lessons From The Field
Azure: Lessons From The FieldAzure: Lessons From The Field
Azure: Lessons From The Field
Rob Gillen
 
XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7
Deniz Kılınç
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
WU (Vienna University of Economics and Business)
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
A. LE
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
Carmen Sanborn
 
Unit 5 application layer
Unit 5 application layerUnit 5 application layer
Unit 5 application layer
Kritika Purohit
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
Adrian Stevenson
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Sean Golliher
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
Max Neunhöffer
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
Daniel Francisco Tamayo
 

Similar to Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing (20)

Ppt programming by alyssa marie paral
Ppt programming by alyssa marie paralPpt programming by alyssa marie paral
Ppt programming by alyssa marie paral
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
8023.ppt
8023.ppt8023.ppt
8023.ppt
 
Ncp computer appls web tech asish
Ncp computer appls  web tech asishNcp computer appls  web tech asish
Ncp computer appls web tech asish
 
Understanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) EnvironmentUnderstanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) Environment
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
Iwt module 1
Iwt  module 1Iwt  module 1
Iwt module 1
 
21 domino mohan-1
21 domino mohan-121 domino mohan-1
21 domino mohan-1
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Azure: Lessons From The Field
Azure: Lessons From The FieldAzure: Lessons From The Field
Azure: Lessons From The Field
 
XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
Current trends in DBMS
Current trends in DBMSCurrent trends in DBMS
Current trends in DBMS
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Unit 5 application layer
Unit 5 application layerUnit 5 application layer
Unit 5 application layer
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
 

More from Sean Golliher

Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)
Sean Golliher
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:Sean Golliher
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Sean Golliher
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Sean Golliher
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
Sean Golliher
 
PageRank and The Google Matrix
PageRank and The Google MatrixPageRank and The Google Matrix
PageRank and The Google Matrix
Sean Golliher
 

More from Sean Golliher (8)

Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
PageRank and The Google Matrix
PageRank and The Google MatrixPageRank and The Google Matrix
PageRank and The Google Matrix
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 

Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing

  • 2. - Some crawlers were stuck in loops. - Issues with storage - MSU’s site has about 60,00+ URLs including subdomains.
  • 3. - 37.5% of grade based on Crawler - 37.5% on presentation - 25% Attendance - Paper topics handed out next week
  • 4. - Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)
  • 5. Three reasons to use multiple computers for crawling  Helps to put the crawler closer to the sites it crawls  Reduces the number of sites the crawler has to remember  Reduces computing resources required  Distributed crawler uses a hash function to assign URLs to crawling computers  hash function should be computed on the host part of each URL
  • 6.  Linear Scan of Documents is called “grepping”’  In order to perform “ranked retrieval” we need the best answer among many documents (billions).  Need to Index documents in advance.  Ex: Boolean Retrieval
  • 7.  Ex: The works of Shakespeare.  Record whether each book contains a word out of all the words (~32,000)  Create binary-term document. Incident Matrix.
  • 8. Suppose we want to answer the Brutus AND Ceaser and NOT Calpurnia. We operate on the row vectors (NOT(Calpurnia)) 110100 AND 110111 AND 101111 = 100100 => Ans: “Antony and Cleopatra” and “Hamlet”
  • 9.  Collection of Documents: Corpus  To asses the effectiveness of an IR system  Precision: What fraction returned documents are relevant to the information need?  Recall: What fraction of the relevant documents in a the collection were returned by the system
  • 10.  Last example, incidence matrix, too large. A 500k x 1M matrix has ½ trillion 0’s and 1’s. And too sparse. 99.8% of matrix is zero.  Inverted Index: Record items that do appear.  To gain speed at retrieval time we build an index in advance.  Two steps:  Collect documents  Tokenize the text (break into terms)
  • 11.
  • 12. Text is stored in hundreds of incompatible file formats.. Bytes on a web server.  e.g., raw text, RTF, HTML, XML, Microsoft Word, ODF, PDF  Other types of files also important  e.g., PowerPoint, Excel  Typically use a conversion tool  converts the document content into a tagged text format such as HTML or XML  retains some of the important formatting information
  • 13. A character encoding is a mapping between bits and glyphs  i.e., getting from bits in a file to characters on a screen  Can be a major source of incompatibility  ASCII (1963) is basic character encoding scheme for English  encodes 128 letters, numbers, special characters, and control characters in 7 bits, extended with an extra bit for storage in bytes
  • 14. Other languages can have many more glyphs  e.g., Chinese has more than 40,000 characters, with over 3,000 in common use  Many languages have multiple encoding schemes  e.g., CJK (Chinese-Japanese-Korean) family of East Asian languages, Hindi, Arabic  must specify encoding  can’t have multiple languages in one file
  • 15.  ASCII: Was able to rep. all of English characters use numbers 32-127. (Unix and C being invented)  IBM started using the other bits for special characters.  Then came the ANSI Standard  The WWW caused all this to “crash”….UNICODE (1991)
  • 16.  A different way of thinking about character representation  Question is A different than A? In some languages even the shape matters.  Every latter is assigned a number (code point) by Unicode consortium. Ex: A = U + 0041 (Hex)
  • 17.  UTF-8 -> Identical to ASCII codes. Coding bits to Glyphs.  You have to know what Encoding is being used!
  • 18.  Single mapping from numbers to glyphs that attempts to include all glyphs in common use in all known languages  Unicode is a mapping between numbers and glyphs  does not uniquely specify bits to glyph mapping!  e.g., UTF-8 (1-8 bytes), UTF-16, UTF-32
  • 19.  e.g., Greek letter pi (π) is Unicode symbol number 960  In binary, 00000011 11000000 (3C0 in hexadecimal)  Final encoding is (110)01111 (10)000000 (CF80 in hexadecimal)
  • 20.
  • 21. Web Site use: <html> <head>  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  • 22. Requirements for document storage system:  Random access ○ request the content of a document based on its URL ○ hash function based on URL is typical  Compression and large files ○ reducing storage requirements and efficient access  Update ○ handling large volumes of new and modified documents ○ adding new anchor text
  • 23.  Text is highly redundant (or predictable)  Compression techniques exploit this redundancy to make files smaller without losing any of the content  Popular algorithms can compress HTML and XML text by 80%  e.g., DEFLATE (zip, gzip) and LZW (UNIX compress, PDF)
  • 24. Google’s document storage system (Chang et al., 2006)  Customized for storing, finding, and updating web pages. Tablets served by 1000’s of machines.  Handles large collection sizes using inexpensive computers
  • 25. No query language, no complex queries to optimize  Only row-level transactions  Tablets are stored in a replicated file system that is accessible by all BigTable servers  Any changes to a BigTable tablet are recorded to a transaction log, which is also stored in a shared file system  If any tablet server crashes, another server can immediately read the tablet data and transaction log from the file system and take over
  • 26.  Logically organized into rows  A row stores data for a single web page  Combination of a row key, a column key, and a timestamp point to a single cell in the row  http://video.google.com/videoplay?docid=727 8544055668715642
  • 27. BigTable can have a huge number of columns per row  all rows have the same column groups  not all rows have the same columns  important for reducing disk reads to access document data  Rows are partitioned into tablets based on their row keys  simplifies determining which server is appropriate
  • 28. Duplicate and near-duplicate documents occur in many situations  Copies, versions, plagiarism, spam, mirror sites  30% of the web pages in a large crawl are exact or near duplicates of pages in the other 70%  Duplicates consume significant resources during crawling, indexing, and search  Little value to most users
  • 29.  Exact duplicate detection is relatively easy  Checksum techniques  A checksum is a value that is computed based on the content of the document ○ e.g., sum of the bytes in the document file  Possible for files with different text to have same checksum  Functions such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes
  • 30. More challenging task  Are web pages with same text context but different advertising or format near-duplicates?  A near-duplicate document is defined using a threshold value for some similarity measure between pairs of documents  e.g., document D1 is a near-duplicate of document D2 if more than 90% of the words in the documents are the same
  • 31.  Many web pages contain text, links, and pictures that are not directly related to the main content of the page  This additional material is mostly noise that could negatively affect the ranking of the page  Techniques have been developed to detect the content blocks in a web page  Non-content material is either ignored or reduced in importance in the indexing process
  • 32.
  • 33.  Many web pages contain text, links, and pictures that are not directly related to the main content of the page  This additional material is mostly noise that could negatively affect the ranking of the page  Techniques have been developed to detect the content blocks in a web page  Non-content material is either ignored or reduced in importance in the indexing process
  • 34. Other approaches use DOM structure and visual (layout) features
  • 35. Cumulative distribution of tags in the example web page  Main text content of the page corresponds to the “plateau” in the middle of the distribution
  • 36. R. Song et al., 2004. “Learning Block Importance Models for Web Pages”

Editor's Notes

  1. Grab all terms and store docID. 2nd column organize alphabetical. 3rd column doc. Freq is number of documents it appears in (good for speed later). Last column, postings lists, is which documents it appears in. Good for ranking algorithms later.
  2. Comp
  3. 2^8 = 256
  4. 2^8 = 256
  5. 2^8 = 256
  6. 960 = 1111000000 so 10 bits. Which means we need to use 128-2047. The first numbers are FIXED in the coding scheme.
  7. Distributed database p6. 57 IR book. In use at Google. Petabyte 10^15.n Can’t use SQL it is a large relational database. Table
  8. Notes: Not all rows have to have the same columns. Big difference between relational databases. Site is example.com…. Other.com links to example .com (since that is the row we are talking about) with anchor text example. Annchor is the the anchor text. Nulll.com links to example.com with anchor text click here.
  9. Tokens. Should we treat apple the same as Apple?
  10. DOM –Document object Model. Rong
  11. DOM –Document object Model. Rong