Web Indexing
About
• Includes back-of-book-style indexes to
  individual websites or an intranet.
• Creation of keyword metadata to provide a
  more useful vocabulary for Internet.
• It is also becoming important for periodical
  websites with increase in there number.
Purpose
• Collects, parses, and stores data to facilitate
  fast and accurate information retrieval.
• Index design incorporates interdisciplinary
  concepts from linguistics, cognitive
  psychology, mathematics, informatics, physics
  and computer science.
• Popular engines focus on the full-text indexing
  of online, natural language documents.
Purpose(contd..)
• Media types such as video and audio and
  graphics are also searchable.
• Cache-based search engines permanently
  store the index along with the corpus.
• Unlike full-text indices, partial-text services
  restrict the depth indexed to reduce index
  size.
Back-of-the-book-style
• Back-of-the-book-style web indexes may be
  called "web site A-Z indexes.“
• The implication with "A-Z" is that there is an
  alphabetical browse view or interface.
• A-Z index could be used to index multiple
  sites, rather than the multiple pages of a
  single site, this is unusual.
Metadata web indexing
• Metadata web indexing involves assigning
  keywords or phrases to web pages or web
  sites within a meta-tag.
• The web page or web site can be retrieved
  with a search engine that is customized to
  search the keywords field.
• This may or may not involve using keywords
  restricted to a controlled vocabulary list.
Purpose
• To optimize speed and performance in finding
  relevant documents for a search query.
• The search engine would scan every document in
  the corpus, which would require considerable
  time and computing power, without indexing.
• Additional computer storage required to store
  the index & increase in the time required for an
  update to take place, are traded off for the time
  saved during information retrieval.
Index Design Factors
• Merge factors
  – indexer must first check whether it is updating old
    content or adding new content.
  – similar in concept to the SQL Merge command
    and other merge algorithms.
• Storage techniques
  – information should be data compressed or
    filtered.
Index Design Factors(Contd..)
• Index size
  – Computer storage required to support the index.


• Lookup speed
  – Quickly a word can be found in the inverted index.
  – Speed of finding an entry in a data structure,
    compared with how quickly it can be updated or
    removed.
Index Design Factors(Contd..)
• Maintenance
  – Index is maintained over time
• Fault tolerance
  – service must be reliable.
  – dealing with index corruption,
  – determining whether bad data can be treated in
    isolation,
  – dealing with bad hardware,
  – partitioning,
  – schemes such as hash-based or composite
    partitioning
  – replication.
Index Data Structures
• Suffix tree
   – Structured like a tree, supports linear time lookup.
   – Built by storing the suffixes of words.
   – Support extendable hashing, which is important for
     search engine indexing.
   – Used for searching for patterns in DNA sequences and
     clustering.
Index Data Structures(Contd..)
• Tree
  – ordered tree data structure that is used to store an
    associative array where the keys are strings.
  – faster than a hash table but less space-efficient.
• Inverted index
  – Stores a list of occurrences of each atomic search
    criterion.
• Citation index
  – Stores citations or hyperlinks between documents to
    support citation analysis.
Index Data Structures(Contd..)
• Ngram index
  – Stores sequences of length of data to support
    other types of retrieval or text mining.


• Term document matrix
  – Used in latent semantic analysis, stores the
    occurrences of words in documents in a two-
    dimensional sparse matrix.
Indexes vs. Taxonomies
• Hierarchical taxonomy vs. alphabetical index
• Two-step process of taxonomy development
  and content linking vs. integrated
  indexing/index creation
• Each is more suitable for different kinds of
  content.
• Sometimes have both, as different means to
  access the same content.
Challenges in Parallelism
• Management of parallel computing processes.
• Many opportunities for race conditions and
  coherent faults.
• Collision between two competing tasks.
• Search engine's architecture may involve
  distributed computing, where the search engine
  consists of several machines operating in unison.
• It more difficult to maintain a fully-synchronized,
  distributed, parallel architecture.
Index Merging
• Inverted index is filled via a merge or rebuild.
• Rebuild is similar to a merge but first deletes
  the contents of the inverted index.
• A merge conflates newly indexed documents,
  typically residing in virtual memory, with the
  index cache residing on one or more
  computer hard drives.
• Inverted index is a word-sorted forward index.
The Forward Index
• Stores a list of words for each document.
• As documents are parsing, it is better to
  immediately store the words per document.
• Sorted to transform it to an inverted index.
• Forward index to an inverted index is only a
  matter of sorting the pairs by the words.
• Essentially a list of pairs consisting of a
  document and a word, collated by the
  document.
Compression
• Generating or maintaining a large-scale search
  engine index represents a significant storage
  and processing challenge.
• Compression to reduce the size of the indices
  on disk.
• Tradeoff is the time and processing power
  required to perform compression and
  decompression.
A Scenario
• A full text, Internet search engine:
  – 6,000,000,000 different web pages exist as of the
    year 2008.
  – 250 words on each webpage (based on the
    assumption they are similar to the pages of a
    novel).
  – 8 bits (or 1 byte) to store a single character.
  – average number of characters in any given word
    on a page may be estimated at 5
  – average personal computer comes with 100 to
    250 gigabytes of usable space.
Under Same Senario
• an uncompressed index (assuming a non-
  conflated, simple, index) for 6 billion web
  pages would need to store 1500 billion word
  entries.
• At 1 byte per character, or 5 bytes per word,
  this would require 2500 gigabytes of storage
  space alone.
• The index can be reduced to a fraction of this
  size, using proper algorithm.
Document Parsing
• Breaks apart the components (words) of a
  document or other form of media for
  insertion into the forward and inverted
  indices.
• words found are called tokens, and so, in the
  context of search engine indexing and natural
  language processing, parsing is more
  commonly referred to as tokenization.
Document Parsing(Contd..)
• Also sometimes called word boundary
  disambiguation, tagging, text segmentation,
  content analysis, text analysis, text mining,
  concordance generation, speech
  segmentation, lexing, or lexical analysis.
• 'indexing', 'parsing', and 'tokenization' are
  used interchangeably in corporate slang.
Challenges in Natural Language
                Processing
•   Word Boundary Ambiguity
•   Language Ambiguity
•   Diverse File Formats
•   Faulty Storage
Tokenization
• Computers do not understand structure of
  natural language document and cannot
  automatically recognize words and sentences.
• Program the computer to identify what
  constitutes an individual or distinct word,
  referred to as a token.
• Program is commonly called a tokenizer or
  parser or laxer.
Format Analysis
• HTML
• ASCII text files (a text document without specific computer
  readable formatting)
• Adobe's Portable Document Format (PDF)
• PostScript (PS)
• LaTex
• UseNet net news server formats
• XML and derivatives like RSS
• SGML
• Multimedia meta data formats like ID3
• Microsoft Word
• Microsoft Excel
• Microsoft Powerpoint
• IBM Lotus Notes
Format Analysis(Compressed)
• ZIP - Zip archive file
• RAR - Roshal ARchive File
• CAB - Microsoft Windows Cabinet File
• Gzip - File compressed with gzip
• BZIP - File compressed using bzip2
• Tape ARchive (TAR), Unix archive file, not (itself)
  compressed
• TAR.Z, TAR.GZ or TAR.BZ2 - Unix archive files
  compressed with Compress, GZIP or BZIP2

Web indexing finale

  • 1.
  • 2.
    About • Includes back-of-book-styleindexes to individual websites or an intranet. • Creation of keyword metadata to provide a more useful vocabulary for Internet. • It is also becoming important for periodical websites with increase in there number.
  • 3.
    Purpose • Collects, parses,and stores data to facilitate fast and accurate information retrieval. • Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics and computer science. • Popular engines focus on the full-text indexing of online, natural language documents.
  • 4.
    Purpose(contd..) • Media typessuch as video and audio and graphics are also searchable. • Cache-based search engines permanently store the index along with the corpus. • Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size.
  • 5.
    Back-of-the-book-style • Back-of-the-book-style webindexes may be called "web site A-Z indexes.“ • The implication with "A-Z" is that there is an alphabetical browse view or interface. • A-Z index could be used to index multiple sites, rather than the multiple pages of a single site, this is unusual.
  • 6.
    Metadata web indexing •Metadata web indexing involves assigning keywords or phrases to web pages or web sites within a meta-tag. • The web page or web site can be retrieved with a search engine that is customized to search the keywords field. • This may or may not involve using keywords restricted to a controlled vocabulary list.
  • 7.
    Purpose • To optimizespeed and performance in finding relevant documents for a search query. • The search engine would scan every document in the corpus, which would require considerable time and computing power, without indexing. • Additional computer storage required to store the index & increase in the time required for an update to take place, are traded off for the time saved during information retrieval.
  • 8.
    Index Design Factors •Merge factors – indexer must first check whether it is updating old content or adding new content. – similar in concept to the SQL Merge command and other merge algorithms. • Storage techniques – information should be data compressed or filtered.
  • 9.
    Index Design Factors(Contd..) •Index size – Computer storage required to support the index. • Lookup speed – Quickly a word can be found in the inverted index. – Speed of finding an entry in a data structure, compared with how quickly it can be updated or removed.
  • 10.
    Index Design Factors(Contd..) •Maintenance – Index is maintained over time • Fault tolerance – service must be reliable. – dealing with index corruption, – determining whether bad data can be treated in isolation, – dealing with bad hardware, – partitioning, – schemes such as hash-based or composite partitioning – replication.
  • 11.
    Index Data Structures •Suffix tree – Structured like a tree, supports linear time lookup. – Built by storing the suffixes of words. – Support extendable hashing, which is important for search engine indexing. – Used for searching for patterns in DNA sequences and clustering.
  • 12.
    Index Data Structures(Contd..) •Tree – ordered tree data structure that is used to store an associative array where the keys are strings. – faster than a hash table but less space-efficient. • Inverted index – Stores a list of occurrences of each atomic search criterion. • Citation index – Stores citations or hyperlinks between documents to support citation analysis.
  • 14.
    Index Data Structures(Contd..) •Ngram index – Stores sequences of length of data to support other types of retrieval or text mining. • Term document matrix – Used in latent semantic analysis, stores the occurrences of words in documents in a two- dimensional sparse matrix.
  • 15.
    Indexes vs. Taxonomies •Hierarchical taxonomy vs. alphabetical index • Two-step process of taxonomy development and content linking vs. integrated indexing/index creation • Each is more suitable for different kinds of content. • Sometimes have both, as different means to access the same content.
  • 16.
    Challenges in Parallelism •Management of parallel computing processes. • Many opportunities for race conditions and coherent faults. • Collision between two competing tasks. • Search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. • It more difficult to maintain a fully-synchronized, distributed, parallel architecture.
  • 17.
    Index Merging • Invertedindex is filled via a merge or rebuild. • Rebuild is similar to a merge but first deletes the contents of the inverted index. • A merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives. • Inverted index is a word-sorted forward index.
  • 18.
    The Forward Index •Stores a list of words for each document. • As documents are parsing, it is better to immediately store the words per document. • Sorted to transform it to an inverted index. • Forward index to an inverted index is only a matter of sorting the pairs by the words. • Essentially a list of pairs consisting of a document and a word, collated by the document.
  • 20.
    Compression • Generating ormaintaining a large-scale search engine index represents a significant storage and processing challenge. • Compression to reduce the size of the indices on disk. • Tradeoff is the time and processing power required to perform compression and decompression.
  • 21.
    A Scenario • Afull text, Internet search engine: – 6,000,000,000 different web pages exist as of the year 2008. – 250 words on each webpage (based on the assumption they are similar to the pages of a novel). – 8 bits (or 1 byte) to store a single character. – average number of characters in any given word on a page may be estimated at 5 – average personal computer comes with 100 to 250 gigabytes of usable space.
  • 22.
    Under Same Senario •an uncompressed index (assuming a non- conflated, simple, index) for 6 billion web pages would need to store 1500 billion word entries. • At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone. • The index can be reduced to a fraction of this size, using proper algorithm.
  • 23.
    Document Parsing • Breaksapart the components (words) of a document or other form of media for insertion into the forward and inverted indices. • words found are called tokens, and so, in the context of search engine indexing and natural language processing, parsing is more commonly referred to as tokenization.
  • 24.
    Document Parsing(Contd..) • Alsosometimes called word boundary disambiguation, tagging, text segmentation, content analysis, text analysis, text mining, concordance generation, speech segmentation, lexing, or lexical analysis. • 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.
  • 25.
    Challenges in NaturalLanguage Processing • Word Boundary Ambiguity • Language Ambiguity • Diverse File Formats • Faulty Storage
  • 26.
    Tokenization • Computers donot understand structure of natural language document and cannot automatically recognize words and sentences. • Program the computer to identify what constitutes an individual or distinct word, referred to as a token. • Program is commonly called a tokenizer or parser or laxer.
  • 27.
    Format Analysis • HTML •ASCII text files (a text document without specific computer readable formatting) • Adobe's Portable Document Format (PDF) • PostScript (PS) • LaTex • UseNet net news server formats • XML and derivatives like RSS • SGML • Multimedia meta data formats like ID3 • Microsoft Word • Microsoft Excel • Microsoft Powerpoint • IBM Lotus Notes
  • 28.
    Format Analysis(Compressed) • ZIP- Zip archive file • RAR - Roshal ARchive File • CAB - Microsoft Windows Cabinet File • Gzip - File compressed with gzip • BZIP - File compressed using bzip2 • Tape ARchive (TAR), Unix archive file, not (itself) compressed • TAR.Z, TAR.GZ or TAR.BZ2 - Unix archive files compressed with Compress, GZIP or BZIP2