Web indexing finale

About
• Includes back-of-book-style indexes to
individual websites or an intranet.
• Creation of keyword metadata to provide a
more useful vocabulary for Internet.
• It is also becoming important for periodical
websites with increase in there number.

Purpose
• Collects, parses, and stores data to facilitate
fast and accurate information retrieval.
• Index design incorporates interdisciplinary
concepts from linguistics, cognitive
psychology, mathematics, informatics, physics
and computer science.
• Popular engines focus on the full-text indexing
of online, natural language documents.

Purpose(contd..)
• Media types such as video and audio and
graphics are also searchable.
• Cache-based search engines permanently
store the index along with the corpus.
• Unlike full-text indices, partial-text services
restrict the depth indexed to reduce index
size.

Back-of-the-book-style
• Back-of-the-book-style web indexes may be
called "web site A-Z indexes.“
• The implication with "A-Z" is that there is an
alphabetical browse view or interface.
• A-Z index could be used to index multiple
sites, rather than the multiple pages of a
single site, this is unusual.

Metadata web indexing
• Metadata web indexing involves assigning
keywords or phrases to web pages or web
sites within a meta-tag.
• The web page or web site can be retrieved
with a search engine that is customized to
search the keywords field.
• This may or may not involve using keywords
restricted to a controlled vocabulary list.

Purpose
• To optimize speed and performance in finding
relevant documents for a search query.
• The search engine would scan every document in
the corpus, which would require considerable
time and computing power, without indexing.
• Additional computer storage required to store
the index & increase in the time required for an
update to take place, are traded off for the time
saved during information retrieval.

Index Design Factors
• Merge factors
– indexer must first check whether it is updating old
content or adding new content.
– similar in concept to the SQL Merge command
and other merge algorithms.
• Storage techniques
– information should be data compressed or
filtered.

Index Design Factors(Contd..)
• Index size
– Computer storage required to support the index.

• Lookup speed
– Quickly a word can be found in the inverted index.
– Speed of finding an entry in a data structure,
compared with how quickly it can be updated or
removed.

Index Design Factors(Contd..)
• Maintenance
– Index is maintained over time
• Fault tolerance
– service must be reliable.
– dealing with index corruption,
– determining whether bad data can be treated in
isolation,
– dealing with bad hardware,
– partitioning,
– schemes such as hash-based or composite
partitioning
– replication.

Index Data Structures
• Suffix tree
– Structured like a tree, supports linear time lookup.
– Built by storing the suffixes of words.
– Support extendable hashing, which is important for
search engine indexing.
– Used for searching for patterns in DNA sequences and
clustering.

Index Data Structures(Contd..)
• Tree
– ordered tree data structure that is used to store an
associative array where the keys are strings.
– faster than a hash table but less space-efficient.
• Inverted index
– Stores a list of occurrences of each atomic search
criterion.
• Citation index
– Stores citations or hyperlinks between documents to
support citation analysis.

Index Data Structures(Contd..)
• Ngram index
– Stores sequences of length of data to support
other types of retrieval or text mining.

• Term document matrix
– Used in latent semantic analysis, stores the
occurrences of words in documents in a two-
dimensional sparse matrix.

Indexes vs. Taxonomies
• Hierarchical taxonomy vs. alphabetical index
• Two-step process of taxonomy development
and content linking vs. integrated
indexing/index creation
• Each is more suitable for different kinds of
content.
• Sometimes have both, as different means to
access the same content.

Challenges in Parallelism
• Management of parallel computing processes.
• Many opportunities for race conditions and
coherent faults.
• Collision between two competing tasks.
• Search engine's architecture may involve
distributed computing, where the search engine
consists of several machines operating in unison.
• It more difficult to maintain a fully-synchronized,
distributed, parallel architecture.

Index Merging
• Inverted index is filled via a merge or rebuild.
• Rebuild is similar to a merge but first deletes
the contents of the inverted index.
• A merge conflates newly indexed documents,
typically residing in virtual memory, with the
index cache residing on one or more
computer hard drives.
• Inverted index is a word-sorted forward index.

The Forward Index
• Stores a list of words for each document.
• As documents are parsing, it is better to
immediately store the words per document.
• Sorted to transform it to an inverted index.
• Forward index to an inverted index is only a
matter of sorting the pairs by the words.
• Essentially a list of pairs consisting of a
document and a word, collated by the
document.

Compression
• Generating or maintaining a large-scale search
engine index represents a significant storage
and processing challenge.
• Compression to reduce the size of the indices
on disk.
• Tradeoff is the time and processing power
required to perform compression and
decompression.

A Scenario
• A full text, Internet search engine:
– 6,000,000,000 different web pages exist as of the
year 2008.
– 250 words on each webpage (based on the
assumption they are similar to the pages of a
novel).
– 8 bits (or 1 byte) to store a single character.
– average number of characters in any given word
on a page may be estimated at 5
– average personal computer comes with 100 to
250 gigabytes of usable space.

Under Same Senario
• an uncompressed index (assuming a non-
conflated, simple, index) for 6 billion web
pages would need to store 1500 billion word
entries.
• At 1 byte per character, or 5 bytes per word,
this would require 2500 gigabytes of storage
space alone.
• The index can be reduced to a fraction of this
size, using proper algorithm.

Document Parsing
• Breaks apart the components (words) of a
document or other form of media for
insertion into the forward and inverted
indices.
• words found are called tokens, and so, in the
context of search engine indexing and natural
language processing, parsing is more
commonly referred to as tokenization.

Document Parsing(Contd..)
• Also sometimes called word boundary
disambiguation, tagging, text segmentation,
content analysis, text analysis, text mining,
concordance generation, speech
segmentation, lexing, or lexical analysis.
• 'indexing', 'parsing', and 'tokenization' are
used interchangeably in corporate slang.

Challenges in Natural Language
Processing
• Word Boundary Ambiguity
• Language Ambiguity
• Diverse File Formats
• Faulty Storage

Tokenization
• Computers do not understand structure of
natural language document and cannot
automatically recognize words and sentences.
• Program the computer to identify what
constitutes an individual or distinct word,
referred to as a token.
• Program is commonly called a tokenizer or
parser or laxer.

Format Analysis
• HTML
• ASCII text files (a text document without specific computer
readable formatting)
• Adobe's Portable Document Format (PDF)
• PostScript (PS)
• LaTex
• UseNet net news server formats
• XML and derivatives like RSS
• SGML
• Multimedia meta data formats like ID3
• Microsoft Word
• Microsoft Excel
• Microsoft Powerpoint
• IBM Lotus Notes

Format Analysis(Compressed)
• ZIP - Zip archive file
• RAR - Roshal ARchive File
• CAB - Microsoft Windows Cabinet File
• Gzip - File compressed with gzip
• BZIP - File compressed using bzip2
• Tape ARchive (TAR), Unix archive file, not (itself)
compressed
• TAR.Z, TAR.GZ or TAR.BZ2 - Unix archive files
compressed with Compress, GZIP or BZIP2

Web indexing finale

More Related Content

What's hot

Similar to Web indexing finale

Recently uploaded

Web indexing finale