About• Includes back-of-book-style indexes to individual websites or an intranet.• Creation of keyword metadata to provide a more useful vocabulary for Internet.• It is also becoming important for periodical websites with increase in there number.
Purpose• Collects, parses, and stores data to facilitate fast and accurate information retrieval.• Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics and computer science.• Popular engines focus on the full-text indexing of online, natural language documents.
Purpose(contd..)• Media types such as video and audio and graphics are also searchable.• Cache-based search engines permanently store the index along with the corpus.• Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size.
Back-of-the-book-style• Back-of-the-book-style web indexes may be called "web site A-Z indexes.“• The implication with "A-Z" is that there is an alphabetical browse view or interface.• A-Z index could be used to index multiple sites, rather than the multiple pages of a single site, this is unusual.
Metadata web indexing• Metadata web indexing involves assigning keywords or phrases to web pages or web sites within a meta-tag.• The web page or web site can be retrieved with a search engine that is customized to search the keywords field.• This may or may not involve using keywords restricted to a controlled vocabulary list.
Purpose• To optimize speed and performance in finding relevant documents for a search query.• The search engine would scan every document in the corpus, which would require considerable time and computing power, without indexing.• Additional computer storage required to store the index & increase in the time required for an update to take place, are traded off for the time saved during information retrieval.
Index Design Factors• Merge factors – indexer must first check whether it is updating old content or adding new content. – similar in concept to the SQL Merge command and other merge algorithms.• Storage techniques – information should be data compressed or filtered.
Index Design Factors(Contd..)• Index size – Computer storage required to support the index.• Lookup speed – Quickly a word can be found in the inverted index. – Speed of finding an entry in a data structure, compared with how quickly it can be updated or removed.
Index Design Factors(Contd..)• Maintenance – Index is maintained over time• Fault tolerance – service must be reliable. – dealing with index corruption, – determining whether bad data can be treated in isolation, – dealing with bad hardware, – partitioning, – schemes such as hash-based or composite partitioning – replication.
Index Data Structures• Suffix tree – Structured like a tree, supports linear time lookup. – Built by storing the suffixes of words. – Support extendable hashing, which is important for search engine indexing. – Used for searching for patterns in DNA sequences and clustering.
Index Data Structures(Contd..)• Tree – ordered tree data structure that is used to store an associative array where the keys are strings. – faster than a hash table but less space-efficient.• Inverted index – Stores a list of occurrences of each atomic search criterion.• Citation index – Stores citations or hyperlinks between documents to support citation analysis.
Index Data Structures(Contd..)• Ngram index – Stores sequences of length of data to support other types of retrieval or text mining.• Term document matrix – Used in latent semantic analysis, stores the occurrences of words in documents in a two- dimensional sparse matrix.
Indexes vs. Taxonomies• Hierarchical taxonomy vs. alphabetical index• Two-step process of taxonomy development and content linking vs. integrated indexing/index creation• Each is more suitable for different kinds of content.• Sometimes have both, as different means to access the same content.
Challenges in Parallelism• Management of parallel computing processes.• Many opportunities for race conditions and coherent faults.• Collision between two competing tasks.• Search engines architecture may involve distributed computing, where the search engine consists of several machines operating in unison.• It more difficult to maintain a fully-synchronized, distributed, parallel architecture.
Index Merging• Inverted index is filled via a merge or rebuild.• Rebuild is similar to a merge but first deletes the contents of the inverted index.• A merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives.• Inverted index is a word-sorted forward index.
The Forward Index• Stores a list of words for each document.• As documents are parsing, it is better to immediately store the words per document.• Sorted to transform it to an inverted index.• Forward index to an inverted index is only a matter of sorting the pairs by the words.• Essentially a list of pairs consisting of a document and a word, collated by the document.
Compression• Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge.• Compression to reduce the size of the indices on disk.• Tradeoff is the time and processing power required to perform compression and decompression.
A Scenario• A full text, Internet search engine: – 6,000,000,000 different web pages exist as of the year 2008. – 250 words on each webpage (based on the assumption they are similar to the pages of a novel). – 8 bits (or 1 byte) to store a single character. – average number of characters in any given word on a page may be estimated at 5 – average personal computer comes with 100 to 250 gigabytes of usable space.
Under Same Senario• an uncompressed index (assuming a non- conflated, simple, index) for 6 billion web pages would need to store 1500 billion word entries.• At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone.• The index can be reduced to a fraction of this size, using proper algorithm.
Document Parsing• Breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices.• words found are called tokens, and so, in the context of search engine indexing and natural language processing, parsing is more commonly referred to as tokenization.
Document Parsing(Contd..)• Also sometimes called word boundary disambiguation, tagging, text segmentation, content analysis, text analysis, text mining, concordance generation, speech segmentation, lexing, or lexical analysis.• indexing, parsing, and tokenization are used interchangeably in corporate slang.
Challenges in Natural Language Processing• Word Boundary Ambiguity• Language Ambiguity• Diverse File Formats• Faulty Storage
Tokenization• Computers do not understand structure of natural language document and cannot automatically recognize words and sentences.• Program the computer to identify what constitutes an individual or distinct word, referred to as a token.• Program is commonly called a tokenizer or parser or laxer.
Format Analysis• HTML• ASCII text files (a text document without specific computer readable formatting)• Adobes Portable Document Format (PDF)• PostScript (PS)• LaTex• UseNet net news server formats• XML and derivatives like RSS• SGML• Multimedia meta data formats like ID3• Microsoft Word• Microsoft Excel• Microsoft Powerpoint• IBM Lotus Notes
Format Analysis(Compressed)• ZIP - Zip archive file• RAR - Roshal ARchive File• CAB - Microsoft Windows Cabinet File• Gzip - File compressed with gzip• BZIP - File compressed using bzip2• Tape ARchive (TAR), Unix archive file, not (itself) compressed• TAR.Z, TAR.GZ or TAR.BZ2 - Unix archive files compressed with Compress, GZIP or BZIP2