Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Indexing Techniques: Their Usage in Search Engines for Information Retrieval


Published on

About Indexing Techniques in Information Retrieval

Published in: Education
  • Be the first to comment

Indexing Techniques: Their Usage in Search Engines for Information Retrieval

  1. 1. Topics Speakers 1. Introduction and Overview Sayon Roy 2. Indexing Techniques – Transition from Manual to Automated System Kaustav Saha 3. Usage in Modern Day Search Engines Vikas Bhushan 4. Currents Trends and Applications Debashis Naskar 5. Conclusion Sumanta Bag
  2. 2. Indexing…. an Overview • Indexing is a crucial part of any information retrieval system. It is a challenging task requiring paying attention to many theoretical and practical issues. While the move towards digital information systems and automated indexing is thought to have reduced the need for indexers in some areas, professional indexers are still much needed and as a matter of fact electronic environment has posed new challenges for the indexers. • Indexing is more a process of the extraction rather than content analysis. • The terms is an index represent certain concepts.
  3. 3. Subject Indexing and Subject Retrieval • Subject indexing can be described as a system of classifying without notation. It is the core theme of information science. • Today subject retrieval is facilitated through the use of structured databases. • The items that are retrieved are listed in the index. • In OPACs indexing is done manually to determine what a resource about. After identification the aboutness is translated in the language of the vocabulary.
  4. 4. Schematic Illustration… Conception of Subject Analysis and Indexing Type of Subject Information Indexing Method Simplistic Conception Explicit Information Extraction Content- Oriented Conception Implicit Information Assignment Requirement -Oriented Conception
  5. 5. Early use of computers for Information Retrieval • In 1948 a “machine called the Univac” capable of searching for text references associated with a subject code was created. • The machine could process “at the rate of 120 words per minute”. It appears that this is the first reference to a computer being used to search for content. • The impact of computers in IR is highlighted when Hollywood drew public attention to the innovation with the comedy “Desk Set”, which came out in 1957. It centred on a group of reference librarians who were about to be replaced by a computer. • IR as a research discipline was starting to emerge at this time with two important developments: how to index documents and how to retrieve them.
  6. 6. Indexing and Information Retrieval… A Chronology •Mortimer Taube’s Uniterm system, which was essentially a proposal to index items by a list of keywords. As simple an idea as this seems today, this was at the time a radical step.
  7. 7. Ranked retrieval •The ranked retrieval approach to search was taken up by IR researchers, who over the following decades refined and revised the means by which documents were sorted in relation to a query. •The superior effectiveness of this approach over Boolean search was demonstrated in many experiments over those years. • Work in the 1950s established computers as the definitive tool for search.
  8. 8. 1960s … •The 1960s witnessed formalization of algorithms to rank documents relative to a Query. •This was a process to support iterative search, where documents previously retrieved could be marked as relevant in an IR system. •Versions of this process are used in modern search engines, such as the “Related articles” link on Google Scholar.
  9. 9. 1970s… •One of the key developments of this period was that Luhn’s term frequency (tf) weights (based on the occurrence of words within a document) •Spärck Jones’s work on word occurrence introduced the idea of inverse document frequency (idf). •An alternative means of modelling IR systems involved extending Maron, Kuhns and Ray’s idea of using probability theory.
  10. 10. 1980s – mid 1990s •Building on the developments of the 1970s, variations of tf idf weighting schemes were produced and the formal models of retrieval were extended. •The original probabilistic model did not include tf weights and a number of researchers worked to incorporate them in an effective and principled way. •Amongst other achievements, this work ultimately led to the ranking function BM25 which, has proven to be a highly effective ranking function and is still commonly used. •Advances on the basic vector space model were also developed and probably the most well-known is Latent Semantic Indexing (LSI).
  11. 11. Mid 1990s – present •The arrival of the web initiated the study of new problems in IR. •Search engine developers quickly realised that they could use the links between web pages to construct a crawler or robot to traverse and gather most web pages on the internet • The first full text search engine using a crawler was WebCrawler released in 1994.
  12. 12. -Kaustav Saha Indexing Techniques – Transition from Manual to Automated System
  13. 13. What is an index? •A Database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. •Association of descriptors (keywords, concepts, metadata) to documents in view of future retrieval •The knowledge / expectation / behavior of the searcher needs to be anticipated
  14. 14. Example of Indexing using POPSI A report on the treatment of infections disease of lungs in India during 1982-85 Discipline Medical Science Entity Lung Property Infections disease Action Treatment Space modifier India Time modifier 1982-85 Form modifier Report Subject heading MEDICAL SCIENCE, LUNG infection disease, treatment, India, 1982-85 INFECTION DEASEASE, TREATMENT medical science, lung, India, 1982-85 Cross Reference Therapeutics see Treatment Therapy see Treatment
  15. 15. Manual and Automatic Indexing •Manual •Human indexers assign index terms to documents •A computer system may be used to record the descriptors generated by the human •Automatic •The system extracts “typical”/ “significant” terms •The human may contribute by setting the parameters or thresholds, or by choosing components or algorithms •Semi-automatic •The system’s contribution may be supported in terms of word lists, thesauri, reference system, etc, following or not the automatic processing of the text
  16. 16. Manual vs. Automatic Indexing •Manual •Slow and expensive •Is based on intellectual judgment and semantic interpretation (concepts, themes) •Low consistency •Automatic •Fast and inexpensive •Mechanical execution of algorithms, with no intelligent interpretation (aboutness / relevance) •Consistent
  17. 17. Vocabulary •Vocabulary (indexing language) •The set of concepts (terms or phrases) that can be used to index documents in a collection •Controlled •Specific for specialized domains •Potential for increased consistency of indexing and precision of retrieval •Un-controlled (free) •Potentially all the terms in the documents •Potential for increased recall
  18. 18. Thesauri •Capture relationships between indexing terms •Hierarchical •Synonymous •Related •Creation of thesauri •Manual vs. automatic •Use of thesauri •In manual / semi-automatic / automatic fashion •Syntagmatic co-ordination / thesaurus-based query expansion during indexing / searching
  19. 19. TEXT REPRESENTATION Lexical analysis Stemming Stop word removal representation Steps of automatic indexing Collection/document structure Data structure
  20. 20. Role of Indexing in Information Retrieval Population of Documents Selected documents Indexing Database in printed or electronic form Search Strategy Information Needs Population of database users System VocabularyDocument Store Document Description
  21. 21. Usage in Modern Day Search Engines - Vikas Bhushan Search Engines Use of search engines Types of Search Engines Software Components in Search Engines Pictorial representation of Components How Search Engines Works with a Model Post-coordinate Indexes
  22. 22. Search engines : An initiative towards correct retrieval from a Labyrinth of Ideas  Search engines do not search only for keywords, some search for other stuff as well  and they are really not “engines” in the classical sense but then mouse is not a “mouse” Rather, these are computer programs that searches for particular keywords and returns a list of documents in which they were found, especially a service that scans documents on the Internet.
  23. 23. Types of Search Engines Crawler Based – Google, AltaVista Human Based – Yahoo directory, Open directory, LookSmart Hybrid Models – Yahoo, Google Meta Search Engines – Dogpile, MetaCrawler
  24. 24. Use of search engines … among others WebCrawler founder Brian Pinkerton puts it, "Imagine walking up to a librarian and saying, 'travel’ . They’re going to look at you with a blank face? "
  25. 25. Components in the Back-end & Front-end process Software Components Back-end Front-end Crawler/Spider Indexer Index File Database Search Engine Interface Query Parser Ranking Mechanism Google uses PageRank Teoma uses ExpertRank Yahoo uses TrustRank
  26. 26. Pictorial representation of Front-end & Back-end Process Search Engine Database
  27. 27. Your Browser How Search Engines Work (Sherman 2003) The Web URL1 URL2 URL3 URL4 Crawler Indexer Search Engine Database Eggs? Eggs. Eggs - 90% Eggo - 81% Ego- 40% Huh? - 10% All About Eggs by S. I. Am
  28. 28. Post-coordinate Indexes An Information Retrieval system that allows the searcher to combine terms in any way is frequently referred to as Post-coordinate. Modern computer based system, operated online, can be considered to be a direct descendent of the previous manual system. The files of an online system comprises two major elements: 1. A complete set of document representations : Bibliographic reference or similar to Search engine database. 2. A list of terms sometimes referred to as an inverted file or a postings files. Continued…
  29. 29. The subject matter discussed in a document, and represented by index terms assigned to it, is multidimensional in character . Consider, for example an article discussing “Political Contenders in Assembly Polls of Karnataka”. Have been index under the following terms :  Political Contenders  Constituencies  Assembly Polls  Karnataka Post-coordinate Indexes…
  30. 30. Political Contenders Index terms mentioned previously actually represent a network of relationship Constituencies Assembly Polls Karnataka Continued…
  31. 31. Information Retrieval System Represented as a Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A B C D E F G H X X X X X X X X X X X X X X X X X X X X X X X X X X X
  32. 32. -DebashisNaskar Currents Trends and Applications
  33. 33. Current trends and applications  The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research.  Automated search engines that rely on keyword matching usually return too many low quality matches.  A large-scale search engine makes heavy use of the additional structure present in hypertext to provide much higher quality search results.
  34. 34. What is XML Indexing?  XML indexing is a form of embedded indexing in which tags are inserted into an XML documents to mark the occurrences of indexable terms or topics.  The clients publishing process automatically generates an index from these index elements. Fortunately, because this automated process handles all layout and formatting ,it is not necessary to treat these issues as a matter of concern.
  35. 35. What makes it work? Index entries in DocBook are encoded using the mother element and has five child elements. There are summarized below: • <indexterm> element: wrapper element for an index entry of any type. • <primary> element: main entry. • <secondary> element: subentry. • <tertiary> element: sub-subentry. • < see > element: ‘see’ references. • <seealso> element: ‘seealso’ references.
  36. 36. Future hopes for Indexers  Indexer should offer XML based services, which is a pre requisite for joining the digital publishing revolution.  Indexers are good with structures and use of XML indexing in publishing is about imposition of structure on text.
  37. 37. Results and Performance  The most important measure of a search engine is the quality of its search results.  Here we highlight the performance and experience with Google. It produces Better results than the major commercial search engines for most searches.
  38. 38. Data Google bing Yahoo! Baidu Babylon Others 2012-04 91.7 3.5 3.36 0.26 0 1.18 2012-05 92.04 3.36 3.26 0.22 0 1.12 2012-06 91.75 3.27 3.04 0.23 0.29 1.42 2012-07 91.17 3.22 2.95 0.45 0.54 1.67 2012-08 91.01 3.22 2.98 0.5 0.6 1.7 2012-09 91.04 3.16 2.91 0.49 0.6 1.8 2012-10 90.75 3.35 2.91 0.54 0.58 1.87 2012-11 90.75 3.32 2.84 0.58 0.6 1.92 2012-12 90.43 3.26 2.89 0.66 0.54 2.21 2013-01 90.47 3.19 2.88 0.63 0.48 2.35 2013-02 89.64 3.62 3.17 0.73 0.39 2.45 2013-03 89.89 3.59 3.2 0.93 0.29 2.11 2013-04 90.17 3.61 3.08 0.92 0.27 1.95
  39. 39. Models for Information Retrieval  Boolean or Vector space model of IR(Information Retrieval) -In this matching is done in a formally defined but semantically imprecise calculus of Index terms.  There are a number of retrieval models that function over a Probabilistic basis. Binary Independence Model, is the most original and is still the most influential among other probabilistic retrieval models. Contd…
  40. 40. OKAPI BM25: model for Information Retrieval  The BIM was originally designed for short catalogue records and abstracts of fairly consistent length.  For modern full-text search collections, a model should pay attention to term frequency and document length. The BM25 weighting scheme , often called Okapi weighting , after the system in which it was first implemented, was developed as a way of building a probabilistic model. Contd…
  41. 41. The score of any document as determined by OKAPI is determined through the following equations: Equation 1. The simplest score for document d is just idf weighting of the query terms present: Equation 2. Sometimes, an alternative version of idf is used. If we start with the formula in the absence of relevance feedback information we estimate that S = s = 0 , then we get an alternative idf formulation as follows: Contd…
  42. 42. Equation 3. We can improve on Equation 1 by factoring in the frequency of each term and document length: Equation 4. If the query is long, then we might also use similar weighting for query terms. This is appropriate if the queries are paragraph long information needs, but unnecessary for short queries:
  43. 43. -Sumanta Bag Conclusion
  44. 44. • For implementation of indexing services individual indexers may prefer numerous approaches. • The effectiveness of an index as a search tool will depend on the number of access points provided. • Different factors influence the recall and precision measures for any retrieved information. • Indexing and its usage can be made more sophisticated through implication of certain concepts like:  Weighted Indexing  Linking of terms  Role Indicators  Subheading  Index Language Device Conclusion: Enhancement of Indexing Procedures
  45. 45. • Many automatic systems include form of weighting to allow the ranking • Weighted indexing grants autonomy on behalf of the searcher to vary the exhaustivity • It is simplifies the process of indexing • Weighted indexing assigns a numerical value to individual terms. • Weighted index has two ways of retrieval from the database. • Major and Minor descriptor. Weighted Indexing
  46. 46. • For efficient and timely retrieval of appropriate and correct information • Inappropriate or irrelevant responses can be avoided by reducing the exhaustively of index. • Removal of unwanted or false association. • To avoid false association by linking of index terms. Linking of terms
  47. 47. • Role indicators play an important part in retrieval of accurate information • Use of syntax to reduce ambiguity. • Role indicators introduced into retrieval system in the early 1960s • The first of its kind was the Engineers Joint a Council (EJC) set of role indicator. • The document surrogate was a ‘telegraphic abstract’ by means of a ‘semantic code dictionary’. Role Indicators
  48. 48. Subheadings • The advent of automated system the need for retrieval of precise information gained importance • The problem of false or ambiguous associations are now less • Subheading can be applied much of post coordinate index system • successful in reducing the ambiguities in the searching of electronic data bases
  49. 49. Index Language Devices Precision Device Weighting Links Role indicators Recall Device Subheadings Synonym control Inverse Relation
  50. 50. Before we Conclude… • The entire discussion was based on application of indexing techniques and principles for design of search engines. • To develop software tools that would allow the user to perform relatively specific subject searchers related to resources of any type. • Search engines operate by building ‘indexes’ to the network resources. • The concept of Boolean logic is followed for searching purposes. • Search engines use inverted indexes.
  51. 51. Conclusion Today the internet has become versatile and is treated as a significant source of a information. The transition from traditional to electronic form of information resources, has paved the way for creation of various software and certain tools. These provide enhanced navigation among resources available in electronic form and within networked environment. However, various studies indicate that there is much ground to be cover before machines become intelligent enough to completely replace humans. As of now the role of the human indexer is quite indispensible. Thus in days to come upgraded indexing techniques and principles would surely be developed thereby ensuring efficient and timely retrieval of information from a digitized environment.
  52. 52. Thank you