Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tdm information retrieval

2,805 views

Published on

  • Be the first to comment

Tdm information retrieval

  1. 1. Text Data Mining PART I - IR
  2. 2. Text Mining Applications Information Retrieval  Query-based search of large text archives, e.g., the Web Text Classification  Automated assignment of topics to Web pages, e.g., Yahoo, Google  Automated classification of email into spam and non- spam Text Clustering  Automated organization of search results in real-time into categories  Discovery clusters and trends in technical literature (e.g. CiteSeer) Information Extraction  Extracting standard fields from free-text  Extracting names and places from reports, newspapers
  3. 3. Information Retrieval - Definition  Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).  Information Retrieval  Deals with the representation, storage, organization of, and access to information items – Modern Information Retrieval  General Objective: Minimize the overhead of a user locating needed information
  4. 4. Information Retrieval Is Not Database Information Retrieval  Process stored documents  Search documents relevant to user queries  No standard of how queries should be  Query results are permissive to errors or inaccurate items Database  Normally no processing of data  Search records matching queries  Standard: SQL language  Query results should have 100% accuracy. Zero tolerant to errors
  5. 5. Information Retrieval Is Not Data Mining Information Retrieval  User target: Existing relevant data entries Data Mining  User target: Knowledge (rules, etc.) implied by data (not the individual data entries themselves) • Many techniques and models are shared and related • E.g. classification of documents
  6. 6. Is Information Retrieval a Form of Text Mining? What is the principal computer specialty for processing documents and text??  Information Retrieval (IR)  The task of IR is to retrieve relevant documents in response to a query.  The fundamental technique of IR is measuring similarity  A query is examined and transformed into a vector of values to be compared with stored documents
  7. 7. Is Information Retrieval a Form of Text Mining?  In the predication problem similar documents are retrieved, then measure their properties, i.e. count the # of class labels to see which label should be assigned to a new document  The objectives of the prediction can be posed in the form of an IR model where documents are retrieved that are relevant to a query, the query will be a new document
  8. 8. Specify Query Search Document Collection Return Subset of Relevant Documents Key Steps in Information Retrieval Examine Document Collection Learn Classification Criteria Apply Criteria to New Documents Key Steps in Predictive Text Mining
  9. 9. Specify Query Vector Match Document Collection Get Subset of Relevant Documents Examine Document Properties Predicting from Retrieved Documents Key steps in IR simple criteria such as document’s labels
  10. 10. Information Retrieval (IR)  Conceptually, IR is the study of finding needed information. I.e., IR helps users find information that matches their information needs.  Expressed as queries  Historically, IR is about document retrieval, emphasizing document as the basic unit.  Finding documents relevant to user queries  Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information.
  11. 11. Information Retrieval Cycle Source Selection Search Query Selection Results Examination Documents Delivery Information Query Formulation Resource source reselection System discovery Vocabulary discovery Concept discovery Document discovery
  12. 12. Abstract IR Architecture DocumentsQuery Hits Representation Function Representation Function Query Representation Document Representation Comparison Function Index offlineonline
  13. 13. IR Architecture
  14. 14. IR Queries  Keyword queries  Boolean queries (using AND, OR, NOT)  Phrase queries  Proximity queries  Full document queries  Natural language questions
  15. 15. Information retrieval models  An IR model governs how a document and a query are represented and how the relevance of a document to a user query is defined.  Main models:  Boolean model  Vector space model  Statistical language model  etc
  16. 16. Elements in Information Retrieval  Processing of documents  Acceptance and processing of queries from users  Modelling, searching and ranking of documents  Presenting the search result
  17. 17. Process of Retrieving Information
  18. 18. Document Processing  Removing stopwords (appear frequently but no much meaning, e.g. “the”, “of”)  Stemming: recognize different words with the same grammar root  Noun groups: common combination of words  Indexing: for fast locating documents
  19. 19. Processing Queries  Define a “language” for queries  Syntax, operators, etc.  Modify the queries for better search  Ignore meaningless parts: punctuations, conjunctives, etc.  Append synonyms e.g. e-business e-commerce  Emerging technology  Natural language queries
  20. 20. Modelling/Ranking of Documents  Model the relevance (usefulness) of documents against the user query Q  The model represents a function Rel(Q,D)  D is a document, Q is a user query  Rel(Q,D) is the relevance of document D to query Q  There are many models available  Algebraic models  Probabilistic models  Set-theoretic models
  21. 21. Basic Vector Space Model  Define a set of words and phases as terms  Text is represented by a vector of terms  User query is converted to a vector, too  Measure the vector “distance” between a document vector and the query vector business computer PowerPoint presentation user web Term Set We are doing an e-business presentation in PowerPoint. Document (1,0,1,1,0,0) computer presentation Query (0,1,0,1,0,0) 222222 )00()00()11()01()10()01(  Distance
  22. 22. Probabilistic Models Overview Probabilistic Models  Ranking: the probability that a document is relevant to a query  Often denoted as Pr(R|D,Q)  In actual measure, log-odds transformation is used:  Probability values are estimated in applications ),|Pr( ),|Pr( log QDR QDR
  23. 23. Information Retrieval Given  A source of textual documents  A well defined limited query (text based) Find  Sentences with relevant information  Extract the relevant information and ignore non- relevant information (important!)  Link related information and output in a predetermined format  Example: news stories, e-mails, web pages, photograph, music, statistical data, biomedical data, etc.  Information items can be in the form of text, image, video, audio, numbers, etc.
  24. 24. Information Retrieval 2 basic information retrieval (IR) process: Browsing or navigation system  User skims document collection by jumping from one document to the other via hypertext or hypermedia links until relevant document found Classical IR system: question answering system  Query: question in natural language  Answer: directly extracted from text of document collection Text Based Information Retrieval: Information item (document)  Text format (written/spoken) or has textual description Information need (query)
  25. 25. Classical IR System Process
  26. 26. General concepts in IR Representation language  Typically a vector of d attribute values, e.g.,  set of color, intensity, texture, features characterizing images  word counts for text documents Data set D of N objects  Typically represented as an N x d matrix Query Q  User poses a query to search D  Query is typically expressed in the same representation language as the data, e.g.,  each text document is a set of words that occur in the document
  27. 27. Query by Content Traditional DB query: exact matches  E.g. query Q = [level = MANAGER] AND [age < 30] or,  Boolean match on text  Query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun”  Not useful when there are many matches  E.g., “data mining” in Google returns 60 million documents Query-by-content query: more general / less precise  E.g. what record is most similar to a query Q?  For text data, often called “information retrieval (IR)”  Can also be used for images, sequences, video, etc  Q can itself be an object (e.g., a document) or a shorter version (e.g., 1 word) Goal  Match query Q to the N objects in the database  Return a ranked list (typically) of the most similar/relevant objects in the data set D given Q
  28. 28. Issues in Query by Content  What representation language to use  How to measure similarity between Q and each object in D  How to compute the results in real-time (for interactive querying)  How to rank the results for the user  Allowing user feedback (query modification)  How to evaluate and compare different IR algorithms/systems
  29. 29. The Standard Approach  Fixed-length (d dimensional) vector representation  For query (1-by-d Q) and database (n-by-d X) objects  Use domain-specific higher-level features (vs raw)  Image “bag of features”: color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), …  Text “bag of words”: freq count for each word in each document, … Also known as the “vector-space” model  Compute distances between vectorized representation  Use k-NN to find k vectors in X closest to Q
  30. 30. Text Retrieval  Document: book, paper, WWW page, ...  Term: word, word-pair, phrase, … (often: 50,000+)  query Q = set of terms, e.g., “data” + “mining”  NLP (natural language processing) too hard, so …  Want (vector) representation for text which  Retains maximum useful semantics  Supports efficient distance computes between docs and Q  Term weights  Boolean (e.g. term in document or not); “bag of words”  Real-valued (e.g. freq term in doc; relative to all docs) ...
  31. 31. Practical Issues Tokenization  Convert document to word counts  word token = “any nonempty sequence of characters”  for HTML (etc) need to remove formatting Canonical forms, Stopwords, Stemming  Remove capitalization  Stopwords  Remove very frequent words (a, the, and…) – can use standard list  Can also remove very rare words  Stemming (next slide) Data representation  E.g., 3 column: <docid termid position>  Inverted index (faster)  List of sorted <termid docid> pairs: useful for finding docs containing certain terms  Equivalent to a sparse representation of term x doc matrix
  32. 32. Intelligent Information Retrieval  Meaning of words  Synonyms “buy” / “purchase”  Ambiguity “bat” (baseball vs. mammal)  Order of words in the query  Hot dog stand in the amusement park  Hot amusement stand in the dog park
  33. 33. Key Word Search  The technical goal for prediction is to classify new, unseen documents  The Prediction and IR are unified by the computation of similarity of documents  IR based on traditional keyword search through a search engine  So we should recognize that using a search engine is a special instance of prediction concept
  34. 34. Key Word Search  We enter a key words to a search engine and expect relevant documents to be returned  These key words are words in a dictionary created from the document collection and can be viewed as a small document  So, we want to measuring how similar the new document (query) is to the documents in the collection
  35. 35. Key Word Search  So, the notion of similarity is reduced to finding documents with the same keywords as posed to the search engine  But, the objective of the search engine is to rank the documents, not to assign a label  So we need additional techniques to break the expected ties (all retrieved documents match the search criteria)
  36. 36. Key Word Search  In full text retrieval, all the words in each document are considered to be keywords.  We use the word term to refer to the words in a document  Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not  Ands are implicit, even if not explicitly specified  Ranking of documents on the basis of estimated relevance to a query is critical  Relevance ranking is based on factors such as o Term frequency Frequency of occurrence of query keyword in document o Inverse document frequency How many documents the query keyword occurs in Fewer  give more importance to keyword o Hyperlinks to documents
  37. 37. Relevance Ranking Using Terms TF-IDF (Term frequency/Inverse Document frequency) ranking:  Let n(d) = number of terms in the document d  n(d, t) = number of occurrences of term t in the document d.  Relevance of a document d to a term t  The log factor is to avoid excessive weight to frequent terms  Relevance of document to query Q n(d) n(d, t) 1 +TF (d, t) = log r (d, Q) =  TF (d, t) n(t)tQ
  38. 38. Relevance Ranking Using Terms  Most systems add to the above model  Words that occur in title, author list, section headings, etc. are given greater importance  Words whose first occurrence is late in the document are given lower importance  Very common words such as “a”, “an”, “the”, “it” etc are eliminated Called stop words  Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart  Documents are returned in decreasing order of relevance score  Usually only top few documents are returned, not all
  39. 39. Similarity Based Retrieval  Similarity based retrieval - retrieve documents similar to a given document  Similarity may be defined on the basis of common words E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents.  Relevance feedback: Similarity can be used to refine answer set to keyword query  User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these  Vector space model: define an n-dimensional space, where n is the number of words in the document set.  Vector for document d goes from origin to a point whose i th coordinate is TF (d,t ) / n (t )  The cosine of the angle between the vectors of two
  40. 40. Relevance Using Hyperlinks  Number of documents relevant to a query can be enormous if only term frequencies are taken into account  Using term frequencies makes “spamming” easy  E.g. a travel agency can add many occurrences of the words “travel” to its page to make its rank very high  Most of the time people are looking for pages from popular sites  Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords
  41. 41. Relevance Using Hyperlinks  Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site  Count only one hyperlink from each site  Popularity measure is for site, not for individual page But, most hyperlinks are to root of site Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu contains many unrelated pages of varying popularity  Refinements  When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige Definition is circular Set up and solve system of simultaneous linear equations
  42. 42. Relevance Using Hyperlinks  Connections to social networking theories that ranked prestige of people  E.g. the president of the U.S.A has a high prestige since many people know him  Someone known by multiple prestigious people has high prestige  Hub and authority based ranking  A hub is a page that stores links to many pages (on a topic)  An authority is a page that contains actual information on a topic  Each page gets a hub prestige based on prestige of authorities that it points to  Each page gets an authority prestige based on prestige of hubs that point to it  Again, prestige definitions are cyclic, and can be got by
  43. 43. Nearest-Neighbor Methods  A method that compares vectors and measures similarity  In Prediction: the NNMs will collect the K most similar documents and then look at their labels  In IR: the NNMs will determine whether a satisfactory response to the search query has been found
  44. 44. Measuring Similarity  These measures used to examine how documents are similar and the output is a numerical measure of similarity  Three increasingly complex measures:  Shared Word Count  Word Count and Bonus  Cosine Similarity
  45. 45. Shared Word Count  Counts the shared words between documents  The words:  In IR we have a global dictionary where all potential words will be included, with the exception of stopwords.  In Prediction its better to preselect the dictionary relative to the label
  46. 46. Computing similarity by Shared words  Look at all words in the new document  For each document in the collection count how many of these words appear  No weighting are used, just a simple count  The dictionary has true key words (weakly words removed)  The results of this measure are clearly intuitive  No one will question why a document was retrieved
  47. 47. Computing similarity by Shared words  Each document represented as a vector of key words (zeros and ones)  The similarity of 2 documents is the product of the 2 vectors  If 2 documents have the same key word then this word is counted (1*1)  The performance of this measure depends mainly on the dictionary used
  48. 48. Computing similarity by Shared words  Shared words is an exact search  Either retrieving or not retrieving a document.  No weighting can be done on terms  In query, A and B, you can’t specify A is more important than B  Every retrieved document are treated equally
  49. 49. Word Count and Bonus  TF – term frequency  Number of times a term occurs in a document  DF –Document frequency  Number of documents that contain the term.  IDF – inversed document frequency  =log (N/df)  N: the total number of documents  Vector is a numerical representation for a point in a multi- dimensional space.  (x1, x2, … … xn) Dimensions of the space need to be defined A measure of the space needs to be defined.
  50. 50. Word Count and Bonus  Each indexing term is a dimension  Each document is a vector Di = (ti1, ti2, ti3, ti4, ... tik)  Document similarity is defined as                   0 1 1 , 1 jdfjw jwiD K j If word (j) occurs in both documents otherwise K = number of words
  51. 51. Word Count and Bonus  The bonus 1/df(j) is a variant of idf. Thus, if the word occurs in many documents, the bonus is small.  This measure better than the Shared Word count, because its discriminate among the weak and strong predictive words.
  52. 52. Word Count and Bonus 2.83 1.33 0 1.33 1.5 1.33 2.67 Measure Similarity With Bonus 10101 11000 00010 10001 00100 01010 11001 Similarity Scores 1101 Labeled Spreadsheet Vector New Document Computing Similarity Scores with Bonus • A document Space is defined by five terms: hardware, software, user, information, index. •The query is “ hardware, user, information.
  53. 53. Cosine Similarity The Vector Space A document is represented as a vector: (W1, W2, … … , Wn) Binary:  Wi= 1 if the corresponding term is in the document  Wi= 0 if the term is not in the document TF: (Term Frequency)  Wi= tfi where tfi is the number of times the term occurred in the document TF*IDF: (Inverse Document Frequency)  Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the number of documents contains the term i, and N the total number of documents in the collection.
  54. 54. Cosine Similarity The Vector Space vec(D) = (w1, w2, ..., wt) Sim(d1,d2) = cos() = [vec(d1)  vec(d2)] / |d1| * |d2| = [ wd1(j) * wd2(j)] / |d1| * |d2| W(j) > 0 whenever j di So, 0 <= sim(d1,d2) <=1 A document is retrieved even if it matches the query terms only partially
  55. 55. Cosine Similarity  How to compute the weight wj?  A good weight must take into account two effects:  quantification of intra-document contents (similarity) tf factor, the term frequency within a document  quantification of inter-documents separation (dissi- milarity) idf factor, the inverse document frequency  wj = tf(j) * idf(j)
  56. 56. Cosine Similarity  TF in the given document shows how important the term is in this document (makes the frequent words for the document more important)  IDF makes rare words across all documents more important.  A high weight in a tf-idf ranking scheme is therefore reached by a high term frequency in the given document and a low term frequency in all other documents.  Term weights in a document affects the position of the document vectors
  57. 57. Cosine Similarity TF-IDF definitions: fik: number occurrences of term ti in document Dk tfik: fik / max(fik) normalized term frequency dfk: number of documents which contain tk idfk: log(N / dfk) where N is the total number of documents wik: tfik idfk term weight Intuition: rare words get more weight, common words less weight
  58. 58. Example TF-IDF  Given a document containing terms with given frequencies: Kent = 3; Ohio = 2; University = 1 and assume a collection of 10,000 documents and document frequencies of these terms are: Kent = 50; Ohio = 1300; University = 250. THEN Kent: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 Ohio: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 University: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2
  59. 59. Cosine Similarity  Cosine  W(j) = tf(j) * idf(j)  Idf(j) = log(N / df(j))             22 2*1 2*1 2,1 jwdjwd jwdjwd dd
  60. 60. Why Mine the Web? Enormous wealth of textual information on the Web.  Book/CD/Video stores (e.g., Amazon)  Restaurant information (e.g., Zagats)  Car prices (e.g., Carpoint) Lots of data on user access patterns  Web logs contain sequence of URLs accessed by users Possible to retrieve “previously unknown” information  People who ski also frequently break their leg.  Restaurants that serve sea food in California are likely to be outside San-Francisco
  61. 61. Mining the Web IR / IE System Query Documents source Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . Web Spider
  62. 62. Web-based Retrieval Additional information in Web documents  Link structure (e.g., Page Rank)  HTML structure  Link/anchor text  Title text, Etc.,  Can be leveraged for better retrieval Additional issues in Web retrieval  Scalability: size of “corpus” is huge (10 to 100 billion docs)  Constantly changing:  Crawlers to update document-term information  Need schemes for efficient updating indices  Evaluation is more difficult:  How is relevance measured?  How many documents in total are relevant?
  63. 63. Probabilistic Approaches to Retrieval Compute P(q | d) for each document d  Intuition: relevance of d to q is related to how likely it is that q was generated by d, or “how likely is q under a model for d?” Simple model for P(q|d)  Pe(q|d) = empirical frequency of words in document d  “tuned” to d, but likely to be sparse (will contain many zeros) 2-stage probabilistic model (or linear interpolation model)  P(q|d) = l Pe (q | d) + (1- l ) Pe (q | corpus)  l can be fixed, e.g., tuned to a particular data set  Or it can depend on d, e.g., where nd = number of words in doc d , and m = a constant (e.g., 1000) Can also use more sophisticated models for P(q|d) e.g., topic- based models )/(1 mnn dd 
  64. 64. Information Retrieval  Web-Based Document Search  Page Rank  Anchor Text  Document Matching  Inverted Lists
  65. 65. Page Rank  PR(A): the page rank of page A.  C(T) : the number of outgoing links from page T.  d : minimum value assigned to any page.  : a page pointing to A.  j jj TCTPRddAPR ))(/)((*)1()( jT
  66. 66. Algorithm of Page Rank 1. Use the PageRank Equation to compute PageRank for each page in the collection using latest PageRanks of pages. 2. Repeat step 1 until no significant change to any PageRank.
  67. 67. Example In The First Iteration:  PR(A)=0.1+0.9*(PR(B)+PR(C)) =0.1+0.9*(1+1) =1.9  PR(B)=0.1+0.9*(PR(A)/2) =0.1+0.9*(1.9/2) =0.95  PR(C)=0.1+0.9*(PR(A)/2) =0.1+0.9*(1.9/2) =0.95 PR(A)=1.48, PR(B)=0.76, PR(C)=0.76 Initial Value: PR(A)=PR(B)=PR(C)=1 d=0.1
  68. 68. Anchor Text  The anchor text is the visible, clickable text in a hyperlink.  For example: <ahref=“http://www.wikipedia.org”>Wikipedia</a>  The anchor text is Wikipedia; the complex URL http://www.wikipedia.org/ displays on the web page as Wikipedia, contributing to a clean, easy to read text or document.
  69. 69. Anchor Text  Anchor text usually gives the user relevant descriptive or contextual information about the content of the link’s destination.  The anchor text may or may not be related to the actual text of the URL of the link.  The words contained in the Anchor Text can determine the ranking that the page will receive by search engines.
  70. 70. Common Misunderstanding  Webmasters sometimes tend to misunderstand anchor text.  Instead of turning appropriate words inside of a sentence into a clickable link, webmasters frequently insert extra text.
  71. 71. Anchor Text  This proper method of linking is beneficial not only to users, but also to the webmasters as anchor text holds significant weight in search engine ranking.  Most search engine optimization experts recommend against using “click here” to designate a link.
  72. 72. Document Matching  An arbitrarily long document is the query, not just a few key words.  But the goal is still to rank and output an ordered list of relevant documents.  The most similar documents are found using the measures described earlier.  Search engines and document matchers are not focused on classification of new documents.  Their primary goal is to retrieve the most relevant
  73. 73. Generalization of searching • Matching a document to a collection of documents looks like a tedious and expensive operation. • Even for a short query, comparison to all large documents in the collection implies a relatively intensive computation task.
  74. 74. Example of document matching  Consider an online help desk, where a complete description of a problem is submitted.  That document could be matched to stored documents, hopefully finding descriptions of similar problems and solutions without having the user experiment with numerous key word searches.
  75. 75. Inverted Lists  Instead of documents pointing to words, a list of words pointing to documents is the primary internal representation for processing queries and matching documents.
  76. 76. Inverted Lists  The inverted list is the key to the efficiency of information retrieval systems.  The inverted list has contributed to make nearest- neighbor methods a pragmatic possibility for prediction.
  77. 77. Example If the query contained words 100 and 200 1) First processing W(100) to compute the similarity S(i) of each document i: S(1)=0+1 S(2)=0+1 … 2) Then process W(200) in the same way: S(2)=1+1 …
  78. 78. Evaluating IE Accuracy  Always evaluate performance on independent, manually- annotated test data not used during system development.  Measure for each test document:  Total number of correct extractions in the solution template: N  Total number of slot/value pairs extracted by the system: E  Number of extracted slot/value pairs that are correct (i.e. in the solution template): C  Compute average value of metrics adapted from IR:  Recall = C/N  Precision = C/E  F-Measure = Harmonic mean of recall and precision
  79. 79. Related Types of Data Sparse high-dimensional data sets with counts, like document-term matrices, are common in data mining, e.g.,  “transaction data”  Rows = customers; Columns = products  Web log data (ignoring sequence)  Rows = Web surfers; Columns = Web pages Recommender systems  Given some products from user i, suggest other products to the user  e.g., Amazon.com’s book recommender  Collaborative filtering:  use k-nearest-individuals as the basis for predictions  Many similarities with querying and information retrieval
  80. 80. What is a Good IR System?  Minimize the overhead of a user locating needed information  Fast, accurate, comprehensive, easy to use, …  Objective measures  Precision  Recall retrieveddocumentsallofNo. retrieveddocumentsrelevantofNo. P dataindocumentsrelevantallofNo. retrieveddocumentsrelevantofNo. R
  81. 81. Measuring Retrieval Effectiveness  Information-retrieval systems save space by using index structures that support only approximate retrieval. May result in:  false negative (false drop) - some relevant documents may not be retrieved.  false positive - some irrelevant documents may be retrieved. For many applications a good index should not permit any false drops, but may permit a few false positives.  Relevant performance metrics:  precision - what percentage of the retrieved documents are relevant to the query.  recall - what percentage of the documents
  82. 82. Measuring Retrieval Effectiveness Recall vs. precision tradeoff:  Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision Measures of retrieval effectiveness:  Recall as a function of number of documents fetched, or  Precision as a function of recall Equivalently, as a function of number of documents fetched  E.g. “precision of 75% at recall of 50%, and 60% at a recall of 75%”
  83. 83. Applications of Information Retrieval  Classic application  Library catalogue e.g. The UofC library catalogue  Current applications  Digital library e.g. http://www.acm.org/dl  WWW search engines e.g. http://www.google.com
  84. 84. Other applications of IE Systems  Job resumes  Seminar announcements  Molecular biology information from MEDLINE, e.g, Extracting gene drug interactions from biomed texts  Summarizing medical patient records by extracting diagnoses, symptoms, physical findings, test results.  Gathering earnings, profits, board members, etc. [corporate information] from web, company reports  Verification of construction industry specifications documents (are the quantities correct/reasonable?)  Extraction of political/economic/business changes from newspaper articles
  85. 85. Conclusion 1. Information retrieval methods are specialized nearest-neighbor methods, which are well-known prediction methods. 2. IR methods typically process unlabeled data and order and display the retrieved documents. 3. The IR methods have no training and induce no new rules for classification.

×