Artificial Intelligence


Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Artificial Intelligence

  1. 1. CORPUS STRUCTURE, LANGUAGE MODELS, AND AD HOC INFORMATION RETRIEVAL Oren Kurland and Lillian Lee Department of Computer Science Cornell University Ithaca, NY
  2. 2. INFORMATION RETRIEVAL <ul><li>Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need </li></ul><ul><li>from within large collections (usually stored on computers). </li></ul>
  4. 4. CLUSTERING <ul><li>Clustering: </li></ul><ul><li>the process of grouping a set of objects into classes of similar objects </li></ul><ul><ul><li>Documents within a cluster should be similar. </li></ul></ul><ul><ul><li>Documents from different clusters should be dissimilar. </li></ul></ul><ul><li>The commonest form of unsupervised learning </li></ul><ul><ul><ul><li>learning from raw data, as opposed to supervised data where a classification of examples is given </li></ul></ul></ul><ul><ul><li>A common and important task that finds many applications in IR and other places </li></ul></ul>
  8. 8. TERMINOLOGY <ul><li>An IR system looks for data matching some criteria defined by the users in their queries. </li></ul><ul><li>The langage used to ask a question is called the query language. </li></ul><ul><li>These queries use keywords (atomic items characterizing some data). </li></ul><ul><li>The basic unit of data is a document ( can be a file, an article, a paragraph, etc.). </li></ul><ul><li>A document corresponds to free text (may be unstructured). </li></ul><ul><li>All the documents are gathered into a collection (or corpus). </li></ul>
  9. 9. TERM FREQUENCY <ul><li>A document is treated as a set of words </li></ul><ul><li>Each word characterizes that document to some extent </li></ul><ul><li>When we have eliminated stop words, the most frequent words tend to be what the document is about </li></ul><ul><li>Therefore: f kd (Nb of occurrences of word k in document d) will be an important measure. </li></ul><ul><li> Also called the term frequency (tf) </li></ul>
  10. 10. DOCUMENT FREQUENCY <ul><li>What makes this document distinct from others in the corpus? </li></ul><ul><li>The terms which discriminate best are not those which occur with high document frequency! </li></ul><ul><li>Therefore: d k (nb of documents in which word k occurs) will also be an important measure. </li></ul><ul><li> Also called the document frequency (idf) </li></ul>
  11. 11. TF.IDF <ul><li>This can all be summarized as: </li></ul><ul><ul><li>Words are best discriminators when : </li></ul></ul><ul><ul><ul><li>they occur often in this document (term frequency) </li></ul></ul></ul><ul><ul><ul><li>do not occur in a lot of documents (document frequency) </li></ul></ul></ul><ul><ul><li>One very common measure of the importance of a word to a document is : </li></ul></ul><ul><ul><ul><li>TF.IDF: term frequency x inverse document frequency </li></ul></ul></ul><ul><ul><li>There are multiple formulas for actually computing this. The underlying concept is the same in all of them. </li></ul></ul>
  12. 12. TERM WEIGHTS <ul><li>tf-score : tf i,j = frequency of term i in document j </li></ul><ul><li>idf-score : idf i = Inversed document frequency of term i </li></ul><ul><ul><li>idf i = log(N/ni) with </li></ul></ul><ul><ul><ul><li>N, the size of the document collection (nb of documents) </li></ul></ul></ul><ul><ul><ul><li>ni , the number of documents in which the term i occurs </li></ul></ul></ul><ul><ul><li>idf i = Proportion of the document collection in which termi occurs </li></ul></ul><ul><li>Term weight of term i in document j (TF-IDF): </li></ul><ul><ul><li>tf i,j . idf i </li></ul></ul><ul><ul><li>the rarity of a term in the document collection </li></ul></ul>
  13. 13. LANGUAGE MODELS FOR IR <ul><li>Language Modeling Approaches </li></ul><ul><ul><li>Attempt to model query generation process </li></ul></ul><ul><ul><li>Documents are ranked by the probability that a query would be observed as a random sample from the respective document model </li></ul></ul><ul><ul><ul><li>Multinomial approach </li></ul></ul></ul>
  14. 14. LANGUAGE MODELS <ul><li>A probability distribution over word sequences </li></ul><ul><ul><li>p(“ Today is Wednesday ”)  0.001 </li></ul></ul><ul><ul><li>p(“ Today Wednesday is ”)  0.0000000000001 </li></ul></ul><ul><ul><li>p(“ The eigenvalue is positive ” )  0.00001 </li></ul></ul><ul><li>Context/topic dependent! </li></ul><ul><li>Treat each document as the basis for a model (e.g., unigram sufficient statistics) </li></ul><ul><li>Rank document d based on P(d | q) </li></ul><ul><li>P(d | q) = P(q | d) x P(d) / P(q) </li></ul><ul><ul><li>P(q) is the same for all documents, so ignore </li></ul></ul><ul><ul><li>P(d) [the prior] is often treated as the same for all d </li></ul></ul><ul><ul><ul><li>But we could use criteria like authority, length, genre </li></ul></ul></ul><ul><ul><li>P(q | d) is the probability of q given d’s model </li></ul></ul><ul><li>Very general formal approach </li></ul>
  15. 15. THE SIMPLEST LANGUAGE MODEL (UNIGRAM MODEL) <ul><li>Generate a piece of text by generating each word independently </li></ul><ul><li>Thus, p(w 1 w 2 ... w n )=p(w 1 )p(w 2 )…p(w n ) </li></ul><ul><li>Parameters: {p(w i )} p(w 1 )+…+p(w N )=1 (N is voc. size) </li></ul><ul><li>A piece of text can be regarded as a sample drawn according to this word distribution </li></ul>
  16. 16. SMOOTHING <ul><li>Smoothing is an important issue, and distinguishes different approaches </li></ul><ul><li>Many smoothing methods are available </li></ul><ul><li>It depends on the data and the task! </li></ul><ul><li>Cross validation is generally used to choose the best method and/or set the smoothing parameters… </li></ul><ul><li>For retrieval, Dirichlet prior performs well… </li></ul><ul><li>Backoff smoothing [Katz 87] doesn’t work well due to a lack of 2 nd -stage smoothing… </li></ul>
  17. 17. Query = “the algorithms for data mining” Another Reason for Smoothing p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal… Content words p DML (w|d1): 0.04 0.001 0.02 0.002 0.003 p DML (w|d2): 0.02 0.001 0.01 0.003 0.004 Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)… Query = “the algorithms for data mining” P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001 Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309 Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409
  18. 18. RETRIEVAL FRAMEWORK <ul><li>When we rank documents with respect to a query, we desire per-document scores that rely both on information drawn from the particular document's contents and on how the document is situated within the similarity structure of the ambient corpus. </li></ul><ul><li>Structure representation via overlapping clusters . Clusters can be represented as facets of the corpus that users might be interested in. Employing intersecting clusters may reduce information loss due to the generalization that clustering can introduce. </li></ul><ul><li>Information representation . Motivated by the empirical successes of language-modeling-based approaches , we use language models induced from documents and clusters as our information representation. Thus, p d (q) and p c (q) specify our initial knowledge of the relation between the query q and a particular document d or cluster c. </li></ul><ul><li>Information integration. To assign a ranking to the docu ments in a corpus C with respect to q, we want to score each d Є C against q in a way that incorporates information from query-relevant corpus facets to which d belongs . </li></ul>
  19. 19. CLUSTER-BASED SMOOTHING/SCORING <ul><li>Cluster-based query likelihood: Similar to the translation model, but “translate” the whole document to the query through a set of clusters. </li></ul>How likely doc D belongs to cluster C Only effective when interpolated with the basic LM scores Likelihood of Q given C
  20. 20. RETRIEVAL ALGORITHM Base line method:- The documents are simply ranked by probabilistic functions on the basis of frequency of words encountered from query.
  21. 21. Probabilistic IR query d1 d2 dn … Information need document collection matching Introduction
  22. 22. BASIS SELECT This algorithm uses the pooling of statistics from documents simply to decide whether the document is worth ranking or not. Only the basis documents are allowed to appear in the final output list having some minimum thresh hold frequency.
  23. 23. IR based on LM query d1 d2 dn … Information need document collection generation … Introduction
  24. 24. SET SELECT ALGORITHM In this case all the documents may appear in the final output list. The idea is that any document in the “best” cluster, basis or not is potentially relevant. BAG SELECT The documents appearing in more than one cluster should get extra consideration. The name is in reference to the incorporation in the document’s multiplicity in the bag formed from the “multi set union”.
  25. 25. ASPECT – X RATIO The degree of relevance on a particular probability is based on the strength of association between d and c where d is the document and c is the query. The uniform aspect x ratio assumes that every d Є c has same degree of association.
  26. 26. A HYBRID ALGORITHM An Interpolation algorithm combines the advantages of both selection-only algorithms and the aspect-x Algorithms The algorithm can be derived by dropping the original as-pect model's conditional independence assumption|namely, that p(qjd; c) = p(qjc) | and instead setting p(qjd; c) in Equation 1 to p(qjd)+(1¡¸)p(qjc), where ¸ indicates the degree of emphasis on individual-document information. If we do so, then via some algebra we get p(qjd) = ¸p(qjd) +(1¡¸)Pc p(qjc)p(cjd). Finally, applying the same assumptions as described in our discussion of the aspect-x algorithm yields a score function that is the linear interpolation of the score of the standard LM approach and the score of the aspect-x algorithm.
  27. 27. TEXT GENERATION WITH UNIGRAM LM (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document d Sampling Given  , p(d|  ) varies according to d Text mining paper Food nutrition paper
  28. 28. ESTIMATION OF UNIGRAM LM (Unigram) Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 Estimation Total #words =100 … text ? mining ? assocation ? database ? … query ? … 10/100 5/100 3/100 3/100 1/100 How good is the estimated model ? It gives our document sample the highest prob, but it doesn’t generalize well… More about this later…
  29. 29. THE BASIC LM APPROACH [PONTE & CROFT 98] Document Text mining paper Food nutrition paper Query = “ data mining algorithms” Language Model … text ? mining ? assocation ? clustering ? … food ? … … food ? nutrition ? healthy ? diet ? … ? Which model would most likely have generated this query?
  30. 30. <ul><li>Experimental Setup: </li></ul><ul><li>Data. They conducted the experiments on TREC data. they used titles (rather than full descriptions) as queries, resulting in an average length of 2-5 terms. Some characteristics of our three corpora are summarized in the following table. </li></ul>
  32. 32.
  33. 33.
  34. 34. Thank You