Artificial Intelligence

CORPUS STRUCTURE, LANGUAGE MODELS, AND AD HOC INFORMATION RETRIEVAL Oren Kurland and Lillian Lee Department of Computer Science Cornell University Ithaca, NY

INFORMATION RETRIEVAL ,[object Object],[object Object]

INFORMATION RETRIEVAL (CONTD.)

CLUSTERING ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

A DATA SET WITH CLEAR CLUSTER STRUCTURE Ch. 16

TERMINOLOGY ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

TERM FREQUENCY ,[object Object],[object Object],[object Object],[object Object],[object Object]

DOCUMENT FREQUENCY ,[object Object],[object Object],[object Object],[object Object]

TF.IDF ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

TERM WEIGHTS ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

LANGUAGE MODELS FOR IR ,[object Object],[object Object],[object Object],[object Object]

LANGUAGE MODELS ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

THE SIMPLEST LANGUAGE MODEL (UNIGRAM MODEL) ,[object Object],[object Object],[object Object],[object Object]

SMOOTHING ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Query = “the algorithms for data mining” Another Reason for Smoothing p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal… Content words p DML (w|d1): 0.04 0.001 0.02 0.002 0.003 p DML (w|d2): 0.02 0.001 0.01 0.003 0.004 Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)… Query = “the algorithms for data mining” P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001 Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309 Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409

RETRIEVAL FRAMEWORK ,[object Object],[object Object],[object Object],[object Object]

CLUSTER-BASED SMOOTHING/SCORING ,[object Object],How likely doc D belongs to cluster C Only effective when interpolated with the basic LM scores Likelihood of Q given C

RETRIEVAL ALGORITHM Base line method:- The documents are simply ranked by probabilistic functions on the basis of frequency of words encountered from query.

Probabilistic IR query d1 d2 dn … Information need document collection matching Introduction

BASIS SELECT This algorithm uses the pooling of statistics from documents simply to decide whether the document is worth ranking or not. Only the basis documents are allowed to appear in the final output list having some minimum thresh hold frequency.

IR based on LM query d1 d2 dn … Information need document collection generation … Introduction

SET SELECT ALGORITHM In this case all the documents may appear in the final output list. The idea is that any document in the “best” cluster, basis or not is potentially relevant. BAG SELECT The documents appearing in more than one cluster should get extra consideration. The name is in reference to the incorporation in the document’s multiplicity in the bag formed from the “multi set union”.

ASPECT – X RATIO The degree of relevance on a particular probability is based on the strength of association between d and c where d is the document and c is the query. The uniform aspect x ratio assumes that every d Є c has same degree of association.

A HYBRID ALGORITHM An Interpolation algorithm combines the advantages of both selection-only algorithms and the aspect-x Algorithms The algorithm can be derived by dropping the original as-pect model's conditional independence assumption|namely, that p(qjd; c) = p(qjc) | and instead setting p(qjd; c) in Equation 1 to p(qjd)+(1¡¸)p(qjc), where ¸ indicates the degree of emphasis on individual-document information. If we do so, then via some algebra we get p(qjd) = ¸p(qjd) +(1¡¸)Pc p(qjc)p(cjd). Finally, applying the same assumptions as described in our discussion of the aspect-x algorithm yields a score function that is the linear interpolation of the score of the standard LM approach and the score of the aspect-x algorithm.

TEXT GENERATION WITH UNIGRAM LM (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document d Sampling Given  , p(d|  ) varies according to d Text mining paper Food nutrition paper

ESTIMATION OF UNIGRAM LM (Unigram) Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 Estimation Total #words =100 … text ? mining ? assocation ? database ? … query ? … 10/100 5/100 3/100 3/100 1/100 How good is the estimated model ? It gives our document sample the highest prob, but it doesn’t generalize well… More about this later…

THE BASIC LM APPROACH [PONTE & CROFT 98] Document Text mining paper Food nutrition paper Query = “ data mining algorithms” Language Model … text ? mining ? assocation ? clustering ? … food ? … … food ? nutrition ? healthy ? diet ? … ? Which model would most likely have generated this query?

[object Object],[object Object]

Artificial Intelligence

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Artificial Intelligence

Similar to Artificial Intelligence (20)

More from vini89

More from vini89 (10)

Recently uploaded

Recently uploaded (20)

Artificial Intelligence