Separation of Lanthanides/ Lanthanides and Actinides
Artificial Intelligence
1. CORPUS STRUCTURE, LANGUAGE MODELS, AND AD HOC INFORMATION RETRIEVAL Oren Kurland and Lillian Lee Department of Computer Science Cornell University Ithaca, NY
17. Query = “the algorithms for data mining” Another Reason for Smoothing p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal… Content words p DML (w|d1): 0.04 0.001 0.02 0.002 0.003 p DML (w|d2): 0.02 0.001 0.01 0.003 0.004 Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)… Query = “the algorithms for data mining” P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001 Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309 Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409
18.
19.
20. RETRIEVAL ALGORITHM Base line method:- The documents are simply ranked by probabilistic functions on the basis of frequency of words encountered from query.
21. Probabilistic IR query d1 d2 dn … Information need document collection matching Introduction
22. BASIS SELECT This algorithm uses the pooling of statistics from documents simply to decide whether the document is worth ranking or not. Only the basis documents are allowed to appear in the final output list having some minimum thresh hold frequency.
23. IR based on LM query d1 d2 dn … Information need document collection generation … Introduction
24. SET SELECT ALGORITHM In this case all the documents may appear in the final output list. The idea is that any document in the “best” cluster, basis or not is potentially relevant. BAG SELECT The documents appearing in more than one cluster should get extra consideration. The name is in reference to the incorporation in the document’s multiplicity in the bag formed from the “multi set union”.
25. ASPECT – X RATIO The degree of relevance on a particular probability is based on the strength of association between d and c where d is the document and c is the query. The uniform aspect x ratio assumes that every d Є c has same degree of association.
26. A HYBRID ALGORITHM An Interpolation algorithm combines the advantages of both selection-only algorithms and the aspect-x Algorithms The algorithm can be derived by dropping the original as-pect model's conditional independence assumption|namely, that p(qjd; c) = p(qjc) | and instead setting p(qjd; c) in Equation 1 to p(qjd)+(1¡¸)p(qjc), where ¸ indicates the degree of emphasis on individual-document information. If we do so, then via some algebra we get p(qjd) = ¸p(qjd) +(1¡¸)Pc p(qjc)p(cjd). Finally, applying the same assumptions as described in our discussion of the aspect-x algorithm yields a score function that is the linear interpolation of the score of the standard LM approach and the score of the aspect-x algorithm.
27. TEXT GENERATION WITH UNIGRAM LM (Unigram) Language Model p(w| ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document d Sampling Given , p(d| ) varies according to d Text mining paper Food nutrition paper
28. ESTIMATION OF UNIGRAM LM (Unigram) Language Model p(w| )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 Estimation Total #words =100 … text ? mining ? assocation ? database ? … query ? … 10/100 5/100 3/100 3/100 1/100 How good is the estimated model ? It gives our document sample the highest prob, but it doesn’t generalize well… More about this later…
29. THE BASIC LM APPROACH [PONTE & CROFT 98] Document Text mining paper Food nutrition paper Query = “ data mining algorithms” Language Model … text ? mining ? assocation ? clustering ? … food ? … … food ? nutrition ? healthy ? diet ? … ? Which model would most likely have generated this query?