Lecture 9 - Machine Learning and Support Vector Machines (SVM)


Published on

MSU Lecture 9. Discussing machine learning and support vector machines (SVM).

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Di = 1 product over the terms that have value 1. Example in the index if the phrase appeared in the document it would have a one. Si = denominator P(D|NR).
  • http://www.miislita.com/information-retrieval-tutorial/okapi-bm25-tutorial.pdf …Stands for Best Match. Developed in 1980s.K normalizes by document length. b regulates the impact of the length normalization. B = 0.75 was found to be effective.
  • Summation over all terms in the query. Scoring a single document in the collection to see how it matches a query.
  • Language models used in speech recognition, machine learning et.
  • Di = 1 product over the terms that have value 1. Example in the index if the phrase appeared in the document it would have a one. Qi is query word and there are n words in the query
  • For example… if we have a language model and we representing a document about computer computer games the document should have a non-zero probablity for the word RPG (role playing game) even if the word does not appear in the document. Question is how much weight do you give document if it has ALL words? Is it really MORE relevant because the word appeared in the documents.
  • Taxonomy – Identifying and classifying things into groups or classes.
  • Di = 1 product over the terms that have value 1. Example in the index if the phrase appeared in the document it would have a one. Qi is query word and there are n words in the query
  • I this case we can use density and frequency…
  • Trying to maximize the width of the tube. If it is on the right it is relevant if it is on the left it is not. Then we define a decision function. How do we find the optimium? If we use the dotted line as our model we just check if data is on right or left hand side. Find a seperating hyperplane. We are going to train this function until we get a good predictive model. Finding general hyperplan wT + b = 0. Once we find w and b we can make predictions. If we put in a sample xi it should be > 0 if wT + b > 0. Will comibing the 2 inequalities next.
  • Distance between to parallel lines is given by.
  • The subtraction of epsilon guarantees a seperation in the data. C is a term for training errors.
  • Lecture 9 - Machine Learning and Support Vector Machines (SVM)

    1. 1. Machine Learning & Support Vector Machines Lecture 9 Sean A. Golliher
    2. 2. Let a, b be two events. p(a | b)p(b) = p(a Ç b) = p(b | a)p(a) p(b | a)p(a) p(a | b) = p(b) p(a | b)p(b) = p(b | a)p(a)
    3. 3. Let D be a document in the collection.Let R represent relevance of a document w.r.t. given (fixed)query and let NR represent non-relevance.Need to find p(R|D) - probability that a retrieved document Dis relevant. p(D | R)p(R) p(R | D) = p(D) p(R),p(NR) - prior probability p(xD | NR)p(NR) of retrieving a (non) relevant p(NR | D) = p(xD) documentP(D|R), p(D|NR) - probability that if a relevant (non-relevdocument is retrieved, it is D.
    4. 4.  Suppose we have a vector representing the presence and absence of terms (1,0,0,1,1). Terms 1, 4, & 5 are present. What is the probability of this document occurring in the relevant set? pi is the probability that the term i occurs in a relevant set. (1- pi ) would be the probability a term would not be included the relevant set. This gives us: p1 x (1-p2) x (1-p3) x p4 x p5
    5. 5.  Popular and effective ranking algorithm based on binary independence model  adds document and query term weights  k1, k2 and K are parameters whose values are set empirically  dl is doc length  Typical TREC value for k1 is 1.2, k2 varies from 0 to 1000, b = 0.75
    6. 6.  Query with two terms, “president lincoln”, (qf = 1). Frequency of term i in the query No relevance information (r and R are zero) N = 500,000 documents “president” occurs in 40,000 documents (n1 = 40, 000) “lincoln” occurs in 300 documents (n2 = 300) “president” occurs 15 times in doc (f1 = 15) “lincoln” occurs 25 times (f2 = 25) document length is 90% of the average length (dl/avdl = .9) k1 = 1.2, b = 0.75, and k2 = 100 K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11
    7. 7.  Unigram language model (simplest form)  probability distribution over the words in a language  generation of text consists of pulling words out of a “bucket” according to the probability distribution and replacing them N-gram language model  some applications use bigram and trigram language models where probabilities depend on previous words  Based on previous n-1 words
    8. 8.  A topic in a document or query can be represented as a language model  i.e., words that tend to occur often when discussing a topic will have high probabilities in the corresponding language model
    9. 9.  Rank documents by the probability that the query could be generated by the document language model (i.e. same topic) P(Q|D) Assuming uniform, unigram model
    10. 10.  Obvious estimate for unigram probabilities is  fqi, D is number of times word occurs in document. D is number of words in document  If query words are missing from document, score will be zero  Missing 1 out of 4 query words same as missing 3 out of 4. Not good for long queries!
    11. 11.  Document texts are a sample from the language model  Missing words should not have zero probability of occurring (calculating probability query could be generated from document) Smoothing is a technique for estimating probabilities for missing (or unseen) words  lower (or discount) the probability estimates for words that are seen in the document text  assign that “left-over” probability to the estimates for the words that are not seen in the text
    12. 12.  Informational  Finding information about some topic which may be on one or more web pages  Topical search Navigational  finding a particular web page that the user has either seen before or is assumed to exist Transactional  finding a site where a task such as shopping or downloading music can be performed Broder (2002) http://www.sigir.org/forum/F2002/broder.pdf
    13. 13.  For effective navigational and transactional search, need to combine features that reflect user relevance Commercial web search engines combine evidence from hundreds of features to generate a ranking score for a web page  page content, page metadata, anchor text, links (e.g., PageRank), and user behavior (click logs)  page metadata – e.g., “age”, how often it is updated, the URL of the page, the domain name of its site, and the amount of text content
    14. 14.  SEO: understanding the relative importance of features used in search and how they can be optimized to obtain better search rankings for a web page  e.g., improve the text used in the title tag, improve the text in heading tags, make sure that the domain name and URL contain important keywords, and try to improve the anchor text and link structure  Some of these techniques are regarded as not appropriate by search engine companies
    15. 15.  Toolkit, written in Java, for experimenting with text. http://www.galagosearch.org/quick-start.html
    16. 16.  Considerable interaction between these fields  Arthur Samuel: 1959 – Checkers game. World’s first self-learning program. IBM701. Web query logs have generated new wave of research  e.g., “Learning to Rank”
    17. 17.  Supervised Learning  Regression analysis Classification Problems  Support Vector Machines (SVM) Unsupervised Learning  http://www.youtube.com/watch?v=GWWIn29ZV4Q Reinforcement Learning Learning Theory  How much training data do we need?  How accurately can we predict an event to 99% accuracy?
    18. 18.  Papers: Boser et al,. 1992 Standard SVM [Cortes and Vapnik, 1995]