Topic model an introduction

901 views
757 views

Published on

This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~

2 Comments
3 Likes
Statistics
Notes
No Downloads
Views
Total views
901
On SlideShare
0
From Embeds
0
Number of Embeds
48
Actions
Shares
0
Downloads
33
Comments
2
Likes
3
Embeds 0
No embeds

No notes for slide

Topic model an introduction

  1. 1. Topic Model ๏ผˆโ‰ˆ ๐Ÿ ๐Ÿ Text Mining๏ผ‰ Yueshen Xu xyshzjucs@zju.edu.cn Middleware, CCNT, ZJU Middleware, CCNT, ZJU6/11/2014 Text Mining&NLP&ML 1, Yueshen Xu
  2. 2. Outline ๏ฐ Basic Concepts ๏ฐ Application and Background ๏ฐ Famous Researchers ๏ฐ Language Model ๏ฐ Vector Space Model (VSM) ๏ฐ Term Frequency-Inverse Document Frequency (TF-IDF) ๏ฐ Latent Semantic Indexing (LSA) ๏ฐ Probabilistic Latent Semantic Indexing (pLSA) ๏ฐ Expectation-Maximization Algorithm (EM) & Maximum- Likelihood Estimation (MLE) 6/11/2014 2 Middleware, CCNT, ZJU, Yueshen Xu
  3. 3. Outline ๏ฐ Latent Dirichlet Allocation (LDA) ๏ฐ Conjugate Prior ๏ฐ Possion Distribution ๏ฐ Variational Distribution and Variational Inference (VD &VI) ๏ฐ Markov Chain Monte Carlo (MCMC) ๏ฐ Metropolis-Hastings Sampling (MH) ๏ฐ Gibbs Sampling and GS for LDA ๏ฐ Bayesian Theory v.s. Probability Theory 6/11/2014 3 Middleware, CCNT, ZJU, Yueshen Xu
  4. 4. Concepts ๏ฐ Latent Semantic Analysis ๏ฐ Topic Model ๏ฐ Text Mining ๏ฐ Natural Language Processing ๏ฐ Computational Linguistics ๏ฐ Information Retrieval ๏ฐ Dimension Reduction ๏ฐ Expectation-Maximization(EM) 6/11/2014 Middleware, CCNT, ZJU Information Retrieval Computational Linguistics Natural Language Processing LSA/Topic Model Text Mining LSA/Topic Model Data Mining Reduction Dimension Machine Learning EM 4 Machine Translation Aim:find the topic that a word or a document belongs to Latent Factor Model , Yueshen Xu
  5. 5. Application ๏ฐ LFM has been a fundamental technique in modern search engine, recommender system, tag extraction, blog clustering, twitter topic mining, news (text) summarization, etc. ๏ฐ Search Engine ๏ฎ PageRank๏ƒ  How importantโ€ฆ.this web page? ๏ฎ LFM๏ƒ  How relevanceโ€ฆ.this web page? ๏ฎ LFM๏ƒ  How relevanceโ€ฆthe userโ€™s query vs. one document? ๏ฐ Recommender System ๏ฎ Opinion Extraction ๏ฎ Spam Detection ๏ฎ Tag Extraction 6/11/2014 5 Middleware, CCNT, ZJU ๏ฐ Text Summarization ๏ฎ Abstract Generation ๏ฎ Twitter Topic Mining Text: Steven Jobs had left us for about two yearsโ€ฆ..the appleโ€™s price will fall downโ€ฆ. , Yueshen Xu
  6. 6. Famous Researcher 6/11/2014 6 Middleware, CCNT, ZJU David Blei, Princeton, LDA Chengxiang Zhai, UIUC, Presidential Early Career Award W. Bruce Croft, UMA Language Model Bing Liu, UIC Opinion Mining John D. Lafferty, CMU, CRF&IBM Thomas Hofmann Brown, pLSA Andrew McCallum, UMA, CRF&IBM Susan Dumais, Microsoft, LSI , Yueshen Xu
  7. 7. Language Model ๏ฐ Unigram Language Model == Zero-order Markov Chain ๏ฐ Bigram Language Model == First-order Markov Chain ๏ฐ N-gram Language Model == (N-1)-order Markov Chain ๏ฐ Mixture-unigram Language Model 6/11/2014 Middleware, CCNT, ZJU ๏ƒ•๏ƒŽ ๏€ฝ sw i i MwpMwp )|()|( ๏ฒ Bag of Words(BoW) No order, no grammar, only multiplicity ๏ƒ•๏ƒŽ ๏€ญ๏€ฝ sw ii i MwwpMwp )|()|( ,1 ๏ฒ 8 w N M w N M z ๐‘ ๐’˜ = ๐‘ง ๐‘(๐‘ง) ๐‘›=1 ๐‘ ๐‘(๐‘ค ๐‘›|๐‘ง) , Yueshen Xu
  8. 8. 9 Vector Space Model ๏ฐ A document is represented as a vector of identifier ๏ฐ Identifier ๏ฎ Boolean: 0, 1 ๏ฎ Term Count: How many timesโ€ฆ ๏ฎ Term Frequency: How frequentโ€ฆin this document ๏ฎ TF-IDF: How importantโ€ฆin the corpus ๏ƒ  most used ๏ฐ Relevance Ranking ๏ฐ First used in SMART(Gerard Salton, Cornell) 6/11/2014 Middleware, CCNT, ZJU ),,,( ),,,( 21 21 tqqq tjjjj wwwq wwwd ๏‹ ๏‹ ๏€ฝ ๏€ฝ Gerard Salton Award(SIGIR) qd qd j j ๏ƒ— ๏€ฝ๏ฑcos , Yueshen Xu
  9. 9. TF-IDF ๏ฐ Mixture language model ๏ฎ Linear combination of a certain distribution(Gaussian) ๏ฎ Better Performance ๏ฐ TF: Term Frequency ๏ฐ IDF: Inversed Document Frequency ๏ฐ TF-IDF 6/11/2014 Middleware, CCNT, ZJU ๏ƒฅ ๏€ฝ k kj ij ij n n tf Term i, document j, count of i in j ) |}:{|1 log( dtDd N idf i i ๏ƒŽ๏ƒŽ๏€ซ ๏€ฝ N documents in the corpus iijjij idftfDdtidftf ๏‚ด๏€ฝ๏€ญ ),,( How important โ€ฆin this document How important โ€ฆin this corpus 10, Yueshen Xu
  10. 10. Latent Semantic Indexing ๏ฐ Challenge ๏ฎ Compare document in the same concept space ๏ฎ Compare documents across languages ๏ฎ Synonymy, ex: buy - purchase, user - consumer ๏ฎ Polysemy, ex; book - book, draw - draw ๏ฐ Key Idea ๏ฎ Dimensionality reduction of word-document co-occurrence matrix ๏ฎ Construction of latent semantic space 6/11/2014 Middleware, CCNT, ZJU Defects of VSM Word Document Word DocumentConcept VSM LSI 11, Yueshen Xu Aspect Topic Latent Factor
  11. 11. Singular Value Decomposition ๏ฐ LSI ~= SVD ๏ฎ U, V: orthogonal matrices ๏ฎ โˆ‘ :the diagonal matrix with the singular values of N 6/11/2014 Middleware, CCNT, ZJU12 T VUN ๏ƒฅ๏€ฝ U t * m Document Terms t * d m* m m* d N โˆ‘U V k < m || k <<mCount, Frequency, TF-IDF t * m Document Terms t * k k* k m* d U V๏ƒฅ N word: Exchangeability k < m || k <<m k , Yueshen Xu
  12. 12. Singular Value Decomposition ๏ฐ The K-largest singular values ๏ฎ Distinguish the variance between words and documents to a greatest extent ๏ฐ Discarding the lowest dimensions ๏ฎ Reduce noise ๏ฐ Fill the matrix ๏ฎ Predict & Lower computational complexity ๏ฎ Enlarge the distinctiveness ๏ฐ Decomposition ๏ฎ Concept, semantic, topic (aspect) 6/11/2014 13 Middleware, CCNT, ZJU (Probabilistic) Matrix Factorization/ Factorization Model: Analytic solution of SVD Unsupervised Learning , Yueshen Xu
  13. 13. Probabilistic Latent Semantic Indexing ๏ฐ pLSI Model 6/11/2014 14 Middleware, CCNT, ZJU w1 w2 wN z1 zK z2 d1 d2 dM โ€ฆ.. โ€ฆ.. โ€ฆ.. )(dp)|( dzp)|( zwp ๏ฐ Assumption ๏ฎ Pairs(d,w) are assumed to be generated independently ๏ฎ Conditioned on z, w is generated independently of d ๏ฎ Words in a document are exchangeable ๏ฎ Documents are exchangeable ๏ฎ Latent topics z are independent Generative Process/Model ๏ƒฅ๏ƒฅ ๏ƒŽ๏ƒŽ ๏€ฝ๏€ฝ๏€ฝ ZzZz zwpdzpdpdzwpdpdpdwpwdp )|()|()()|,()()()|(),( Multinomial Distribution Multinomial Distribution One layer of โ€˜Deep Neutral Networkโ€™ Global Local , Yueshen Xu
  14. 14. Probabilistic Latent Semantic Indexing 6/11/2014 15 Middleware, CCNT, ZJU d z w N M ๏ƒฅ๏ƒŽ ๏€ฝ Zz zwpdzpdwp )|()|()|( ๏ƒฅ ๏ƒฅ๏ƒฅ ๏ƒŽ ๏ƒŽ๏ƒŽ ๏€ฝ ๏€ฝ๏€ฝ Zz ZzZz zpzdpzwp zdpzdwpzwdpdwp )()|()|( ),(),|(),,(),( d z w N M These are two ways to formulate pLSA, which are equivalent but lead to two different inference processes Equivalent in Bayes Rule Probabilistic Graph Model d:Exchangeability Directed Acyclic Graph (DAG) , Yueshen Xu
  15. 15. Expectation-Maximization ๏ฐ EM is a general algorithm for maximum-likelihood estimation (MLE) where the data are โ€˜incompleteโ€™ or contains latent variables: pLSA, GMM, HMMโ€ฆ---Cross Domain๏Š ๏ฐ Deduction Process ๏ฎ ฮธ:parameter to be estimated; ฮธ0: initialize randomly; ฮธn: the current value; ฮธn+1: the next value 6/11/2014 16 Middleware, CCNT, ZJU )()(max1 nn LL ๏ฑ๏ฑ๏ฑ ๏ฑ ๏€ญ๏€ฝ๏€ซ ),|(log)( ๏ฑ๏ฑ XpL ๏€ฝ )|,(log)( ๏ฑ๏ฑ HXpLc ๏€ฝ Latent Variable ),|(log)(),|(log)|(log)|,(log)( ๏ฑ๏ฑ๏ฑ๏ฑ๏ฑ๏ฑ XHpLXHpXpHXpLc ๏€ซ๏€ฝ๏€ซ๏€ฝ๏€ฝ ),|( ),|( log)()()()( ๏ฑ ๏ฑ ๏ฑ๏ฑ๏ฑ๏ฑ XHp XHp LLLL n n cc n ๏€ซ๏€ญ๏€ฝ๏€ญ , Yueshen Xu Objective:
  16. 16. Expectation-Maximization 6/11/2014 17 Middleware, CCNT, ZJU ),|( ),|( log),|( ),|()(),|()()()( ๏ฑ ๏ฑ ๏ฑ ๏ฑ๏ฑ๏ฑ๏ฑ๏ฑ๏ฑ XHp XHp XHp XHpLXHpLLL n H n H nn c H n c n ๏ƒฅ ๏ƒฅ๏ƒฅ ๏€ซ ๏€ญ๏€ฝ๏€ญ K-L divergence: non-negative Kullback-Leibler Divergence, or Relative Entropy ๏ƒฅ๏ƒฅ ๏€ญ๏€ซ๏‚ณ H nn c H nn c XHpLLXHpLL ),|()()(),|()()( ๏ฑ๏ฑ๏ฑ๏ฑ๏ฑ๏ฑ Lower Bound ๏ƒฅ๏€ฝ๏€ฝ H n ccXHp n XHpLLEQ n ),|()()]([);( ),|( ๏ฑ๏ฑ๏ฑ๏ฑ๏ฑ ๏ฑ Q-function E-step (expectation): Compute Q; M-step(maximization): Re-estimate ฮธ by maximizing Q Convergence How is EM used in pLSA? , Yueshen Xu
  17. 17. EM in pLSA 6/11/2014 18 Middleware, CCNT, ZJU ๏ƒฅ๏ƒฅ๏ƒฅ ๏ƒฅ ๏ƒฅ๏ƒฅ ๏ƒฅ ๏€ฝ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ๏€ฝ K k ikkjijk N i M j ji K k ikkj N i M j jiijk H n ccXHp n dzpzwpdwzpwdn dzpzwpwdndwzp XHpLLEQ n 11 1 1 1 1 ),|( ))|()|(log(),|(),( ))|()|(log(),(),|( ),|()()]([);( ๏ฑ๏ฑ๏ฑ๏ฑ๏ฑ ๏ฑ Posterior Random value in initialization Likelyhood function Constraints: 1. 2. 1)|( 1 ๏€ฝ๏ƒฅ๏€ฝ M j kj zwp 1)|( 1 ๏€ฝ๏ƒฅ๏€ฝ K k jk dzp Lagrange Multiplier ๏€จ ๏€ฉ ๏ƒฅ ๏ƒฅ๏ƒฅ ๏ƒฅ ๏€ฝ ๏€ฝ๏€ฝ ๏€ฝ ๏€ญ๏€ซ๏€ญ๏€ซ๏€ฝ M i K k iki K k M j kjkc dzpzwpLEH 1 11 1 ))|(1())|(1(][ ๏ฒ๏ด๏ฑ Partial derivative=0 independent variable independent variable ๏ƒฅ๏ƒฅ ๏ƒฅ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ M m N i imkim N i ijkij kj dwzpdwn dwzpdwn zwp 1 1 1 ),|(),( ),|(),( )|( )( ),|(),( )|( 1 i M j ijkij ik dn dwzpdwn dzp ๏ƒฅ๏€ฝ ๏€ฝ M-Step E-Step ๏ƒฅ ๏ƒฅ ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ K l illj ikkj K l illji iikkj ijk dzpzwp dzpzwp dzpzwpdp dpdzpzwp dwzp 1 1 )|()|( )|()|( )|()|()( )()|()|( ),|( Associative Law & Distributive Law , Yueshen Xu ๐‘™๐‘œ๐‘” ๐‘(๐‘ค|๐‘‘) ๐‘›(๐‘‘,๐‘ค)
  18. 18. Bayesian Theory v.s. Probability Theory ๏ฐ Bayesian Theory v.s. Probability Theory ๏ฎ Estimate ๐œƒ through posterior v.s. Estimate ๐œƒ through the maximization of likelihood ๏ฎ Bayesian theory ๏ƒ  prior v.s. Probability theory ๏ƒ  statistic ๏ฎ When the number of samples โ†’ โˆž, Bayesian theory == Probability theory ๏ฐ Parameter Estimation ๏ฎ ๐‘ ๐œƒ ๐ท โˆ ๐‘ ๐ท ๐œƒ ๐‘ ๐œƒ ๏ƒ  ๐‘ ๐œƒ ? ๏ƒ  Conjugate Prior ๏ƒ  likelihood is helpful, but its function is limited ๏ƒ  Otherwise? 6/11/2014 19 Middleware, CCNT, ZJU ๏ฐ Non-parametric Bayesian Methods (Complicated) ๏ฎ Kernel methods: I just know a little... ๏ฎ VSM ๏ƒ  CF ๏ƒ  MF ๏ƒ  pLSA ๏ƒ  LDA ๏ƒ  Non-parametric Bayesian๏ƒ  Deep Learning , Yueshen Xu
  19. 19. Latent Dirichlet Allocation ๏ฐ Latent Dirichlet Allocation (LDA) ๏ฎ David M. Blei, Andrew Y. Ng, Michael I. Jordan ๏ฎ Journal of Machine Learning Research๏ผŒ2003, cited > 3000 ๏ฎ Hierarchical Bayesian model; Bayesian pLSI 6/11/2014 20 Middleware, CCNT, ZJU ฮธ z w N M ฮฑ ฮฒ Iterative times Generative Process of a document d in a corpus according to LDA ๏ƒ˜ Choose N ~ Poisson(๐œ‰); ๏ƒ  Why? ๏ƒ˜ For each document d={๐‘ค1, ๐‘ค2 โ€ฆ ๐‘ค ๐‘›} Choose ๐œƒ ~๐ท๐‘–๐‘Ÿ(๐›ผ); ๏ƒ  Why? ๏ƒ˜ For each of the N words ๐‘ค ๐‘› in d: a) Choose a topic ๐‘ง ๐‘›~๐‘€๐‘ข๐‘™๐‘ก๐‘–๐‘›๐‘œ๐‘š๐‘–๐‘›๐‘Ž๐‘™ ๐œƒ ๏ƒ Why? b) Choose a word ๐‘ค ๐‘› from ๐‘ ๐‘ค ๐‘› ๐‘ง ๐‘›, ๐›ฝ , a multinomial probability conditioned on ๐‘ง ๐‘› ๏ƒ Why ACM-Infosys Awards , Yueshen Xu
  20. 20. Latent Dirichlet Allocation ๏ฐ LDA(Cont.) 6/11/2014 21 Middleware, CCNT, ZJU ฮธ z w N Mฮฑ ๐œ‘ ฮฒ K ฮฒ Generative Process of a document d in LDA ๏ƒ˜ Choose N ~ Poisson(๐œ‰); ๏ƒ  Not important ๏ƒ˜ For each document d={๐‘ค1, ๐‘ค2 โ€ฆ ๐‘ค ๐‘›} Choose ๐œƒ ~๐ท๐‘–๐‘Ÿ(๐›ผ);๐œƒ = ๐œƒ1, ๐œƒ2 โ€ฆ ๐œƒ ๐พ , ๐œƒ = ๐พ , K is fixed, 1 ๐พ ๐œƒ = 1, ๐ท๐‘–๐‘Ÿ~๐‘€๐‘ข๐‘™๐‘ก๐‘– โ†’๐ถ๐‘œ๐‘›๐‘—๐‘ข๐‘”๐‘Ž๐‘ก๐‘’ ๐‘ƒ๐‘Ÿ๐‘–๐‘œ๐‘Ÿ ๏ƒ˜ For each of the N words ๐‘ค ๐‘› in d: a) Choose a topic ๐‘ง ๐‘›~๐‘€๐‘ข๐‘™๐‘ก๐‘–๐‘›๐‘œ๐‘š๐‘–๐‘›๐‘Ž๐‘™ ๐œƒ b) Choose a word ๐‘ค ๐‘› from ๐‘ ๐‘ค ๐‘› ๐‘ง ๐‘›, ๐›ฝ , a multinomial probability conditioned on ๐‘ง ๐‘›๏ƒ  one word ๏ƒŸ๏ƒ  one topic one document ๏ƒŸ๏ƒ  multi-topics ๐œƒ = ๐œƒ1, ๐œƒ2 โ€ฆ ๐œƒ ๐พ z= ๐‘ง1, ๐‘ง2 โ€ฆ ๐‘ง ๐พ For each word ๐‘ค ๐‘›there is a ๐‘ง ๐‘› ๏ƒŸ ๏ƒ  pLSA: the number of p(z|d) is linear to the number of documents ๏ƒ  overfitting Regularization M+K Dirichlet-Multinomial , Yueshen Xu
  21. 21. Latent Dirichlet Allocation 6/11/2014 22 Middleware, CCNT, ZJU, Yueshen Xu
  22. 22. Conjugate Prior & Distributions ๏ฐ Conjugate Prior: ๏ฎ If the posterior p(ฮธ|x) are in the same family as the p(ฮธ), the prior and posterior are called conjugate distributions, and the prior is called a conjugate prior of the likelihood p(x|ฮธ) : p(ฮธ|x) โˆ p(x|ฮธ)p(ฮธ) ๏ฐ Distributions ๏ฎ Binomial Distribution โ†โ†’ Beta Distribution ๏ฎ Multinomial Distribution โ†โ†’ Dirichlet Distribution ๏ฐ Binomial & Beta Distribution ๏ฎ Binomial๏ƒ  Bin(m|N,ฮธ)=C(m,N)ฮธm(1-ฮธ)N-m :likelihood ๏ฎ C(m,N)=N!/(N-m)!m! ๏ฎ Beta(ฮธ|a,b) ๏ƒ  6/11/2014 23 Middleware, CCNT, ZJU 11- )1( )()( )( ๏€ญ ๏€ญ ๏‡๏‡ ๏€ซ๏‡ ba ba ba ๏ฑ๏ฑ ๏ƒฒ ๏‚ฅ ๏€ญ๏€ญ ๏€ฝ๏‡ 0 1 )( dteta ta Why do prior and posterior need to be conjugate distributions? , Yueshen Xu
  23. 23. Conjugate Prior & Distributions 6/11/2014 24 Middleware, CCNT, ZJU 11- )1( )()( )( )1(),(),,,|( ๏€ญ ๏€ญ ๏‡๏‡ ๏€ซ๏‡ ๏‚ด ๏€ญ๏€ซ๏‚ต ba lm ba ba lmmCbalmp ๏ฑ๏ฑ ๏ฑ๏ฑ๏ฑ 11- )1( )()( )( ),,,|( ๏€ญ๏€ซ๏€ซ ๏€ญ ๏€ซ๏‡๏€ซ๏‡ ๏€ซ๏€ซ๏€ซ๏‡ ๏€ฝ blam blam blam balmp ๏ฑ๏ฑ๏ฑ Beta Distribution! Parameter Estimation ๏ฐ Multinomial & Dirichlet Distribution ๏ฎ x/ ๐‘ฅ is a multivariate, ex, ๐‘ฅ = (0,0,1,0,0,0): event of ๐‘ฅ3 happens ๏ฎ The probabilistic distribution of ๐‘ฅ in only one event : ๐‘ ๐‘ฅ ๐œƒ = ๐‘˜=1 ๐พ ๐œƒ ๐‘˜ ๐‘ฅ ๐‘˜ , ๐œƒ = (๐œƒ1, ๐œƒ2 โ€ฆ , ๐œƒ ๐‘˜) , Yueshen Xu
  24. 24. Conjugate Prior & Distributions ๏ฐ Multinomial & Dirichlet Distribution (Cont.) ๏ฎ Mult(๐‘š1, ๐‘š2, โ€ฆ , ๐‘š ๐พ|๐œฝ, ๐‘)= ๐‘! ๐‘š1!๐‘š2!โ€ฆ๐‘š ๐พ! ๐ถ ๐‘ ๐‘š1 ๐ถ ๐‘โˆ’๐‘š1 ๐‘š2 ๐ถ ๐‘โˆ’๐‘š1โˆ’๐‘š2 ๐‘š3 โ€ฆ ๐ถ ๐‘โˆ’ ๐‘˜=1 ๐พโˆ’1 ๐‘š ๐‘˜ ๐‘š ๐พ ๐‘˜=1 ๐พ ๐œƒ ๐‘˜ ๐‘ฅ ๐‘˜ : the likelihood function of ๐œƒ 6/11/2014 25 Middleware, CCNT, ZJU Mult: The exact probabilistic distribution of ๐‘ ๐‘ง ๐‘˜ ๐‘‘๐‘— and ๐‘ ๐‘ค๐‘— ๐‘ง ๐‘˜ In Bayesian theory, we need to find a conjugate prior of ๐œƒ for Mult, where 0 < ๐œƒ < 1, ๐‘˜=1 ๐พ ๐œƒ ๐‘˜ = 1 Dirichlet Distribution ๐ท๐‘–๐‘Ÿ ๐œƒ ๐œถ = ฮ“(๐›ผ0) ฮ“ ๐›ผ1 โ€ฆ ฮ“ ๐›ผ ๐พ ๐‘˜=1 ๐พ ๐œƒ ๐‘˜ ๐›ผ ๐‘˜โˆ’1 a vector Hyper-parameter: parameter in probabilistic distribution function (pdf) , Yueshen Xu
  25. 25. Conjugate Prior & Distributions ๏ฐ Multinomial & Dirichlet Distribution (Cont.) ๏ฎ ๐‘ ๐œƒ ๐’Ž, ๐œถ โˆ ๐‘ ๐’Ž ๐œƒ ๐‘(๐œƒ|๐œถ) โˆ ๐‘˜=1 ๐พ ๐œƒ ๐‘˜ ๐›ผ ๐‘˜+๐‘š ๐‘˜โˆ’1 6/11/2014 26 Middleware, CCNT, ZJU Dirichlet? ๐‘ ๐œƒ ๐’Ž, ๐œถ =๐ท๐‘–๐‘Ÿ ๐œƒ ๐’Ž + ๐œถ = ฮ“(๐›ผ0+๐‘) ฮ“ ๐›ผ1+๐‘š1 โ€ฆฮ“ ๐›ผ ๐พ+๐‘š ๐พ ๐‘˜=1 ๐พ ๐œƒ ๐‘˜ ๐›ผ ๐‘˜+๐‘š ๐‘˜โˆ’1 Why? ๏ƒ  Gamma ฮ“ is a mysterious function Dirichlet! ๐‘~๐ต๐‘’๐‘ก๐‘Ž ๐‘ก ๐›ผ, ๐›ฝ ๏ƒ  ๐ธ ๐‘ = 0 1 ๐‘ก ร— ฮ“ ๐›ผ+๐›ฝ ฮ“ ๐›ผ ฮ“ ๐›ฝ ๐‘ก ๐›ผโˆ’1(1 โˆ’ ๐‘ก) ๐›ฝโˆ’1 ๐‘‘๐‘ก = ๐›ผ ๐›ผ+๐›ฝ ๐‘~๐ท๐‘–๐‘Ÿ ๐œƒ ๐›ผ ๏ƒ  ๐ธ ๐‘ = ๐›ผ1 ๐‘–=1 ๐พ ๐›ผ ๐‘– , ๐›ผ2 ๐‘–=1 ๐พ ๐›ผ ๐‘– , โ€ฆ , ๐›ผ ๐พ ๐‘–=1 ๐พ ๐›ผ ๐‘– , Yueshen Xu
  26. 26. Poisson Distribution ๏ฐ Why Poisson distribution? ๏ฎ The number of births per hour during a given day; the number of particles emitted by a radioactive source in a given time; the number of cases of a disease in different towns ๏ฎ For Bin(n,p), when n is large, and p is small ๏ƒ  p(X=k)โ‰ˆ ๐œ‰ ๐‘˜ ๐‘’โˆ’๐œ‰ ๐‘˜! , ๐œ‰ โ‰ˆ ๐‘›๐‘ ๏ฎ ๐บ๐‘Ž๐‘š๐‘š๐‘Ž ๐‘ฅ ๐›ผ = ๐‘ฅ ๐›ผโˆ’1 ๐‘’โˆ’๐‘ฅ ฮ“(๐›ผ) ๏ƒ ๐บ๐‘Ž๐‘š๐‘š๐‘Ž ๐‘ฅ ๐›ผ = ๐‘˜ + 1 = ๐‘ฅ ๐‘˜ ๐‘’โˆ’๐‘ฅ ๐‘˜! (ฮ“ ๐‘˜ + 1 = ๐‘˜!) (Poisson ๏ƒ  discrete; Gamma ๏ƒ  continuous) 6/11/2014 27 Middleware, CCNT, ZJU ๏ฐ Poisson Distribution ๏ฎ ๐‘ ๐‘˜|๐œ‰ = ๐œ‰ ๐‘˜ ๐‘’โˆ’๐œ‰ ๐‘˜! ๏ฎ Many experimental situations occur in which we observe the counts of events within a set unit of time, area, volume, length .etc , Yueshen Xu
  27. 27. Solution for LDA ๏ฐ LDA(Cont.) ๏ฎ ๐›ผ, ๐›ฝ: corpus-level parameters ๏ฎ ๐œƒ: document-level variable ๏ฎ z, w:word-level variables ๏ฎ Conditionally independent hierarchical models ๏ฎ Parametric Bayes model 6/11/2014 28 Middleware, CCNT, ZJU ๏ƒท ๏ƒท ๏ƒท ๏ƒท ๏ƒท ๏ƒธ ๏ƒถ ๏ƒง ๏ƒง ๏ƒง ๏ƒง ๏ƒง ๏ƒจ ๏ƒฆ knkk ppp ppp ppp ๏Œ ๏๏Œ๏Œ๏ ๏Œ ๏Œ 21 n22221 n11211๐‘ง1 ๐‘ง2 ๐‘ง ๐พ ๐‘ค1 ๐‘ง1 ๐‘ง2 ๐‘ง ๐‘› ๐‘ค2 ๐‘ค ๐‘› p ๐œƒ, ๐’›, ๐’˜ ๐›ผ, ๐›ฝ = ๐‘(๐œƒ|๐›ผ) ๐‘›=1 ๐‘ ๐‘ ๐‘ง ๐‘› ๐œƒ ๐‘(๐‘ค ๐‘›|๐‘ง ๐‘›, ๐›ฝ) Solving Process (๐‘ ๐‘ง๐‘– ๐œฝ = ๐œƒ๐‘–) p ๐’˜ ๐›ผ, ๐›ฝ = ๐‘(๐œƒ|๐›ผ) ๐‘›=1 ๐‘ ๐‘ง ๐‘› ๐‘ ๐‘ง ๐‘› ๐œƒ ๐‘(๐‘ค ๐‘›|๐‘ง ๐‘›, ๐›ฝ) ๐‘‘๐œƒ multiple integral p ๐‘ซ ๐›ผ, ๐›ฝ = ๐‘‘=1 ๐‘€ ๐‘(๐œƒ ๐‘‘|๐›ผ) ๐‘›=1 ๐‘ ๐‘‘ ๐‘ง ๐‘‘๐‘› ๐‘ ๐‘ง ๐‘‘๐‘› ๐œƒ ๐‘‘ ๐‘(๐‘ค ๐‘‘๐‘›|๐‘ง ๐‘‘๐‘›, ๐›ฝ) ๐‘‘๐œƒd ๐›ฝ , Yueshen Xu
  28. 28. Solution for LDA 6/11/2014 29 Middleware, CCNT, ZJU The most significant generative model in Machine Learning Community in the recent ten years ๐‘ ๐’˜ ๐›ผ, ๐›ฝ = ฮ“( ๐‘– ๐›ผ๐‘–) ๐‘– ฮ“(๐›ผ๐‘–) ๐‘–=1 ๐‘˜ ๐œƒ๐‘– ๐›ผ ๐‘–โˆ’1 ๐‘›=1 ๐‘ ๐‘–=1 ๐‘˜ ๐‘—=1 ๐‘‰ (๐œƒ๐‘– ๐›ฝ๐‘–๐‘—) ๐‘ค ๐‘› ๐‘— ๐‘‘๐œƒ p ๐’˜ ๐›ผ, ๐›ฝ = ๐‘(๐œƒ|๐›ผ) ๐‘›=1 ๐‘ ๐‘ง ๐‘› ๐‘ ๐‘ง ๐‘› ๐œƒ ๐‘(๐‘ค ๐‘›|๐‘ง ๐‘›, ๐›ฝ) ๐‘‘๐œƒ Rewrite in terms of model parameters ๐›ผ = ๐›ผ1, ๐›ผ2, โ€ฆ ๐›ผ ๐พ ; ๐›ฝ โˆˆ ๐‘… ๐พร—๐‘‰:What we need to solve out Variational Inference Gibbs Sampling Deterministic Inference Stochastic Inference Why variational inference?๏ƒ Simplify the dependency structure Why sampling?๏ƒ  Approximate the statistical properties of the population with those of samplesโ€™ , Yueshen Xu
  29. 29. Variational Inference ๏ฐ Variational Inference (Inference through a variational distribution), VI ๏ฎ VI aims to use an approximating distribution that has a simpler dependency structure than that of the exact posterior distribution 6/11/2014 30 Middleware, CCNT, ZJU ๐‘ƒ(๐ป|๐ท) โ‰ˆ ๐‘„(๐ป) true posterior distribution variational distribution Dissimilarity between P and Q? Kullback-Leibler Divergence ๐พ๐ฟ(๐‘„| ๐‘ƒ = ๐‘„ ๐ป ๐‘™๐‘œ๐‘” ๐‘„ ๐ป ๐‘ƒ ๐ท ๐‘ƒ ๐ป, ๐ท ๐‘‘๐ป = ๐‘„ ๐ป ๐‘™๐‘œ๐‘” ๐‘„ ๐ป ๐‘ƒ ๐ป, ๐ท ๐‘‘๐ป + ๐‘™๐‘œ๐‘”๐‘ƒ(๐ท) ๐ฟ ๐‘‘๐‘’๐‘“ ๐‘„ ๐ป ๐‘™๐‘œ๐‘”๐‘ƒ ๐ป, ๐ท ๐‘‘๐ป โˆ’ ๐‘„ ๐ป ๐‘™๐‘œ๐‘”๐‘„ ๐ป ๐‘‘๐ป =< ๐‘™๐‘œ๐‘”๐‘ƒ(๐ป, ๐ท) >Q(H) +โ„ ๐‘„ Entropy of Q , Yueshen Xu
  30. 30. Variational Inference 6/11/2014 31 Middleware, CCNT, ZJU ๐‘ƒ ๐ป ๐ท = ๐‘ ๐œƒ, ๐‘ง ๐’˜, ๐›ผ, ๐›ฝ , ๐‘„ ๐ป = ๐‘ž ๐œƒ, ๐‘ง ๐›พ, ๐œ™ = ๐‘ž ๐œƒ ๐›พ ๐‘ž ๐‘ง ๐œ™ = ๐‘ž(๐œƒ|๐›พ) ๐‘›=1 ๐‘ ๐‘ž(๐‘ง ๐‘›|๐œ™ ๐‘›) ๐›พโˆ—, ๐œ™โˆ— = arg min(๐ท(๐‘ž ๐œƒ, ๐‘ง ๐›พ, ๐œ™ ||๐‘ ๐œƒ, ๐‘ง ๐’˜, ๐›ผ, ๐›ฝ ))๏ผšbut we donโ€™t know the exact analytical form of the above KL log ๐‘ ๐‘ค ๐›ผ, ๐›ฝ = ๐‘™๐‘œ๐‘” ๐‘ง ๐‘ ๐œƒ, ๐‘ง, ๐‘ค ๐›ผ, ๐›ฝ ๐‘‘๐œƒ = ๐‘™๐‘œ๐‘” ๐‘ง ๐‘ ๐œƒ, ๐‘ง, ๐‘ค ๐›ผ, ๐›ฝ ๐‘ž(๐œƒ, ๐‘ง) ๐‘ž(๐œƒ, ๐‘ง) ๐‘‘๐œƒ โ‰ฅ ๐‘ง ๐‘ž ๐œƒ, ๐‘ง ๐‘™๐‘œ๐‘” ๐‘ ๐œƒ, ๐‘ง, ๐‘ค ๐›ผ, ๐›ฝ ๐‘ž(๐œƒ, ๐‘ง) ๐‘‘๐œƒ = ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ ๐œƒ, ๐‘ง, ๐‘ค ๐›ผ, ๐›ฝ โˆ’ ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ž ๐œƒ, ๐‘ง = ๐ฟ(๐›พ, ๐œ™; ๐›ผ, ๐›ฝ) log ๐‘ ๐‘ค ๐›ผ, ๐›ฝ = ๐ฟ ๐›พ, ๐œ™; ๐›ผ, ๐›ฝ + KL ๏ƒ  minimize KL == maximize L ๐œƒ ,z: independent (approximately) for facilitating computation , Yueshen Xu variational distribution
  31. 31. Variational Inference 6/11/2014 32 Middleware, CCNT, ZJU ๐ฟ ๐›พ, ๐œ™; ๐›ผ, ๐›ฝ = ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ ๐œƒ ๐›ผ + ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ ๐‘ง ๐œƒ + ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ ๐‘ค ๐‘ง, ๐›ฝ โˆ’ ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ž ๐œƒ โˆ’ ๐ธ ๐‘ž[๐‘™๐‘œ๐‘”๐‘ž(๐‘ง)] ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ ๐œƒ ๐›ผ = ๐‘–=1 ๐พ ๐›ผ๐‘– โˆ’ 1 ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐œƒ๐‘– + ๐‘™๐‘œ๐‘”ฮ“ ๐‘–=1 ๐พ ๐›ผ๐‘– โˆ’ ๐‘–=1 ๐พ ๐‘™๐‘œ๐‘”ฮ“(๐›ผ๐‘–) ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐œƒ๐‘– = ๐œ“ ๐›พ๐‘– โˆ’ ๐œ“( ๐‘—=1 ๐พ ๐›พ๐‘—) ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ ๐‘ง ๐œƒ = ๐‘›=1 ๐‘ ๐‘–=1 ๐พ ๐ธ ๐‘ž[๐‘ง๐‘›๐‘–] ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐œƒ๐‘– = ๐‘›=1 ๐‘ ๐‘–=1 ๐พ ๐œ™ ๐‘›๐‘–(๐œ“ ๐›พ๐‘– โˆ’ ๐œ“( ๐‘—=1 ๐พ ๐›พ๐‘—) ) ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ ๐‘ค ๐‘ง, ๐›ฝ = ๐‘›=1 ๐‘ ๐‘–=1 ๐พ ๐‘—=1 ๐‘‰ ๐ธ ๐‘ž[๐‘ง๐‘›๐‘–] ๐‘ค ๐‘› ๐‘— ๐‘™๐‘œ๐‘”๐›ฝ๐‘–๐‘— = ๐‘›=1 ๐‘ ๐‘–=1 ๐พ ๐‘—=1 ๐‘‰ ๐œ™ ๐‘›๐‘– ๐‘ค ๐‘› ๐‘— ๐‘™๐‘œ๐‘”๐›ฝ๐‘–๐‘— , Yueshen Xu
  32. 32. Variational Inference 6/11/2014 33 Middleware, CCNT, ZJU ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ž ๐œƒ ๐›พ is much like ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ ๐œƒ ๐›ผ ๐ธ ๐‘ž ๐‘™๐‘œ๐‘”๐‘ž ๐‘ง ๐œ™ = ๐ธ ๐‘ž ๐‘›=1 ๐‘ ๐‘–=1 ๐‘˜ ๐‘ง ๐‘›๐‘– ๐‘™๐‘œ๐‘” ๐œ™ ๐‘›๐‘– Maximize L with respect to ๐œ™ ๐‘›๐‘–: ๐ฟ ๐œ™ ๐‘›๐‘– = ๐œ™ ๐‘›๐‘–(๐œ“ ๐›พ๐‘– โˆ’ ๐œ“( ๐‘—=1 ๐พ ๐›พ๐‘—))+๐œ™ ๐‘›๐‘– ๐‘™๐‘œ๐‘”๐›ฝ๐‘–๐‘—-๐œ™ ๐‘›๐‘–log๐œ™ ๐‘›๐‘– + ๐œ†( ๐‘—=1 ๐พ ๐œ™ ๐‘›๐‘– โˆ’ 1) Lagrangian Multiplier Taking derivatives with respect to ๐œ™ ๐‘›๐‘–: ๐œ•๐ฟ ๐œ•๐œ™ ๐‘›๐‘– = (๐œ“ ๐›พ๐‘– โˆ’ ๐œ“( ๐‘—=1 ๐พ ๐›พ๐‘—))+๐‘™๐‘œ๐‘”๐›ฝ๐‘–๐‘—-log๐œ™ ๐‘›๐‘– โˆ’ 1 + ๐œ†=0 ๐œ™ ๐‘›๐‘– โˆ ๐›ฝ๐‘–๐‘—exp(๐œ“ ๐›พ๐‘– โˆ’ ๐œ“ ๐‘—=1 ๐พ ๐›พ๐‘— ) , Yueshen Xu
  33. 33. Variational Inference ๏ฐ You can refer to more in the original paper.๏Š๏Œ ๏ฐ Variational EM Algorithm ๏ฎ Aim: (๐›ผ โˆ— , ๐›ฝ โˆ— )=arg max ๐‘‘=1 ๐‘€ ๐‘ ๐’˜|๐›ผ, ๐›ฝ ๏ฎ Initialize ๐›ผ, ๐›ฝ ๏ฎ E-Step: compute ๐›ผ, ๐›ฝ through variational inference for likelihood approximation ๏ฎ M-Step: Maximize the likelihood according to ๐›ผ, ๐›ฝ ๏ฎ End until convergence 6/11/2014 34 Middleware, CCNT, ZJU, Yueshen Xu
  34. 34. Markov Chain Monte Carlo ๏ฐ MCMC๏ƒ  Basic: Markov Chain (First-order) ๏ƒ  Stationary Distribution ๏ƒ  Fundament of Gibbs Sampling ๏ฐ General: ๐‘ƒ ๐‘‹๐‘ก+๐‘› = ๐‘ฅ ๐‘‹1, ๐‘‹2, โ€ฆ ๐‘‹๐‘ก = ๐‘ƒ(๐‘‹๐‘ก+๐‘› = ๐‘ฅ|๐‘‹๐‘ก) ๏ฐ First-Order: ๐‘ƒ ๐‘‹๐‘ก+1 = ๐‘ฅ ๐‘‹1, ๐‘‹2, โ€ฆ ๐‘‹๐‘ก = ๐‘ƒ(๐‘‹๐‘ก+1 = ๐‘ฅ|๐‘‹๐‘ก) ๏ฐ One-step transition probabilistic matrix 6/11/2014 35 Middleware, CCNT, ZJU ๏ƒท ๏ƒท ๏ƒท ๏ƒท ๏ƒท ๏ƒธ ๏ƒถ ๏ƒง ๏ƒง ๏ƒง ๏ƒง ๏ƒง ๏ƒจ ๏ƒฆ ๏‚ฎ๏‚ฎ๏‚ฎ ๏‚ฎ๏‚ฎ๏‚ฎ ๏‚ฎ๏‚ฎ๏‚ฎ ๏€ฝ |)||(|...)2|(|)1|(| )12(p...)22(p)12(p |)|1(...)21()11(p SSpSpSp Spp P ๏๏๏๏ Xm Xm+1 , Yueshen Xu
  35. 35. Markov Chain Monte Carlo ๏ฐ Markov Chain ๏ฎ Initialization probability: ๐œ‹0 = {๐œ‹0 1 , ๐œ‹0 2 , โ€ฆ , ๐œ‹0(|๐‘†|)} ๏ฎ ๐œ‹ ๐‘› = ๐œ‹ ๐‘›โˆ’1 ๐‘ƒ = ๐œ‹ ๐‘›โˆ’2 ๐‘ƒ2 = โ‹ฏ = ๐œ‹0 ๐‘ƒ ๐‘›: Chapman-Kolomogrov equation ๏ฎ Central-limit Theorem: Under the premise of connectivity of P, lim ๐‘›โ†’โˆž ๐‘ƒ๐‘–๐‘— ๐‘› = ๐œ‹ ๐‘— ; ๐œ‹ ๐‘— = ๐‘–=1 |๐‘†| ๐œ‹ ๐‘– ๐‘ƒ๐‘–๐‘— ๏ฎ lim ๐‘›โ†’โˆž ๐œ‹0 ๐‘ƒ ๐‘› = ๐œ‹(1) โ€ฆ ๐œ‹(|๐‘†|) โ‹ฎ โ‹ฎ โ‹ฎ ๐œ‹(1) ๐œ‹(|๐‘†|) ๏ƒ  ๐œ‹ = {๐œ‹ 1 , ๐œ‹ 2 , โ€ฆ , ๐œ‹ ๐‘— , โ€ฆ , ๐œ‹(|๐‘†|)} 6/11/2014 36 Middleware, CCNT, ZJU Stationary Distribution ๐‘‹0~๐œ‹0 ๐‘ฅ โˆ’โ†’ ๐‘‹1~๐œ‹1 ๐‘ฅ โˆ’โ†’ โ‹ฏ โˆ’โ†’ ๐‘‹ ๐‘›~๐œ‹ ๐‘ฅ โˆ’โ†’ ๐‘‹ ๐‘›+1~๐œ‹ ๐‘ฅ โˆ’โ†’ ๐‘‹ ๐‘›+2~๐œ‹ ๐‘ฅ โˆ’โ†’ sample Convergence Stationary Distribution , Yueshen Xu
  36. 36. Markov Chain Monte Carlo ๏ฐ MCMC Sampling ๏ฎ We should construct the relationship between ๐œ‹(๐‘ฅ) and MC transition process ๏ƒ  Detailed Balance Condition ๏ฎ In a common MC, if for ๐… ๐’™ , ๐‘ƒ ๐‘ก๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘–๐‘ก๐‘–๐‘œ๐‘› ๐‘š๐‘Ž๐‘ก๐‘Ÿ๐‘–๐‘ฅ , ๐œ‹ ๐‘– ๐‘ƒ๐‘–๐‘— = ๐œ‹(j) ๐‘ƒ๐‘—๐‘–, ๐‘“๐‘œ๐‘Ÿ ๐‘Ž๐‘™๐‘™ ๐‘–, ๐‘— ๏ƒ ๐œ‹(๐‘ฅ) is the stationary distribution of this MC ๏ฎ Prove: ๐‘–=1 โˆž ๐œ‹ ๐‘– ๐‘ƒ๐‘–๐‘— = ๐‘–=1 โˆž ๐œ‹ ๐‘— ๐‘ƒ๐‘—๐‘– = ๐œ‹ ๐‘— โˆ’โ†’ ๐œ‹๐‘ƒ = ๐œ‹๏ƒ ๐œ‹ is the solution of the equation ๐œ‹๐‘ƒ = ๐œ‹ ๏ƒ  Done ๏ฎ For a common MC(q(i,j), q(j|i), q(i๏ƒ j)), and for any probabilistic distribution p(x) (the dimension of x is arbitrary) ๏ƒ  Transformation 6/11/2014 37 Middleware, CCNT, ZJU ๐‘ ๐‘– ๐‘ž ๐‘–, ๐‘— ๐›ผ ๐‘–, ๐‘— = ๐‘ ๐‘— ๐‘ž(๐‘—, ๐‘–)๐›ผ(๐‘—, ๐‘–) Qโ€™(i,j) Qโ€™(j,i) ๐›ผ ๐‘–, ๐‘— = ๐‘ ๐‘— ๐‘ž(๐‘—, ๐‘–),๐›ผ ๐‘—, ๐‘– = ๐‘ ๐‘– ๐‘ž(๐‘—, ๐‘–), necessary condition , Yueshen Xu
  37. 37. Markov Chain Monte Carlo ๏ฐ MCMC Sampling(cont.) Step1: Initialize: ๐‘‹0 = ๐‘ฅ0 Step2: for t = 0, 1, 2, โ€ฆ ๐‘‹๐‘ก = ๐‘ฅ๐‘ก, ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’ ๐‘ฆ ๐‘“๐‘Ÿ๐‘œ๐‘š ๐‘ž(๐‘ฅ|๐‘ฅ๐‘ก) (๐‘ฆ โˆˆ ๐ท๐‘œ๐‘š๐‘Ž๐‘–๐‘› ๐‘œ๐‘“ ๐ท๐‘’๐‘“๐‘–๐‘›๐‘–๐‘ก๐‘–๐‘œ๐‘›) sample u from Uniform[0,1] If ๐‘ข < ๐›ผ ๐‘ฅ๐‘ก, ๐‘ฆ = ๐‘ ๐‘ฆ ๐‘ž ๐‘ฅ๐‘ก ๐‘ฆ โ‡’ ๐‘ฅ๐‘ก โ†’ ๐‘ฆ, ๏ƒ  Xt+1 = y else Xt+1 = xt 6/11/2014 38 Middleware, CCNT, ZJU ๏ฐ Metropolis-Hastings Sampling Step1: Initialize: ๐‘‹0 = ๐‘ฅ0 Step2: for t = 0, 1, 2, โ€ฆn, n+1, n+2โ€ฆ ๐‘‹๐‘ก = ๐‘ฅ๐‘ก, ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’ ๐‘ฆ ๐‘“๐‘Ÿ๐‘œ๐‘š ๐‘ž ๐‘ฅ ๐‘ฅ๐‘ก ๐‘ฆ โˆˆ ๐ท๐‘œ๐‘š๐‘Ž๐‘–๐‘› ๐‘œ๐‘“ ๐ท๐‘’๐‘“๐‘–๐‘›๐‘–๐‘ก๐‘–on Burn-in Period Convergence , Yueshen Xu
  38. 38. Gibbs Sampling sample u from Uniform[0,1] If ๐‘ข < ๐›ผ ๐‘ฅ๐‘ก, ๐‘ฆ = ๐‘š๐‘–๐‘›{ ๐‘ ๐‘ฆ ๐‘ž ๐‘ฅ๐‘ก ๐‘ฆ ๐‘ ๐‘ฅ๐‘ก ๐‘ž ๐‘ฆ ๐‘ฅ๐‘ก , 1} โ‡’ ๐‘ฅ๐‘ก โ†’ ๐‘ฆ ,๏ƒ  Xt+1 = y else Xt+1 = xt 6/11/2014 39 Middleware, CCNT, ZJU Not suitable with regard to high dimensional variables ๏ฐ Gibbs Sampling(Two Dimensions,(x1,y1)) ๏ฎ A(x1,y1), B(x1,y2) ๏ƒ  ๐‘ ๐‘ฅ1, ๐‘ฆ1 ๐‘ ๐‘ฆ2 ๐‘ฅ1 = ๐‘ ๐‘ฅ1 ๐‘ ๐‘ฆ1 ๐‘ฅ1 ๐‘(๐‘ฆ2|๐‘ฅ1) ๏ƒ  ๐‘ ๐‘ฅ1, ๐‘ฆ2 ๐‘ ๐‘ฆ1 ๐‘ฅ1 = ๐‘ ๐‘ฅ1 ๐‘ ๐‘ฆ2 ๐‘ฅ1 ๐‘(๐‘ฆ1|๐‘ฅ1) ๐‘ ๐‘ฅ1, ๐‘ฆ1 ๐‘ ๐‘ฆ2 ๐‘ฅ1 = ๐‘ ๐‘ฅ1, ๐‘ฆ2 ๐‘ ๐‘ฆ1 ๐‘ฅ1 ๐‘ ๐ด ๐‘ ๐‘ฆ2 ๐‘ฅ1 = ๐‘ ๐ต ๐‘ ๐‘ฆ1 ๐‘ฅ1 A(x1,y1) B(x1,y2) C(x2,y1) D ๐‘ ๐ด ๐‘ ๐‘ฅ2 ๐‘ฆ1 = ๐‘ ๐ถ ๐‘ ๐‘ฅ1 ๐‘ฆ1 , Yueshen Xu
  39. 39. Gibbs Sampling ๏ฐ Gibbs Sampling(Cont.) ๏ฎ We can construct the transition probabilistic matrix Q accordingly ๐‘„ ๐ด โ†’ ๐ต = ๐‘(๐‘ฆ ๐ต|๐‘ฅ1), if ๐‘ฅ ๐ด = ๐‘ฅ ๐ต = ๐‘ฅ1 ๐‘„ ๐ด โ†’ ๐ถ = ๐‘(๐‘ฅ ๐ถ|๐‘ฆ1), if ๐‘ฆ ๐ด = ๐‘ฆ ๐ถ = ๐‘ฆ1 ๐‘„ ๐ด โ†’ ๐ท = 0, else 6/11/2014 40 Middleware, CCNT, ZJU A(x1,y1) B(x1,y2) C(x2,y1) D Detailed Balance Condition: ๐‘ ๐‘‹ ๐‘„ ๐‘‹ โ†’ ๐‘Œ = ๐‘ ๐‘Œ ๐‘„(๐‘Œ โ†’ ๐‘‹) โˆš ๏ฐ Gibbs Sampling(in two dimension) Step1: Initialize: ๐‘‹0 = ๐‘ฅ0, ๐‘Œ0 = ๐‘ฆ0 Step2: for t = 0, 1, 2, โ€ฆ 1. ๐‘ฆ๐‘ก+1~๐‘ ๐‘ฆ ๐‘ฅ ๐‘ก ; . 2. ๐‘ฅ๐‘ก+1~๐‘ ๐‘ฅ ๐‘ฆ๐‘ก+1 , Yueshen Xu
  40. 40. Gibbs Sampling 6/11/2014 41 Middleware, CCNT, ZJU ๏ฐ Gibbs Sampling(in two dimension) Step1: Initialize: ๐‘‹0 = ๐‘ฅ0 = {๐‘ฅ1: ๐‘– = 1,2, โ€ฆ ๐‘›} Step2: for t = 0, 1, 2, โ€ฆ 1. ๐‘ฅ1 (๐‘ก+1) ~๐‘ ๐‘ฅ1 ๐‘ฅ2 (๐‘ก) , ๐‘ฅ3 (๐‘ก) , โ€ฆ , ๐‘ฅ ๐‘› (๐‘ก) ; 2. ๐‘ฅ2 ๐‘ก+1 ~๐‘ ๐‘ฅ2 ๐‘ฅ1 (๐‘ก+1) , ๐‘ฅ3 (๐‘ก) , โ€ฆ , ๐‘ฅ ๐‘› (๐‘ก) 3. โ€ฆ 4. ๐‘ฅ๐‘— ๐‘ก+1 ~๐‘ ๐‘ฅ๐‘— ๐‘ฅ1 (๐‘ก+1) , ๐‘ฅ๐‘—โˆ’1 (๐‘ก+1) , ๐‘ฅ๐‘—+1 (๐‘ก) โ€ฆ , ๐‘ฅ ๐‘› (๐‘ก) 5. โ€ฆ 6. ๐‘ฅ ๐‘› ๐‘ก+1~๐‘ ๐‘ฅ ๐‘› ๐‘ฅ1 (๐‘ก+1) , ๐‘ฅ2 (๐‘ก+1) , โ€ฆ , ๐‘ฅ ๐‘›โˆ’1 (๐‘ก+1) t+1 t , Yueshen Xu
  41. 41. Gibbs Sampling for LDA ๏ฐ Gibbs Sampling in LDA ๏ฎ Dir ๐‘ ๐›ผ = 1 ฮ”(๐›ผ) ๐‘˜=1 ๐‘‰ ๐‘ ๐‘˜ ๐›ผ ๐‘˜โˆ’1 , ฮ”( ๐›ผ) is the normalization factor: ฮ” ๐›ผ = ๐‘˜=1 ๐‘‰ ๐‘ ๐‘˜ ๐›ผ ๐‘˜โˆ’1 ๐‘‘ ๐‘ ๐‘ ๐‘ง ๐‘š ๐›ผ = ๐‘ ๐‘ง ๐‘š ๐œƒ ๐‘ ๐œƒ ๐›ผ ๐‘‘ ๐‘ = ๐‘˜=1 ๐‘‰ ๐œƒ ๐‘˜ ๐‘› ๐‘˜ Dir( ๐œƒ| ๐›ผ) ๐‘‘ ๐œƒ = ๐‘˜=1 ๐‘‰ ๐œƒ ๐‘˜ ๐‘› ๐‘˜ 1 ฮ”(๐›ผ) ๐‘˜=1 ๐‘‰ ๐œƒ ๐‘˜ ๐›ผ ๐‘˜โˆ’1 ๐‘‘ ๐œƒ = 1 ฮ”(๐›ผ) ๐‘˜=1 ๐‘‰ ๐œƒ ๐‘˜ ๐‘› ๐‘˜+๐›ผ ๐‘˜โˆ’1 ๐‘‘ ๐œƒ = ฮ”(๐‘› ๐‘š+๐›ผ) ฮ”(๐›ผ) 6/11/2014 42 Middleware, CCNT, ZJU ๐‘ ๐’› ๐›ผ = ๐‘š=1 ๐‘€ ๐‘ ๐‘ง ๐‘š ๐›ผ = ๐‘š=1 ๐‘€ ฮ”(๐‘› ๐‘š+๐›ผ) ฮ”(๐›ผ) โˆ’โ†’ ๐‘ ๐’˜, ๐’› ๐›ผ, ๐›ฝ = ๐‘˜=1 ๐พ ฮ”(๐‘› ๐‘˜+๐›ฝ) ฮ”(๐›ฝ) ๐‘š=1 ๐‘€ ฮ”(๐‘› ๐‘š+๐›ผ) ฮ”(๐›ผ) , Yueshen Xu
  42. 42. Gibbs Sampling for LDA ๏ฐ Gibbs Sampling in LDA ๏ฎ ๐‘ ๐œƒ ๐‘š ๐‘งยฌ๐‘–, ๐‘คยฌ๐‘– = ๐ท๐‘–๐‘Ÿ(๐œƒ ๐‘š|๐‘› ๐‘š,ยฌ๐‘– + ๐›ผ), ๐‘ ๐œ‘ ๐‘˜ ๐‘งยฌ๐‘–, ๐‘คยฌ๐‘– = ๐ท๐‘–๐‘Ÿ(๐œ‘ ๐‘˜|๐‘› ๐‘˜,ยฌ๐‘– + ๐›ฝ) ๐‘(๐‘ง๐‘– = ๐‘˜| ๐‘งยฌ๐‘–, ๐‘คยฌ๐‘–) โˆ ๐‘ ๐‘ง๐‘– = ๐‘˜, ๐‘ค๐‘– = ๐‘ก, ๐œƒ ๐‘š, ๐œ‘ ๐‘˜ ๐‘งยฌ๐‘–, ๐‘คยฌ๐‘– = ๐ธ ๐œƒ ๐‘š๐‘˜ โˆ™ ๐ธ ๐œ‘ ๐‘˜๐‘ก = ๐œƒ ๐‘š๐‘˜ โˆ™ ๐œ‘ ๐‘˜๐‘ก ๐œƒ ๐‘š๐‘˜= ๐‘› ๐‘š,ยฌ๐‘– (๐‘ก) +๐›ผ ๐‘˜ ๐‘˜=1 ๐พ (๐‘› ๐‘š,ยฌ๐‘– (๐‘˜) +๐›ผ ๐‘˜) , ๐œ‘ ๐‘˜๐‘ก= ๐‘› ๐‘˜,ยฌ๐‘– (๐‘ก) +๐›ฝ ๐‘˜ ๐‘ก=1 ๐‘‰ (๐‘› ๐‘˜,ยฌ๐‘– (๐‘ก) +๐›ฝ ๐‘˜) ๐‘(๐‘ง๐‘– = ๐‘˜| ๐‘งยฌ๐‘–, ๐‘ค) โˆ ๐‘› ๐‘š,ยฌ๐‘– (๐‘ก) +๐›ผ ๐‘˜ ๐‘˜=1 ๐พ (๐‘› ๐‘š,ยฌ๐‘– (๐‘˜) +๐›ผ ๐‘˜) ร— ๐‘› ๐‘˜,ยฌ๐‘– (๐‘ก) +๐›ฝ ๐‘˜ ๐‘ก=1 ๐‘‰ (๐‘› ๐‘˜,ยฌ๐‘– (๐‘ก) +๐›ฝ ๐‘˜) ๐‘ง๐‘– (๐‘ก+1) ~ ๐‘(๐‘ง๐‘– = ๐‘˜| ๐‘งยฌ๐‘–, ๐‘ค), i=1โ€ฆK 6/11/2014 43 Middleware, CCNT, ZJU, Yueshen Xu
  43. 43. Q&A 6/11/2014 Middleware, CCNT, ZJU44, Yueshen Xu

ร—