Your SlideShare is downloading. ×
graphical models for the Internet
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

graphical models for the Internet

1,798
views

Published on

Published in: Education

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,798
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
60
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Graphical Models for the Internet Alexander Smola & Amr Ahmed Yahoo! Research & Australian National University Santa Clara, CA alex@smola.org blog.smola.org
  • 2. Outline• Part 1 - Motivation • Automatic information extraction • Application areas• Part 2 - Basic Tools • Density estimation / conjugate distributions • Directed Graphical models and inference• Part 3 - Topic Models (our workhorse) • Statistical model • Large scale inference (parallelization, particle filters)• Part 4 - Advanced Modeling • Temporal dependence • Mixing clustering and topic models • Social Networks • Language models
  • 3. Part 1 - Motivation
  • 4. Data on the Internet• Webpages (content, graph)• Clicks (ad, page, social) Finite resources• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail) • Editors are expensive• Photos, Movies (Flickr, YouTube, Vimeo ...) • Editors don’t know users•• Cookies / tracking info (see Ghostery) Installed apps (Android market etc.) unlimited amounts • Barrier to i18n • Abuse (intrusions are novel) of • Implicit feedback data• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook) • Data analysis (find interesting stuff• Reviews (Yelp, Y!Local) rather than find x)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook) • Integrating many systems• Purchase decisions (Netflix, Amazon) • Modular design for data integration• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing) • Integrate with given prediction tasks• Timestamp (everything)• News articles (BBC, NYTimes, Y!News) Invest in modeling and naming•• Blog posts (Tumblr, Wordpress) Microblogs (Twitter, Jaiku, Meme) rather than data generation
  • 5. Clustering documents
  • 6. Clustering documents airline university restaurant
  • 7. Today’s missionFind hidden structure in the data Human understandable Improved knowledge for estimation
  • 8. Some applications
  • 9. Hierarchical Clustering NIPS 2010 Adams, Ghahramani, Jordan
  • 10. Topics in textLatent Dirichlet Allocation; Blei, Ng, Jordan, JMLR 2003
  • 11. Word segmentationMochihashi, Yamada, Ueda, ACL 2009
  • 12. Language model automatically synthesized from Penn Treebank Mochihashi, Yamada, Ueda ACL 2009
  • 13. User model over time0.5 Baseball 0.30.4 Dating Propotion Baseball0.3 0.2 Finance0.2 Celebrity Jobs 0.10.1 Dating Health 0 0 0 10 20 30 40 0 10 20 30 40 Dating DayBaseball Celebrity Health Jobs DayFinance League   Snooki women   skin job   financial   baseball   Tom   men body   career Thomson   basketball,   Cruise dating   fingers   business chart   doublehead Katie singles   cells   assistant real   Bergesen Holmes   personals   toes   hiring Stock Griffey Pinkett seeking   wrinkle   part-­‐time Trading bullpen   Kudrow match layers receptionist currency Greinke Hollywood Ahmed et al., KDD 2011
  • 14. Face recognition from captions Jain, Learned-Miller, McCallum, ICCV 2007
  • 15. Storylines from news Ahmed et al, AISTATS 2011
  • 16. Ideology detectionAhmed et al, 2010; Bitterlemons collection
  • 17. Hypertext topic extraction Gruber, Rosen-Zvi, Weiss; UAI 2008
  • 18. Alternatives
  • 19. Ontologies • continuous maintenance • no guarantee of coverage • difficult categories expensive, small
  • 20. Face Classification • 100-1000 people • 10k faces • curated (not realistic) • expensive to generate
  • 21. Topic Detection & Tracking • editorially curated training data • expensive to generate • subjective in selection of threads • language specific
  • 22. Advertising Targeting• Needs training data in every language• Is it really relevant for better ads?• Does it cover relevant areas?
  • 23. Challenges• Scale • Millions to billions of instances (documents, clicks, users, messages, ads) • Rich structure of data (ontology, categories, tags) • Model description typically larger than memory of single workstation• Modeling • Usually clustering or topic models do not solve the problem • Temporal structure of data • Side information for variables • Solve problem. Don’t simply apply a model!• Inference • 10k-100k clusters for hierarchical model • 1M-100M words • Communication is an issue for large state space
  • 24. Summary - Part 1• Essentially infinite amount of data• Labeling is prohibitively expensive• Not scalable for i18n• Even for supervised problems unlabeled data abounds. Use it.• User-understandable structure for representation purposes• Solutions are often customized to problem We can only cover building blocks in tutorial.
  • 25. Part 2 - Basic Tools
  • 26. Statistics 101
  • 27. Probability• Space of events X • server status (working, slow, broken) • income of the user (e.g. $95,000) • search queries (e.g. “graphical models”)• Probability axioms (Kolmogorov) Pr(X) ∈ [0, 1], Pr(X ) = 1 Pr(∪i Xi ) = i Pr(Xi ) if Xi ∩ Xj = ∅• Example queries • P(server working) = 0.999 • P(90,000 income 100,000) = 0.1
  • 28. (In)dependence• Independence Pr(x, y) = Pr(x) · Pr(y) • Login behavior of two users (approximately) • Disk crash in different colos (approximately)• Dependent events • Emails Pr(x, y) = Pr(x) · Pr(y) • Queries • News stream / Buzz / Tweets • IM communication Everywhere! • Russian Roulette
  • 29. Independence 0.3 0.2 0.3 0.2
  • 30. Dependence 0.45 0.05 0.05 0.45
  • 31. A Graphical Model Spam Mailp(spam, mail) = p(spam) p(mail|spam)
  • 32. Bayes Rule• Joint Probability Pr(X, Y ) = Pr(X|Y ) Pr(Y ) = Pr(Y |X) Pr(X)• Bayes Rule Pr(Y |X) · Pr(X) Pr(X|Y ) = Pr(Y )• Hypothesis testing• Reverse conditioning
  • 33. AIDS test (Bayes rule)• Data • Approximately 0.1% are infected • Test detects all infections • Test reports positive for 1% healthy people• Probability of having AIDS if test is positive Pr(t|a = 1) · Pr(a = 1) Pr(a = 1|t) = Pr(t) Pr(t|a = 1) · Pr(a = 1) = Pr(t|a = 1) · Pr(a = 1) + Pr(t|a = 0) · Pr(a = 0) 1 · 0.001 = = 0.091 1 · 0.001 + 0.01 · 0.999
  • 34. Improving the diagnosis• Use a follow-up test • Test 2 reports positive for 90% infections • Test 2 reports positive for 5% healthy people 0.01 · 0.05 · 0.999 = 0.357 1 · 0.9 · 0.001 + 0.01 · 0.05 · 0.999• Why can’t we use Test 1 twice? Outcomes are not independent but tests 1 and 2 are conditionally independent p(t1 , t2 |a) = p(t1 |a) · p(t2 |a)
  • 35. Application: Naive Bayes
  • 36. Naive Bayes Spam Filter• Key assumption Words occur independently of each other given the label of the document n p(w1 , . . . , wn |spam) = p(wi |spam) i=1• Spam classification via Bayes Rule n p(spam|w1 , . . . , wn ) ∝ p(spam) p(wi |spam)• Parameter estimation i=1 Compute spam probability and word distributions for spam and ham
  • 37. A Graphical Model spam spam how to estimate p(w|spam) w1 w2 ... wn wi n i=1..np(w1 , . . . , wn |spam) = p(wi |spam) i=1
  • 38. Naive NaiveBayes Classifier • Two classes (spam/ham) • Binary features (e.g. presence of $$$, viagra) • Simplistic Algorithm • Count occurrences of feature for spam/ham • Count number of spam/ham mails spam probabilityfeature probability n(i, y) n(y) p(xi = TRUE|y) = and p(y) = n(y) n n(y) n(i, y) n(y) − n(i, y) p(y|x) ∝ n n(y) n(y) i:xi =TRUE i:xi =FALSE
  • 39. Naive NaiveBayes Classifier what if n(i,y)=n(y)? what if n(i,y)=0? n(y) n(i, y) n(y) − n(i, y) p(y|x) ∝ n n(y) n(y) i:xi =TRUE i:xi =FALSE
  • 40. Estimating Probabilities
  • 41. Two outcomes (binomial)• Example: probability of ‘viagra’ in spam/ham• Data likelihood p(X; π) = π n1 (1 − π)n0• Maximum Likelihood Estimation • Constraint π ∈ [0, 1] • Taking derivatives yields n1 π= n0 + n1
  • 42. n outcomes (multinomial)• Example: USA, Canada, India, UK, NZ• Data likelihood ni p(X; π) = πi i• Maximum Likelihood Estimation • Constrained optimization problem πi = 1 i • Using log-transform yields ni πi = j nj
  • 43. Tossing a Dice12 2460 120
  • 44. Conjugate Priors• Unless we have lots of data estimates are weak• Usually we have an idea of what to expect p(θ|X) ∝ p(X|θ) · p(θ) we might even have ‘seen’ such data before• Solution: add ‘fake’ observations p(θ) ∝ p(Xfake |θ) hence p(θ|X) ∝ p(X|θ)p(Xfake |θ) = p(X ∪ Xfake |θ)• Inference (generalized Laplace smoothing) n n 1 1 m fake count φ(xi ) −→ φ(xi ) + µ0 n i=1 n + m i=1 n+m fake mean
  • 45. Conjugate Prior in action mi = m · [µ0 ]i• Discrete Distribution ni ni + mi p(x = i) = −→ p(x = i) = n n+m• Tossing a dice Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 MLE 0.15 0.30 0.10 0.05 0.20 0.20 MAP (m0 = 6) 0.15 0.27 0.12 0.08 0.19 0.19 MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17• Rule of thumb need 10 data points (or prior) per parameter
  • 46. Honest diceMLEMAP
  • 47. Tainted diceMLEMAP
  • 48. Exponential Families
  • 49. Exponential Families• Density function p(x; θ) = exp (φ(x), θ − g(θ)) where g(θ) = log exp (φ(x ), θ) x• Log partition function generates cumulants ∂θ g(θ) = E [φ(x)] 2 ∂θ g(θ) = Var [φ(x)]• g is convex (second derivative is p.s.d.)
  • 50. Examples• Binomial Distribution φ(x) = x• Discrete Distribution φ(x) = ex (ex is unit vector for x) 1 φ(x) = x, xx• Gaussian 2• Poisson (counting measure 1/x!) φ(x) = x• Dirichlet, Beta, Gamma, Wishart, ...
  • 51. Normal Distribution
  • 52. Poisson Distribution λx e−λ p(x; λ) = x!
  • 53. Beta Distribution xα−1 (1 − x)β−1 p(x; α, β) = B(α, β)
  • 54. Dirichlet Distribution... this is a distribution over distributions ...
  • 55. Maximum Likelihood• Negative log-likelihood n − log p(X; θ) = g(θ) − φ(xi ), θ i=1 empirical mean average• Taking derivatives n 1 −∂θ log p(X; θ) = m E[φ(x)] − φ(xi ) m i=1 We pick the parameter such that the distribution matches the empirical average.
  • 56. Example: Gaussian Estimation• Sufficient statistics: x, x 2• Mean and variance given by µ = Ex [x] and σ 2 = Ex [x2 ] − E2 [x] x• Maximum Likelihood Estimate n n 1 2 1 2 2 µ= ˆ xi and σ = xi − µ ˆ n i=1 n i=1• Maximum a Posteriori Estimate smoother n n 1 2 1 2 n0 2 µ= ˆ xi and σ = xi + 1−µ ˆ n + n0 i=1 n + n0 i=1 n + n0smoother
  • 57. Collapsing • Conjugate priors p(θ) ∝ p(Xfake |θ) Hence we know how to compute normalization • Prediction p(x|X) = p(x|θ)p(θ|X)dθ (Beta, binomial) ∝ p(x|θ)p(X|θ)p(Xfake |θ)dθ(Dirichlet, multinomial) (Gamma, Poisson) = p({x} ∪ X ∪ Xfake |θ)dθ (Wishart, Gauss) look up closed form expansions http://en.wikipedia.org/wiki/Exponential_family
  • 58. Directed Graphical Models
  • 59. ... some Web 2.0 service MySQL Apache Website• Joint distribution (assume a and m are independent) p(m, a, w) = p(w|m, a)p(m)p(a)• Explaining away p(w|m, a)p(m)p(a) p(m, a|w) = ,a p(w|m , a )p(m )p(a ) m a and m are dependent conditioned on w
  • 60. ... some Web 2.0 service MySQL Apache Website is broken is working At least one of the MySQL is workingtwo services is broken Apache is working (not independent)
  • 61. Directed graphical model m a m a m a w w w user• Easier estimation u action • 15 parameters for full joint distribution • 1+1+3+1 for factorizing distribution• Causal relations• Inference for unobserved variables
  • 62. No loops allowed p(c|e)p(e|c) p(c|e)p(e) or p(e|c)p(c)
  • 63. Directed Graphical Model• Joint probability distribution p(x) = p(xi |xparents(i) ) i• Parameter estimation • If x is fully observed the likelihood breaks up log p(x|θ) = log p(xi |xparents(i) , θ) i • If x is partially observed things get interesting maximization, EM, variational, sampling ...
  • 64. ClusteringDensity Estimation θ n p(x, θ) = p(θ) p(xi |θ) i=1 xClustering K n θ p(x, y, θ) = p(π) p(θk ) p(yi |π)p(xi |θ, yi ) k=1 i=1 y x
  • 65. ChainsMarkov Chain Plate past past present future futureHidden Markov Chain user’s mindset observed user actionuser model for traversal through search results
  • 66. ChainsMarkov Chain Plate n−1 p(x; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) i=1Hidden Markov Chain user’s mindset n−1 n p(x, y; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) p(yi |xi ) i=1 i=1 observed user actionuser model for traversal through search results
  • 67. Factor Graphs Latent Factors Observed Effects• Observed effects Click behavior, queries, watched news, emails• Latent factors User profile, news content, hot keywords, social connectivity graph, events
  • 68. Recommender Systems news, SearchMonkey answers u m social ranking OMG r ... intersecting plates ... personals (like nested for loops)• Users u• Movies m• Ratings r (but only for a subset of users)
  • 69. Challenges domain• How to design models expert • Common (engineering) sense • Computational tractability• Inference statistics • Easy for fully observed situations • Many algorithms if not fully observed • Dynamic programming / message passing
  • 70. Summary - Part 2• Probability theory to estimate events• Conjugate priors and Laplace smoothing• Conjugate = phantasy data• Collapsing• Laplace smoothing• Directed graphical models
  • 71. Part 3 - Clustering Topic Models
  • 72. Inference Algorithms
  • 73. ClusteringDensity Estimation log-concave θ n p(x, θ) = p(θ) p(xi |θ) i=1 find θ xClustering K n θ p(x, y, θ) = p(π) p(θk ) p(yi |π)p(xi |θ, yi ) k=1 i=1 general nonlinear y x
  • 74. Clustering• Optimization problem maximize p(x, y, θ) θ y K n maximize log p(π) + log p(θk ) + log [p(yi |π)p(xi |θ, yi )] θ k=1 i=1 yi ∈Y• Options • Direct nonconvex optimization (e.g. BFGS) • Sampling (draw from the joint distribution) • Variational approximation (concave lower bounds aka EM algorithm)
  • 75. Clustering• Integrate out y θ • Integrate out θ θ Y y x x x• Nonconvex • Y is coupled optimization • Sampling problem • Collapsed p• EM algorithm p(y|x) ∝ p({x} | {xi : yi = y} ∪ Xfake )p(y|Y ∪ Yfake )
  • 76. Gibbs sampling• Sampling: Draw an instance x from distribution p(x)• Gibbs sampling: • In most cases direct sampling not possible • Draw one set of variables at a time (b,g) - draw p(.,g) (g,g) - draw p(g,.) 0.45 0.05 (g,g) - draw p(.,g) (b,g) - draw p(b,.) 0.05 0.45 (b,b) ...
  • 77. Gibbs sampling for clustering
  • 78. Gibbs sampling for clustering randominitialization
  • 79. Gibbs sampling for clustering samplecluster labels
  • 80. Gibbs sampling for clustering resamplecluster model
  • 81. Gibbs sampling for clustering resamplecluster labels
  • 82. Gibbs sampling for clustering resamplecluster model
  • 83. Gibbs sampling for clustering resamplecluster labels
  • 84. Gibbs sampling for clustering resamplecluster model e.g. Mahout Dirichlet Process Clustering
  • 85. Inference Algorithm ≠ Model Corollary: EM ≠ Clustering
  • 86. Topic models
  • 87. Grouping objects Singapore
  • 88. Grouping objects airline university restaurant
  • 89. Grouping objects AustraliaUSA Singapore
  • 90. Topic Models Australia Singapore university USA airlineairline Singapore university USA Singapore food food
  • 91. Clustering Topic Models Clustering Topics ? group objects decompose objects by prototypes into prototypes
  • 92. Clustering Topic Models clustering Latent Dirichlet Allocation α prior α prior cluster topic θ probability θ probability cluster y label y topic label instance instance x x
  • 93. Clustering Topic Models Cluster/ topic x membership = Documents distributions clustering: (0, 1) matrix topic model: stochastic matrix LSI: arbitrary matrices
  • 94. Topics in textLatent Dirichlet Allocation; Blei, Ng, Jordan, JMLR 2003
  • 95. Collapsed Gibbs Sampler
  • 96. Joint Probability Distribution sample Ψindependently sample θ slo p(θ, z, ψ, x|α, β) independently w K m α = p(ψk |β) p(θi |α) k=1 i=1 topic m,mi θi probability p(zij |θi )p(xij |zij , ψ) i,j sample z zij topic label independently instance language prior β ψk xij
  • 97. Collapsed Sampler p(z, x|α, β) fa m k st= p(zi |α) p({xij |zij = k} |β) α i=1 k=1 topic sample z θi probability sequentially zij topic label instancelanguage prior β ψk xij
  • 98. Collapsed Sampler Griffiths Steyvers, 2005 p(z, x|α, β) fa m k st= p(zi |α) p({xij |zij = k} |β) α i=1 k=1 topic −ij θi probabilityn (t, d) + αt n−ij (t, w) + βt n−i (d) + t αt n−i (t) + t βt zij topic label instancelanguage prior β ψk xij
  • 99. Sequential Algorithm• Collapsed Gibbs Sampler • For 1000 iterations do • For each document do • For each word in the document do • Resample topic for the word • Update local (document, topic) table • Update global (word, topic) table this kills parallelism
  • 100. State of the art UMass Mallet, UC Irvine, Google• For 1000 iterations do table out • For each document do of sync • For each word in the document do • Resample topic for the word memory • Update local (document, topic) table inefficient • Update CPU local (word, topic) table blocking • Update global (word, topic) table network bound changes rapidly αt n(t, d = i) n(t, w = wij ) [n(t, d = i) + αt ] p(t|wij ) ∝ βw ¯ + βw n(t) + β + ¯ ¯ n(t) + β n(t) + β slow moderately fast
  • 101. Our Approach• For 1000 iterations do (independently per computer) • For each thread/core do • For each document do • For each word in the document do • Resample topic for the word • Update local (document, topic) table • Generate computer local (word, topic) message • In parallel update local (word, topic) table • In parallel update global (word, topic) table network memory table out blocking bound inefficient of sync concurrent minimal continuous barrier cpu hdd net view sync free
  • 102. Architecture details
  • 103. Multicore Architecture Intel Threading Building Blockstokens sampler sampler diagnostics file count output to sampler topics combiner sampler updater file optimization samplertopics joint state table• Decouple multithreaded sampling and updating (almost) avoids stalling for locks in the sampler• Joint state table • much less memory required • samplers syncronized (10 docs vs. millions delay)• Hyperparameter update via stochastic gradient descent• No need to keep documents in memory (streaming)
  • 104. Cluster Architecture sampler sampler sampler sampler ice• Distributed (key,value) storage via memcached• Background asynchronous synchronization • single word at a time to avoid deadlocks • no need to have joint dictionary • uses disk, network, cpu simultaneously
  • 105. Cluster Architecture sampler sampler sampler sampler ice ice ice ice• Distributed (key,value) storage via ICE• Background asynchronous synchronization • single word at a time to avoid deadlocks • no need to have joint dictionary • uses disk, network, cpu simultaneously
  • 106. Making it work• Startup • Randomly initialize topics on each node (read from disk if already assigned - hotstart) • Sequential Monte Carlo for startup much faster • Aggregate changes on the fly• Failover • State constantly being written to disk (worst case we lose 1 iteration out of 1000) • Restart via standard startup routine• Achilles heel: need to restart from checkpoint if even a single machine dies.
  • 107. Easily extensible• Better language model (topical n-grams) can process millions of users (vs 1000s)• Conditioning on side information (upstream) estimate topic based on authorship, source, joint user model ...• Conditioning on dictionaries (downstream) integrate topics between different languages• Time dependent sampler for user model approximate inference per episode
  • 108. Google Mallet Irvine’08 Irvine’09 Yahoo LDA LDAMulticore no yes yes yes yes Cluster MPI no MPI point 2 point memcached dictionary separate jointState table separate separate split sparse sparse asynchronous synchronous synchronous synchronous asynchronousSchedule approximate exact exact exact exact messages
  • 109. Speed• 1M documents per day on 1 computer (1000 topics per doc, 1000 words per doc)• 350k documents per day per node (context switches memcached stray reducers)• 8 Million docs (Pubmed) (sampler does not burn in well - too short doc) • Irvine: 128 machines, 10 hours • Yahoo: 1 machine, 11 days • Yahoo: 20 machines, 9 hours• 20 Million docs (Yahoo! News Articles) • Yahoo: 100 machines, 12 hours
  • 110. Scalability 200k documents/computer40302010 0 CPUs 1 10 20 50 100 Runtime (hours) Initial topics per word x10 Likelihood even improves with parallelism! -3.295 (1 node) -3.288 (10 nodes) -3.287 (20 nodes)
  • 111. The CompetitionDataset size (millions) 50k 20 50000 15 Throughput/h 10 5 37500 0 Google Irvine Yahoo 25000 Cluster size13097.5 12500 65 6.4k32.5 150 0 0 Google Irvine Yahoo Google Irvine Yahoo
  • 112. Design Principles
  • 113. Variable Replication• Global shared variable computer x y z x y y’ z synchronize local copy• Make local copy • Distributed (key,value) storage table for global copy • Do all bookkeeping locally (store old versions) • Sync local copies asynchronously using message passing (no global locks are needed)• This is an approximation!
  • 114. Asymmetric Message Passing• Large global shared state space (essentially as large as the memory in computer)• Distribute global copy over several machines (distributed key,value storage) global state current copy old copy
  • 115. Out of core storage• Very large state space x y z• Gibbs sampling requires us to traverse the data sequentially many times (think 1000x)• Stream local data from disk and update coupling variable each time local data is accessed• This is exact tokens sampler sampler diagnostics file count output to sampler topics combiner sampler updater file optimization sampler topics
  • 116. Summary - Part 3• Inference in graphical models• Clustering• Topic models• Sampling• Implementation details
  • 117. Part 4 - Advanced Modeling
  • 118. Chinese Restaurant Process φ1 φ2 φ3
  • 119. Problem• How many clusters should we pick?• How about a prior for infinitely many clusters?• Finite model n(y) + αy p(y|Y, α) = n + y αy • Infinite model Assume that the total smoother weight is constant n(y) α p(y|Y, α) = and p(new|Y, α) = n+ y αy n+α
  • 120. Chinese Restaurant Metaphor φ1 φ2 φ3 the rich get richer GeneraBve  Process -­‐For  data  point  xi   -­‐  Choose  table  j  ∝  mj        and    Sample  xi  ~  f(φj) -­‐  Choose  a  new  table    K+1  ∝  α   -­‐  Sample  φK+1  ~  G0      and  Sample  xi  ~  f(φK+1)Pitman; Antoniak; Ishwaran; Jordan et al.; Teh et al.;
  • 121. Evolutionary Clustering• Time series of objects, e.g. news stories• Stories appear / disappear• Want to keep track of clusters automatically
  • 122. Recurrent Chinese Restaurant Process T=1 φ1,1 φ2,1 φ3,1 m1,1=2 m2,1=3 m3,1=1 T=2
  • 123. Recurrent Chinese Restaurant Process T=1 φ1,1 φ2,1 φ3,1 m1,1=2 m2,1=3 m3,1=1 T=2 φ1,1 φ2,1 φ3,1
  • 124. Recurrent Chinese Restaurant Process T=1 φ1,1 φ2,1 φ3,1 m1,1=2 m2,1=3 m3,1=1 T=2 φ1,1 φ2,1 φ3,1
  • 125. Recurrent Chinese Restaurant Process T=1 φ1,1 φ2,1 φ3,1 m1,1=2 m2,1=3 m3,1=1 T=2 φ1,1 φ2,1 φ3,1 Sample  φ1,2  ~  P(.| φ1,1)
  • 126. Recurrent Chinese Restaurant Process T=1 φ1,1 φ2,1 φ3,1 m1,1=2 m2,1=3 m3,1=1 T=2 φ1,1 φ2,1 φ3,1
  • 127. Recurrent Chinese Restaurant Process T=1 φ1,1 φ2,1 φ3,1 m1,1=2 m2,1=3 m3,1=1 T=2 φ1,2 φ2,2 φ3,1 φ4,2 dead cluster new cluster
  • 128. Longer History φ1,1 φ2,1 φ3,1 T=1 m1,1=2 m2,1=3 m3,1=1 T=2 φ1,2 φ2,2 φ3,1 φ4,2 m2,3 T=3φ1,2 φ2,2 φ4,2
  • 129. TDPM Generative Power DPM W= ∞ λ=∞ TDPM W=4 Powerlaw λ = .4Independent DPMs W= 0λ = ? (any) 37
  • 130. User modeling Baseball 0.3 Propotion 0.2 Finance Jobs 0.1 Dating 0 0 10 20 30 40 Day
  • 131. Buying a camera show ads now too late time
  • 132. User modeling Problem  formulaBon Movies AutoCar Theatre PriceDeals Art Used galleryvan inspecBon Diet Hiringjob Calories SalaryHiring Recipe Dietdiet chocolate calories Flight School London Supplies Hotel Loan weather college
  • 133. User modeling Problem  formulaBon CARS Art Movies AutoCar Theatre PriceDeals Art Used galleryvan inspecBon Jobs Diet Diet Hiringjob Calories SalaryHiring Recipe Dietdiet chocolate calories Travel Flight School finance London College Supplies Hotel Loan weather college
  • 134. User modeling Problem  formulaBonInput• Queries  issued  by  the  user  or  Tags  of  watched  content• Snippet  of  page  examined  by  user• Time  stamp  of  each  acBon  (day  resoluBon)Output•    Users’  daily  distribuBon  over  intents•    Dynamic  intent  representaBon Travel Flight School finance London College Supplies Hotel Loan weather college
  • 135. Time dependent models• LDA for topical model of users where • User interest distribution changes over time • Topics change over time• This is like a Kalman filter except that • Don’t know what to track (a priori) • Can’t afford a Rauch-Tung-Striebel smoother • Much more messy than plain LDA
  • 136. Graphical Model α αt−1 αt αt+1 time dependentplain θi t−1 θi t θi t+1 θi user interestLDA zij zij wij wij user actions φk φt−1 k φt k φt+1 k t−1 t t+1 actions per topic β β β β
  • 137. All μ3 month μ2 week Long-­‐term μ short-­‐term Prior  for  user   acBons  at  Bme  t foodFood recipe Part-­‐Bme Kelly chickenChicken job Opening recipe Pizzapizza hiring salary cuisine millage                t                                  t+1           Time Diet Cars Job Finance Recipe Car Bank job   Chocolate Blue Online Career Pizza Book Credit Business Food Kelley Card Assistant Chicken Prices debt   Hiring Milk Small por_olio Part-­‐Bme Buaer Speed Finance RecepBonist Powder large Chase
  • 138. At  0me  t At  0me  t+1 Car job   Bank Recipe AlBma Career Online Chocolate Accord Business Credit Pizza Blue Assistant Card Food Book Hiring debt   Chicken Kelley Part-­‐Bme por_olio Milk Prices RecepBoni Finance Buaer Small st Chase Powder Speed short-­‐term priorsFood  ChickenPizza    mileage GeneraBve  Process •  For  each  user  interacBon •  Choose  an  intent  from  local  distribuBon • Sample  word  from  the  topic’s  word-­‐distribuBon  Car  speed  offer •Choose  a  new  intent    ∝  α  Camry  accord  career • Sample  a  new  intent  from  the  global  distribuBon •  Sample  word  from  the  new  topic  word-­‐distribuBon  
  • 139. At  0me  t At  0me  t+1 At  0me  t+2 At  0me  t+3Global mprocess m nUser  1 nprocessUser  2processUser  3process
  • 140. Sample users0.5 Baseball 0.30.4 Dating Propotion Baseball0.3 0.2 Finance0.2 Celebrity Jobs 0.10.1 Dating Health 0 0 0 10 20 30 40 0 10 20 30 40 Day Day Dating Baseball Celebrity Health Jobs Finance League   Snooki women   skin job   financial   baseball   Tom   men body   career Thomson   basketball,   Cruise dating   fingers   business chart   doublehead Katie singles   cells   assistant real   Bergesen Holmes   personals   toes   hiring Stock Griffey Pinkett seeking   wrinkle   part-­‐time Trading bullpen   Kudrow match layers receptionist currency Greinke Hollywood
  • 141. DatasetsData
  • 142. ROC score improvement
  • 143. ROC score improvement Dataset−2 baseline 62 TLDA TLDA+Baseline 60 58 56 54 52 50 ] ] ] ] ] ] ] 0 0 00 40 20 60 00 00 00 00 2 ,6 0, 0, 0, ,4 ,2 ,1 1 00 [6 [4 0 00 00 00 [1 0 [6 [4 [2 [1
  • 144. LDA for user profiling Sample  Z Sample  Z Sample  Z Sample  Z For  users For  users For  users For  users Write  counts   Write  counts   Write  counts   Write  counts   to   to   to   to   memcached memcached memcached memcached BarrierCollect  counts   Do  nothing Do  nothing Do  nothingand  sample   BarrierRead   from   Read   from   Read   from   Read   from  memcached memcached memcached memcached
  • 145. News
  • 146. News Stream
  • 147. News Stream• Over 1 high quality news article per second• Multiple sources (Reuters, AP, CNN, ...)• Same story from multiple sources• Stories are related• Goals • Aggregate articles into a storyline • Analyze the storyline (topics, entities)
  • 148. Clustering / RCRP • Assume active story distribution at time t • Draw story indicator • Draw words from story distribution • Down-weight story counts for next day Ahmed Xing, 2008
  • 149. Clustering / RCRP• Pro • Nonparametric model of story generation (no need to model frequency of stories) • No fixed number of stories • Efficient inference via collapsed sampler• Con • We learn nothing! • No content analysis
  • 150. Latent Dirichlet Allocation • Generate topic distribution per article • Draw topics per word from topic distribution • Draw words from topic specific word distribution Blei, Ng, Jordan, 2003
  • 151. Latent Dirichlet Allocation• Pro • Topical analysis of stories • Topical analysis of words (meaning, saliency) • More documents improve estimates• Con • No clustering
  • 152. More Issues• Named entities are special, topics less (e.g. Tiger Woods and his mistresses)• Some stories are strange (topical mixture is not enough - dirty models)• Articles deviate from general story (Hierarchical DP)
  • 153. StorylinesAmr Ahmed, Quirong Ho, Jake Eisenstein, Alex Smola, Choon Hui Teo, 2011
  • 154. Storylines Model • Topic model • Topics per cluster • RCRP for cluster • Hierarchical DP for article • Separate model for named entities • Story specific correction
  • 155. Storylines Model High-levelTightly-focused concepts 46
  • 156. The  Graphical  Model Storylines ModelTightly-­‐focused High-­‐level  concepts
  • 157. The  Graphical  Model Storylines ModelEach  story  has:•DistribuBon  over  words•DistribuBon  over  topics•DistribuBon  over  named  enBtes  
  • 158. The  Graphical  Model Storylines Model•  Document’s  topic  mix  is  sampled  from  its  story  prior•  Words  inside  a  document  either  global  or  story  specific 49
  • 159. The  GeneraBve  ProcessGenerative process 50
  • 160. The  GeneraBve  ProcessGenerative process 51
  • 161. The  GeneraBve  ProcessGenerative process 52
  • 162. The  GeneraBve  ProcessGenerative process 53
  • 163. Estimation• Sequential Monte Carlo (Particle Filter) • For new time period draw stories s, topics z p(st+1 , zt+1 |x1...t+1 , s1...t , z1...t ) using Gibbs Sampling for each particle • Reweight particle via p(xt+1 |x1...t , s1...t , z1...t ) • Regenerate particles if l2 norm too heavy
  • 164. Numbers ...• TDT5 (Topic Detection and Tracking) macro-averaged minimum detection cost: 0.714 time entities topics story words 0.84 0.90 0.86 0.75 This is the best performance on TDT5!• Yahoo News data ... beats all other clustering algorithms
  • 165. Stories
  • 166. %
  • 167. #(
  • 168. )*
  • 169. !# +,#
  • 170. # # * !#$$
  • 171. Related Stories
  • 172. ! #
  • 173. $$
  • 174. Detecting Ideologies Ahmed and Xing, 2010
  • 175. Problem  Statement Ideologies Build  a  model  to  describe  both   collecBons  of  dataVisualizaBon•  How  does  each  ideology  view  mainstream  events?•  On  which  topics  do  they  differ?•  On  which  topics  do  they  agree?
  • 176. Problem  Statement Ideologies Build  a  model  to  describe  both   collecBons  of  dataVisualizaBonClassificaBon•Given  a  new  news  arBcle    or  a  blog  post,  the  system  should  infer •  From  which  side  it  was  wriaen •    JusBfy  its  answer  on  a  topical  level  (view  on  aborBon,  taxes,  health  care)
  • 177. Problem  Statement Ideologies Build  a  model  to  describe  both   collecBons  of  dataVisualizaBonClassificaBonStructured  browsing•Given  a  new  news  arBcle    or  a  blog  post,  the  user  can  ask  for  : •Examples  of  other  arBcles  from  the  same  ideology  about  the  same  topic •Documents  that  could  exemplify  alterna0ve  views  from  other  ideologies
  • 178. Building a factored model β1 φ1,1 φ2,1 β1 φ1,2 φ2,2 Ω1 Ω2 βk-­‐1 φ1,k φ2,k βk Ideology  1 Ideology  2 Views Views Topics
  • 179. Building a factored model β1 φ1,1 φ2,1 β2 φ1,2 φ2,2Ω1 Ω2 βk-­‐1 φ1,k φ2,k βk Ideology  1 Ideology  2 Views Views Topics λ λ 1−λ 1−λ
  • 180. Datasets Data• BiAerlemons:   • Middle-­‐east  conflict,  document  wriaen  by  Israeli  and  PalesBnian  authors. •  ~300  documents  form  each  view  with  average  length  740 •  MulB  author  collecBon •  80-­‐20  split  for  test  and  train• Poli0cal  Blog-­‐1: •  American  poliBcal  blogs  (Democrat  and  Republican) •  2040  posts  with  average  post  length  =  100  words •  Follow  test  and  train  split  as  in  (Yano  et  al.,  2009)• Poli0cal  Blog-­‐2    (test  generalizaBon  to  a  new  wriBng  style) •  Same  as  1  but  6  blogs,  3  from  each  side •    ~14k  posts  with  ~200  words  per  post •  4  blogs  for  training  and  2  blogs  for  test
  • 181. Example:  Biaerlemons  corpus Bitterlemons dataset US    role powell minister colin visit arafat state leader roadmap Israeli election month iraq yasir bush US president american internal policy statement PalesBnian sharon administration prime express pro previous View senior involvement clinton pressure policy washington package work transfer View terrorism european Roadmap  process palestinian palestinian end settlement process force terrorism unit israeli israeli implementation obligation provide confidence element roadmap phase security Peace peace stop expansion commitment interim discussion union ceasefire state plan politicalyear political fulfill unit illegal present succee point build positive international step authority occupation process previous assassination meet recognize present timetable process state forward end security end conflict right way government Arab  Involvement governmentneed conflict people way track negotiation official time year peace strategic plo hizballah security leadership position force islamic neighbor territorial syria syrian negotiate lebanon withdrawal time victory negotiation radical iran relation think deal conference concession present second stand obviou countri mandate asad agreement regional circumstance represent greater conventional intifada october initiative relationship sense talk strategy issue affect jihad time participant parti negotiator
  • 182. ClassificaBonClassification accuracy
  • 183. GeneralizaBon  to  New  BlogsGeneralization to new blogs
  • 184. Geqng  AlternaBve  View Finding alternate views-­‐ Given  a  document  wriaen  in  one  ideology,  retrieve  the  equivalent-­‐ Baseline:  SVM  +  cosine  similarity 144
  • 185. Can  We  use  Unlabeled  data? Unlabeled data•  In  theory  this  is  simple •Add  a  step  that  samples  the  document  view  (v) •Doesn’t  mix  in  pracBce  because  Bght  coupling  between  v  and  (x1,x2,z)•SoluBon •Sample    v  and  (x1,x2,z)    as  a  block    using  a  Metropolis-­‐HasBng  step •  This  is  a  huge  proposal!
  • 186. Summary - Part 4• Chinese Restaurant Process• Recurrent CRP• User modeling• Storylines• Ideology detection