Graphical Models for the Internet

1,779 views

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,779
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
26
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Graphical Models for the Internet

  1. 1. Graphical Models for the Internet Alexander Smola Yahoo! Research, Santa Clara, CA Australian National University alex@smola.org blog.smola.org credits to Amr Ahmed, Yahoo Research, CA
  2. 2. Outline1. Systems • Hardware (computer archictecture / networks / data centers) • Storage and processing (file systems, MapReduce, Dryad, S4) • Communication and synchronization (star, ring, hashtable, distributed star, tree)2. Applications on the internet • User modeling (clustering, abuse, profiling) • Content analysis (webpages, links, news) • Search / sponsored search3. Probabilistic modeling • Basic probability theory • Naive Bayes • Density estimation (exponential families)4. Directed graphical models • Directed graph semantics (independence, factorization) • Clustering and Markov models (basic model, EM, sampling) • Dirichlet distribution
  3. 3. Outline5. Scalable topic modeling • Latent Dirichlet Allocation • Sampling and parallel communication • Applications to user profiling6. Applications of latent variable models • Time dependent / context dependent models • News articles / ideology estimation • Recommendation systems7. Undirected graphical models • Conditional independence and graphs (vertices, chains, trellis) • Message passing (junction trees, variable elimination) • Variational methods, sampling, particle filtering, message passing8. Applications of undirected graphical models (time permitting) • Conditional random fields • Information extraction • Unlabeled data
  4. 4. Part 1 - Systems
  5. 5. Hardware
  6. 6. Computers• CPU • 8-16 cores (Intel/AMD servers) • 2-3 GHz (close to 1 IPC per core peak) - 10-100 GFlops/socket • 8-16 MB Cache (essentially accessible at clock speed) • Vectorized multimedia instructions (128bit wide, e.g. add, multiply, logical) • Deep pipeline architectures (branching is expensive)• RAM • 8-128 GB depending on use • 2-3 memory banks (each 32bit wide - atomic writes!) • DDR3 (10GB/s per chip, random access >10x slower)• Harddisk • 2-3 TB/disk • 100 MB/s sequential read from SATA2 • 5ms latency (no change over 10 years), i.e. random access is slow• Solid State Drives • 500 MB/s sequential read • Random writes are really expensive (read-erase-write cycle for a block) • Latency is 0.5ms or lower (controller & flash cell dependent)• Anything you can do in bulk & sequence is at least 10x faster
  7. 7. Computers• Network interface • Gigabit network (10-100MB/s throughput) • Copying RAM across network takes 1-10 minutes • Copying 1TB across network takes 1 day • Dozens of computers in a rack (on same switch)• Power consumption • 200-400W per unit (don’t buy the fastest CPUs) • Energy density big issue (cooling!) • GPUs take much more power (150W each)• Systems designed to fail • Commodity hardware • Too many machines to ensure 100% uptime • Design software to deal with it (monitoring, scheduler, data storage)• Systems designed to grow
  8. 8. Server Centers• 10,000+ servers• Aggregated into rack (same switch)• Reduced bandwidth between racks• Some failure modes • OS crash • Disk dies (one of many) • Computer dies • Network switch dies (lose rack) • Packet loss / congestion / DNS• Several server centers worldwide • Applications running distributed (e.g. ranking, mail, images) • Load balancing • Data transfer between centers
  9. 9. Some indicative prices (on Amazon) storage data transfer server costs
  10. 10. Processing in the cloud
  11. 11. Data storage• Billions of webpages• Billions of queries in query / click log• Millions of data ranked by editors• Storing data is cheap • less than 100 Billion interesting webpages • assume 10kB per webpage - 1PB total • Amazon S3 price (1 month) $10k (at $0.01/GB)• Processing data is expensive • 10 Billion webpages • 10ms per page (only simple algorithms work) • 10 days on 1000 computers ($24k-$240k at $0.1-$1/h)• Crawling the data is very expensive • Assume 10 Gb/s link - takes >100 days to gather APCN2 cable has 2.5 Tb/s bandwidth • Amazon EC2 price $100k (at $0.1/GB), with overhead $1M)
  12. 12. File Systems (GoogleFS/HDFS) name node not quite so cheap server (replicated) replicate blocks 3x cheap Ghemawat, Gobioff, Leung, 2003 servers
  13. 13. File Systems Details• Chunkservers • store 64MB (or larger) blocks • write to one, replicate transparently (3x)• Name node • keeps directory of chunks • replicated• Write • get chunkserver ID from name node • write block(s) to chunkserver (expensive to write parts of chunk) • largely write once scenario (read many times) • distributed write for higher bandwidth (each server 1 chunk)• Read • from any of the replicated chunks • higher replication for hotspots (many reads of a chunk)• Elasticity • Add additional chunkservers - name node migrates chunks to new machine • Node failure (name node keeps track of chunk server status) - requests replication
  14. 14. Comparison• HDFS/GoogleFS • Fast block writes / reads • Fault tolerant • Flexible size adjustment • Terrible for random writes / bad for random writes • Not really a filesystem• NFS & co. • No distributed file management• Lustre • Proper filesystem • High bandwidth • Explicitly exploits fast interconnects • Cabinet servers replicated with RAID5 • Fails if cabinet dies • Difficult to add more storage
  15. 15. MapReduce• Scenario Map Reduce • Lots of data (much more than what a single computer can process) • Stored in a distributed fashion • Move computation to data• Map Apply function f to data as distributed over the mappers This runs (if possible) where data sits• Reduce Combine data given keys generated in the Map phase• Fault tolerance • If mapper dies, re-process the data • If reducer dies, re-send the data (cached) from mappers (requires considerable storage) Dean, Ghemawat, 2004
  16. 16. Item count in MapReduce• Task: object counting• Map • Each mapper gets (unsorted) chunk of data • Preferably local on chunkservers • Perform local item counts • Emit (item, count) data• Reduce • Aggregate all counts for a given (image: gridgain.com) item (all end up at same reducer) • Emit aggregate counts
  17. 17. k-means in MapReduce• Initialize random cluster centers• Map • Assign each data point to a cluster based on current model • Aggregate data per cluster • Send cluster aggregates to reducer (e.g. mean, variance, size)• Reduce • Aggregate all data per cluster • Update cluster centers (image: mathworks.com) • Send new cluster centers to new reducers• Repeat until converged
  18. 18. k-means in MapReduce• Initialize random cluster centers• Map • Assign each data point to a cluster based on current model • Aggregate data per cluster • Send cluster aggregates to reducer (e.g. mean, variance, size)• Reduce • Aggregate all data per cluster • Update cluster centers (image: mathworks.com) • Send new cluster centers to new reducers needs to re-read data from disk• Repeat until converged for each MapReduce iteration
  19. 19. Dryad• Data flow graph (DAG rather than bipartite graph)• Interface variety • Memory FIFO • Disk • Network (image: Microsoft Research)• Modular composition of computation graph Isard, Budiu, Yu, Birrell & Fetterly, 2007
  20. 20. S4 - Online processing dispatch withdistributed hash Neumeyer, Robbins, Nair, Kesari, 2011
  21. 21. Dataflow Paradigms
  22. 22. Pipeline CPU CPU CPU CPU• Process data sequentially• Parallelizes up to number of tasks (disk read, feature extraction, logging, output)• Reasonable for a single machine• Parallelization per filter possible (see e.g. Intel Threading Building Blocks)
  23. 23. Pipeline multicore CPU CPU CPU CPU CPU CPU CPU• Process data sequentially• Parallelizes up to number of tasks (disk read, feature extraction, logging, output)• Reasonable for a single machine• Parallelization per filter possible (see e.g. Intel Threading Building Blocks)
  24. 24. Tree• Aggregate data hierarchically• Parallelizes at O(log n) cost (tree traversal) CPU• Communication at the interface is important (e.g. network latency) CPU CPU• Good dataflow processing• Poor efficiency for batched CPU processing O(1/log n) CPU CPU• Poor fault tolerance• Does not map well onto server centers (need to ensure that we have lower leaves on CPU rack)
  25. 25. Star• Aggregate data centrally• Does not parallelize at all if communication cost is high CPU• Perfect parallelization if CPU bound• Latency is O(1) unless the network CPU CPU is congested.• Network requirements are O(n) CPU• Central node becomes hotspot• Synchronization is very easy CPU CPU• Trivial to add more resources (just add leaves) CPU• Difficult to parallelize center
  26. 26. Distributed (key,value) storage• Caching problem • Store many (key,value) pairs • Linear scalability in clients and servers • Automatic key distribution mechanism• memcached • (key,value) servers • client access library distributes access patterns m(key, M ) = argmin h(key, m) m∈M • randomized O(n) bandwidth • aggregate O(n) bandwidth • load balancing via hashing P2P uses similar • no versioned writes routing tables
  27. 27. Proportional hashing/caching• Machines with different capacity• Hotspots too hot for a single machine to handle• Retain nearest neighbor hashing properties• Sparse cover of keyspace proportional to machine capacities• Repeatedly hash a key until it hits a key-range, i.e. hn(key)• Keep busy table and advance to next hash if machine already used Chawla, Reed, Juhnke, Syed, 2011, USENIX
  28. 28. Distributed Star• Aggregate data centrally• Use different center for each key, CPU 2 as selected by distributed hashing• Linear bandwidth for CPU 3 synchronization CPU 1• Perfect scalability O(n) bandwidth required CPU 6 CPU 4• Each CPU performs local computation and stores small CPU 5 fraction of global data• Works best if all nodes on the same switch / rack m(key, M ) = argmin h(key, m) m∈M
  29. 29. Distributed Star• Aggregate data centrally• Use different center for each key, as selected by distributed hashing• Linear bandwidth for CPU 1 CPU 3 synchronization CPU 2• Perfect scalability O(n) bandwidth required CPU 6 CPU 4• Each CPU performs local computation and stores small CPU 5 fraction of global data• Works best if all nodes on the same switch / rack m(key, M ) = argmin h(key, m) m∈M
  30. 30. Distributed Star• Aggregate data centrally• Use different center for each key, CPU 2 as selected by distributed hashing• Linear bandwidth for CPU 1 synchronization CPU 3• Perfect scalability O(n) bandwidth required CPU 6 CPU 4• Each CPU performs local computation and stores small CPU 5 fraction of global data• Works best if all nodes on the same switch / rack m(key, M ) = argmin h(key, m) m∈M
  31. 31. Ringbuffer• Problem • Disk, RAM and CPU operate at different speeds (>10x difference) • Want to do maximum data processing (e.g. optimization)• Idea • Load data from disk into ringbuffer • Process data continuously on buffer • Chain ringbuffers• Yields consistently maximum throughput for each resource
  32. 32. Summary• Hardware Servers, networks, amounts of data• Processing paradigms MapReduce, Dryad, S4• Communication templates Stars, pipelines, distributed hash table, caching
  33. 33. Part 2 - Motivation
  34. 34. Data on the Internet• Webpages (content, graph)• Clicks (ad, page, social) Finite resources• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail) • Editors are expensive• Photos, Movies (Flickr, YouTube, Vimeo ...) • Editors don’t know users• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.) • Barrier to i18n• Location (Latitude, Loopt, Foursquared) • Abuse (intrusions are novel)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo) • Implicit feedback• Comments (Disqus, Facebook) • Data analysis (find interesting stuff• Reviews (Yelp, Y!Local) rather than find x)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook) • Integrating many systems• Purchase decisions (Netflix, Amazon) • Modular design for data integration• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing) • Integrate with given prediction tasks• Timestamp (everything)• News articles (BBC, NYTimes, Y!News) Invest in modeling and naming•• Blog posts (Tumblr, Wordpress) Microblogs (Twitter, Jaiku, Meme) rather than data generation
  35. 35. Data on the Internet• Webpages (content, graph)• Clicks (ad, page, social) Finite resources• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail) • Editors are expensive• Photos, Movies (Flickr, YouTube, Vimeo ...) • Editors don’t know users•• Cookies / tracking info (see Ghostery) Installed apps (Android market etc.) unlimited amounts • Barrier to i18n • Abuse (intrusions are novel) of • Implicit feedback data• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook) • Data analysis (find interesting stuff• Reviews (Yelp, Y!Local) rather than find x)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook) • Integrating many systems• Purchase decisions (Netflix, Amazon) • Modular design for data integration• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing) • Integrate with given prediction tasks• Timestamp (everything)• News articles (BBC, NYTimes, Y!News) Invest in modeling and naming•• Blog posts (Tumblr, Wordpress) Microblogs (Twitter, Jaiku, Meme) rather than data generation
  36. 36. Unsupervised Modeling
  37. 37. Hierarchical Clustering NIPS 2010 Adams, Ghahramani, Jordan
  38. 38. Topics in textLatent Dirichlet Allocation; Blei, Ng, Jordan, JMLR 2003
  39. 39. Word segmentationMochihashi, Yamada, Ueda, ACL 2009
  40. 40. Language model automatically synthesized from Penn Treebank Mochihashi, Yamada, Ueda ACL 2009
  41. 41. User model over time0.5 Baseball 0.30.4 Dating Propotion Baseball0.3 0.2 Finance0.2 Celebrity Jobs 0.10.1 Dating Health 0 0 0 10 20 30 40 0 10 20 30 40 Dating DayBaseball Celebrity Health Jobs DayFinance League   Snooki women   skin job   financial   baseball   Tom   men body   career Thomson   basketball,   Cruise dating   fingers   business chart   doublehead Katie singles   cells   assistant real   Bergesen Holmes   personals   toes   hiring Stock Griffey Pinkett seeking   wrinkle   part-­‐time Trading bullpen   Kudrow match layers receptionist currency Greinke Hollywood Ahmed et al., KDD 2011
  42. 42. User model over time0.5 Baseball 0.30.4 Dating Propotion Baseball0.3 0.2 Finance0.2 Celebrity Jobs 0.10.1 Dating Health 0 0 0 10 20 30 40 0 10 20 30 40 Dating DayBaseball Celebrity Health Jobs DayFinance League   Snooki women   skin job   financial   baseball   Tom   men body   career Thomson   basketball,   Cruise dating   fingers   business chart   doublehead Katie singles   cells   assistant real   Bergesen Holmes   personals   toes   hiring Stock Griffey Pinkett seeking   wrinkle   part-­‐time Trading bullpen   Kudrow match layers receptionist currency Greinke Hollywood Ahmed et al., KDD 2011
  43. 43. Face recognition from captions Jain, Learned-Miller, McCallum, ICCV 2007
  44. 44. Storylines from news Ahmed et al, AISTATS 2011
  45. 45. Ideology detectionAhmed et al, 2010; Bitterlemons collection
  46. 46. Hypertext topic extraction Gruber, Rosen-Zvi, Weiss; UAI 2008
  47. 47. Supervised Modeling
  48. 48. Ontologies • continuous maintenance • no guarantee of coverage • difficult categories
  49. 49. Ontologies • continuous maintenance • no guarantee of coverage • difficult categories
  50. 50. Face Classification/Recognition • 100-1000 people • 10k faces • curated (not realistic) • expensive to generate
  51. 51. Topic Detection & Tracking • editorially curated training data • expensive to generate • subjective in selection of threads • language specific
  52. 52. Advertising Targeting• Needs training data in every language• Is it really relevant for better ads?• Does it cover relevant areas?
  53. 53. Advertising Targeting• Needs training data in every language• Is it really relevant for better ads?• Does it cover relevant areas?
  54. 54. Collaborative Filtering
  55. 55. Challenges• Scale • Millions to billions of instances (documents, clicks, users, messages, ads) • Rich structure of data (ontology, categories, tags) • Model description typically larger than memory of single workstation• Modeling • Usually clustering or topic models do not solve the problem • Temporal structure of data • Side information for variables • Solve problem. Don’t simply apply a model!• Inference • 10k-100k clusters for hierarchical model • 1M-100M words • Communication is an issue for large state space
  56. 56. Summary• Essentially infinite amount of data• Labeling is (in many cases) prohibitively expensive• Editorial data not scalable for i18n• Even for supervised problems unlabeled data abounds. Use it.• User-understandable structure for representation purposes• Solutions are often customized to problem We can only cover building blocks in tutorial.
  57. 57. Part 3 - Basic Tools
  58. 58. Statistics 101
  59. 59. Probability• Space of events X • server working; slow response; server broken • income of the user (e.g. $95,000) • query text for search (e.g. “statistics tutorial”)• Probability axioms (Kolmogorov) Pr(X) ∈ [0, 1], Pr(X ) = 1 Pr(∪i Xi ) = i Pr(Xi ) if Xi ∩ Xj = ∅• Example queries • P(server working) = 0.999 • P(90,000 income 100,000) = 0.1
  60. 60. Venn Diagram All events
  61. 61. Venn Diagram X X X ∩X All events
  62. 62. Venn Diagram X X X ∩X All eventsPr(X ∪ X ) = Pr(X) + Pr(X ) − Pr(X ∩ X )
  63. 63. (In)dependence• Independence Pr(x, y) = Pr(x) · Pr(y) • Login behavior of two users (approximately) • Disk crash in different colos (approximately)• Dependent events • Emails Pr(x, y) = Pr(x) · Pr(y) • Queries • News stream / Buzz / Tweets • IM communication • Russian Roulette
  64. 64. (In)dependence• Independence Pr(x, y) = Pr(x) · Pr(y) • Login behavior of two users (approximately) • Disk crash in different colos (approximately)• Dependent events • Emails Pr(x, y) = Pr(x) · Pr(y) • Queries • News stream / Buzz / Tweets • IM communication • Russian Roulette
  65. 65. (In)dependence• Independence Pr(x, y) = Pr(x) · Pr(y) • Login behavior of two users (approximately) • Disk crash in different colos (approximately)• Dependent events • Emails Pr(x, y) = Pr(x) · Pr(y) • Queries • News stream / Buzz / Tweets • IM communication Everywhere! • Russian Roulette
  66. 66. Independence 0.25 0.25 0.25 0.25
  67. 67. Dependence 0.45 0.05 0.05 0.45
  68. 68. A Graphical Model Spam Mailp(spam, mail) = p(spam) p(mail|spam)
  69. 69. Bayes Rule• Joint Probability Pr(X, Y ) = Pr(X|Y ) Pr(Y ) = Pr(Y |X) Pr(X)• Bayes Rule Pr(Y |X) · Pr(X) Pr(X|Y ) = Pr(Y )• Hypothesis testing• Reverse conditioning
  70. 70. AIDS test (Bayes rule)• Data • Approximately 0.1% are infected • Test detects all infections • Test reports positive for 1% healthy people• Probability of having AIDS if test is positive
  71. 71. AIDS test (Bayes rule)• Data • Approximately 0.1% are infected • Test detects all infections • Test reports positive for 1% healthy people• Probability of having AIDS if test is positive Pr(t|a = 1) · Pr(a = 1) Pr(a = 1|t) = Pr(t) Pr(t|a = 1) · Pr(a = 1) = Pr(t|a = 1) · Pr(a = 1) + Pr(t|a = 0) · Pr(a = 0) 1 · 0.001 = = 0.091 1 · 0.001 + 0.01 · 0.999
  72. 72. Improving the diagnosis
  73. 73. Improving the diagnosis• Use a follow-up test • Test 2 reports positive for 90% infections • Test 2 reports positive for 5% healthy people 0.01 · 0.05 · 0.999 = 0.357 1 · 0.9 · 0.001 + 0.01 · 0.05 · 0.999
  74. 74. Improving the diagnosis• Use a follow-up test • Test 2 reports positive for 90% infections • Test 2 reports positive for 5% healthy people 0.01 · 0.05 · 0.999 = 0.357 1 · 0.9 · 0.001 + 0.01 · 0.05 · 0.999• Why can’t we use Test 1 twice? Outcomes are not independent but tests 1 and 2 are conditionally independent
  75. 75. Improving the diagnosis• Use a follow-up test • Test 2 reports positive for 90% infections • Test 2 reports positive for 5% healthy people 0.01 · 0.05 · 0.999 = 0.357 1 · 0.9 · 0.001 + 0.01 · 0.05 · 0.999• Why can’t we use Test 1 twice? Outcomes are not independent but tests 1 and 2 are conditionally independent p(t1 , t2 |a) = p(t1 |a) · p(t2 |a)
  76. 76. Logarithms are good• Floating point numbers π = log p 52 11 1 signmantissa exponent• Probabilities can be very small. In particular products of many probabilities. Underflow!• Store data in mantissa, not exponent pi → πi pi → max π + log exp [πi − max π] i i i i• Known bug e.g. in Mahout Dirichlet clustering
  77. 77. Application: Naive Bayes
  78. 78. Naive Bayes Spam Filter
  79. 79. Naive Bayes Spam Filter• Key assumption Words occur independently of each other given the label of the document n p(w1 , . . . , wn |spam) = p(wi |spam) i=1
  80. 80. Naive Bayes Spam Filter• Key assumption Words occur independently of each other given the label of the document n p(w1 , . . . , wn |spam) = p(wi |spam) i=1• Spam classification via Bayes Rule n p(spam|w1 , . . . , wn ) ∝ p(spam) p(wi |spam) i=1
  81. 81. Naive Bayes Spam Filter• Key assumption Words occur independently of each other given the label of the document n p(w1 , . . . , wn |spam) = p(wi |spam) i=1• Spam classification via Bayes Rule n p(spam|w1 , . . . , wn ) ∝ p(spam) p(wi |spam)• Parameter estimation i=1 Compute spam probability and word distributions for spam and ham
  82. 82. Naive Bayes Spam Filter Equally likely phrases• Get rich quick. Buy WWW stock.• Buy Viagra. Make your WWW experience last longer.• You deserve a PhD from WWW University. We recognize your expertise.• Make your rich WWW PhD experience last longer.
  83. 83. Naive Bayes Spam Filter Equally likely phrases• Get rich quick. Buy WWW stock.• Buy Viagra. Make your WWW experience last longer.• You deserve a PhD from WWW University. We recognize your expertise.• Make your rich WWW PhD experience last longer.
  84. 84. A Graphical Model spamw1 w2 ... wn
  85. 85. A Graphical Model spam w1 w2 ... wn n p(w1 , . . . , wn |spam) = p(wi |spam) i=1
  86. 86. A Graphical Model spam spam w1 w2 ... wn wi n i=1..np(w1 , . . . , wn |spam) = p(wi |spam) i=1
  87. 87. A Graphical Model spam spam how to estimate p(w|spam) w1 w2 ... wn wi n i=1..np(w1 , . . . , wn |spam) = p(wi |spam) i=1
  88. 88. Simple Algorithm• For each document (x,y) do • Aggregate label counts given y • For each feature xi in x do • Aggregate statistic for (xi, y) for each y• For y estimate distribution p(y)• For each (xi,y) pair do Estimate distribution p(xi|y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture• Given new instance compute p(y|x) ∝ p(y) p(xj |y) j
  89. 89. Simple Algorithm• For each document (x,y) do • Aggregate label counts given y pass over all data • For each feature xi in x do • Aggregate statistic for (xi, y) for each y• For y estimate distribution p(y)• For each (xi,y) pair do trivially parallel Estimate distribution p(xi|y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture• Given new instance compute p(y|x) ∝ p(y) p(xj |y) j
  90. 90. MapReduce Algorithm• Map(document (x,y)) • For each mapper for each feature xi in x do • Aggregate statistic for (xi, y) for each y • Send aggregate statistics to reducer• Reduce(xi, y) • Aggregate over all messages from mappers • Estimate distribution p(xi|y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture • Send coordinate-wise model to global storage• Given new instance compute p(y|x) ∝ p(y) p(xj |y) j
  91. 91. MapReduce Algorithm• Map(document (x,y)) • For each mapper for each feature xi in x do local per • Aggregate statistic for (xi, y) for each y chunkserver • Send aggregate statistics to reducer• Reduce(xi, y) only aggregates • Aggregate over all messages from mappers needed • Estimate distribution p(xi|y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture • Send coordinate-wise model to global storage• Given new instance compute p(y|x) ∝ p(y) p(xj |y) j
  92. 92. Naive NaiveBayes Classifier • Two classes (spam/ham) • Binary features (e.g. presence of $$$, viagra) • Simplistic Algorithm • Count occurrences of feature for spam/ham • Count number of spam/ham mails spam probabilityfeature probability n(i, y) n(y) p(xi = TRUE|y) = and p(y) = n(y) n n(y) n(i, y) n(y) − n(i, y) p(y|x) ∝ n n(y) n(y) i:xi =TRUE i:xi =FALSE
  93. 93. Naive NaiveBayes Classifier what if n(i,y)=n(y)? what if n(i,y)=0? n(y) n(i, y) n(y) − n(i, y) p(y|x) ∝ n n(y) n(y) i:xi =TRUE i:xi =FALSE
  94. 94. Naive NaiveBayes Classifier what if n(i,y)=n(y)? what if n(i,y)=0? n(y) n(i, y) n(y) − n(i, y) p(y|x) ∝ n n(y) n(y) i:xi =TRUE i:xi =FALSE
  95. 95. Estimating Probabilities
  96. 96. Binomial Distribution• Two outcomes (head, tail); (0,1)• Data likelihood p(X; π) = π n1 (1 − π)n0• Maximum Likelihood Estimation • Constrained optimization problem π ∈ [0, 1] exθ • Incorporate constraint via p(x; θ) = 1 + eθ • Taking derivatives yields n1 n1 θ = log ⇐⇒ p(x = 1) = n 0 + n1 n0 + n1
  97. 97. ... in detail ... n n eθxi p(X; θ) = p(xi ; θ) = i=1 i=1 1 + eθ n =⇒ log p(X; θ) = θ xi − n log 1 + eθ i=1 n eθ=⇒ ∂θ log p(X; θ) = xi − n i=1 1 + eθ n 1 eθ ⇐⇒ xi = = p(x = 1) n i=1 1 + eθ
  98. 98. ... in detail ... n n eθxi p(X; θ) = p(xi ; θ) = i=1 i=1 1 + eθ n =⇒ log p(X; θ) = θ xi − n log 1 + eθ i=1 n eθ =⇒ ∂θ log p(X; θ) = xi − n i=1 1 + eθ n 1 eθ ⇐⇒ xi = = p(x = 1) n i=1 1 + eθempirical probability of x=1
  99. 99. Discrete Distribution• n outcomes (e.g. USA, Canada, India, UK, NZ)• Data likelihood ni p(X; π) = πi i• Maximum Likelihood Estimation • Constrained optimization problem ... or ... exp θx • Incorporate constraint via p(x; θ) = x exp θx • Taking derivatives yields ni ni θi = log ⇐⇒ p(x = i) = j nj j nj
  100. 100. Tossing a Dice12 2460 120
  101. 101. Tossing a Dice12 2460 120
  102. 102. Exponential Families
  103. 103. Exponential Families
  104. 104. Exponential Families• Density function p(x; θ) = exp (φ(x), θ − g(θ)) where g(θ) = log exp (φ(x ), θ) x
  105. 105. Exponential Families• Density function p(x; θ) = exp (φ(x), θ − g(θ)) where g(θ) = log exp (φ(x ), θ) x• Log partition function generates cumulants ∂θ g(θ) = E [φ(x)] 2 ∂θ g(θ) = Var [φ(x)]
  106. 106. Exponential Families• Density function p(x; θ) = exp (φ(x), θ − g(θ)) where g(θ) = log exp (φ(x ), θ) x• Log partition function generates cumulants ∂θ g(θ) = E [φ(x)] 2 ∂θ g(θ) = Var [φ(x)]• g is convex (second derivative is p.s.d.)
  107. 107. Examples• Binomial Distribution φ(x) = x• Discrete Distribution φ(x) = ex (ex is unit vector for x) 1 φ(x) = x, xx• Gaussian 2• Poisson (counting measure 1/x!) φ(x) = x• Dirichlet, Beta, Gamma, Wishart, ...
  108. 108. Normal Distribution
  109. 109. Poisson Distribution λx e−λ p(x; λ) = x!
  110. 110. Beta Distribution xα−1 (1 − x)β−1 p(x; α, β) = B(α, β)
  111. 111. Dirichlet Distribution... this is a distribution over distributions ...
  112. 112. Maximum Likelihood
  113. 113. Maximum Likelihood• Negative log-likelihood n − log p(X; θ) = g(θ) − φ(xi ), θ i=1
  114. 114. Maximum Likelihood• Negative log-likelihood n − log p(X; θ) = g(θ) − φ(xi ), θ i=1 empirical mean average• Taking derivatives n 1 −∂θ log p(X; θ) = m E[φ(x)] − φ(xi ) m i=1 We pick the parameter such that the distribution matches the empirical average.
  115. 115. Conjugate Priors• Unless we have lots of data estimates are weak• Usually we have an idea of what to expect p(θ|X) ∝ p(X|θ) · p(θ) we might even have ‘seen’ such data before• Solution: add ‘fake’ observations p(θ) ∝ p(Xfake |θ) hence p(θ|X) ∝ p(X|θ)p(Xfake |θ) = p(X ∪ Xfake |θ)• Inference (generalized Laplace smoothing) n n 1 1 m fake count φ(xi ) −→ φ(xi ) + µ0 n i=1 n + m i=1 n+m fake mean
  116. 116. Example: Gaussian Estimation• Sufficient statistics: x, x 2• Mean and variance given by µ = Ex [x] and σ 2 = Ex [x2 ] − E2 [x] x• Maximum Likelihood Estimate n n 1 2 1 2 2 µ= ˆ xi and σ = xi − µ ˆ n i=1 n i=1• Maximum a Posteriori Estimate smoother n n 1 2 1 2 n0 2 µ= ˆ xi and σ = xi + 1−µ ˆ n + n0 i=1 n + n0 i=1 n + n0smoother
  117. 117. Collapsing • Conjugate priors p(θ) ∝ p(Xfake |θ) Hence we know how to compute normalization • Prediction p(x|X) = p(x|θ)p(θ|X)dθ (Beta, binomial) ∝ p(x|θ)p(X|θ)p(Xfake |θ)dθ(Dirichlet, multinomial) (Gamma, Poisson) = p({x} ∪ X ∪ Xfake |θ)dθ (Wishart, Gauss) look up closed form expansions http://en.wikipedia.org/wiki/Exponential_family
  118. 118. Conjugate Prior in action mi = m · [µ0 ]i ni ni + mi p(x = i) = −→ p(x = i) = n n+m Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 MLE 0.15 0.30 0.10 0.05 0.20 0.20 MAP (m0 = 6) 0.15 0.27 0.12 0.08 0.19 0.19 MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17
  119. 119. Conjugate Prior in action mi = m · [µ0 ]i• Discrete Distribution ni ni + mi p(x = i) = −→ p(x = i) = n n+m• Tossing a dice Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 MLE 0.15 0.30 0.10 0.05 0.20 0.20 MAP (m0 = 6) 0.15 0.27 0.12 0.08 0.19 0.19 MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17
  120. 120. Conjugate Prior in action mi = m · [µ0 ]i• Discrete Distribution ni ni + mi p(x = i) = −→ p(x = i) = n n+m• Tossing a dice Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 MLE 0.15 0.30 0.10 0.05 0.20 0.20 MAP (m0 = 6) 0.15 0.27 0.12 0.08 0.19 0.19 MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17• Rule of thumb need 10 data points (or prior) per parameter
  121. 121. Honest diceMLEMAP
  122. 122. Tainted diceMLEMAP
  123. 123. Priors (part deux)• Parameter smoothing 2 p(θ) ∝ exp(−λ θ1 ) or p(θ) ∝ exp(−λ θ2 )• Posterior m p(θ|x) ∝ p(xi |θ)p(θ) i=1 m 1 2 ∝ exp φ(xi ), θ − mg(θ) − 2 θ2 i=1 2σ• Convex optimization problem (MAP estimation) m 1 1 2 minimize g(θ) − φ(xi ), θ + 2 θ2 θ m i=1 2mσ
  124. 124. Summary• Basic statistics tools• Estimating probabilities (mainly scalar)• Exponential family introduction
  125. 125. Part 4: Directed Graphical
  126. 126. Basics
  127. 127. ... some Web 2.0 service MySQL Apache Website
  128. 128. ... some Web 2.0 service MySQL Apache Website• Joint distribution (assume a and m are independent) p(m, a, w) = p(w|m, a)p(m)p(a)
  129. 129. ... some Web 2.0 service MySQL Apache Website• Joint distribution (assume a and m are independent) p(m, a, w) = p(w|m, a)p(m)p(a)• Explaining away p(w|m, a)p(m)p(a) p(m, a|w) = ,a p(w|m , a )p(m )p(a ) m a and m are dependent conditioned on w
  130. 130. ... some Web 2.0 service MySQL Apache Website
  131. 131. ... some Web 2.0 service MySQL Apache Website is working MySQL is working Apache is working
  132. 132. ... some Web 2.0 service MySQL Apache Website is broken is working At least one of the MySQL is workingtwo services is broken Apache is working (not independent)
  133. 133. Directed graphical model m a m a m a w w w user• Easier estimation u action • 15 parameters for full joint distribution • 1+1+3+1 for factorizing distribution• Causal relations• Inference for unobserved variables
  134. 134. No loops allowed
  135. 135. No loops allowed
  136. 136. No loops allowed
  137. 137. No loops allowed
  138. 138. No loops allowed p(c|e)p(e|c)
  139. 139. No loops allowed p(c|e)p(e|c) p(c|e)p(e) or p(e|c)p(c)
  140. 140. No loops allowed p(c|e)p(e|c) p(c|e)p(e) or p(e|c)p(c)
  141. 141. Directed Graphical Model• Joint probability distribution p(x) = p(xi |xparents(i) ) i• Parameter estimation • If x is fully observed the likelihood breaks up log p(x|θ) = log p(xi |xparents(i) , θ) i • If x is partially observed things get interesting maximization, EM, variational, sampling ...
  142. 142. ChainsMarkov Chainpast past present future future
  143. 143. ChainsMarkov Chain Platepast past present future future
  144. 144. ChainsMarkov Chain Platepast past present future futureHidden Markov Chain user’s mindset observed user action
  145. 145. ChainsMarkov Chain Plate past past present future futureHidden Markov Chain user’s mindset observed user actionuser model for traversal through search results
  146. 146. ChainsMarkov Chain Plate past past present future futureHidden Markov Chain user’s mindset observed user actionuser model for traversal through search results
  147. 147. ChainsMarkov Chain Plate n−1 p(x; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) i=1Hidden Markov Chain user’s mindset n−1 n p(x, y; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) p(yi |xi ) i=1 i=1 observed user actionuser model for traversal through search results
  148. 148. Factor Graphs Latent Factors Observed Effects
  149. 149. Factor Graphs Latent Factors Observed Effects• Observed effects Click behavior, queries, watched news, emails
  150. 150. Factor Graphs Latent Factors Observed Effects• Observed effects Click behavior, queries, watched news, emails• Latent factors User profile, news content, hot keywords, social connectivity graph, events
  151. 151. Recommender Systems u m r
  152. 152. Recommender Systems u m r• Users u• Movies m• Ratings r (but only for a subset of users)
  153. 153. Recommender Systems u m r ... intersecting plates ... (like nested for loops)• Users u• Movies m• Ratings r (but only for a subset of users)
  154. 154. Recommender Systems news, SearchMonkey answers u m social ranking OMG r ... intersecting plates ... personals (like nested for loops)• Users u• Movies m• Ratings r (but only for a subset of users)
  155. 155. Challenges your job my job
  156. 156. Challenges• How to design models your job • Common (engineering) sense • Computational tractability my job
  157. 157. Challenges• How to design models your job • Common (engineering) sense • Computational tractability• Dependency analysis my job • Bayes ball (not in this lecture)
  158. 158. Challenges• How to design models your job • Common (engineering) sense • Computational tractability• Dependency analysis my job • Bayes ball (not in this lecture)• Inference • Easy for fully observed situations • Many algorithms if not fully observed • Dynamic programming / message passing
  159. 159. Dynamic Programming 101
  160. 160. Chains n−1 p(x; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) x0 x1 x2 x3 i=1 n p(xi ) = p(x0 ) p(xj |xj−1 ) x0 ,...xi−1 ,xi+1 ...xn j=1 :=l0 (x0 ) n = [l0 (x0 )p(x1 |x0 )] p(xj |xj−1 ) x x1 ,...xi−1 ,xi+1 ...xn x0 j=2 :=l1 (x1 ) n = [l1 (x1 )p(x2 |x1 )] p(xj |xj−1 ) x2 ,...xi−1 ,xi+1 ...xn x1 j=3 :=l2 (x2 )
  161. 161. Chains n−1 p(x; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) x0 x1 x2 x3 i=1 n p(xi ) = p(x0 ) p(xj |xj−1 ) x0 ,...xi−1 ,xi+1 ...xn j=1 :=l0 (x0 ) n = [l0 (x0 )p(x1 |x0 )] p(xj |xj−1 ) x x1 ,...xi−1 ,xi+1 ...xn x0 j=2 :=l1 (x1 ) n = [l1 (x1 )p(x2 |x1 )] p(xj |xj−1 ) x2 ,...xi−1 ,xi+1 ...xn x1 j=3 :=l2 (x2 )
  162. 162. Chains n−1 p(x; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) x0 x1 x2 x3 i=1 n p(xi ) = p(x0 ) p(xj |xj−1 ) x0 ,...xi−1 ,xi+1 ...xn j=1 :=l0 (x0 ) n = [l0 (x0 )p(x1 |x0 )] p(xj |xj−1 ) x x1 ,...xi−1 ,xi+1 ...xn x0 j=2 :=l1 (x1 ) n = [l1 (x1 )p(x2 |x1 )] p(xj |xj−1 ) x2 ,...xi−1 ,xi+1 ...xn x1 j=3 :=l2 (x2 )
  163. 163. Chains n−1 p(x; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) x0 x1 x2 x3 i=1 n p(xi ) = p(x0 ) p(xj |xj−1 ) x0 ,...xi−1 ,xi+1 ...xn j=1 :=l0 (x0 ) n = [l0 (x0 )p(x1 |x0 )] p(xj |xj−1 ) x x1 ,...xi−1 ,xi+1 ...xn x0 j=2 :=l1 (x1 ) n = [l1 (x1 )p(x2 |x1 )] p(xj |xj−1 ) x2 ,...xi−1 ,xi+1 ...xn x1 j=3 :=l2 (x2 )
  164. 164. Chains n−1 p(x; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) x0 x1 x2 x3 i=1 n−1 p(xi ) = li (xi ) p(xj+1 |xj ) xi+1 ...xn j=i n−2 = li (xi ) p(xj+1 |xj ) p(xn |xn−1 ) xi+1 ...xn−1 j=i xn x :=rn−1 (xn−1 ) n−3 = li (xi ) p(xj+1 |xj ) p(xn−1 |xn−2 )rn−1 (xn−1 ) xi+1 ...xn−2 j=i xn−1 :=rn−2 (xn−2 )
  165. 165. Chains n−1 p(x; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) x0 x1 x2 x3 i=1• Forward recursion l0 (x0 ) := p(x0 ) and li (xi ) := li−1 (xi−1 )p(xi |xi−1 ) xi−1• Backward recursion rn (xn ) := 1 and ri (xi ) := ri+1 (xi+1 )p(xi+1 |xi ) xi+1• Marginalization conditioning p(xi ) = li (xi )ri (xi ) p(x) p(x−i |xi ) = p(xi ) p(xi , xi+1 ) = li (xi )p(xi+1 |xi )ri (xi+1 )
  166. 166. Chains x0 x1 x2 x3 x4 x5• Send forward messages starting from left node mi−1→i (xi ) = mi−2→i−1 (xi−1 )f (xi−1 , xi ) xi−1• Send backward messages starting from right node mi+1→i (xi ) = mi+2→i+1 (xi+1 )f (xi , xi+1 ) xi+1
  167. 167. Trees x3 x4 x5 x0 x1 x2 x6 x7 x8• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ... m2→3 (x3 ) = m1→2 (x2 )m6→2 (x2 )f (x2 , x3 ) x2 m2→6 (x6 ) = m1→2 (x2 )m3→2 (x2 )f (x2 , x6 ) x2 m2→1 (x1 ) = m3→2 (x2 )m6→2 (x2 )f (x1 , x2 ) x2
  168. 168. Trees x3 x4 x5 x0 x1 x2 x6 x7 x8• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ... m2→3 (x3 ) = m1→2 (x2 )m6→2 (x2 )f (x2 , x3 ) x2 m2→6 (x6 ) = m1→2 (x2 )m3→2 (x2 )f (x2 , x6 ) x2 m2→1 (x1 ) = m3→2 (x2 )m6→2 (x2 )f (x1 , x2 ) x2
  169. 169. Trees x3 x4 x5 x0 x1 x2 x6 x7 x8• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ... m2→3 (x3 ) = m1→2 (x2 )m6→2 (x2 )f (x2 , x3 ) x2 m2→6 (x6 ) = m1→2 (x2 )m3→2 (x2 )f (x2 , x6 ) x2 m2→1 (x1 ) = m3→2 (x2 )m6→2 (x2 )f (x1 , x2 ) x2
  170. 170. Trees x3 x4 x5 x0 x1 x2 x6 x7 x8• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ... m2→3 (x3 ) = m1→2 (x2 )m6→2 (x2 )f (x2 , x3 ) x2 m2→6 (x6 ) = m1→2 (x2 )m3→2 (x2 )f (x2 , x6 ) x2 m2→1 (x1 ) = m3→2 (x2 )m6→2 (x2 )f (x1 , x2 ) x2
  171. 171. Trees x3 x4 x5 x0 x1 x2 x6 x7 x8• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ... m2→3 (x3 ) = m1→2 (x2 )m6→2 (x2 )f (x2 , x3 ) x2 m2→6 (x6 ) = m1→2 (x2 )m3→2 (x2 )f (x2 , x6 ) x2 m2→1 (x1 ) = m3→2 (x2 )m6→2 (x2 )f (x1 , x2 ) x2
  172. 172. Trees x3 x4 x5 x0 x1 x2 x6 x7 x8• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ... m2→3 (x3 ) = m1→2 (x2 )m6→2 (x2 )f (x2 , x3 ) x2 m2→6 (x6 ) = m1→2 (x2 )m3→2 (x2 )f (x2 , x6 ) x2 m2→1 (x1 ) = m3→2 (x2 )m6→2 (x2 )f (x1 , x2 ) x2
  173. 173. Trees x3 x4 x5 x0 x1 x2 x6 x7 x8• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ... m2→3 (x3 ) = m1→2 (x2 )m6→2 (x2 )f (x2 , x3 ) x2 m2→6 (x6 ) = m1→2 (x2 )m3→2 (x2 )f (x2 , x6 ) x2 m2→1 (x1 ) = m3→2 (x2 )m6→2 (x2 )f (x1 , x2 ) x2
  174. 174. Junction Trees 2,3,4 3 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 ) mi→j (xj ) = f (xi , xj ) ml→i (xj ) = f (x245 )m12→245 (x2 )m457→245 (x45 ) xi l=j x5 clique separatorpotential clique set potential
  175. 175. Junction Trees 2,3,4 3 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 ) mi→j (xj ) = f (xi , xj ) ml→i (xj ) = f (x245 )m12→245 (x2 )m457→245 (x45 ) xi l=j x5 clique separatorpotential clique set potential
  176. 176. Junction Trees 2,3,4 3 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 ) mi→j (xj ) = f (xi , xj ) ml→i (xj ) = f (x245 )m12→245 (x2 )m457→245 (x45 ) xi l=j x5 clique separatorpotential clique set potential
  177. 177. Junction Trees 2,3,4 3 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 ) mi→j (xj ) = f (xi , xj ) ml→i (xj ) = f (x245 )m12→245 (x2 )m457→245 (x45 ) xi l=j x5 clique separatorpotential clique set potential
  178. 178. Junction Trees 2,3,4 3 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 ) mi→j (xj ) = f (xi , xj ) ml→i (xj ) = f (x245 )m12→245 (x2 )m457→245 (x45 ) xi l=j x5 clique separatorpotential clique set potential
  179. 179. Junction Trees 2,3,4 3 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 ) mi→j (xj ) = f (xi , xj ) ml→i (xj ) = f (x245 )m12→245 (x2 )m457→245 (x45 ) xi l=j x5 clique separatorpotential clique set potential
  180. 180. Junction Trees 2,3,4 3 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 ) mi→j (xj ) = f (xi , xj ) ml→i (xj ) = f (x245 )m12→245 (x2 )m457→245 (x45 ) xi l=j x5 clique separatorpotential clique set potential
  181. 181. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = f (xi , xj ) ml→i (xj ) xi l=j = f (x245 )m12→245 (x2 )m457→245 (x45 ) x5 clique separatorpotential clique set potential
  182. 182. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = f (xi , xj ) ml→i (xj ) xi l=j = f (x245 )m12→245 (x2 )m457→245 (x45 ) x5 clique separatorpotential clique set potential
  183. 183. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = f (xi , xj ) ml→i (xj ) xi l=j = f (x245 )m12→245 (x2 )m457→245 (x45 ) x5 clique separatorpotential clique set potential
  184. 184. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = f (xi , xj ) ml→i (xj ) xi l=j = f (x245 )m12→245 (x2 )m457→245 (x45 ) x5 clique separatorpotential clique set potential
  185. 185. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = f (xi , xj ) ml→i (xj ) xi l=j = f (x245 )m12→245 (x2 )m457→245 (x45 ) x5 clique separatorpotential clique set potential
  186. 186. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = f (xi , xj ) ml→i (xj ) xi l=j = f (x245 )m12→245 (x2 )m457→245 (x45 ) x5 clique separatorpotential clique set potential
  187. 187. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = f (xi , xj ) ml→i (xj ) xi l=j = f (x245 )m12→245 (x2 )m457→245 (x45 ) x5 clique separatorpotential clique set potential
  188. 188. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = f (xi , xj ) ml→i (xj ) xi l=j = f (x245 )m12→245 (x2 )m457→245 (x45 ) x5 clique separatorpotential clique set potential
  189. 189. Generalized Distributive Law• Key Idea Dynamic programming uses only sums and multiplications, hence replace them with equivalent operations from other semirings• Semiring • ‘addition’ and ‘summation’ equivalent • Associative law: (a+b)+c = a+(b+c) • Distributive law: a(b+c) = ab + ac
  190. 190. Generalized Distributive Law• Integrating out probabilities (sum, product) a · (b + c) = a · b + a · c• Finding the maximum (max, +) a + max(b, c) = max(a + b, a + c)• Set algebra (union, intersection) A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)• Boolean semiring (AND, OR)• Probability semiring (log +, +)• Tropical semiring (min, +)
  191. 191. Chains ... again n−1 s = max s(x0 ) +¯ x s(xi+1 |xi ) x0 x1 x2 x3 i=1 n s = max s(x0 ) +¯ s(xj |xj−1 ) x0...n j=1 :=l0 (x0 ) n = max max [l0 (x0 )s(x1 |x0 )] + s(xj |xj−1 ) x x1...n x0 j=2 :=l1 (x1 ) n = max max [l1 (x1 )s(x2 |x1 )] + s(xj |xj−1 ) x2...n x1 j=3 :=l2 (x2 )
  192. 192. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = max f (xi , xj ) + ml→i (xj ) xi = max f (x245 ) + m12→245 (x2 ) + m457→245 (x45 ) l=j x5 clique separatorpotential clique set potential
  193. 193. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = max f (xi , xj ) + ml→i (xj ) xi = max f (x245 ) + m12→245 (x2 ) + m457→245 (x45 ) l=j x5 clique separatorpotential clique set potential
  194. 194. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = max f (xi , xj ) + ml→i (xj ) xi = max f (x245 ) + m12→245 (x2 ) + m457→245 (x45 ) l=j x5 clique separatorpotential clique set potential
  195. 195. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = max f (xi , xj ) + ml→i (xj ) xi = max f (x245 ) + m12→245 (x2 ) + m457→245 (x45 ) l=j x5 clique separatorpotential clique set potential
  196. 196. Junction Trees 3 2,3,4 2,4 1 2 1,2 2 2,4,5 4 4,5 4,5,7 m245→234 (x24 )mi→j (xj ) = max f (xi , xj ) + ml→i (xj ) xi = max f (x245 ) + m12→245 (x2 ) + m457→245 (x45 ) l=j x5 clique separatorpotential clique set potential

×