Upcoming SlideShare
×

# Submodularity slides

1,313 views

Published on

Published in: Education, Technology
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,313
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
17
0
Likes
2
Embeds 0
No embeds

No notes for slide
• Get p+1 approximation for C = intersection of p matroids; 1-1/e whp for any matroid using continuous greedy + pipage rounding
• Random placement: 53%; Uniform: 73% Optimized: 81%
• ### Submodularity slides

1. 1. Beyond Convexity –Submodularity in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University International Conference on Machine Learning | July 5, 2008 Carnegie Mellon
2. 2. AcknowledgementsThanks for slides and material to Mukund Narasimhan,Jure Leskovec and Manuel Reyes GomezMATLAB Toolbox and details for references available at http://www.submodularity.org Algorithms implemented M 2
3. 3. Optimization in Machine Learning + Classify + from – by w1 finding a separating + + - - hyperplane (parameters w) + + - - w* - - Which one should we choose? w2 Define loss L(w) = “1/size of margin”  Solve for best vector w* = argminw L(w) L(w)Key observation: Many problems in ML are convex! w  no local minima!!  3
4. 4. Feature selectionGiven random variables Y, X1, … XnWant to predict Y from subset XA = (Xi1,…,Xik) Y “Sick” Naïve Bayes X1 X2 X3 Model “Fever” “Rash” “Male”Want k most informative features: A* = argmax IG(XA; Y) s.t. |A| · k Uncertainty Uncertainty before knowing XA after knowing XAwhere IG(XA; Y) = H(Y) - H(Y | XA) 4
5. 5. Factoring distributions VGiven random variables X1,…,Xn X1 X3 X2 X4 X6 X5 X7Partition variables V into sets A andVnA as independent as possible X1 X3 X2 X7 X4 X6 X5Formally: Want A VnA A* = argminA I(XA; XVnA) s.t. 0<|A|<nwhere I(XA,XB) = H(XB) - H(XB j XA)Fundamental building block in structure learning[Narasimhan&Bilmes, UAI ’04] 5
6. 6. Combinatorial problems in MLGiven a (finite) set V, function F: 2V ! R, want A* = argmin F(A) s.t. some constraints on ASolving combinatorial problems: Mixed integer programming? Often difficult to scale to large problems Relaxations? (e.g., L1 regularization, etc.) Not clear when they work This talk: Fully combinatorial algorithms (spanning tree, matching, …) Exploit problem structure to get guarantees about solution! 6
7. 7. Example: Greedy algorithm for feature selection Given: finite set V of features, utility function F(A) = IG(XA; Y) Want: A*µ V such that Y “Sick” NP-hard! X1 X2 X3Greedy algorithm: M “Fever” “Rash” “Male” Start with A = ; For i = 1 to k s* := argmaxs F(A [ {s}) A := A [ {s*} How well can this simple heuristic do? 7
8. 8. Key property: Diminishing returns Selection A = {} Selection B = {X2,X3} Y Y “Sick” “Sick” X2 X3 “Rash” “Male” Adding X1 X1 Adding X1Theorem [Krause, Guestrin UAI ‘05]: Information gain F(A) in “Fever” will help a lot! doesn’t help much Naïve Bayes models is submodular! New feature X1Submodularity: B A + s Large improvement + s Small improvement For Aµ B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) 8
9. 9. Why is submodularity useful?Theorem [Nemhauser et al ‘78]Greedy maximization algorithm returns Agreedy: F(Agreedy) ¸(1-1/e) max|A|·k F(A) ~63%Greedy algorithm gives near-optimal solution!More details and exact statement laterFor info-gain: Guarantees best possible unless P = NP![Krause, Guestrin UAI ’05] 9
10. 10. Submodularity in Machine LearningIn this tutorial we will see that many ML problems aresubmodular, i.e., for F submodular require:Minimization: A* = argmin F(A) Structure learning (A* = argmin I(XA; XVnA)) Clustering MAP inference in Markov Random Fields …Maximization: A* = argmax F(A) Feature selection Active learning Ranking … 10
11. 11. Tutorial Overview Examples and properties of submodular functions Submodularity and convexity Minimizing submodular functions Maximizing submodular functions Research directions, … LOTS of applications to Machine Learning!! 11
12. 12. SubmodularityProperties and Examples Carnegie Mellon
13. 13. Set functionsFinite set V = {1,2,…,n}Function F: 2V ! RWill always assume F(;) = 0 (w.l.o.g.)Assume black-box that can evaluate F for any input A Approximate (noisy) evaluation of F is ok (e.g., [37])Example: F(A) = IG(XA; Y) = H(Y) – H(Y | XA) = ∑y,x P(xA) [log P(y | xA) – log P(y)] A Y Y “Sick” “Sick” X1 X2 X2 X3 “Fever” “Rash” “Rash” “Male” F({X1,X2}) = 0.9 F({X2,X3}) = 0.5 13
14. 14. Submodular set functions Set function F on V is called submodular if For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB) + ¸ + A A [ BB AÅB Equivalent diminishing returns characterization:Submodularity: B A + S Large improvement + S Small improvementFor AµB, s∉B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) 14
15. 15. Submodularity and supermodularitySet function F on V is called submodular if 1) For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB)  2) For all AµB, s∉B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)F is called supermodular if –F is submodularF is called modular if F is both sub- and supermodular for modular (“additive”) F, F(A) = ∑i2A w(i) 15
16. 16. Example: Set cover Place sensors Want to cover floorplan with discs in building Possible OFFICE OFFICE QUIET PHONE locations CONFERENCE STORAGE V LAB ELEC COPY SERVER KITCHEN Node predicts For A µ V: F(A) = “areavalues of positions covered by sensors placed at A” with some radius Formally: W finite set, collection of n subsets Si µ W For A µV={1,…,n} define F(A) = |i2 A Si| 16
17. 17. Set cover is submodular A={S1,S2}OFFICE OFFICE QUIET PHONE CONFERENCE S1 S2 STORAGE LAB ELEC COPY SERVER S’ KITCHEN F(A[{S’})-F(A)OFFICE OFFICE QUIET PHONE ¸ CONFERENCE F(B[{S’})-F(B) S1 S2 STORAGE LAB ELEC COPY S3 SERVER S4 KITCHEN S’ B = {S1,S2,S3,S4} 17
18. 18. Example: Mutual informationGiven random variables X1,…,XnF(A) = I(XA; XVnA) = H(XVnA) – H(XVnA |XA)Lemma: Mutual information F(A) is submodularF(A [ {s}) – F(A) = H(Xsj XA) – H(Xsj XVn(A[{s}) ) Nonincreasing in A: Nondecreasing in A AµB ) H(Xs|XA) ¸ H(Xs|XB)δ s(A) = F(A[{s})-F(A) monotonically nonincreasing  F submodular  18
19. 19. Example: Influence in social networks [Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Alice 0.2 Prob. of influencing 0.5 0.4 0.3 0.2 0.5 Bob 0.5 Fiona CharlieWho should get free cell phones?V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona}F(A) = Expected number of people influenced when targeting A 19
20. 20. Influence in social networks is submodular [Kempe, Kleinberg, Tardos KDD ’03] Dorothy EricAlice 0.2 0.5 0.4 0.3 0.2 0.5Bob 0.5 Fiona CharlieKey idea: Flip coins c in advance  “live” edgesFc(A) = People influenced under outcome c (set cover!)F(A) = ∑ P(c) F (A) is submodular as well! 20
21. 21. Closedness propertiesF1,…,Fm submodular functions on V and λ1,…,λm > 0Then: F(A) = ∑i λi Fi(A) is submodular!Submodularity closed under nonnegative linearcombinations!Extremely useful fact!! Fθ(A) submodular ) ∑θ P(θ) Fθ(A) submodular! Multicriterion optimization: F1,…,Fm submodular, λi¸0 ) ∑i λi Fi(A) submodular 21
22. 22. Submodularity and ConcavitySuppose g: N ! R and F(A) = g(|A|)Then F(A) submodular if and only if g concave! E.g., g could say “buying in bulk is cheaper” g(|A|) |A| 22
23. 23. Maximum of submodular functionsSuppose F1(A) and F2(A) submodular.Is F(A) = max(F1(A),F2(A)) submodular? F(A) = max(F1(A),F2(A)) F1(A) F2(A) |A| max(F1,F2) not submodular in general! 23
24. 24. Minimum of submodular functionsWell, maybe F(A) = min(F1(A),F2(A)) instead? F1(A) F2(A) F(A) F({b}) – F(;)=0 ; 0 0 0 < {a} 1 0 0 F({a,b}) – F({a})=1 {b} 0 1 0 {a,b} 1 1 1 min(F1,F2) not submodular in general!But stay tuned – we’ll address mini Fi later! 24
25. 25. DualityFor F submodular on V let G(A) = F(V) – F(VnA)G is supermodular and called dual to FDetails about properties in [Fujishige ’91] F(A) |A| |A| G(A) 25
26. 26. Tutorial OverviewExamples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, maxSubmodularity and convexityMinimizing submodular functionsMaximizing submodular functionsExtensions and research directions 26
27. 27. Submodularityand Convexity Carnegie Mellon
28. 28. Submodularity and convexityFor V = {1,…,n}, and A µ V, let wA = (w1A,…,wnA) with wiA = 1 if i 2 A, 0 otherwiseKey result [Lovasz ’83]: Every submodular functionF induces a function g on Rn+, such that F(A) = g(wA) for all A µ V g(w) is convex minA F(A) = minw g(w) s.t. w 2 [0,1]nLet’s see how one can define g(w) 28
29. 29. The submodular polyhedron PFPF = {x 2 Rn: x(A) · F(A) for all A µ V} Example: V = {a,b} A F(A) x(A) = ∑i2 A xi ; 0 {a} -1 {b} 2 {a,b} 0 x{b} 2 x({b}) · F({b}) PF 1 x({a,b}) · F({a,b}) -2 -1 0 1 x{a} x({a}) · F({a}) 29
30. 30. Lovasz extension Claim: g(w) = maxx2PF wTx PF = {x 2 Rn: x(A) · F(A) for all A µ V} w{b} xw=argmaxx2 PF wT x xw 2 g(w)=wT xw w 1 -2 -1 0 1 w{a}Evaluating g(w) requires solving a linear program withexponentially many constraints  30
31. 31. x{b} xw 2 Evaluating the Lovasz extension w 1g(w) = maxx2PF wTx -2 -1 0 1 x{a}PF = {x 2 Rn: x(A) · F(A) for all A µ V}Theorem [Edmonds ’71, Lovasz ‘83]:For any given w, can get optimal solution xw to the LPusing the following greedy algorithm: Order V={e1,…,en} so that w(e1)¸ …¸ w(en) M Let xw(ei) = F({e1,…,ei}) – F({e1,…,ei-1})Then wT xw = g(w) = maxx2 PF wT xSanity check: If w = wA and A={e1,…,ek}, then AT k 31
32. 32. Example: Lovasz extension A F(A) g(w) = max {wT x: x 2 PF} ; 0 w{b} {a} -1 {b} 2 [-2,2] 2 {a,b} 0 {b} {a,b} [-1,1] 1 w=[0,1] {} {a} want g(w) -2 -1 0 1 w{a} Greedy ordering: e1 = b, e2 = a  w(e1)=1 > w(e2)=0g([0,1]) = [0,1]T [-2,2] = 2 = F({b}) xw(e1)=F({b})-F(;)=2 xw(e2)=F({b,a})-F({b})=-2g([1,1]) = [1,1]T [-1,1] = 0 = F({a,b})  xw=[-2,2] 32
33. 33. Why is this useful? x{b} 2 [0,1]2 Theorem [Lovasz ’83]: 1 g(w) attains its minimum in [0,1]n at a corner! 0 1 x{a}If we can minimize g on [0,1]n, can minimize F… (at corners, g and F take same values) F(A) submodular g(w) convex (and efficient to evaluate) Does the converse also hold? No, consider g(w1,w2,w3) = max(w1,w2+w3) {a} {b} {c} F({a,b})-F({a})=0 < F({a,b,c})-F({a,c})=1 33
34. 34. Tutorial OverviewExamples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, maxSubmodularity and convexity Every SF induces a convex function with SAME minimum Special properties: Greedy solves LP over exponential polytopeMinimizing submodular functionsMaximizing submodular functionsExtensions and research directions 34
35. 35. Minimization ofsubmodular functions Carnegie Mellon
36. 36. Overview minimizationMinimizing general submodular functionsMinimizing symmetric submodular functionsApplications to Machine Learning 36
37. 37. Minimizing a submodular functionWant to solve A* = argminA F(A)Need to solve minw maxx wTx g(w) s.t. w2[0,1]n, x2PFEquivalently: minc,w c s.t. c ¸ wT x for all x2PF w2 [0,1]n 37
38. 38. Ellipsoid algorithm [Grötschel, Lovasz, Schrijver ’81] Feasible region Optimality direction minc,w c s.t. c ¸ wT x for all x2PF w2 [0,1]nSeparation oracle: Find most violated constraint: maxx wT x – c s.t. x 2 PFCan solve separation using the greedy algorithm!! Ellipsoid algorithm minimizes SFs in poly-time! 38
39. 39. Minimizing submodular functionsEllipsoid algorithm not very practicalWant combinatorial algorithm for minimization!Theorem [Iwata (2001)]There is a fully combinatorial, strongly polynomialalgorithm for minimizing SFs, that runs in time O(n8 log2 n)Polynomial-time = Practical ??? 39
40. 40. A more practical alternative? [Fujishige ’91, Fujishige et al ‘06] x({a,b})=F({a,b}) x A F(A) {b} ; 0 2 {a} -1 x* {b} 2 [-1,1] 1 Base polytope: {a,b} 0 BF = PF Å {x(V) = F(V)} -2 -1 0 1 x{a}Minimum norm algorithm: M Find x* = argmin ||x||2 s.t. x 2 BF x*=[-1,1] Return A* = {i: x*(i) < 0} A*={a}Theorem [Fujishige ’91]: A* is an optimal solution!Note: Can solve 1. using Wolfe’s algorithmRuntime finite but unknown!!  40
41. 41. Empirical comparison [Fujishige et al ’06] Cut functions from DIMACSLower is better (log-scale!) Challenge Running time (seconds) Minimum norm algorithm 64 128 256 512 1024 Problem size (log-scale!) Minimum norm algorithm orders of magnitude faster! Our implementation can solve n = 10k in < 6 minutes! 41
42. 42. Checking optimality (duality) BF A F(A) w{b} ; 0 x* 2 {a} -1 {b} 2 [-1,1] 1 {a,b} 0 Base polytope: -2 -1 0 1 w{a} BF = PF Å {x(V) = F(V)}Theorem [Edmonds ’70] A = {a}, F(A) = -1minA F(A) = maxx {x (V) : x 2 BF} – w = [1,0]where x–(s) = min {x(s), 0} xw = [-1,1] xw- = [-1,0]Testing how close A’ is to minA F(A) xw-(V) = -1 Run greedy algorithm for w=wA’ to get xw  A optimal! F(A’) ¸ minA F(A) ¸ xw–(V) 42
43. 43. Overview minimizationMinimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n^8) Practical alternative: Minimum norm algorithm?Minimizing symmetric submodular functionsApplications to Machine Learning 43
44. 44. What if we have special structure?Worst-case complexity of best known algorithm: O(n8 log2n)Can we do better for special cases?Example (again): Given RVs X1,…,XnF(A) = I(XA; XVnA) = I(XVnA ; XA) = F(VnA)Functions F with F(A) = F(VnA) for all A are symmetric 44
45. 45. Another example: Cut functions 2 1 3 V={a,b,c,d,e,f,g,h} a c e g 2 32 2 2 3 3 3 b d f h 2 1 3 F(A) = ∑ {ws,t: s2 A, t2 Vn A} Example: F({a})=6; F({c,d})=10; F({a,b,c,d})=2 Cut function is symmetric and submodular! 45
46. 46. Minimizing symmetric functionsFor any A, submodularity implies2 F(A) = F(A) + F(VnA) ¸ F(A Å (VnA))+F(A [ (VnA)) = F(;) + F(V) = 2 F(;) = 0Hence, any symmetric SF attains minimum at ;In practice, want nontrivial partition of V intoA and VnA, i.e., require that A is neither ; of VWant A* = argmin F(A) s.t. 0 < |A| < nThere is an efficient algorithm for doing that!  46
47. 47. Queyranne’s algorithm (overview) [Queyranne ’98]Theorem: There is a fully combinatorial, strongly Mpolynomial algorithm for solving A* = argminA F(A) s.t. 0<|A|<nfor symmetric submodular functions ARuns in time O(n3) [instead of O(n8)…] Note: also works for “posimodular” functions: F posimodular  A,Bµ V: F(A)+F(B) ¸ F(AnB)+F(BnA) 47
48. 48. Gomory Hu trees 2 1 3 6 2 9 a c e g a c e g 2 32 2 2 3 3 3 6 7 10 9 b d f h b d f h 2 1 3 G T A tree T is called Gomory-Hu (GH) tree for SF F if for any s, t 2 V it holds that min {F(A): s2A and t∉A} = min {wi,j: (i,j) is an edge on the s-t path in T} “min s-t-cut in T = min s-t-cut in G” Expensive to Theorem [Queyranne ‘93]: find one in GH-trees exist for any symmetric SF F! general!  48
49. 49. Pendent pairs For function F on V, s,t2 V: (s,t) is pendent pair if {s} 2 argminA F(A) s.t. s2A, t∉A Pendent pairs always exist: 6 2 9 a c e g 7 9 Gomory-Hu 6 10 tree T b d f h Take any leaf s and neighbor t, then (s,t) is pendent! E.g., (a,c), (b,c), (f,e), …Theorem [Queyranne ’95]: Can find pendent pairs in O(n2) (without needing GH-tree!) 49
50. 50. Why are pendent pairs useful?Key idea: Let (s,t) pendent, A* = argmin F(A)Then EITHER V s and t separated by A*, e.g., s2A*, t∉A*. A* But then A*={s}!! OR s t s and t are not separated by A* V V A* s A* t s t Then we can merge s and t… 50
51. 51. Merging Suppose F is a symmetric SF on V, and we want to merge pendent pair (s,t) Key idea: “If we pick s, get t for free” V’ = Vn{t} F’(A) = F(A[{t}) if s2A, or = F(A) if s∉A Lemma: F’ is still symmetric and submodular! 2 1 3 1 3 a c e g Merge a,c e g 2 3 (a,c) 32 2 2 3 3 3 4 4 3 3 3 b d f h b d f h 2 1 3 2 1 3 51
52. 52. Queyranne’s algorithmInput: symmetric SF F on V, |V|=nOutput: A* = argmin F(A) s.t. 0 < |A| < nInitialize F’ Ã F, and V’ Ã VFor i = 1:n-1 (s,t) Ã pendentPair(F’,V’) Ai = {s} (F’,V’) Ã merge(F’,V’,s,t)Return argmini F(Ai) Running time: O(n3) function evaluations 52
53. 53. Note: Finding pendent pairsInitialize v1 Ã x (x is arbitrary element of V)For i = 1 to n-1 do Wi Ã {v1,…,vi} vi+1 Ã argminv F(Wi[{v}) - F({v}) s.t. v2 VnWiReturn pendent pair (vn-1,vn)Requires O(n2) evaluations of F 53
54. 54. Overview minimizationMinimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n8) Practical alternative: Minimum norm algorithm?Minimizing symmetric submodular functions Many useful submodular functions are symmetric Queyranne’s algorithm minimize symmetric SFs in O(n3)Applications to Machine Learning 54
55. 55. Application: Clustering [Narasimhan, Jojic, Bilmes NIPS ’05] Group data points V into “homogeneous clusters” o V o Find a partition V=A1 [ … [ Ak o o o that minimizes o o A1 o A o o o 2 F(A1,…,Ak) = ∑i E(Ai) “Inhomogeneity of Ai” Examples for E(A): Entropy H(A) Cut functionSpecial case: k = 2. Then F(A) = E(A) + E(VnA) is symmetric!If E is submodular, can use Queyranne’s algorithm!  55
56. 56. What if we want k>2 clusters? [Zhao et al ’05, Narasimhan et al ‘05] X2Greedy Splitting algorithm M X1 X3 X5 X7 X4 X6Start with partition P = {V}For i = 1 to k-1 X1 X3 X2 X7 For each member Cj 2 P do X4 X6 X5 split cluster Cj: A* = argmin E(A) + E(CjnA) s.t. 0<|A|<|Cj| Pj Ã P n {Cj} [ {A,CjnA} X3 X1 X2 X7 Partition we get by splitting j-th cluster X6 X4 X5 P Ã argminj F(Pj)Theorem: F(P) · (2-2/k) F(Popt) 56
57. 57. Example: Clustering species [Narasimhan et al ‘05] Common genetic information = #of common substrings:Can easily extend to sets of species 57
58. 58. Example: Clustering species [Narasimhan et al ‘05]The common genetic information ICG does not require alignment captures genetic similarity is smallest for maximally evolutionarily diverged species is a symmetric submodular function! Greedy splitting algorithm yields phylogenetic tree! 58
59. 59. Example: SNPs [Narasimhan et al ‘05]Study human genetic variation(for personalized medicine, …)Most human variation due to point mutations that occuronce in human history at that base location: Single Nucleotide Polymorphisms (SNPs)Cataloging all variation too expensive(\$10K-\$100K per individual!!) 59
60. 60. SNPs in the ACE gene [Narasimhan et al ‘05]Rows: Individuals. Columns: SNPs.Which columns should we pick to reconstruct the rest?Can find near-optimal clustering (Queyranne’s algorithm) 60
61. 61. Reconstruction accuracy[Narasimhan et al ‘05] Comparison with clustering based on Entropy Prediction accuracy Pairwise correlation# of clusters PCA 61
62. 62. Example: Speaker segmentation [Reyes-Gomez, Jojic ‘07] Region A “Fiona” Mixed waveforms Frequency Alice “???” Time Fiona “???” E(A)=-log p(XA) Partition Likelihood of Spectro- “region” A“308” gram F(A)=E(A)+E(VnA) using Q-Algo symmetric“217” & posimodular 62
63. 63. Example: Image denoising 63
64. 64. Example: Image denoising Pairwise Markov Random FieldY1 Y2 Y3 X1 X2 X3 P(x1,…,xn,y1,…,yn) = ∏i,j ψi,j(yi,yj) Πi φi(xi,yi)Y4 Y5 Y6 X4 X5 X6 Want argmaxy P(y | x)Y7 Y8 Y9 =argmaxy log P(x,y) =argminy ∑i,j Ei,j(yi,yj)+∑i Ei(yi) X7 X8 X9 Xi: noisy pixels Ei,j(yi,yj) = -log ψi,j(yi,yj) Yi: “true” pixelsWhen is this MAP inference efficiently solvable (in high treewidth graphical models)? 64
65. 65. MAP inference in Markov Random Fields [Kolmogorov et al, PAMI ’04, see also: Hammer, Ops Res ‘65]Energy E(y) = ∑i,j Ei,j(yi,yj)+∑i Ei(yi)Suppose yi are binary, define F(A) = E(yA) where yAi = 1 iff i2 AThen miny E(y) = minA F(A)TheoremMAP inference problem solvable by graph cuts For all i,j: Ei,j(0,0)+Ei,j(1,1) · Ei,j(0,1)+Ei,j(1,0) each Ei,j is submodular“Efficient if prefer that neighboring pixels have same color” 65
66. 66. Constrained minimizationHave seen: if F submodular on V, can solve A*=argmin F(A) s.t. A2VWhat about A*=argmin F(A) s.t. A2V and |A| ·kE.g., clustering with minimum # points per cluster, …In general, not much known about constrained minimization However, can do A*=argmin F(A) s.t. 0<|A|< n A*=argmin F(A) s.t. |A| is odd/even [Goemans&Ramakrishnan ‘95] A*=argmin F(A) s.t. A 2 argmin G(A) for G submodular [Fujishige ’91] 66
67. 67. Overview minimizationMinimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n8) Practical alternative: Minimum norm algorithm?Minimizing symmetric submodular functions Many useful submodular functions are symmetric Queyranne’s algorithm minimize symmetric SFs in O(n3)Applications to Machine Learning Clustering [Narasimhan et al’ 05] Speaker segmentation [Reyes-Gomez & Jojic ’07] MAP inference [Kolmogorov et al ’04] 67
68. 68. Tutorial OverviewExamples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, maxSubmodularity and convexity Every SF induces a convex function with SAME minimum Special properties: Greedy solves LP over exponential polytopeMinimizing submodular functions Minimization possible in polynomial time (but O(n8)…) Queyranne’s algorithm minimizes symmetric SFs in O(n3) Useful for clustering, MAP inference, structure learning, …Maximizing submodular functionsExtensions and research directions 68
69. 69. Maximizing submodular functions Carnegie Mellon
70. 70. Maximizing submodular functionsMinimizing convex functions: Minimizing submodular functions:Polynomial time solvable! Polynomial time solvable!Maximizing convex functions: Maximizing submodular functions: NP hard! NP hard! But can get approximation guarantees  70
71. 71. Maximizing influence [Kempe, Kleinberg, Tardos KDD ’03] Dorothy EricAlice 0.2 0.5 0.4 0.3 0.2 0.5Bob 0.5 Fiona CharlieF(A) = Expected #people influenced when targeting AF monotonic: If AµB: F(A) ·F(B)Hence V = argmaxA F(A)More interesting: argmaxA F(A) – Cost(A) 71
72. 72. Maximizing non-monotonic functions maximumSuppose we want for not monotonic F A* = argmax F(A) s.t. AµV |A|Example: F(A) = U(A) – C(A) where U(A) is submodular utility, and C(A) is supermodular cost function E.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08]In general: NP hard. Moreover:If F(A) can take negative values:As hard to approximate as maximum independent set 72 1-ε
73. 73. Maximizing positive submodular functions [Feige, Mirrokni, Vondrak FOCS ’07]TheoremThere is an efficient randomized local search procedure,that, given a positive submodular function F, F(;)=0,returns set ALS such that F(ALS) ¸(2/5) maxA F(A)picking a random set gives ¼ approximation(½ approximation if F is symmetric!)we cannot get better than ¾ approximation unless P = NP 73
74. 74. Scalarization vs. constrained maximizationGiven monotonic utility F(A) and cost C(A), optimize: Option 1: Option 2: maxA F(A) – C(A) maxA F(A) s.t. A µ V s.t. C(A) · B “Scalarization” “Constrained maximization” Can get 2/5 approx… coming up… if F(A)-C(A) ¸ 0 for all A µ VPositiveness is astrong requirement  74
75. 75. Constrained maximization: OutlineMonotonic submodular Selected set Selection cost Budget Subset selection: C(A) = |A| Robust optimization Complex constraints 75
76. 76. MonotonicityA set function is called monotonic if AµBµV ) F(A) · F(B)Examples: Influence in social networks [Kempe et al KDD ’03] For discrete RVs, entropy F(A) = H(XA) is monotonic: Suppose B=A [C. Then F(B) = H(XA, XC) = H(XA) + H(XC | XA) ¸ H(XA) = F(A) Information gain: F(A) = H(Y)-H(Y | XA) Set cover Matroid rank functions (dimension of vector spaces, …) … 76
77. 77. Subset selectionGiven: Finite set V, monotonic submodular function F, F(;) = 0Want: A*µ V such that NP-hard! 77
78. 78. Exact maximization of monotonic submodular functions1) Mixed integer programming [Nemhauser et al ’81] max η s.t. η · F(B) + ∑s2VnB αs δ s(B) for all B µ S ∑s αs · k αs 2 {0,1} where δ s(B) = F(B [ {s}) – F(B) Solved using constraint generation2) Branch-and-bound: “Data-correcting algorithm” M [Goldengorin et al ’99] Both algorithms worst-case exponential! 78
79. 79. Approximate maximization Given: finite set V, monotonic submodular function F(A) Want: A*µ V such that NP-hard! Y “Sick”Greedy algorithm: M X1 X2 X3Start with A0 = ; “Fever” “Rash” “Male”For i = 1 to k si := argmaxs F(Ai-1 [ {s}) - F(Ai-1) Ai := Ai-1 [ {si} 79
80. 80. Performance of greedy algorithmTheorem [Nemhauser et al ‘78]Given a monotonic submodular function F, F(;)=0, thegreedy maximization algorithm returns Agreedy F(Agreedy) ¸(1-1/e) max|A|· k F(A) ~63% Sidenote: Greedy algorithm gives 1/2 approximation for maximization over any matroid C! [Fisher et al ’78] 80
81. 81. An “elementary” counterexampleX1, X2 ~ Bernoulli(0.5) X1 X2Y = X1 XOR X2 YLet F(A) = IG(XA; Y) = H(Y) – H(Y|XA)Y | X1 and Y | X2 ~ Bernoulli(0.5) (entropy 1)Y | X1,X2 is deterministic! (entropy 0)Hence F({1,2}) – F({1}) = 1, but F({2}) – F(;) =0F(A) submodular under some conditions! (later) 81
82. 82. Example: Submodularity of info-gainY1,…,Ym, X1, …, Xn discrete RVsF(A) = IG(Y; XA) = H(Y)-H(Y | XA)F(A) is always monotonicHowever, NOT always submodularTheorem [Krause & Guestrin UAI’ 05]If Xi are all conditionally independent given Y,then F(A) is submodular!Y1 Y2 Y3 Hence, greedy algorithm works! In fact, NO algorithm can do betterX1 X2 X3 X4 than (1-1/e) approximation! 82
83. 83. Building a Sensing Chair [Mutlu, Krause, Forlizzi, Guestrin, Hodgins UIST ‘07]People sit a lotActivity recognition inassistive technologiesSeating pressure asuser interface Equipped with 1 sensor per cm2! Costs \$16,000!  Lean Lean Slouch Can we get similar left forward accuracy with fewer, 82% accuracy on cheaper sensors? 10 postures! [Tan et al]83
84. 84. How to place sensors on a chair? Sensor readings at locations V as random variables Predict posture Y using probabilistic model P(Y,V) Pick sensor locations A* µ V to minimize entropy:Possible locations V Placed sensors, did a user study: Accuracy Cost Before 82% \$16,000  After 79% \$100  Similar accuracy at <1% of the cost! 84
85. 85. Variance reduction (a.k.a. Orthogonal matching pursuit, Forward Regression) Let Y = ∑i αi Xi+ε, and (X1,…,Xn,ε) » N(¢; µ,Σ) Want to pick subset XA to predict Y Var(Y | XA=xA): conditional variance of Y given XA = xA Expected variance: Var(Y | XA) = s p(xA) Var(Y | XA=xA) dxA Variance reduction: FV(A) = Var(Y) – Var(Y | XA) FV(A) is always monotonic *under some Theorem [Das & Kempe, STOC ’08] conditions on Σ FV(A) is submodular* Orthogonal matching pursuit near optimal! [see other analyses by Tropp, Donoho et al., and Temlyakov] 85
86. 86. Batch mode active learning [Hoi et al, ICML’06] o o Which data points o should we o + o o – o o o label to minimize error? + o o o – o o Want batch A of k points to o show an expert for labelingF(A) selects examples that are uncertain [σ2(s) = π(s) (1-π(s)) is large] diverse (points in A are as different as possible) relevant (as close to VnA is possible, sT s’ large)F(A) is submodular and monotonic![approximation to improvement in Fisher-information] 86
87. 87. Results about Active Learning [Hoi et al, ICML’06]Batch mode Active Learning performs better than Picking k points at random Picking k points of highest entropy 87
88. 88. Monitoring water networks [Krause et al, J Wat Res Mgt 2008] Contamination of drinking water could affect millions of peopleContaminationSensors Simulator from EPA Hach Sensor Place sensors to detect contaminations ~\$14K “Battle of the Water Sensor Networks” competitionWhere should we place sensors to quickly detect contamination? 88
89. 89. Model-based sensing Utility of placing sensors based on model of the world For water networks: Water flow simulator from EPA F(A)=Expected impact reduction placing sensors at A Model predicts Low impact High impactTheorem [Krause et al., J Wat location : Contamination Res Mgt ’08]Impact reduction F(A) in water networks is submodular! Medium impact S3 location S1 2 S S3 S1 S4 S2 Sensor reduces S4 impact through Set V of all early detection! S1 network junctions High impact reduction F(A) = 0.9 Low impact reduction F(A)=0.01 89
90. 90. Battle of the Water Sensor Networks CompetitionReal metropolitan area network (12,527 nodes)Water flow simulator provided by EPA3.6 million contamination eventsMultiple objectives: Detection time, affected population, …Place sensors that detect well “on average” 90
91. 91. Bounds on optimal solution [Krause et al., J Wat Res Mgt ’08] 1.4 Offline Population protected F(A) 1.2 (Nemhauser) bound Higher is better 1 0.8 0.6 Water Greedy networks 0.4 solution data 0.2 0 0 5 10 15 20 Number of sensors placed(1-1/e) bound quite loose… can we get better bounds? 91
92. 92. Data dependent bounds [Minoux ’78]Suppose A is candidate solution to argmax F(A) s.t. |A| · kand A* = {s1,…,sk} be an optimal solutionThen F(A*) · F(A [ A*) = F(A)+∑i F(A[{s1,…,si})-F(A[ {s1,…,si-1}) · F(A) + ∑i (F(A[{si})-F(A)) = F(A) + ∑i δ si MFor each s 2 VnA, let δ s = F(A[{s})-F(A)Order such that δ 1 ¸ δ 2 ¸ … ¸ δ n 92
93. 93. Bounds on optimal solution [Krause et al., J Wat Res Mgt ’08] 1.4 Offline 1.2 (Nemhauser)Sensing quality F(A) bound Data-dependent Higher is better 1 bound 0.8 0.6 Water Greedy networks 0.4 solution data 0.2 0 0 5 10 15 20 Number of sensors placed Submodularity gives data-dependent bounds on the performance of any algorithm 93
94. 94. BWSN Competition results [Ostfeld et al., J Wat Res Mgt 2008] 13 participants Performance measured in 30 different criteria G: Genetic algorithm D: Domain knowledge H: Other heuristic E: “Exact” method (MIP) 30 25 G H E Higher is better 20Total Score E G 15 H G 10 G H D D G 5 0 . i l. . . an i s r ld ll ou sk . el al al al lle on ta do al tfe Gu m al rp et et et Pi m rk ie W ht Os et ca lo an & Ba g rry rin ac an & ly Sa & o Gu se Be Tr Po Do & at u Hu s & W 24% better performance than runner-up!  op ei i re au & ld Pr Pr im s tfe Kr de Gh Os ia El 94
95. 95. What was the trick?Simulated all 3.6M contaminations on 2 weeks / 40 processors152 GB data on disk, 16 GB in main memory (compressed) Very accurate computation of F(A) Very slow evaluation of F(A)  30 hours/20 sensors Running time (minutes) 300Lower is better Exhaustive search (All subsets) 6 weeks for all 200 30 settings  Naive greedy ubmodularity 100 to the rescue 0 1 2 3 4 5 6 7 8 9 10 Number of sensors selected 95
96. 96. Scaling up greedy algorithm [Minoux ’78]In round i+1, have picked Ai = {s1,…,si} pick si+1 = argmaxs F(Ai [ {s})-F(Ai)I.e., maximize “marginal benefit” δ s(Ai) δ s(Ai) = F(Ai [ {s})-F(Ai)Key observation: Submodularity implies δ s(Ai) ¸ δ s(Ai+1) i · j ) δ s(Ai) ¸ δ s(Aj) sMarginal benefits can never increase! 96
97. 97. “Lazy” greedy algorithm [Minoux ’78] Lazy greedy algorithm: M  First iteration as usual Benefit δ s(A) a Keep an ordered list of marginal benefits δ i from previous iteration b d Re-evaluate δ i only for top b c element e d If δ i stays on top, use it, c e otherwise re-sortNote: Very easy to compute online bounds, lazy evaluations, etc. [Leskovec et al. ’07] 97
98. 98. Result of lazy evaluationSimulated all 3.6M contaminations on 2 weeks / 40 processors152 GB data on disk, 16 GB in main memory (compressed) Very accurate computation of F(A) Very slow evaluation of F(A)  30 hours/20 sensors Running time (minutes) 300Lower is better Exhaustive search (All subsets) 6 weeks for all 200 30 settings  Naive greedy ubmodularity 100 Fast greedy to the rescue: 0 Using “lazy evaluations”: 1 2 3 4 5 6 7 8 9 10 Number of sensors selected 1 hour/20 sensors Done after 2 days!  98
99. 99. What about worst-case? [Krause et al., NIPS ’07] Knowing the sensor locations, an adversary contaminates here! S3 S2 S3 S4 S2 S1 S4 S1 Placement detects Very different average-case impact, well on “average-case” Same worst-case impact (accidental) contaminationWhere should we place sensors to quickly detect in the worst case? 99
100. 100. Constrained maximization: Outline Utility function Selected set Selection cost Budget Subset selectionRobust optimization Complex constraints 100
101. 101. Optimizing for the worst case Separate utility function Fi for each contamination i Fi(A) = impact reduction by sensors A for contamination i Contamination at node s Sensors A Want to solve (A) Fs(B) is high Sensors B ContaminationEach of the Fi is submodular at node rUnfortunately, mini Fi not submodular! Fr(A) is high (B) low 101How can we solve this robust optimization problem?
102. 102. How does the greedy algorithm do?V={ , , }Can only buy k=2 Set A F1 F2 mini Fi 1 0 0 Greedy picks 0 2 0 first Optimal ε ε ε Hence we can’t find any Then, can solution 1 ε ε choose only approximation algorithm. ε 2 ε orOptimal score: 1 Or can 2 1 1 we? Greedy score: ε  Greedy does arbitrarily badly. Is there something better? Theorem [NIPS ’07]: The problem max|A|· k mini Fi(A) does not admit any approximation unless P=NP 102
103. 103. Alternative formulationIf somebody told us the optimal value,can we recover the optimal solution A*?Need to findIs this any easier?Yes, if we relax the constraint |A| · k 103
104. 104. Solving the alternative problem Trick: For each Fi and c, define truncation Fi(A) c Remains F’i,c(A)submodular! |A| Problem 1 (last slide) Problem 2 Non-submodular  optimal solutions!Submodular! SameDon’t know how to solve solves the other as constraint? Solving one But appears 104
105. 105. Maximization vs. coverage Previously: Wanted A* = argmax F(A) s.t. |A| · k Now need to solve: A* = argmin |A| s.t. F(A) ¸ Q M Greedy algorithm: Start with A := ;; For bound, assume While F(A) < Q and |A|< n F is integral. s* := argmaxs F(A [ {s}) If not, just round it. A := A [ {s*}Theorem [Wolsey et al]: Greedy will return Agreedy |A | · (1+log max F({s})) |A | 105
106. 106. Solving the alternative problem Trick: For each Fi and c, define truncation Fi(A) c F’i,c(A) |A| Problem 1 (last slide) Problem 2 Non-submodular  Submodular!Don’t know how to solve Can use greedy algorithm! 106
107. 107. Back to our example Guess c=1 Set A F1 F2 mini Fi F’avg,1 First pick 1 0 0 ½ Then pick 0 2 0 ½ Optimal solution! ε ε ε ε 1 ε ε (1+ε)/2 ε 2 ε (1+ε)/2 1 2 1 1 How do we find c? Do binary search! 107
108. 108. SATURATE Algorithm [Krause et al, NIPS ‘07]Given: set V, integer k and monotonic SFs F1,…,Fm MInitialize cmin=0, cmax = mini Fi(V)Do binary search: c = (cmin+cmax)/2 Greedily find AG such that F’avg,c(AG) = c If |AG| · α k: increase cmin If |AG| > α k: decrease cmaxuntil convergence Truncation threshold (color) 108
109. 109. Theoretical guarantees [Krause et al, NIPS ‘07]Theorem: The problem max|A|· k mini Fi(A)does not admit any approximation unless P=NP Theorem: SATURATE finds a solution AS such that mini Fi(AS) ¸ OPTk and |AS| · α kwhere OPTk = max|A|·k mini Fi(A) α = 1 + log maxs ∑i Fi({s})Theorem:If there were a polytime algorithm with better factorβ < α, then NP µ DTIME(nlog log n) 109
110. 110. Example: Lake monitoring Monitor pH values using robotic sensor transect Prediction at unobserved Observations A locationspH value True (hidden) pH values Var(s | A) Use probabilistic model (Gaussian processes) Position s along transect to estimate prediction error Where should we sense to minimize our maximum error?  Robust submodular (often) submodular optimization problem! [Das & Kempe ’08] 110
111. 111. Comparison with state of the art Algorithm used in geostatistics: Simulated Annealing [Sacks & Schiller ’88, van Groeningen & Stein ’98, Wiens ’05,…] 7 parameters that need to be fine-tuned 0.25 Maximum marginal variance nce 2.5 0.2 a um arginal variabetter G dy ree Greedy 2 0.15 SATURATE 1.5 Simulated M xim m 0.1 Annealing 1 0.05 S u im lated A n alin ne g SATURATE 0.5 0 0 20 40 60 0 20 40 60 80 100 SATURATE is competitive & 10x faster N m er of se sors u b n Number of sensors Environmental monitoring Precipitation data No parameters to tune! 111
112. 112. Results on water networks 3000 Maximum detection time (minutes) No decrease 2500 until all GreedyLower is better 2000 contaminations Simulated detected! 1500 Annealing 1000 500 SATURATE 0 0 10 20 Water networks Number of sensors 60% lower worst-case detection time! 112
113. 113. Worst- vs. average caseGiven: Set V, submodular functions F1,…,Fm Average-case score Worst-case score Too optimistic? Very pessimistic!Want to optimize both average- and worst-case score!Can modify SATURATE to solve this problem!  Want: Fac(A) ¸ cac and Fwc(A) ¸ cwc Truncate: min{Fac(A),cac} + min{Fwc(A),cwc} ¸ cac+cwc 113
114. 114. Worst- vs. average case 7000 Only optimize for 6000 average caseWorst case impact lower is better 5000 Knee in 4000 tradeoff curve Water 3000 Tradeoffs networks (SATURATE) 2000 data 1000 Only optimize for worst case 0 0 50 100 150 200 250 300 350 Can findAverage case impact good compromise between average- and is better lower worst-case score! 114
115. 115. Constrained maximization: Outline Utility function Selected set Selection cost Budget Subset selectionRobust optimization Complex constraints 115
116. 116. Other aspects: Complex constraints maxA F(A) or maxA mini Fi(A) subject to So far: |A| · k In practice, more complex constraints: Different costs: C(A) · B Locations need to be Sensors need to communicate connected by paths (form a routing tree) [Chekuri & Pal, FOCS ’05] [Singh et al, IJCAI ’07]Lake monitoring Building monitoring 116
117. 117. Non-constant cost functionsFor each s 2 V, let c(s)>0 be its cost(e.g., feature acquisition costs, …)Cost of a set C(A) = ∑s2 A c(s) (modular function!)Want to solve A* = argmax F(A) s.t. C(A) · BCost-benefit greedy algorithm: M Start with A := ;; While there is an s2VnA s.t. C(A[{s}) · B A := A [ {s*} 117
118. 118. Performance of cost-benefit greedyWant Set A F(A) C(A) {a} 2ε εmaxA F(A) s.t. C(A)· 1 {b} 1 1Cost-benefit greedy picks a.Then cannot afford b! Cost-benefit greedy performs arbitrarily badly! 118
119. 119. Cost-benefit optimization [Wolsey ’82, Sviridenko ’04, Leskovec et al ’07]Theorem [Leskovec et al. KDD ‘07] ACB: cost-benefit greedy solution and AUC: unit-cost greedy solution (i.e., ignore costs)Then max { F(ACB), F(AUC) } ¸ ½ (1-1/e) OPTCan still compute online bounds andspeed up using lazy evaluationsNote: Can also get (1-1/e) approximation in time O(n4) [Sviridenko ’04] Slightly better than ½ (1-1/e) in O(n2) [Wolsey ‘82] 119