Upcoming SlideShare
×

Submodularity slides

• 721 views

More in: Education , Technology
• Comment goes here.
Are you sure you want to
Be the first to comment
Be the first to like this

Total Views
721
On Slideshare
0
From Embeds
0
Number of Embeds
0

Shares
2
0
Likes
0

No embeds

Report content

No notes for slide
• Get p+1 approximation for C = intersection of p matroids; 1-1/e whp for any matroid using continuous greedy + pipage rounding
• Random placement: 53%; Uniform: 73% Optimized: 81%

Transcript

• 1. Beyond Convexity –Submodularity in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University International Conference on Machine Learning | July 5, 2008 Carnegie Mellon
• 2. AcknowledgementsThanks for slides and material to Mukund Narasimhan,Jure Leskovec and Manuel Reyes GomezMATLAB Toolbox and details for references available at http://www.submodularity.org Algorithms implemented M 2
• 3. Optimization in Machine Learning + Classify + from – by w1 finding a separating + + - - hyperplane (parameters w) + + - - w* - - Which one should we choose? w2 Define loss L(w) = “1/size of margin”  Solve for best vector w* = argminw L(w) L(w)Key observation: Many problems in ML are convex! w  no local minima!!  3
• 4. Feature selectionGiven random variables Y, X1, … XnWant to predict Y from subset XA = (Xi1,…,Xik) Y “Sick” Naïve Bayes X1 X2 X3 Model “Fever” “Rash” “Male”Want k most informative features: A* = argmax IG(XA; Y) s.t. |A| · k Uncertainty Uncertainty before knowing XA after knowing XAwhere IG(XA; Y) = H(Y) - H(Y | XA) 4
• 5. Factoring distributions VGiven random variables X1,…,Xn X1 X3 X2 X4 X6 X5 X7Partition variables V into sets A andVnA as independent as possible X1 X3 X2 X7 X4 X6 X5Formally: Want A VnA A* = argminA I(XA; XVnA) s.t. 0<|A|<nwhere I(XA,XB) = H(XB) - H(XB j XA)Fundamental building block in structure learning[Narasimhan&Bilmes, UAI ’04] 5
• 6. Combinatorial problems in MLGiven a (finite) set V, function F: 2V ! R, want A* = argmin F(A) s.t. some constraints on ASolving combinatorial problems: Mixed integer programming? Often difficult to scale to large problems Relaxations? (e.g., L1 regularization, etc.) Not clear when they work This talk: Fully combinatorial algorithms (spanning tree, matching, …) Exploit problem structure to get guarantees about solution! 6
• 7. Example: Greedy algorithm for feature selection Given: finite set V of features, utility function F(A) = IG(XA; Y) Want: A*µ V such that Y “Sick” NP-hard! X1 X2 X3Greedy algorithm: M “Fever” “Rash” “Male” Start with A = ; For i = 1 to k s* := argmaxs F(A [ {s}) A := A [ {s*} How well can this simple heuristic do? 7
• 8. Key property: Diminishing returns Selection A = {} Selection B = {X2,X3} Y Y “Sick” “Sick” X2 X3 “Rash” “Male” Adding X1 X1 Adding X1Theorem [Krause, Guestrin UAI ‘05]: Information gain F(A) in “Fever” will help a lot! doesn’t help much Naïve Bayes models is submodular! New feature X1Submodularity: B A + s Large improvement + s Small improvement For Aµ B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) 8
• 9. Why is submodularity useful?Theorem [Nemhauser et al ‘78]Greedy maximization algorithm returns Agreedy: F(Agreedy) ¸(1-1/e) max|A|·k F(A) ~63%Greedy algorithm gives near-optimal solution!More details and exact statement laterFor info-gain: Guarantees best possible unless P = NP![Krause, Guestrin UAI ’05] 9
• 10. Submodularity in Machine LearningIn this tutorial we will see that many ML problems aresubmodular, i.e., for F submodular require:Minimization: A* = argmin F(A) Structure learning (A* = argmin I(XA; XVnA)) Clustering MAP inference in Markov Random Fields …Maximization: A* = argmax F(A) Feature selection Active learning Ranking … 10
• 11. Tutorial Overview Examples and properties of submodular functions Submodularity and convexity Minimizing submodular functions Maximizing submodular functions Research directions, … LOTS of applications to Machine Learning!! 11
• 12. SubmodularityProperties and Examples Carnegie Mellon
• 13. Set functionsFinite set V = {1,2,…,n}Function F: 2V ! RWill always assume F(;) = 0 (w.l.o.g.)Assume black-box that can evaluate F for any input A Approximate (noisy) evaluation of F is ok (e.g., [37])Example: F(A) = IG(XA; Y) = H(Y) – H(Y | XA) = ∑y,x P(xA) [log P(y | xA) – log P(y)] A Y Y “Sick” “Sick” X1 X2 X2 X3 “Fever” “Rash” “Rash” “Male” F({X1,X2}) = 0.9 F({X2,X3}) = 0.5 13
• 14. Submodular set functions Set function F on V is called submodular if For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB) + ¸ + A A [ BB AÅB Equivalent diminishing returns characterization:Submodularity: B A + S Large improvement + S Small improvementFor AµB, s∉B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) 14
• 15. Submodularity and supermodularitySet function F on V is called submodular if 1) For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB)  2) For all AµB, s∉B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)F is called supermodular if –F is submodularF is called modular if F is both sub- and supermodular for modular (“additive”) F, F(A) = ∑i2A w(i) 15
• 16. Example: Set cover Place sensors Want to cover floorplan with discs in building Possible OFFICE OFFICE QUIET PHONE locations CONFERENCE STORAGE V LAB ELEC COPY SERVER KITCHEN Node predicts For A µ V: F(A) = “areavalues of positions covered by sensors placed at A” with some radius Formally: W finite set, collection of n subsets Si µ W For A µV={1,…,n} define F(A) = |i2 A Si| 16
• 17. Set cover is submodular A={S1,S2}OFFICE OFFICE QUIET PHONE CONFERENCE S1 S2 STORAGE LAB ELEC COPY SERVER S’ KITCHEN F(A[{S’})-F(A)OFFICE OFFICE QUIET PHONE ¸ CONFERENCE F(B[{S’})-F(B) S1 S2 STORAGE LAB ELEC COPY S3 SERVER S4 KITCHEN S’ B = {S1,S2,S3,S4} 17
• 18. Example: Mutual informationGiven random variables X1,…,XnF(A) = I(XA; XVnA) = H(XVnA) – H(XVnA |XA)Lemma: Mutual information F(A) is submodularF(A [ {s}) – F(A) = H(Xsj XA) – H(Xsj XVn(A[{s}) ) Nonincreasing in A: Nondecreasing in A AµB ) H(Xs|XA) ¸ H(Xs|XB)δ s(A) = F(A[{s})-F(A) monotonically nonincreasing  F submodular  18
• 19. Example: Influence in social networks [Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Alice 0.2 Prob. of influencing 0.5 0.4 0.3 0.2 0.5 Bob 0.5 Fiona CharlieWho should get free cell phones?V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona}F(A) = Expected number of people influenced when targeting A 19
• 20. Influence in social networks is submodular [Kempe, Kleinberg, Tardos KDD ’03] Dorothy EricAlice 0.2 0.5 0.4 0.3 0.2 0.5Bob 0.5 Fiona CharlieKey idea: Flip coins c in advance  “live” edgesFc(A) = People influenced under outcome c (set cover!)F(A) = ∑ P(c) F (A) is submodular as well! 20
• 21. Closedness propertiesF1,…,Fm submodular functions on V and λ1,…,λm > 0Then: F(A) = ∑i λi Fi(A) is submodular!Submodularity closed under nonnegative linearcombinations!Extremely useful fact!! Fθ(A) submodular ) ∑θ P(θ) Fθ(A) submodular! Multicriterion optimization: F1,…,Fm submodular, λi¸0 ) ∑i λi Fi(A) submodular 21
• 22. Submodularity and ConcavitySuppose g: N ! R and F(A) = g(|A|)Then F(A) submodular if and only if g concave! E.g., g could say “buying in bulk is cheaper” g(|A|) |A| 22
• 23. Maximum of submodular functionsSuppose F1(A) and F2(A) submodular.Is F(A) = max(F1(A),F2(A)) submodular? F(A) = max(F1(A),F2(A)) F1(A) F2(A) |A| max(F1,F2) not submodular in general! 23
• 24. Minimum of submodular functionsWell, maybe F(A) = min(F1(A),F2(A)) instead? F1(A) F2(A) F(A) F({b}) – F(;)=0 ; 0 0 0 < {a} 1 0 0 F({a,b}) – F({a})=1 {b} 0 1 0 {a,b} 1 1 1 min(F1,F2) not submodular in general!But stay tuned – we’ll address mini Fi later! 24
• 25. DualityFor F submodular on V let G(A) = F(V) – F(VnA)G is supermodular and called dual to FDetails about properties in [Fujishige ’91] F(A) |A| |A| G(A) 25
• 26. Tutorial OverviewExamples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, maxSubmodularity and convexityMinimizing submodular functionsMaximizing submodular functionsExtensions and research directions 26
• 27. Submodularityand Convexity Carnegie Mellon
• 28. Submodularity and convexityFor V = {1,…,n}, and A µ V, let wA = (w1A,…,wnA) with wiA = 1 if i 2 A, 0 otherwiseKey result [Lovasz ’83]: Every submodular functionF induces a function g on Rn+, such that F(A) = g(wA) for all A µ V g(w) is convex minA F(A) = minw g(w) s.t. w 2 [0,1]nLet’s see how one can define g(w) 28
• 29. The submodular polyhedron PFPF = {x 2 Rn: x(A) · F(A) for all A µ V} Example: V = {a,b} A F(A) x(A) = ∑i2 A xi ; 0 {a} -1 {b} 2 {a,b} 0 x{b} 2 x({b}) · F({b}) PF 1 x({a,b}) · F({a,b}) -2 -1 0 1 x{a} x({a}) · F({a}) 29
• 30. Lovasz extension Claim: g(w) = maxx2PF wTx PF = {x 2 Rn: x(A) · F(A) for all A µ V} w{b} xw=argmaxx2 PF wT x xw 2 g(w)=wT xw w 1 -2 -1 0 1 w{a}Evaluating g(w) requires solving a linear program withexponentially many constraints  30
• 31. x{b} xw 2 Evaluating the Lovasz extension w 1g(w) = maxx2PF wTx -2 -1 0 1 x{a}PF = {x 2 Rn: x(A) · F(A) for all A µ V}Theorem [Edmonds ’71, Lovasz ‘83]:For any given w, can get optimal solution xw to the LPusing the following greedy algorithm: Order V={e1,…,en} so that w(e1)¸ …¸ w(en) M Let xw(ei) = F({e1,…,ei}) – F({e1,…,ei-1})Then wT xw = g(w) = maxx2 PF wT xSanity check: If w = wA and A={e1,…,ek}, then AT k 31
• 32. Example: Lovasz extension A F(A) g(w) = max {wT x: x 2 PF} ; 0 w{b} {a} -1 {b} 2 [-2,2] 2 {a,b} 0 {b} {a,b} [-1,1] 1 w=[0,1] {} {a} want g(w) -2 -1 0 1 w{a} Greedy ordering: e1 = b, e2 = a  w(e1)=1 > w(e2)=0g([0,1]) = [0,1]T [-2,2] = 2 = F({b}) xw(e1)=F({b})-F(;)=2 xw(e2)=F({b,a})-F({b})=-2g([1,1]) = [1,1]T [-1,1] = 0 = F({a,b})  xw=[-2,2] 32
• 33. Why is this useful? x{b} 2 [0,1]2 Theorem [Lovasz ’83]: 1 g(w) attains its minimum in [0,1]n at a corner! 0 1 x{a}If we can minimize g on [0,1]n, can minimize F… (at corners, g and F take same values) F(A) submodular g(w) convex (and efficient to evaluate) Does the converse also hold? No, consider g(w1,w2,w3) = max(w1,w2+w3) {a} {b} {c} F({a,b})-F({a})=0 < F({a,b,c})-F({a,c})=1 33
• 34. Tutorial OverviewExamples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, maxSubmodularity and convexity Every SF induces a convex function with SAME minimum Special properties: Greedy solves LP over exponential polytopeMinimizing submodular functionsMaximizing submodular functionsExtensions and research directions 34
• 35. Minimization ofsubmodular functions Carnegie Mellon
• 36. Overview minimizationMinimizing general submodular functionsMinimizing symmetric submodular functionsApplications to Machine Learning 36
• 37. Minimizing a submodular functionWant to solve A* = argminA F(A)Need to solve minw maxx wTx g(w) s.t. w2[0,1]n, x2PFEquivalently: minc,w c s.t. c ¸ wT x for all x2PF w2 [0,1]n 37
• 38. Ellipsoid algorithm [Grötschel, Lovasz, Schrijver ’81] Feasible region Optimality direction minc,w c s.t. c ¸ wT x for all x2PF w2 [0,1]nSeparation oracle: Find most violated constraint: maxx wT x – c s.t. x 2 PFCan solve separation using the greedy algorithm!! Ellipsoid algorithm minimizes SFs in poly-time! 38
• 39. Minimizing submodular functionsEllipsoid algorithm not very practicalWant combinatorial algorithm for minimization!Theorem [Iwata (2001)]There is a fully combinatorial, strongly polynomialalgorithm for minimizing SFs, that runs in time O(n8 log2 n)Polynomial-time = Practical ??? 39
• 40. A more practical alternative? [Fujishige ’91, Fujishige et al ‘06] x({a,b})=F({a,b}) x A F(A) {b} ; 0 2 {a} -1 x* {b} 2 [-1,1] 1 Base polytope: {a,b} 0 BF = PF Å {x(V) = F(V)} -2 -1 0 1 x{a}Minimum norm algorithm: M Find x* = argmin ||x||2 s.t. x 2 BF x*=[-1,1] Return A* = {i: x*(i) < 0} A*={a}Theorem [Fujishige ’91]: A* is an optimal solution!Note: Can solve 1. using Wolfe’s algorithmRuntime finite but unknown!!  40
• 41. Empirical comparison [Fujishige et al ’06] Cut functions from DIMACSLower is better (log-scale!) Challenge Running time (seconds) Minimum norm algorithm 64 128 256 512 1024 Problem size (log-scale!) Minimum norm algorithm orders of magnitude faster! Our implementation can solve n = 10k in < 6 minutes! 41
• 42. Checking optimality (duality) BF A F(A) w{b} ; 0 x* 2 {a} -1 {b} 2 [-1,1] 1 {a,b} 0 Base polytope: -2 -1 0 1 w{a} BF = PF Å {x(V) = F(V)}Theorem [Edmonds ’70] A = {a}, F(A) = -1minA F(A) = maxx {x (V) : x 2 BF} – w = [1,0]where x–(s) = min {x(s), 0} xw = [-1,1] xw- = [-1,0]Testing how close A’ is to minA F(A) xw-(V) = -1 Run greedy algorithm for w=wA’ to get xw  A optimal! F(A’) ¸ minA F(A) ¸ xw–(V) 42
• 43. Overview minimizationMinimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n^8) Practical alternative: Minimum norm algorithm?Minimizing symmetric submodular functionsApplications to Machine Learning 43
• 44. What if we have special structure?Worst-case complexity of best known algorithm: O(n8 log2n)Can we do better for special cases?Example (again): Given RVs X1,…,XnF(A) = I(XA; XVnA) = I(XVnA ; XA) = F(VnA)Functions F with F(A) = F(VnA) for all A are symmetric 44
• 45. Another example: Cut functions 2 1 3 V={a,b,c,d,e,f,g,h} a c e g 2 32 2 2 3 3 3 b d f h 2 1 3 F(A) = ∑ {ws,t: s2 A, t2 Vn A} Example: F({a})=6; F({c,d})=10; F({a,b,c,d})=2 Cut function is symmetric and submodular! 45
• 46. Minimizing symmetric functionsFor any A, submodularity implies2 F(A) = F(A) + F(VnA) ¸ F(A Å (VnA))+F(A [ (VnA)) = F(;) + F(V) = 2 F(;) = 0Hence, any symmetric SF attains minimum at ;In practice, want nontrivial partition of V intoA and VnA, i.e., require that A is neither ; of VWant A* = argmin F(A) s.t. 0 < |A| < nThere is an efficient algorithm for doing that!  46
• 47. Queyranne’s algorithm (overview) [Queyranne ’98]Theorem: There is a fully combinatorial, strongly Mpolynomial algorithm for solving A* = argminA F(A) s.t. 0<|A|<nfor symmetric submodular functions ARuns in time O(n3) [instead of O(n8)…] Note: also works for “posimodular” functions: F posimodular  A,Bµ V: F(A)+F(B) ¸ F(AnB)+F(BnA) 47
• 48. Gomory Hu trees 2 1 3 6 2 9 a c e g a c e g 2 32 2 2 3 3 3 6 7 10 9 b d f h b d f h 2 1 3 G T A tree T is called Gomory-Hu (GH) tree for SF F if for any s, t 2 V it holds that min {F(A): s2A and t∉A} = min {wi,j: (i,j) is an edge on the s-t path in T} “min s-t-cut in T = min s-t-cut in G” Expensive to Theorem [Queyranne ‘93]: find one in GH-trees exist for any symmetric SF F! general!  48
• 49. Pendent pairs For function F on V, s,t2 V: (s,t) is pendent pair if {s} 2 argminA F(A) s.t. s2A, t∉A Pendent pairs always exist: 6 2 9 a c e g 7 9 Gomory-Hu 6 10 tree T b d f h Take any leaf s and neighbor t, then (s,t) is pendent! E.g., (a,c), (b,c), (f,e), …Theorem [Queyranne ’95]: Can find pendent pairs in O(n2) (without needing GH-tree!) 49
• 50. Why are pendent pairs useful?Key idea: Let (s,t) pendent, A* = argmin F(A)Then EITHER V s and t separated by A*, e.g., s2A*, t∉A*. A* But then A*={s}!! OR s t s and t are not separated by A* V V A* s A* t s t Then we can merge s and t… 50
• 51. Merging Suppose F is a symmetric SF on V, and we want to merge pendent pair (s,t) Key idea: “If we pick s, get t for free” V’ = Vn{t} F’(A) = F(A[{t}) if s2A, or = F(A) if s∉A Lemma: F’ is still symmetric and submodular! 2 1 3 1 3 a c e g Merge a,c e g 2 3 (a,c) 32 2 2 3 3 3 4 4 3 3 3 b d f h b d f h 2 1 3 2 1 3 51
• 52. Queyranne’s algorithmInput: symmetric SF F on V, |V|=nOutput: A* = argmin F(A) s.t. 0 < |A| < nInitialize F’ Ã F, and V’ Ã VFor i = 1:n-1 (s,t) Ã pendentPair(F’,V’) Ai = {s} (F’,V’) Ã merge(F’,V’,s,t)Return argmini F(Ai) Running time: O(n3) function evaluations 52
• 53. Note: Finding pendent pairsInitialize v1 Ã x (x is arbitrary element of V)For i = 1 to n-1 do Wi Ã {v1,…,vi} vi+1 Ã argminv F(Wi[{v}) - F({v}) s.t. v2 VnWiReturn pendent pair (vn-1,vn)Requires O(n2) evaluations of F 53
• 54. Overview minimizationMinimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n8) Practical alternative: Minimum norm algorithm?Minimizing symmetric submodular functions Many useful submodular functions are symmetric Queyranne’s algorithm minimize symmetric SFs in O(n3)Applications to Machine Learning 54
• 55. Application: Clustering [Narasimhan, Jojic, Bilmes NIPS ’05] Group data points V into “homogeneous clusters” o V o Find a partition V=A1 [ … [ Ak o o o that minimizes o o A1 o A o o o 2 F(A1,…,Ak) = ∑i E(Ai) “Inhomogeneity of Ai” Examples for E(A): Entropy H(A) Cut functionSpecial case: k = 2. Then F(A) = E(A) + E(VnA) is symmetric!If E is submodular, can use Queyranne’s algorithm!  55
• 56. What if we want k>2 clusters? [Zhao et al ’05, Narasimhan et al ‘05] X2Greedy Splitting algorithm M X1 X3 X5 X7 X4 X6Start with partition P = {V}For i = 1 to k-1 X1 X3 X2 X7 For each member Cj 2 P do X4 X6 X5 split cluster Cj: A* = argmin E(A) + E(CjnA) s.t. 0<|A|<|Cj| Pj Ã P n {Cj} [ {A,CjnA} X3 X1 X2 X7 Partition we get by splitting j-th cluster X6 X4 X5 P Ã argminj F(Pj)Theorem: F(P) · (2-2/k) F(Popt) 56
• 57. Example: Clustering species [Narasimhan et al ‘05] Common genetic information = #of common substrings:Can easily extend to sets of species 57
• 58. Example: Clustering species [Narasimhan et al ‘05]The common genetic information ICG does not require alignment captures genetic similarity is smallest for maximally evolutionarily diverged species is a symmetric submodular function! Greedy splitting algorithm yields phylogenetic tree! 58
• 59. Example: SNPs [Narasimhan et al ‘05]Study human genetic variation(for personalized medicine, …)Most human variation due to point mutations that occuronce in human history at that base location: Single Nucleotide Polymorphisms (SNPs)Cataloging all variation too expensive(\$10K-\$100K per individual!!) 59
• 60. SNPs in the ACE gene [Narasimhan et al ‘05]Rows: Individuals. Columns: SNPs.Which columns should we pick to reconstruct the rest?Can find near-optimal clustering (Queyranne’s algorithm) 60
• 61. Reconstruction accuracy[Narasimhan et al ‘05] Comparison with clustering based on Entropy Prediction accuracy Pairwise correlation# of clusters PCA 61
• 62. Example: Speaker segmentation [Reyes-Gomez, Jojic ‘07] Region A “Fiona” Mixed waveforms Frequency Alice “???” Time Fiona “???” E(A)=-log p(XA) Partition Likelihood of Spectro- “region” A“308” gram F(A)=E(A)+E(VnA) using Q-Algo symmetric“217” & posimodular 62
• 63. Example: Image denoising 63
• 64. Example: Image denoising Pairwise Markov Random FieldY1 Y2 Y3 X1 X2 X3 P(x1,…,xn,y1,…,yn) = ∏i,j ψi,j(yi,yj) Πi φi(xi,yi)Y4 Y5 Y6 X4 X5 X6 Want argmaxy P(y | x)Y7 Y8 Y9 =argmaxy log P(x,y) =argminy ∑i,j Ei,j(yi,yj)+∑i Ei(yi) X7 X8 X9 Xi: noisy pixels Ei,j(yi,yj) = -log ψi,j(yi,yj) Yi: “true” pixelsWhen is this MAP inference efficiently solvable (in high treewidth graphical models)? 64
• 65. MAP inference in Markov Random Fields [Kolmogorov et al, PAMI ’04, see also: Hammer, Ops Res ‘65]Energy E(y) = ∑i,j Ei,j(yi,yj)+∑i Ei(yi)Suppose yi are binary, define F(A) = E(yA) where yAi = 1 iff i2 AThen miny E(y) = minA F(A)TheoremMAP inference problem solvable by graph cuts For all i,j: Ei,j(0,0)+Ei,j(1,1) · Ei,j(0,1)+Ei,j(1,0) each Ei,j is submodular“Efficient if prefer that neighboring pixels have same color” 65
• 66. Constrained minimizationHave seen: if F submodular on V, can solve A*=argmin F(A) s.t. A2VWhat about A*=argmin F(A) s.t. A2V and |A| ·kE.g., clustering with minimum # points per cluster, …In general, not much known about constrained minimization However, can do A*=argmin F(A) s.t. 0<|A|< n A*=argmin F(A) s.t. |A| is odd/even [Goemans&Ramakrishnan ‘95] A*=argmin F(A) s.t. A 2 argmin G(A) for G submodular [Fujishige ’91] 66
• 67. Overview minimizationMinimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n8) Practical alternative: Minimum norm algorithm?Minimizing symmetric submodular functions Many useful submodular functions are symmetric Queyranne’s algorithm minimize symmetric SFs in O(n3)Applications to Machine Learning Clustering [Narasimhan et al’ 05] Speaker segmentation [Reyes-Gomez & Jojic ’07] MAP inference [Kolmogorov et al ’04] 67
• 68. Tutorial OverviewExamples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, maxSubmodularity and convexity Every SF induces a convex function with SAME minimum Special properties: Greedy solves LP over exponential polytopeMinimizing submodular functions Minimization possible in polynomial time (but O(n8)…) Queyranne’s algorithm minimizes symmetric SFs in O(n3) Useful for clustering, MAP inference, structure learning, …Maximizing submodular functionsExtensions and research directions 68
• 69. Maximizing submodular functions Carnegie Mellon
• 70. Maximizing submodular functionsMinimizing convex functions: Minimizing submodular functions:Polynomial time solvable! Polynomial time solvable!Maximizing convex functions: Maximizing submodular functions: NP hard! NP hard! But can get approximation guarantees  70
• 71. Maximizing influence [Kempe, Kleinberg, Tardos KDD ’03] Dorothy EricAlice 0.2 0.5 0.4 0.3 0.2 0.5Bob 0.5 Fiona CharlieF(A) = Expected #people influenced when targeting AF monotonic: If AµB: F(A) ·F(B)Hence V = argmaxA F(A)More interesting: argmaxA F(A) – Cost(A) 71
• 72. Maximizing non-monotonic functions maximumSuppose we want for not monotonic F A* = argmax F(A) s.t. AµV |A|Example: F(A) = U(A) – C(A) where U(A) is submodular utility, and C(A) is supermodular cost function E.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08]In general: NP hard. Moreover:If F(A) can take negative values:As hard to approximate as maximum independent set 72 1-ε
• 73. Maximizing positive submodular functions [Feige, Mirrokni, Vondrak FOCS ’07]TheoremThere is an efficient randomized local search procedure,that, given a positive submodular function F, F(;)=0,returns set ALS such that F(ALS) ¸(2/5) maxA F(A)picking a random set gives ¼ approximation(½ approximation if F is symmetric!)we cannot get better than ¾ approximation unless P = NP 73
• 74. Scalarization vs. constrained maximizationGiven monotonic utility F(A) and cost C(A), optimize: Option 1: Option 2: maxA F(A) – C(A) maxA F(A) s.t. A µ V s.t. C(A) · B “Scalarization” “Constrained maximization” Can get 2/5 approx… coming up… if F(A)-C(A) ¸ 0 for all A µ VPositiveness is astrong requirement  74
• 75. Constrained maximization: OutlineMonotonic submodular Selected set Selection cost Budget Subset selection: C(A) = |A| Robust optimization Complex constraints 75
• 76. MonotonicityA set function is called monotonic if AµBµV ) F(A) · F(B)Examples: Influence in social networks [Kempe et al KDD ’03] For discrete RVs, entropy F(A) = H(XA) is monotonic: Suppose B=A [C. Then F(B) = H(XA, XC) = H(XA) + H(XC | XA) ¸ H(XA) = F(A) Information gain: F(A) = H(Y)-H(Y | XA) Set cover Matroid rank functions (dimension of vector spaces, …) … 76
• 77. Subset selectionGiven: Finite set V, monotonic submodular function F, F(;) = 0Want: A*µ V such that NP-hard! 77
• 78. Exact maximization of monotonic submodular functions1) Mixed integer programming [Nemhauser et al ’81] max η s.t. η · F(B) + ∑s2VnB αs δ s(B) for all B µ S ∑s αs · k αs 2 {0,1} where δ s(B) = F(B [ {s}) – F(B) Solved using constraint generation2) Branch-and-bound: “Data-correcting algorithm” M [Goldengorin et al ’99] Both algorithms worst-case exponential! 78
• 79. Approximate maximization Given: finite set V, monotonic submodular function F(A) Want: A*µ V such that NP-hard! Y “Sick”Greedy algorithm: M X1 X2 X3Start with A0 = ; “Fever” “Rash” “Male”For i = 1 to k si := argmaxs F(Ai-1 [ {s}) - F(Ai-1) Ai := Ai-1 [ {si} 79
• 80. Performance of greedy algorithmTheorem [Nemhauser et al ‘78]Given a monotonic submodular function F, F(;)=0, thegreedy maximization algorithm returns Agreedy F(Agreedy) ¸(1-1/e) max|A|· k F(A) ~63% Sidenote: Greedy algorithm gives 1/2 approximation for maximization over any matroid C! [Fisher et al ’78] 80
• 81. An “elementary” counterexampleX1, X2 ~ Bernoulli(0.5) X1 X2Y = X1 XOR X2 YLet F(A) = IG(XA; Y) = H(Y) – H(Y|XA)Y | X1 and Y | X2 ~ Bernoulli(0.5) (entropy 1)Y | X1,X2 is deterministic! (entropy 0)Hence F({1,2}) – F({1}) = 1, but F({2}) – F(;) =0F(A) submodular under some conditions! (later) 81
• 82. Example: Submodularity of info-gainY1,…,Ym, X1, …, Xn discrete RVsF(A) = IG(Y; XA) = H(Y)-H(Y | XA)F(A) is always monotonicHowever, NOT always submodularTheorem [Krause & Guestrin UAI’ 05]If Xi are all conditionally independent given Y,then F(A) is submodular!Y1 Y2 Y3 Hence, greedy algorithm works! In fact, NO algorithm can do betterX1 X2 X3 X4 than (1-1/e) approximation! 82
• 83. Building a Sensing Chair [Mutlu, Krause, Forlizzi, Guestrin, Hodgins UIST ‘07]People sit a lotActivity recognition inassistive technologiesSeating pressure asuser interface Equipped with 1 sensor per cm2! Costs \$16,000!  Lean Lean Slouch Can we get similar left forward accuracy with fewer, 82% accuracy on cheaper sensors? 10 postures! [Tan et al]83
• 84. How to place sensors on a chair? Sensor readings at locations V as random variables Predict posture Y using probabilistic model P(Y,V) Pick sensor locations A* µ V to minimize entropy:Possible locations V Placed sensors, did a user study: Accuracy Cost Before 82% \$16,000  After 79% \$100  Similar accuracy at <1% of the cost! 84
• 85. Variance reduction (a.k.a. Orthogonal matching pursuit, Forward Regression) Let Y = ∑i αi Xi+ε, and (X1,…,Xn,ε) » N(¢; µ,Σ) Want to pick subset XA to predict Y Var(Y | XA=xA): conditional variance of Y given XA = xA Expected variance: Var(Y | XA) = s p(xA) Var(Y | XA=xA) dxA Variance reduction: FV(A) = Var(Y) – Var(Y | XA) FV(A) is always monotonic *under some Theorem [Das & Kempe, STOC ’08] conditions on Σ FV(A) is submodular* Orthogonal matching pursuit near optimal! [see other analyses by Tropp, Donoho et al., and Temlyakov] 85
• 86. Batch mode active learning [Hoi et al, ICML’06] o o Which data points o should we o + o o – o o o label to minimize error? + o o o – o o Want batch A of k points to o show an expert for labelingF(A) selects examples that are uncertain [σ2(s) = π(s) (1-π(s)) is large] diverse (points in A are as different as possible) relevant (as close to VnA is possible, sT s’ large)F(A) is submodular and monotonic![approximation to improvement in Fisher-information] 86
• 87. Results about Active Learning [Hoi et al, ICML’06]Batch mode Active Learning performs better than Picking k points at random Picking k points of highest entropy 87
• 88. Monitoring water networks [Krause et al, J Wat Res Mgt 2008] Contamination of drinking water could affect millions of peopleContaminationSensors Simulator from EPA Hach Sensor Place sensors to detect contaminations ~\$14K “Battle of the Water Sensor Networks” competitionWhere should we place sensors to quickly detect contamination? 88
• 89. Model-based sensing Utility of placing sensors based on model of the world For water networks: Water flow simulator from EPA F(A)=Expected impact reduction placing sensors at A Model predicts Low impact High impactTheorem [Krause et al., J Wat location : Contamination Res Mgt ’08]Impact reduction F(A) in water networks is submodular! Medium impact S3 location S1 2 S S3 S1 S4 S2 Sensor reduces S4 impact through Set V of all early detection! S1 network junctions High impact reduction F(A) = 0.9 Low impact reduction F(A)=0.01 89
• 90. Battle of the Water Sensor Networks CompetitionReal metropolitan area network (12,527 nodes)Water flow simulator provided by EPA3.6 million contamination eventsMultiple objectives: Detection time, affected population, …Place sensors that detect well “on average” 90
• 91. Bounds on optimal solution [Krause et al., J Wat Res Mgt ’08] 1.4 Offline Population protected F(A) 1.2 (Nemhauser) bound Higher is better 1 0.8 0.6 Water Greedy networks 0.4 solution data 0.2 0 0 5 10 15 20 Number of sensors placed(1-1/e) bound quite loose… can we get better bounds? 91
• 92. Data dependent bounds [Minoux ’78]Suppose A is candidate solution to argmax F(A) s.t. |A| · kand A* = {s1,…,sk} be an optimal solutionThen F(A*) · F(A [ A*) = F(A)+∑i F(A[{s1,…,si})-F(A[ {s1,…,si-1}) · F(A) + ∑i (F(A[{si})-F(A)) = F(A) + ∑i δ si MFor each s 2 VnA, let δ s = F(A[{s})-F(A)Order such that δ 1 ¸ δ 2 ¸ … ¸ δ n 92
• 93. Bounds on optimal solution [Krause et al., J Wat Res Mgt ’08] 1.4 Offline 1.2 (Nemhauser)Sensing quality F(A) bound Data-dependent Higher is better 1 bound 0.8 0.6 Water Greedy networks 0.4 solution data 0.2 0 0 5 10 15 20 Number of sensors placed Submodularity gives data-dependent bounds on the performance of any algorithm 93
• 94. BWSN Competition results [Ostfeld et al., J Wat Res Mgt 2008] 13 participants Performance measured in 30 different criteria G: Genetic algorithm D: Domain knowledge H: Other heuristic E: “Exact” method (MIP) 30 25 G H E Higher is better 20Total Score E G 15 H G 10 G H D D G 5 0 . i l. . . an i s r ld ll ou sk . el al al al lle on ta do al tfe Gu m al rp et et et Pi m rk ie W ht Os et ca lo an & Ba g rry rin ac an & ly Sa & o Gu se Be Tr Po Do & at u Hu s & W 24% better performance than runner-up!  op ei i re au & ld Pr Pr im s tfe Kr de Gh Os ia El 94
• 95. What was the trick?Simulated all 3.6M contaminations on 2 weeks / 40 processors152 GB data on disk, 16 GB in main memory (compressed) Very accurate computation of F(A) Very slow evaluation of F(A)  30 hours/20 sensors Running time (minutes) 300Lower is better Exhaustive search (All subsets) 6 weeks for all 200 30 settings  Naive greedy ubmodularity 100 to the rescue 0 1 2 3 4 5 6 7 8 9 10 Number of sensors selected 95
• 96. Scaling up greedy algorithm [Minoux ’78]In round i+1, have picked Ai = {s1,…,si} pick si+1 = argmaxs F(Ai [ {s})-F(Ai)I.e., maximize “marginal benefit” δ s(Ai) δ s(Ai) = F(Ai [ {s})-F(Ai)Key observation: Submodularity implies δ s(Ai) ¸ δ s(Ai+1) i · j ) δ s(Ai) ¸ δ s(Aj) sMarginal benefits can never increase! 96
• 97. “Lazy” greedy algorithm [Minoux ’78] Lazy greedy algorithm: M  First iteration as usual Benefit δ s(A) a Keep an ordered list of marginal benefits δ i from previous iteration b d Re-evaluate δ i only for top b c element e d If δ i stays on top, use it, c e otherwise re-sortNote: Very easy to compute online bounds, lazy evaluations, etc. [Leskovec et al. ’07] 97
• 98. Result of lazy evaluationSimulated all 3.6M contaminations on 2 weeks / 40 processors152 GB data on disk, 16 GB in main memory (compressed) Very accurate computation of F(A) Very slow evaluation of F(A)  30 hours/20 sensors Running time (minutes) 300Lower is better Exhaustive search (All subsets) 6 weeks for all 200 30 settings  Naive greedy ubmodularity 100 Fast greedy to the rescue: 0 Using “lazy evaluations”: 1 2 3 4 5 6 7 8 9 10 Number of sensors selected 1 hour/20 sensors Done after 2 days!  98
• 99. What about worst-case? [Krause et al., NIPS ’07] Knowing the sensor locations, an adversary contaminates here! S3 S2 S3 S4 S2 S1 S4 S1 Placement detects Very different average-case impact, well on “average-case” Same worst-case impact (accidental) contaminationWhere should we place sensors to quickly detect in the worst case? 99
• 100. Constrained maximization: Outline Utility function Selected set Selection cost Budget Subset selectionRobust optimization Complex constraints 100
• 101. Optimizing for the worst case Separate utility function Fi for each contamination i Fi(A) = impact reduction by sensors A for contamination i Contamination at node s Sensors A Want to solve (A) Fs(B) is high Sensors B ContaminationEach of the Fi is submodular at node rUnfortunately, mini Fi not submodular! Fr(A) is high (B) low 101How can we solve this robust optimization problem?
• 102. How does the greedy algorithm do?V={ , , }Can only buy k=2 Set A F1 F2 mini Fi 1 0 0 Greedy picks 0 2 0 first Optimal ε ε ε Hence we can’t find any Then, can solution 1 ε ε choose only approximation algorithm. ε 2 ε orOptimal score: 1 Or can 2 1 1 we? Greedy score: ε  Greedy does arbitrarily badly. Is there something better? Theorem [NIPS ’07]: The problem max|A|· k mini Fi(A) does not admit any approximation unless P=NP 102
• 103. Alternative formulationIf somebody told us the optimal value,can we recover the optimal solution A*?Need to findIs this any easier?Yes, if we relax the constraint |A| · k 103
• 104. Solving the alternative problem Trick: For each Fi and c, define truncation Fi(A) c Remains F’i,c(A)submodular! |A| Problem 1 (last slide) Problem 2 Non-submodular  optimal solutions!Submodular! SameDon’t know how to solve solves the other as constraint? Solving one But appears 104
• 105. Maximization vs. coverage Previously: Wanted A* = argmax F(A) s.t. |A| · k Now need to solve: A* = argmin |A| s.t. F(A) ¸ Q M Greedy algorithm: Start with A := ;; For bound, assume While F(A) < Q and |A|< n F is integral. s* := argmaxs F(A [ {s}) If not, just round it. A := A [ {s*}Theorem [Wolsey et al]: Greedy will return Agreedy |A | · (1+log max F({s})) |A | 105
• 106. Solving the alternative problem Trick: For each Fi and c, define truncation Fi(A) c F’i,c(A) |A| Problem 1 (last slide) Problem 2 Non-submodular  Submodular!Don’t know how to solve Can use greedy algorithm! 106
• 107. Back to our example Guess c=1 Set A F1 F2 mini Fi F’avg,1 First pick 1 0 0 ½ Then pick 0 2 0 ½ Optimal solution! ε ε ε ε 1 ε ε (1+ε)/2 ε 2 ε (1+ε)/2 1 2 1 1 How do we find c? Do binary search! 107
• 108. SATURATE Algorithm [Krause et al, NIPS ‘07]Given: set V, integer k and monotonic SFs F1,…,Fm MInitialize cmin=0, cmax = mini Fi(V)Do binary search: c = (cmin+cmax)/2 Greedily find AG such that F’avg,c(AG) = c If |AG| · α k: increase cmin If |AG| > α k: decrease cmaxuntil convergence Truncation threshold (color) 108
• 109. Theoretical guarantees [Krause et al, NIPS ‘07]Theorem: The problem max|A|· k mini Fi(A)does not admit any approximation unless P=NP Theorem: SATURATE finds a solution AS such that mini Fi(AS) ¸ OPTk and |AS| · α kwhere OPTk = max|A|·k mini Fi(A) α = 1 + log maxs ∑i Fi({s})Theorem:If there were a polytime algorithm with better factorβ < α, then NP µ DTIME(nlog log n) 109
• 110. Example: Lake monitoring Monitor pH values using robotic sensor transect Prediction at unobserved Observations A locationspH value True (hidden) pH values Var(s | A) Use probabilistic model (Gaussian processes) Position s along transect to estimate prediction error Where should we sense to minimize our maximum error?  Robust submodular (often) submodular optimization problem! [Das & Kempe ’08] 110
• 111. Comparison with state of the art Algorithm used in geostatistics: Simulated Annealing [Sacks & Schiller ’88, van Groeningen & Stein ’98, Wiens ’05,…] 7 parameters that need to be fine-tuned 0.25 Maximum marginal variance nce 2.5 0.2 a um arginal variabetter G dy ree Greedy 2 0.15 SATURATE 1.5 Simulated M xim m 0.1 Annealing 1 0.05 S u im lated A n alin ne g SATURATE 0.5 0 0 20 40 60 0 20 40 60 80 100 SATURATE is competitive & 10x faster N m er of se sors u b n Number of sensors Environmental monitoring Precipitation data No parameters to tune! 111
• 112. Results on water networks 3000 Maximum detection time (minutes) No decrease 2500 until all GreedyLower is better 2000 contaminations Simulated detected! 1500 Annealing 1000 500 SATURATE 0 0 10 20 Water networks Number of sensors 60% lower worst-case detection time! 112
• 113. Worst- vs. average caseGiven: Set V, submodular functions F1,…,Fm Average-case score Worst-case score Too optimistic? Very pessimistic!Want to optimize both average- and worst-case score!Can modify SATURATE to solve this problem!  Want: Fac(A) ¸ cac and Fwc(A) ¸ cwc Truncate: min{Fac(A),cac} + min{Fwc(A),cwc} ¸ cac+cwc 113
• 114. Worst- vs. average case 7000 Only optimize for 6000 average caseWorst case impact lower is better 5000 Knee in 4000 tradeoff curve Water 3000 Tradeoffs networks (SATURATE) 2000 data 1000 Only optimize for worst case 0 0 50 100 150 200 250 300 350 Can findAverage case impact good compromise between average- and is better lower worst-case score! 114
• 115. Constrained maximization: Outline Utility function Selected set Selection cost Budget Subset selectionRobust optimization Complex constraints 115
• 116. Other aspects: Complex constraints maxA F(A) or maxA mini Fi(A) subject to So far: |A| · k In practice, more complex constraints: Different costs: C(A) · B Locations need to be Sensors need to communicate connected by paths (form a routing tree) [Chekuri & Pal, FOCS ’05] [Singh et al, IJCAI ’07]Lake monitoring Building monitoring 116
• 117. Non-constant cost functionsFor each s 2 V, let c(s)>0 be its cost(e.g., feature acquisition costs, …)Cost of a set C(A) = ∑s2 A c(s) (modular function!)Want to solve A* = argmax F(A) s.t. C(A) · BCost-benefit greedy algorithm: M Start with A := ;; While there is an s2VnA s.t. C(A[{s}) · B A := A [ {s*} 117
• 118. Performance of cost-benefit greedyWant Set A F(A) C(A) {a} 2ε εmaxA F(A) s.t. C(A)· 1 {b} 1 1Cost-benefit greedy picks a.Then cannot afford b! Cost-benefit greedy performs arbitrarily badly! 118
• 119. Cost-benefit optimization [Wolsey ’82, Sviridenko ’04, Leskovec et al ’07]Theorem [Leskovec et al. KDD ‘07] ACB: cost-benefit greedy solution and AUC: unit-cost greedy solution (i.e., ignore costs)Then max { F(ACB), F(AUC) } ¸ ½ (1-1/e) OPTCan still compute online bounds andspeed up using lazy evaluationsNote: Can also get (1-1/e) approximation in time O(n4) [Sviridenko ’04] Slightly better than ½ (1-1/e) in O(n2) [Wolsey ‘82] 119
• 121. Water vs. Web Placing sensors in Selecting water networks vs. informative blogsIn both problems we are given Graph with nodes (junctions / blogs) and edges (pipes / links) Cascades spreading dynamically over the graph (contamination / citations)Want to pick nodes to detect big cascades earlyIn both applications, utility functions submodular  [Generalizes Kempe et al, KDD ’03] 121
• 122. Performance on Blog selection 0.7 400 Lower is better GreedyHigher is better Running time (seconds) 0.6 Exhaustive search Cascades captured 300 (All subsets) 0.5 0.4 Naive 200 greedy In-links 0.3 All outlinks 0.2 # Posts 100 Fast greedy 0.1 Random 0 0 0 20 40 60 80 100 1 2 3 4 5 6 7 8 9 10 Number of blogs Number of blogs selected Blog selection Blog selection ~45k blogs Outperforms state-of-the-art heuristics 700x speedup using submodularity! 122
• 123. Cost of reading a blog skip Naïve approach: Just pick 10 best blogs Selects big, well known blogs (Instapundit, etc.) These contain many posts, take long to read! Cost/benefit 0.6 Cascades captured analysis 0.4 Ignoring cost 0.2 0 0 0 2 1 4 26 8 3 10 412 5 14 4 x10 Cost(A) = Number of posts / dayCost-benefit optimization picks summarizer blogs! 123
• 124. Predicting the “hot” blogs Want blogs that will be informative in the future Split data set; train on historic, test on future Detects on training set 0.25 Greedy on future G edy re Test on futureCascades captured s “Cheating” ction 0.2 200 #dete 0.15 0 Jan Feb M ar Ar p May 0.1 Greedy on historic Test on future Blog selection “overfits” Detect well S rate atu Detect poorly s 0.05 #detection 200 here! training data! to here! 0 0 0 2 1000 4 6 2000 8 10 3000 12 14 4000 Poor generalization! Cost(A) = Number of posts / day 0 Want blogs that May JaWhy’s that? Apr n F eb Mar Let’s see what continue to do well! goes wrong here. 124
• 125. Robust optimization G edy re“Overfit” blog ctions 200 selection A #dete 0 Jan Feb M ar Ar p MayFi(A) = detections S rate atu F1(A)=.5 F (A)=.6 F5 (A)=.02 G edy re 3 in interval i #dete s s #dete ction ction F2 (A)=.8 F4(A)=.01 200 200 Optimize 0 Jan Feb Mar A pr M ay 0 worst-case Jan Feb M ar S rate atu Ar p May s Detections using SATURATE ction“Robust” blog 200 #dete selection A* 0 Jan Feb Mar Apr May Robust optimization  Regularization! 125
• 126. Predicting the “hot” blogs 0.25 Greedy on future Test on future Robust solution Cascades captured “Cheating” 0.2 Test on future 0.15 0.1 Greedy on historic Test on future 0.05 0 00 2 4 1000 6 2000 8 10 300012 14 4000 Cost(A) = Number of posts / day 50% better generalization! 126
• 127. Other aspects: Complex constraints skip maxA F(A) or maxA mini Fi(A) subject to So far: |A| · k In practice, more complex constraints: Different costs: C(A) · B Sensors need to communicate Locations need to be (form a routing tree) connected by paths [Chekuri & Pal, FOCS ’05] [Singh et al, IJCAI ’07] Building monitoringLake monitoring 127
• 128. Naïve approach: Greedy-connect long Simple heuristic: Greedily optimize submodular utility function F(A) Then add nodes to minimize communication cost C(A) 50 1 51 OFFICE 2 2 OFFICE 54 9 11 12 QUIET PHONE 15 16 1.5 8 53 relayF(A) = 52 10 13 14 No communication 0.2 CONFERENCE 49 17 1 7 Most 18 F(A) = 4 STORAGE 48 ELEC COPY 5 node possible! 6 LAB 19 11informative 2 C(A) 1 3 47 C(A) = 1 = C(A) = 10 4 20 46 21 3 45 2 SERVER 22 44 relay 43 39 KITCHEN 37 1 33 Second 29 27 23 = node C(A) 2 3.5 most informative 35 F(A)2 3.5 = 40 31 efficient 25 24 42 41 34 38 36 32 30 28 26 communication! Very informative,Communication cost = Expected # of trials Not very High communication (learned using Gaussian Processes)Want to find optimal tradeoff informative  cost! between information and communication cost 128
• 129. The pSPIEL Algorithm [Krause, Guestrin, Gupta, Kleinberg IPSN 2006] pSPIEL: Efficient nonmyopic algorithm (padded Sensor Placements at Informative and cost- Effective Locations)1 2 Decompose sensing region 2 1 into small, well-separated C1 C2 clusters Solve cardinality constrained 3 2 problem per cluster (greedy)1 3 C4 C3 Combine solutions using 2 1 k-MST algorithm 129
• 130. Guarantees for pSPIEL [Krause, Guestrin, Gupta, Kleinberg IPSN 2006]Theorem:pSPIEL finds a tree T withsubmodular utility F(T) ¸ Ω(1) OPTFcommunication cost C(T) ·O(log |V|) OPTC 130
• 131. Proof of concept study Learned model from short deployment of 46 sensors at the Intelligent Workplace Manually selected 20 sensors; Used pSPIEL to place 12 and 19 sensors Compared prediction accuracy 100 90 80 70 60 50 40 30 20 10 0Initial deployment Optimized Accuracy and validation set placements Time 131
• 132. 132
• 133. 133
• 134. 134
• 135. accuracy on Proof of concept study 46 locations pSPIEL improves solution over intuitive manual placement: 50% better prediction and 20% less communication cost, or 20% better prediction and 40% less communication cost Poor placements can hurt a (pS19) Manual (M20) pSPIEL lot! pSPIEL (pS12) Good solution can be unintuitive Root mean squares error (Lux) Communication cost (ETX) M20 M20better better pS19 pS12 pS12 pS19 135
• 136. Robustness sensor placement [Krause, McMahan, Guestrin, Gupta ‘07] what if the usage pattern changes? θnew Optimal for old parameters θoldWant placement to do well both under all possible parameters θ Maximize minθ Fθ(A)Unified view Robustness to change in parameters Robust experimental design Robustness to adversariesCan use SATURATE for robust sensor placement! 136
• 137. Robust pSpiel RpS19¯ manual pSpiel Robust pSpiel 100 25 0.62 90 0.6 80 20 worst-case value 70RMS (Lux) 0.58 60 15 Cost 50 0.56 40 10 0.54 30 20 5 0.52 10 0 0 0.5 M20 pS19 RpS19 M20 pS19 RpS19 M20 pS19 RpS19 Robust placement more intuitive, still better than manual! 137
• 138. Tutorial OverviewExamples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, maxSubmodularity and convexity Every SF induces a convex function with SAME minimum Special properties: Greedy solves LP over exponential polytopeMinimizing submodular functions Minimization possible in polynomial time (but O(n8)…) Queyranne’s algorithm minimizes symmetric SFs in O(n3) Useful for clustering, MAP inference, structure learning, …Maximizing submodular functions Greedy algorithm finds near-optimal set of k elements For more complex problems (robustness, constraints) greedy fails, but there still exist good algorithms (SATURATE, pSPIEL, …) Can get online bounds, lazy evaluations, … Useful for feature selection, active learning, sensor placement, …Extensions and research directions 138
• 139. skip Extensions andresearch directions Carnegie Mellon
• 140. Learning submodular functions [Goemans, Harvey, Kleinberg, Mirrokni, ’08]Pick m sets, A1 … Am, get to see F(A1), …, F(Am)From this, want to approximate F by F’ s.t. 1/α · F(A)/F’(A) · α for all ATheorem: Even if F is monotonic we can pick polynomially many Ai, chosen adaptively,cannot approximate better than α = n½ / log(n)unless P = NP 140
• 141. Sequential selection [Krause, Guestrin ‘07]Thus far assumed knowsubmodular function F(model of environment)→ Bad assumption pH data from Merced river Don’t know lake correlations before we go… a priori More RMS errorActive learning: model 0.1 activeSimultaneous sensing (selection) learningand model (F) learning 0.05 Can use submodularity to analyze exploration/exploitation tradeoff Obtain theoretical guarantees 0 0 10 20 30 40 More observations 141
• 142. Online maximization of submodular functions [Golovin & Streeter ‘07]Pick sets A1 A2 A3 … ATSFs F1 F2 F3 … FTReward r1=F1(A1) r2 r3 … rT Total: ∑t rt  max Time Theorem Can efficiently choose A1,…At s.t. in expectation (1/T) ∑t Ft(At) ¸ (1/T) (1-1/e) max|A|· k ∑t Ft(A) for any sequence Fi, as T!1 “Can asymptotically get ‘no-regret’ over clairvoyant greedy” 142
• 143. Game theoretic applicationsHow can we fairly distribute a set V of“unsplittable” goods to m people?“Social welfare” problem: Each person i has submodular utility Fi(A) Want to partitition V = A1 [ … [ Am to maximize F(A1,…,Am) = ∑i Fi(Ai)Theorem [Vondrak, STOC ’08]:Can get 1-1/e approximation! 143
• 144. Beyond Submodularity: Other notionsPosimodularity? F(A) + F(B) ¸ F(AnB) + F(BnA) 8 A,B Strictly generalizes symmetric submodular functionsSubadditive functions? F(A) + F(B) ¸ F(A [ B) 8 A,B Strictly generalizes monotonic submodular functionsCrossing / intersecting submodularity? F(A) + F(B) ¸ F(A[B) + F(AÅB) holds for some sets A,B Submodular functions can be defined on arbitrary latticesBisubmodular functions? Set functions defined on pairs (A,A’) of disjoint sets of F(A,A’) + F(B,B’) ¸ F((A,A’)Ç(B,B’)) + F((A,A’)Æ(B,B’))Discrete-convex analysis (L-convexity, M-convexity, …)Submodular flows… 144
• 145. Beyond submodularity: Non-submodular functionsFor F submodular and G supermodular, want A* = argminA F(A) + G(A) Y “Sick” F ({X1,X2}) · F({X1})+F({X2}) X1 X2Example: “MRI” “ECG” –G (A) is information gain for feature selection F(A) is cost of computing features A, where “buying in bulk is cheaper”In fact, any set function can be written this way!! 145
• 146. An analogyFor F submodular and G supermodular, want A* = argminA F(A) + G(A)Have seen: submodularity ~ convexity supermodularity ~ concavityCorresponding problem: f convex, g concave x* = argminx f(x) + g(x) 146
• 147. DC Programming / Convex Concave Procedure [Pham Dinh Tao ‘85]f h’ x’ Ã argmin f(x) While not converged do h 1.) g’Ã linear upper bound of g, g’ tight at x’ 2.) x’ Ã argmin f(x)+g’(x) g Will converge to local optimum Generalizes EM, …Clever idea [Narasimhan&Bilmes ’05]:Also works for submodular and supermodular functions! Replace 1) by “modular” upper bound M Replace 2) by submodular function minimizationUseful e.g. for discriminative structure learning!Many more details in their UAI ’05 paper 147
• 148. Structure in ML / AI problems ML last 10 years: ML “next 10 years:” Convexity Submodularity  Kernel machines New structural SVMs, GPs, MLE… propertiesStructural insights help us solve challenging problems 148
• 149. Open problems / directionsSubmodular optimization Applications to AI/ML Improve on O(n8 log2 n) Fast / near-optimal algorithm for inference? minimization? Active Learning Algorithms for Structured prediction? constrained Understanding minimization of SFs? generalization? Extend results to more Ranking? general notions (subadditive, …)? Utility / Privacy? Lots of interesting open problems!! 149
• 150. www.submodularity.orgExamples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, maxSubmodularity and convexity Every SF induces a convex function with SAME minimum Special properties: Greedy solves LP over exponential polytopeMinimizing submodular functions Minimization possible in polynomial time (but O(n8)…) Check out our Queyranne’s algorithm minimizes symmetric SFs in O(n3) Matlab toolbox! Useful for clustering, MAP inference, structure learning, …Maximizing submodular functions sfo_queyranne, Greedy algorithm finds near-optimal set of k elements sfo_min_norm_point, For more complex problems (robustness, constraints) greedy sfo_celf, sfo_sssp, fails, but there still exist good algorithms (SATURATE, pSPIEL, …) sfo_greedy_splitting, Can get online bounds, lazy evaluations, … sfo_greedy_lazy, Useful for feature selection, active learning, sensor placement, …sfo_saturate,Extensions and research directions sfo_max_dca_lazy Sequential, online algorithms … Optimizing non-submodular functions 150