Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Paper Study: Melding the data decision pipeline
1. Melding the Data-Decision
Pipeline: Decision-Focused
Learning for Combinatorial
Optimization
Bryan Wilder, Bistra Dilkina and Milind Tambe
University of Southern California
AAAI 2019
2. Abstract
• Introduce a general framework for decision-focused learning, where
the machine learning model is directly trained in conjunction with the
optimization algorithm.
• Instantiate the framework for two broad classes of combinatorial
problems: linear programming and submodular maximization.
• Experiments show that proposed method outperforms the traditional
method in terms of solution quality.
3. Introduction
• Machine learning: use data to predict unknown quantities with the
help of loss function.
• Optimization algorithm: use predictions to arrive at decision which
maximizes some objective.
• Separating two pieces entirely to train the model may result in bad
decision.
• Focus on combinatorial optimization, propose decision-focused
learning framework which integrates prediction and optimization
algorithm.
6. Implicit differentiation
• Example:
• We want to find the slope of the tangent line to the circle at the point (3, −4).
• One way to derive
• 𝑦 = − 25 − 𝑥2 ((3, −4) locates at the bottom semi-circle)
• ⇒ 𝑦′
= −
1
2
25 − 𝑥2 −
1
2 −2𝑥 =
𝑥
√(25−𝑥2)
• 𝑚 = 𝑦′ =
3
√(25−32)
=
3
4
Source: https://www.math.ucdavis.edu/~kouba/CalcOneDIRECTORY/implicitdiffdirectory/ImplicitDiff.html
7. Implicit differentiation (cont’d)
• However, not every function can be explicitly written as function of
another variable.
• In implicit differentiation, we differentiate each side of an equation with
two variables by treating one of the variables as a function of the other.
• Using the implicit differentiation, we treat 𝑦 as an implicit function of 𝑥
• 𝑥2 + 𝑦2 = 25
• ⇒ 2𝑥 + 2𝑦
𝑑𝑦
𝑑𝑥
= 0
• ⇒ 𝑦′
=
𝑑𝑦
𝑑𝑥
=
−2𝑥
2𝑦
=
−𝑥
𝑦
• 𝑚 = 𝑦′ =
−𝑥
𝑦
=
−3
−4
=
3
4
Source: https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-2-new/ab-3-2/a/implicit-differentiation-review
8. Lagrange Multiplier
• Consider the optimization problem
max f x, y
subject to g x, y = 0
• Observe the graph, find that
where 𝛻𝑥,𝑦 𝑓 𝑥, 𝑦 =
𝜕𝑓 𝑥,𝑦
𝜕𝑥
,
𝜕𝑓 𝑥,𝑦
𝜕𝑦
𝑇
• Let ℒ 𝑥, 𝑦, 𝜆 = 𝑓 𝑥, 𝑦 + 𝜆 𝑔(𝑥, 𝑦)
• Solve 𝛻𝑥,𝑦,𝜆ℒ 𝑥, 𝑦, 𝜆 = 𝟎 is equivalently to solve equation (1)
(1)
blue: contours of f(x, y) and 𝑑1> 𝑑2 > 𝑑3
red: constraint 𝑔 𝑥, 𝑦 = 𝑐
Source: https://en.m.wikipedia.org/wiki/Lagrange_multiplier
10. KKT condition
• Consider the optimization problem
max f 𝐱
subject to
gi 𝐱 ≤ 0 for i = 1, … , 𝑚,
ℎ𝑗 𝒙 = 0 for j = 1, … , 𝑙.
• If 𝒙∗ is a local optima, then exist 𝜇𝑖 (𝑖 = 1, … , 𝑚) and 𝜆𝑗 (𝑗 = 1, … , 𝑙) such that
• Stationarity
𝛻𝑓 𝒙∗ =
𝑖=1
𝑚
𝜇𝑖 𝛻𝑔𝑖 𝒙∗ +
𝑗=1
𝑙
𝜆𝑗 𝛻ℎ𝑗(𝒙∗)
• Primal feasibility
𝑔𝑖 𝒙∗ ≤ 0, for i = 1, … , 𝑚
ℎ𝑗 𝒙∗
= 0, for j = 1, … , 𝑙
• Dual feasibility
𝜇𝑖 ≥ 0, for i = 1, … , 𝑚
• Complementary slackness
𝜇𝑖 𝑔𝑖 𝐱∗ = 0, for i = 1, … , 𝑚
Source: https://en.m.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions
Source: https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdf
11. Linear programming relaxation
• Example:
• In 0-1 integer program, all variables are
• 𝑥𝑖 ∈ {0,1}
• After the relaxation,
• 𝑥𝑖 ∈ [0,1]
• The relaxation transforms an NP-hard
optimization problem into a problem
that can solve in polynomial time.
Source: https://en.wikipedia.org/wiki/Linear_programming_relaxation
Source: https://en.wikipedia.org/wiki/Convex_hull
13. Problem description
• Consider combinatorial optimization problem
max
𝑥∈𝒳
𝑓(𝑥, 𝜃)
where 𝒳 is a discrete set containing all feasible set.
• Without loss of generality, 𝒳 ⊆ 0,1 𝑛, and 𝑥 is a binary vector or
decision vector.
• The objective 𝑓 depends on 𝜃 ∈ Θ. Consider 𝜃 is unknown and must
be inferred from data.
• Observe a feature vector 𝑦 ∈ 𝒴 which is correlated with 𝜃.
• Let 𝑚: 𝒴 ↦ Θ denote a model mapping observed feature to
parameters.
14. Problem description (cont’d)
• Use the training data 𝑦1, 𝜃1 , … , (𝑦 𝑁, 𝜃 𝑁) drawn from 𝑃 to find the
model 𝑚 (supervised manner).
• Define 𝑥∗ 𝜃 = arg max
𝑥∈𝒳
𝑓(𝑥, 𝜃) to be the optimal 𝑥 for a given 𝜃.
• Objective:
max 𝔼 𝑦,𝜃~𝑃[𝑓(𝑥∗ 𝑚 𝑦 , 𝜃)]
• Example:
• 𝑦: user ratings of the movie
• 𝜃: movie-actor assignments
• Predict which actors are associated with each movie.
15. • Classical solution (two stage method)
1. Learn a model 𝑚 using loss function.
• min
𝜔
𝔼 𝑦,𝜃~𝑃[ℒ(𝜃, 𝑚(𝑦, 𝜔))]
2. Use the learned model to solve the optimization problem.
• Possible cons:
• Loss function does not consider how 𝜔 will affect the decision making.
• Is it possible to do better?
16. General framework
• 𝑥∗ 𝜃 = arg max
𝑥∈𝒳
𝑓(𝑥, 𝜃)
• 𝑥∗ is a decision from a binary set, which renders output non-
differentiable with respect to 𝜔.
• Consider continuous relaxation of original problem,
𝑥 𝜃 = arg max
𝑥∈𝑐𝑜𝑛𝑣 𝒳
𝑓(𝑥, 𝜃)
where 𝑐𝑜𝑛𝑣 denotes the convex hull.
• Obtain a gradient by sampling a single (𝑦, 𝜃) from training data,
𝑑𝑓(𝑥 𝜃 ,𝜃)
𝑑𝜔
=
𝑑𝑓(𝑥 𝜃 ,𝜃)
𝑑𝑥 𝜃
𝑑𝑥 𝜃
𝑑𝜃
𝑑𝜃
𝑑𝜔
where መ𝜃 = 𝑚(𝑦, 𝜔)
max
𝑥∈𝑐𝑜𝑛𝑣 𝒳
𝑓(𝑥( መ𝜃), 𝜃) = max
𝑥∈𝑐𝑜𝑛𝑣 𝒳
𝑓(𝑥(𝑚 𝑦, 𝜔 ), 𝜃)
17. General framework (cont’d)
•
𝑑𝑥 𝜃
𝑑𝜃
measures how the optimal decision changes with respect to መ𝜃.
• For continuous problems, the optimal continuous decision must
satisfy KKT condition.
• Constraints are convex hull, which can be represented as {𝑥: 𝐴𝑥 ≤ 𝑏}.
• Let (𝑥, 𝜆) be pair of primal and dual variables, then differentiating the
conditions yields that
23. By solve this linear system, we can obtain desired
𝑑𝑥
𝑑𝜃
24. Linear programming
• Consider a linear program with equality and inequality constraints
max 𝜃 𝑇
𝑥 s. t. Ax = b, Gx ≤ ℎ
• Problem: 𝛻𝑥
2
𝑓 𝑥, 𝜃 is always zero, left hand side matrix becomes
singular.
• Resolve the regularized problem instead
max 𝜃 𝑇 𝑥 − 𝛾 𝑥 2
2
s. t. 𝐴𝑥 = 𝑏, 𝐺𝑥 ≤ ℎ
• Transform LP into quadratic program(QP).
25.
26.
27.
28. • All other terms can be derived from (𝑥, 𝜆) which is output from QP
solvers
29. Submodular maximization
• Consider problem to maximize a set function 𝑓: 2 𝑉 ↦ 𝑅 where 𝑉 is a
ground set of items.
• A set function is submodular if satisfies one of equivalent condition.
• For every A, 𝐵 ⊆ V with 𝐴 ⊆ 𝐵 and any 𝑣 ∈ 𝑉B, we have
𝑓 𝐴 ∪ 𝑣 − 𝑓 𝐴 ≥ 𝑓 𝐵 ∪ 𝑣 − 𝑓(𝐵).
• For every A, 𝐵 ⊆ V, we have 𝑓 𝐴 + 𝑓 𝐵 ≥ 𝑓 𝐴 ∪ 𝐵 + 𝑓(𝐴 ∩ 𝐵).
• Focus on the cardinality-constrained optimization max
𝑆 ≤𝑘
𝑓(𝑆).
30. Submodular maximization (cont’d)
• View a set function as defined on the domain 0,1 𝑉
(indicator view)
• Multilinear extension 𝐹 defined on 0,1 𝑉
(probability view).
𝐹 𝑥 = 𝔼 𝑓 𝑆 =
𝑆⊆𝑉
𝑓 𝑆 ෑ
𝑖∈𝑆
𝑥𝑖 ෑ
𝑖∉𝑆
(1 − 𝑥𝑖)
where 𝑥𝑖 denotes the probability of item 𝑖 included in 𝑆 independently.
• Instead of solving max
𝑆 ≤𝑘
𝑓(𝑆), we can solve
max
𝑥∈𝑐𝑜𝑛𝑣 𝒳
𝐹(𝑥)
where 𝒳 = {𝑥 ∈ 0,1 𝑉
: σ𝑖 𝑥𝑖 ≤ 𝑘}
31. • Multilinear extension has closed form of coverage functions.
• A set of items 𝑈, and for each item 𝑗 ∈ 𝑈 has a weight 𝑤𝑗.
• Choose from a set of actions 𝑉, and each action 𝑎𝑖 covers each item
independently with probability 𝜃𝑖𝑗.
𝐹 𝑥, 𝜃 =
𝑗∈U
𝑤𝑗(1 − ෑ
𝑖∈𝑉
(1 − 𝑥𝑖𝑗 𝜃𝑖𝑗))
𝐹 𝑥 = 𝔼 𝑓 𝑆 =
𝑆⊆𝑉
𝑓 𝑆 ෑ
𝑖∈𝑆
𝑥𝑖 ෑ
𝑖∉𝑆
(1 − 𝑥𝑖)
35. Experiments
• For linear programming:
• Bipartite matching
• Feature vector: whether each word appeared in the paper.
• Objective: Reconstruct the citation network
• For submodular maximization:
• Budget allocation
• Model an advertiser’s choice of how to divide a finite budget 𝑘 between a set of channels.
• Feature vector: ground truth 𝜃 passed to DNN
• Objective: expected number of customers reached
• Diverse recommendation
• Feature vector: user rating of movie
• Objective: predict which actors are associated with each movie.
36. Solution quality
• Quality: the objective value of its decision evaluated using the true 𝜃
NN2: two layer NN
RF: random forest
38. Conclusion
• Focus on combinatorial optimization and introduce a general
framework for decision-focused learning.
• Instantiate the framework for linear programming and submodular
maximization.
• Experiments show that proposed method leads to better solution
quality although may loss some accuracy.