Paper Study: Melding the data decision pipeline

Melding the Data-Decision
Pipeline: Decision-Focused
Learning for Combinatorial
Optimization
Bryan Wilder, Bistra Dilkina and Milind Tambe
University of Southern California
AAAI 2019

Abstract
• Introduce a general framework for decision-focused learning, where
the machine learning model is directly trained in conjunction with the
optimization algorithm.
• Instantiate the framework for two broad classes of combinatorial
problems: linear programming and submodular maximization.
• Experiments show that proposed method outperforms the traditional
method in terms of solution quality.

Introduction
• Machine learning: use data to predict unknown quantities with the
help of loss function.
• Optimization algorithm: use predictions to arrive at decision which
maximizes some objective.
• Separating two pieces entirely to train the model may result in bad
decision.
• Focus on combinatorial optimization, propose decision-focused
learning framework which integrates prediction and optimization
algorithm.

Matrix Calculus
• scalar by vector • vector by scalar • vector by vector

Implicit differentiation
• Example:
• We want to find the slope of the tangent line to the circle at the point (3, −4).
• One way to derive
• 𝑦 = − 25 − 𝑥2 ((3, −4) locates at the bottom semi-circle)
• ⇒ 𝑦′
= −
1
2
25 − 𝑥2 −
1
2 −2𝑥 =
𝑥
√(25−𝑥2)
• 𝑚 = 𝑦′ =
3
√(25−32)
=
3
4
Source: https://www.math.ucdavis.edu/~kouba/CalcOneDIRECTORY/implicitdiffdirectory/ImplicitDiff.html

Implicit differentiation (cont’d)
• However, not every function can be explicitly written as function of
another variable.
• In implicit differentiation, we differentiate each side of an equation with
two variables by treating one of the variables as a function of the other.
• Using the implicit differentiation, we treat 𝑦 as an implicit function of 𝑥
• 𝑥2 + 𝑦2 = 25
• ⇒ 2𝑥 + 2𝑦
𝑑𝑦
𝑑𝑥
= 0
• ⇒ 𝑦′
=
𝑑𝑦
𝑑𝑥
=
−2𝑥
2𝑦
=
−𝑥
𝑦
• 𝑚 = 𝑦′ =
−𝑥
𝑦
=
−3
−4
=
3
4
Source: https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-2-new/ab-3-2/a/implicit-differentiation-review

Lagrange Multiplier
• Consider the optimization problem
max f x, y
subject to g x, y = 0
• Observe the graph, find that
where 𝛻𝑥,𝑦 𝑓 𝑥, 𝑦 =
𝜕𝑓 𝑥,𝑦
𝜕𝑥
,
𝜕𝑓 𝑥,𝑦
𝜕𝑦
𝑇
• Let ℒ 𝑥, 𝑦, 𝜆 = 𝑓 𝑥, 𝑦 + 𝜆 𝑔(𝑥, 𝑦)
• Solve 𝛻𝑥,𝑦,𝜆ℒ 𝑥, 𝑦, 𝜆 = 𝟎 is equivalently to solve equation (1)
(1)
blue: contours of f(x, y) and 𝑑1> 𝑑2 > 𝑑3
red: constraint 𝑔 𝑥, 𝑦 = 𝑐
Source: https://en.m.wikipedia.org/wiki/Lagrange_multiplier

Lagrange Multiplier (cont’d)
• Generalize to 𝒏 variables
• 𝒙 = 𝑥1, 𝑥2, ⋯ , 𝑥 𝑛
𝑻
• Solve 𝛻𝑥1,𝑥2,…,𝑥 𝑛,𝜆ℒ 𝑥1, 𝑥2, … , 𝑥 𝑛, 𝜆 = 𝟎
• Generalize to 𝑴 constraints
• ℒ 𝑥1, … , 𝑥 𝑛, 𝜆1, … , 𝜆 𝑀 = 𝑓 𝑥1, … , 𝑥 𝑛 + σ 𝑘=1
𝑀
𝜆 𝑘 𝑔 𝑘(𝑥1, … , 𝑥 𝑛)
• Solve 𝛻𝑥1,𝑥2,…,𝑥 𝑛,𝜆1,…,𝜆 𝑀
ℒ 𝑥1, … , 𝑥 𝑛, 𝜆1, … , 𝜆 𝑀 = 𝟎

KKT condition
• Consider the optimization problem
max f 𝐱
subject to
gi 𝐱 ≤ 0 for i = 1, … , 𝑚,
ℎ𝑗 𝒙 = 0 for j = 1, … , 𝑙.
• If 𝒙∗ is a local optima, then exist 𝜇𝑖 (𝑖 = 1, … , 𝑚) and 𝜆𝑗 (𝑗 = 1, … , 𝑙) such that
• Stationarity
𝛻𝑓 𝒙∗ = ෍
𝑖=1
𝑚
𝜇𝑖 𝛻𝑔𝑖 𝒙∗ + ෍
𝑗=1
𝑙
𝜆𝑗 𝛻ℎ𝑗(𝒙∗)
• Primal feasibility
𝑔𝑖 𝒙∗ ≤ 0, for i = 1, … , 𝑚
ℎ𝑗 𝒙∗
= 0, for j = 1, … , 𝑙
• Dual feasibility
𝜇𝑖 ≥ 0, for i = 1, … , 𝑚
• Complementary slackness
𝜇𝑖 𝑔𝑖 𝐱∗ = 0, for i = 1, … , 𝑚
Source: https://en.m.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions
Source: https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdf

Linear programming relaxation
• Example:
• In 0-1 integer program, all variables are
• 𝑥𝑖 ∈ {0,1}
• After the relaxation,
• 𝑥𝑖 ∈ [0,1]
• The relaxation transforms an NP-hard
optimization problem into a problem
that can solve in polynomial time.
Source: https://en.wikipedia.org/wiki/Linear_programming_relaxation
Source: https://en.wikipedia.org/wiki/Convex_hull

Problem description
• Consider combinatorial optimization problem
max
𝑥∈𝒳
𝑓(𝑥, 𝜃)
where 𝒳 is a discrete set containing all feasible set.
• Without loss of generality, 𝒳 ⊆ 0,1 𝑛, and 𝑥 is a binary vector or
decision vector.
• The objective 𝑓 depends on 𝜃 ∈ Θ. Consider 𝜃 is unknown and must
be inferred from data.
• Observe a feature vector 𝑦 ∈ 𝒴 which is correlated with 𝜃.
• Let 𝑚: 𝒴 ↦ Θ denote a model mapping observed feature to
parameters.

Problem description (cont’d)
• Use the training data 𝑦1, 𝜃1 , … , (𝑦 𝑁, 𝜃 𝑁) drawn from 𝑃 to find the
model 𝑚 (supervised manner).
• Define 𝑥∗ 𝜃 = arg max
𝑥∈𝒳
𝑓(𝑥, 𝜃) to be the optimal 𝑥 for a given 𝜃.
• Objective:
max 𝔼 𝑦,𝜃~𝑃[𝑓(𝑥∗ 𝑚 𝑦 , 𝜃)]
• Example:
• 𝑦: user ratings of the movie
• 𝜃: movie-actor assignments
• Predict which actors are associated with each movie.

• Classical solution (two stage method)
1. Learn a model 𝑚 using loss function.
• min
𝜔
𝔼 𝑦,𝜃~𝑃[ℒ(𝜃, 𝑚(𝑦, 𝜔))]
2. Use the learned model to solve the optimization problem.
• Possible cons:
• Loss function does not consider how 𝜔 will affect the decision making.
• Is it possible to do better?

General framework
• 𝑥∗ 𝜃 = arg max
𝑥∈𝒳
𝑓(𝑥, 𝜃)
• 𝑥∗ is a decision from a binary set, which renders output non-
differentiable with respect to 𝜔.
• Consider continuous relaxation of original problem,
𝑥 𝜃 = arg max
𝑥∈𝑐𝑜𝑛𝑣 𝒳
𝑓(𝑥, 𝜃)
where 𝑐𝑜𝑛𝑣 denotes the convex hull.
• Obtain a gradient by sampling a single (𝑦, 𝜃) from training data,
𝑑𝑓(𝑥 ෡𝜃 ,𝜃)
𝑑𝜔
=
𝑑𝑓(𝑥 ෡𝜃 ,𝜃)
𝑑𝑥 ෡𝜃
𝑑𝑥 ෡𝜃
𝑑෡𝜃
𝑑෡𝜃
𝑑𝜔
where መ𝜃 = 𝑚(𝑦, 𝜔)
max
𝑓(𝑥( መ𝜃), 𝜃) = max
𝑓(𝑥(𝑚 𝑦, 𝜔 ), 𝜃)

General framework (cont’d)
•
𝑑𝑥 ෡𝜃
𝑑෡𝜃
measures how the optimal decision changes with respect to መ𝜃.
• For continuous problems, the optimal continuous decision must
satisfy KKT condition.
• Constraints are convex hull, which can be represented as {𝑥: 𝐴𝑥 ≤ 𝑏}.
• Let (𝑥, 𝜆) be pair of primal and dual variables, then differentiating the
conditions yields that

• Recall stationarity: 𝛻𝑓 𝒙∗
= σ𝑖=1
𝑚
𝜇𝑖 𝛻𝑔𝑖 𝒙∗
+ σ 𝑗=1
𝑙
𝜆𝑗 𝛻ℎ𝑗(𝒙∗
)
Derivation-Stationarity

Derivation-Stationarity (cont’d)
• By implicit differentiation (seen 𝜆 as an implicit function of 𝑥),

Derivation-Complementary slackness
• Recall complementary slackness: 𝜇𝑖 𝑔𝑖 𝐱∗ = 0, for i = 1, … , 𝑚
Define

Derivation-Complementary slackness (cont’d)

Derivation-Complementary slackness (cont’d)
• By implicit differentiation (seen 𝜆 as an implicit function of 𝑥),

By solve this linear system, we can obtain desired
𝑑𝑥
𝑑𝜃

Linear programming
• Consider a linear program with equality and inequality constraints
max 𝜃 𝑇
𝑥 s. t. Ax = b, Gx ≤ ℎ
• Problem: 𝛻𝑥
2
𝑓 𝑥, 𝜃 is always zero, left hand side matrix becomes
singular.
• Resolve the regularized problem instead
max 𝜃 𝑇 𝑥 − 𝛾 𝑥 2
2
s. t. 𝐴𝑥 = 𝑏, 𝐺𝑥 ≤ ℎ
• Transform LP into quadratic program(QP).

• All other terms can be derived from (𝑥, 𝜆) which is output from QP
solvers

Submodular maximization
• Consider problem to maximize a set function 𝑓: 2 𝑉 ↦ 𝑅 where 𝑉 is a
ground set of items.
• A set function is submodular if satisfies one of equivalent condition.
• For every A, 𝐵 ⊆ V with 𝐴 ⊆ 𝐵 and any 𝑣 ∈ 𝑉B, we have
𝑓 𝐴 ∪ 𝑣 − 𝑓 𝐴 ≥ 𝑓 𝐵 ∪ 𝑣 − 𝑓(𝐵).
• For every A, 𝐵 ⊆ V, we have 𝑓 𝐴 + 𝑓 𝐵 ≥ 𝑓 𝐴 ∪ 𝐵 + 𝑓(𝐴 ∩ 𝐵).
• Focus on the cardinality-constrained optimization max
𝑆 ≤𝑘
𝑓(𝑆).

Submodular maximization (cont’d)
• View a set function as defined on the domain 0,1 𝑉
(indicator view)
• Multilinear extension 𝐹 defined on 0,1 𝑉
(probability view).
𝐹 𝑥 = 𝔼 𝑓 𝑆 = ෍
𝑆⊆𝑉
𝑓 𝑆 ෑ
𝑖∈𝑆
𝑥𝑖 ෑ
𝑖∉𝑆
(1 − 𝑥𝑖)
where 𝑥𝑖 denotes the probability of item 𝑖 included in 𝑆 independently.
• Instead of solving max
𝑆 ≤𝑘
𝑓(𝑆), we can solve
max
𝐹(𝑥)
where 𝒳 = {𝑥 ∈ 0,1 𝑉
: σ𝑖 𝑥𝑖 ≤ 𝑘}

• Multilinear extension has closed form of coverage functions.
• A set of items 𝑈, and for each item 𝑗 ∈ 𝑈 has a weight 𝑤𝑗.
• Choose from a set of actions 𝑉, and each action 𝑎𝑖 covers each item
independently with probability 𝜃𝑖𝑗.
𝐹 𝑥, 𝜃 = ෍
𝑗∈U
𝑤𝑗(1 − ෑ
𝑖∈𝑉
(1 − 𝑥𝑖𝑗 𝜃𝑖𝑗))
𝐹 𝑥 = 𝔼 𝑓 𝑆 = ෍
𝑆⊆𝑉
𝑓 𝑆 ෑ
𝑖∈𝑆
𝑥𝑖 ෑ
𝑖∉𝑆
(1 − 𝑥𝑖)

𝐹 𝑥, 𝜃 = ෍
𝑗∈U
𝑤𝑗(1 − ෑ
𝑖∈𝑉
1 − 𝑥𝑖𝑗 𝜃𝑖𝑗)

Experiments
• For linear programming:
• Bipartite matching
• Feature vector: whether each word appeared in the paper.
• Objective: Reconstruct the citation network
• For submodular maximization:
• Budget allocation
• Model an advertiser’s choice of how to divide a finite budget 𝑘 between a set of channels.
• Feature vector: ground truth 𝜃 passed to DNN
• Objective: expected number of customers reached
• Diverse recommendation
• Feature vector: user rating of movie
• Objective: predict which actors are associated with each movie.

Solution quality
• Quality: the objective value of its decision evaluated using the true 𝜃
NN2: two layer NN
RF: random forest

Accuracy
MSE: mean squared error
CE: cross entropy

Conclusion
• Focus on combinatorial optimization and introduce a general
framework for decision-focused learning.
• Instantiate the framework for linear programming and submodular
maximization.
• Experiments show that proposed method leads to better solution
quality although may loss some accuracy.

Paper Study: Melding the data decision pipeline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Paper Study: Melding the data decision pipeline

Similar to Paper Study: Melding the data decision pipeline (20)

Recently uploaded

Recently uploaded (20)

Paper Study: Melding the data decision pipeline