SlideShare a Scribd company logo
1 of 152
Download to read offline
CS592 Presentation #5
Sparse Additive Models
20173586 Jeongmin Cha
20174463 Jaesung Choe
20184144 Andries Bruno
1
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Introduction
● Combine ideas from
Sparse
Linear
Models
Additive
Nonparametric
regression
Sparse Additive Models (SpAM)
Backfitting
sparsity
constraint
1. Introduction
● SpAM ⋍ additive nonparametric regression model
○ but, + sparsity constraint on
○ functional version of group lasso
● Nonparametric regression model
relaxes the strong assumptions made by a linear model
1. Introduction
● The authors show the estimator of
● 1. Sparsistence (Sparsity pattern consistency)
○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically
● 2. Persistence
○ the estimator is persistent, predictive risk of estimator converges
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
● Data Representation
● Additive Nonparametric model
2. Notation and Assumption
2. Notation and Assumption
● P = the joint distribution of ( Xi
, Yi
)
● The definition of L2
(P) norm (f on [0, 1]):
● On each dimension,
● hilbert subspace of L2
(P) of P-measurable functions
● zero mean
● The hilbert subspace has the inner product
● hilbert space
of dimensional functions in the additive form
2. Notation and Assumption
● uniformly bounded, orthonormal basis on [0,1]
● The dimensional function
2. Notation and Assumption
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
#1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
#1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
Sparse additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
(Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj))
(Green) : coefficients β would become sparse.
: Lasso
Sparse additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse(!!) additive model
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Sampled!
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Soft
thresholding
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
(Theorem. 1)
3. Sparse Backfitting
#2. Theorem 1
From the penalized lagrangian form,
3. Sparse Backfitting
#2. Theorem 1 says
From the penalized lagrangian form,
The minimizers ( ) satisfy
where denotes projection matrix, represents residual matrix
and means the positive part.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( )
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
: Only positive parts survive such that function becomes sparse.
3. Sparse Backfitting
#3. Backfitting algorithm
- According to theorem 1,
- Estimate smoother projection matrix where
- Flow of the backfitting algorithm
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
When there is non-linearity, SpAM can be effective.
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
SpAM backfitting algorithm
#5. Risk estimation
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
6.1. Synthetic Data
● Generate 150 samples from a 200-dimensional additive model
● The remaining 196 features are irrelevant and are set to zero plus a 0
mean gaussian noise.
6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
The same thresholding phenomenon that was
shown in the lasso is observed.
6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
6.2. Boston Housing
● There are 506 observations with 10 covariates.
● To explore the sparsistency properties of SpAM, 20 irrelevant variables are
added.
● Ten of those are randomly drawn from Uniform(0, 1)
● The remainder are permutations of the original 10 covariates.
6.2. Boston Housing
● SpAM identifies 6 of the nonzero components.
● Both types of irrelevant variables are correctly zeroed out.
6.3. SpAM for Spam
● Dataset consists of 3,065 emails which serve as training set.
● 57 attributes are available. These are all numeric
● Attributes measure the percentage of specific words in an email, the
average and maximum run lengths of uppercase letters.
● Sample on 300 emails from the training set and use the remainder as test
set.
6.3. SpAM for Spam
Best model
6.3. SpAM for Spam
The 33 selected variables cover 80% of the significant predictors.
6.3. Functional Sparse Coding
● Here we compare SpAM with lasso. We consider natural images.
● The problem setup is as follows:
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
6.3. Functional Sparse Coding
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
● Sparsity allows specialization of features and enforces capturing of salient
properties of the data.
6.3. Functional Sparse Coding
● When solved with lasso and SGD, 200 codewords that capture edge
features at different scales and spatial orientations are learned:
6.3. Functional Sparse Coding
● In the functional version, no assumption of linearity is made between X and
y. Instead, the following additive model is used:
● This leads to the following optimization problem:
6.3. Functional Sparse Coding
● Which model is lasso and which is SpAM?
6.3. Functional Sparse Coding
● What about expressiveness?
6.3. Functional Sparse Coding
● The sparse linear model use 8 codewords
while the functional uses 7 with a lower
residual sum of squares (RSS)
● Also, the linear and nonlinear versions learn
different codewords.
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
7. Discussion Points (1)
● As the authors said, SpAM is essentially a functional version of the
grouped lasso. Then, are there any formulations for functional versions of
other methods - e.g. ridge, fused lasso? Finding a generalized functional
version of lasso families will be an interesting problem
○ Functional logistic regression with fused lasso penalty (FLR-FLP)
7. Discussion Points (1)
● Objective function = FLR loss + lasso penalty + fused lasso penalty
● FLR loss
● gamma = coefficient in functional parameters
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso?
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
○ First we assume G is a partition of {1,..,p} and that G’s do not overlap
○ The optimization problem then becomes
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
● The regularization term becomes
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
○ If our estimates are too smooth, we risk bias.
■ Thus me wake erroneous assumptions about the underlying functions.
■ In this case we miss relevant relations between features and targets. Thus we
underfit.
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
● The learned model becomes sensitive to small variations in the data. Thus
we overfit.
We must keep a balance between bias and variance by using an appropriate
level of smoothing.
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
● Our data set is so massive that either the extra processing time, or the
extra computer memory needed to fit and store an additive rather than
a linear model is prohibitive.
Thank you for listening
76
CS592 Presentation #5
Sparse Additive Models
20173586 Jeongmin Cha
20174463 Jaesung Choe
20184144 Andries Bruno
77
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Brief of Additive Models
1. Introduction
● Combine ideas from
Sparse
Linear
Models
Additive
Nonparametric
regression
Sparse Additive Models (SpAM)
Backfitting
sparsity
constraint
1. Introduction
● SpAM ⋍ additive nonparametric regression model
○ but, + sparsity constraint on
○ functional version of group lasso
● Nonparametric regression model
relaxes the strong assumptions made by a linear model
1. Introduction
● The authors show the estimator of
● 1. Sparsistence (Sparsity pattern consistency)
○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically
● 2. Persistence
○ the estimator is persistent, predictive risk of estimator converges
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
● Data Representation
● Additive Nonparametric model
2. Notation and Assumption
2. Notation and Assumption
● P = the joint distribution of ( Xi
, Yi
)
● The definition of L2
(P) norm (f on [0, 1]):
● On each dimension,
● hilbert subspace of L2
(P) of P-measurable functions
● zero mean
● The hilbert subspace has the inner product
● hilbert space
of dimensional functions in the additive form
2. Notation and Assumption
● uniformly bounded, orthonormal basis on [0,1]
● The dimensional function
2. Notation and Assumption
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
#1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
#1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
Sparse additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
(Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj))
(Green) : coefficients β would become sparse.
: Lasso
Sparse additive model optimization problem
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse(!!) additive model
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Sampled!
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Soft
thresholding
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
#1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
(Theorem. 1)
3. Sparse Backfitting
#2. Theorem 1
From the penalized lagrangian form,
3. Sparse Backfitting
#2. Theorem 1 says
From the penalized lagrangian form,
The minimizers ( ) satisfy
where denotes projection matrix, represents residual matrix
and means the positive part.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( )
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
#2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
: Only positive parts survive such that function becomes sparse.
3. Sparse Backfitting
#3. Backfitting algorithm
- According to theorem 1,
- Estimate smoother projection matrix where
- Flow of the backfitting algorithm
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
When there is non-linearity, SpAM can be effective.
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
SpAM backfitting algorithm
#5. Risk estimation
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
6.1. Synthetic Data
● Generate 150 samples from a 200-dimensional additive model
● The remaining 196 features are irrelevant and are set to zero plus a 0
mean gaussian noise.
6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
The same thresholding phenomenon that was
shown in the lasso is observed.
6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
6.2. Boston Housing
● There are 506 observations with 10 covariates.
● To explore the sparsistency properties of SpAM, 20 irrelevant variables are
added.
● Ten of those are randomly drawn from Uniform(0, 1)
● The remainder are permutations of the original 10 covariates.
6.2. Boston Housing
● SpAM identifies 6 of the nonzero components.
● Both types of irrelevant variables are correctly zeroed out.
6.3. SpAM for Spam
● Dataset consists of 3,065 emails which serve as training set.
● 57 attributes are available. These are all numeric
● Attributes measure the percentage of specific words in an email, the
average and maximum run lengths of uppercase letters.
● Sample on 300 emails from the training set and use the remainder as test
set.
6.3. SpAM for Spam
Best model
6.3. SpAM for Spam
The 33 selected variables cover 80% of the significant predictors.
6.3. Functional Sparse Coding
● Here we compare SpAM with lasso. We consider natural images.
● The problem setup is as follows:
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
6.3. Functional Sparse Coding
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
● Sparsity allows specialization of features and enforces capturing of salient
properties of the data.
6.3. Functional Sparse Coding
● When solved with lasso and SGD, 200 codewords that capture edge
features at different scales and spatial orientations are learned:
6.3. Functional Sparse Coding
● In the functional version, no assumption of linearity is made between X and
y. Instead, the following additive model is used:
● This leads to the following optimization problem:
6.3. Functional Sparse Coding
● Which model is lasso and which is SpAM?
6.3. Functional Sparse Coding
● What about expressiveness?
6.3. Functional Sparse Coding
● The sparse linear model use 8 codewords
while the functional uses 7 with a lower
residual sum of squares (RSS)
● Also, the linear and nonlinear versions learn
different codewords.
Contents
1. Introduction
2. Notation and Assumptions
3. Sparse Backfitting
4. Choosing regularization parameter
5. Results
6. Discussion Points
7. Discussion Points (1)
● As the authors said, SpAM is essentially a functional version of the
grouped lasso. Then, are there any formulations for functional versions of
other methods - e.g. ridge, fused lasso? Finding a generalized functional
version of lasso families will be an interesting problem
○ Functional logistic regression with fused lasso penalty (FLR-FLP)
7. Discussion Points (1)
● Objective function = FLR loss + lasso penalty + fused lasso penalty
● FLR loss
● gamma = coefficient in functional parameters
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso?
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
○ First we assume G is a partition of {1,..,p} and that G’s do not overlap
○ The optimization problem then becomes
7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
● The regularization term becomes
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
○ If our estimates are too smooth, we risk bias.
■ Thus me wake erroneous assumptions about the underlying functions.
■ In this case we miss relevant relations between features and targets. Thus we
underfit.
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
● The learned model becomes sensitive to small variations in the data. Thus
we overfit.
We must keep a balance between bias and variance by using an appropriate
level of smoothing.
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
● Our data set is so massive that either the extra processing time, or the
extra computer memory needed to fit and store an additive rather than
a linear model is prohibitive.
Thank you for listening
152

More Related Content

What's hot

Polynomial Kernel for Interval Vertex Deletion
Polynomial Kernel for Interval Vertex DeletionPolynomial Kernel for Interval Vertex Deletion
Polynomial Kernel for Interval Vertex DeletionAkankshaAgrawal55
 
Electrical Engineering Exam Help
Electrical Engineering Exam HelpElectrical Engineering Exam Help
Electrical Engineering Exam HelpLive Exam Helper
 
Time Series Analysis in Cryptocurrency Markets: The "Bitcoin Brothers" (Paper...
Time Series Analysis in Cryptocurrency Markets: The "Bitcoin Brothers" (Paper...Time Series Analysis in Cryptocurrency Markets: The "Bitcoin Brothers" (Paper...
Time Series Analysis in Cryptocurrency Markets: The "Bitcoin Brothers" (Paper...Keita Makino
 
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...IOSR Journals
 
Kernel for Chordal Vertex Deletion
Kernel for Chordal Vertex DeletionKernel for Chordal Vertex Deletion
Kernel for Chordal Vertex DeletionAkankshaAgrawal55
 
Position analysis and dimensional synthesis
Position analysis and dimensional synthesisPosition analysis and dimensional synthesis
Position analysis and dimensional synthesisPreetshah1212
 
Polylogarithmic approximation algorithm for weighted F-deletion problems
Polylogarithmic approximation algorithm for weighted F-deletion problemsPolylogarithmic approximation algorithm for weighted F-deletion problems
Polylogarithmic approximation algorithm for weighted F-deletion problemsAkankshaAgrawal55
 
Basics of Integration and Derivatives
Basics of Integration and DerivativesBasics of Integration and Derivatives
Basics of Integration and DerivativesFaisal Waqar
 
Colloquium presentation
Colloquium presentationColloquium presentation
Colloquium presentationbgeron
 
2015 CMS Winter Meeting Poster
2015 CMS Winter Meeting Poster2015 CMS Winter Meeting Poster
2015 CMS Winter Meeting PosterChelsea Battell
 
computervision project
computervision projectcomputervision project
computervision projectLianli Liu
 

What's hot (20)

Polynomial Kernel for Interval Vertex Deletion
Polynomial Kernel for Interval Vertex DeletionPolynomial Kernel for Interval Vertex Deletion
Polynomial Kernel for Interval Vertex Deletion
 
Electrical Engineering Exam Help
Electrical Engineering Exam HelpElectrical Engineering Exam Help
Electrical Engineering Exam Help
 
[Download] Rev-Chapter-3
[Download] Rev-Chapter-3[Download] Rev-Chapter-3
[Download] Rev-Chapter-3
 
Phase Responce of Pole zero
Phase Responce of Pole zeroPhase Responce of Pole zero
Phase Responce of Pole zero
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
Time Series Analysis in Cryptocurrency Markets: The "Bitcoin Brothers" (Paper...
Time Series Analysis in Cryptocurrency Markets: The "Bitcoin Brothers" (Paper...Time Series Analysis in Cryptocurrency Markets: The "Bitcoin Brothers" (Paper...
Time Series Analysis in Cryptocurrency Markets: The "Bitcoin Brothers" (Paper...
 
Filter Designing
Filter DesigningFilter Designing
Filter Designing
 
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...
Adomian Decomposition Method for Certain Space-Time Fractional Partial Differ...
 
Kernel for Chordal Vertex Deletion
Kernel for Chordal Vertex DeletionKernel for Chordal Vertex Deletion
Kernel for Chordal Vertex Deletion
 
Position analysis and dimensional synthesis
Position analysis and dimensional synthesisPosition analysis and dimensional synthesis
Position analysis and dimensional synthesis
 
Polylogarithmic approximation algorithm for weighted F-deletion problems
Polylogarithmic approximation algorithm for weighted F-deletion problemsPolylogarithmic approximation algorithm for weighted F-deletion problems
Polylogarithmic approximation algorithm for weighted F-deletion problems
 
Computer Science Assignment Help
Computer Science Assignment HelpComputer Science Assignment Help
Computer Science Assignment Help
 
1406
14061406
1406
 
Basics of Integration and Derivatives
Basics of Integration and DerivativesBasics of Integration and Derivatives
Basics of Integration and Derivatives
 
Colloquium presentation
Colloquium presentationColloquium presentation
Colloquium presentation
 
2015 CMS Winter Meeting Poster
2015 CMS Winter Meeting Poster2015 CMS Winter Meeting Poster
2015 CMS Winter Meeting Poster
 
Digital Signal Processing Homework Help
Digital Signal Processing Homework HelpDigital Signal Processing Homework Help
Digital Signal Processing Homework Help
 
Fine Grained Complexity
Fine Grained ComplexityFine Grained Complexity
Fine Grained Complexity
 
computervision project
computervision projectcomputervision project
computervision project
 
Guarding Polygons via CSP
Guarding Polygons via CSPGuarding Polygons via CSP
Guarding Polygons via CSP
 

Similar to Sparse Additive Models (SPAM)

CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docxCSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docxmydrynan
 
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...SSA KPI
 
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...Naoki Shibata
 
Building 3D Morphable Models from 2D Images
Building 3D Morphable Models from 2D ImagesBuilding 3D Morphable Models from 2D Images
Building 3D Morphable Models from 2D ImagesShanglin Yang
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Ukraine
 
Memory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital PredistorterMemory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital PredistorterIJERA Editor
 
A New Method For Solving Kinematics Model Of An RA-02
A New Method For Solving Kinematics Model Of An RA-02A New Method For Solving Kinematics Model Of An RA-02
A New Method For Solving Kinematics Model Of An RA-02IJERA Editor
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengSpark Summit
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 
5.2 Least Squares Linear Regression.pptx
5.2  Least Squares Linear Regression.pptx5.2  Least Squares Linear Regression.pptx
5.2 Least Squares Linear Regression.pptxMaiEllahham1
 
Forecasting day ahead power prices in germany using fixed size least squares ...
Forecasting day ahead power prices in germany using fixed size least squares ...Forecasting day ahead power prices in germany using fixed size least squares ...
Forecasting day ahead power prices in germany using fixed size least squares ...Niklas Ignell
 
Data fitting in Scilab - Tutorial
Data fitting in Scilab - TutorialData fitting in Scilab - Tutorial
Data fitting in Scilab - TutorialScilab
 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...Amir Ziai
 

Similar to Sparse Additive Models (SPAM) (20)

CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docxCSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
 
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...Forecasting Default Probabilities  in Emerging Markets and   Dynamical Regula...
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
 
1108.1170
1108.11701108.1170
1108.1170
 
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
 
Building 3D Morphable Models from 2D Images
Building 3D Morphable Models from 2D ImagesBuilding 3D Morphable Models from 2D Images
Building 3D Morphable Models from 2D Images
 
Computation Assignment Help
Computation Assignment Help Computation Assignment Help
Computation Assignment Help
 
Group Project
Group ProjectGroup Project
Group Project
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
 
BPstudy sklearn 20180925
BPstudy sklearn 20180925BPstudy sklearn 20180925
BPstudy sklearn 20180925
 
Memory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital PredistorterMemory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital Predistorter
 
A New Method For Solving Kinematics Model Of An RA-02
A New Method For Solving Kinematics Model Of An RA-02A New Method For Solving Kinematics Model Of An RA-02
A New Method For Solving Kinematics Model Of An RA-02
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
 
5.2 Least Squares Linear Regression.pptx
5.2  Least Squares Linear Regression.pptx5.2  Least Squares Linear Regression.pptx
5.2 Least Squares Linear Regression.pptx
 
Bob
BobBob
Bob
 
Repair dagstuhl jan2017
Repair dagstuhl jan2017Repair dagstuhl jan2017
Repair dagstuhl jan2017
 
Forecasting day ahead power prices in germany using fixed size least squares ...
Forecasting day ahead power prices in germany using fixed size least squares ...Forecasting day ahead power prices in germany using fixed size least squares ...
Forecasting day ahead power prices in germany using fixed size least squares ...
 
Data fitting in Scilab - Tutorial
Data fitting in Scilab - TutorialData fitting in Scilab - Tutorial
Data fitting in Scilab - Tutorial
 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
 

More from Jeongmin Cha

차정민 (소프트웨어 엔지니어) 이력서 + 경력기술서
차정민 (소프트웨어 엔지니어) 이력서 + 경력기술서차정민 (소프트웨어 엔지니어) 이력서 + 경력기술서
차정민 (소프트웨어 엔지니어) 이력서 + 경력기술서Jeongmin Cha
 
Causal Effect Inference with Deep Latent-Variable Models
Causal Effect Inference with Deep Latent-Variable ModelsCausal Effect Inference with Deep Latent-Variable Models
Causal Effect Inference with Deep Latent-Variable ModelsJeongmin Cha
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...Jeongmin Cha
 
Waterful Application (iOS + AppleWatch)
Waterful Application (iOS + AppleWatch)Waterful Application (iOS + AppleWatch)
Waterful Application (iOS + AppleWatch)Jeongmin Cha
 
시스템 프로그램 설계 2 최종발표 (차정민, 조경재)
시스템 프로그램 설계 2 최종발표 (차정민, 조경재)시스템 프로그램 설계 2 최종발표 (차정민, 조경재)
시스템 프로그램 설계 2 최종발표 (차정민, 조경재)Jeongmin Cha
 
시스템 프로그램 설계1 최종발표
시스템 프로그램 설계1 최종발표시스템 프로그램 설계1 최종발표
시스템 프로그램 설계1 최종발표Jeongmin Cha
 
마이크로프로세서 응용(2013-2)
마이크로프로세서 응용(2013-2)마이크로프로세서 응용(2013-2)
마이크로프로세서 응용(2013-2)Jeongmin Cha
 
최종발표
최종발표최종발표
최종발표Jeongmin Cha
 

More from Jeongmin Cha (8)

차정민 (소프트웨어 엔지니어) 이력서 + 경력기술서
차정민 (소프트웨어 엔지니어) 이력서 + 경력기술서차정민 (소프트웨어 엔지니어) 이력서 + 경력기술서
차정민 (소프트웨어 엔지니어) 이력서 + 경력기술서
 
Causal Effect Inference with Deep Latent-Variable Models
Causal Effect Inference with Deep Latent-Variable ModelsCausal Effect Inference with Deep Latent-Variable Models
Causal Effect Inference with Deep Latent-Variable Models
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...
 
Waterful Application (iOS + AppleWatch)
Waterful Application (iOS + AppleWatch)Waterful Application (iOS + AppleWatch)
Waterful Application (iOS + AppleWatch)
 
시스템 프로그램 설계 2 최종발표 (차정민, 조경재)
시스템 프로그램 설계 2 최종발표 (차정민, 조경재)시스템 프로그램 설계 2 최종발표 (차정민, 조경재)
시스템 프로그램 설계 2 최종발표 (차정민, 조경재)
 
시스템 프로그램 설계1 최종발표
시스템 프로그램 설계1 최종발표시스템 프로그램 설계1 최종발표
시스템 프로그램 설계1 최종발표
 
마이크로프로세서 응용(2013-2)
마이크로프로세서 응용(2013-2)마이크로프로세서 응용(2013-2)
마이크로프로세서 응용(2013-2)
 
최종발표
최종발표최종발표
최종발표
 

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Sparse Additive Models (SPAM)

  • 1. CS592 Presentation #5 Sparse Additive Models 20173586 Jeongmin Cha 20174463 Jaesung Choe 20184144 Andries Bruno 1
  • 2. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 3. 1. Brief of Additive Models
  • 4. 1. Brief of Additive Models
  • 5. 1. Brief of Additive Models
  • 6. 1. Brief of Additive Models
  • 7. 1. Brief of Additive Models
  • 8. 1. Introduction ● Combine ideas from Sparse Linear Models Additive Nonparametric regression Sparse Additive Models (SpAM) Backfitting sparsity constraint
  • 9. 1. Introduction ● SpAM ⋍ additive nonparametric regression model ○ but, + sparsity constraint on ○ functional version of group lasso ● Nonparametric regression model relaxes the strong assumptions made by a linear model
  • 10. 1. Introduction ● The authors show the estimator of ● 1. Sparsistence (Sparsity pattern consistency) ○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically ● 2. Persistence ○ the estimator is persistent, predictive risk of estimator converges
  • 11. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 12. ● Data Representation ● Additive Nonparametric model 2. Notation and Assumption
  • 13. 2. Notation and Assumption ● P = the joint distribution of ( Xi , Yi ) ● The definition of L2 (P) norm (f on [0, 1]):
  • 14. ● On each dimension, ● hilbert subspace of L2 (P) of P-measurable functions ● zero mean ● The hilbert subspace has the inner product ● hilbert space of dimensional functions in the additive form 2. Notation and Assumption
  • 15. ● uniformly bounded, orthonormal basis on [0,1] ● The dimensional function 2. Notation and Assumption
  • 16. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 17. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function)
  • 18. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Additive model optimization problem
  • 19. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting Sparse additive model optimization problem
  • 20. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting (Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj)) (Green) : coefficients β would become sparse. : Lasso Sparse additive model optimization problem
  • 21. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse(!!) additive model
  • 22. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso
  • 23. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Sampled!
  • 24. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Soft thresholding
  • 25. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm
  • 26. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm (Theorem. 1)
  • 27. 3. Sparse Backfitting #2. Theorem 1 From the penalized lagrangian form,
  • 28. 3. Sparse Backfitting #2. Theorem 1 says From the penalized lagrangian form, The minimizers ( ) satisfy where denotes projection matrix, represents residual matrix and means the positive part.
  • 29. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting
  • 30. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
  • 31. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is,
  • 32. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( )
  • 33. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding.
  • 34. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ??
  • 35. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ?? : Only positive parts survive such that function becomes sparse.
  • 36. 3. Sparse Backfitting #3. Backfitting algorithm - According to theorem 1, - Estimate smoother projection matrix where - Flow of the backfitting algorithm
  • 37. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix
  • 38. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso?
  • 39. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? )
  • 40. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
  • 41. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307. Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? ) When there is non-linearity, SpAM can be effective.
  • 42. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 44. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 45. 6.1. Synthetic Data ● Generate 150 samples from a 200-dimensional additive model ● The remaining 196 features are irrelevant and are set to zero plus a 0 mean gaussian noise.
  • 46. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n The same thresholding phenomenon that was shown in the lasso is observed.
  • 47. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n
  • 48. 6.2. Boston Housing ● There are 506 observations with 10 covariates. ● To explore the sparsistency properties of SpAM, 20 irrelevant variables are added. ● Ten of those are randomly drawn from Uniform(0, 1) ● The remainder are permutations of the original 10 covariates.
  • 49. 6.2. Boston Housing ● SpAM identifies 6 of the nonzero components. ● Both types of irrelevant variables are correctly zeroed out.
  • 50. 6.3. SpAM for Spam ● Dataset consists of 3,065 emails which serve as training set. ● 57 attributes are available. These are all numeric ● Attributes measure the percentage of specific words in an email, the average and maximum run lengths of uppercase letters. ● Sample on 300 emails from the training set and use the remainder as test set.
  • 51. 6.3. SpAM for Spam Best model
  • 52. 6.3. SpAM for Spam The 33 selected variables cover 80% of the significant predictors.
  • 53. 6.3. Functional Sparse Coding ● Here we compare SpAM with lasso. We consider natural images. ● The problem setup is as follows: ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients.
  • 54. 6.3. Functional Sparse Coding ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients. ● Sparsity allows specialization of features and enforces capturing of salient properties of the data.
  • 55. 6.3. Functional Sparse Coding ● When solved with lasso and SGD, 200 codewords that capture edge features at different scales and spatial orientations are learned:
  • 56. 6.3. Functional Sparse Coding ● In the functional version, no assumption of linearity is made between X and y. Instead, the following additive model is used: ● This leads to the following optimization problem:
  • 57. 6.3. Functional Sparse Coding ● Which model is lasso and which is SpAM?
  • 58. 6.3. Functional Sparse Coding ● What about expressiveness?
  • 59. 6.3. Functional Sparse Coding ● The sparse linear model use 8 codewords while the functional uses 7 with a lower residual sum of squares (RSS) ● Also, the linear and nonlinear versions learn different codewords.
  • 60. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 61. 7. Discussion Points (1) ● As the authors said, SpAM is essentially a functional version of the grouped lasso. Then, are there any formulations for functional versions of other methods - e.g. ridge, fused lasso? Finding a generalized functional version of lasso families will be an interesting problem ○ Functional logistic regression with fused lasso penalty (FLR-FLP)
  • 62. 7. Discussion Points (1) ● Objective function = FLR loss + lasso penalty + fused lasso penalty ● FLR loss ● gamma = coefficient in functional parameters
  • 63. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso?
  • 64. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ○ First we assume G is a partition of {1,..,p} and that G’s do not overlap ○ The optimization problem then becomes
  • 65. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ● The regularization term becomes
  • 66. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions?
  • 67. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs.
  • 68. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then?
  • 69. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then? ○ If our estimates are too smooth, we risk bias. ■ Thus me wake erroneous assumptions about the underlying functions. ■ In this case we miss relevant relations between features and targets. Thus we underfit.
  • 70. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then?
  • 71. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean?
  • 72. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean? ● The learned model becomes sensitive to small variations in the data. Thus we overfit. We must keep a balance between bias and variance by using an appropriate level of smoothing.
  • 73. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models?
  • 74. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure.
  • 75. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure. ● Our data set is so massive that either the extra processing time, or the extra computer memory needed to fit and store an additive rather than a linear model is prohibitive.
  • 76. Thank you for listening 76
  • 77. CS592 Presentation #5 Sparse Additive Models 20173586 Jeongmin Cha 20174463 Jaesung Choe 20184144 Andries Bruno 77
  • 78. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 79. 1. Brief of Additive Models
  • 80. 1. Brief of Additive Models
  • 81. 1. Brief of Additive Models
  • 82. 1. Brief of Additive Models
  • 83. 1. Brief of Additive Models
  • 84. 1. Introduction ● Combine ideas from Sparse Linear Models Additive Nonparametric regression Sparse Additive Models (SpAM) Backfitting sparsity constraint
  • 85. 1. Introduction ● SpAM ⋍ additive nonparametric regression model ○ but, + sparsity constraint on ○ functional version of group lasso ● Nonparametric regression model relaxes the strong assumptions made by a linear model
  • 86. 1. Introduction ● The authors show the estimator of ● 1. Sparsistence (Sparsity pattern consistency) ○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically ● 2. Persistence ○ the estimator is persistent, predictive risk of estimator converges
  • 87. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 88. ● Data Representation ● Additive Nonparametric model 2. Notation and Assumption
  • 89. 2. Notation and Assumption ● P = the joint distribution of ( Xi , Yi ) ● The definition of L2 (P) norm (f on [0, 1]):
  • 90. ● On each dimension, ● hilbert subspace of L2 (P) of P-measurable functions ● zero mean ● The hilbert subspace has the inner product ● hilbert space of dimensional functions in the additive form 2. Notation and Assumption
  • 91. ● uniformly bounded, orthonormal basis on [0,1] ● The dimensional function 2. Notation and Assumption
  • 92. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 93. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function)
  • 94. #1. Formulate to the population SpAM 3. Sparse Backfitting Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Additive model optimization problem
  • 95. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting Sparse additive model optimization problem
  • 96. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting (Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj)) (Green) : coefficients β would become sparse. : Lasso Sparse additive model optimization problem
  • 97. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse(!!) additive model
  • 98. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso
  • 99. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Sampled!
  • 100. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Soft thresholding
  • 101. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm
  • 102. #1. Formulate to the population SpAM Eq8. standard form of additive model optimization problem Eq9. Penalized Lagrangian form (objective function) Eq10. Design choice of SpAM. (beta) is the scaling parameter and (function g) is function in Hilbert space. Standard additive model 3. Sparse Backfitting q q Ψ: base function to linearly span to the function g. where q <= p. Linearly dependent functions g are grouped to Ψ. ➝ Functional version of group lasso. Eq11. Penalized Lagrangian form of Eq10 and sample version of Eq9. : Lasso Sparse additive model (SpAM) Backfitting algorithm (Theorem. 1)
  • 103. 3. Sparse Backfitting #2. Theorem 1 From the penalized lagrangian form,
  • 104. 3. Sparse Backfitting #2. Theorem 1 says From the penalized lagrangian form, The minimizers ( ) satisfy where denotes projection matrix, represents residual matrix and means the positive part.
  • 105. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting
  • 106. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
  • 107. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is,
  • 108. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( )
  • 109. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding.
  • 110. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ??
  • 111. #2. Proof of Theorem 1 (show : ) 3. Sparse Backfitting #2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative. #2-2. Using iterated expectations, the above condition can be re-written as that is, #2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function Only positive parts survive after thresholding. Discussion points: Why do you think theorem 1 is important ?? : Only positive parts survive such that function becomes sparse.
  • 112. 3. Sparse Backfitting #3. Backfitting algorithm - According to theorem 1, - Estimate smoother projection matrix where - Flow of the backfitting algorithm
  • 113. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix
  • 114. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso?
  • 115. 3. Sparse Backfitting #4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso) SpAM backfitting algorithm is the functional version of the coordinate descent algorithm. : Functional mapping. : Iterate through coordinate : Smoothing matrix Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? )
  • 116. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
  • 117. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity). 3. Sparse Backfitting [1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140. GAMSEL:: 1 VS SpAM: vs Lasso: [2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307. Discussion points: Now, we understand that SpAM is the functional version of group lasso. Then SpAM is alway better than lasso or group lasso? (Hint : lasso is linear model, and SpAM is …? ) When there is non-linearity, SpAM can be effective.
  • 118. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 119. SpAM backfitting algorithm #5. Risk estimation
  • 120. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 121. 6.1. Synthetic Data ● Generate 150 samples from a 200-dimensional additive model ● The remaining 196 features are irrelevant and are set to zero plus a 0 mean gaussian noise.
  • 122. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n The same thresholding phenomenon that was shown in the lasso is observed.
  • 123. 6.1. Synthetic Data ● Empirical probability of selecting the true four variables as a function of the sample size n
  • 124. 6.2. Boston Housing ● There are 506 observations with 10 covariates. ● To explore the sparsistency properties of SpAM, 20 irrelevant variables are added. ● Ten of those are randomly drawn from Uniform(0, 1) ● The remainder are permutations of the original 10 covariates.
  • 125. 6.2. Boston Housing ● SpAM identifies 6 of the nonzero components. ● Both types of irrelevant variables are correctly zeroed out.
  • 126. 6.3. SpAM for Spam ● Dataset consists of 3,065 emails which serve as training set. ● 57 attributes are available. These are all numeric ● Attributes measure the percentage of specific words in an email, the average and maximum run lengths of uppercase letters. ● Sample on 300 emails from the training set and use the remainder as test set.
  • 127. 6.3. SpAM for Spam Best model
  • 128. 6.3. SpAM for Spam The 33 selected variables cover 80% of the significant predictors.
  • 129. 6.3. Functional Sparse Coding ● Here we compare SpAM with lasso. We consider natural images. ● The problem setup is as follows: ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients.
  • 130. 6.3. Functional Sparse Coding ● y is the data to be represented. X is an nxp matrix with columns X_j’s vectors to be learned. The L1 penalty encourages sparsity in the coefficients. ● Sparsity allows specialization of features and enforces capturing of salient properties of the data.
  • 131. 6.3. Functional Sparse Coding ● When solved with lasso and SGD, 200 codewords that capture edge features at different scales and spatial orientations are learned:
  • 132. 6.3. Functional Sparse Coding ● In the functional version, no assumption of linearity is made between X and y. Instead, the following additive model is used: ● This leads to the following optimization problem:
  • 133. 6.3. Functional Sparse Coding ● Which model is lasso and which is SpAM?
  • 134. 6.3. Functional Sparse Coding ● What about expressiveness?
  • 135. 6.3. Functional Sparse Coding ● The sparse linear model use 8 codewords while the functional uses 7 with a lower residual sum of squares (RSS) ● Also, the linear and nonlinear versions learn different codewords.
  • 136. Contents 1. Introduction 2. Notation and Assumptions 3. Sparse Backfitting 4. Choosing regularization parameter 5. Results 6. Discussion Points
  • 137. 7. Discussion Points (1) ● As the authors said, SpAM is essentially a functional version of the grouped lasso. Then, are there any formulations for functional versions of other methods - e.g. ridge, fused lasso? Finding a generalized functional version of lasso families will be an interesting problem ○ Functional logistic regression with fused lasso penalty (FLR-FLP)
  • 138. 7. Discussion Points (1) ● Objective function = FLR loss + lasso penalty + fused lasso penalty ● FLR loss ● gamma = coefficient in functional parameters
  • 139. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso?
  • 140. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ○ First we assume G is a partition of {1,..,p} and that G’s do not overlap ○ The optimization problem then becomes
  • 141. 7. Discussion Points (2) ● How might we handle group sparsity in additive models(GroupSpAM) as an analogy to GroupLasso. ● The regularization term becomes
  • 142. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions?
  • 143. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs.
  • 144. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then?
  • 145. 7. Discussion Points (3) ● What effect do you think smoothing has on the functions? ○ It turns out smoothing has some connections to bias-variance tradeoffs. ● Let’s suppose we make our estimates too smooth. What may we expect then? ○ If our estimates are too smooth, we risk bias. ■ Thus me wake erroneous assumptions about the underlying functions. ■ In this case we miss relevant relations between features and targets. Thus we underfit.
  • 146. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then?
  • 147. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean?
  • 148. 7. Discussion Points (3) ● What if we make rough estimates. What may we expect then? ○ We risk variance. What does this mean? ● The learned model becomes sensitive to small variations in the data. Thus we overfit. We must keep a balance between bias and variance by using an appropriate level of smoothing.
  • 149. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models?
  • 150. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure.
  • 151. 7. Discussion Points (4) ● Some Notes on Practicality With modern computing power, can you think of situations where a linear sparsity inducing model such as lasso may be preferred over Sparse Additive Models? ● Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure. ● Our data set is so massive that either the extra processing time, or the extra computer memory needed to fit and store an additive rather than a linear model is prohibitive.
  • 152. Thank you for listening 152