8. 1. Introduction
● Combine ideas from
Sparse
Linear
Models
Additive
Nonparametric
regression
Sparse Additive Models (SpAM)
Backfitting
sparsity
constraint
9. 1. Introduction
● SpAM ⋍ additive nonparametric regression model
○ but, + sparsity constraint on
○ functional version of group lasso
● Nonparametric regression model
relaxes the strong assumptions made by a linear model
10. 1. Introduction
● The authors show the estimator of
● 1. Sparsistence (Sparsity pattern consistency)
○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically
● 2. Persistence
○ the estimator is persistent, predictive risk of estimator converges
13. 2. Notation and Assumption
● P = the joint distribution of ( Xi
, Yi
)
● The definition of L2
(P) norm (f on [0, 1]):
14. ● On each dimension,
● hilbert subspace of L2
(P) of P-measurable functions
● zero mean
● The hilbert subspace has the inner product
● hilbert space
of dimensional functions in the additive form
2. Notation and Assumption
15. ● uniformly bounded, orthonormal basis on [0,1]
● The dimensional function
2. Notation and Assumption
17. #1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
18. #1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Additive model optimization problem
19. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
Sparse additive model optimization problem
20. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
(Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj))
(Green) : coefficients β would become sparse.
: Lasso
Sparse additive model optimization problem
21. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse(!!) additive model
22. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
23. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Sampled!
24. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Soft
thresholding
25. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
26. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
(Theorem. 1)
28. 3. Sparse Backfitting
#2. Theorem 1 says
From the penalized lagrangian form,
The minimizers ( ) satisfy
where denotes projection matrix, represents residual matrix
and means the positive part.
30. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
31. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
32. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( )
33. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
34. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
35. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
: Only positive parts survive such that function becomes sparse.
36. 3. Sparse Backfitting
#3. Backfitting algorithm
- According to theorem 1,
- Estimate smoother projection matrix where
- Flow of the backfitting algorithm
37. 3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
38. 3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
39. 3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
40. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
41. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
When there is non-linearity, SpAM can be effective.
45. 6.1. Synthetic Data
● Generate 150 samples from a 200-dimensional additive model
● The remaining 196 features are irrelevant and are set to zero plus a 0
mean gaussian noise.
46. 6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
The same thresholding phenomenon that was
shown in the lasso is observed.
47. 6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
48. 6.2. Boston Housing
● There are 506 observations with 10 covariates.
● To explore the sparsistency properties of SpAM, 20 irrelevant variables are
added.
● Ten of those are randomly drawn from Uniform(0, 1)
● The remainder are permutations of the original 10 covariates.
49. 6.2. Boston Housing
● SpAM identifies 6 of the nonzero components.
● Both types of irrelevant variables are correctly zeroed out.
50. 6.3. SpAM for Spam
● Dataset consists of 3,065 emails which serve as training set.
● 57 attributes are available. These are all numeric
● Attributes measure the percentage of specific words in an email, the
average and maximum run lengths of uppercase letters.
● Sample on 300 emails from the training set and use the remainder as test
set.
52. 6.3. SpAM for Spam
The 33 selected variables cover 80% of the significant predictors.
53. 6.3. Functional Sparse Coding
● Here we compare SpAM with lasso. We consider natural images.
● The problem setup is as follows:
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
54. 6.3. Functional Sparse Coding
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
● Sparsity allows specialization of features and enforces capturing of salient
properties of the data.
55. 6.3. Functional Sparse Coding
● When solved with lasso and SGD, 200 codewords that capture edge
features at different scales and spatial orientations are learned:
56. 6.3. Functional Sparse Coding
● In the functional version, no assumption of linearity is made between X and
y. Instead, the following additive model is used:
● This leads to the following optimization problem:
59. 6.3. Functional Sparse Coding
● The sparse linear model use 8 codewords
while the functional uses 7 with a lower
residual sum of squares (RSS)
● Also, the linear and nonlinear versions learn
different codewords.
61. 7. Discussion Points (1)
● As the authors said, SpAM is essentially a functional version of the
grouped lasso. Then, are there any formulations for functional versions of
other methods - e.g. ridge, fused lasso? Finding a generalized functional
version of lasso families will be an interesting problem
○ Functional logistic regression with fused lasso penalty (FLR-FLP)
62. 7. Discussion Points (1)
● Objective function = FLR loss + lasso penalty + fused lasso penalty
● FLR loss
● gamma = coefficient in functional parameters
63. 7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso?
64. 7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
○ First we assume G is a partition of {1,..,p} and that G’s do not overlap
○ The optimization problem then becomes
65. 7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
● The regularization term becomes
66. 7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
67. 7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
68. 7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
69. 7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
○ If our estimates are too smooth, we risk bias.
■ Thus me wake erroneous assumptions about the underlying functions.
■ In this case we miss relevant relations between features and targets. Thus we
underfit.
70. 7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
71. 7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
72. 7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
● The learned model becomes sensitive to small variations in the data. Thus
we overfit.
We must keep a balance between bias and variance by using an appropriate
level of smoothing.
73. 7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
74. 7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
75. 7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
● Our data set is so massive that either the extra processing time, or the
extra computer memory needed to fit and store an additive rather than
a linear model is prohibitive.
84. 1. Introduction
● Combine ideas from
Sparse
Linear
Models
Additive
Nonparametric
regression
Sparse Additive Models (SpAM)
Backfitting
sparsity
constraint
85. 1. Introduction
● SpAM ⋍ additive nonparametric regression model
○ but, + sparsity constraint on
○ functional version of group lasso
● Nonparametric regression model
relaxes the strong assumptions made by a linear model
86. 1. Introduction
● The authors show the estimator of
● 1. Sparsistence (Sparsity pattern consistency)
○ SpAM backfitting algorithm recovers the correct sparsity pattern asymptotically
● 2. Persistence
○ the estimator is persistent, predictive risk of estimator converges
89. 2. Notation and Assumption
● P = the joint distribution of ( Xi
, Yi
)
● The definition of L2
(P) norm (f on [0, 1]):
90. ● On each dimension,
● hilbert subspace of L2
(P) of P-measurable functions
● zero mean
● The hilbert subspace has the inner product
● hilbert space
of dimensional functions in the additive form
2. Notation and Assumption
91. ● uniformly bounded, orthonormal basis on [0,1]
● The dimensional function
2. Notation and Assumption
93. #1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
94. #1. Formulate to the population SpAM
3. Sparse Backfitting
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Additive model optimization problem
95. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
Sparse additive model optimization problem
96. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
(Red) : functional mapping (Xj ↦ g(Xj) ↦β*g(Xj))
(Green) : coefficients β would become sparse.
: Lasso
Sparse additive model optimization problem
97. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse(!!) additive model
98. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
99. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Sampled!
100. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Soft
thresholding
101. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
102. #1. Formulate to the population SpAM
Eq8. standard form of additive model optimization problem
Eq9. Penalized Lagrangian form (objective function)
Eq10. Design choice of SpAM. (beta) is the scaling parameter
and (function g) is function in Hilbert space.
Standard additive model
3. Sparse Backfitting
q q
Ψ: base function to linearly span to the function g.
where q <= p.
Linearly dependent functions g are grouped to Ψ.
➝ Functional version of group lasso.
Eq11. Penalized Lagrangian form of Eq10 and sample version
of Eq9.
: Lasso
Sparse additive model (SpAM)
Backfitting algorithm
(Theorem. 1)
104. 3. Sparse Backfitting
#2. Theorem 1 says
From the penalized lagrangian form,
The minimizers ( ) satisfy
where denotes projection matrix, represents residual matrix
and means the positive part.
106. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
107. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
108. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( )
109. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
110. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
111. #2. Proof of Theorem 1 (show :
)
3. Sparse Backfitting
#2-1. Stationary condition is obtained by setting to zero the Frechet directional derivative.
#2-2. Using iterated expectations, the above condition can be re-written as
that is,
#2-3. We can obtain the equivalence ( ) #2-4. The soft thresholding update for function
Only positive parts survive after thresholding.
Discussion points:
Why do you think theorem 1 is important ??
: Only positive parts survive such that function becomes sparse.
112. 3. Sparse Backfitting
#3. Backfitting algorithm
- According to theorem 1,
- Estimate smoother projection matrix where
- Flow of the backfitting algorithm
113. 3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
114. 3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
115. 3. Sparse Backfitting
#4. SpAM Backfitting algorithm Coordinate descent algorithm (lasso)
SpAM backfitting algorithm is the functional version of the coordinate
descent algorithm.
: Functional mapping.
: Iterate through coordinate
: Smoothing matrix
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
116. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
117. For simplicity, let's think SPLAM is the combination of SpAM (non-linearity) and Lasso (linearity).
3. Sparse Backfitting
[1] Lou, Yin, et al. "Sparse partially linear additive models." Journal of Computational and Graphical Statistics 25.4 (2016): 1126-1140.
GAMSEL:: 1 VS SpAM: vs Lasso:
[2] Hastie, Trevor J. "Generalized additive models." Statistical models in S. Routledge, 2017. 249-307.
Discussion points:
Now, we understand that SpAM is the functional version of group lasso.
Then SpAM is alway better than lasso or group lasso?
(Hint : lasso is linear model, and SpAM is …? )
When there is non-linearity, SpAM can be effective.
121. 6.1. Synthetic Data
● Generate 150 samples from a 200-dimensional additive model
● The remaining 196 features are irrelevant and are set to zero plus a 0
mean gaussian noise.
122. 6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
The same thresholding phenomenon that was
shown in the lasso is observed.
123. 6.1. Synthetic Data
● Empirical probability of selecting the true four variables as a function of
the sample size n
124. 6.2. Boston Housing
● There are 506 observations with 10 covariates.
● To explore the sparsistency properties of SpAM, 20 irrelevant variables are
added.
● Ten of those are randomly drawn from Uniform(0, 1)
● The remainder are permutations of the original 10 covariates.
125. 6.2. Boston Housing
● SpAM identifies 6 of the nonzero components.
● Both types of irrelevant variables are correctly zeroed out.
126. 6.3. SpAM for Spam
● Dataset consists of 3,065 emails which serve as training set.
● 57 attributes are available. These are all numeric
● Attributes measure the percentage of specific words in an email, the
average and maximum run lengths of uppercase letters.
● Sample on 300 emails from the training set and use the remainder as test
set.
128. 6.3. SpAM for Spam
The 33 selected variables cover 80% of the significant predictors.
129. 6.3. Functional Sparse Coding
● Here we compare SpAM with lasso. We consider natural images.
● The problem setup is as follows:
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
130. 6.3. Functional Sparse Coding
● y is the data to be represented. X is an nxp matrix with columns X_j’s
vectors to be learned. The L1 penalty encourages sparsity in the
coefficients.
● Sparsity allows specialization of features and enforces capturing of salient
properties of the data.
131. 6.3. Functional Sparse Coding
● When solved with lasso and SGD, 200 codewords that capture edge
features at different scales and spatial orientations are learned:
132. 6.3. Functional Sparse Coding
● In the functional version, no assumption of linearity is made between X and
y. Instead, the following additive model is used:
● This leads to the following optimization problem:
135. 6.3. Functional Sparse Coding
● The sparse linear model use 8 codewords
while the functional uses 7 with a lower
residual sum of squares (RSS)
● Also, the linear and nonlinear versions learn
different codewords.
137. 7. Discussion Points (1)
● As the authors said, SpAM is essentially a functional version of the
grouped lasso. Then, are there any formulations for functional versions of
other methods - e.g. ridge, fused lasso? Finding a generalized functional
version of lasso families will be an interesting problem
○ Functional logistic regression with fused lasso penalty (FLR-FLP)
138. 7. Discussion Points (1)
● Objective function = FLR loss + lasso penalty + fused lasso penalty
● FLR loss
● gamma = coefficient in functional parameters
139. 7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso?
140. 7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
○ First we assume G is a partition of {1,..,p} and that G’s do not overlap
○ The optimization problem then becomes
141. 7. Discussion Points (2)
● How might we handle group sparsity in additive models(GroupSpAM) as
an analogy to GroupLasso.
● The regularization term becomes
142. 7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
143. 7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
144. 7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
145. 7. Discussion Points (3)
● What effect do you think smoothing has on the functions?
○ It turns out smoothing has some connections to bias-variance tradeoffs.
● Let’s suppose we make our estimates too smooth. What may we expect
then?
○ If our estimates are too smooth, we risk bias.
■ Thus me wake erroneous assumptions about the underlying functions.
■ In this case we miss relevant relations between features and targets. Thus we
underfit.
146. 7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
147. 7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
148. 7. Discussion Points (3)
● What if we make rough estimates. What may we expect then?
○ We risk variance. What does this mean?
● The learned model becomes sensitive to small variations in the data. Thus
we overfit.
We must keep a balance between bias and variance by using an appropriate
level of smoothing.
149. 7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
150. 7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
151. 7. Discussion Points (4)
● Some Notes on Practicality
With modern computing power, can you think of situations where a linear
sparsity inducing model such as lasso may be preferred over Sparse
Additive Models?
● Our data analysis is guided by a credible scientific theory which
asserts linear relationships among the variables we measure.
● Our data set is so massive that either the extra processing time, or the
extra computer memory needed to fit and store an additive rather than
a linear model is prohibitive.