Upcoming SlideShare
×

# Copyright © by SIAM. Unauthorized reproduction of this ...

721 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

Views
Total views
721
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
4
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Copyright © by SIAM. Unauthorized reproduction of this ...

2. 2. PROBING THE PARETO FRONTIER FOR BP SOLUTIONS 891 include image reconstruction, such as MRI [36, 37] and seismic [31, 32] images, and model selection in regression [26]. In many of these applications, the data sets are large, and the matrix A is available only as an operator. In compressed sensing [10, 11, 12, 23], for example, the matrices are often fast operators such as Fourier or wavelet transforms. It is therefore crucial to develop algorithms for the sparse reconstruction problem that scale well and work eﬀectively in a matrix-free context. We present an algorithm, suitable for large-scale applications, that is capable of ﬁnding solutions of (BPσ ) for any value of σ ≥ 0. Our approach is based on recasting (BPσ ) as a problem of ﬁnding the root of a single-variable nonlinear equation. At each iteration of our algorithm, an estimate of that variable is used to deﬁne a convex optimization problem whose solution yields derivative information that can be used by a Newton-based root-ﬁnding algorithm. 1.1. One-norm regularization. The convex optimization problem (BPσ ) is only one possible statement of the one-norm regularized least-squares problem. In fact, the BPDN label is typically applied to the penalized least-squares problem 2 (QPλ ) minimize Ax − b 2 + λ x 1, x which is the problem statement proposed by Chen, Donoho, and Saunders [14, 15]. A third formulation, (LSτ ) minimize Ax − b 2 subject to x 1 ≤ τ, x has an explicit one-norm constraint and is often called the Lasso problem [46]. The parameter λ is related to the Lagrange multiplier of the constraint in (LSτ ) and to the reciprocal of the multiplier of the constraint in (BPσ ). Thus, for appropriate parameter choices of σ, λ, and τ , the solutions of (BPσ ), (QPλ ), and (LSτ ) coincide, and these problems are in some sense equivalent. However, except for special cases— such as A orthogonal—the parameters that make these problems equivalent cannot be known a priori. The formulation (QPλ ) is often preferred because of its close connection to convex quadratic programming, for which many algorithms and software are available; some examples include iteratively reweighted least squares [7, section 4.5] and gradient projection [27]. For the case where an estimate of the noise-level σ is known, Chen, √ Donoho, and Saunders [15, section 5.2] argue that the choice λ = σ 2 log n has important optimality properties. However, this argument hinges on the orthogonality of A. We focus on the situation where σ is approximately known—such as when we can estimate the noise levels inherent in an underlying system or in the measurements taken. In this case it is preferable to solve (BPσ ), and here this is our primary goal. An important consequence of our approach is that it can also be used to eﬃciently solve the closely related problems (BP) and (LSτ ). Our algorithm also applies to all three problems in the complex domain, which can arise in signal processing applications. 1.2. Approach. At the heart of our approach is the ability to eﬃciently solve a sequence of (LSτ ) problems using a spectral projected-gradient (SPG) algorithm [5, 6, 18]. As with (QPλ ), this problem is parameterized by a scalar; the crucial diﬀerence, however, is that the dual solution of (LSτ ) yields vital information on how to update τ so that the next solution of (LSτ ) is much closer to the solution of (BPσ ). Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
3. 3. 892 EWOUT VAN DEN BERG AND MICHAEL P. FRIEDLANDER Let xτ denote the optimal solution of (LSτ ). The single-parameter function (1.1) φ(τ ) = rτ 2, with rτ := b − Axτ gives the optimal value of (LSτ ) for each τ ≥ 0. As we describe in section 2, its derivative is given by −λτ , where λτ ≥ 0 is the unique dual solution of (LSτ ). Impor- tantly, this dual solution can easily be obtained as a by-product of the minimization of (LSτ ); this is discussed in section 2.1. Our approach is then based on applying Newton’s method to ﬁnd a root of the nonlinear equation (1.2) φ(τ ) = σ, which deﬁnes a sequence of regularization parameters τk → τσ , where xτσ is a solution of (BPσ ). In other words, τσ is the parameter that causes (LSτ ) and (BPσ ) to share the same solution. There are four distinct components to this paper. The ﬁrst two are related to the root-ﬁnding algorithm for (BPσ ). The third is an eﬃcient algorithm for solving (LSτ )—and hence for evaluating the function φ and its derivative φ . The fourth gives the results of a series of numerical experiments. Pareto curve (section 2). The Pareto curve deﬁnes the optimal trade-oﬀ between the two-norm of the residual r and the one-norm of the solution x. The problems (BPσ ) and (LSτ ) are two distinct characterizations of the same curve. Our approach uses the function φ to parameterize the Pareto curve by τ . We show that, for all points of interest, φ—and hence the Pareto curve—is continuously diﬀerentiable. We are also able to give an explicit expression for its derivative. This surprising result permits us to use a Newton-based root-ﬁnding algorithm to ﬁnd roots of the nonlinear equation (1.2), which correspond to points on the Pareto curve. Thus we can ﬁnd solutions of (BPσ ) for any σ ≥ 0. Root ﬁnding (section 3). Each iteration of the root-ﬁnding algorithm for (1.2) requires the evaluation of φ and φ at some τ , and hence the minimization of (LSτ ). This is a potentially expensive subproblem, and the eﬀectiveness of our approach hinges on the ability to solve this subproblem only approximately. We present rate- of-convergence results for the case where φ and φ are known only approximately. This is in contrast to the usual inexact-Newton analysis [22], which assumes that φ is known exactly. We also give an eﬀective stopping rule for determining the required accuracy of each minimization of (LSτ ). Projected gradient for Lasso (section 4). We describe an SPG algorithm for solv- ing (LSτ ). Each iteration of this method requires an orthogonal projection of an n-vector onto the convex set x 1 ≤ τ . In section 4.2 we give an algorithm for this projection with a worst-case complexity of O(n log n). In many important applica- tions, A is a Fourier-type operator, and matrix-vector products with A and AT can be obtained with O(n log n) cost. The projection cost is typically much smaller than the worst case, and the dominant cost in our algorithm consists of the matrix-vector products, as it does in other algorithms for BPDN. We also show how the projec- tion algorithm can easily be extended to project complex vectors, which allows us to extend the SPG algorithm to problems in the complex domain. Implementation and numerical experiments (sections 5 and 6). To demonstrate the eﬀectiveness of our approach, we apply our algorithm on a set of benchmark prob- lems and compare it to other state-of-the-art solvers. In sections 6.1 and 6.2 we report numerical results on a series of (BPσ ) and (BP) problems, which are normally con- sidered as distinct problems. In section 6.3 we report numerical results on a series of (LSτ ) problems for various values of τ , and compare against the equivalent (QPλ ) for- Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
6. 6. PROBING THE PARETO FRONTIER FOR BP SOLUTIONS 895 The dual of this convex problem is given by (2.2) maximize L(y, λ) subject to λ ≥ 0, y,λ where L(y, λ) = inf { r 2 − y T(Ax + r − b) + λ( x 1 − τ )} x,r is the Lagrange dual function, and the m-vector y and scalar λ are the Lagrange multipliers (e.g., dual variables) corresponding to the constraints in (2.1). We use the separability of the inﬁmum in r and x to rearrange terms and arrive at the equivalent statement L(y, λ) = bTy − τ λ − sup {y Tr − r 2 } − sup {y TAx − λ x 1 }. r x We recognize the suprema above as the conjugate functions of r 2 and of λ x 1 , respectively. For an arbitrary norm · with dual norm · ∗ , the conjugate function of f (x) = α x for any α ≥ 0 is given by 0 if y ∗ ≤ α, (2.3) f∗ (y) := sup {y Tx − α x } = x ∞ otherwise; see Boyd and Vandenberghe [8, section 3.3.1]. With this expression of the conjugate function, it follows that (2.2) remains bounded if and only if the dual variables y and λ satisfy the constraints y 2 ≤ 1 and ATy ∞ ≤ λ. The dual of (2.1), and hence of (LSτ ), is then given by (2.4) maximize bTy − τ λ subject to y 2 ≤ 1, ATy ∞ ≤ λ; y,λ the nonnegativity constraint on λ is implicitly enforced by the second constraint. Importantly, the dual variables y and λ can easily be computed from the optimal primal solutions. To derive y, ﬁrst note that, from (2.3), (2.5) sup {y Tr − r 2 } = 0 if y 2 ≤ 1. r Therefore, y = r/ r 2 , and we can without loss of generality take y 2 = 1 in (2.4). To derive λ, note that, as long as τ > 0, λ must be at its lower bound, as implied by the constraint ATy ∞ ≤ λ. Hence, we take λ = ATy ∞ . (If r = 0 or τ = 0, the choice of y or λ, respectively, is arbitrary.) The dual variable y can then be eliminated, and we arrive at the following necessary and suﬃcient optimality conditions for the primal-dual solution (rτ , xτ , λτ ) of (2.1): (2.6a) Axτ + rτ = b, xτ 1 ≤τ (primal feasibility); (2.6b) A rτ ∞ ≤ λτ rτ T 2 (dual feasibility); (2.6c) λτ ( xτ 1 − τ ) = 0 (complementarity). 2.2. Convexity and diﬀerentiability of the Pareto curve. Let τBP be the optimal objective value of the problem (BP). This corresponds to the smallest value of τ such that (LSτ ) has a zero objective value. As we show below, φ is nonincreasing, Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
7. 7. 896 EWOUT VAN DEN BERG AND MICHAEL P. FRIEDLANDER and therefore τBP is the ﬁrst point at which the graph of φ touches the horizontal axis. Our assumption that 0 = b ∈ range(A) implies that (BP) is feasible and that τBP > 0. Therefore, at the endpoints of the interval of interest, (2.7) φ(0) = b 2 >0 and φ(τBP ) = 0. As the following result conﬁrms, the function is convex and strictly decreasing over the interval τ ∈ [0, τBP ]. It is also continuously diﬀerentiable on the interior of this interval—this is a crucial property. Theorem 2.1. (a) The function φ is convex and nonincreasing. (b) For all τ ∈ (0, τBP ), φ is continuously diﬀerentiable, φ (τ ) = −λτ , and the optimal dual variable λτ = ATyτ ∞ , where yτ = rτ / rτ 2 . (c) For τ ∈ [0, τBP ], xτ 1 = τ , and φ is strictly decreasing. Proof. (a) The function φ can be restated as (2.8) φ(τ ) = inf f (x, τ ), x where 0 if x 1 ≤ τ , f (x, τ ) := Ax − b 2 + ψτ (x) and ψτ (x) := ∞ otherwise. Note that, by (2.3), ψτ (x) = supz {xTz − τ z ∞ }, which is the pointwise supremum of an aﬃne function in (x, τ ). Therefore, it is convex in (x, τ ). Together with the con- vexity of Ax − b 2 , this implies that f is convex in (x, τ ). Consider any nonnegative scalars τ1 and τ2 , and let x1 and x2 be the corresponding minimizers of (2.8). For any β ∈ [0, 1], φ(βτ1 + (1 − β)τ2 ) = inf f (x, βτ1 + (1 − β)τ2 ) x ≤ f βx1 + (1 − β)x2 , βτ1 + (1 − β)τ2 ≤ βf (x1 , τ1 ) + (1 − β)f (x2 , τ2 ) = βφ(τ1 ) + (1 − β)φ(τ2 ). Hence, φ is convex in τ . Moreover, φ is nonincreasing because the feasible set enlarges as τ increases. (b) The function φ is diﬀerentiable at τ if and only if its subgradient at τ is unique [43, Theorem 25.1]. By [4, Proposition 6.5.8(a)], −λτ ∈ ∂φ(τ ). Therefore, to prove the diﬀerentiability of φ, it is enough to show that λτ is unique. Note that λ appears linearly in (2.4) with coeﬃcient −τ < 0, and thus λτ is not optimal unless it is at its lower bound, as implied by the constraint ATy ∞ ≤ λ. Hence, λτ = ATyτ ∞ . Moreover, the convexity of (LSτ ) implies that its optimal value is unique, and so rτ ≡ b − Axτ is unique. Also, rτ > 0 because τ < τBP (cf. (2.7)). As discussed in connection with (2.5), we can then take yτ = rτ / rτ 2 , and so the uniqueness of rτ implies the uniqueness of yτ , and hence the uniqueness of λτ , as required. The continuity of the gradient follows from the convexity of φ. (c) The assertion holds trivially for τ = 0. For τ = τBP , xτBP 1 = τBP by deﬁnition. It only remains to prove part (c) on the interior of the interval. Note that φ(τ ) ≡ rτ > 0 for all τ ∈ [0, τBP ). Then by part (b), λτ > 0, and hence φ is strictly decreasing for τ < τBP . But because xτ and λτ both satisfy the complementarity condition in (2.6), it must hold that xτ 1 = τ . Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
8. 8. PROBING THE PARETO FRONTIER FOR BP SOLUTIONS 897 2.3. Generic regularization. The technique used to prove Theorem 2.1 does not in any way rely on the speciﬁc norms used in the objective and regularization functions, and it can be used to prove similar properties for the generic regularized ﬁtting problem (2.9) minimize Ax − b s subject to x p ≤ τ, x where 1 ≤ (p, s) ≤ ∞ deﬁne the norms of interest, i.e., x p = ( i |xi |p )1/p . More generally, the constraint here may appear as Lx p , where L may be rectangular. Such a constraint deﬁnes a seminorm, and it often arises in the discrete approximations of derivative operators. In particular, least-squares with Tikhonov regularization [47], which corresponds to p = s = 2, is used extensively for the regularization of ill-posed problems; see Hansen [29] for a comprehensive study. In this case, the Pareto curve deﬁned by the optimal trade-oﬀ between x 2 and Ax−b 2 is often called the L-curve because of its shape when plotted on a log-log scale [30]. If we deﬁne p and s such that ¯ ¯ 1/p + 1/¯ = 1 p and 1/s + 1/¯ = 1, s then the dual of the generic regularization problem is given by maximize bTy − τ λ subject to y ¯ s ≤ 1, ATy ¯ p ≤ λ. y,λ As with (2.4), the optimal dual variables are given by y = r/ r p and λ = ATy s . ¯ ¯ This is a generalization of the results obtained by Dax [21], who derives the dual for p and s strictly between 1 and ∞. The corollary below, which follows from a straightfoward modiﬁcation of Theorem 2.1, asserts that the Pareto curve deﬁned for any 1 ≤ (p, s) ≤ ∞ in (2.7) has the properties of convexity and diﬀerentiability. Corollary 2.2. Let θ(τ ) := rτ s , where rτ := b − Axτ , and xτ is the optimal solution of (2.9). (a) The function θ is convex and nonincreasing. (b) For all τ ∈ (0, τBP ), θ is continuously diﬀerentiable, θ (τ ) = −λτ , and the optimal dual variable λτ = ATyτ p , where yτ satisﬁes y T rτ = rτ s . ¯ (c) For τ ∈ [0, τBP ], xτ p = τ , and θ is strictly decreasing. 3. Root ﬁnding. As we brieﬂy outlined in section 1.2, our algorithm generates a sequence of regularization parameters τk → τσ based on the Newton iteration (3.1) τk+1 = τk + Δτk , with Δτk := σ − φ(τk ) /φ (τk ), such that the corresponding solutions xτk of (LSτk ) converge to xσ . For values of σ ∈ (0, b 2 ), Theorem 2.1 implies that φ is convex, strictly decreasing, and continuously diﬀerentiable. In that case it is clear that τk → τσ superlinearly for all initial values τ0 ∈ (0, τBP ) (see, e.g., Bertsekas [3, Proposition 1.4.1]). The eﬃciency of our method, as with many Newton-type methods for large prob- lems, ultimately relies on the ability to carry out the iteration described by (3.1) with only an approximation of φ(τk ) and φ (τk ). Although the nonlinear equation (1.2) that we wish to solve involves only a single variable τ , the evaluation of φ(τ ) involves the solution of (LSτ ), which can be a large optimization problem that is expensive to solve to full accuracy. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
9. 9. 898 EWOUT VAN DEN BERG AND MICHAEL P. FRIEDLANDER For systems of nonlinear equations in general, inexact Newton methods assume that the Newton system analogous to the equation φ (τk )Δτk = σ − φ(τk ) is solved only approximately, with a residual that is a fraction of the right-hand side. A constant fraction yields a linear convergence rate, and a fraction tending to zero yields a superlinear convergence rate (see, e.g., Nocedal and Wright [40, Theo- rem 7.2]). However, the inexact-Newton analysis does not apply to the case where the right-hand side (i.e., the function itself) is known only approximately, and it is therefore not possible to know a priori the accuracy required to achieve an inexact Newton-type convergence rate. This is the situation that we are faced with if (LSτ ) is solved approximately. As we show below, with only approximate knowledge of the function value φ, this inexact version of Newton’s method still converges, although the convergence rate is sublinear. The rate can be made arbitrarily close to superlinear by increasing the accuracy with which we compute φ. 3.1. Approximate primal-dual solutions. In this section we use the dual- ity gap to derive an easily computable expression that bounds the accuracy of the computed function value of φ. The algorithm for solving (LSτ ) outlined in section 4 maintains feasibility of the iterates at all iterations. Thus, an approximate solution xτ and its corresponding residual r τ := b − A¯τ satisfy ¯ ¯ x (3.2) xτ ¯ 1 ≤τ and rτ ¯ 2 ≥ rτ 2 > 0, where the second set of inequalities holds because xτ is suboptimal and τ < τBP . We ¯ can thus construct the approximations yτ := r τ / rτ ¯ ¯ ¯ ¯ and λτ := ATy τ ¯ 2 ∞ to the dual variables that are dual feasible, i.e., they satisfy (2.6b). The value of the dual problem (2.2) at any feasible point gives a lower bound on the optimal value rτ 2 , and the value of the primal problem (2.1) at any feasible point gives an upper bound on the optimal value. Therefore, (3.3) ¯ ¯ bTy τ − τ λτ ≤ rτ ≤ rτ ¯ 2 2. We use the duality gap (3.4) δτ := rτ ¯ ¯ ¯ − (bTy τ − τ λτ ) 2 to measure the quality of an approximate solution xτ . By (3.3), δτ is necessarily ¯ nonnegative. ¯ Let φ(τ ) := rτ 2 be the objective value of (LSτ ) at the approximate solution xτ . ¯ ¯ ¯ The duality gap at xτ provides a bound on the diﬀerence between φ(τ ) and φ(τ ). If ¯ we additionally assume that A is full rank (so that its condition number is bounded), we can also use δτ to provide a bound on the diﬀerence between the derivatives φ (τ ) ¯ and φ (τ ). From (3.3)–(3.4) and from Theorem 2.1(b), for all τ ∈ (0, τBP ), (3.5) ¯ φ(τ ) − φ(τ ) < δτ ¯ and |φ (τ ) − φ (τ )| < γδτ for some positive constant γ that is independent of τ . It follows from the deﬁnition of φ and from standard properties of matrix norms that γ is proportional to the condition number of A. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
10. 10. PROBING THE PARETO FRONTIER FOR BP SOLUTIONS 899 3.2. Local convergence rate. The following theorem establishes the local con- vergence rate of an inexact Newton method for (1.2) where φ and φ are known only approximately. Theorem 3.1. Suppose that A has full rank, σ ∈ (0, b 2 ), and δk := δτk → 0. ¯ Then if τ0 is close enough to τσ , the iteration (3.1)—with φ and φ replaced by φ and ¯ —generates a sequence τk → τσ that satisﬁes φ (3.6) |τk+1 − τσ | = γδk + ηk |τk − τσ |, where ηk → 0 and γ is a positive constant. Proof. Because φ(τσ ) = σ ∈ (0, b 2 ), (2.7) implies that τσ ∈ (0, τBP ). By Theorem 2.1 we have that φ(τ ) is continuously diﬀerentiable for all τ close enough to τσ , and so by Taylor’s theorem, 1 φ(τk ) − σ = φ (τσ + α[τk − τσ ]) dα · (τk − τσ ) 0 1 = φ (τk )(τk − τσ ) + φ (τσ + α[τk − τσ ]) − φ (τk ) · dα (τk − τσ ) 0 = φ (τk )(τk − τσ ) + ω(τk , τσ ), where the remainder ω satisﬁes (3.7) ω(τk , τσ )/|τk − τσ | → 0 as |τk − τσ | → 0. By (3.5) and because (3.2) holds for τ = τk , there exist positive constants γ1 and γ2 , independent of τk , such that φ(τk ) − σ ¯ φ(τk ) − σ − ¯ ≤ γ1 δk and |φ (τk )|−1 < γ2 . φ (τk ) φ (τk ) ¯ ¯ Then, because Δτk = σ − φ(τk ) /φ (τk ), |τk+1 − τσ | = |τk − τσ + Δτk | ¯ φ(τk ) − σ 1 = − ¯ + φ(τk ) − σ − ω(τk , τσ ) φ (τk ) φ (τk ) ¯ φ(τk ) − σ φ(τk ) − σ ω(τk , τσ ) ≤ − ¯ + φ (τk ) φ (τk ) φ (τk ) = γ1 δk + γ2 |ω(τk , τσ )| = γ1 δk + ηk |τk − τσ |, where ηk := γ2 |ω(τk , τσ )|/|τk − τσ |. With τk suﬃciently close to τσ , (3.7) implies that ηk < 1. Apply the above inequality recursively ≥ 1 times to obtain −i |τk+ − τσ | ≤ γ1 (γ1 ) δk+i−1 + (ηk ) |τk − τσ |, i=1 and because δk → 0 and ηk < 1, it follows that τk+ → τσ as → ∞. Thus τk → τσ , as required. By again applying (3.7), we have that ηk → 0. Note that if (LSτ ) is solved exactly at each iteration, such that δk = 0, then Theorem 3.1 shows that the convergence rate is superlinear, as we expect of a standard Newton iteration. In eﬀect, the convergence rate of the algorithm depends on the rate at which δk → 0. If A is rank deﬁcient, then the constant γ in (3.6) is inﬁnite; we thus expect that ill-conditioning in A leads to slow convergence unless δk = 0, i.e., φ is evaluated accurately at every iteration. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.