5 Proximal Toolbox
The previous toolbox we have presented is well
adapted for solving a large number of small and medium-scale sparse
decomposition problems with the square loss, which is typical from the
classical dictionary learning framework. We now present
a new software package that is adapted for solving a wide range of
possibly large-scale learning problems, with several combinations of losses and
regularization terms. The method implements the proximal methods
of [1], and includes the proximal solvers for the tree-structured
regularization of [13], and the solver of [20] for
general structured sparse regularization.
The solver for structured sparse regularization norms includes a C++ max-flow
implementation of the push-relabel algorithm of [11], with
heuristics proposed by [5].
This implementation also provides robust stopping criteria based on
duality gaps, which are presented in Appendix A. It can handle intercepts (unregularized variables). The general formulation that our software
can solve take the form
where f is a smooth loss function and ψ is a regularization function.
When one optimizes a matrix W in ℝp × r instead of
a vector w in ℝp, we will write
Note that the software can possibly handle nonnegativity constraints.
We start by presenting the type of regularization implemented in the software
5.1 Regularization Functions
Our software can handle the following regularization functions ψ for vectors w in ℝp:
-
The Tikhonov regularization: ψ(w) =▵ 1/2||w||22.
- The ℓ1-norm: ψ(w) =▵ ||w||1.
- The Elastic-Net: ψ(w) =▵ ||w||1+γ||w||22.
- The Fused-Lasso: ψ(w) =▵ ||w||1+γ||w||22+γ2∑i=1p−1|wi+1−wi|.
- The group Lasso: ψ(w) =▵ ∑g ∈ G ηg ||wg||2, where G are groups of variables of the same size.
- The group Lasso with ℓ∞-norm: ψ(w) =▵ ∑g ∈ G ηg ||wg||∞, where G are groups of variables of the same size.
- The sparse group Lasso: same as above but with an additional ℓ1 term.
- The tree-structured sum of ℓ2-norms: ψ(w) =▵ ∑g ∈ G ηg ||wg||2, where G is a tree-structured set of groups [13], and the ηg are positive weights.
- The tree-structured sum of ℓ∞-norms: ψ(w) =▵ ∑g ∈ G ηg ||wg||∞. See [13]
- General sum of ℓ∞-norms: ψ(w) =▵ ∑g ∈ G ηg ||wg||∞, where no assumption are made on the groups G.
Our software also handles regularization functions ψ on matrices W in ℝp × r (note that W can be transposed in these formulations). In particular,
-
The ℓ1/ℓ2-norm: ψ(W) =▵ ∑i=1p ||Wi||2, where Wi denotes the i-th row of W.
- The ℓ1/ℓ∞-norm: ψ(W) =▵ ∑i=1p ||Wi||∞,
- The ℓ1/ℓ2+ℓ1-norm: ψ(W) =▵ ∑i=1p ||Wi||2 + λ2 ∑i,j|Wij|.
- The ℓ1/ℓ∞+ℓ1-norm: ψ(W) =▵ ∑i=1p ||Wi||∞+λ2 ∑i,j|Wij|,
- The ℓ1/ℓ∞-norm on rows and columns: ψ(W) =▵ ∑i=1p ||Wi||∞+λ2 ∑j=1r||Wj||∞, where Wj denotes the j-th column of W.
- The multi-task tree-structured sum of ℓ∞-norms:
ψ(W) | | | | | | ηg||wgi||∞+ γ | | ηg | | ||Wj||∞,
(23) |
where the first double sums is in fact a sum of independent structured norms on the columns wi of W, and the right term is a tree-structured regularization norm applied to the ℓ∞-norm of the rows of W, thereby inducing the tree-structured regularization at the row level. G is here a tree-structured set of groups.
- The multi-task general sum of ℓ∞-norms is the same as Eq. (23) except that the groups G are general overlapping groups.
- The trace norm: ψ(W) =▵ ||W||*.
Non-convex regularizations are also implemented with the ISTA algorithm (no duality gaps are of course provided in these cases):
-
The ℓ0-pseudo-norm: ψ(w) =▵ ||w||0.
- The rank: ψ(W) =▵ randk(W).
- The tree-structured ℓ0-pseudo-norm: ψ(w) =▵ ∑g ∈ G δwg ≠ 0.
All of these regularization terms for vectors or matrices can be coupled with
nonnegativity constraints. It is also possible to add an intercept, which one
wishes not to regularize, and we will include this possibility in the next
sections. There are also a few hidden undocumented options which are available in the source code.
We now present 3 functions for computing proximal operators associated to the previous regularization functions.
5.2 Function mexProximalFlat
This function computes the proximal operators associated to many regularization functions, for input signals U=[u1,…,un] in ℝp × n, it finds a matrix V=[v1,…,vn] in ℝp × n such that:
• If one chooses a regularization function on vectors, for every column u of U, it computes one column v of V solving
| | | | ||u−v||22 + λ ||v||0,
(24) |
or
| | | | ||u−v||22 + λ ||v||1,
(25) |
or
| | | | ||u−v||22 + λ ||v||22,
(26) |
or
| | | | ||u−v||22 + λ ||v||1 + λ2||v||22,
(27) |
or
| | | | ||u−v||22 + λ | | |vj+1i−vji|+λ2 ||v||1 + λ3||v||22,
(28) |
or
| | | | ||u−v||22 + λ | | δg(v),
(29) |
where T is a tree-structured set of groups (see [14]), and δg(v) = 0 if vg=0 and 1 otherwise.
It can also solve
| | | | ||u−v||22 + λ | | ηg ||vg||2,
(30) |
or
| | | | ||u−v||22 + λ | | ηg ||vg||∞,
(31) |
or
| | | | ||u−v||22 + λ | | ηg ||vg||∞,
(32) |
where G is any kind of set of groups.
This function can also solve the following proximal operators on matrices
| | | | ||U−V||F2 + λ | | ||Vi||2,
(33) |
where Vi is the i-th row of V, or
| | | | ||U−V||F2 + λ | | ||Vi||∞,
(34) |
or
| | | | ||U−V||F2 + λ | | ||Vi||2 +λ2 | | | |Vij|,
(35) |
or
| | | | ||U−V||F2 + λ | | ||Vi||∞+λ2 | | | |Vij|,
(36) |
or
| | | | ||U−V||F2 + λ | | ||Vi||∞+λ2 | | ||Vj||∞.
(37) |
where Vj is the j-th column of V.
See usage details below:
%
% Usage: alpha=mexProximalFlat(U,param);
%
% Name: mexProximalFlat
%
% Description: mexProximalFlat computes proximal operators. Depending
% on the value of param.regul, it computes
%
% Given an input matrix U=[u^1,\ldots,u^n], it computes a matrix
% V=[v^1,\ldots,v^n] such that
% if one chooses a regularization functions on vectors, it computes
% for each column u of U, a column v of V solving
% if param.regul='l0'
% argmin 0.5||u-v||_2^2 + lambda||v||_0
% if param.regul='l1'
% argmin 0.5||u-v||_2^2 + lambda||v||_1
% if param.regul='l2'
% argmin 0.5||u-v||_2^2 + lambda||v||_2^2
% if param.regul='elastic-net'
% argmin 0.5||u-v||_2^2 + lambda||v||_1 + lambda_2||v||_2^2
% if param.regul='fused-lasso'
% argmin 0.5||u-v||_2^2 + lambda FL(v) + ...
% ... lambda_2||v||_1 + lambda_3||v||_2^2
% if param.regul='linf'
% argmin 0.5||u-v||_2^2 + lambda||v||_inf
% if param.regul='l2-not-squared'
% argmin 0.5||u-v||_2^2 + lambda||v||_2
% if param.regul='group-lasso-l2'
% argmin 0.5||u-v||_2^2 + lambda sum_g ||v_g||_2
% where the groups are consecutive entries of v of size param.group_size
% if param.regul='group-lasso-linf'
% argmin 0.5||u-v||_2^2 + lambda sum_g ||v_g||_inf
% if param.regul='sparse-group-lasso-l2'
% argmin 0.5||u-v||_2^2 + lambda sum_g ||v_g||_2 + lambda_2 ||v||_1
% where the groups are consecutive entries of v of size param.group_size
% if param.regul='sparse-group-lasso-linf'
% argmin 0.5||u-v||_2^2 + lambda sum_g ||v_g||_inf + lambda_2 ||v||_1
% if param.regul='trace-norm-vec'
% argmin 0.5||u-v||_2^2 + lambda ||mat(v)||_*
% where mat(v) has param.group_size rows
%
% if one chooses a regularization function on matrices
% if param.regul='l1l2', V=
% argmin 0.5||U-V||_F^2 + lambda||V||_{1/2}
% if param.regul='l1linf', V=
% argmin 0.5||U-V||_F^2 + lambda||V||_{1/inf}
% if param.regul='l1l2+l1', V=
% argmin 0.5||U-V||_F^2 + lambda||V||_{1/2} + lambda_2||V||_{1/1}
% if param.regul='l1linf+l1', V=
% argmin 0.5||U-V||_F^2 + lambda||V||_{1/inf} + lambda_2||V||_{1/1}
% if param.regul='l1linf+row-column', V=
% argmin 0.5||U-V||_F^2 + lambda||V||_{1/inf} + lambda_2||V'||_{1/inf}
% if param.regul='trace-norm', V=
% argmin 0.5||U-V||_F^2 + lambda||V||_*
% if param.regul='rank', V=
% argmin 0.5||U-V||_F^2 + lambda rank(V)
% if param.regul='none', V=
% argmin 0.5||U-V||_F^2
%
% for all these regularizations, it is possible to enforce non-negativity constraints
% with the option param.pos, and to prevent the last row of U to be regularized, with
% the option param.intercept
%
% Inputs: U: double m x n matrix (input signals)
% m is the signal size
% param: struct
% param.lambda (regularization parameter)
% param.regul (choice of regularization, see above)
% param.lambda2 (optional, regularization parameter)
% param.lambda3 (optional, regularization parameter)
% param.verbose (optional, verbosity level, false by default)
% param.intercept (optional, last row of U is not regularized,
% false by default)
% param.transpose (optional, transpose the matrix in the regularizaiton function)
% param.group_size (optional, for regularization functions assuming a group
% structure)
% param.pos (optional, adds positivity constraints on the
% coefficients, false by default)
% param.numThreads (optional, number of threads for exploiting
% multi-core / multi-cpus. By default, it takes the value -1,
% which automatically selects all the available CPUs/cores).
%
% Output: alpha: double m x n matrix (output coefficients)
%
% Author: Julien Mairal, 2010
5.3 Function mexProximalTree
This function computes the proximal operators associated to tree-structured regularization functions, for input signals U=[u1,…,un] in ℝp × n, and a tree-structured set of groups [13], it computes a matrix V=[v1,…,vn] in ℝp × n. When one uses a regularization function on vectors, it computes a column v of V for every column u of U:
| | | | ||u−v||22 + λ | | ηg ||vg||2,
(38) |
or
| | | | ||u−v||22 + λ | | ηg ||vg||∞,
(39) |
or
| | | | ||u−v||22 + λ | | δg(v),
(40) |
where δg(v)=0 if vg=0 and 1 otherwise (see appendix of [14]).
When the multi-task tree-structured regularization function is used, it solves
| | | | ||U−V||F2 + λ | | | ηg ||vgi||∞+ λ2 | | | | ||vgj||∞,
(41) |
which is a formulation presented in [20].
This function can also be used for computing the proximal operators addressed by mexProximalFlat (it will just not take into account the tree structure). The way the tree is incoded is presented below, (and examples are given in the file test_ProximalTree.m, with more usage details:
%
% Usage: alpha=mexProximalTree(U,tree,param);
%
% Name: mexProximalTree
%
% Description: mexProximalTree computes a proximal operator. Depending
% on the value of param.regul, it computes
%
% Given an input matrix U=[u^1,\ldots,u^n], and a tree-structured set of groups T,
% it returns a matrix V=[v^1,\ldots,v^n]:
%
% when the regularization function is for vectors,
% for every column u of U, it compute a column v of V solving
% if param.regul='tree-l0'
% argmin 0.5||u-v||_2^2 + lambda \sum_{g \in T} \delta^g(v)
% if param.regul='tree-l2'
% for all i, v^i =
% argmin 0.5||u-v||_2^2 + lambda\sum_{g \in T} \eta_g||v_g||_2
% if param.regul='tree-linf'
% for all i, v^i =
% argmin 0.5||u-v||_2^2 + lambda\sum_{g \in T} \eta_g||v_g||_inf
%
% when the regularization function is for matrices:
% if param.regul='multi-task-tree'
% V=argmin 0.5||U-V||_F^2 + lambda \sum_{i=1}^n\sum_{g \in T} \eta_g||v^i_g||_inf + ...
% lambda_2 \sum_{g \in T} \eta_g max_{j in g}||V_j||_{inf}
%
% it can also be used with any non-tree-structured regularization addressed by mexProximalFlat
%
% for all these regularizations, it is possible to enforce non-negativity constraints
% with the option param.pos, and to prevent the last row of U to be regularized, with
% the option param.intercept
%
% Inputs: U: double m x n matrix (input signals)
% m is the signal size
% tree: struct
% with four fields, eta_g, groups, own_variables and N_own_variables.
%
% The tree structure requires a particular organization of groups and variables
% * Let us denote by N = |T|, the number of groups.
% the groups should be ordered T={g1,g2,\ldots,gN} such that if gi is included
% in gj, then j <= i. g1 should be the group at the root of the tree
% and contains every variable.
% * Every group is a set of contiguous indices for instance
% gi={3,4,5} or gi={4,5,6,7} or gi={4}, but not {3,5};
% * We define root(gi) as the indices of the variables that are in gi,
% but not in its descendants. For instance for
% T={ g1={1,2,3,4},g2={2,3},g3={4} }, then, root(g1)={1},
% root(g2)={2,3}, root(g3)={4},
% We assume that for all i, root(gi) is a set of contigous variables
% * We assume that the smallest of root(gi) is also the smallest index of gi.
%
% For instance,
% T={ g1={1,2,3,4},g2={2,3},g3={4} }, is a valid set of groups.
% but we can not have
% T={ g1={1,2,3,4},g2={1,2},g3={3} }, since root(g1)={4} and 4 is not the
% smallest element in g1.
%
% We do not lose generality with these assumptions since they can be fullfilled for any
% tree-structured set of groups after a permutation of variables and a correct ordering of the
% groups.
% see more examples in test_ProximalTree.m of valid tree-structured sets of groups.
%
% The first fields sets the weights for every group
% tree.eta_g double N vector
%
% The next field sets inclusion relations between groups
% (but not between groups and variables):
% tree.groups sparse (double or boolean) N x N matrix
% the (i,j) entry is non-zero if and only if i is different than j and
% gi is included in gj.
% the first column corresponds to the group at the root of the tree.
%
% The next field define the smallest index of each group gi,
% which is also the smallest index of root(gi)
% tree.own_variables int32 N vector
%
% The next field define for each group gi, the size of root(gi)
% tree.N_own_variables int32 N vector
%
% examples are given in test_ProximalTree.m
%
% param: struct
% param.lambda (regularization parameter)
% param.regul (choice of regularization, see above)
% param.lambda2 (optional, regularization parameter)
% param.lambda3 (optional, regularization parameter)
% param.verbose (optional, verbosity level, false by default)
% param.intercept (optional, last row of U is not regularized,
% false by default)
% param.pos (optional, adds positivity constraints on the
% coefficients, false by default)
% param.numThreads (optional, number of threads for exploiting
% multi-core / multi-cpus. By default, it takes the value -1,
% which automatically selects all the available CPUs/cores).
%
% Output: alpha: double m x n matrix (output coefficients)
%
% Author: Julien Mairal, 2010
5.4 Function mexProximalGraph
This function computes the proximal operators associated to structured sparse regularization, for input signals U=[u1,…,un] in ℝp × n, and a set of groups [20], it returns a matrix V=[v1,…,vn] in ℝp × n.
When one uses a regularization function on vectors, it computes a column v of V for every column u of U:
| | | | ||u−v||22 + λ | | ηg ||vg||∞,
(42) |
or with a regularization function on matrices, it computes V solving
| | | | ||U−V||F2 + λ | | | ηg ||vgi||∞+ λ2 | | | | ||vgj||∞,
(43) |
This function can also be used for computing the proximal operators addressed by mexProximalFlat. The way the graph is incoded is presented below (and also in the example file test_ProximalGraph.m, with more usage details:
%
% Usage: alpha=mexProximalGraph(U,graph,param);
%
% Name: mexProximalGraph
%
% Description: mexProximalGraph computes a proximal operator. Depending
% on the value of param.regul, it computes
%
% Given an input matrix U=[u^1,\ldots,u^n], and a set of groups G,
% it computes a matrix V=[v^1,\ldots,v^n] such that
%
% if param.regul='graph'
% for every column u of U, it computes a column v of V solving
% argmin 0.5||u-v||_2^2 + lambda\sum_{g \in G} \eta_g||v_g||_inf
%
% if param.regul='graph+ridge'
% for every column u of U, it computes a column v of V solving
% argmin 0.5||u-v||_2^2 + lambda\sum_{g \in G} \eta_g||v_g||_inf + lambda_2||v||_2^2
%
%
% if param.regul='multi-task-graph'
% V=argmin 0.5||U-V||_F^2 + lambda \sum_{i=1}^n\sum_{g \in G} \eta_g||v^i_g||_inf + ...
% lambda_2 \sum_{g \in G} \eta_g max_{j in g}||V_j||_{inf}
%
% it can also be used with any regularization addressed by mexProximalFlat
%
% for all these regularizations, it is possible to enforce non-negativity constraints
% with the option param.pos, and to prevent the last row of U to be regularized, with
% the option param.intercept
%
% Inputs: U: double p x n matrix (input signals)
% m is the signal size
% graph: struct
% with three fields, eta_g, groups, and groups_var
%
% The first fields sets the weights for every group
% graph.eta_g double N vector
%
% The next field sets inclusion relations between groups
% (but not between groups and variables):
% graph.groups sparse (double or boolean) N x N matrix
% the (i,j) entry is non-zero if and only if i is different than j and
% gi is included in gj.
%
% The next field sets inclusion relations between groups and variables
% graph.groups_var sparse (double or boolean) p x N matrix
% the (i,j) entry is non-zero if and only if the variable i is included
% in gj, but not in any children of gj.
%
% examples are given in test_ProximalGraph.m
%
% param: struct
% param.lambda (regularization parameter)
% param.regul (choice of regularization, see above)
% param.lambda2 (optional, regularization parameter)
% param.lambda3 (optional, regularization parameter)
% param.verbose (optional, verbosity level, false by default)
% param.intercept (optional, last row of U is not regularized,
% false by default)
% param.pos (optional, adds positivity constraints on the
% coefficients, false by default)
% param.numThreads (optional, number of threads for exploiting
% multi-core / multi-cpus. By default, it takes the value -1,
% which automatically selects all the available CPUs/cores).
%
% Output: alpha: double p x n matrix (output coefficients)
%
% Author: Julien Mairal, 2010
After having presented the regularization terms which our software can handle,
we present the various formulations that we address
5.5 Problems Addressed
We present here regression or classification formulations and their multi-task variants.
5.5.1 Regression Problems with the Square Loss
Given a training set {xi,yi}i=1n, with xi ∈ ℝp and yi ∈ ℝ for all i in [ 1;n ], we address
where b is an optional variable acting as an “intercept”, which is not regularized, and ψ
can be any of the regularization functions presented above.
Let us consider the vector y in ℝn that carries the entries yi.
The problem without the intercept takes the following form, which we have already
encountered in the previous toolbox, but with different notations:
where the X=[xi,…,xn]T (the xi’s are here the rows of X).
5.5.2 Classification Problems with the Logistic Loss
The next formulation that our software can solve is the regularized logistic regression formulation.
We are again given a training set {xi,yi}i=1n, with xi ∈
ℝp, but the variables yi are now in {−1,+1} for all i in
[ 1;n ]. The optimization problem we address is
| | | | log(1+e−yi(w⊤xi+b) + λψ(w),
|
with again ψ taken to be one of the regularization function presented above, and b is an optional intercept.
5.5.3 Multi-class Classification Problems with the Softmax Loss
We have also implemented a multi-class logistic classifier (or softmax).
For a classification problem with r classes, we are given a training set {xi,yi}i=1n, where the variables xi are still vectors in ℝp, but the yi’s have integer values in {1,2,…,r}. The formulation we address is the following multi-class learning problem
| | | | | log | ⎛
⎜
⎝ | | e (wj−wyi)⊤xi + bj−byi | ⎞
⎟
⎠ | + λ | | ψ(wj),
(44) |
where W = [w1,…,wr] and the optional vector b in ℝr carries intercepts for each class.
5.5.4 Multi-task Regression Problems with the Square Loss
We are now considering a problem with r tasks, and a training set
{xi,yi}i=1n, where the variables xi are still vectors in ℝp, and yi
is a vector in ℝr. We are looking for r regression vectors wj, for j∈ [ 1;r ], or equivalently for a matrix W=[w1,…,wr] in ℝp × r. The formulation we address is the following
multi-task regression problem
where ψ is any of the regularization function on matrices we have presented in the previous section.
Note that by introducing the appropriate variables Y, the problem without intercept could be equivalently rewritten
5.5.5 Multi-task Classification Problems with the Logistic Loss
The multi-task version of the logistic regression follows the same principle.
We consider r tasks, and a training set
{xi,yi}i=1n, with the xi’s in ℝp, and the yi’s
are vectors in {−1,+1}r. We look for a matrix W=[w1,…,wr] in ℝp × r. The formulation is the following
multi-task regression problem
| | | | | log | ⎛
⎜
⎝ | 1+e−yji(w⊤xi+bj) | ⎞
⎟
⎠ | + λψ(W).
|
5.5.6 Multi-task and Multi-class Classification Problems with the Softmax Loss
The multi-task/multi-class version directly follows from the formulation of Eq. (44), but associates with each class a task, and as a consequence, regularizes the matrix W in a particular way:
| | | | | log | ⎛
⎜
⎝ | | e (wj−wyi)⊤xi + bj−byi | ⎞
⎟
⎠ | + λψ(W).
|
How duality gaps are computed for any of these formulations is presented in Appendix A.
We now present the main functions for solving these problems
5.6 Function mexFistaFlat
Given a matrix X=[x1,…,xp]T in ℝm × p, and a matrix Y=[y1,…,yn], it solves the optimization problems presented in the previous section, with the same regularization functions as mexProximalFlat.
see usage details below:
%
% Usage: W=mexFistaFlat(Y,X,W0,param);
%
% Name: mexFistaFlat
%
% Description: mexFistaFlat solves sparse regularized problems.
% X is a design matrix of size m x p
% X=[x^1,...,x^n]', where the x_i's are the rows of X
% Y=[y^1,...,y^n] is a matrix of size m x n
% It implements the algorithms FISTA, ISTA and subgradient descent.
%
% - if param.loss='square' and param.regul is a regularization function for vectors,
% the entries of Y are real-valued, W = [w^1,...,w^n] is a matrix of size p x n
% For all column y of Y, it computes a column w of W such that
% w = argmin 0.5||y- X w||_2^2 + lambda psi(w)
%
% - if param.loss='square' and param.regul is a regularization function for matrices
% the entries of Y are real-valued, W is a matrix of size p x n.
% It computes the matrix W such that
% W = argmin 0.5||Y- X W||_F^2 + lambda psi(W)
%
% - param.loss='square-missing' : same as param.loss='square', but handles missing data
% represented by NaN (not a number) in the matrix Y
%
% - if param.loss='logistic' and param.regul is a regularization function for vectors,
% the entries of Y are either -1 or +1, W = [w^1,...,w^n] is a matrix of size p x n
% For all column y of Y, it computes a column w of W such that
% w = argmin (1/m)sum_{j=1}^m log(1+e^(-y_j x^j' w)) + lambda psi(w),
% where x^j is the j-th row of X.
%
% - if param.loss='logistic' and param.regul is a regularization function for matrices
% the entries of Y are either -1 or +1, W is a matrix of size p x n
% W = argmin sum_{i=1}^n(1/m)sum_{j=1}^m log(1+e^(-y^i_j x^j' w^i)) + lambda psi(W)
%
% - if param.loss='multi-logistic' and param.regul is a regularization function for vectors,
% the entries of Y are in {0,1,...,N} where N is the total number of classes
% W = [W^1,...,W^n] is a matrix of size p x Nn, each submatrix W^i is of size p x N
% for all submatrix WW of W, and column y of Y, it computes
% WW = argmin (1/m)sum_{j=1}^m log(sum_{j=1}^r e^(x^j'(ww^j-ww^{y_j}))) + lambda sum_{j=1}^N psi(ww^j),
% where ww^j is the j-th column of WW.
%
% - if param.loss='multi-logistic' and param.regul is a regularization function for matrices,
% the entries of Y are in {0,1,...,N} where N is the total number of classes
% W is a matrix of size p x N, it computes
% W = argmin (1/m)sum_{j=1}^m log(sum_{j=1}^r e^(x^j'(w^j-w^{y_j}))) + lambda psi(W)
% where ww^j is the j-th column of WW.
%
% - param.loss='cur' : useful to perform sparse CUR matrix decompositions,
% W = argmin 0.5||Y-X*W*X||_F^2 + lambda psi(W)
%
%
% The function psi are those used by mexProximalFlat (see documentation)
%
% This function can also handle intercepts (last row of W is not regularized),
% and/or non-negativity constraints on W, and sparse matrices for X
%
% Inputs: Y: double dense m x n matrix
% X: double dense or sparse m x p matrix
% W0: double dense p x n matrix or p x Nn matrix (for multi-logistic loss)
% initial guess
% param: struct
% param.loss (choice of loss, see above)
% param.regul (choice of regularization, see function mexProximalFlat)
% param.lambda (regularization parameter)
% param.lambda2 (optional, regularization parameter, 0 by default)
% param.lambda3 (optional, regularization parameter, 0 by default)
% param.verbose (optional, verbosity level, false by default)
% param.intercept (optional, last row of U is not regularized,
% false by default)
% param.pos (optional, adds positivity constraints on the
% coefficients, false by default)
% param.numThreads (optional, number of threads for exploiting
% multi-core / multi-cpus. By default, it takes the value -1,
% which automatically selects all the available CPUs/cores).
% param.max_it (optional, maximum number of iterations, 100 by default)
% param.tol (optional, tolerance for stopping criteration, which is a relative duality gap
% if it is available, or a relative change of parameters).
% param.gamma (optional, multiplier for increasing the parameter L in fista, 1.5 by default)
% param.L0 (optional, initial parameter L in fista, 0.1 by default, should be small enough)
% param.fixed_step (deactive the line search for L in fista and use param.L0 instead)
% param.compute_gram (optional, pre-compute X^TX, false by default).
% param.intercept (optional, do not regularize last row of W, false by default).
% param.ista (optional, use ista instead of fista, false by default).
% param.subgrad (optional, if not param.ista, use subradient descent instead of fista, false by default).
% param.a, param.b (optional, if param.subgrad, the gradient step is a/(t+b)
% also similar options as mexProximalFlat
%
% the function also implements the ADMM algorithm via an option param.admm=true. It is not documented
% and you need to look at the source code to use it.
%
% Output: W: double dense p x n matrix or p x Nn matrix (for multi-logistic loss)
%
% Author: Julien Mairal, 2010
5.7 Function mexFistaTree
Given a matrix X=[x1,…,xp]T in ℝm × p, and a matrix Y=[y1,…,yn], it solves the optimization problems presented in the previous section, with the same regularization functions as mexProximalTree.
see usage details below:
%
% Usage: W=mexFistaTree(Y,X,W0,tree,param);
%
% Name: mexFistaTree
%
% Description: mexFistaTree solves sparse regularized problems.
% X is a design matrix of size m x p
% X=[x^1,...,x^n]', where the x_i's are the rows of X
% Y=[y^1,...,y^n] is a matrix of size m x n
% It implements the algorithms FISTA, ISTA and subgradient descent for solving
%
% min_W loss(W) + lambda psi(W)
%
% The function psi are those used by mexProximalTree (see documentation)
% for the loss functions, see the documentation of mexFistaFlat
%
% This function can also handle intercepts (last row of W is not regularized),
% and/or non-negativity constraints on W and sparse matrices X
%
% Inputs: Y: double dense m x n matrix
% X: double dense or sparse m x p matrix
% W0: double dense p x n matrix or p x Nn matrix (for multi-logistic loss)
% initial guess
% tree: struct (see documentation of mexProximalTree)
% param: struct
% param.loss (choice of loss, see above)
% param.regul (choice of regularization, see function mexProximalFlat)
% param.lambda (regularization parameter)
% param.lambda2 (optional, regularization parameter, 0 by default)
% param.lambda3 (optional, regularization parameter, 0 by default)
% param.verbose (optional, verbosity level, false by default)
% param.intercept (optional, last row of U is not regularized,
% false by default)
% param.pos (optional, adds positivity constraints on the
% coefficients, false by default)
% param.numThreads (optional, number of threads for exploiting
% multi-core / multi-cpus. By default, it takes the value -1,
% which automatically selects all the available CPUs/cores).
% param.max_it (optional, maximum number of iterations, 100 by default)
% param.tol (optional, tolerance for stopping criteration, which is a relative duality gap
% if it is available, or a relative change of parameters).
% param.gamma (optional, multiplier for increasing the parameter L in fista, 1.5 by default)
% param.L0 (optional, initial parameter L in fista, 0.1 by default, should be small enough)
% param.fixed_step (deactive the line search for L in fista and use param.L0 instead)
% param.compute_gram (optional, pre-compute X^TX, false by default).
% param.intercept (optional, do not regularize last row of W, false by default).
% param.ista (optional, use ista instead of fista, false by default).
% param.subgrad (optional, if not param.ista, use subradient descent instead of fista, false by default).
% param.a, param.b (optional, if param.subgrad, the gradient step is a/(t+b)
% also similar options as mexProximalTree
%
% the function also implements the ADMM algorithm via an option param.admm=true. It is not documented
% and you need to look at the source code to use it.
%
% Output: W: double dense p x n matrix or p x Nn matrix (for multi-logistic loss)
%
% Author: Julien Mairal, 2010
5.8 Function mexFistaGraph
Given a matrix X=[x1,…,xp]T in ℝm × p, and a matrix Y=[y1,…,yn], it solves the optimization problems presented in the previous section, with the same regularization functions as mexProximalGraph.
see usage details below:
%
% Usage: W=mexFistaGraph(Y,X,W0,graph,param);
%
% Name: mexFistaGraph
%
% Description: mexFistaGraph solves sparse regularized problems.
% X is a design matrix of size m x p
% X=[x^1,...,x^n]', where the x_i's are the rows of X
% Y=[y^1,...,y^n] is a matrix of size m x n
% It implements the algorithms FISTA, ISTA and subgradient descent.
%
% It implements the algorithms FISTA, ISTA and subgradient descent for solving
%
% min_W loss(W) + lambda psi(W)
%
% The function psi are those used by mexProximalGraph (see documentation)
% for the loss functions, see the documentation of mexFistaFlat
%
% This function can also handle intercepts (last row of W is not regularized),
% and/or non-negativity constraints on W.
%
% Inputs: Y: double dense m x n matrix
% X: double dense or sparse m x p matrix
% W0: double dense p x n matrix or p x Nn matrix (for multi-logistic loss)
% initial guess
% graph: struct (see documentation of mexProximalGraph)
% param: struct
% param.loss (choice of loss, see above)
% param.regul (choice of regularization, see function mexProximalFlat)
% param.lambda (regularization parameter)
% param.lambda2 (optional, regularization parameter, 0 by default)
% param.lambda3 (optional, regularization parameter, 0 by default)
% param.verbose (optional, verbosity level, false by default)
% param.intercept (optional, last row of U is not regularized,
% false by default)
% param.pos (optional, adds positivity constraints on the
% coefficients, false by default)
% param.numThreads (optional, number of threads for exploiting
% multi-core / multi-cpus. By default, it takes the value -1,
% which automatically selects all the available CPUs/cores).
% param.max_it (optional, maximum number of iterations, 100 by default)
% param.tol (optional, tolerance for stopping criteration, which is a relative duality gap
% if it is available, or a relative change of parameters).
% param.gamma (optional, multiplier for increasing the parameter L in fista, 1.5 by default)
% param.L0 (optional, initial parameter L in fista, 0.1 by default, should be small enough)
% param.fixed_step (deactive the line search for L in fista and use param.L0 instead)
% param.compute_gram (optional, pre-compute X^TX, false by default).
% param.intercept (optional, do not regularize last row of W, false by default).
% param.ista (optional, use ista instead of fista, false by default).
% param.subgrad (optional, if not param.ista, use subradient descent instead of fista, false by default).
% param.a, param.b (optional, if param.subgrad, the gradient step is a/(t+b)
% also similar options as mexProximalTree
%
% the function also implements the ADMM algorithm via an option param.admm=true. It is not documented
% and you need to look at the source code to use it.
%
%
% Output: W: double dense p x n matrix or p x Nn matrix (for multi-logistic loss)
%
% Author: Julien Mairal, 2010