“Proximal Point with Mini Batches  Regularized Convex On Linear"
This post is part of the series "Proximal point", and is not selfcontained.
Series posts:

“Proximal Point  warmup"

“Proximal point  convex on linear losses"

“Proximal Point  regularized convex on linear I"

“Proximal Point  regularized convex on linear II"

“Proximal Point is, after all, yet another gradient method"

“Selective approximation  the proxlinear method for training arbitrary models"

“Proximal Point with Mini Batches"

“Proximal Point with Mini Batches  Convex On Linear"
 “Proximal Point with Mini Batches  Regularized Convex On Linear" (this post)

“Proximal Point to the Extreme  Factorization Machines"
Recap and intro
We continue our endaevor of extending the reach of efficiently implementable stochastic proximal point method in the minibatch setting:
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ \frac{1}{B} \sum_{i \in B} f_i(x) + \frac{1}{2\eta} \x  x_t\_2^2 \Biggr \}.\]Last time we discussed the implementation for convexonlinear losses, which include linear least squares and linear logistic regression. Continuing the same journey we already went through before the minibatch setting, this time we add regularization, and consider losses of the form:
\[f_i(x)=\phi(a_i^T x + b_i) + r(x),\]where the regularizer \(r\) is the same for all training samples, and \(\phi\) is a scalar convex function. In that case, the method becomes:
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ \frac{1}{B} \sum_{i \in B} \phi(a_i^T x + b_i) + r(x) + \frac{1}{2\eta} \x  x_t\_2^2 \Biggr \}.\]Our aim in this post is to derive an efficient implementation, in Python, of the above computational step.
Solving via duality
We again employ duality in an attempt to make the problem of computing \(x_{t+1}\) tractable. We replace that problem with its equivalent constrained variant:
\[\operatorname*{minimize}_{x,z} \quad \frac{1}{B} \sum_{i\in B} \phi(z_i) + r(x) + \frac{1}{2\eta}\x  x_t\_2^2 \quad \operatorname{\text{subject to}} \quad z_i = a_i^T x + b_i.\]We will not get into the tedious details, but embedding the vectors \(a_i\) into the rows of the batch matrix \(A_B\), and after some mathematical manipulations, \(q(s)\) is:
\[q(s)=\color{magenta}{\frac{1}{B}\sum_{i \in B} \phi^*(B s_i) + \underbrace{\min_x \left\{ r(x) + \frac{1}{2\eta} \x  (x_t  \eta A_B^T s)\_2^2\right\}}_{\text{Moreau envelope}}} + (A_B x_t + b_B)^T s  \frac{\eta}{2} \A_B^Ts\_2^2.\]And here we arrive at our problem, which appears in the magentacolored part. The function \(\phi^*\) is always concave, while the Moreau envelope is always convex. The sum of such functions may, in general, be nor convex nor concave. So, although we know from duality theory that the function \(q(s)\) is concave, by separating it into two components we cannot convey its concavity to a generic convex optimization solver such as CVX  these solvers require that we write functions in a way which obeys an explicit set of rules which convey the function’s curvature as either convex or concave. One such rule is  we can add convex functions together, and concave functions together. But we cannot mix both, for reasons which go beyond the scope of this blog post.
The conclusion is the same duality trick which served us well before in transforming a highdimensional problem on \(x\) to a lowdimensional problem on the dual variables \(s\), cannot serve us now. Can we do something else? It turns out we can, but first let’s explore another extension of duality  inequality constraints.
Solving via duality  take 2
We aim to compute
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ \frac{1}{B} \sum_{i \in B} \phi(a_i^T x + b_i) + r(x) + \frac{1}{2\eta} \x  x_t\_2^2 \Biggr \}.\]Let’s rephraze the optimization problem in a slightly different manner, by including two auxiliary variables  the vectors \(z\) and \(w\), and by embedding the vectors \(a_i\) into the rows of the batch matrix \(A_B\):
\[\operatorname*{minimize}_{x,z,w} \quad \frac{1}{B} \sum_{i\in B} \phi(z_i) + r(w) + \frac{1}{2\eta} \x  x_t\_2^2 \quad \operatorname{\text{subject to}} \quad z = A_B x + b, \ \ x = w\]Let’s construct a dual by assigning prices \(\mu\) and \(\nu\) to the violation of each set of constraints, and separating the minimization over each variable:
\[\begin{aligned} q(\mu, \nu) &= \inf_{x,z,w} \left\{\frac{1}{B} \sum_{i\in B} \phi(z_i) + r(w) + \frac{1}{2\eta} \x  x_t\_2^2 + \mu^T(A_B x + b  z) + \nu^T (x  w) \right\} \\ &= \color{blue}{\inf_x \left\{ \frac{1}{2\eta} \x  x_t\_2^2 + (A_B^T \mu + \nu)^T x \right\}} + \color{purple}{\inf_z \left\{ \frac{1}{B} \sum_{i\in B} \phi(z_i)  \mu^T z \right\}} + \color{green}{\inf_w \left\{ r(w) \nu^T w \right\}} + \mu^T b. \end{aligned}\]We’ve already encountered the purple part in a previous post  it can be written in terms of the convex conjugate \(\phi^*\):
\[\text{purple} = \frac{1}{B}\sum_{i \in B} \phi^*(B \mu_i)\]The green part is also straightforward, and is exactly \(r^*(\nu)\), where \(r^*\) is the convex conjugate of the regularizer \(r\). Finally, the blue part, despite being some cumbersome, is a simple quadratic minimiation problem over \(x\), so let’s solve it by equating the gradient of the term inside the \(\inf\) with zero:
\[\frac{1}{\eta}(x  x_t) + A_B^T \mu  \nu = 0.\]By rearranging, we obtain that the equation is solved at \(x = x_t  \eta(A_B^T \mu  \nu)\). Recall, that if strong duality holds, that’s exactly the rule for computing the optimal \(x\) from the optimal \((\mu, \nu)\) pair. Let’s substitute the above \(x\) into the blue term, and after some algebraic manipulations obtain:
\[\text{blue} = \frac{1}{2\eta} \left\ \color{brown}{x_t  \eta (A_B^T \mu  \nu)}  x_t \right\_2^2 + (A_B^T \mu + \nu)^T [\color{brown}{x_t  \eta (A_B^T \mu  \nu)}] =  \frac{\eta}{2} \ A_B^T \mu  \nu \_2^2 + (A_B x_t)^T \mu + x_t^T \nu\]Summarizing everything, we have:
\[q(\mu, \nu) = \color{blue}{ \frac{\eta}{2} \ A_B^T \mu  \nu \_2^2 + (A_B x_t)^T \mu + x_t^T \nu} \color{purple}{\frac{1}{B}\sum_{i \in B} \phi^*(B \mu_i)} \color{green}{r^*(\nu)} + b^T \mu.\]The blue part is a concave quadratic, and \(\phi^*\) and \(r^*\) are both concave. Well, seem that we’ve done it, haven’t we? Not quite! Recall that the dimension of \(\nu\) is the same as the dimension of \(x\), so we haven’t reduced the problem’s dimension at all! If we have a huge model parameter vector \(x\), we’ll have a huge dual variable \(\nu\).
Have we reached the glass ceiling?
The above two failures to come up with an efficient algorithm for computing \(x_{t+1}\) for minibatches in the regularized convexonlinear losses might make us wonder  is it even possible to implement the method in this setting? Well, it turns out that if we insist on using an offtheshelf optimization solver, it might be very hard. But if we are willing to write our own, it is possible.
Indeed, consider the dual problem we derived in take 1. We know that the dual function being maximized is concave. If we can compute its gradient, we can write our own fast gradient method, i.e. FISTA^{1} or Nesterov’s accelerated gradient^{2} method, to solve it. Note that I am not referring to a stochastic optimization algorithm for training a model, but a fully deterministic one for solving a simple optimization problem. Such methods can be quite fast. Furthermore, if we can compute the Hessian matrix of \(q\) we can employ Newton’s method^{3}, and solve the dual problem even faster, in a matter of a few milliseconds. However, writing convex optimization solvers is beyond the scope of this post.
What’s next?
The entire post series was devoted to deriving efficient implementations of the proximal point methods to various generic problem classes. Contrary to the above, I would like to devote the next, and last blog post of this series to implementing the method on a specific, but interesting problem. Stay tuned!

Beck A. & Teboulle M. (2009) A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Science, 2(11), 183–202. ↩

Nesterov Y. (1983) A method for solving the convex programming problem with convergence rate O(1/k^2). Dokl. Akad. Nauk SSSR 269, 543547 ↩

https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization ↩