“Proximal Point  regularized convex on linear II"
Last time we discussed applying the stochastic proximal point method
to losses of the form
where and are convex functions, and devised an efficient implementation for L2 regularized losses: . We now aim for a more general approach, which will allow us to deal with many other regularizers. One important example is L1 regularization , since it is wellknown that it promotes a sparse parameter vector . It is useful when we have a huge number of features, and the nonzero components of select only the features which have meaningful effect on the model’s predictive power.
Recall that implementing stochastic proximal point method amounts to solving the onedimensional problem dual to the optimizer’s step (S) by maximizing:
Having obtained the maximizer we compute .
The part highlighted in blue is where the regularizer comes in, and is a major barrier between a practitioner and the method’s advantages  the practitioner has to mathematically rederive the minimization of and reimplement the resulting derivation for every regularizer. Seems quite impractical, and since can be arbitrarily complex, it might even be impossible.
Unfortunately, we cannot remove this obstacle entirely, but we can express it in terms of a “textbook recipe”  something a practitioner can pick from a catalog in a textbook and use, instead of rederiving. In fact, that’s how most optimization methods work. For example, SGD is based on the textbook recipe of ‘gradient’ and we have explicit rules for computing them, the stochastic proximal point method we derived in the last post is based on the textbook recipe of ‘convex conjugate’, and in this post we will introduce and use yet another textbook recipe.
Highschool tricks
Our journey begins with one trick which is taught in highschool algebra classes is known as “completing a square”  we rearrange the formula for a square of a sum to the form
Such a trick is occasionally useful to express things in terms of squares only. We do a similar trick on the formula for the squared Euclidean norm:
Rearranging, we obtain
Now, let’s apply the trick to the term inside the infimum in the definition of the dual problem. It is a bit technical, but the endresult leads us to our desired texbook recipe.
Plugging the above expression for into the formula (A), results in:
The magenta part may seem unfamiliar, but it is a wellknown concept in optimization: the Moreau envelope^{1} of the function . Let’s get introduced to the concept properly.
Formally, the Moreau envelope of a convex function with parameter is denoted by and defined by
Consequently, we can write the function of the dual problem as:
Now is composed of two textbook concepts  the convex conjugate , and the Moreau envelope . A related concept is the minimizer of (c) above  Moreau’s proximal operator of with parameter :
Moreau envelopes and proximal operators, which were introduced in 1965 by the French mathematician Jean Jacques Moreau, create a “smoothed” version of arbitrary convex functions, and are nowdays ubiquitous in modern optimization theory and practice. Since then, the concepts have been used for many other purposes  just Google Scholar it. In fact, the stochastic proximal point method itself can be compactly written in terms of the operator: select and compute .
Before going deeper, let’s recap and explicitly write our “metaalgorithm” for computing using the above concepts:
 Compute using the Moreau envelope.
 Solve the dual problem: find a maximizer of .
 Compute .
Moreau envelope  tangible example
To make things less abstract, look at a onedimensional example to gain some more intuition: the absolute value function . Doing some lengthy calculus, which is out of scope of this post, we can compute:
That is, the envelope is a function which looks like a prabola when is close enough to the 0, and switches to behaving like the absolute value when we get far away. Some readers may recognize this function  this is the wellknown Huber function, which is commonly used in statistics as differentiable approximation of the absolute value. Let’s plot it:
import numpy as np
import matplotlib.pyplot as plt
import math
def huber_1d(eta, u):
if math.fabs(u) <= eta:
return (u ** 2) / (2 * eta)
else:
return math.fabs(u)  eta / 2
def huber(eta, u):
return np.array([huber_1d(eta, x) for x in u])
x = np.linspace(2, 2, 1000)
plt.plot(x, np.abs(x), label='x')
plt.plot(x, huber(1, x), label='eta=1')
plt.plot(x, huber(0.5, x), label='eta=0.5')
plt.plot(x, huber(0.1, x), label='eta=0.1')
plt.legend(loc='best')
plt.show()
Here is the resulting plot:
Viola! A smoothed version of the absolute value function. Smaller values of lead to a better, but less smooth approximation. As we said before, this behavior is not unique to the absolute value’s envelope  Moreau envelopes of convex functions are always differentiable, and their gradient is always continuous.
Another interesting thing we can see in the plot is that the envelopes approach the approximating function from below. It is not a coincidence as well, since:
that is, the envelope always lies below the function itself.
Maximizing
Now let’s get back to our metaalgorithm. We need to solve the dual problem by maximizing , and we typically do it by equating its derivative with zero. Hence, in practice, we are interested in the derivative of rather than its value (assuming is indeed continuously differentiable). Using the chain rule, we obtain:
Moreau’s exceptional work does not disappoint, and using some clever analysis he derived the following remarkable formula for the gradient of :
Substituting the formula for into (d), the derivative can be written as
To conclude, our ingredients for are: a formula for the proximal operator of , and a formula for the derivative of . Since proximal operators are ubiquitous in optimization theory and practice, entire book chapters about proximal operators were written, i.e. see here^{2} and here^{3}. The second reference contains, at the end of the chapter, a catalog of explicit formulas for for various functions summarized in a table. Here are a two important examples:
Remarks  

is a vector whose components are all 1. is the ‘positive part’ of . More details later in this post.  
0  u  no regularizer 
With the above in mind, the metaalgorithm for computing amounts to:
 Obtain a solution of the equation
 Compute
L2 regularization  again
Last time we used a lengthy mathematical derivation to obtain the computational steps for L2 regularized losses, namely, losses of the form
Let’s see if we can avoid lengthy and errorprone mathematics using the dualderivative formula (DD). According to the table of proximal operators, we have . Thus, to compute we plug the above into the formula and solve:
Looking carefully, we see that it is exactly the derivative of from the last post, but this time it was obtained by taking a formula from a textbook. No lengthy math this time!
Having obtained the solution of of the equation , we can proceed and compute
which is, again, the same formula we obtained in the last post, but without doing any lengthy math.
The only thing a practitioner wishing to derive a formula for has to do by herself is to find a way to solve the onedimensional equation . The rest is provided by our textbook recipes  the proximal operator, and the convex conjugate.
L1 regularized logistic regression  end to end optimizer tutorial
We consider losses of the form
namely, and . The vector comprises both the training sample and the label , since for a positive sample the incurred loss is , while for a negative sample the incurred loss is . Namely, we have , where the sign depends on the label .
Deriving and implementing the optimizer
We need to deal with two tasks: find and . In previous posts in the series we already saw that
and it is defined on the closed interval . The derivative is
and it is defined on the open interval . From the table of proximal operators above, we find that
The above formula may be familar to some of you with background in signal processing: this is the softthresholding function with . It is implemented in PyTorch as torch.nn.functional.softshrink
, while in NumPy it can be easily implemented as:
import numpy as np
# the softthresholding function with parameter `delta`
def soft_threshold(delta, u):
return np.clip(np.abs(u)  delta, 0, None) * np.sign(u)
Let’s plot it for the onedimensional case, to see what it looks like:
import matplotlib.pyplot as plt
x = np.linspace(5, 5, 1000)
plt.plot(x, soft_threshold(1, x), label='delta=1')
plt.plot(x, soft_threshold(3, x), label='delta=3')
plt.legend(loc='best')
plt.show()
Here is the result:
The function zeroesout inputs close to the origin, and behaves as a linear function when our distance from the origin is .
Now let’s discuss the implementation. Seems we have all our ingredients for the derivative formula (DD)  the proximal operator of and the derivative . Putting the ingredients together, we aim to solve:
At first glance it seems like a hard equation to solve, but we have already dealt with a similar challenge in a previous post. Recall that:

The dual function is aways concave, and therefore its derivative is decreasing. Moreover, it tends to negative infinity when , and to positive infinity when . In other words, looks something like this:

The function is defined on the interval and is continuous. Thus, we can employ the same bisection strategy we used in a previous post for nonregularized logistic regression:
 Find an initial interval where our solution must lie, by setting:
 for the smallest positive integer satisfying ,
 and for the smallest positive integer satisfying .
 Run the bisection method for finding a zero of in the interval .
 Find an initial interval where our solution must lie, by setting:
Finally, having solved the equaion we use its solution to compute:
Let’s implement the method, and then discuss one of its important properties:
import math
import torch
from torch.nn.functional import softshrink
class LogisticRegressionL1:
def __init__(self, x, eta, lambdaa):
self._x = x
self._eta = eta
self._lambda = lambdaa
def step(self, w, y):
# helper local variables
delta = self._eta * self._lambda
eta = self._eta
x = self._x
# extract vector `a` from features w and label y
if y == 0:
a = w
else:
a = w
# compute the incurred loss components
data_loss = math.log1p(math.exp(torch.dot(a, x).item())) # logistic loss
reg_loss = self._lambda * x.abs().sum().item(). # L1 regularization
# dual derivative
def qprime(s):
return torch.dot(a, softshrink(x  eta * s * a, delta)).item()  math.log(s) + math.log(1  s)
# find initial bisection interval
l = 0.5
while qprime(l) <= 0:
l /= 2
u = 0.5
while qprime(1  u) >= 0:
u /= 2
u = 1  u
# run bisection  find s_star
while u  l > 1E16: # should be accurate enough
mid = (u + l) / 2
if qprime(mid) == 0:
break
if qprime(l) * qprime(mid) > 0:
l = mid
else:
u = mid
s_star = (u + l) / 2
# perform the computational step
x.set_(softshrink(x  eta * s_star * a, delta))
# return loss components
return data_loss, reg_loss
Before running an experiment, let’s discuss an important property of our optimizer. Look at the last line in the code above  the next iterate is the result of the softthresholding operator, and we saw that it zeroesout entries of whose absolute value is very small (). Consequently, the algorithm itself reflects the sparsity promoting nature of L1 regularization  we zeroout insignificant entries!
The above property is exactly the competitive edge of the proximal point approach in contrast to blackbox approaches, such as SGD. Since we deal with the loss itself, rather with its firstorder approximation, we preserve its important properties. As we will see in the experiment below, AdaGrad does not produce sparse vecotrs, while the solver we implemented above does. So even if we did not have the benefit of stepsize stability, we still have the benefit of preserving our regularizer’s properties.
Using the optimizer
Since my computational resources are limited, and I do not wish to train models for several days, we will use a rather small dataset this time. I chose the spambase dataset available from here. It is composed of 57 numerical columns, signifying frequencies of various frequentlyoccuring words, and average runlengths of capital letters, and a 58th column with a spam indicator.
Let’s begin by loading the dataset
import pandas as pd
url = 'https://web.stanford.edu/~hastie/ElemStatLearn/datasets/spam.data'
df = pd.read_csv(url, delimiter=' ', header=None)
df.sample(4)
Here is a possible result of the sample
function, which samples 4 rows at random:
0 1 2 3 4 ... 53 54 55 56 57
4192 0.0 0.00 0.00 0.0 0.00 ... 0.000 2.939 51 97 0
1524 0.0 0.90 0.00 0.0 0.00 ... 0.000 6.266 41 94 1
1181 0.0 0.00 0.00 0.0 1.20 ... 0.000 50.166 295 301 1
669 0.0 0.26 0.26 0.0 0.39 ... 0.889 12.454 107 1096 1
Now, let’s normalize all our numerical columns to lie in , so that all our logistic regression coefficients we will be at the same scale (otherwise, L1 regularization will not be effective):
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
scaled = min_max_scaler.fit_transform(df.iloc[:, 0:56])
df.iloc[:, 0:56] = scaled
Now, let’s create our PyTorch dataset, which we will use for training:
from torch.utils.data.dataset import TensorDataset
W = torch.tensor(np.array(df.iloc[:, 0:56])) # features
Y = torch.tensor(np.array(df.iloc[:, 57])) # labels
ds = TensorDataset(W, Y)
And now, let’s run the optimizer we wrote above, LogisticRegressionL1
, to find the weights of the regularized logistic regression model, with regularization parameter . Since this post is on optimization, we refer readers to standard techniques for choosing regularization parameters, such as Kfold crossvalidation. Since our proximal point optimizers are quite stable w.r.t the step size choices, I just arbitrarily chose .
from torch.utils.data.dataloader import DataLoader
# init. model parameter vector
x = torch.empty(56, requires_grad=False, dtype=torch.float64)
torch.nn.init.normal_(x)
# create optimizer
step_size = 1
llambda = 3E4
opt = LogisticRegressionL1(x, step_size, llambda)
# run 40 epochs, print out data loss and reg. loss
for epoch in range(40):
data_loss = 0.0
reg_loss = 0.0
for w, y in DataLoader(ds, shuffle=True):
ww = w.squeeze(0)
yy = y.item()
step_data_loss, step_reg_loss = opt.step(ww, yy)
data_loss += step_data_loss
reg_loss += step_reg_loss
print('data loss = ', data_loss / len(ds),
' reg. loss = ', reg_loss / len(ds),
' loss = ', (data_loss + reg_loss) / len(ds))
# print the parameter vector
print(x)
Here is the output I got:
data loss = 0.3975306874061495 reg. loss = 0.036288109782517605 loss = 0.4338187971886671
data loss = 0.31726795987087547 reg. loss = 0.054970775631660425 loss = 0.3722387355025359
...
data loss = 0.26724176087574 reg. loss = 0.0789531172022866 loss = 0.34619487807802657
tensor([2.2331e01, 2.4050e+00, 1.9527e01, 1.3942e01, 2.5370e+00,
1.3878e+00, 1.3516e+01, 2.6608e+00, 2.0741e+00, 1.7322e01,
1.6017e+00, 2.9143e+00, 2.9302e01, 1.6100e02, 2.1631e+00,
1.1565e+01, 4.9527e+00, 1.7578e01, 1.1546e+00, 3.2810e+00,
1.6294e+00, 2.9166e+00, 1.1044e+01, 4.6498e+00, 3.0636e+01,
8.7392e+00, 2.6487e+01, 6.1895e02, 1.3150e+00, 1.7103e+00,
5.8073e02, 0.0000e+00, 8.5150e+00, 0.0000e+00, 0.0000e+00,
1.4386e02, 2.4498e+00, 0.0000e+00, 2.0155e+00, 6.6931e01,
2.7988e+00, 1.3649e+01, 2.0991e+00, 7.5787e+00, 1.4172e+01,
1.9791e+01, 1.1263e+00, 2.6178e+00, 3.5694e+00, 6.1839e+00,
1.4704e+00, 5.4873e+00, 2.1276e+01, 0.0000e+00, 3.6687e01,
1.8539e+00], dtype=torch.float64)
After 40 epochs we got a parameter vector with several zero entries. These are the entries which our regularized optimization problem zeroes out, due to their small effect on model accuracy. Note, that increasing puts more emphasis on regularization and less on model accuracy, and therefore would zero out more entries at the cost of even less accurate model. Namely, we trade simplicity (less features used) for accuracy w.r.t our training data.
Running the same code with produces training loss of , while the data_loss
we see above is . So we indeed have a model which fits the training data less, but is also simpler, since it uses less features.
Now, let’s optimize the same logistic regression model with PyTorch’s standard AdaGrad optimizer.
from torch.optim import Adagrad
x = torch.empty(56, requires_grad=True, dtype=torch.float64)
torch.nn.init.normal_(x)
llambda = 3E4
step_size = 1
opt = Adagrad([x], lr = step_size)
for epoch in range(40):
data_loss = 0.0
reg_loss = 0.0
for w, y in DataLoader(ds, shuffle=True):
ww = w.squeeze(0)
yy = y.item()
if yy == 0:
a = ww
else:
a = ww
# zeroout x.grad
opt.zero_grad()
# compute loss
sample_data_loss = torch.log1p(torch.exp(torch.dot(a, x)))
sample_reg_loss = llambda * x.abs().sum()
sample_loss = sample_data_loss + sample_reg_loss
# compute loss gradient and perform optimizer step
sample_loss.backward()
opt.step()
data_loss += sample_data_loss.item()
reg_loss += sample_reg_loss.item()
print('data loss = ', data_loss / len(ds),
' reg. loss = ', reg_loss / len(ds),
' loss = ', (data_loss + reg_loss) / len(ds))
print(x)
After some time, I got the following output:
data loss = 0.32022869947190186 reg. loss = 0.05981964910605969 loss = 0.3800483485779615
...
data loss = 0.2645752374292745 reg. loss = 0.07861101490768707 loss = 0.34318625233696154
tensor([9.3965e01, 2.4251e+00, 2.6659e02, 1.6550e01, 2.6870e+00,
1.1179e+00, 1.3851e+01, 2.9437e+00, 2.1336e+00, 5.9068e02,
2.2235e01, 3.2626e+00, 2.2777e01, 1.0366e01, 2.1678e+00,
1.1491e+01, 5.0057e+00, 5.1104e01, 1.1741e+00, 3.3976e+00,
1.6788e+00, 2.8254e+00, 1.1238e+01, 4.7773e+00, 3.0312e+01,
8.7872e+00, 2.6049e+01, 8.1157e03, 1.7401e+00, 1.7351e+00,
4.8288e05, 9.3150e05, 8.6179e+00, 7.2897e05, 6.7084e02,
2.4873e05, 2.6019e+00, 2.0619e06, 2.1924e+00, 4.1967e01,
2.8286e+00, 1.3638e+01, 2.2318e+00, 7.7308e+00, 1.4278e+01,
1.9813e+01, 1.1892e+00, 2.7827e+00, 3.7886e+00, 5.9871e+00,
1.6373e+00, 5.5717e+00, 2.1019e+01, 3.9419e03, 5.1873e01,
2.2150e+00], dtype=torch.float64, requires_grad=True)
Well, it seems that none of the vector’s components are zero! Some may think that maybe AdaGrad has not converged, and letting it run for more epochs, or with a different stepsize will make a difference. But, unfortunately, it does not. A blackbox optimization technique, which is based on a linear approximation of our losses, instead of exploiting the losses themselves, often cannot produce solutions which preserve important properties imposed by our loss functions. In this case, it does not preserve the featureselection property of L1 regularization.
What’s next?
Up until now we have laid the theoretical and computational foundations, which we can now use for non blackbox optimizers for a much broader variety of problems, much beyond the simple ‘convex on linear’ setup. Our foundations are: the proximal viewpoint of optimization, convex duality, convex conjugate, and Moreau envelope.
Two important examples in machine learning come to mind  factorization machines, and neural networks. Losses incurred by these models clearly do not fall into the ‘convex on linear’ category, and we will see in future posts how we can construct non blackbox optimizers, which exploit more information about the loss, to train such models.
Moreover, up until now we assumed the fully stochastic setting: at each iteration we select one training sample, and perform a computational step based on the loss the sample incurres. We will see that the concepts we developed so far let us find an efficient implementation for the stochastic proximal point algorithm for the minibatch setting, where we select a small subset of training samples , and perform a computational step based on the average loss incurred by these samples.
Before we proceed to the above interesting stuff, one foundational concept is missing. Despite the fact that it doesn’t seem so at first glance, the stochastic proximal point method is a gradient method: it makes a small step towards a negative gradient direction. But in contrast to SGDtype methods, the gradient is not , but taken at a different point, which is not . The next post will deal with this nature of the method. We will mostly have theoretical explanations and drawings illustrating them, and see why it is a gradient method after all. No code, just theory. And no lengthy math, just elementary math and handwaving with drawings and illustrations. Stay tuned!

JJ Moreau. (1965). Proximite et dualit´e dans un espace Hilbertien. Bulletin de la Société Mathématique de France 93. (pp. 273–299) ↩

N. Parikh, S. Boyd (2013). Proximal Algorithms. Foundations and Trends in Optimization Vol.1 No. 3 (pp. 123231) ↩

A. Beck (2017). FirstOrder Methods in Optimization. SIAM Books Ch.6 (pp. 129177) ↩