Models having a high-dimensional parameter space, such as large neural networks, often pose a challenge when deployed on edge devices, due to various constraints. Two remedies are often suggested: pruning and quantization. In this post I’d like to concentrate on the idea of pruning, which amounts to removing neurons that we beleive have little or no contribution to the over-all model performance. PyTorch provides various heuristics for model pruning, that are explained in a tutorial in the official documentation.
I’d like to discuss a decades-old alternative idea to those heuristics - L1 regularization. It is “known” to promote sparsity, and there is an end-less amount of resources online explaining why. But there are very little resources explaining how this can be achieved in modern ML frameworks, such as PyTorch. I believe there are two major reasons for that.
The first reason is very direct - just adding an L1 regularization term to the cost we differentiate in each training loop, in conjunction with an optimizer such as SGD or Adam, will often not produce sparse solution. You can find plenty of evidence online, such as here, here, or here. I want to avoid discussing why, and will just say that the reason is in the optimizers - they were not designed to properly handle sparsity-inducing regularizers, and some trickery is required.
The second reason stems from how software engineering is done. People want to re-use components or patterns. There is a very clear pattern of how PyTorch training is implemented, and we either implement it manually in simple cases, or rely on a helper library, such as PyTorch Ignite, or PyTorch Ligntning to do the job for us.
So can we use sparsity-inducing regularization with PyTorch, that nicely and easily integrates with the existing ecosystem? It turns out that there is an interesting stream of research that facilitates exactly that - the idea of sparse regularization by Hadamard parametrization. I first encountered it in a paper by Peter Hoff^{1}, and noticed that the idea has been further explored in several additional papers ^{2}^{3}^{4}. I believe this stream of research hasn’t received the attention (pun intended!) it deserves, since it allows an extremely easy way of achieving sparse L1 regularization that seamlessly integrates into the existing PyTorch ecosystem of patterns and libraries. In fact, the code is so embarrasingly simple that I am surprised that such parametrizations haven’t become popular.
The basic idea is very simple. Suppose we aim to find model weights \(\mathbf{w}\) by minimizing the L1 regularized loss over our training set:
\[\tag{P} \min_{\mathbf{w}} \quad \frac{1}{n} \sum_{i=1}^n \ell_i(\mathbf{w}) + \lambda \|w\|_1\]We reformulate the problem \((P)\) above by representing \(\mathbf{w}\) as a component-wise product of two vectors. Formally, \(\mathbf{w} = \mathbf{u} \odot \mathbf{v}\) where \(\odot\) is the component-wise (or Hadamard) product. And instead of solving \((P)\) we solve the problem below:
\[\tag{Q} \min_{\mathbf{u},\mathbf{v}} \quad \frac{1}{n} \sum_{i=1}^n \ell_i(\mathbf{u} \odot \mathbf{v}) + \lambda \left( \|\mathbf{u}\|_2^2 + \| \mathbf{v} \|_2^2 \right)\]Note, that \((Q)\) uses L2 regularization! As it turns out^{1}, any local minimum \((Q)\) is also a local minimum of \((P)\). L2 regularization is native to PyTorch in the form of the weight_decay
parameter to its optimizers. But more importantly, parametrizations are also a native beast in PyTorch!
We first begin with implementing this idea in PyTorch for a simple linear model, and then extend it to Neural Networks. This is, of course, not the best method to achieve sparsity. But it’s an extremmely simple one, easy to try out for your model, and fun! As customary, the code is available in a notebook that you can deploy on Google Colab.
In this section we will demonstrate how to implement Hadamard parametrization in PyTorch to train a linear model on a data-set, and verify that we indeed achieve sparsity similar to a truly optimal solution of \((P)\). We regard the solution achieved by CVXPY, which is a well-known convex optimization package for Python, as an “exact” solution.
We begin from the data which we use throughout this section. We will use the Madelon dataset, which is a synthetic data-set that was used for the NeurIPS 2003 feature selection challenge. It’s available from openml as data-set 1485, and therefore we can use the fetch_openml
function from scikit-learn to fetch it:
from sklearn.datasets import fetch_openml
madelon = fetch_openml(data_id=1485, parser='auto')
To get a feel of what this data-set looks like, let’s print it:
print(madelon.frame)
The output is:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V492 V493 V494 \
0 485 477 537 479 452 471 491 476 475 473 ... 481 477 485
1 483 458 460 487 587 475 526 479 485 469 ... 478 487 338
2 487 542 499 468 448 471 442 478 480 477 ... 481 492 650
3 480 491 510 485 495 472 417 474 502 476 ... 480 474 572
4 484 502 528 489 466 481 402 478 487 468 ... 479 452 435
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2595 493 458 503 478 517 479 472 478 444 477 ... 475 485 443
2596 481 484 481 490 449 481 467 478 469 483 ... 485 508 599
2597 485 485 530 480 444 487 462 475 509 494 ... 474 502 368
2598 477 469 528 485 483 469 482 477 494 476 ... 476 453 638
2599 482 453 515 481 500 493 503 477 501 475 ... 478 487 694
V495 V496 V497 V498 V499 V500 Class
0 511 485 481 479 475 496 2
1 513 486 483 492 510 517 2
2 506 501 480 489 499 498 2
3 454 469 475 482 494 461 1
4 486 508 481 504 495 511 1
... ... ... ... ... ... ... ...
2595 517 486 474 489 506 506 1
2596 498 527 481 490 455 451 1
2597 453 482 478 481 484 517 1
2598 471 538 470 490 613 492 1
2599 493 499 474 494 536 526 2
So it’s a classification data-set with 500 numerical features, and two classes. Naturally, we will use the binary cross-entropy loss in our minimization problem.
At this stage our objective is just demonstrating properties of the model fitting procedure, rather than evaluating the performance of the model. Thus, for simplicity, we will not split into train / evaluation sets, and operate on the entire data-set.
To make it more friendly for model training, let’s first rescale it to zero mean and unit variance, and extract labels as values in \(\{0, 1\}\):
import numpy as np
from sklearn.preprocessing import StandardScaler
scaled_data = np.asarray(StandardScaler().fit_transform(madelon.data))
labels = np.asarray(madelon.target.cat.codes)
Now let’s find an optimal solution of the L1 regularized problem \((P)\) using CVXPY, which is a Python framework for accurate solution of convex optimization problems. For Logistic Regression, the loss of the sample \((\mathbf{x}_i, y_i)\) is:
\[\ell_i(\mathbf{w}) = \ln(1+\exp(\mathbf{w}^T \mathbf{x_i})) - y \cdot \mathbf{w}^T \mathbf{x}_i\]This is obviously convex, due to the convexity of \(\ln(1+\exp(x))\), which is modeled by the cvxpy.logistic
function. The corresponding CVXPY code for constructing an object representing \((P)\) is:
import cvxpy as cp
# (coef, intercept) are the vector w.
coef = cp.Variable(scaled_data.shape[1])
intercept = cp.Variable()
reg_coef = cp.Parameter(nonneg=True)
pred = scaled_data @ coef + intercept # <--- this is w^T x
loss = cp.logistic(pred) - cp.multiply(labels, pred)
mean_loss = cp.sum(loss) / len(scaled_data)
cost = loss + reg_coef * cp.norm(coef, 1)
problem = cp.Problem(cp.Minimize(cost))
Of course, we don’t know at this stage which regularization coefficient to use to achieve sparsity, so let’s begin with \(10^{-4}\):
reg_coef.value = 1e-4
problem.solve()
print(f'Loss at optimum = {loss.value:.4g}')
I got th efollowing output:
Loss at optimum = 0.5466
Let’s also plot the coefficients. The plot_coefficients
function below just contains boilerplace to make the plot nice, and ability to specify transparency and color for other parts of this post, where we want to make several plots on the same axes:
import matplotlib.pyplot as plt
def plot_coefficients(coefs, ax_coefs=None, alpha=1., color='blue', **kwargs):
if ax_coefs is None:
ax_coefs = plt.gca()
markerline, stemlines, baseline = ax_coefs.stem(coefs, markerfmt='o', **kwargs)
ax_coefs.set_xlabel('Feature')
ax_coefs.set_ylabel('Weight')
ax_coefs.set_yscale('asinh', linear_width=1e-6) # linear near zero, logarithmic further from zero
stemlines.set_linewidth(0.25)
markerline.set_markerfacecolor('none')
markerline.set_linewidth(0.1)
markerline.set_markersize(2.)
baseline.set_linewidth(0.1)
stemlines.set_color(color)
markerline.set_color(color)
baseline.set_color(color)
stemlines.set_alpha(alpha)
markerline.set_alpha(alpha)
baseline.set_alpha(alpha)
plot_coefficients(coef.value)
I got the following plot:
Note that the y-axis appears logarithmic to to the \(\operatorname{arcsinh}\) scale. Doesn’t look sparse at all! So let’s try a larger coefficient:
reg_coef.value = 1e-2
problem.solve()
print(f'Loss at optimum = {loss.value:.4g}')
plot_coefficients(coef.value)
The output is
Loss at optimum = 0.6188
The plot I obtained:
Now it looks much sparser! Let’s store the coefficients vector, we will need it in the remainder of this section to compare it to the results we achieve with PyTorch:
cvxpy_sparse_coefs = coef.value.copy()
I don’t know if this is a ‘good’ feature selection strategy for this specific dataset, but it’s not our objective. Our objective is showing how to implement Hadamard parametrization in PyTorch that recovers a similar sparsity pattern. So let’s do it!
Parametrizations in PyTorch allow representing any learnable parameter as a function of other learnable parameters. Typically, this is used to impose constraints. For example, we may represent a vector representing discrete event probabilities as the soft-max operation applied to a vector of arbitrary real values. A parametrization in PyTorch is just another module. Here is an example:
class SimplexParametrization(torch.nn.Module):
def forward(self, x):
return torch.softmax(x)
Now, suppose our model has a parameter called vec
which we’d like to constrain to lie in the probability simplex. It can be done in the following manner:
torch.nn.utils.parametrize.register_parametrization(model, 'vec', SimplexParametrization())
Viola!
Since a parametrization is just another module, it can have its own learnable weights! So we can use this fact to easily parametrize the weights of a torch.nn.Linear
module: we will regard its original weights as \(\mathbf{u}\), the parametrization module will have its own weigths \(\mathbf{v}\), and will compute \(\mathbf{u} \odot \mathbf{v}\). Here is the code:
import torch
class HadamardParametrization(torch.nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.v = torch.nn.Parameter(torch.ones(out_features, in_features))
def forward(self, u):
return u * self.v
Note, that I initialized the \(\mathbf{v}\) vector to a vector of ones. This is because the first time a parametrization is applied, the forward
function is called to compute the parametrized value, and I want to use the deep mathematical fact that \(1\) is neutral w.r.t the multiplication operator to keep the original weight unchanged.
Let’s apply it to a linear layer and inspect its trainable parameters to get a feeling:
layer = torch.nn.Linear(8, 1)
torch.nn.utils.parametrize.register_parametrization(layer, 'weight', HadamardParametrization(8, 1))
That’s it! Now if we train our linear model using PyTorch optimizers with weight_decay
, we will in fact apply L1 regularization to the original weights. The weight decay is exactly equivalent to the L1 regularization coefficient. Under the hood, the layer.weight
parameter is now represented as a Hadamard product of two tensors.
To get a feeling of what happens under the hood, let’s inspect our linear layer after applying the parametrization:
for name, param in layer.named_parameters():
print(name, ': ', param)
The output I got is:
bias : Parameter containing:
tensor([-0.1233], requires_grad=True)
parametrizations.weight.original : Parameter containing:
tensor([[-0.0035, 0.2683, 0.0183, 0.3384, -0.0326, 0.1316, -0.1950, -0.0953]],
requires_grad=True)
parametrizations.weight.0.v : Parameter containing:
tensor([[1., 1., 1., 1., 1., 1., 1., 1.]], requires_grad=True)
We can see there are three trainable parameters. The bias of the linear layer, the original weight of the linear layer, which we now treat as the \(\mathbf{u}\) vector, and the weight of the HadamardParametrization
module, which is initialized to ones, which we treat as the \(\mathbf{v}\) vector. What happens if we try to access the weight
of the linear layer? Let’s see:
print(layer.weight)
Here is the output:
tensor([[-0.0035, 0.2683, 0.0183, 0.3384, -0.0326, 0.1316, -0.1950, -0.0953]],
grad_fn=<MulBackward0>)
But it has a MulBackward
gradient back-propagation function, because under the hood it is computed as a product of two tensors.
To see our parametrization in action, we will need three components. First, a function that implements a pretty standard PyTorch training loop. Something that looks familiar, and without any trickery. Second, a function that plots its results. Third, a function that integrates the two above ingredients to train a Hadamard-parametrized logistic regression model.
Here is our pretty-standard PyTorch training loop. It returns the training loss achieved in each epoch in a list, so that we can plot it:
from tqdm import trange
def train_model(dataset, model, criterion, optimizer, n_epochs=500, batch_size=8):
epoch_losses = []
for epoch in trange(n_epochs):
epoch_loss = 0.
for batch, batch_label in torch.utils.data.DataLoader(dataset, batch_size=batch_size):
# compute predictiopn and loss
batch_pred = model(batch)
loss = criterion(batch_pred, batch_label)
epoch_loss += loss.item() * torch.numel(batch_label)
# invoke the optimizer using the gradients.
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_losses.append(epoch_loss / len(dataset))
return epoch_losses
Our second ingredient are plotting functions. Here is a function that plots the epoch losses:
def plot_convergence(epoch_losses, ax=None):
if ax is None:
ax = plt.gca()
ax.set_xlabel('Epoch')
ax.set_ylabel('Cost')
ax.plot(epoch_losses)
ax.set_yscale('log')
last_iterate_loss = epoch_losses[-1]
ax.axhline(last_iterate_loss, color='r', linestyle='--')
ax.text(len(costs) / 2, last_iterate_loss, f'{last_iterate_loss:.4g}',
fontsize=12, va='center', ha='center', backgroundcolor='w')
To get a feeling of what the output looks like, let’s plot a a dummy list simulatinga loss of \(\exp(-\sqrt{i})\) in the \(i\)-th epoch:
plot_convergence(np.exp(-np.sqrt(np.arange(100))))
We can see a plot of the achieves loss on a logarithmic scale, and a horizontal line denoting the loss at the last epoch. We would also like to see the coefficients of our trained model, just like we did with the CVXPY models. So here is a function that plots the losses on the left, and the coefficients on the right. The coefficients are plotted together with ‘reference’ coefficients, so that we can visually compare our model to some reference. In our case, the reference coefficients are the ones we obtained from CVXPY.
def plot_training_results(model, losses, ref_coefs):
# create figure and decorate axis labels
fig, (ax_conv, ax_coefs) = plt.subplots(1, 2, figsize=(12, 4))
plot_coefficients(ref_coefs, ax_coefs, color='blue', label='Reference')
plot_coefficients(model.weight.ravel().detach().numpy(), ax_coefs, color='orange', label='Hadamard')
ax_coefs.legend()
plot_convergence(losses, ax_conv)
plt.tight_layout()
plt.show()
Our third and last ingredient is the function that integrates it all. It trains a Hadamard parametrized model, and plots the coefficients and the epoch losses:
import torch.nn.utils.parametrize
def train_parameterized_model(alpha, optimizer_fn, ref_coefs, **train_kwargs):
model_shape = (scaled_data.shape[1], 1)
model = torch.nn.Linear(*model_shape)
torch.nn.utils.parametrize.register_parametrization(
model, 'weight', HadamardParametrization(*model_shape)) # <-- this applies Hadamard parametrization
dataset = torch.utils.data.TensorDataset(
torch.as_tensor(scaled_data).float(),
torch.as_tensor(labels).float().unsqueeze(1))
criterion = torch.nn.BCEWithLogitsLoss() # <-- this is the loss for logistic regression
optimizer = optimizer_fn(model.parameters(), weight_decay=alpha)
epoch_losses = train_model(dataset, model, criterion, optimizer, **train_kwargs)
plot_training_results(model, epoch_losses, ref_coefs)
Now let’s try it out with a regularization coefficient of \(10^{-2}\). That is exactly the same coefficient we used to obtain the sparse coefficients with CVXPY. However, this is not CVXPY, and we need to also chose an optimizer and its parameters. I used the Adam optimizer with a learning rate of \(10^{-4}\). And yes, I know^{5} that Adam’s weight decay is not exactly L2 regularization, but many use Adam as their go-to optimizer, and I want to demonstrate that the idea works with Adam as well:
from functools import partial
train_parameterized_model(alpha=1e-4,
optimizer_fn=partial(torch.optim.Adam, lr=1e-4),
ref_coefs=cvxpy_sparse_coefs)
Here is the result I got:
On the left, we can see the convergence plot. On the right, we can see coefficients from both CVXPY and the Hadamard parametrization. They almost coincide, with almost the same sparsity pattern. The training loss, 0.6194, is also pretty close to 6188, which is what we achieved with CVXPY.
Now, having seen that Hadamard parametrization indeed ‘induces sparsity’, just like its equivalent L1 regularization, we can do something more interesting, and apply it to neural networks.
The concept of sparsity doesn’t necessarily fit neural networks in the best way, but a related concept of group sparsity does. We introduce it here, show how it is seamlessly implemented using a Hadamard product parametrization in PyTorch, and conduct an experiment with the famous california housing prices dataset.
One caveat of the parametrization technique we saw is that it requires twice as many trainable parameters. For large neural networks this may be prohibitive in terms of time, space, or just the cost of training on the cloud. But with neural networks, it may be enough to produce a zero either at the neuron input level, or at the output level. For example, some of the neurons of a given layer produce a zero, whereas others do not.
This can be achieved through regularization that induces group sparsity, meaning that we would like entire groups of weights to be zero whenever the effect of the group on the loss is small enough. If we define the groups to be the columns of the weight matrices of our linear layers, we will achieve sparsity on neuron inputs. This is because in PyTorch the columns correspond to the of input features of a linear layer.
One way to achieve this, is using the sum of the column norms in the regularization coefficient. For example, suppose we have 3-layer neural network whose weight matrices are \(\mathbf{W}_1 \in \mathbb{R}^{8\times 3}, \mathbf{W}_2\in\mathbb{R}^{3\times 2}\), and \(\mathbf{W}_3 \in \mathbb{R}^{2\times 1}\), and we are training over a data-set with \(n\) samples with cost functions \(\ell_1, \dots, \ell_n\). Then to induce column sparsity, we should train by minimizing
\[\begin{align*} \min_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{W}_3} \quad \frac{1}{n} \sum_{i=1}^n \ell_i(\mathbf{W}_1, \mathbf{W}_2, \mathbf{W}_3) + \lambda \Bigl( &\| \mathbf{W}_{1,1} \|_2 + \| \mathbf{W}_{1,2} \|_2 + \| \mathbf{W}_{1,2} \|_3 + \\ &\| \mathbf{W}_{2,1} \|_2 + \| \mathbf{W}_{2,2} \|_2 + \\ &\| \mathbf{W}_{3,1} \|_2 \Bigr), \end{align*}\]where \(\mathbf{P}_i\) denotes the \(i\)-th column of the matrix \(\mathbf{P}\). Seems a bit clumsy, but the regularizer just sums up the Euclidean norms of the weight matrix columns. Note, that the norms are not squared, so this is not our friendly neighborhood L2 regularization.
It turns out^{2}^{3}^{4} that this is equivalent to a Hadamard product parametrization with our friendly neighborhood L2 regularization. This means that we can again use the weight_decay
feature of PyTorch optimizers to achieve column sparsity. As we would expect, the parametrization operates on matrix columns, rather than individual components. A weight matrix \(\mathbf{W} \in \mathbb{R}^{m \times n}\) will parametrized by the matrix \(\mathbf{U} \in \mathbb{R}^{m \times n}\) and the vector \(\mathbf{v} \in \mathbb{R}^n\) as by multiplying each column of \(\mathbf{U}\) by the corresponding component of \(\mathbf{v}\):
The implementation in PyTorch is embarasingly simple:
class InputsHadamardParametrization(torch.nn.Module):
def __init__(self, in_features):
super().__init__()
self.v = torch.nn.Parameter(torch.ones(1, in_features))
def forward(self, u):
return u * self.v
Note, that we use the broadcasting ability of PyTorch to multiply each column of the argument u
by the corresponding component of v
.
To see our idea in action, we shall use the california housing dataset, mainly due to its availability on Google colab. It has 8 numerical features, and a continuous regression target. Let’s load it:
train_df = pd.read_csv('sample_data/california_housing_train.csv')
test_df = pd.read_csv('sample_data/california_housing_test.csv')
The first 5 rows of the train data-set are:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value |
---|---|---|---|---|---|---|---|---|
-114.31 | 34.19 | 15 | 5612 | 1283 | 1015 | 472 | 1.4936 | 66900 |
-114.47 | 34.4 | 19 | 7650 | 1901 | 1129 | 463 | 1.82 | 80100 |
-114.56 | 33.69 | 17 | 720 | 174 | 333 | 117 | 1.6509 | 85700 |
-114.57 | 33.64 | 14 | 1501 | 337 | 515 | 226 | 3.1917 | 73400 |
-114.57 | 33.57 | 20 | 1454 | 326 | 624 | 262 | 1.925 | 65500 |
The first 8 columns are the features, and the last column is the regression target. We make some preprocessing by splitting into training and evaluation set, and Scikit-Learn’s StandardScaler
to standardize the numerical features. Then, we convert everything to PyTorch datasets:
# standardize features
scaler = StandardScaler().fit(train_df)
train_scaled = scaler.transform(train_df)
test_scaled = scaler.transform(test_df)
# conver to PyTorch objects
train_ds = torch.utils.data.TensorDataset(
torch.as_tensor(train_scaled[:, :-1]).float(),
torch.as_tensor(train_scaled[:, -1]).float().unsqueeze(1))
test_features = torch.as_tensor(test_scaled[:, :-1]).float()
test_labels = torch.as_tensor(test_scaled[:, -1]).float().unsqueeze(1)
Now we are ready. We will use the following simple four-layer neural network to fit the training set:
class Network(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc1 = torch.nn.Linear(8, 32)
self.fc2 = torch.nn.Linear(32, 64)
self.fc3 = torch.nn.Linear(64, 32)
self.fc4 = torch.nn.Linear(32, 1)
self.relu = torch.nn.ReLU()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
x = self.relu(self.fc3(x))
x = self.fc4(x)
return x
def linear_layers(self):
return [self.fc1, self.fc2, self.fc3, self.fc4]
The first layer has 8 input features, since our data-set has 8 features. Later layers expand it to 64 hidden features, and then shrink back. We don’t know if we will need all those dimensions, but that’s what we have our sparsity inducing regularization for - so that we can find out. Note that I added a linear_layers()
method to be able to operate on all the linear layers of the network. It could be done in a generic manner by inspecting all modules and checking which ones are torch.nn.Linear
, but I want to make the subsequent code simpler.
Let’s inspect our network to see how many parameters it has. To that end, we shall use the torchinfo package:
import torchinfo
network = Network()
torchinfo.summary(network, input_size=(1, 8))
Most of the output is not interesting, but one line is:
Trainable params: 4,513
So our network has 4513 trainable parameters. As we shall see, using sparsity inducing regularization we can let gradient descent (or Adam) discover how many dimensions we need! Let’s proceed to training our network with column-parametrized weights:
def parametrize_neuron_inputs(network):
for layer in network.linear_layers():
num_inputs = layer.weight.shape[1]
torch.nn.utils.parametrize.register_parametrization(
layer, 'weight', InputsHadamardParametrization(num_inputs))
parametrize_neuron_inputs(network)
epoch_costs = train_model(train_ds, network, torch.nn.MSELoss(),
n_epochs=200, batch_size=128,
optimizer=torch.optim.Adam(network.parameters(), lr=0.002, weight_decay=0.001))
Reusing our plot_convergence
function from the previous section, we can see how the model trains:
plot_convergence(epoch_costs)
We can now also evaluate the performance on the test set:
def eval_network(network):
network.eval()
criterion = torch.nn.MSELoss()
print(f'Test loss = {criterion(network(test_features), test_labels):.4g}')
eval_network(network)
The output is:
Test loss = 0.2997
So we did not over-fit. The test MSE is similar to the train MSE. Now let’s inspect our sparsity. To that end, I implemeted a funciton that plots the matrix sparsity patterns of the four layers, where I regard any entry below some threshold as a zero. Nonzeros are white, whereas zeros are black. Here is the code:
def plot_network(network, zero_threshold=1e-5):
fig, axs = plt.subplots(1, 4, figsize=(12, 3))
layers = network.linear_layers()
for i, (ax, layer) in enumerate(zip(axs.ravel(), layers), start=1):
layer_weights = layer.weight.abs().detach().numpy()
image = layer.weight.abs().detach().numpy() > zero_threshold
ax.imshow(image, cmap='gray', vmin=0, vmax=1)
ax.set_title(f'Layer {i}')
plt.tight_layout()
plt.show()
plot_network(network)
Now here is a surprise! It is most apparent in the first layer. Our parametrization is supposed to induce sparsity on the columns of the matrices, but we see that it also induces sparsity on the rows. So what’s going on? Well, it turns out that if the inputs of some inner layer are unused, because the column weights are zero, we can also zero-out the corresponding rows of the layer before. Since the outputs of the layer before are unused by the layer after, it has no effect on the training loss, but reduces the regularization term. Indeed, careful inspection will show that the rows of the first layer that were fully zeroed out exactly correspond to the columns of the second layer that were zeroed out. This is true to any consequent pair of layers. What’s truly amazing is that we didn’t have to do anything - gradient descent (or Adam, in this case) ‘discovered’ this pattern on its own!
Now that we know exactly which rows and columns we can remove, let’s write a function that does it. It’s a bit technical, and I don’t want to go into the PyTorch details, but you can read the code and convince yourself that this is exactly what the function below does for a linear layer - it computes a mask of columns whose norm is negilgibly small, receives the mask from the previous layer, and removes the corresponding rows and columns.
@torch.no_grad()
def shrink_linear_layer(layer, input_mask, threshold=1e-6):
# compute mask of nonzero output neurons
output_norms = torch.linalg.vector_norm(layer.weight, ord=1, dim=1)
if layer.bias is not None:
output_norms += layer.bias.abs()
output_mask = output_norms > threshold
# compute shrunk sizes
in_features = torch.sum(input_mask).item()
out_features = torch.sum(output_mask).item()
# create a new shrunk layer
has_bias = layer.bias is not None
shrunk_layer = torch.nn.Linear(in_features, out_features, bias=has_bias)
shrunk_layer.weight.set_(layer.weight[output_mask][:, input_mask])
if has_bias:
shrunk_layer.bias.set_(layer.bias[output_mask])
return shrunk_layer, output_mask
Now let’s apply it to all four layers:
mask = torch.ones(8, dtype=bool)
network.fc1, mask = shrink_linear_layer(network.fc1, mask)
network.fc2, mask = shrink_linear_layer(network.fc2, mask)
network.fc3, mask = shrink_linear_layer(network.fc3, mask)
network.fc4, mask = shrink_linear_layer(network.fc4, mask)
Note, that we replace the linear layers of the network with new ones. These new layers do not have a Hadamard parametrization, so now applying weight decay will apply the regular L2 regularuzation we are used to. Let’s see how many trainable weights does our network have now:
torchinfo.summary(network, input_size=(1, 8))
Here is the output:
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
Network [1, 1] --
├─Linear: 1-1 [1, 18] 162
├─ReLU: 1-2 [1, 18] --
├─Linear: 1-3 [1, 15] 285
├─ReLU: 1-4 [1, 15] --
├─Linear: 1-5 [1, 9] 144
├─ReLU: 1-6 [1, 9] --
├─Linear: 1-7 [1, 1] 10
==========================================================================================
Total params: 601
Trainable params: 601
Non-trainable params: 0
Total mult-adds (M): 0.00
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
==========================================================================================
Only 601 trainable parameters! So let’s train its 601 remaining weights, now without any parametrizations:
epoch_costs = train_model(train_ds, network, torch.nn.MSELoss(),
n_epochs=200, batch_size=128,
optimizer=torch.optim.Adam(network.parameters(), lr=0.002, weight_decay=1e-6))
plot_convergence(epoch_costs)
We can also evaluate its performance:
eval_network(network)
The output is:
Test loss = 0.2385
You may have noticed that I manually chose the learning rate and the weight decay for the parametrized network, and a different learning rate and weight-decay for the shrunk network. In practice, we should do hyperparameter tuning, and select the best combination that optimizes for some metric on an evaluation set. Namely, each hyperparameter tuning experiment performs two phases, just like what we did with our neural network in this post. In the first phase, it trains a parametrized network and shrinks it. The parametrization helps ‘discover’ the correct sparsity pattern. Then, in the second phase, we train the shrunk network, and then evaluates its performance. This is because we have no way of knowing in advance which hyperparameters will induce the ‘optimal’ sparsity pattern. So a pesudo-code for hyperparameter tuning experiment may like this:
def tuning_objective(phase_1_lr, phase_1_alpha, phase_2_lr, phase_2_alpha):
network = create_network()
apply_hadamard_parametrization(network)
train(network, phase_1_lr, phase_1_alpha)
network = shrink_network(network)
train(network, phase_2_lr, phase_2_alpha)
return evaluate_performance(network)
So I recommend relying on the hyperparameter tuner to discover good parameters for the above objective, just like we rely on gradient descent to discover the ‘right’ sparsity pattern.
The idea of training first with sparsity inducing regularization, and then again without it, is not new. In fact, many statisticians working with Lasso do something similar: we first use Lasso for feature selection, and then re-train the model on the selected features wihtout Lasso. This is because sparsity inducing regularization typically hurts performance by shrinking the remaining model weights too aggressively. This was a kind of “crasftman-knowledge”, but recently some papers ^{6}^{7} formally analyzed this approach and made it more publicly known. This idea also has some resemblance to relaxed Lasso^{8}.
Finally, if we have an inference “budget”, we may choose to inform our hyperparameter tuner that the cost for exceeding the budget is very high. For example, in the above tuning objective, we can replace the return statement by:
return evaluate_performance(network) + 1000 * max(number_of_parameters(network) - budget, 0)
This way the tuner will try to avoid exceeding the budget, because of the high cost of each additional model parameter. Of course, the cost doesn’t have to be that extreme, and we can make it much less than 1000 units for each additional parameter, depending on our requirements.
The beauty of sparsity inducing regularization is that we let our optimizer discover the sparsity patterns, instead of doing extremely expensive neural architecture search. And the beauty of Hadamard-product parametrization is that it lets us re-use existing optimizers of our ML frameworks to add sparsity-inducing regularizers, without having to write specialized custom optimizers. Maybe to some of you this may sound like Klingon, but for readers familiar with proximal minimization: writing a proximal operator for group sparsity inducing norm with componentwise learning rates using PyTorch, so that it is also GPU friendly, is extremely hard. But with Hadamard parametrization we don’t need to.
Beyond neural networks, the idea can be also applied to convolutional nets - we can make each filter a “group”, and let gradient descent discover how many filters, or channels, we need in each convolutional layer. We can also apply it to factorization machines^{9}, to discover the ‘right’ latent embedding dimension. The idea is extremely versatile!
I hope you had fun reading it as much as I had fun writing it, and see you in the next post!
Hoff, Peter D. “Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization.” Computational Statistics & Data Analysis 115 (2017): 186-198. ↩ ↩^{2}
Ziyin, Liu, and Zihao Wang. “spred: Solving L1 Penalty with SGD.” International Conference on Machine Learning. PMLR, 2023. ↩ ↩^{2}
Kolb, Chris, et al. “Smoothing the edges: a general framework for smooth optimization in sparse regularization using Hadamard overparametrization.” arXiv preprint arXiv:2307.03571 (2023). ↩ ↩^{2}
Poon, Clarice, and Gabriel Peyré. “Smooth over-parameterized solvers for non-smooth structured optimization.” Mathematical programming 201.1 (2023): 897-952. ↩ ↩^{2}
Loshchilov, Ilya, and Frank Hutter. “Decoupled Weight Decay Regularization.” International Conference on Learning Representations (2019). ↩
Belloni, Alexandre, and Victor Chernozhukov. “ℓ1-penalized quantile regression in high-dimensional sparse models.” (2011): 82-130. ↩
BELLONI, ALEXANDRE, and VICTOR CHERNOZHUKOV. “Least squares after model selection in high-dimensional sparse models.” Bernoulli (2013): 521-547. ↩
Meinshausen, Nicolai. “Relaxed lasso.” Computational Statistics & Data Analysis 52.1 (2007): 374-393. ↩
Rendle, Steffen. “Factorization machines.” 2010 IEEE International conference on data mining. IEEE, 2010. ↩
Typically in machine learning we train a model by minimizing the average loss over the training set \(\mathbf{x}_1, \dots, \mathbf{x}_n\)^{1}, perhaps with some regularization. Mathematically, we solve the problem:
\[\min_\mathbf{w} \quad \frac{1}{n} \sum_{i=1}^n f(\mathbf{w}, \mathbf{x}_i)\]A recently published JMLR paper^{2} proposes an alternative, a tilted loss:
\[\min_\mathbf{w} \quad \frac{1}{t} \ln\left( \frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i)) \right)\]The LogSumExp function \(\mathbf{x} \to \ln(\sum_{i=1}^n x_i)\) serves as the “aggregator” of losses over individual samples, instead of just the plain average. The parameter \(t\) can be thought as a kind of ‘temperature’ of our aggregator. When \(t \to \infty\), it converges to the worst-case loss over the training set. This is useful when we want to make sure that we perform reasonably well on the difficult instances as well, and not only on average. Conversely, when \(t \to -\infty\), it converges to the best-case loss over the training set. This is useful for the opposite - perform well on most easy instances, and be less sensitive to “outliers”, or more difficult instances. Taking \(t \to 0\), it converges to the regular average loss. Thus, it allows interpolating between fairness and robustness. For the same of simplicity, in this post we shall assume that \(t > 0\).
Off-the-shelf methods available in PyTorch and TensorFlow are based on stochastic gradients, and are designed to minimize averages over individual samples. However, the tilted loss is not an average due to the LogSumExp aggregation, and hence model training becomes a bit tricky. The paper proposes to use ‘tilted averaging’ on mini-batches, instead of regular averaging, but without a mathematical justification to the best of my knowledge. Intuitively, such strategy minimizes some approximation of the tilted loss, but not the tilted loss itself.
In theory, since a logarithm is monotonic, and \(\frac{1}{t}\) is positive, we could discard both and train on the tilted loss itself by minimizing an average of exponentials:
\[\frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i))\]Let’s call the above a stripped reformulation, since we stripped away the logarithm. Now it is just an average over the samples, so it should be easy, right? In practice this may cause severe numerical problems, since exponentials tend to ‘explode’ even for a moderate value of the loss \(f(\mathbf{w}, \mathbf{x}_i)\). The LogSumExp function itself, and its gradient, the SoftMax, are not hard to evaluate numerically - but the logarithm plays a crurial role. So can we devise a reformulation with better numerical properties, and still minimize the tilted loss itself, rather than an approximation?
I got interesting inspiration by remembering Prof. Francis Bach’s fascinating post about the so-called “η-trick” and its use in iteratively reweighted methods. The trick allows transforming a function that is “hard” to deal with to a function that is “easy” to deal with by adding additional auxiliary variables. So it made me wonder - can we do the same with the tilted loss to mitigate the numerical issues? In this post we will try to understand how the numerical issue is manifested in model training, and explore an idea to try to mitigate it to some extent. As usual, the code is available in a notebook.
To understand how the ‘numerical problems’ we just discussed are manifested, let’s try to do a simple line fitting problem. We will generate data, and try out training a linear model to fit to noisy samples using the stripped reformulation on top of the MSE loss.
We will generate noisy measurements around the line \(y = 0.8x - 1\), and fit a line by minimizing the tilted loss of squared residuals.
Here is the sample generation code:
import numpy as np
# true line parameters
true_w = np.array([0.8, -1])
# sampling parameters
noise_strength = 0.3
n = 100
# sample random X and noisy Y coordinates
np.random.seed(42)
x = np.stack([np.random.randn(n), np.ones(n)], axis=-1)
y = x @ true_w + noise_strength * np.random.standard_t(df=3, size=n)
I used T distribution for the noise to take advantage of its ‘heavy tails’, meaning that occasionally some samples will deviate further from the line than the majority of the samples. I also used 3 degrees of freedom, to make sure we have a finite variance, otherwise, demonstrating what I want in this post becomes even harder. Let’s take a look at the data:
import matplotlib.pyplot as plt
plt.scatter(x[:, 0], y)
plt.show()
Indeed, most of the samples lie along the line, but a few of them go a bit farther away.
Now let’s try fitting with several PyTorch optimizers and several step sizes to see the behavior. Note, that we would like to understand the behavior of the optimizer as a minimization algorithm, and not as a learning algorithm. Thus, there will be no division into train/validation/test. Instead, we will just see how well did we manage to minimize the desired loss on the training set we just sampled.
Our first ingredient is a loss for our stripped reformulation - something that takes an existing loss, multiplies by \(t\), and exponentiates it:
class StrippedTiltedLoss:
def __init__(self, underlying_loss, t):
super().__init__()
self.underlying_loss = underlying_loss
self.t = t
def __call__(self, pred, target):
return torch.exp(self.t * self.underlying_loss(pred, target))
Next, we shall write a function that fits a line to the data using a given loss and a given optimizer. It’s just a standard PyTorch training loop:
import torch
def pytorch_fit(x, y, criterion, make_optim_fn, n_epochs=100):
# convert numpy arrays to torch tensors
x = torch.as_tensor(x)
y = torch.as_tensor(y)
# define initial w to be the zero vector
w_fit = torch.nn.Parameter(torch.zeros_like(x[0]))
# create optimizer
optim = make_optim_fn([w_fit])
# regular PyTorch training loop.
for epoch in range(n_epochs):
for xi, yi in zip(x, y):
pred = torch.dot(xi, w_fit)
loss = criterion(pred, yi)
optim.zero_grad()
loss.backward()
optim.step()
return w_fit.detach()
To get a feeling, let’s try it the stripped tilted reformulation on top of the MSE loss:
from torch.nn import MSELoss
pytorch_fit(x, y,
criterion=StrippedTiltedLoss(MSELoss(), t=1),
make_optim_fn=lambda params: torch.optim.SGD(params, lr=1e-6))
The output I got is:
tensor([ 0.4708, -0.3032], dtype=torch.float64)
Doesn’t look like our line, so maybe we aren’t learning fast enough with such a small step size? Let’s try a larger step size:
pytorch_fit(x, y,
criterion=StrippedTiltedLoss(MSELoss(), t=1),
make_optim_fn=lambda params: torch.optim.SGD(params, lr=1e-4))
Now I got an output:
tensor([nan, nan], dtype=torch.float64)
We can add some printouts to understand what’s going on, but it’s quite simple. A large step size causes the weights to change sharply, which in turn causes large residuals, which in turn causes the gradients to become even more exponentially larger, causing even sharper changes to the learned weights.
We can conjecture, therefore, that there is a very narrow range of step sizes that perform reasonably well. A step size too small will make little progress, whereas a step size too large makes too much progress, causing exploding gradients. Consequently, hyper-parameter tuning becomes difficult and expensive, since pinpointing just the right step-size may require many training episodes, and waste previous time or money. Let’s verify our conjecture numerically, and plot the true tilted loss we obtain every step size.
Our first component is a function that computes the tilted loss for a given dataset. To ensure numerical accuracy and stability, I want to reuse PyTorch’s built-int logsumexp
function. Using the fact that \(\frac{1}{n} = \exp(-\ln(n))\), we can reformulate the tilted loss as:
Now we can use logsumexp
to compute the tilted loss without numerical issues:
import math
def compute_tilted_loss(w_fit, x, y, t):
x = torch.as_tensor(x)
y = torch.as_tensor(y)
n = x.shape[0]
squared_residuals = torch.square(x @ w_fit - y)
return torch.logsumexp(t * squared_residuals - math.log(n), dim=-1) / t
Next, we write a function that experiments with our stripped equivalent of the tilted loss for various step-sizes with SGD:
from tqdm.auto import tqdm
def eval_sgd_exponential_loss(x, y, lrs, t=1):
losses = []
for lr in tqdm(lrs):
optim_factory = partial(torch.optim.SGD, lr=lr)
w_exp_fit = pytorch_fit(x, y, StrippedTiltedLoss(MSELoss(), t), optim_factory)
w_exp_loss = compute_tilted_loss(w_exp_fit, x, y, t).item()
losses.append(w_exp_loss)
return losses
Let’s plot the results for a fine grid of step-sizes:
lrs = np.geomspace(1e-7, 1e-4, 60).tolist()
losses = eval_sgd_exponential_loss(x, y, lrs)
plt.plot(lrs, losses)
plt.xscale('log')
plt.yscale('log')
plt.xlim([np.min(lrs), np.max(lrs)])
The x-axis is the step size, whereas the y-axis is the achieved value of the tilted loss. We can see that a tiny range, somewhere between \(10^{-5}\) and \(2 \times 10^{-5}\), results in a reasonable performance. Above that, we see that we’re have no data - that’s because the losses
array contains NaNs. Our gradients exploded, and the optimizer failed for step sizes that are just a bit too large. Now let’s do some interesting tricks.
I do not recall exactly where, but in an exercise on convex optimization I encountered this simple fact:
\[\ln(z) = \min_v \{ z \exp(v) - v \} - 1\]The proof is a two liner - just take the derivative inside the \(\min\) w.r.t \(v\) and equate it with zero:
\[\begin{align*}\tag{V} &z \exp(v) - 1 = 0 \\ &v = -\ln(z) \end{align*}\]Substitute this \(v=-\ln(z)\) into the expression inside the \(\min\) to get the desired result. Remember this formula for \(v\) - it will be useful for PyTorch parameter initialization later in this post.
Writing a function as a minimum of a family of functions is called a variational formulation. So what we have is a variational formulation of the logarithm. Now let’s use it to do something useful. It’s a bit technical, but the end-result leads us in the right direction. We use the variational formulation of the logarithm in the tilted loss, and obtain the following:
\[\begin{aligned} \ln\left( \frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i)) \right) &= \min_v \left\{ \left( \frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i)) \right) \exp(v) - v \right\} - 1 \\ &= \min_v \left\{ \frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i) + v) - v \right\} - 1 \\ &= \min_v \left\{ \frac{1}{n} \sum_{i=1}^n \left( \exp(t f(\mathbf{w}, \mathbf{x}_i) + v) - v \right) \right\} - 1 \end{aligned}\]When minimizing, the constant \(-1\) at the end can also be stripped. Thus, training with a tilted loss amounts to solving the minimization problem
\[\min_{\mathbf{w}, v} \quad \frac{1}{n} \sum_{i=1}^n \left( \exp(t f(\mathbf{w}, \mathbf{x}_i) + v) - v \right)\]Let’s call this the variational formulation of the tilted loss. At first it appears we have not done anything useful - it’s again an average of exponentials. But a closer examination reveals that if \(v\) is negative it balances away large losses, and has a ‘stabilizing’ effect. So is it negative? Well, recalling equation \((V)\), the one I asked to remember, at the optimum we must have:
\[v = -\ln\left (\frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i)) \right )\]Except for extremely rare cases, losses are typically positive, their exponentials are at least 1, and therefore the argument of the logarithm is at least 1. This means that for any reasonable ML task, \(v\) at the optimum indeed is negative. But we do not care only about the optimum, we care about what happens throughout the entire training process. Thus, this variational formulation is just a heuristic which may, and we need to check weather it indeed does so.
An important component of making this heuristic useful is initializing \(v\) properly, so that the first epochs don’t fail on large gradients. But it’s not hard - the formula for \(v\) above is a great for initialization as well.
Note, that we need to learn an additional parameter \(v\), which is conceptually not part of the model, but rather a part of the loss. Moreover, we will need a function to initialize our loss object, so that we can initialize \(v\). To facilitate the above, our losses will inherit torch.nn.Module
, and will have an additional initialize
method. Here is the stripped loss. Note that its initialization method does nothing:
class StrippedTiltedLoss(torch.nn.Module):
def __init__(self, underlying_loss, t):
super().__init__()
self.underlying_loss = underlying_loss
self.t = t
def initialize(self, x, y):
pass
def forward(self, pred, target):
exp_losses = torch.exp(self.t * self.underlying_loss(pred, target))
return exp_losses.mean()
Here is the variational loss we just derived:
class VariationalTiltedLoss(torch.nn.Module):
def __init__(self, t):
super().__init__()
self.t = t
self.v = torch.nn.Parameter(torch.tensor(0.))
def initialize(self, preds, targets):
with torch.no_grad():
init_losses = self.underlying_loss(preds, targets)
n = init_losses.shape[0]
v_init = -torch.logsumexp(self.t * init_losses - math.log(n), dim=-1)
self.v.set_(v_init)
def forward(self, pred, target):
sample_loss = self.underlying_loss(pred, target)
exp_tilted_losses = torch.exp(self.t * sample_loss + self.v) - self.v
return exp_tilted_losses.mean()
It is assumed here that the underlying loss does not perform any reduction, such as averaging or summing the individual losses. We need the individual sample losses for our purposes, and thus we’re doing the averaging ourselves. In this post we train on individual samples, so no averaging is required, but in general people train on mini-batches of samples, and I wanted to make the code above re-usable in this scenario as well. This means that when passing the underlying loss, we need to tell it to avoid reducing, e.g. torch.nn.MSELoss(reduction='none')
To use losses that contain an addinal parameter, and to call the initialize
method properly, we make a small modification to the fit_pytorch
method we wrote above:
from itertools import chain
def pytorch_fit(x, y, criterion, make_optim, n_epochs=100):
# convert numpy arrays to torch tensors
x = torch.as_tensor(x)
y = torch.as_tensor(y)
dim = x.shape[1]
dtype = x.dtype
# perform proper initialization
w_fit = torch.nn.Parameter(torch.zeros(dim, dtype=dtype))
criterion.initialize(x @ w_fit, y)
# create optimizer - don't foget that the `ciretion` now also has parameters!
parameters_to_learn = chain(criterion.parameters(), [w_fit])
optim = make_optim(parameters_to_learn)
# regular PyTorch training loop.
for epoch in range(n_epochs):
for xi, yi in zip(x, y):
pred = torch.dot(xi, w_fit)
loss = criterion(pred, yi)
optim.zero_grad()
loss.backward()
optim.step()
return w_fit.detach()
Now let’s compare our two tilted loss formulations, the exponential, and the tilted exponential, in terms of their sensitivity to exactly pinpointing a narrow interval of step-sizes. We will expand our test a little bit, and try it for several values of the temperature \(t\) and two different optimizers - SGD and Adam. To that end, we wrote a function that tries both losses for a given set of temperatures, a set of learning rates, and a given optimizer. The results are gathered in a Pandas DataFrame.
from functools import partial
from itertools import product
import pandas as pd
def compare_tilted_formulations(x, y, ts, lrs, optim_ctor):
records = []
for t, lr in tqdm(list(product(ts, lrs))):
make_optim_fn = partial(optim_ctor, lr=lr)
mse_loss = MSELoss(reduction='none')
w_exp_fit = pytorch_fit(x, y, StrippedTiltedLoss(mse_loss, t), make_optim_fn)
w_tilted_fit = pytorch_fit(x, y, VariationalTiltedLoss(mse_loss, t), make_optim_fn)
w_exp_loss = compute_tilted_loss(w_exp_fit, x, y, t).item()
w_tilted_loss = compute_tilted_loss(w_tilted_fit, x, y, t).item()
records.append(dict(t=t, lr=lr, loss_type='stripped', value=w_exp_loss))
records.append(dict(t=t, lr=lr, loss_type='variational', value=w_tilted_loss))
return pd.DataFrame.from_records(records)
Now let’s conduct our experiments. We begin with SGD:
lrs = np.geomspace(1e-7, 1e-4, 20).tolist()
ts = [0.25, 1, 2]
sgd_eval_recs = compare_tilted_formulations(x, y, ts=ts, lrs=lrs, optim_ctor=torch.optim.SGD)
To plot the results, it will be convenient to use seaborn:
import seaborn as sns
sns.set()
g = sns.relplot(data=sgd_eval_recs,
hue='loss_type', col='t', x='lr', y='value', alpha=0.5)
g.set(xscale='log')
g.set(ylim=[0, 10])
g.set(yscale='asinh')
We see that \(t=0.5\) is not very challenging to both formulations. With \(t=1\) we already begin to see the difference - the stripped variant works well only for a very narrow interval, whereas the variational variant works in a significantly larger range. With \(t=2\), SGD fails altogether with the stripped variation.
But maybe SGD is less robust, so let’s try Adam:
adam_lrs = np.geomspace(1e-5, 1e2, 30)
adam_eval_recs = compare_tilted_formulations(x, y, ts=ts, lrs=adam_lrs, optim_ctor=torch.optim.Adam)
g = sns.relplot(data=adam_eval_recs,
hue='loss_type', col='t', x='lr', y='value', alpha=0.5)
g.set(xscale='log')
g.set(ylim=[0, 20])
g.set(yscale='asinh')
Indeed, Adam is more robust. It does not miserably fail for larger values of \(t\), but we can see a similar phenomenon. As \(t\) increases, it becomes harder to ‘pinpoint’ just the right step-size. So indeed, this simple trick may improve the computational cost of training a model with a tilted loss, and may significantly reduce the costs of training models when it’s to perform well not just on average, but also close to the worst case.
The variational formulation of the logarithm indeed helped, at least on the line fitting exercise. I do not wish to invest the resources required to try it out with a neural network, but I hope the code here is generic enough for you to try it out on your own ML task.
A variational formulation for the logarithm is nice, but we might have been able to do much better if we had a useful variational formulation for the entire LogSumExp function. I personally do not know if such a closed-form formulation exists, but if you do - talk to me, and let’s write a paper!
I would like to thank Prof. Tian Li and her collegues for their paper on tilted losses. It was enlightening, and I recommend you read it. And moreover, thank Prof. Bach for providing the inspiration.
]]>Throughout this series, beginning here, we demonstrated various properties and applications of polynomial regression on different datasets. We used the Bernstein basis to demonstrate the importance of chosing a “good” polynomial basis, and that other well-known bases may be unfit for machine learning tasks. In this post, that concludes the series, we will try to understand why this happens by studying various regularization properties of the bases we encountered, including the standard power basis, the Chebyshev basis, the Legendre basis, and the Bernstein basis.
In this post we will not prove theorems, but rather demonstrate using an example. Therefore, there will be plenty of code and plots. And along the way we’ll learn some interesting tricks with linear regression and polynomials. All the code from this post is available in this notebook. So let’s get started!
When fitting data representing some “truth”, typically we observe a finite number of samples, and fit a model based on these samples. But what if, in theory, we did this again and again, and every time obtained a different set of samples? Well, our hope is that the corresponding models would, on average, be “close” to the truth.
Let’s simulate this by fitting a univariate function using polynomial regression. We will sample the function at some randomly chosen points, fit a polynomial, and repeat the experiment again and again.
Let’s write the components that will facilitate our experiments. We start by defining some interesting function \(f\) to approximate:
import numpy as np
def f(x):
first = np.sin(np.pi * (x + np.abs(x - 0.75) ** 1.5))
second = np.cos(0.8 * np.pi * (np.abs(x + 0.75) ** 1.5 - x)) + 0.5
return 2 * np.minimum(first, second)
Seems like we have some weird parameters there. To see why, let’s see what \(f\) it looks like on \([-1, 1]\):
import matplotlib.pyplot as plt
plot_n = 10000
plot_xs = np.linspace(-1, 1, plot_n)
plt.plot(plot_xs, f(plot_xs))
plt.show()
So I played a bit with the code in fn
above until I got this interesting plot - \(f\) is composed of two functions joined at a “kink”. We will work with the interval \([-1, 1]\), since it’s easy to work with using NumPy’s built-in functions for polynomial bases.
To conduct our experiment, we will need a way to fit a polynomial using the basis of our choice by sampling some random points in \([-1, 1]\). Then, we would like to evaluate our polynomial on a dense grid of points in \([-1, 1]\) and compare it with the “truth”. To that end, we implemented a function that:
Let’s see it’s code:
def fit_eval(vander_fn, eval_at, deg=20, n=40, reg_coef=0., ax=None, **plot_kws):
# sample points for fitting
xs = np.random.uniform(-1, 1, n)
ys = f(xs)
# build matrix and vector for least-squares regression
vander_mat = vander_fn(xs, deg)
if reg_coef > 0:
coef_mat = np.identity(1 + deg) * np.sqrt(reg_coef)
vander_mat = np.concatenate([vander_mat, coef_mat], axis=0)
ys = np.concatenate([ys, np.zeros(1 + deg)], axis=-1)
# compute polynomial coefficients
coef = np.linalg.lstsq(vander_mat, ys, rcond=None)[0]
# evaluate the polynomial at `eval_at`
return vander_fn(eval_at, deg) @ coef
We can see by the default parameters that by default we sample 40 points, and fit a polynomial of degree 20. We will not change that throughout the post, but you are welcome to play with the notebook as you wish.
Note the code in the if reg_coef > 0
block in the function above. There is an interesting trick there for re-using existing NumPy libraries for least-squares regression, which are reliable and numerically stable, to solve a regularized least-squares problem. Note, that:
\(\| A w - y\|^2 + \alpha \|w\|^2 =
\sum_{j=1}^m ( a_j^T w - y_j )^2 + \sum_{i=1}^n (\sqrt{\alpha} w_i - 0)^2 =
\left\|
\begin{bmatrix}
A \\ \sqrt{\alpha} I
\end{bmatrix}
w - \begin{bmatrix}
y \\ 0
\end{bmatrix}
\right\|^2\)
So, a regularized regression problem is reducible to a simple least-squares problem, with a data matrix padded by \(\sqrt{\alpha} I\), and the labels vector padded by zeros. That’s exactly what the code in the aforementioned block does. The reason I used this trick is to rely only on a small set of Python libraries, and avoid dependencies on scikit-learn and others.
So let’s see how it works:
ys = fit_eval(polyvander, plot_xs, deg=10)
plt.plot(plot_xs, ys, color='k')
plt.plot(plot_xs, fn(plot_xs), color='r')
plt.show()
So doing it once, is nice, but we aim to repeat this experiment many times. So let’s write a function that does just that:
def fit_eval_samples(eval_at, vander_fn, n_iter=1000, **fit_eval_kwargs):
y_samples = []
for i in range(n_iter):
ys = fit_eval(vander_fn, eval_at, **fit_eval_kwargs)
y_samples.append(ys)
y_samples = np.vstack(y_samples)
y_true = f(eval_at)
return y_samples, y_true
This function samples n_iter
sets of points, computes n_iter
least-squares fits, and evaluates each of the resulting polynomials at the points at the evaluation points eval_at
. The results are organized into the rows of the matrix y_samples
- the \(i\)-th row contains the values of the \(i\)-th polynomial. For convenience, it also computes the values of our “true” function \(f(x)\) at the evaluation points.
Finally, since we will be working on \([-1, 1]\), and the interval of approximation of the Bernstein basis is \([0, 1]\), let’s implement the Bernstein vandermonde function we already encountered, with appropriate scaling:
from scipy.stats import binom as binom_dist
def bernvander(x, deg, lb=-1, ub=1):
x = np.array(x)
x = np.clip((x - lb) / (ub - lb), lb, ub)
return binom_dist.pmf(np.arange(1 + deg), deg, x.reshape(-1, 1))
Now that we have all our ingredients in place, let’s visualizee and analyze bias and variance.
Let’s use our fit_eval_samples
function to plot a large number of fits according to a given polynomial basis. Our function plots the fits and the true function returned by fit_eval_samples
, and also the average polynomial, by averaging the returned y_samples
array. Each one of the fit polynomials will be drawn in a transparent manner, so that we can see their density. And moreover, since badly fit polynomials “go crazy” near the boundaries, the function also accepts the y-axis limits for plotting.
def plot_basis_fits(n_iter, vander_fn, ylim=[-3, 3], alpha=0.1, ax=None, **fit_eval_kwargs):
ax = ax or plt.gca()
plot_xs = np.linspace(-1, 1, 10000)
samples, y_true = fit_eval_samples(plot_xs, vander_fn, n_iter, **fit_eval_kwargs)
mean_poly = np.mean(samples, axis=0)
ax.plot(plot_xs, samples.T, 'r', alpha=alpha)
ax.plot(plot_xs, mean_poly, color='blue', linewidth=2.)
ax.plot(plot_xs, y_true, 'k--')
ax.set_ylim(ylim)
Now we can use it to plot all our bases. The following function does just that by showing 100 different fit polynomials for every basis:
from numpy.polynomial.chebyshev import chebvander
from numpy.polynomial.legendre import legvander
from numpy.polynomial.polynomial import polyvander
def plot_all_bases(**plot_loop_kwargs):
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
plot_basis_fits(100, polyvander, ax=axs[0, 0], **plot_loop_kwargs)
axs[0, 0].set_title('Standard')
plot_basis_fits(100, chebvander, ax=axs[0, 1], **plot_loop_kwargs)
axs[0, 1].set_title('Chebyshev')
plot_basis_fits(100, legvander, ax=axs[1, 0], **plot_loop_kwargs)
axs[1, 0].set_title('Legendre')
plot_basis_fits(100, bernvander, ax=axs[1, 1], **plot_loop_kwargs)
axs[1, 1].set_title('Bernstein')
plt.show()
Let’s let’s use it to visualize bias and variance without regularization:
plot_all_bases(reg_coef=0.)
The dashed black line is the true function, the blue line is the average among the 100 polynomials, and the transparent red lines are the polynomials themselves. As expected, without regularization, the fit polynomials with all bases appear ‘crazy’. Moreover, even the average polynomial appears to be far away from the true function near the boundaries.
The difference between the average polynomial and the true function is called the bias, whereas the spread of the different polynomials around the average is called the variance. Of course, the bias and variance are different at every point. Near \(x=0\), they are pretty small, and as we approach the boundaries, both increase^{1}. The bias-variance tradeoff is a well-known concept in statistics, and a large body of research has been invested to its study. Here, we will try to approach it from a more empirical perspective.
Ideally, we would like both the bias and the variance to be small. A small bias means that on average over the samples, our polynomial represents the truth. A low variance means that regardless of the specific data-set, we will be always close to this truth, meaning that we will generalize well.
Typically, when measuring the bias, the deviation from the true function is squared. This is convenient, since the squared bias and the variance are a decomposition of the mean squared error. Informally, speaking:
\[\mathbb{E}[\mathrm{error}^2] = \mathbb{E}[\mathrm{bias}^2] + \mathrm{variance}\]A more formal introduction can be found in the above-linked wikipedia article, and references therein.
Both bias and variance can be reduced either by a better estimation procedure or by more data. Let’s see what happens when we add more data - instead of using our default, and sampling 40 points for our least-squares regression, we will sample 200:
plot_all_bases(n=200, reg_coef=0.)
Appears much better! The mean polynomial almost coincides with the true function, and there’s little wiggling of individual polynomials around it. In this example, we have a very simlpe data-set with only one feature. In practice, data-sets are finite, and contain plenty of features. Not always enough to learn all coefficients in a model to the required precision.
As we pointed out, the bias-variance tradeoff also depends on the estimation procedure, and not only on the amount of data we have. Often in practice, our data-sets are finite, and we need to adapt the estimation procedure as well. Here we have two means to affect the estimation procedure - the choice of the basis, and the regularization coefficient. So let’s try using some regularization:
plot_all_bases(reg_coef=1e-3, ylim=[-1.5, 1.5])
We can see that with this regularizaton coefficients, the Bernstein and the power basis behave much better, with the Bernstein basis being a bit better in terms of variance. It even looks close to what we can achieve with more data. But why the two other bases perform poorly? Maybe we’re under-regularizing the other two bases? Let’s try a larger coefficient:
plot_all_bases(reg_coef=1e-1, ylim=[-1.5, 1.5])
We can clearly see we are over-regularizing the standard and the Bernstein bases - the average polynomial, the blue line, begins to get smoother and farther away from the true function. This is the expected increase of bias as a result of regularization. But the Chebyshev and Legendre bases are still a bit wiggly - so let’s try an even more aggressive regularization:
plot_all_bases(reg_coef=1, ylim=[-1.5, 1.5])
It appears that these two bases do not improve with regularization - their bias increases, without a significant improvement improvement to the variance. So both components of the estimation procedure are crucial - the regularization and the basis.
Visualization is nice, but let’s measure these effects. We will try several regularization strengths, and for each strength - we will compute the average bias and variance we encounter among the evaluation points. Since both the bias and the variance vary along the interval \([-1, 1]\), we will average the squared bias and variance over the interval. Computing the mean squared bias and the variance for a given basis is straightforward:
def bias_variance_tradeoff(vander_fn, reg_coefs, nx=1000, **fit_eval_kwargs):
xs = np.linspace(-1, 1, nx)
biases = []
vars = []
for reg_coef in reg_coefs:
y_samples, y_true = fit_eval_samples(xs, vander_fn, reg_coef=reg_coef, **fit_eval_kwargs)
# mean squared bias over the samples, averaged over the interval [-1, 1]
bias_agg = np.mean((np.mean(y_samples, axis=0) - y_true) ** 2)
# variance over the samples, averaged over the interval [-1, 1]
variance_agg = np.mean(np.var(y_samples, axis=0))
biases.append(bias_agg)
vars.append(variance_agg)
return biases, vars
So now ler’s define regularization coefficients and conduct our experiment. It will be convenient to gather all the data to a Pandas dataframe, and plot it later:
import pandas as pd
reg_coefs = np.geomspace(1e-8, 1e2, 64)
biases, vars = bias_variance_tradeoff(polyvander, reg_coefs)
power_df = pd.DataFrame({'reg_coef': reg_coefs, 'bias': biases, 'variance': vars, 'basis': 'Power'})
biases, vars = bias_variance_tradeoff(chebvander, reg_coefs)
cheb_df = pd.DataFrame({'reg_coef': reg_coefs, 'bias': biases, 'variance': vars, 'basis': 'Chebyshev'})
biases, vars = bias_variance_tradeoff(legvander, reg_coefs)
leg_df = pd.DataFrame({'reg_coef': reg_coefs, 'bias': biases, 'variance': vars, 'basis': 'Legendre'})
biases, vars = bias_variance_tradeoff(bernvander, reg_coefs)
ber_df = pd.DataFrame({'reg_coef': reg_coefs, 'bias': biases, 'variance': vars, 'basis': 'Bernstein'})
all_df = pd.concat([power_df, cheb_df, leg_df, ber_df])
Let’s see a sample of our data-frame:
print(all_df)
Here is the result I got
reg_coef bias variance basis
0 1.000000e-08 0.012936 4.517494 Power
1 1.441220e-08 0.024666 1.617035 Power
2 2.077114e-08 0.017855 2.934622 Power
3 2.993577e-08 0.025465 1.882831 Power
4 4.314402e-08 0.005049 2.150978 Power
.. ... ... ... ...
59 2.317818e+01 0.594062 0.000286 Bernstein
60 3.340485e+01 0.614740 0.000133 Bernstein
61 4.814372e+01 0.629843 0.000076 Bernstein
62 6.938568e+01 0.640700 0.000036 Bernstein
63 1.000000e+02 0.648413 0.000018 Bernstein
[256 rows x 4 columns]
For every regularization coefficient and basis choice, we have a bias and a variance measurement. Let’s plot these using a SeaBorn scatterplot:
import seaborn as sns
to_plot = all_df.copy()
to_plot['size'] = np.log(to_plot['reg_coef'])
sns.scatterplot(data=to_plot, x='bias', y='variance', hue='basis', size='size')
ax = plt.gca()
ax.set_xscale('log')
ax.set_yscale('log')
ax.legend(bbox_to_anchor=(1.05,1), loc=2, borderaxespad=0.)
plt.show()
For each basis we have a different color, and the size of the points corresponds to the regularization coefficients, with larger points representing more aggressive regularization:
Now we begin to understand the full picture. First, polynomials need regularization. A mild coefficient results in both a high measured bias and a high variance. At some point, we land on a nice trade-off curve. Moreover, we now see why the Bernstein basis was so successful in all our experiments before - it achieves a much better bias-variance tradeoff. Looking at the bottom-left part, we can see that it can achieve both low bias and low variance.
Does this phenomenon have a formal proof? Well, I wasn’t able to find one. But I was able to find a proof for a different sampling procedure, where the noise doesn’t come from sampling different data-sets, but from introducing noise in the \(y\) coordinate in this elaborate stackexchange answer. I suppose a similar proof could be derived for the case of a random data-set selection. I don’t believe I discovered something new here, but merely learned something that is “known” but was not formally published. If you have found something - please email me, and I will be glad to update the post and give a proper credit.
To summarize, we now understand that not only the class of models is important, but also its representation. Indeed class of polynomial functions can be represented with different bases, but some bases perform better than others for machine-learning tasks. So I think the most important lesson from this series is that:
We cannot judge a class of models on its own, without considering a concrete represention, since its performance is often tightly coupled to representation of choice.
Now let’s move to studying a different, but important theoretical property of the Bernstein basis.
Throughout this series we regularized the derivative of the polynomials to achieve a certain goal, either smoothness or monotonicity. I will concentrate on monotonicity, since it’s simpler to study. We in this post a theorem with an interesting consequence - if the coefficients of a polynomial in Bernstein form are monotone increasing, then the polynomial is monotone increasing. A similar result is obtained for a decreasing sequence.
But what about the inverse imlpication? Does every monotone-increasing polynomial also have a monotone-increasing coefficient sequence when represented according to the Bernstein basis? Well, the answer is NO. This means that when we fit a polynomial in the Bernstein basis with an increasing coefficients sequence, like we did in a previous post, we are not guaranteed to get the the best-fit increasing polynomial. There may be another increasing polynomial, whose Bernstein coefficients are not increasing, but it achieves a smaller training error.
So an interesting question begs to be answered - how far apart are increasing polynomials, and Bernstein polynomials with increasing coefficients? If this distance is small, the above fact should not bother us too much. But what if it is large? So let’s try to study this question empirically. This is not a formal proof, but rather a demonstration of some interesting phenomena.
What we will do is generate random increasing polynomials, and then try to find a least-squares fit to these polynomials using the Bernstein basis with increasing coefficients. So first we need to understand one important thing - how do we generate a polynomial that is guaranteed to be increasing on \([-1, 1]\). Having understood that, we will be able to write a simple Python function to generate some random increasing polynomial. Our plan is simple - we first learn how to generate a non-negative polynomial on \([-1, 1]\), and then compute its integral to obtain an increasing polynomial. To that end, let’s dive into century-old results on non-negative polynomials, initiated by no other than David Hilbert.
We begin by introducing the concept of a polynomial that is a sum of squares. A polynomial \(p(x)\) is a sum of squares if there exist polynomials \(q_1, \dots, q_m\) such that:
\[p(x) = q_1^2(x) + \dots + q_m^2(x)\]Obiviously, any sum-of-squares polynomial is non-negative on the entire real-line. But we are not interested on the entire real-line, but on the interval \([-1, 1]\). It turns out that there exists a theorem for characterizing non-negative polynomials on this interval.
Theorem (Blekherman et. al. ^{2}, Theorem 3.72)
The polynomial \(p: \mathbb{R} \to \mathbb{R}\) of degree is non-negative on \([a, b]\) if and only if:
- if the degree of \(p\) is the odd number \(2d+1\), then \(p(x) = (x - a) \cdot s(x) + (b-x) \cdot t(x)\) where \(s(x)\) and \(t(x)\) are sum of square polynomials of degree at most \(2d\).
- if the degree of \(p\) is the even number \(2d\), then \(p(x) = s(x) + (x-a) \cdot (b-x) \cdot t(x)\) where \(s(x)\) and \(t(x)\) are sum of squares polynomials of degrees at most \(2 d\) and \(2 d - 2\).
So it appears that all we have to do is generate random two polynomials that are sums of squares, and construct \(p\) of the desired degree by multiplying and adding polynomials. To that end, we will use the np.polynomial.polynomial.Polynomial
class that can represent operations such as addition and multiplication on arbitrary polynomials. So let’s implement the nonneg_on_biunit
function, that generates a non-negative polynomial on the bi-unit interval \([-1, 1]\). To generate coefficients of the sum-of-squares polynomial, we will rely on the Cauchy distribution due to its heavy tails, so that we obtain a large variety of coefficients. Random numbers are generated by the np.random.standard_cauchy
function.
from numpy.polynomial.polynomial import Polynomial
def sum_of_squares_poly(half_deg):
num_coef = 1 + half_deg
first_poly = Polynomial(np.random.standard_cauchy(num_coef))
second_poly = Polynomial(np.random.standard_cauchy(num_coef))
return first_poly * first_poly + second_poly * second_poly
def nonneg_on_biunit(deg):
if deg == 0:
return Polynomial(np.random.standard_cauchy()) ** 2
if deg % 2 == 0: # odd degree
s = sum_of_squares_poly(deg // 2)
t = sum_of_squares_poly(deg // 2 - 1)
return s + t * Polynomial(np.array([1, 0, -1]))
else: # even degree
s = sum_of_squares_poly((deg - 1) // 2)
t = sum_of_squares_poly((deg - 1) // 2)
return Polynomial(np.array([1, -1])) * s + \
Polynomial(np.array([1, 1])) * t
Does it work? Let’s see! We will plot randomly generated non-negative polynomials of various degrees:
np.random.seed(42)
fig, ax = plt.subplots(3, 3, figsize=(10, 10))
xs = np.linspace(-1, 1, 1000)
for deg, ax in enumerate(ax.flatten()):
ys = nonneg_on_biunit(deg)(xs)
ax.plot(xs, ys)
ax.set_title(f'deg = {deg}')
plt.show()
They indeed appear to be diverse, and non-negative. So let’s continue with our plan of creating increasing polynomials by integrating non-negative polynomials.
def increasing_on_biunit(deg):
nonneg = nonneg_on_biunit(deg - 1)
return nonneg.integ()
Let’s plot them, and see what we got:
np.random.seed(42)
fig, ax = plt.subplots(3, 3, figsize=(10, 10))
xs = np.linspace(-1, 1, 1000)
for deg, ax in enumerate(ax.flatten(), start=1):
ys = increasing_on_biunit(deg)(xs)
ax.plot(xs, ys)
ax.set_title(f'deg = {deg}')
plt.show()
Now, that our code appears to work, lets proceed to fitting Bernstein form polynomials with increasing coefficients to these increasing polynomials. First, for every degree, we will fit a Bernstein form polynomial of the same degree, to see if an increasing polynomial of degree \(d\) can be represented by a Bernstein polynomial of increasing coefficients of degree \(d\). To that end, we will use our beloved CVXPY package again, to constrain the Bernstein coefficients:
import cvxpy as cp
np.random.seed(42)
fig, ax = plt.subplots(4, 3, figsize=(10, 14))
xs = np.linspace(-1, 1, 1000)
for deg, ax in enumerate(ax.flatten(), start=1):
true_ys = increasing_on_biunit(deg)(xs)
vander_mat = bernvander(xs, deg)
coef_var = cp.Variable(1 + deg)
objective = cp.Minimize(cp.sum_squares(vander_mat @ coef_var - true_ys))
prob = cp.Problem(objective, constraints=[cp.diff(coef_var) >= 0])
prob.solve()
coef = coef_var.value
bern_ys = bernvander(xs, deg) @ coef
ax.plot(xs, true_ys, 'k--')
ax.plot(xs, bern_ys, 'r')
ax.set_title(f'deg = {deg}')
plt.show()
We see that the fit is often very close, but doesn’t exactly match. Obivously, it is posisble to find an increasing polynomial of the corresponding degree to fit each function, since each function is itself a polynomial of that degree. But the constraint that the Bernstein coefficients increase reduces the space to a subset of increasing polynomials, and we are unable to exactly fit.
But what happens if we allow fitting Bernstein form polynomials of higher degrees? Say, for an increasing polynomial of degree d, we will fit a Bernstein polynomial with increasing coefficients of degree 2d. Maybe increasing the degree helps reduce the gap?
import cvxpy as cp
np.random.seed(42)
fig, ax = plt.subplots(4, 3, figsize=(10, 12))
xs = np.linspace(-1, 1, 1000)
for deg, ax in enumerate(ax.flatten(), start=1):
true_ys = increasing_on_biunit(deg)(xs)
fit_deg = 2 * deg # <--- NOTE HERE
vander_mat = bernvander(xs, fit_deg)
coef_var = cp.Variable(1 + fit_deg)
objective = cp.Minimize(cp.sum_squares(vander_mat @ coef_var - true_ys))
prob = cp.Problem(objective, constraints=[cp.diff(coef_var) >= 0])
prob.solve()
coef = coef_var.value
bern_ys = bernvander(xs, fit_deg) @ coef
ax.plot(xs, true_ys, 'k--')
ax.plot(xs, bern_ys, 'r')
ax.set_title(f'deg = {deg}')
plt.show()
It appears it does. Empirically, we are able to fit increasing polynomials of degree d with polynomials of degree 2d with increasing Bernstein coefficients. In practice, this gap means that we may need to use a higher polynomial degree than we could, in theory, when fitting a polynomial to an increasing function.
Is a factor of two always sufficient to reduce the representation gap? If not - what is the relationship between the higher degree fit with increasing coefficients and the degree of the original polynomial? Personally, I don’t know. But it’s an interesting question. What is known is that among the bases that have direct shape control properties via their coefficients, which are colloquially known as “normalized totally-positive bases”, the Bernstein basis is the unique basis with “optimal” shape control properties. The meaning of the optimality criterion is out of the scope of this post, but I refer the readers to the paper Shape preserving representations and optimality of the Bernstein basis^{3} for reference. As we mentioned before, these properties are extensively used in computer graphics to represent curves and shapes, including all the text you are reading on the screen.
It is possible to use the minimal degree via semidefinite optimization by exploiting the theory of sum-of-squares polynomials, but this is out of the scope of this post. Moreover, the heavier computational burdain of semidefinite optimization typically makes this technique less applicable to fitting models to large amounts of data. Interested readers are referred to the book Semidefinite Optimization and Convex Algebraic Geometry^{2}.
This exploration of polynomial regression certainly taught me a lot. I learned that polynomials are not to be feared when designing regression models, when using a proper basis. The simplicity of polynomials is appealing - they have only one hyperparameter to tune, which is their degree. There are plenty of other function bases that can be used when fitting a nonlinear model using linear regression techniques, such as cubic splines, or radial basis functions. All of them are very useful, but they require more hyperparameter tuning, which may result in longer model fitting times and a slower model experimentation feedback loop. For example, splines, which are essentially piecewise polynomials with continuous (higher order) derivatives, require specifying their degree, the number of break-points, and the degree of derivative continuity. But given enough computational resources and data, these techniques probably perform better than polynomials.
I hope you enjoyed this series as much as I did, and the next posts will probably be on a different subject.
In theory, a least-squares estimator is unbiased. But it appears we don’t have enough samples so that our average polynomial approaches the true mean, and it appears as if we have bias. ↩
Grigoriy Blekherman, Pablo A. Parrilo, and Rekha R. Thomas. Semidefinite Optimization and Convex Algebraic Geometry. SIAM (2012). ↩ ↩^{2}
J.M. Camicer and J.M. Pefia. Shape preserving representations and optimality of the Bernstein basis. Advances in Computational Mathematics 1 (1993) ↩
We continue our journey in the land of polynomial regression and the Bernstein basis, that we began in this post, through another interesting landscape. There are many settings in which a model is trained to predict an abstract, meaningless score, which is later used for classification or ranking. For example, consider a linear support-vector machine (SVM) classifier. When classifying a sample, we only care about the sign of the score. If we take our SVM, and multiply its weights vector by a positive factor - we obtain the same classifier exactly. The scores are meaningless - only their sign is meaningful. Another example is the learning to rank setting. Our model produces a score that is used to rank items, and select the top-\(k\) items to the user. The scores themselves are not meaningful - only their relative order is.
Statistically-inclined readers probably know that logistic regression tends to produce calibrated models out of the box. However, when the underlying logistic-regression model is a neural network, rather than a linear model, this is not the case. Indeed, there is a well-known paper by Guo et. al^{1} that shows otherwise.
In many applications we want the score to represent some interpretable confidence in the prediction, and one way to achieve this is calibration. A model is calibrated, if the scores it produces are probabilities that are consistent with the empirical frequency of observing a positive sample. One formal way to define calibration is as following:
A supervised model \(f\) trained on samples \((x, y)\sim \mathcal{D}\) with \(y \in \{0, 1\}\) calibrated if
\[\mathbb{E}[y|f(x)] = f(x)\]
To make the discussion about calibration simpler, we avoid the discussion of models that produce multiple scores for a sample, such as multi-class and multi-label classifiers.
Calibrated models are important, for example, in online advertising. We truly care that a model produces the probability of a click, or the probability of a purchase, since these probabilities are used to compute expectations. Another context is safety critical applications - there might be a difference betwen a \(0.00001\%\) probability that our self-driving care observed a human, and \(0.1\%\).
One way to achieve calibration is to stack a calibrator model \(\omega: \mathbb{R} \to [0, 1]\) on top an already trained model \(f\), so that the predictions become:
\[\omega(f(x))\]If the calibrator \(\omega\) is an increasing function, classification or ranking remain unaffected, since the relative order of scores is preserved.
In this post we will use the power of the Bernstein basis in controlling the function we fit to devise monotonic calibrators \(\omega\) that fit the requirements. Then, we compare the performance of our Bernstein calibrators to two built-in calibrators available in the Scikit-Learn package, that implements two well-known algorithms that are widely used to calibrate models throughout the industry. I recommend readers to take a look at the model calibration tutorial of the Scikit-Learn package as well. As usual, the code is available in a notebook you can try in Google Colab.
The idea of using shape-restricted polynomial regression for probabilistic calibration was, to the best of my knowledge, first proposed by Wang et. al. ^{2} in 2019, so it’s quite new.
Throughout this post we will work with a support-vector machine classifier trained to predict diabetes on the CDC Diabetes Prediction Dataset. To easily access it, we can intall the ucimlrepo dataset that allow us to download it from the UCI machine-learning dataset repository:
pip install ucimlrepo
And now we can access it:
from ucimlrepo import fetch_ucirepo
# fetch dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)
# data (as pandas dataframes)
X = cdc_diabetes_health_indicators.data.features
y = cdc_diabetes_health_indicators.data.targets
Let’s print a summary of the data:
print(X.describe().transpose()[['min', '25%', '50%', '75%', 'max']])
The following output is produced:
min 25% 50% 75% max
HighBP 0.0 0.0 0.0 1.0 1.0
HighChol 0.0 0.0 0.0 1.0 1.0
CholCheck 0.0 1.0 1.0 1.0 1.0
BMI 12.0 24.0 27.0 31.0 98.0
Smoker 0.0 0.0 0.0 1.0 1.0
Stroke 0.0 0.0 0.0 0.0 1.0
HeartDiseaseorAttack 0.0 0.0 0.0 0.0 1.0
PhysActivity 0.0 1.0 1.0 1.0 1.0
Fruits 0.0 0.0 1.0 1.0 1.0
Veggies 0.0 1.0 1.0 1.0 1.0
HvyAlcoholConsump 0.0 0.0 0.0 0.0 1.0
AnyHealthcare 0.0 1.0 1.0 1.0 1.0
NoDocbcCost 0.0 0.0 0.0 0.0 1.0
GenHlth 1.0 2.0 2.0 3.0 5.0
MentHlth 0.0 0.0 0.0 2.0 30.0
PhysHlth 0.0 0.0 0.0 3.0 30.0
DiffWalk 0.0 0.0 0.0 0.0 1.0
Sex 0.0 0.0 0.0 1.0 1.0
Age 1.0 6.0 8.0 10.0 13.0
Education 1.0 4.0 5.0 6.0 6.0
Income 1.0 5.0 7.0 8.0 8.0
We see that most features are actually binary. Let’s print the number of unique values of the non-binary columns:
X[['BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income']].nunique()
We get the following output:
BMI 84
GenHlth 5
MentHlth 31
PhysHlth 31
Age 13
Education 6
Income 8
dtype: int64
Therefore, I decided to treat only a few of the non-binary features as numerical, and the rest as categorical:
categorical_cols = ['HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke',
'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
'DiffWalk', 'Sex', 'Education',
'Income']
numerical_cols = ['Age', 'BMI', 'MentHlth', 'PhysHlth']
Now let’s do the usual magic, and split the data. However, in this post, in addition to train and test sets we will have a calibration set whose purpose is training the calibrator model \(\omega\). At this stage we will not use it, but let’s be prepared. We will use 15% for the test set, another 15% for the calibration set, and 70% for the train set:
from sklearn.model_selection import train_test_split
X_remain, X_test, y_remain, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_calib, y_train, y_calib = train_test_split(X_remain, y_remain, test_size=0.15/0.85, random_state=43)
And now let’s fit our linear support vector machine model. As usual, categorical features will be one-hot encoded, whereas numerical features will be min-max scaled. We use the LinearSVC
lass for the classifier, with the class_weight='balanced'
option to handle our imbalanced dataset, and the dual=False
option to make it train faster in our case, when the samples greatly out-number the features:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.svm import LinearSVC
pipeline = Pipeline([
('feature_transformer', ColumnTransformer(
transformers=[
('categorical', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=10), categorical_cols),
('numerical', MinMaxScaler(), numerical_cols)
]
)),
('classifier', LinearSVC())
])
Now let’s fit our model to the training data, and reports its classification performance on the test set:
from sklearn.metrics import classification_report
pipeline.fit(X_train, y_train)
print(classification_report(y_test, pipeline.predict(X_test)))
I got the following output:
precision recall f1-score support
0 0.96 0.71 0.82 32840
1 0.30 0.79 0.44 5212
accuracy 0.72 38052
macro avg 0.63 0.75 0.63 38052
weighted avg 0.87 0.72 0.77 38052
Looking at the “macro avg” row, we see that it’s not the best classifier in the world, but it has some discriminative power - the precision, recall, and F1 score are indeed reasonable. Enough to move on.
Before explaining how calibration is evaluated, a short methodological note. Calibration should be evaluated on a held-out test set, not on the train set. If calibration is important for hyperparameter tuning, then we also need to evaluate it on the validation set. Now let’s talk about how we evaluate calibration.
One way to evaluate calbration is visually, using calibration curves or calibration reliability diagrams^{3}. These curves attempt to directly visualize how far we are from our calibration criterion:
\[\mathbb{E}[y|f(x)] = f(x)\]In theory, we would like to plot the points
\[(\mathbb{E}[y|f(x)], f(x)) \qquad (x, y) \sim \mathcal{D},\]but we cannot, since we only have access to a finite data-set, not the distribution that generated it. Thus, in practice we resort to approximation by binning the outputs of \(f(x)\) into sub-intervals of \([0, 1]\) and using averages instead of means. This is implemented by Scikit-Learn in the CalibrationDisplay
class.
Let’s try it out with a very naive calibrator - we will just take the output of our SVM, and pass it through the sigmoid function \(\sigma(y) = (1+\exp(-y))^{-1}\). This will produce values in \([0, 1]\) that we can use:
y_pred = pipeline.decision_function(X_test)
y_pred = 1 / (1 + np.exp(-y_pred))
CalibrationDisplay.from_predictions(y_test, y_pred, n_bins=10, name='SVM + Sigmoid')
plt.show()
I got the following plot:
In a perfectly calibrated classifier, the blue calibration curve should align with the dotted black line - the average prediction in each bin should align with the empirical positive sample frequency.
Beyond visuals means, we have metrics that can quantify the miscalibration error. The simplest of such metrics is the Empirical Calibration Error (ECE), whose computation is similar to how calibration curves are constructed. It is just the weighted average of calibration errors in each bin - the weights are the number of samples in each bin. Since Scikit-Learn is an open-source project, I implemented the ECE metric based on its code for computing calibration curves:
# implementation based on the code of calibration_curve in sklearn:
# https://github.com/scikit-learn/scikit-learn/blob/872124551/sklearn/calibration.py#L927
def ece(y_true, y_prob, n_bins=10):
bins = np.linspace(0.0, 1.0, n_bins + 1)
binids = np.searchsorted(bins[1:-1], y_prob)
bin_sums = np.bincount(binids, weights=y_prob, minlength=len(bins))
bin_true = np.bincount(binids, weights=y_true, minlength=len(bins))
bin_total = np.bincount(binids, minlength=len(bins))
nonzero = bin_total != 0
prob_true = bin_true[nonzero] / bin_total[nonzero]
prob_pred = bin_sums[nonzero] / bin_total[nonzero]
return np.sum(np.abs(prob_true - prob_pred) * bin_total[nonzero]) / np.sum(bin_total)
Now we can use this function to print the ECE of our naive sigmoid calibrator:
print(f'ECE = {ece(y_test, y_pred)}')
The output is:
ECE = 0.30732883382620496
At this stage it doesn’t tell us much, until we begin improving it.
In addition to the ECE, the standard cross-entropy loss and the mean-squared error loss can also help us quantify miscalibration. In the context of probability calibration, the mean-squared error is known as the breier score. However, they have an inherent weakness - they quantify both miscalibration and discriminative power^{4}. For example, if our classifier is better at differentiating positive and negative samples than a competing classifier, these two losses show an improvement, even if its calibration error remains the same. Alternatively, improving only the calibration error without improving discriminative power also reduces these losses. Since in this post the classifier remains identical due to the monotonic nature of calibrators, and only its calibration error changes, these two metrics are useful. Both are implemented in Scikit-Learn and we can use them:
from sklearn.metrics import (
brier_score_loss,
log_loss
)
brier_score_loss(y_test, y_pred), log_loss(y_test, y_pred)
The output is
(0.19391187428082382, 0.5748329800290517)
Now let’s improve those numbers using calibrators designed for the task. The ECE is not a very reliable metric due to the approximation by binning, but we still include it, since it is widely used in papers on model calibration.
The simplest well-known calibrator is the Platt calibrator^{5}, which essentially boils down to fitting a logistic regression model whose only feature is the original model prediction. Namely, the Platt classifier is a function of the form
\[\omega(y) = \frac{1}{1 + \exp(a y + b)},\]where \(a\) and \(b\) are learned parameters. Where are these parameters learned from? That’s what we have the above-mentioned calibration set. It is just the training set of the calibrator, and the training samples are \(\{ (f(x_i), y_i) \}_{i \in C}\), where \(C\) is the calibration set.
In Scikit-Learn, the Platt calibrator is implemented in the CalibratedClassifierCV
class. This class is pretty versatile, and has various options for how a calibrator is trained, and what exactly is used as the calibration set. To make this post simple, we will use the cv=prefit
option, which means that our model has been pre-fit, and we need to fit just the calibrator \(\omega\) itself. The Platt calibrator can be chosen using the method='sigmoid'
constructor option. So let’s try it out!
from sklearn.calibration import CalibratedClassifierCV
sigmoid_calib = CalibratedClassifierCV(pipeline, method='sigmoid', cv='prefit')
sigmoid_calib.fit(X_calib, y_calib)
To evaluate it, let’s implement a short function that will show all the three metrics we care about:
def estimator_errors(estimator, X_test, y_test):
y_pred = estimator.predict_proba(X_test)[:, 1]
return f'ECE = {ece(y_test, y_pred):.5f}, Brier = {brier_score_loss(y_test, y_pred):.5f}, LogLoss = {log_loss(y_test, y_pred):.5f}'
Now let’s plot the calibration curve and the metrics!
CalibrationDisplay.from_estimator(sigmoid_calib, X_test, y_test, n_bins=10)
plt.title(estimator_errors(sigmoid_calib, X_test, y_test))
plt.show()
Here is the output:
Looks a bit better. Some of the points lie on the diagonal line of perfect calibration, whereas others do not. However, looking at the metrics (in the title), we see that all of them were significantly improved, by orders of magnitude. This means that the points we see as mis-calibrated in the curve probably have little samples in the corresponding bins. Therefore, it is likely that their effect on the miscalibration error is quite small. It would be nice if Scikit-Learn could show the weight of each point using the point size, so we could see it visually - but unfortunately it does not.
The second well-known calibrators are piecewise-constant functions of the form
\[\omega(y) = \begin{cases} y_0 & y \leq x_1 \\ y_1 & x_1 < y \leq x_2 \\ \vdots & \\ y_{n-1} & x_{n-1} < y \leq x_n \\ y_n & y > x_n \end{cases},\]where \(y_0 < y_1 < \dots < y_n\), and \(x_1, \dots, x_n\) are learned from the calibration set. The mathematical procedure for fitting such a function to data is called isotonic regression^{6}^{7}, and using it for calibration is done by passing the method='isotonic'
to the CalibratedClassifierCV
class. So let’s try it out as well!
isotonic_calib = CalibratedClassifierCV(pipeline, method='isotonic', cv='prefit')
isotonic_calib.fit(X_calib, y_calib)
CalibrationDisplay.from_estimator(sigmoid_calib, X_test, y_test, n_bins=10)
plt.title(estimator_errors(sigmoid_calib, X_test, y_test))
plt.show()
I obtained the following plot:
Looks much better! And the three metrics were improved as well. We can also plot the piecewise-constant function:
calibrator = isotonic_calib.calibrated_classifiers_[0].calibrators[0]
plt.plot(calibrator.f_.x, calibrator.f_.y)
plt.title(f'Caibrator with {len(calibrator.f_.x)} points')
plt.show()
We can see that our classifier produced scores approximately between -2 and 2 on the calibration set, and the best-fit piecewise constant function has 118 “jumps”.
There are two interesting observations we can make. First, a piecewise-constant function can harm ranking and classification, since it’s not strictly increasing by definition. Two samples having different, but nearby scores might be mapped to the same output. Second, the number of jumps may become large as the size of the calibration data-set increases. This means that inference may also become expensive, since computing \(\omega(y)\) requires performing a lookup for the interval \(y\) belongs to. So can we do better?
In a previous post in our adventures with polynomial regression we saw an interesting theorem. Suppose our calibrator is:
\[\omega(y) = \sum_{i=0}^n u_i b_{i,n}(y),\]where \(\{ b_{i,n} \}_{i=0}^n\) is the \(n\)-degree Bernstein basis. Then having \(u_{i+1} \geq u_i\) implies that \(\omega\) is increasing. Moreover, if at least for one one index $j$ we have \(u_{j+1} > u_j\), then \(\omega\) is strictly increasing. Therefore, we can fit our calibrator to the calibration set \((\hat{y}_1, y_1), \dots, (\hat{y}_m, y_m)\) by solving a constrained polynomial regression problem using the Bernstein basis. As long as not all coefficients are equal, we will obtain a strictly increasing calibrator!
Denoting \(\mathbf{b}(y) = (b_{0,n}(y), \dots, b_{n,n}(y))^T\), we need to solve the following constrained least-squares regression problem:
\[\begin{aligned} \min_{\mathbf{u}} &\quad \sum_{j=1}^m \left( \mathbf{b}(\hat{y}_j)^T \mathbf{u} - y_j \right)^2 \\ \text{s.t.} &\quad 0 \leq u_i \leq 1, & i = 0, \dots, n \\ &\quad u_{i} \geq u_{i-1}, & i = 1, \dots, n \end{aligned}\]Letting \(\hat{\mathbf{V}}\) be the Vandermonde matrix whose rows are \(\mathbf{b}(y_j)\), we can write the above problem as:
\[\begin{aligned} \min_{\mathbf{u}} &\quad \| \hat{\mathbf{V}} \mathbf{u} - \mathbf{y} \|^2 \\ \text{s.t.} &\quad 0 \leq u_i \leq 1, & i = 0, \dots, n \\ &\quad u_{i} \geq u_{i-1}, & i = 1, \dots, n \end{aligned}\]Having found the optimal solution \(\mathbf{u}^*\) our calibrator’s prediction becomes:
\[\omega(y) = \mathbf{b}(y)^T \mathbf{u}^*.\]The only issue stems from the fact that the underlying model’s predictions \(y_j\) are not necessarily in \([0, 1]\), but the Bernstein basis requires inputs in that range. As we already saw, the remedy comes from a simple min-max scaling. So let’s code our Bernstein calibrator!
As you probably guessed, the Calibrator is just another Scikit-Learn classifier that applies the calibration procedure on top of a wrapped uncalibrated classifier. We use CVXPY, which we encountered before in this series, to solve the above-mentioned minimization problem. For the code in this post to work correctly, please make sure you have version 1.5 or above. So here it is:
from sklearn.base import ClassifierMixin, MetaEstimatorMixin, BaseEstimator
import cvxpy as cp
from scipy.stats import binom
class BernsteinCalibrator(BaseEstimator, ClassifierMixin, MetaEstimatorMixin):
def __init__(self, estimator=None, *, degree=20):
self.estimator = estimator
self.degree = degree
def fit(self, X, y):
pred = self._get_predictions(X)
self.classes_ = self.estimator.classes_
# compute min / max for scaling
self.min_ = np.min(pred)
self.max_ = np.max(pred)
# compute Vandermonde matrix
vander = self._bernvander(pred)
# find Bernstein polynomial coefficients
self.coef_ = self._fit_coef(vander, y)
return self
def _fit_coef(self, vander, y):
coef = cp.Variable(self.degree + 1, bounds=[0, 1])
objective = cp.norm(vander @ coef - y)
constraints = [cp.diff(coef) >= 0]
prob = cp.Problem(cp.Minimize(objective), constraints)
prob.solve()
return coef.value
def predict_proba(self, X):
pred = self._get_predictions(X)
calibrated = self._calibrate_scores(pred).reshape(-1, 1)
return np.concatenate([1 - calibrated, calibrated], axis=1)
def _calibrate_scores(self, pred):
vander = self._bernvander(pred)
return np.clip(vander @ self.coef_, 0, 1)
def _bernvander(self, pred):
scaled = (pred - self.min_) / (self.max_ - self.min_)
scaled = np.clip(scaled, 0, 1)
basis_idx = np.arange(1 + self.degree)
return binom.pmf(basis_idx, self.degree, scaled[:, None])
def _get_predictions(self, X):
estimator = self.estimator
if estimator is None:
estimator = LinearSVC(random_state=0, dual="auto")
if hasattr(estimator, 'predict_proba'):
pred = estimator.predict_proba(X)
return pred[:, 1]
elif hasattr(estimator, 'decision_function'):
return estimator.decision_function(X)
else:
raise RuntimeError('Estimator must have either predict_proba or decison_function method')
The fit
method computes the minimum and maximum observed values for the min-max scaling mechanism. Then it fits the coefficients using CVXPY by calling the _fit_coef
method. The predict_proba
method just evaluates the fitted Bernstein polynomial after computing the predictions of the underlying estimator. The _bernvander
method computes the Vandermonde matrix for a vector of predictions after applying the min-max scaling. The rest of the code is straightforward boilerplate.
Now let’s try it out, and fit a polynomial calibrator of degree 20:
bernstein_calib = BernsteinCalibrator(pipeline, degree=20)
bernstein_calib.fit(X_calib, y_calib)
CalibrationDisplay.from_estimator(bernstein_calib, X_test, y_test, n_bins=10)
plt.title(estimator_errors(bernstein_calib, X_test, y_test))
plt.show()
I got the following result:
Nice! All error metrics became smaller. The claibration curve looks good. And our model has a much smaller number of parameters than the isotonic one - only 21 coefficients, instead of 118. We can also plot the calibration function \(\omega(y)\), with the Bernstein coefficients as control points:
xs = np.linspace(bernstein_calib.min_, bernstein_calib.max_, 1000)
ys = bernstein_calib._calibrate_scores(xs)
plt.plot(xs, ys, label='Calibrator', color='blue')
ctrl_xs = np.linspace(bernstein_calib.min_, bernstein_calib.max_, bernstein_calib.degree + 1)
ctrl_ys = bernstein_calib.coef_
plt.scatter(ctrl_xs, ctrl_ys, label='Coefficients', color='red')
plt.legend()
plt.show()
I got the following plot:
So maybe we can work with an even lower degree? Let’s try fitting a polynomial calibrator of degree 10:
bernstein_calib_lowdeg = BernsteinCalibrator(pipeline, degree=10)
bernstein_calib_lowdeg.fit(X_calib, y_calib)
CalibrationDisplay.from_estimator(bernstein_calib_lowdeg, X_test, y_test, n_bins=10)
plt.title(estimator_errors(bernstein_calib_lowdeg, X_test, y_test))
plt.show()
The result is below:
The curve looks a bit worse, but the metrics still outperform isotonic regression.
But wait! A calibrator is essetially a probability prediction model, and we know just the right tool for the task - logistic regression. In fact, we already saw the Platt calibrator that was in fact a simple logistic regression model, whose only feature is the underlying prediction. So maybe, using logistic, rather than least-squares regression we can work with an even lower degree polynomial and achieve good calibration.
For logistic regression, our loss function, or the minimization objective, need to be modified. Moreover, logistic regression coefficients may, in theory, go to infinity (or minus infinity) if the optimal prediction for some feature combinations is close to zero or one. This may cause the minimization procedure to declare that the problam is not solvable. Thus, we cap the Bernstein ceofficients to be in the range \([-15, 15]\), in addition to the monotonicity constraint. This ensures that our model’s predictions are also in that range, and the sigmoid function evaluated at the endpoints are 0 and 1 for all practical purposes. So, our modified convex optimization problem becomes:
\[\begin{aligned} \min_{\mathbb{u}} &\quad \sum_{j=1}^m \left( \ln(1+\exp(\mathbf{b}(\hat{y}_j)^T \mathbf{u}) - y_j\mathbf{b}(\hat{y}_j)^T \mathbf{u} \right) \\ \text{s.t.} &\quad -15 \leq u_i \leq 15, & i = 0, \dots, n \\ &\quad u_{i} \geq u_{i-1}, & i = 1, \dots, n \end{aligned}\]Carefully inspecting the objective - it’s just the regular loss of the logistic regression problem. To implement it, we just override the _fit_coef
method to implement the above minimization problem as the fitting procedure, and the _calibrate_scores
method to apply the sigmoid function after computing the Bernstein polynomial. So here it is:
class BernsteinSigmoidCalibrator(BernsteinCalibrator):
def _compute_coef(self, vander, y):
coef = cp.Variable(self.degree + 1, bounds=[-15, 15])
scores = vander @ coef
objective = cp.sum(cp.logistic(scores) - cp.multiply(y, scores))
constraints = [cp.diff(coef) >= 0]
prob = cp.Problem(cp.Minimize(objective), constraints)
prob.solve()
return coef.value
def _calibrate_scores(self, pred):
vander = self._bernvander(pred)
return self._sigmoid(vander @ self.coef_)
@staticmethod
def _sigmoid(scores):
return np.piecewise(
scores,
[scores > 0],
[lambda z: 1 / (1 + np.exp(-z)), lambda z: np.exp(z) / (1 + np.exp(z))]
)
Note, that to avoid overflows and other numerical issues, we carefully implemented the sigmoid function to handle positive and negative values differently. Now let’s try it out with a degree of 10.
bernstein_sigmoid_calib = BernsteinSigmoidCalibrator(pipeline, degree=10)
bernstein_sigmoid_calib.fit(X_calib, y_calib)
CalibrationDisplay.from_estimator(bernstein_sigmoid_calib, X_test, y_test, n_bins=10)
plt.title(estimator_errors(bernstein_sigmoid_calib, X_test, y_test))
plt.show()
The result is:
Nice! With a degree of 10, we achieved a similar result than least-squares fitting with a degree of 20. To summarize, here are the metrics. The best metric is highlighted.
Calibrator | ECE | Breier | LogLoss |
---|---|---|---|
Platt | 0.01295 | 0.09746 | 0.31395 |
Isotonic | 0.00737 | 0.09707 | 0.31289 |
Bernstein (deg = 20) | 0.00622 | 0.09703 | 0.31195 |
Bernstein (deg = 10) | 0.00634 | 0.09708 | 0.31231 |
Bernstein logistic regression (deg = 10) | 0.00653 | 0.09704 | 0.31199 |
We saw an interesting application of the ability to control the derivative of polynomials represented in the Bernstein basis for model calibration. I welcome you to try it our for your own work, where controlling derivatives in the context of your machine-learned models is important.
As a side note, there are other bases that allow controling derivatives in a similar manner. For example, the well-known B-Spline basis for polynomial splines. But that’s out of scope for our series - my objective was showing that polynomial regression is not that “scary overfitting monster”, but rather a useful tool in machine learning.
My next, and final post in the series will be of a more exploratory nature - of trying to understand why the Bernstein basis is useful for fitting polynomial models from a different, statistical perspective. Stay tuned!
Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger. On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning (2017) ↩
Yongqiao Wang, Lishuai Li, Chuangyin Dang. Calibrating Classification Probabilities with Shape-Restricted Polynomial Regression. IEEE Transactions on Pattern Analysis and Machine Intelligence 41.8 (2019) ↩
Morris H. Degroot, Stephen E. Fienberg. The Comparison and Evaluation of Forecasters. Journal of the Royal Statistical Society: Series D (The Statistician) (1983) ↩
Allan H. Murphy. A New Vector Partition of the Probability Score. Journal of Applied Meteorology and Climatology (1973). ↩
Platt, John. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.” Advances in large margin classifiers 10.3 (1999) ↩
R.E. Miles. The Complete Amalgamation into Blocks, by Weighted Means, of a Finite Set of Real Numbers. Biometrika 46.3 (1959) ↩
D. J. Bartholomew. A Test of Homogeneity for Ordered Alternatives. II Biometrika 46.3 (1959) ↩
In the previous post we built a Scikit-Learn component that you can already integrate into your pipelines to train models whose numerical features are represented in the Bernstein basis. Feature interactions is a simple and effective feature engineering trick, and this post builds upon this knowledge and improves the component we built by introducing pairwise interactions between numerical features. This post is a direct continuation of the previous post, and I will assume that you are familiar with what we built so far. If what you see here looks like Klingon, and you don’t know Klingon, please take your time to read the posts on polynomial features from the beginning. As previously, the code is available in a notebook that you can open in Google Colab. Due to the nature of this post, the notebook extends the code from the last post with additional experiments, rather than being written from scratch.
The BernsteinTransformer
component we created last time allowed us to construct a Scikit-Learn pipeline, train, and make predictions using the following simple lines of code:
categorical_features = [...] # the list of categorical feature names
numerical_features = [...] # the list of numerical feature names
my_estimator = ... # Ridge / Lasso / LogisticRegression / ...
pipeline = training_pipeline(BernsteinFeatures(), my_estimator, categorical_features, numerical_features)
pipeline.fit(train_df, train_df[label_column])
test_predictions = pipeline.predict(test_df)
We constructed pipelines of the following generic form to facilitate using polynomial bases over data in a compact interval, by first rescaling it:
Our BernsteinTransformer
generated Bernstein basis features for each column separately. As a baseline, we also used the PowerBasisTransformer
that generated the power-basis features. We will extend both classes in a way that will allow us to construct pairwise interactions between numerical features by generating tensot product bases:
Such bases can be used to learn a function of any given pair of features \(x\) and \(y\) with linear coefficients:
\[f(x, y) = \sum_{i=0}^n \sum_{j=0}^n \alpha_{i,j} b_{i,n}(x) b_{j,n}(y)\]The basis \(b_{0,n}, \dots, b_{n,n}\), in this post, can be either the power-basis for n-th degree polynomials, or the Bernstein basis.
As an additional baseline, we will also Scikit-Learn’s built-int PolynomialFeatures class, that does something similar, but different with the power basis. Given \(m\) numerical features of degree \(n\), it allows learning functions of the form
\[f(x_1, \dots x_m) = \sum_{\substack{i_1 + \dots + i_m = n \\ i_k \geq 0}} \alpha_{i_1, \dots, i_m} \left( \prod_{k=1}^m x_k^{i_k} \right).\]This looks “scary”, but essentially this is a generic multivariate polynomial of degree \(n\) whose variables are \(x_1, \dots , x_m\). So let’s get started!
Without further due, let’s extend the the base-class for both polynomial feature transformers from the previous post, to have an additional interaction_features
argument in its constructor, and produce tensor-product features. Again, we need to take care not to introduce an additional “bias term”, and to that end, we eliminate the first basis function, as in the previous post:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from itertools import combinations
class PolynomialBasisTransformer(BaseEstimator, TransformerMixin):
def __init__(self, degree=5, bias=False, na_value=0., interactions=False):
self.degree = degree
self.bias = bias
self.na_value = na_value
self.interactions = interactions
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# Check if X is a Pandas DataFrame and convert to NumPy array
if hasattr(X, 'values'):
X = X.values
# Ensure X is a 2D array
if X.ndim == 1:
X = X.reshape(-1, 1)
# Get the number of columns in the input array
n_rows, n_features = X.shape
# Compute the specific polynomial basis for each column
basis_features = [
self.feature_matrix(X[:, i])
for i in range(n_features)
]
# create interaction features - basis tensor products
if self.interactions:
interaction_features = [
(u[:, None, :] * v[:, :, None]).reshape(n_rows, -1)
for u, v in combinations(basis_features, 2)
]
result_basis = interaction_features
else:
result_basis = basis_features
if not self.bias:
result_basis = [basis[:, 1:] for basis in result_basis]
return np.hstack(result_basis)
def feature_matrix(self, column):
vander = self.vandermonde_matrix(column)
return np.nan_to_num(vander, self.na_value)
def vandermonde_matrix(self, column):
raise NotImplementedError("Subclasses must implement this method.")
Our concrete bernstein and power basis transformers from the previous post remain the same - their job is implementing the vandermonde_matrix
method. We include them here for completeness:
class BernsteinFeatures(PolynomialBasisTransformer):
def vandermonde_matrix(self, column):
basis_idx = np.arange(1 + self.degree)
basis = binom.pmf(basis_idx, self.degree, column[:, None])
return basis
class PowerBasisFeatures(PolynomialBasisTransformer):
def vandermonde_matrix(self, column):
return poly.polyvander(column, self.degree)
The rest of the components we built in the previous post remain the same. So let’s try them out, and add another experiment to our attempts to predict california housing prices!
Recall, that we’re given a train and a test set already in Google colab, and can load them:
train_df = pd.read_csv('sample_data/california_housing_train.csv')
test_df = pd.read_csv('sample_data/california_housing_test.csv')
print(train_df.head())
# longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
# 0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
# 1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
# 2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
# 3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
# 4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
The task is predicting the median_house_value
column based on the other columns. Let’s use the same categorical and numerical features as in our previous post:
categorical_features = ['housing_median_age']
numerical_features = ['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
target = ['median_house_value']
And use the same pipeline construction function as in the previous post:
from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor
def california_housing_pipeline(basis_transformer):
return training_pipeline(
basis_transformer,
TransformedTargetRegressor(
regressor=Ridge(),
transformer=MinMaxScaler()
),
categorical_features,
numerical_features
)
So now, in addition to the linear, power basis, and Bernstein bases, we will add the pairwise power basis, pairwise Bernstein basis, and the the built-int PolynomialFeatures
basis. Let’s begin with the pairwise Bernstein basis. Note the interactions=True
argument I give to the BernsteinFeatures component:
inter_param_space = {
'preprocessor__numerical__basis__degree': hp.uniformint('degree', 1, 8),
'model__regressor__alpha': hp.loguniform('C', -10, 5)
}
bernstein_inter_pipeline = california_housing_pipeline(BernsteinFeatures(interactions=True))
tune_and_evaluate_pipeline(bernstein_inter_pipeline, inter_param_space,
train_df, test_df, target,
'neg_root_mean_squared_error')
This time we’ll use lower polynomial degrees, up to 8, because the model becomes too large to finish tuning in a few minutes on Colab. I got the following result:
Tuning params
100%|██████████| 100/100 [14:10<00:00, 8.50s/trial, best loss: 55558.90453947183]
Best params = {'model__regressor__alpha': 0.003973626254894749, 'preprocessor__numerical__basis__degree': 8}
Refitting with best params on the entire training set
Test metric = -58131.78984
Now let’s try the power basis with interactions:
power_inter_pipeline = california_housing_pipeline(PowerBasisFeatures(interactions=True))
tune_and_evaluate_pipeline(power_inter_pipeline, inter_param_space,
train_df, test_df, target,
'neg_root_mean_squared_error')
After a few minutes I get the following output:
Tuning params
100%|██████████| 100/100 [12:25<00:00, 7.46s/trial, best loss: 56765.359651797495]
Best params = {'model__regressor__alpha': 0.00017748456793637552, 'preprocessor__numerical__basis__degree': 8}
Refitting with best params on the entire training set
Test metric = -59228.75478
So we can certainly see that even with interaction features, the power basis performs worse than the Bernstein basis.
And last but not least, let’s use the PolynomialFeatures
class that the Scikit-Learn package provides. To be fair, we need to choose its maximum degree so that the number of generated features is similar to that of the pairwise bases. So we have 7 numerical features, and therefore \(\frac{1}{2} \cdot 7 \cdot 6 = 21\) feature pairs. With maximum degree of 8, each pair generates at most \(8 \cdot 8 - 1 = 63\) basis functions. So the total number of generated features is \(21 \cdot 63 = 1323\).
A multivariate polynomial with \(7\) variables of degree \(d\) has
\[{7 + d \choose d}\]coefficients. It can be easily shown using the stars and bars technique in combinatorics. Choosing \(d = 7\) we get 1716 coefficients, which is pretty close. With \(d=6\) we get less than 1323 coefficients, so using polynomials of max degree 7 seems like a reasonable choice.
Let’s try it out!
from sklearn.preprocessing import PolynomialFeatures
polyfeat_param_space = {
'preprocessor__numerical__basis__degree': hp.uniformint('degree', 1, 7),
'model__regressor__alpha': hp.loguniform('C', -10, 5)
}
polyfeat_pipeline = california_housing_pipeline(PolynomialFeatures(include_bias=False))
tune_and_evaluate_pipeline(polyfeat_pipeline, polyfeat_param_space,
train_df, test_df, target,
'neg_root_mean_squared_error')
After a few minutes, I got the following output:
Tuning params
100%|██████████| 100/100 [25:59<00:00, 15.60s/trial, best loss: 56986.28132958403]
Best params = {'model__regressor__alpha': 0.0003080552886505334, 'preprocessor__numerical__basis__degree': 6}
Refitting with best params on the entire training set
Test metric = -59155.28068
So, summarizing the results of the previous post, together with the results of this post, we obtain the following table:
Linear | Power basis | Bernstein basis | Pairwise Bernstein | Pairwise Power | Full polynomial | |
---|---|---|---|---|---|---|
RMSE | 67627.17474 | 63534.49228 | 61559.04848 | 58131.78984 | 59228.75478 | 59155.28068 |
Improvement over Linear | 0% | 6.05% | 8.97% | 14.04% | 12.41% | 12.52% |
Tuned degree | 1 | 31 | 50 | 8 | 8 | 6 |
As we can see, the clear winners are the pairwise polynomial features. This, of course, will not always be the case. But there is a good reason why it may be an option worth exploring.
Let’s get formally introduced - the set of functions \(\mathbb{B}_{n,n} = \{ b_{i,n}(x) b_{j,n}(y) \}_{i,j=0}^n\) is the tensor product basis constructed from the \(n\)-th degree Bernstein basis. In general, tensor product bases are function bases that are composed of pairwise product of basis functions, but here we explore the special case of Bernstein basis functions. The basis \(\mathbb{B}_{n,n}\) shares some nice properties with the Bernstein basis:
Now let’s look at an arbitrary function \(f\) that is spanned by this basis:
\[f(x,y) = \sum_{i=0}^n \sum_{j=0}^n \alpha_{i,j} b_{i,n}(x) b_{j,n}(y)\]Due to the two properties above, like in the case of the univariate Bernstein basis, \(f\) is just a weighted sum of its coefficients \(\alpha_{i,j}\). The basis function values specify the weight of each coefficient.
Moreover, we have the same ‘controlling’ property as with the univariate basis - \(\alpha_{i,j}\) “controls” the value of the function \(f\) in the vicinity of the point \((\frac{i}{n}, \frac{j}{n})\). These properties make it easy to regularize \(f\), just as in the case of the univariate basis. We will not go into the details in this post, but just as is the case with the univariate basis, we can also control the first or second derivative of \(f\) by imposing constraints on its coefficients based on discrete analogues of first and second order differences.
Despite the name ‘basis’, it is important to note that the tensor product basis does not span all bivariate polynomials of degree \(2n\), but merely a very useful subspace. For example, the monomials \(x^{2n}\) and \(y^{2n}\) appear nowhere in the polynomial expansion of the \(f(x, y)\) defined above.
Bases with such properties, such as the Bernstein basis, and also the well-known B-Spline basis are heavily used by computer aided design software to represent 2D surfaces embedded in 3D ^{1}. In the case of the Bernstein basis, the surfaces are known as Bézier surfaces, named after the French engineer Pierre Bézier. I like the idea of propagating knowledge established in one field to another field, and I believe this is one such case. I’d like to refer interested readers to the beautiful tutorial paper^{2} by Michael Floater and Kai Hormann.
This post concludes our adventures in designing a Scikit-Learn transformer. I’m happy to receive feedback, so please don’t hesitate to contact me if you have feedback to share. Next, we will explore a practical case when controlling polynomial derivatives is important, and write yet another Scikit-Learn component. Stay tuned!
When representing a 3D surface, we have three functons \(f_x, f_y, f_z\), one for the \(x\) coordinate, one for the \(y\), and one for the \(z\) coordinate. ↩
Michael S. Floater & Kai Hormann Surface Parameterization: a Tutorial and Survey. Mathematics and Visualization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-26808-1_9 ↩
In the last two posts we introducted the Bernstein basis as an alternative way to generate polynomial features from data. In this post we’ll be concerned with an implementation that we can use in our model training pipelines based on Scikit-Learn. The Scikit-Learn library has the concept of a a transformer class that generates features from raw data, and we will indeed develop and test such a transformer class for the Bernstein basis. Contrary to previous posts, here we will have some math, but plenty of code, which is fully available in this Colab Notebook.
But beforehand, let’s do a short recap of what we learned in the last two posts:
Transformer classes in Scikit-Learn generate new features out of existing ones, and can be combined in a convenient way into pipelines that perform a set of transformations that eventually generate features for a trained model. We will implement a Scikit-Learn transformer class for Bernstein polynomials, called BernsteinFeatures
. As a baseline, we will also implement a similar transformer that generates the power basis, called PowerBasisFeatures
. We will combine them in a Pipeline
to build a mechainsm that trains and evaluates a model using the well-known fit-transform paradigm. In this post, we will train a linear model on our generated features.
Since feature normalization is a must, we will always prepend our polynomial transformer by a normalization transformer. In this post, we will use the MinMaxScaler
class built into Scikit-Learn. For categorical features, we will use OneHotEncoder
. Therefore, our pipelines in this post will have the following generic form:
Before we begin - some expectations. The behavior of the functions we approximate on real data-sets is typically not as ‘crazy’ as the toy functions we approximated in previous posts. The wide oscilations and wiggling of the “true” function we are aiming to learn are not that common in practice. A harder challenge is modeling the interaction between several features, rather than the effect of each feature separately. Therefore, the advantage we will see from a simple application of Bernstein polynomials over the power basis isn’t that large, but it’s quite visible and consistent. Thus, when fitting a model with polynomial features, I’d go with Bernstein polynomials by default, instead of a power basis. It’s very easy, and we have nothing to lose - we can only gain.
A transformer class in Scikit-Learn needs to implement the basic fit-transform paradigm. Since polynomial features are the same regardless of the data, the fit
method is empty. The transform method, as expected, will concatenate the generate a Vandermonde matrices of the columns. Note, that we will be handling each column separately at this stage, and do not aim to compute any interaction terms between columns.
There is one mathematical issue we need to take care of. Since a polynomial basis can represent any polynomial, including those that do not pass throught the origin, they implicitly contain a “bias” term. The power basis even explicit about it - its first basis function is the constant \(1\). However, a typical linear model already has its own bias term, namely,
\[f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x}\rangle + b.\]The bias is, of course, equivalent to having a constant feature. Thus, our data-matrix has two constant features, meaning it’s as ill-conditioned as it can be - its columns are linearly dependent. When several numerical features are used things become even worse - we have several implicit constant features.
To mitigate the above, we will add a bias
boolean flag to our transformers that instructs the transformer to generate a basis of polynomials going through the origin. This policy is in line with other transformers that are built-in into Scikit-Learn, such as the SplineTransformer
and the PolynomialFeatures
classes. For the power basis it amounts to discarding the first basis function. It turns out that the same idea works for the Bernstein basis as well, since \(b_{0,n}(0) = 1\), and \(b_{i,n}(0) = 0\) for all \(i \geq 1\).
Becide the above mathematical aspect, we will also have to take care of several technical aspects. First, we will add support for Pandas data-frames, since they are ubiquitously used by many practitioners. Second, we will have to take care of one-dimensional arrays as input, and reshape them into a column. Finally, we will treat transform NaN values to constant (zero) vectors to model the fact that a missing numerical feature “has no effect”. This is not always the best course of action, but it’s useful in this post. The base class taking care of the above mathematical and technical aspects is written below:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
class PolynomialBasisTransformer(BaseEstimator, TransformerMixin):
def __init__(self, degree=5, bias=False, na_value=0.):
self.degree = degree
self.bias = bias
self.na_value = na_value
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# Check if X is a Pandas DataFrame and convert to NumPy array
if hasattr(X, 'values'):
X = X.values
# Ensure X is a 2D array
if X.ndim == 1:
X = X.reshape(-1, 1)
# Get the number of columns in the input array
n_rows, n_features = X.shape
# Compute the specific polynomial basis for each column
basis_features = [
self.feature_matrix(X[:, i])
for i in range(n_features)
]
# no bias --> skip the first basis function
if not self.bias:
basis_features = [basis[:, 1:] for basis in basis_features]
return np.hstack(transformed_features)
def feature_matrix(self, column):
vander = self.vandermonde_matrix(column)
return np.nan_to_num(vander, self.na_value)
def vandermonde_matrix(self, column):
raise NotImplementedError("Subclasses must implement this method.")
The power and Bernstein bases are easily implemented by overriding the vandermonde_matrix
method of the above base-class:
import numpy.polynomial.polynomial as poly
from scipy.stats import binom
class BernsteinFeatures(PolynomialBasisTransformer):
def vandermonde_matrix(self, column):
basis_idx = np.arange(1 + self.degree)
basis = binom.pmf(basis_idx, self.degree, column[:, None])
return basis
class PowerBasisFeatures(PolynomialBasisTransformer):
def vandermonde_matrix(self, column):
return poly.polyvander(column, self.degree)
Let’s see how they work. We will use Pandas to display the results of our transformers as nicely formatted tables.
import pandas as pd
pbt = BernsteinFeatures(degree=2).fit(np.empty(0))
bbt = PowerBasisFeatures(degree=2).fit(np.empty(0))
# transform a column - output the Vandermonde matrix according to each basis
feature = np.array([0, 0.5, 1, np.nan])
print(pd.DataFrame.from_dict({
'Feature': feature,
'Power basis': list(pbt.transform(feature)),
'Bernstein basis': list(bbt.transform(feature))
}))
# transform a column - output the Vandermonde matrix according to each basis
feature = np.array([0, 0.5, 1, np.nan])
print(pd.DataFrame.from_dict({
'Feature': feature,
'Power basis': list(pbt.transform(feature)),
'Bernstein basis': list(bbt.transform(feature))
}))
# Feature Power basis Bernstein basis
# 0 0.0 [0.0, 0.0] [0.0, 0.0]
# 1 0.5 [0.5000000000000002, 0.25] [0.5, 0.25]
# 2 1.0 [0.0, 1.0] [1.0, 1.0]
# 3 NaN [0.0, 0.0] [0.0, 0.0]
# transform two columns - concatenate the Vandermonde matrices
features = np.array([
[0, 0.25],
[0.5, 0.5],
[np.nan, 0.75]
])
print(pd.DataFrame.from_dict({
'Feature 0': features[:, 0],
'Feature 1': features[:, 1],
'Power basis': list(pbt.transform(features)),
'Bernstein basis': list(bbt.transform(features))
}))
# Feature 0 Feature 1 Power basis Bernstein basis
# 0 0.0 0.25 [0.0, 0.0, 0.375, 0.0625] [0.0, 0.0, 0.25, 0.0625]
# 1 0.5 0.50 [0.5000000000000002, 0.25, 0.5000000000000002, 0.25] [0.5, 0.25, 0.5, 0.25]
# 2 NaN 0.75 [0.0, 0.0, 0.375, 0.5625] [0.0, 0.0, 0.75, 0.5625]
Nice! Now let’s proceed to our example.
Let’s implement the pipeline structure we saw at the beginning of this post in code, and a function to train models using this pipeline.
We will write a function that a basis transformer and a model as an arguments, and constructs the components of the pipeline. Categorical features will be one-hot encoded, numerical features will be scaled and transformed using the given basis transformer, and finally the result will be passed as an input of the given model.
To make sure our scaled numerical features never fall outside of the \([0, 1]\) interval, even if the test-set contaisn values larger or smaller than what we saw in the training set, we clip the scaled value to \([0, 1]\). And to make sure we don’t inflate the dimension of our model by one-hot encoding rare categorical values, we will limit their frequency to 10. Here is the code:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
def training_pipeline(basis_transformer, model_estimator,
categorical_features, numerical_features):
basis_feature_transformer = Pipeline([
('scaler', MinMaxScaler(clip=True)),
('basis', basis_transformer)
])
categorical_transformer = OneHotEncoder(
sparse_output=False,
handle_unknown='infrequent_if_exist',
min_frequency=10
)
preprocessor = ColumnTransformer(
transformers=[
('numerical', basis_feature_transformer, numerical_features),
('categorical', categorical_transformer, categorical_features)
]
)
return Pipeline([
('preprocessor', preprocessor),
('model', model_estimator)
])
We can now use a pipeline with Bernstein features with Ridge regression as in:
pipeline = training_pipeline(BernsteinFeatures(), Ridge(), categorical_features, numerical_features)
test_predictions = pipeline.fit(train_df, train_df[target]).transform(test_df)
But wait! We need to know what polynomial degree to use, and maybe tune some hyperparameters of the trained model. Otherwise, the experimental results we observe may simply be due to a bad choice of hyperparameters.
We need two ingredients. One is technical - how we set hyperparameters of components hidden deep inside a Pipeline
. The other is how we actually tune them. For setting hyperparameters, Scikit-Learn provides an interface. There are two functions: get_params()
which returns a dictionary of all settable parameters, and set_params
that can set parameters of all the components contained inside a pipeline. Let’s look at an example of a pipeline with BernsteinFeatures
as the basis transformer, and Ridge
as the model. Since Ridge
has an alpha
parameter, and BernsteinFeatures
has a degree
parameters, let’s look for those:
from sklearn.linear_model import Ridge
pipeline = training_pipeline(BernsteinFeatures(), Ridge(), [], [])
print({k:v for k,v in pipeline.get_params().items()
if 'degree' in k or 'alpha' in k})
# prints: {'preprocessor__numerical__basis__degree': 5, 'model__alpha': 1.0}
There is a pattern here! Looking at our training_pipeline
method above, we see that there is a component named “preprocessor”, inside of which there is a component named “numerical”, that contains a “basis”. That “basis” component is our transformer, so it has a “degree”. The full name is just a concatenation of the above with double underscores. The same idea for the model. We can also set these parameters as follows:
pipeline.set_param('preprocessor__numerical__basis__degree', SOME_DEGREE)
pipeline.set_param('model__alpha', SOME_REGULARIZATION_COEFFICIENT)
So now that we know how to set hyperparameters of parts within a pipeline, let’s tune them. To that end, we will use hyperopt^{2}! It’s a nice hyperparameter tuner, very easy to use, and implementes the state-of-the art Bayesian Optimization paradigm that can obtain high quality hyperparameter configurations hyperparameters in a relatively small number of trials. It’s as easy to use as a grid search, available by default on Colab, and saves us precious time. And I certainly don’t want to wait long until I see the results.
To use hyperopt, we need two ingredients. A a tuning objective that evaluates the performance of a given hyperparameter configuration, and a search space for hyperparameters. Writing such a tuning objective is quite easy - we will use a cross-validated score using Scikit-Learn’s built-int capabilities:
from sklearn.model_selection import cross_val_score
def tuning_objective(pipeline, metric, train_df, target, params):
pipeline.set_params(**params)
scores = cross_val_score(pipeline, train_df, train_df[target], scoring=metric)
return -np.mean(scores)
Well, that wasn’t hard, but there’s an intricate detail - note that we are returning minus the average metric across folds. This is because Scikit-Learn’s metrics are built to be maximized, but hyperopt is built to minimize.
Defining a the hyperparameter seach space is also easy - it’s just a dict specifying a distribution for each hyperparameter. For our example above with a Ridge model we can use something like this:
from hyperopt import hp
param_space = {
'preprocessor__numerical__basis__degree': hp.uniformint('degree', 1, 50),
'model__alpha': hp.loguniform('alpha', -10, 5)
}
Hyperopt has a uniform
and uniformint
functions for hyperparameters that we would normally tune using a uniform grid, such as the number of layers of an NN, or the degree of a polynomial. In the code above, the degree of the polynomial is a number between 1 and 50, and all are equally likely. It also has a loguniform
function for hyperparameters that we normally tune using a geometrically-spaced grid, such as a learning rate, or a regularization coefficient. In the example above, the regularization coefficient is between \(e^{-10}\) and \(e^5\), and all exponents are uniformly likely.
Having specified the objective function and the parameter space, we can use fmin
for tuning, like this:
from hyperopt import fmin, tpe
fmin(lambda params: tuning_objective(pipeline, metric, train_df, target, params),
space=param_space,
algo=tpe.suggest,
max_evals=100)
We have given it a function to minimize, gave it the hyperparameter search space, told it to use the TPE algorithm for tuning^{3}, and limited it to 100 evaluations of our tuning objective. It will invoke our objective on hyperparameter configurations that it considers as worth trying, and eventually give us the best configuration it found. More on that can be found in hyperopt’s documentation. Beyond the objective and the search space, we also need to tell it which algorithm to use, and how many configurations it should try.
So let’s write a function that tunes hyperparameters using the training set, fits a model using the optimal configuration, and evaluates the resulting model’s performance using the test set. Then, re-train the pipeline on the entire training set using the best hyper-parameters, and evalluate it on the test set.
from hyperopt import fmin, tpe
from sklearn.metrics import get_scorer
def tune_and_evaluate_pipeline(pipeline, param_space,
train_df, test_df, target, metric,
max_evals=50, random_seed=42):
print('Tuning params')
def bound_tuning_objective(params):
return tuning_objective(pipeline, metric, train_df, target, params)
params = fmin(fn=bound_tuning_objective, # <-- this is the objective
space=param_space, # <-- the search space
algo=tpe.suggest, # <-- the algorithm to use. TPE is the most widely used.
max_evals=max_evals, # <-- maximum number of configurations to try
rstate=np.random.default_rng(random_seed),
return_argmin=False)
print(f'Best params = {params}')
print('Refitting with best params on the entire training set')
pipeline.set_params(**params)
fit_result = pipeline.fit(train_df, train_df[target])
scorer = get_scorer(metric)
score = scorer(fit_result, test_df, test_df[target])
print(f'Test metric = {score:.5f}')
return fit_result
Now we have all the ingredients in place! We can now, for example, tune, train a tuned Ridge regression model with Bernstein polynomial features that predicts the foo
column in our data-set, and measures success using the Mean-Squared Error metric as follows:
train_df = ...
test_df = ...
categorical_features = [...]
numerical_features = [...]
pipeline = trainin_pipeline(BernsteinTransformer(), Ridge(), categorical_features, numerical_features)
model = tune_and_evaluate_pipeline(
pipeline,
param_space,
train_df,
test_df,
'foo',
'neg_root_mean_squared_error')
Now let’s put our work-horse to work!
The well-known California Housing price prediction data-set is available in the samples directory on Colab, so it will be convenient to use. Let’s load it, and print a sample:
train_df = pd.read_csv('sample_data/california_housing_train.csv')
test_df = pd.read_csv('sample_data/california_housing_test.csv')
print(train_df.head())
# longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
# 0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
# 1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
# 2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
# 3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
# 4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
The task is predicting the median_house_value
column based on the other columns.
First, we can see that there are seveal feature columns with very large and diverse numbers. They probably have a very skewed distribution. Let’s plot those distributions:
skewed_columns = ['total_rooms', 'total_bedrooms', 'population', 'households']
axs = train_df.loc[:, skewed_columns].plot.hist(
bins=20, subplots=True, layout=(2, 2), figsize=(8, 6))
axs.flat[0].get_figure().tight_layout()
Indeed very skewed! Typically applying a logarithm helps. Let’s see plot them after applying a logarithm (note the .apply(np.log)
):
axs = train_df.loc[:, skewed_columns].apply(np.log).plot.hist(
bins=20, subplots=True, layout=(2, 2), figsize=(8, 6))
axs.flat[0].get_figure().tight_layout()
Ah, much better! We also note that housing_median_age
variable, despite being numerical, is discrete. Indeed, it has only 52 unitue values in the entire dataset. So we will treat it as a categorical variable. Let’s summarize our features in code:
categorical_features = ['housing_median_age']
numerical_features = ['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
target = ['median_house_value']
So we’re almost ready to fit a model. We can see that our target variable, median_house_value
, has very large magnitude values. It is usually beneficial to scale them to a smaller range. However, we would like to measure the prediction error with respect to the original values. Fortunately, Scikit-Learn provides us with a TransformedTargetRegressor
class that allows scaling the target variable for the regression model, and scaling it back to the original range when producing an output.
Now we’re ready to construct our model fitting pipeline that fits a Ridge model on scaled regression targets, and transformed features:
from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor
def california_housing_pipeline(basis_transformer):
return training_pipeline(
basis_transformer,
TransformedTargetRegressor(
regressor=Ridge(),
transformer=MinMaxScaler()
),
categorical_features,
numerical_features
)
Beautiful! Now we can use our hyperparameter tuning function to train a tuned model on our dataset. Since it’s a regression task, we will measure the Root Mean Squared Error (RMSE), implemented by the neg_root_mean_squared_error
Scikit-Learn metric. So let’s begin with Bernstein polynomial features:
poly_param_space = {
'preprocessor__numerical__basis__degree': hp.uniformint('degree', 1, 50),
'model__regressor__alpha': hp.loguniform('C', -10, 5)
}
bernstein_pipeline = california_housing_pipeline(BernsteinFeatures())
bernstein_fit_result = tune_and_evaluate_pipeline(
bernstein_pipeline, poly_param_space,
train_df, test_df, target,
'neg_root_mean_squared_error')
After a few minutes I got the following output:
Tuning params
100%|██████████| 50/50 [03:37<00:00, 4.34s/trial, best loss: 60364.25845777496]
Best params = {'model__regressor__alpha': 0.0075549014272857686, 'preprocessor__numerical__basis__degree': 50}
Refitting with best params on the entire training set
Test metric = -61559.04848
The root mean squared error (RMSE) on the test-set of the tuned model is \(61559.04848\). Now let’s try the power basis:
power_basis_pipeline = california_housing_pipeline(PowerBasisFeatures())
power_basis_fit_result = tune_and_evaluate_pipeline(
power_basis_pipeline, poly_param_space,
train_df, test_df, target,
metric='neg_root_mean_squared_error')
This time I got the following output:
Tuning params
100%|██████████| 50/50 [00:54<00:00, 1.10s/trial, best loss: 62205.78033504614]
Best params = {'model__regressor__alpha': 4.7685837926305776e-05, 'preprocessor__numerical__basis__degree': 31}
Refitting with best params on the entire training set
Test metric = -63534.49228
This time the RMSE is \(63534.49228\). The Bernstein basis got us a \(3.1\%\) improvement! If we look closer at the output, we can see that the tuned Bernstein polynomial is of degree 50, whereas the best tuned power basis polynomial is of degree 31. We already saw that high degree polynomials in the Bernstein basis are easy to regularize, and our tuner probably saw the same phenomenon, and cranked up the degree to 50.
How are our polynomial features compared to a simple linear model? Well, let’s see. To re-use all our existing code instead of writing a new pipeline, we’ll just use a “do nothing” feature transformer that implements the identity function. Note, that this time there is no degree to tune.
class IdentityTransformer(BaseEstimator, TransformerMixin):
def __init__(self, na_value=0.):
self.na_value = na_value
def fit(self, input_array, y=None):
return self
def transform(self, input_array, y=None):
# we are compatible with our polynomial features - NA values are zeroed-out. The rest
# are passed through
return np.where(np.isnan(input_array), self.na_value, input_array)
linear_param_space = {
'model__regressor__alpha': hp.loguniform('C', -10, 5)
}
linear_pipeline = california_housing_pipeline(IdentityTransformer())
linear_fit_result = tune_and_evaluate_pipeline(
linear_pipeline, linear_param_space,
train_df, test_df, target,
'neg_root_mean_squared_error')
Here is the output:
Tuning params
100%|██████████| 50/50 [00:22<00:00, 2.22trial/s, best loss: 66571.41310284132]
Best params = {'model__regressor__alpha': 0.013724898474056764}
Refitting with best params on the entire training set
Test metric = -67627.17474
The RMSE is \(67627.17474\). So let’s summarize the results in the table:
Linear | Power basis | Bernstein basis | |
---|---|---|---|
RMSE | 67627.17474 | 63534.49228 | 61559.04848 |
Improvement over Linear | 0% | 6.05% | 8.97% |
Tuned degree | 1 | 31 | 50 |
Impressive! Just changing the polynomial basis gives us a visible boost, and the high degree doesn’t appear to do something bad.
Now we shall inspect our models a bit closer. That’s why we stored the fit models in the bernstein_fit_result
and power_basis_fit_result
variables above. Following the structure of our pipelines, to get the coefficients we can use the following function:
def get_coefs(pipeline):
transformed_target_regressor = pipeline.named_steps['model']
ridge_model = transformed_target_regressor.regressor
return ridge_model.coef_.ravel()
Now we can plot the polynomials! First, we will need to extract the coefficients of the numerical features, and ignore the ones corresponding to the categorical features. Next, we need to treat the coefficients of each numerical feature separately, and plot the polynomial they represent. Since our numerical features are always scaled to \([0, 1]\), plotting amounts to evaluating our polynomials on a dense grid in \([0, 1]\). So this is our plotting function:
import matplotlib.pyplot as plt
def plot_feature_curves(pipeline, basis_transformer_ctor, numerical_features):
# get the coefficients and the degree
degree = pipeline.get_params()['preprocessor__numerical__basis__degree']
coefs = get_coefs(pipeline)
# extract the numerical features, and form a matrix, such that the
# coefficients of each feature is in a separate row.
numerical_slice = pipeline.get_params()['preprocessor'].output_indices_['numerical']
feature_coefs = coefs[numerical_slice].reshape(-1, degree)
# form the basis Vandermonde matrix on [0, 1]
xs = np.linspace(0, 1, 1000)
xs_vander = basis_transformer_ctor(degree=degree).fit_transform(xs)
# do the plotting
n_cols = 3
n_rows = math.ceil(len(numerical_features) / n_cols)
fig, axs = plt.subplots(n_rows, n_cols, figsize=(3 * n_cols, 3 * n_rows))
for i, (ax, coefs) in enumerate(zip(axs.ravel(), feature_coefs)):
ax.plot(xs, xs_vander @ coefs)
ax.set_title(numerical_features[i])
fig.show()
A bit lengthy, but understandable.
Recalling our previous post, we know that the coefficients in the Bernstein basis are actually “control points”, so let’s add the ability to plot them as well to the above function:
import matplotlib.pyplot as plt
def plot_feature_curves(pipeline, basis_transformer_ctor, numerical_features,
plot_control_pts=True):
# get the coefficients and the degree
degree = pipeline.get_params()['preprocessor__numerical__basis__degree']
coefs = get_coefs(pipeline)
# extract the numerical features, and form a matrix, such that the
# coefficients of each feature is in a separate row.
numerical_slice = pipeline.get_params()['preprocessor'].output_indices_['numerical']
feature_coefs = coefs[numerical_slice].reshape(-1, degree)
# form the basis Vandermonde matrix on [0, 1]
xs = np.linspace(0, 1, 1000)
xs_vander = basis_transformer_ctor(degree=degree).fit_transform(xs)
# do the plotting
n_cols = 3
n_rows = math.ceil(len(numerical_features) / n_cols)
fig, axs = plt.subplots(n_rows, n_cols, figsize=(3 * n_cols, 3 * n_rows))
for i, (ax, coefs) in enumerate(zip(axs.ravel(), feature_coefs)):
if plot_control_pts:
control_xs = (1 + np.arange(len(coefs))) / len(coefs)
ax.scatter(control_xs, coefs, s=30, facecolors='none', edgecolor='b', alpha=0.5)
ax.plot(xs, xs_vander @ coefs)
ax.set_title(numerical_features[i])
fig.show()
Now let’s see our Bernstein polynomials!
plot_feature_curves(bernstein_fit_result, BernsteinFeatures, numerical_features)
What about the power basis? Let’s take a look as well. Note, that we won’t plot the coefficients as “control points”, since the coefficients of the power basis are not control points in any way.
plot_feature_curves(power_basis_fit_result, PowerBasisFeatures, numerical_features, plot_control_pts=False)
Look at the “households” and “total_bedrooms” polynomials. Seems that they’re “going crazy” near the boundary of the domain. As we expected - was not specifically designed to approximate functions on \([0, 1]\), and it’s hard to regularize to produce a good fit. It will either under-fit, or over-regularize.
In fact, we may recall that the “natural domain” of the power basis is the complex unit circle. It may be interesting to try representing periodic features, such as the time of day using the power basis, since such features naturally map to a point on a circle. However, there are other challenges involved, such as ensuring that our model will be real-valued rather than complex-valued, and this may be a nice subject for another post.
This was a nice adventure. I certainly learned a lot about Scikit-Learn while writing this post, and I hope that the transformer for producing the Bernstein basis may be useful for to you as well. We note that polynomial non-linear features have a nice property they have only one tunable hyperparameter, so learning a tuned model should be computationally cheaper compared to other alternatives, such as radial basis functions.
Looking again at the Bernstein polynomials above, we see that they are a bit ‘wiggly’, the control point seem like a mess, and in the previous post we learned how to smooth them out by regularizing their second derivative. Moreover, in the beginning of this post we said something interesting - the predictive power of simple models may be improved by incorporating interactions between features. So in the next posts we’re going to do exactly that - enhance our transformer to model feature interactions, and write an enhance version of the Ridge estimator to smooth polynomial features. Stay tuned!
I wouldn’t even call it extrapolation - in our context I think of the polynomial basis as “undefined” outside of its natural domain. ↩
Bergstra, James, Daniel Yamins, and David Cox. “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures.” International conference on machine learning. PMLR, 2013. ↩
Watanabe, S., 2023. Tree-structured Parzen estimator: Understanding its algorithm components and their roles for better empirical performance. arXiv preprint arXiv:2304.11127. ↩
In the previous post we we saw that the Bernstein polynomials can be used to fit a high-degree polynomial curve with ease, without its shape going out of control. In this post we’ll look at the Bernstein polynomials in more depth, both experimentally and theoretically. First, we will explore the Bernstein polynomials \(\mathbb{B}_n = \{ b_{0,n}, \dots, b_{n, n} \}\), where
\[b_{i,n}(x) = \binom{n}{i} x^i (1-x)^{n-i},\]empirically and visually. We will see how to use the coefficients to achieve a higher degree of control over the shape of the function we fit. Then, we’ll explore them more theoretically, and see that they are indeed a basis - they represent the same model class as the classical power basis \(\{1, x, x^2, \dots, x^n\}\). All the results are reproducible from this notebook.
To study the shape preserving properties, we will rely on the bernvander
function we’ve implemented in the last post, that given the numbers \(x_1, \dots, x_m\), computes the Bernstein Vandermonde matrix of a given degree \(n\), that contains all the polynomials evaluated at all the given points:
This is something we should have probably done earler, but let’s plot the Bernstein polynomials to see what they look like. Below, we plot the basis \(\mathbb{B}_{7}\) using the bernvander
function.
import matplotlib.pyplot as plt
import numpy as np
plt_xs = np.linspace(0, 1, 1000)
bernstein_basis = bernvander(plt_xs, deg=7)
plt.plot(plt_xs, bernstein_basis,
label=[f'$b_{{{i},8}}$' for i in range(8)])
plt.legend(ncols=2)
plt.show()
We can see that each polynomial is a “hill” whose maxima appear equally spaced. So are they? Let’s add vertical bars using the axvline
function to verify:
plt_xs = np.linspace(0, 1, 1000)
bernstein_basis = bernvander(plt_xs, deg=7)
plt.plot(plt_xs, bernstein_basis,
label=[f'$b_{{{i},8}}$' for i in range(8)])
for x in np.linspace(0, 1, 8):
plt.axvline(x, color='gray', linestyle='dotted')
plt.legend(ncols=2)
plt.show()
It indeed appears so - the maxima of the polynomials are at \(\{ \tfrac{i}{n}\}_{i=0}^n\). We won’t prove it formally, but that’s not hard. Now we can have some interesting insights. Suppose we have a polynomial written in Bernstein form, namely, as a weighted sum of Bernstein polynomials:
\[f(x) = \sum_{k=0}^n u_i b_{i,n}(x)\]Recall from the previous post that the Bernstein polynomials sum to one, and therefore \(f(x)\) is just a weighted average of the coefficients \(u_0, \dots, u_n\). Thus, at \(x=\frac{i}{n}\), the weight of \(u_i\) in the weighted average dominates the weights of the other coefficients. In other words,
\(u_i\) controls the polynomial \(f(x)\) in the vicinity of the point \(\frac{i}{n}\).
In fact, the name often given to the coefficients \(u_0, \dots, u_n\) is “control points”. To visualize this observation, let’s see what happens if we change one coefficient, \(u_3\), of a 7-th degree polynomial using an animation:
from matplotlib.animation import FuncAnimation, PillowWriter
n = 7
n_frames = 50
ctrl_xs = np.linspace(0, 1, 1 + n) # the points i / n
w_init = np.cos(2 * np.pi * ctrl_xs) # initial coefficients
plt_vander = bernvander(plt_xs, deg=n) # bernstein basis at plot points
fig, ax = plt.subplots()
def animate(i):
# animate the coefficients "w"
t = np.sin(2 * np.pi * i / n_frames)
w = np.array(w_init)
w[3] = (1 - t) * w[3] + t * 3
# plot the Bernstein polynomial and the coefficients at i / n
ax.clear()
ax.set_xlim([-0.05, 1.05])
ax.set_ylim([-3, 3])
control_plot = ax.scatter(ctrl_xs, w, color='red') # plot control points
poly_plot = ax.plot(plt_xs, plt_vander @ w, color='blue') # plot the polynomial
return poly_plot, control_plot
ani = FuncAnimation(fig, animate, n_frames)
ani.save('control_coefficients.gif', dpi=300, writer=PillowWriter(fps=25))
We get the following result:
Looks nice! We can indeed see where the name “control points” comes from. But what can we say about it formally? Well, there are several results. The most famous one is the constructive proof of the Weierstrass approximation theorem:
Theorem [Lorentz^{1}, 1952] Suppose \(g(x)\) is continuous in \([0, 1]\). Then the polynomials \(\sum_{i=0}^n g(\tfrac{i}{n}) b_{i,n}(x)\) uniformly converge to \(g(x)\) as \(n \to \infty\).
As a consequence, we can interpret the Bernstein coefficient \(u_i\) as the value of some function \(g\) that our polynomial approximates at \(x=\frac{i}{n}\). Equipped with this idea, we can ask ourselves a simple question. What if the coefficients are increasing? Will the polynomial be an increasing function?
Well, it turns out the answer is yes - we can force the polynomial to be an increasing function of \(x\) by making sure the coefficients are increasing. In fact, we have even more interesting things we can formally say. To do that, let’s look at the derivatives of polynomials in Bernstein form. Suppose that
\[f(x) = \sum_{i=0}^n u_i b_{i,n}(x),\]then the first and second derivatives are:
\[\begin{align} f'(x) &= n \sum_{i=0}^{n-1} (u_{i+1} - u_i) b_{i,n-1}(x) \\ f''(x) &= n (n-1) \sum_{i=0}^{n-2} (u_{i+2} - 2 u_{i+1} + u_i) b_{i,n-2}(x) \end{align}\]The first derivative is a weighted sum of the coefficient first order differences \(u_{i+1}-u_i\), whereas the second derivative is a weighted sum of the second order differences \(u_{i+2}-2u_{i+1}+u_i\). Therefore, we can conclude that:
Theorem [Chang et. al^{2}, 2007, Proposition 1] Given \(f(x) = \sum_{i=0}^n u_i b_{i,n}(x)\)
- If \(u_{i+1} - u_i \geq 0\), then \(f'(x) \geq 0\), and \(f\) is nondecreasing,
- If \(u_{i+1} - u_i \leq 0\), then \(f'(x) \leq 0\), and \(f\) is nondecreasing,
- If \(u_{i+2} - 2u_{i+1} + u_i \geq 0\), then \(f''(x) \geq 0\), and \(f\) is convex,
- If \(u_{i+2} - 2u_{i+1} + u_i \leq 0\), then \(f''(x) \leq 0\), and \(f\) is concave,
An important application of fitting nondecreasing functions, for example, is fitting a CDF. One practical example of CDF fitting is the bid shading problem^{3}^{4}^{5} in online advertising. We are required to model the probability of winning an ad auction given a bid \(x\). Naturally, the winning probability should increase when the bid \(x\) increases. Another important example is calibration curves^{6}^{7}^{8} in classification models, which are functions that map the model’s score to a probability such that the mean predicted probability conforms to the true conditional probability of the label given the features. The curve should be increasing - the higher the score, the higher probability it represents. See this great tutorial in the SkLearn documentation.
The simplest way to impose constraints on the coefficients when fitting models on small-scale data is using the CVXPY library, which we already encountered in previous posts in this blog. The library allows solving arbitrary convex optimization problems, specified by the function to minimize, and a set of constraints. Let’s see how we can use CVXPY to fit a nondecreasing Bernstein polynomial. First, we define the function and use it to generate noisy data:
def nondecreasing_func(x):
return (3 - 2 * x) * (x ** 2) * np.exp(x)
# define number of points and noise
m = 30
sigma = 0.2
np.random.seed(42)
x = np.random.rand(m)
y = nondecreasing_func(x) + sigma * np.random.randn(m)
Now, we define the model fitting as an optimization problem with constraints. Mathematically, we aim to minimize the L2 loss subject to coefficient monotonicity contstraints:
\[\begin{align} \min_{\mathbf{u}} & \quad \| \mathbf{V} \mathbf{u} - \mathbf{y} \|^2 \\ \text{s.t.} &\quad u_{i+1} \geq u_i && i = 0, \dots, n-1 \end{align}\]The matrix \(\mathbf{V}\) is the Bernsten Vandermonde matrix at \(x_1, \dots, x_m\). When multiplied by \(\mathbf{u}\) we obtain the values of the polynomials in Bernstein form at each of the data points. The following CVXPY code is just a direct formulation of the above for fitting a polynomial of degree \(n=20\):
import cvxpy as cp
deg = 20
u = cp.Variable(deg + 1) # a placeholder for the optimal Bernstein coefficients
loss = cp.sum_squares(bernvander(x, deg) @ u - y) # The L2 loss - the sum of residual squares
constraints = [cp.diff(u) >= 0] # constraints - u_{i+1} - u_i >= 0
problem = cp.Problem(cp.Minimize(loss), constraints)
# solve the minimization problem and
problem.solve()
u_opt = u.value
Now, let’s plot the points, the original, and the fit functions:
plt.scatter(x, y, color='red')
plt.plot(plt_xs, nondecreasing_func(plt_xs), color='blue')
plt.plot(plt_xs, bernvander(plt_xs, deg) @ u_opt, color='green')
Not bad, given the level of noise, and the fact that we have no regularization whatsoever! For larger scale problems we will typically use an ML framework, such as PyTorch or Tensorflow, and they do not provide mechanisms to impose hard constraints on parameters. Therefore, when using such frameworks, we need to use a regularization term that penalizes violation of our desired constraints. For example, to penalize for violating the nondecreasing constraint, we can use the regularizer:
\[r(\mathbf{u}) = \sum_{i=1}^n \max(0, u_{i} - u_{i+1})^2\]Looking at the curve above, we see that it’s a bit wiggly. Can we do something about it? Looking at the the second derivative formula above, we can “smooth out” the curve by adding a regularization term that penalizes the second order differences. This will, in turn, penalize the second order derivative. Why second order? Because ideally, when the second order differences are zero, we’ll get a straight line. So we’re “smoothing out” the curve to be more straight.
Mathematically, we’ll need to solve:
\[\begin{align} \min_{\mathbf{u}} & \quad \| \mathbf{V} \mathbf{u} - \mathbf{y} \|^2 + \alpha \sum_{i=0}^{n-2} (u_{i+2} - 2 u_{i+1} - u_i)^2 \\ \text{s.t.} &\quad u_{i+1} \geq u_i && i = 0, \dots, n-1 \end{align}\]where \(\alpha\) is a tuned regularization parameter. The code in CVXPY, after tuning \(\alpha\), looks like this:
deg = 20
alpha = 2
u = cp.Variable(deg + 1) # a placeholder for the optimal Bernstein coefficients
loss = cp.sum_squares(bernvander(x, deg) @ u - y) # The L2 loss - the sum of residual squares
reg = alpha * cp.sum_squares(cp.diff(u, 2)) # penalty for 2nd order differences
constraints = [cp.diff(u) >= 0] # constraints - u_{i+1} - u_i >= 0
problem = cp.Problem(cp.Minimize(loss + reg), constraints)
After solving the problem and plotting the polynomial, I obtained this:
Not bad! Now we will study the Bernstein basis from a more theoretical perspective to understand their representation power.
So, is it really a basis? If it is, then there should be a simple transition matrix for going back and forth between the standard and the Bernstein basis. In this case, solving a regression problem with both bases should be equivalent. So why should we bother working with the Bernstein basis? We explore those questions below.
First, let’s begin by showing that it’s indeed a basis. Note that the set \(\mathbb{B}_n\) of n-th degree Bernstein poynomials indeed has \(n+1\) polynomial functions. So it remains to be convinced that any polynomial can be expressed as a weighted sum of these \(n+1\) functions. It turns out that for any \(k < n\), we can write: \(x^k = \sum_{j=k}^n \frac{\binom{j}{k}}{\binom{n}{k}} b_{j, n}(x) = \sum_{j=k}^n q_{j,k} b_{j,n}(x)\)
The proof is a bit technical and involved, and requires the inverse binomial transform, but it gives us our desired result: any power of \(x\) up to \(n\) can be expressed using Bernstein polynomials. Consequently, any polynomial of degree up to \(n\) can be expressed as a weighted sum of Bernstein polynomials, and therefore:
The representation power of Bernstein polynomials is identical to that of the standard basis. Both represent the same model class we fit to data.
Using Bernstein polynomials, in itself, does not restrict or regularize the model class, since any polynomial can be written in Bernstein form. The Bernstein form is just easier to regularize.
This observation leads to some interesting insights, which will be easier to describe by writing the standard and the the Bernstein bases as vectors:
\[\mathbf{p}_n(x)=(1, x, x^2, \cdots, x^n)^T, \qquad \mathbf{b}_n(x)=(b_{0,n}(x), \cdots, b_{n,n}(x))^T\]We note that the standard and Bernstein Vandermonde matrix rows we saw in the previous post are exactly \(\mathbf{p}_n(x_i)\), and \(\mathbf{b}_n(x_i)\), respectively. Using this notation, we can write the powers of \(x\) in terms of the Bernstein basis in matrix form, by gathering the coefficients \(q_{j,k}\) above, assuming that \(q_{j,k}=0\) whenever \(j<k\), into a triangular matrix \(\mathbf{Q}_n\):
\[\mathbf{p}_n(x)^T = \mathbf{b}_n(x)^T \mathbf{Q}_n\]The matrix \(\mathbf{Q}_n\) is the basis trasition matrix - it can transform any polynomial written using the standard basis to the same polynomial written in the Bernstein basis:
\[a_0 + a_1 x + \dots + a_n x^n = \mathbf{p}_n(x)^T \mathbf{a} = \mathbf{b}_n(x)^T \mathbf{Q}_n \mathbf{a}\]The vector \(\mathbf{Q}_n \mathbf{a}\) is s the coefficient vector w.r.t the Bernstein basis. Does it mean we can actually fit a polynomial in the standard basis, but regularize it as if it was written in the Bernstein basis? Well, yes we can! Polynomial fitting in the Bernstein basis can be written as
\[\min_{\mathbf{w}} \quad \frac{1}{2}\sum_{i=1}^n (\mathbf{b}_n(x_i) \mathbf{w} - y_i)^2 + \frac{\alpha}{2} \| \mathbf{w} \|^2.\]The constants \(\frac{1}{2}\) are for convenience later, when taking derivatives. Introducing the change of variables \(\mathbf{w} = \mathbf{Q}_n \mathbf{a}\), the above problem becomes equivalent to:
\[\min_{\mathbf{a}} \quad \frac{1}{2} \sum_{i=1}^n (\mathbf{p}_n(x_i) \mathbf{a} - y_i)^2 + \frac{\alpha}{2} \| \mathbf{Q}_n \mathbf{a} \|^2. \tag{P}\]Thus, we can fit a polynomial in terms of its standard basis coefficients \(\mathbf{a}\), but regularize its Bernstein coefficients \(\mathbf{Q}_n \mathbf{a}\). So does it really work? Let’s check! First, let’s implement the transition matrix function:
import numpy as np
from scipy.special import binom
def basis_transition(n):
ks = np.arange(0, 1 + n)
js = np.arange(0, 1 + n).reshape(-1, 1)
Q = binom(js, ks) / binom(n, ks)
Q = np.tril(Q)
return Q
The regularized least-squares problem (P) above is a convex problem that can be easily solved by equating the gradient w.r.t \(\mathbf{a}\) with zero. Putting all the \(\mathbf{p}_n(x_i)\) for the data points \(i = 1, \dots, m\) into the rows of the Vandermonde matrix \(\mathbf{V}\), equating the gradient to zero becomes:
\[\mathbf{V}^T (\mathbf{V} \mathbf{a} - \mathbf{y}) + \alpha \mathbf{Q}_n^T \mathbf{Q}_n \mathbf{a} = 0.\]Re-arranging, and solving for the coefficients \(\mathbf{a}\), we obtain:
\[\mathbf{a} = (\mathbf{V}^T \mathbf{V} + \alpha \mathbf{Q}_n^T \mathbf{Q}_n)^{-1} \mathbf{V}^T \mathbf{y}\]So let’s implement the fitting procedure:
import numpy.polynomial.polynomial as poly
def fit_bernstein_reg(x, y, alpha, deg):
""" Fit a polynomial in the standard basis to the data-points `(x[i], y[i])` with Bernstein
regularization `alpha`, and degree `deg`.
"""
V = poly.polyvander(x, deg)
Q = basis_transition(deg)
A = V.T @ V + alpha * Q.T @ Q
b = V.T @ y
# solve the linear system
a = np.linalg.solve(A, b)
return a
Now, let’s try reproducing the results of the previous post with degrees 50 and 100.
def true_func(x):
return np.sin(8 * np.pi * x) / np.exp(x) + x
# define number of points and noise
m = 30
sigma = 0.1
deg = 50
# generate features
np.random.seed(42)
x = np.random.rand(m)
y = true_func(X) + sigma * np.random.randn(m)
# fit the polynomial
a = fit_bernstein_reg(x, y, 5e-4, deg=deg)
# plot the original function, the points, and the fit polynomial
plt_xs = np.linspace(0, 1, 1000)
polynomial_ys = poly.polyvander(plt_xs, deg) @ a
plt.scatter(x, y)
plt.plot(plt_xs, true_func(plt_xs), 'blue')
plt.plot(plt_xs, polynomial_ys, 'red')
plt.show()
I got the following plot, which appears pretty similar to what we got in the previous post, but slightly worse:
Let’s crank up the degree to 100 by setting deg = 100
. I got the following image:
Again, slightly worse than what we achieved by directly fitting the Bernstein form, but appears close.
There two technical issues with our idea. First, manually fitting models rather than relying on standard tools, such as SciKit-Learn appears to be troublesome, and in terms of computational efficiency, we need to deal with the additional matrix \(\mathbf{Q}_n\). Second, and most importantly, the standard Vandermonde matrix and the basis transition matrix \(\mathbf{Q}_n\) are extremely ill conditioned^{9}. This makes hard to actually solve the fitting problem and obtain coefficients that are close to the true optimal coefficients. This is true regardless if we chose direct matrix inversion, CVXPY, or an SGD-based optimizer from PyTorch or TensorFlow.
Due to inefficiency and ill conditioning this trick has a little value in practice. But provides us with an important insight: achieving good regularization requires a sophisticated non-diagonal matrix in the regularization term. It’s not a formal statement, but probably any “good” basis will have a non-diagonal transition matrix to the standard basis. This means that fitting a polynomial in the standard basis using typical ML tricks of rescaling the columns of the Vandermonde matrix has a little chance of success. And it doesn’t matter if we rescale using min-max scaling, or standardization to zero mean and unit variance. To fit a polynomial, we need to use a “good” basis directly.
In this post we explored the ability of the Bernstein form to control the shape of the curve we’re fitting - either making it smooth, increasing, decreasing, convex, or concave. Then, we saw that Bernstein polynomials are just polynomials - they have the same representation power as the standard basis, but just easier to regularize.
The next post will be more engineering oriented. We’ll see how to use the Bernstein basis for feature engineering and fitting models to some real-world data-sets, and we will write a SciKit-Learn transformer to do so. Stay tuned!
Lorentz, G. G. (1952). Bernstein Polynomials. University of Toronto Press. ↩
Chang, I. S., Chien, L. C., Hsiung, C. A., Wen, C. C., & Wu, Y. J. (2007). Shape restricted regression with random Bernstein polynomials. Lecture Notes-Monograph Series, 187-202. ↩
Sarah Sluis, S. (2019). Everything you need to know about bid shading. ↩
Karlsson, N., & Sang, Q. (2021, May). Adaptive bid shading optimization of first-price ad inventory. In 2021 American Control Conference (ACC) (pp. 4983-4990). IEEE. ↩
Gligorijevic, D., Zhou, T., Shetty, B., Kitts, B., Pan, S., Pan, J., & Flores, A. (2020, October). Bid shading in the brave new world of first-price auctions. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (pp. 2453-2460). ↩
Niculescu-Mizil, A., & Caruana, R. (2005, August). Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning (pp. 625-632). ↩
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3), 61-74. ↩
Zadrozny, B., & Elkan, C. (2002, July). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 694-699). ↩
Intuitively, a matrix is ill conditioned if numerical algorithms fail to accurately perform computations with this matrix, such as matrix multiplication, solving a linear system, or training a machine learned model. ↩
When fitting a non-linear model using linear regression, we typically generate new features using non-linear functions. We also know that any function, in theory, can be approximated by a sufficiently high degree polynomial. This result is known as Weierstrass approximation theorem. But many blogs, papers, and even books tell us that high polynomials should be avoided. They tend to oscilate and overfit, and regularization doesn’t help! They even scare us with images, such as the one below, when the polynomial fit using the data points (in red) is far away from the true function (in blue):
It turns out that it’s just a MYTH. There’s nothing inherently wrong with high degree polynomials, and in contrast to what is typically taught, high degree polynomials are easily controlled using standard ML tools, like regularization. The source of the myth stems mainly from two misconceptions about polynomials that we will explore here. In fact, not only they are great non-linear features, certain representations also provide us with powerful control over the shape of the function we wish to learn.
A colab notebook with the code for reproducing the above results is available here.
Vladimir Vapnik, in his famous book “The Nature of Statistical Learning Theory” which is cited more than 100,000 times as of today, coined the approximation vs. estimation balance. The approximation power of a model is its ability to represent the “reality” we would like to learn. Typically, approximation power increases with the complexity of the model - more parameters mean more power to represent any function to arbitrary precision. Polynomials are no different - higher degree polynomials can represent functions to higher accuracy. However, more parameters make it difficult to estimate these parameters from the data.
Indeed, higher degree polynomials have a higher capacity to approximate arbitrary functions. And since they have more coefficients, these coefficients are harder to estimate from data. But how does it differ from other non-linear features, such as the well-known radial basis functions? Why do polynomials have such a bad reputation? Are they truly hard to estimate from data?
It turns out that the primary source is the standard polynomial basis for n-degree polynomials \(\mathbb{E}_n = {1, x, x^2, ..., x^n}\). Indeed, any degree \(n\) polynomial can be written as a linear combination of these functions:
\[\alpha_0 \cdot 1 + \alpha_1 \cdot x + \alpha_2 \cdot x^2 + \cdots + \alpha_n x^n\]But the standard basis \(\mathbb{B}_n\) is awful for estimating polynomials from data. In this post we will explore other ways to represent polynomials that are appropriate for machine learning, and are readily available in standard Python packages. We note, that one advantage of polynomials over other non-linear feature bases is that the only hyperparameter is their degree. There is no “kernel width”, like in radial basis functions^{1}.
The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features.
In this post we will demonstrate fitting the function
\[f(x)=\sin(8 \pi x) / \exp(x)+x\]on the interval \([0, 1]\) by fitting to \(m=30\) samples corrupted by Gaussian noise. The following code implements the function and generates samples:
import numpy as np
def true_func(x):
return np.sin(8 * np.pi * x) / np.exp(x) + x
m = 30
sigma = 0.1
# generate features
np.random.seed(42)
X = np.random.rand(m)
y = true_func(X) + sigma * np.random.randn(m)
For function plotting, we will use uniformly-spaced points in \([0, 1]\). The following code plots the true function and the sample points:
import matplotlib.pyplot as plt
plt_xs = np.linspace(0, 1, 1000)
plt.scatter(X.ravel(), y.ravel())
plt.plot(plt_xs, true_func(plt_xs), 'blue')
plt.show()
Now let’s fit a polynomial to the sampled points using the standard basis. Namely, we’re given the set of noisy points \(\{ (x_i, y_i) \}_{i=1}^m\), and we need to find the coefficients \(\alpha_0, \dots, \alpha_n\) that minimize:
\[\sum_{i=1}^m (\alpha_0 + \alpha_1 x_i + \dots + \alpha_n x_i^n - y_i)^2\]As expected, this is readily accomplished by transforming each sample \(x_i\) to a vector of features \(1, x_i, \dots, x_i^n\), and fitting a linear regression model to the resulting features. Fortunately, NumPy has the numpy.polynomial.polynomial.polyvander
function. It takes a vector containing \(x_1, \dots, x_m\) and produces the matrix
The name of the function comes from the name of the matrix - the Vandermonde matrix. Let’s use it to fit a polynomial of degree \(n=50\).
from sklearn.linear_model import LinearRegression
import numpy.polynomial.polynomial as poly
n = 50
model = LinearRegression(fit_intercept=False)
model.fit(poly.polyvander(X, deg=n), y)
The reason we use fit_intercept=False
is because the ‘intercept’ is provided by the first column of the Vandermonde matrix. Now we can plot the function we just fit:
plt.scatter(X.ravel(), y.ravel()) # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue') # plot the true function
plt.plot(plt_xs, model.predict(poly.polyvander(plt_xs, deg=n)), 'r') # plot the fit model
plt.ylim([-5, 5])
plt.show()
As expected, we got the “scary” image from the beginning of this post. Indeed, the standard basis is awful for model fitting! We hope that regularization provides a remedy, but it does not. Maybe adding some L2 regularization helps? Let’s use the Ridge
class from the sklearn.linear_model
package to fit an L2 regularized model:
from sklearn.linear_model import Ridge
reg_coef = 1e-7
model = Ridge(fit_intercept=False, alpha=reg_coef)
model.fit(poly.polyvander(X, deg=n), y)
plt.scatter(X.ravel(), y.ravel()) # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue') # plot the true function
plt.plot(plt_xs, model.predict(poly.polyvander(plt_xs, deg=n)), 'r') # plot the fit model
plt.ylim([-5, 5])
plt.show()
We get the following result:
The regularization coefficient coefficient of \(\alpha=10^{-7}\) is large enough to break the model in \([0,0.8]\) but not large enough to avoid over-fitting in \([0.8, 1]\). Increasing the coefficient clearly won’t help - the model will be broken even further in \([0, 0.8]\).
Since we will be trying several polynomial bases, it makes sense to write a more generic function for our experiments that will accept various “Vandermonde” matrix functions of the basis of our choice, fit the polynomial using the Ridge
class, and plot it with the original function and the sample points.
def fit_and_plot(vander, n, alpha):
model = Ridge(fit_intercept=False, alpha=alpha)
model.fit(vander(X, deg=n), y)
plt.scatter(X.ravel(), y.ravel()) # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue') # plot the true function
plt.plot(plt_xs, model.predict(vander(plt_xs, deg=n)), 'r') # plot the fit model
plt.ylim([-5, 5])
plt.show()
Now we can reproduce our latest experiment by invoking:
fit_and_plot(poly.polyvander, n=50, alpha=1e-7)
It turns out that in our sister discipline, approximation theory, reseachers also encountered similar difficulties with the standard basis \(\mathbb{E}_n\), and developed a thoery for approximating functions by polynomials from different bases. Two prominent examples of bases of \(n\)-degree polynomials include, and their:
numpy.polynomial.chebyshev
module.numpy.polynomial.legendre
module.They are the computational workhorse of a large variety of numerical algorithms that are enabled by approximating a function using a polynomial, and are well-known for their advantages in approximating functions in the \([-1, 1]\) interval^{2}. In particular, the corresponding “Vandermonde” matrices are provided by the chebvander
and legvander
functions in corresponding modules above. Each row in these matrices contains the value of the basis functions at each point, just like the standard Vandermonde matrix of the standard basis. For example, the Chebyshev Vandermonde matrix is:
I will not elaborate their formulas and properties here for a reason that will immediately be revealed. However, I highly recomment Prof. Nick Trefethen’s “Approximation theory and approximation practice” online video course to get familiar with their advantages. His book with the same name is an excellent introduction to the subject.
It might be tempting to try fitting a Chebyshev polynomial using our fit_and_plot
method above directly:
import numpy.polynomial.chebyshev as cheb
fit_and_plot(cheb.chebvander, n=50, alpha=1e-7)
However, that’s not the best thing to do. We aim to fit a function sampled from \([0, 1]\), but the Chebyshev basis “lives” in \([-1, 1]\). Therefore, we will add the transformation \(x \to 2x-1\) before invoking the chebvander
function:
def scaled_chebvander(x, deg):
return cheb.chebvander(2 * x - 1, deg=deg)
fit_and_plot(scaled_chebvander, n=50, alpha=1)
Note that a different basis requires a different regularization coefficient. We get the following result:
Whoa! Seems even worse than the standard basis!. Maybe more regularization helps?
fit_and_plot(scaled_chebvander, n=50, alpha=10)
Appears that our polynomial is both a bad fit for the function, and extremely oscilatory. Even worse when the standard basis! Interested readers can repeat the experiment with Legendre polynomials and see a slightly better, but similar result. So what’s wrong? Is everything that approximation theory tries to teach us about polynomials wrong?
The answer stems from the fundamental difference between two tasks:
The Chebyshev and Legendre bases perform extremely well at the the interpolation task, but not at the fitting task. It turns out that the polynomial \(T_k\) in the Chebyshev basis, and the polynomial \(P_k\) in the Legendre basis, are both \(k\)-degree polynomials. For example, \(T_1\) is a linear function, whereas \(T_{50}\) is a polynomial of degree 50. These two functions are radically different. Thus, the coefficient of \(T_1\) and \(T_{50}\) have “different units”. This property is shared with the standard basis as well. Thus, we have two issues:
Both properties show that for the fitting, rather the interpolation tasks we need something else.
A remedy is provided by the Bernstein basis \(\mathbb{B}_n = \{ b_{0,n}, \dots, b_{n, n} \}\). These are \(n\)-degree polynomials defined by on \([0, 1]\) by:
\[b_{i,n}(x) = \binom{n}{i} x^i (1-x)^{n-i}\]These polynomials are widely used in computer graphics to approximate curves and surfaces, but it appears that they’re less known in the machine learning community. In fact, all the text you see on the screen when reading this post is rendered using Bernstein polynomials^{3}. We will study them more in depth in the next posts, but at this stage I would like to point out two simple properties that give an intuitive explanation of why they’re useful in machine learning.
First, note that each \(b_{i,n}\) is an \(n\)-degree polynomial. Thus, when representing a polynomial using
\[p_n(x) = \alpha_0 b_{0,n}(x) + \alpha_1 b_{1,n}(x) + \dots + \alpha_n b_{n,n}(x),\]all the coefficients have the same “units”.
If the formula of \(b_{i,n}(x)\) seems familiar - you are correct. It is exactly the probability mass function of the binomial distribution for obtaining \(i\) successes in a sequence of trials whose success probability is \(x\). Therefore, \(b_{i,n}(x) \geq 0\), and \(\sum_{i=0}^n b_{i,n}(x) = 1\) for any \(x \in [0, 1]\). Consequently, the polynomial \(p_n(x)\) is just a weighted average of the coefficients \(\alpha_0, \dots, \alpha_n\). So not only the coefficients have the same “units”, their “units” are also the same as the model’s labels. Thus, they’re much easier to regularize - they’re all on the same “scale”.
Finally, due to the equivalence with the binomial distribution p.m.f, we can implement a “Vandermonde” matrix in Python using the scipy.stats.binom.pmf
function.
from scipy.stats import binom
def bernvander(x, deg):
return binom.pmf(np.arange(1 + deg), deg, x.reshape(-1, 1))
Let’s try and fit without regularization at all
fit_and_plot(bernvander, n=50, alpha=0)
We see our regular over-fitting. Now let’s see that they’re indeed easy to regularize. After trying several regularization coefficients, I came up with this:
fit_and_plot(bernvander, n=50, alpha=5e-7)
Beautiful! This is a polynomial of degree 50! The fit is great, no oscillations, and the misfit near the right endpoint stems from the noise - I don’t believe there’s enough information in the data to convey the fact that it should “curve up” rather than “curve down”.
Let’s see what happens when we crank-up the degree. Can we produce a nice non-oscilating polynomial?
fit_and_plot(bernvander, n=100, alpha=5e-4)
This is a polynomial of degree 100, that does not overfit!
The notorious reputation of high-degree polynomials in the machine learning community is primarily a myth. Despite it, papers, books, and blog posts are based on this premise as if it was an axiom. Bernstein polynomials are little known in the machine learning community, but there are a few papers^{4}^{5} using them to represent polynomial features. Their main advantage is ease of use - we can use high degree polynomials to exploit their approximation power, and easily control model complexity with just one hyperparameter - the regularization coefficient.
In the following posts we will explore the Bernstein basis in more detail. We will use it to create polynomial features for real-world datasets and test it versus the standard basis. Moreover, we will see how to regularize the coefficients to control the shape of the function we aim to represent.. For example, what if we know that the function we’re aiming to fit is increasing? Stay tuned!
There are also kernel methods, and polynomial kernels. But polynomial kernels suffer from problems similar to the standard basis. ↩
The standard basis is not that awful. It’s a great basis for representing polynomials on the complex unit circle. In fact, the Fourier transform is based exactly on this observation. ↩
See Bézier curves and TrueType font outlines. ↩
Marco, Ana, and José-Javier Martı. “Polynomial least squares fitting in the Bernstein basis.” Linear Algebra and its Applications 433.7 (2010): 1254-1264. ↩
Wang, Jiangdian, and Sujit K. Ghosh. “Shape restricted nonparametric regression with Bernstein polynomials.” Computational Statistics & Data Analysis 56.9 (2012): 2729-2741. ↩
In this final episode of the proximal point series I would like to take the method to the extreme, and show that we can actually train a model which composed with an appropriate loss produces functions which are non-linear and non-convex: we’ll be training a factorization machine for classification problems without linearly approximating loss functions and relying on loss gradients. Factorization machines and variants are widely used in recommender systems, i.e. recommending movies to users. I assume readers are familiar with the basics, and below I provide only a brief introduction, so that throughout this post we have a consistent notation and terminology, and understand the assumptions we make.
I do not claim that it is the best method for training factorization machines, but it is indeed an interesting challenge in order to see what are the limits of efficiently implementable proximal point methods. We’ll have some more advanced optimization theory, even some advanced linear algebra, but most importantly, at the end of the journey we’ll have a github repo with code which you can run and try it out on your own dataset!
Since it’s an ‘academic’ experiment in nature, and I do not aim to implement the most efficient and robust code, we’ll make some simplifying assumptions. However, a production-ready training algortithm will not be far away from the implementation we construct in this post.
Let’s begin by a quick introduction to factorization machines. Factorization machines are usually trained on categorical data representing the users and the items, for example, age group and gender may be user features, while product category and price group may be item features. The model embeds each categorical feature to a latent space of some pre-defined dimension \(k\), and the model’s prediction comprises of inner products of the latent vectors corresponding to the current samples. The most simple variant are second order factorization machines, which we the focus of this post.
Formally, our second-order factorization machine \(\sigma(w, x)\) is given a binary input \(w \in \{0, 1\}^m\) which is a one-hot encoding of a subset of at most \(m\) categorical features. For exampe, suppose we would like to predict the affinity of people with chocolate. Assume, for simplicity, that we have only two user gender values \(\{ \mathrm{male}, \mathrm{female} \}\), and two two age groups \(\{ \mathrm{young}, \mathrm{old} \}\). For our items, suppose we have only one feature - the chocolate type, which may take the values \(\{\mathrm{dark}, \mathrm{milk}, \mathrm{white}\}\). In that case, the model’s input is the vector of zeros and ones encoding feature indicators:
\[w=(w_{\mathrm{male}}, w_{\mathrm{female}}, w_{\mathrm{young}}, w_{\mathrm{old}}, w_{\mathrm{dark}}, w_{\mathrm{milk}}, w_{\mathrm{white}}).\]A young male who tasted dark chocolate is represented by the vector
\[w = (1, 0, 1, 0, 1, 0, 0).\]In general, the vector \(w\) can be defined by arbitrary real numbers, but I promised that we’ll make simplifying assumptions :)
The model’s parameter vector \(x = (b_0, b_1, \dots, b_m, v_1, \dots, v_m)\) is composed of the model’s global bias \(b_0 \in \mathbb{R}\), the biases \(b_i \in \mathbb{R}\) for the features \(i\in \{1, \dots, m\}\), and the latent vectors \(v_i \in \mathbb{R}^k\) for the same features with \(k\) being the embeddig dimension. The model computes:
\[\sigma(w, x) := b_0 + \sum_{i = 1}^m w_i b_i + \sum_{i = 1}^m\sum_{j = i + 1}^{m} (v_i^T v_j) w_i w_j.\]Let’s set up some notation which will become useful throughout this post. We will detote a set of consecutive integers by \(i..j=\{i, i+1, \dots, j\}\), and the set of distinct pairs of the integers \(J\) is denoted by \(P[J]=\{ (i,j) \in J\times J : i<j \}\). Consequently, we can re-write:
\[\sigma(w,x)=b_0 + \sum_{i\in 1..m} w_i b_i+\sum_{(i,j) \in P[1..m]} (v_i^T v_j) w_i w_j\]At this stage this notation does not seem useful, but it will simplify things later in this post. We’ll use this notation consistently throughout the post.
For completeness, let’s implement a factorization machine in PyTorch. To that end, recall a famous trick introduced by Steffen Rendle in his pioneering paper^{1} on factorization machines, based on the formula
\[\Bigl\| \sum_{i\in 1..m} w_i v_i \Bigr\|_2^2 = \sum_{i\in 1..m} \|w_i v_i\|_2^2 + 2 \sum_{(i,j)\in P[1..m]} (v_i^T v_j) w_i w_j.\]After re-arrangement, the above results in:
\[\sum_{(i,j)\in P[1..m]} (v_i^T v_j) w_i w_j= \frac{1}{2}\Bigl\| \sum_{i\in1.m} w_i v_i \Bigr\|_2^2-\frac{1}{2}\sum_{i\in1..m} \|w_i v_i\|_2^2. \tag{L}\]Since \(w\) is a binary vector, we can associate it with its non-zero indices \(\operatorname{nz}(w)\), and the right-hand side of above term can be written as:
\[\frac{1}{2}\Bigl\| \sum_{i \in \operatorname{nz}(w)} v_i \Bigr\|_2^2-\frac{1}{2}\sum_{i \in \operatorname{nz}(w)} \| v_i\|_2^2.\]Consequently, the pairwise terms can be computed in time linear in the number of non-zero indicators in \(w\), instead of the quadratic time imposed by the naive way. The PyTorch implementation below uses the trick above.
import torch
from torch import nn
class FM(torch.nn.Module):
def __init__(self, m, k):
super(FM, self).__init__()
self.bias = nn.Parameter(torch.zeros(1))
self.biases = nn.Parameter(torch.zeros(m))
self.vs = nn.Embedding(m, k)
with torch.no_grad():
torch.nn.init.normal_(self.vs.weight, std=0.01)
torch.nn.init.normal_(self.biases, std=0.01)
def forward(self, w_nz): # since w are indicators, we simply use the non-zero indices
vs = self.vs(w_nz)
# in vs:
# dim = 0 is the mini-batch dimension. We would like to operate on each elem. of a mini-batch separately.
# dim = 1 are the embedding vectors
# dim = 2 are their components.
pow_of_sum = vs.sum(dim=1).square().sum(dim=1) # sum vectors -> square -> sum components
sum_of_pow = vs.square().sum(dim=[1,2]) # square -> sum vectors and components
pairwise = 0.5 * (pow_of_sum - sum_of_pow)
biases = self.biases
linear = biases[w_nz].sum(dim=1) # sum biases for each element of the mini-batch
return pairwise + linear + self.bias
If we are interested in solving a regression problem, i.e. predicting arbitrary real values, such as a score a person would give to the chocolate, we can use \(\sigma\) directly to make predictions. If we are in the binary classification setup, i.e. predict the probability that a person likes the corresponding chocolate, we compose \(\sigma\) with a sigmoid, and predict \(p(w,x) = (1+e^{-\sigma(w,x)})^{-1}\).
In this post we are interested in the binary classification setup, with the binary cross-entropy loss. Namely, given a label \(y \in \{0,1\}\) the loss is:
\[-y \ln(p(w,x)) - (1 - y) \ln(1 - p(w,x)).\]For example, if we would like to predict which chocolate people like, we could train the model on a data-set with samples of people who liked a certain chocolate having the label \(y = 1\), and people who tasted but did not like it will have the lable \(y = 0\). Having trained the model, we can recommend chocolate to a person by choosing the one with the highest probability of being liked.
Using a simple transformation \(\hat{y} = 2y-1\) we can remap the labels to be in \(\{-1, 1\}\) instead. Then, it isn’t hard to verify that the binary cross-entropy loss above reduces to:
\[\ln(1+\exp(-\hat{y} \sigma(w,x))).\]Consequently, our aim will be training over the set \(\{ (w_i, \hat{y}_i) \}_{i=1}^n\) by minimizing the average loss
\[\frac{1}{n} \sum_{i=1}^n \underbrace{\ln(1+\exp(-\hat{y}_i \sigma(w_i, x)))}_{f_i(x)}.\]Instead of using regular SGD-based methods for training, which construct a linear approximations of \(f_i\) and are able to use only the information provided by the gradient, we will avoid approximating and use the loss itself via the stochastic proximal point algorithm - at iteration \(t\) choose \(f \in \{f_1, \dots, f_n\}\) and compute:
\[x_{t+1} = \operatorname*{argmin}_x \left\{ f(x) + \frac{1}{2\eta} \|x - x_t\|_2^2 \right\}. \tag{P}\]Careful readers might notice that the formula above is total nonsense in general. Why? Well, the each \(f\) is a non-convex function of \(x\). If \(f\) was convex, we would obtain a unique and well-defined minimizer \(x_{t+1}\). However, in general, the \(\operatorname{argmin}\) above is a set of minimizers, which might even be empty! In this post we will attempt to mitigate this issue:
Having done the above, we’ll be able to construct an algorithm which can train classifying factorization machines which exploit the exact loss function, instead of just relying on its slope as in SGD.
In previous posts we heavily relied on duality in general, and convex conjugates in particular, and this post is no exception. Recall, that the convex conjugate of the function \(h\) is defined by:
\[h^*(z) = \sup_x \{ x^T z - h(x) \},\]and recall also that in a previous post we saw that \(h(t)=\ln(1+\exp(t))\) is convex, and its convex conjugate is:
\[h^*(z) = \begin{cases} z\ln(z) + (1 - z) \ln(1 - z), & 0 < z < 1 \\ 0, & \text{otherwise}. \end{cases}\]An interesting result about conjugates is that under some technical conditions, which hold for \(h(t)\) above, we have \(h^{**} = h\), namely, the conjugate of \(h^*\) is \(h\). Moreover, in our case the \(\sup\) in the conjugate’s definition can be replaced with a \(\max\), since the supermum is always attained^{2}. Why is it useful? Since now we know that:
\[\ln(1+\exp(t))=\max_z \left\{ t z - z \ln(z) - (1-z) \ln(1-z) \right\}.\]Consequently, the term inside the \(\operatorname{argmin}\) of the proximal point step (P) can be written as:
\[\begin{aligned} f(x) &+ \frac{1}{2\eta} \|x - x_t\|_2^2 \\ &\equiv \ln(1+\exp(-\hat{y} \sigma(w,x))) + \frac{1}{2\eta} \|x - x_t\|_2^2 \\ &= \max_z \Bigl\{ \underbrace{ -z \hat{y} \sigma(w,x) + \frac{1}{2\eta} \|x - x_t\|_2^2 - z\ln(z) - (1-z)\ln(1-z) }_{\phi(x,z)} \Bigr\}. \end{aligned}\]Since we are interested in minimizing the above, we will be solving the saddle-point problem:
\[\min_x \max_z \phi(x,z). \tag{Q}\]Convex duality theory has another interesting form - it provides conditions on saddle-point problems which ensure that we can switch the order of \(\min\) and \(\max\) to obtain an equivalent problem. Why is it interesting? Because switching the order produces
\[\max_z \underbrace{ \min_x \phi(x,z)}_{q(z)},\]and finding the optimal \(z\) means maximizing the one dimensional function \(q\), which may even be as simple as a high-school calculus exercise.
So here is relevant duality theorem, which is a simplification of Sion’s minimax theorem from 1958 for this post:
Let \(\phi(x,z)\) be a continuous function which is convex in \(x\) and concave in \(z\). Suppose that the domain of \(\phi\) over \(z\) is compact, i.e. a closed a bounded set. Then,
\[\min_x \max_z \phi(x,z) = \max_z \min_x \phi(x,z)\]
In our case, it’s easy to see that \(\phi\) is indeed concave in \(z\) using negativity of its second derivative, and its domain, the interval \([0,1]\), is indeed compact. What we require for the theorem’s conditions to hold is convexity in \(x\), which is what we explore next. Then, we’ll see that \(q\), despite not being so simple, can still be quite efficiently maximized. The theorem does not imply that a pair \((x, z)\) solving the max-min problem also solves the min-max problem, but in our case the max-min problem has a unique solution, and in that particular case it indeed also solves the min-max problem.
Consequently, having found \(z^*=\operatorname{argmax}_z q(z)\), we by construction obtain a formula for computing the optimal \(x\): \(x_{t+1} = \operatorname*{argmin}_x ~ \phi(x, z^*).\)
So let’s begin by ensuring that the conditions for Sion’s theorem hold. Ignoring the terms of \(\phi\) which do not depend on \(x\), we need to study the convexity of the following part as a function of \(x\):
\[(*) = -z \hat{y} \sigma(w,x) + \frac{1}{2\eta} \|x - x_t\|_2^2.\]To that end, we need to open the ‘black box’ and look inside \(\sigma\) again. That’s going to be a bit technical, but it gets us where we need. If you don’t wish to read all the details, you may skip to the conclusion below.
Recall, the composition \(x = (b_0, b_1, \dots, b_m, v_1, \dots, v_m)\) and the definition
\[\sigma(w, b_0, \dots, b_m, v_1, \dots, v_m) = b_0 + \sum_{i\in1..m} w_i b_i + \sum_{(i,j)\in P[1..m]} (v_i^T v_j) w_i w_j.\]Consequently, we can re-write \((*)\) as:
\[\begin{aligned} (*) =& \color{blue}{-z \hat{y} \Bigl[ b_0 + \sum_{i\in1..m} w_i b_i \Bigr] + \frac{1}{2\eta} \|b - b_t\|_2^2} \\ & \color{brown}{- z \hat{y} \sum_{i\in P[1..m]} (v_i^T v_j) w_i w_j + \frac{1}{2\eta} \sum_{i\in1..m} \| v_i - v_{i,t} \|_2^2}. \end{aligned}\]The part colored in blue is always convex - it is the sum of a linear function and a convex-quadratic one. It remains to study the convexity of the brown part. Re-arranging the formula for \(\|v_i + v_j\|_2^2\), we obtain that:
\[v_i^T v_j = \frac{1}{2} \|v_i + v_j\|_2^2 - \frac{1}{2}\|v_i\|_2^2 - \frac{1}{2} \|v_j\|_2^2.\]Denoting \(\alpha_{ij} = -z \hat{y} w_i w_j\) we can re-write the brown part as: \(\begin{aligned} \color{brown}{\text{brown}} &= \sum_{i\in P[1..m]} |\alpha_{ij}| v_i^T ( \operatorname{sign}(\alpha_{ij}) v_j) + \frac{1}{2\eta} \sum_{i\in1..m} \| v_i - v_{i,t} \|_2^2 \\ &= \frac{1}{2}\sum_{i\in P[1..m]} |\alpha_{ij}| \left[ \|v_i + \operatorname{sign}(\alpha_{ij}) v_j\|_2^2 - \|v_i\|_2^2-\|v_j\|_2^2 \right] + \sum_{i\in1..m} \left[ \| v_i \|_2^2 \color{darkgray}{- 2 v_i^T v_{i,t} + \|v_{i,t}\|_2^2} \right] \end{aligned}\)
The grayed-out part on the right is linear in \(v_i\), so it’s convex. Since \(\alpha_{ij} = \alpha_{ji}\), to simplify notation we define \(\alpha_{ii}=0\), and the remaining non-greyed parts can be written as:
\[\frac{1}{2} \sum_{i\in P[1..m]} |\alpha_{ij}| \|v_i + \operatorname{sign}(\alpha_{ij}) v_j\|_2^2 + \sum_{i\in 1..m} \left(\frac{1}{2\eta} - \sum_{j\in 1..m} |\alpha_{ij}|\right) \|v_i\|_2^2.\]Again, the first sum is a sum of convex-quadratic functions, and thus convex. For the second part to be convex, we require that for each \(i\) we have
\[\frac{1}{2\eta} \geq \sum_{j\in 1..m} |\alpha_{ij}|,\]or equivalently that the step-size \(\eta\) must satisfy
\[\eta \leq \frac{1}{2\sum_{j \in 1..m} |\alpha_{ij}|}\]Since \(\vert \alpha_{ij} \vert \leq 1\), we can easily deduce that for any step-size \(\eta \leq \frac{1}{2m}\), we obtain a convex \(\phi\). A better bound is obtained if we have a bound on the number of indicators in the vector \(w\) which may be non-zero at the same time. For example, if we have six categorical fields, we will have at most six non-zero elements in \(w\), and thus \(\eta \leq \frac{1}{12}\).
Convexity is nice if we want Sion’s theorem to hold, but if we want a unique minimizer \(x_{t+1}\) we need strict convexity, which is obtained by using a strict inequality - replace \(\leq\) with \(<\). In this post we will assume that we have at most \(d\) categorical features, and use step-sizes which satisfy
\[\eta \leq \frac{1}{2d+1} < \frac{1}{2d}.\]Suppose that Sion’s theorem holds, and that we can obtain a unique minimizer \(x_{t+1}\). How do we compute it? Well, Sion’s theorem lets us switch the order of \(\min\) and \(\max\), so we are aiming to solve:
\[\max_z \underbrace{ \min_x \phi(x,z)}_{q(z)},\]and explicitly writing \(\phi\) we have:
\[\begin{aligned} q(z) = \min_{b,v_i} \Bigl\{ &-z \hat{y} \Bigl[ b_0 + \sum_{i\in 1..m} w_i b_i \Bigr] + \frac{1}{2\eta} \|b - b_t\|_2^2 \\ &- z \hat{y} \sum_{(i,j) \in P[1..m]} (v_i^T v_j) w_i w_j + \frac{1}{2\eta} \sum_{i\in 1..m} \| v_i - v_{i,t} \|_2^2 \\ &- z \ln(z) - (1-z) \ln(1-z) \Bigr\} \end{aligned}\]From now it becomes a bit technical, but the end-result will be an algorithm to compute \(q(z)\) for any \(z\) by solving the minimization problem over \(x\). Afterwards, we’ll find a way to maximize \(q\) over \(z\).
Using separability^{3} we can separate the minimum above into a sum of three parts: the minimum over the biases \(b\), another minimum over the latent vectors \(v_1, \dots, v_m\), and the term \(-z \ln(z) - (1-z) \ln(1-z)\), namely:
\[\begin{aligned} q(z) &= \underbrace{\min_b \left\{ -z \hat{y} \left[ b_0 + \sum_{i\in 1..m} w_i b_i \right] + \frac{1}{2\eta} \|b - b_t\|_2^2 \right\}}_{q_1(z)} \\ &+ \underbrace{\min_{v_1, \dots, v_m} \left\{ - z \hat{y} \sum_{(i,j) \in P[1..m]} (v_i^T v_j) w_i w_j + \frac{1}{2\eta} \sum_{i\in 1..m} \| v_i - v_{i,t} \|_2^2 \right\}}_{q_2(z)} \\ &-z \ln(z) - (1-z) \ln(1-z) \end{aligned}\]We’ll analyze \(q_1\), and \(q_2\) shortly, but let’s take a short break and implement a skeleton of our training algorithm. A deeper analysis of \(q_1\), \(q_2\), and \(q\) will let us fill the skeleton. On construction, it receives a factorization machine object of the class we implemented above, and the step size. Then, each training step’s input is the set \(\operatorname{nz}(w)\) of the non-zero feature indicators, and the label \(\hat{y}\):
class ProxPtFMTrainer:
def __init__(self, fm, step_size):
# training parameters
self.b0 = fm.bias
self.bs = fm.biases
self.vs = fm.vs
self.step_size = step_size
def step(self, w_nz, y_hat):
pass # we'll replace it with actual code to train the model.
Defining \(\hat{w}=(1, w_1, \dots, w_m)^T\) and \(\hat{b}=(b_0, b_1, \dots, b_m)\), we obtain:
\[\begin{aligned} q_1(z) =&\min_{\hat{b}} \left\{ -z \hat{y} \hat{w}^T \hat{b} + \frac{1}{2\eta} \|\hat{b} - \hat{b}_t\|_2^2 \right\} \end{aligned}.\]The term inside the minimu is a simple convex quadratic which is minimized by comparing its gradient with zero: \(\hat{b}^* = \hat{b}_t + \eta z \hat{y} \hat{w}. \tag{A}\)
Consequently:
\[\begin{aligned} q_1(z) &= -z \hat{y} \hat{w}^T (\hat{b}_t + \eta z \hat{y} \hat{w}) + \frac{1}{2\eta} \| \eta z \hat{y} \hat{w} \|_2^2 \\ &= -\hat{y} (\hat{w}^T \hat{b}_t) z - \eta \hat{y}^2 \|\hat{w}\|_2^2 z^2 + \frac{\eta \hat{y}^2 \|\hat{w}\|_2^2}{2} z^2 \\ &= -\hat{y} (\hat{w}^T \hat{b}_t) z - \frac{\eta \hat{y}^2 \|\hat{w}\|_2^2}{2} z^2 \end{aligned}\]Since \(\hat{y} =\pm 1\) we have that \(\hat{y}^2 = 1\). Moreover, since \(w_i\) are indicators, the term \(\|\hat{w}\|_2^2\) is the number of non-zero entries of \(w\) plus one. So, to summarize, the above can be written as
\[q_1(z) = -\frac{\eta (1 + |\operatorname{nz}(w)|)}{2}z^2 -\hat{y} (w^T b_t + b_{0,t}) z.\]What a surprise - \(q_1\) is just a concave parabola!
So, to summarize, what we have here is an explicit expression for \(q_1\), and the formula (A) to update the biases once we have obtained the optimal \(z\).
Let’s implement the code for the two steps above. We’ll see below that the function \(q_1\) will have to be evaluated several times in order to find the optimal \(z\), and consequently it’s beneficial to cache various expensive-to-compute elements so that its evaluation is quick and efficient every time. Consequently, the step
function will store these parts in the classe’s members.
# inside ProxPtFMTrainer
def step(self, w_nz, y_hat):
self.nnz = w_nz.numel() # |nz(w)|
self.bias_sum = self.bs[w_nz].sum().item() # w^T b_t
# TODO - this function will grow as we proceed
def q_one(self, y_hat, z)
return -0.5 * self.step_size * (1 + self.nnz) * (z ** 2) \
- y_hat * (self.bias_sum + self.b0.item()) * z
def update_biases(self, w_nz, y_hat, z):
self.bs[w_nz] = self.bs[w_nz] + self.step_size * z * y_hat
self.b0.add_(self.step_size * z * y_hat)
You might be asking yourself why we stored the bias sum in a member of self
. It will become apparent shortly, but we’ll be calling the function q_one
repeatedly, and we would like to avoid re-computing time consuming things we could compute only once.
We are aiming to compute
\[q_2(z) = \min_{v_1, \dots, v_m} \left\{ Q(v_1, \dots, v_m, z) \equiv - z \hat{y} \sum_{(i,j)\in P[1..m]} (v_i^T v_j) w_i w_j + \frac{1}{2\eta} \sum_{i\in 1..m} \| v_i - v_{i,t} \|_2^2 \right\}.\]Of course, we assume that we indeed chose \(\eta\) such that \(Q\) inside the \(\min\) operator is strictly convex in \(v_1, \dots, v_m\), so that there is a unique minimizer.
Since \(w\) is a vector of indicators, we can write the function \(Q\) by separating out the part which corresponds to non-zero indicators in \(w\):
\[Q(v_1, \dots, v_m,z) = \underbrace{-z \hat{y} \sum_{(i,j)\in P[\operatorname{nz}(w)]} v_{i}^T v_{j}+\frac{1}{2\eta} \sum_{i\in \operatorname{nz}(w)} \|v_i - v_{i,t} \|_2^2}_{\hat{Q}} + \underbrace{\frac{1}{2\eta}\sum_{i \notin \operatorname{nz}(w)} \|v_i-v_{i,t}\|_2^2}_{R}.\]Looking at \(R\), clearly the minimizer must satisfy \(v_i^* = v_{i,t}\) for all \(i \notin \operatorname{nz}(w)\), and consequently \(R\) must be zero at optimum, independent of \(z\). Hence, we have:
\[q_2(z)=\min_{v_{\operatorname{nz}(w)}} \hat{Q}(v_{\operatorname{nz}(w)}, z),\]where \(v_{\operatorname{nz}(w)}\) is the set of the vectors \(v_i\) for \(i \in \operatorname{nz}(w)\). Since \(\hat{Q}\) is a quadratic function which we made sure is strictly convex, we can find our optimal \(v_{\operatorname{nz}(w)}^*\) by solving the linear system obtained by equating the gradient of \(\hat{Q}\) with zero.
So let’s see what the gradient looks like. We have a function of several vector variables \(v_{\operatorname{nz}(w)}\), and we imagine that they are all stacked into one big vector. Consequently, the gradient of \(\hat{Q}\) is a stacked vector comprising of the gradients w.r.t each of the vectors. So, let’s compute the gradient w.r.t each \(v_i\) and equate it with zero:
\[\nabla_{v_i} \hat{Q} = -z \hat{y} \sum_{\substack{j \in \operatorname{nz}(w)\\j\neq i}} v_{j}+\frac{1}{\eta} (v_{i} - v_{i,t})=0.\]By re-arranging and putting constants on the RHS we can re-write the above as
\[-\eta z \hat{y} \sum_{\substack{j \in \operatorname{nz}(w)\\j\neq i}} v_{j} + v_{i} = v_{i,t}.\]The above system means that we are actually solving linear systems with the same coefficients for each coordinate of the embedding vectors. Equivalently written, we can stack the vectors \(v_{\operatorname{nz}(w)}\) into the rows of the matrix \(V\), and the vectors \(v_{\operatorname{nz}(w),t}\) into the rows of the matrix \(V_t\), and solve the linear system
\[\underbrace{\begin{pmatrix} 1 & -\eta z \hat{y} & \cdots & -\eta z \hat{y} \\ -\eta z \hat{y} & 1 & \cdots & -\eta z \hat{y} \\ \vdots & \vdots & \ddots & \vdots & \\ -\eta z \hat{y} & -\eta z \hat{y} & \cdots & 1 \end{pmatrix}}_{S(z)} V = V_t\]Note, that the matrix \(S(z)\) is small, since its dimensions only depend on the number of non-zero elements in \(w\). So, now we have an efficient algorithm for computing \(q_2(z)\) given the sample \(w\) and the latent vectors from the previous iterate \(v_{1,t}, \dots, v_{m,t}\):
Algorithm B
- Embed the latent vectors \(\{v_{t,i}\}_{i \in \operatorname{nz}(w)}\) into thr rows of the matrix \(V_t\).
- Obtain a solution \(V^*\) of the linear system of equations \(S(z) V = V_t\), and use the rows of \(V^*\) as the vectors \(\{v_{i}^*\}_{i \in \operatorname{nz}(w)}\).
- Output: \(q_2(z)=-z \hat{y} \sum_{(i,j) \in P[\operatorname{nz}(w)]} ({v_{i}^*}^T v_{j}^*)+\frac{1}{2\eta} \sum_{i\in \operatorname{nz}(w)} \|v_{i}^* - v_{i,t} \|_2^2\)
However, let’s see how we can solve the linear system without invoking any matrix inversion algorithm altogether, since it turns out we can directly and efficiently compute \(S(z)^{-1}\)! The matrix \(S(z)\) can be written as:
\[S(z) = (1 + \eta z \hat{y}) I - \eta z \hat{y}(\mathbf{e} ~ \mathbf{e}^T)\]where \(\mathbf{e} \in \mathbb{R}^{\vert\operatorname{nz}(w)\vert}\) is a column vector whose components are all \(1\). Now, we’ll employ the Sherman-Morrison matrix inversion identity:
\[(A+u v^T)^{-1} = A^{-1} - \frac{A^{-1} u v^T A^{-1}}{1 + v^T A^{-1} u}.\]In our case, we’ll be taking \(A = (1 + \eta \hat{y} z) I\), \(u=-\eta \hat{y} z \mathbf{e}\), and \(v = \mathbf{e}\), and consequently we have:
\[S(z)^{-1} = \frac{1}{1 + \eta \hat{y} z} I + \frac{\eta \hat{y} z}{(1 + \eta \hat{y} z)^2 - \eta \hat{y} z(1 + \eta \hat{y} z) \mathbf{e}^T \mathbf{e}} \mathbf{e}~\mathbf{e}^T\]Now, note that \(\mathbf{e}~\mathbf{e}^T = \unicode{x1D7D9}\) is a matrix whose components are all \(1\), and that \(\mathbf{e}^T \mathbf{e} = \vert\operatorname{nz}(w)\vert\) by construction. Thus:
\[\begin{aligned} S(z)^{-1} &= \frac{1}{1 + \eta \hat{y} z} I + \frac{\eta \hat{y} z}{(1 + \eta \hat{y} z)^2 - \eta \hat{y} z(1 + \eta \hat{y} z) |\operatorname{nz}(w)|} \unicode{x1D7D9} \\ &= I - \frac{\eta \hat{y} z}{1+\eta \hat{y} z} I + \frac{\eta \hat{y} z}{(1 + \eta \hat{y} z)^2 - \eta \hat{y} z(1 + \eta \hat{y} z) |\operatorname{nz}(w)|} \unicode{x1D7D9} \\ &= I - \frac{\eta \hat{y} z}{1+\eta \hat{y} z} \left[ I - \frac{1}{1+\eta \hat{y} z (1- |\operatorname{nz}(w)| )} \unicode{x1D7D9} \right] \end{aligned}\]So the solution of the linear system \(S(z)V = V_t\) is:
\[V^*=S(z)^{-1} V_t = V_t - \frac{\eta \hat{y} z}{1+\eta \hat{y} z} \underbrace{ \left[ V_t - \frac{1}{1+\eta \hat{y} z (1- |\operatorname{nz}(w)| )} \unicode{x1D7D9} V_t \right]}_{(*)} \tag{C}\]Finally, we note that the matrix \(\unicode{x1D7D9} V_t\) is the matrix obtained by computing the sum of the rows of \(V_t\) and replicating the result \(\vert \operatorname{nz}(w)\vert\) times, so we don’t even need to invoke any matrix multiplication function at all!
So, to summarize, we have Algorithm B above to compute \(q_2(z)\), where the solution of the linear system is obtained via formula (C) above. Moreover, formula (C) is used to update the latent vectors once the optimal \(z\) is found. Let’s implement the above:
# inside ProxPtFMTrainer
def step(self, w_nz, y_hat):
self.nnz = w_nz.numel() # |nz(w)|
self.bias_sum = self.bs[w_nz].sum().item() # w^T b_t
self.vs_nz = self.vs.weight[w_nz, :] # the matrix V_t
self.ones_times_vs_nnz = self.vs_nz.sum(dim=0, keepdim=True) # the sums of the rows of V_t
# TODO - this function will grow as we proceed
def q_two(self, y_hat, z):
if z == 0:
return 0
# solve the linear system - find the optimal vectors
vs_opt = self.solve_s_inv_system(y_hat, z)
# compute q_2
pairwise = (vs_opt.sum(dim=0).square().sum() - vs_opt.square().sum()) / 2 # the pow-of-sum - sum-of-pow trick
diff_squared = (vs_opt - self.vs_nz).square().sum()
return (-z * y_hat * pairwise + diff_squared / (2 * self.step_size)).item()
def update_vectors(self, w_nz, yhat, z): # use equation (C) to update the latent vectors
if z == 0:
return
self.vs.weight[w_nz, :].sub_(self.vectors_update_dir(yhat, z))
def solve_s_inv_system(self, y_hat, z):
return self.vs_nz - self.vectors_update_dir(y_hat, z)
def vectors_update_dir(self, y_hat, z): # marked with (*) in equation (C)
beta = self.step_size * y_hat * z
alpha = beta / (1 + beta)
return alpha * (self.vs_nz - self.ones_times_vs_nnz / (1 + beta * (1 - self.nnz)))
We need one last ingredient - a way to maximize \(q\) and compute the optimal \(z\).
Recall that
\[q(z) = q_1(z) + q_2(z) - z\ln(z) - (1-z)\ln(1-z).\]Now, consider two important properties of \(q\):
So, if it has a maximizer, it must be unique, and must lie in the interval \([0,1]\). So, does it have a maximizer? Well, it does! Any concave function is continuous, and by the well-known Weirstrass theorem, any continuous function on a compact interval has a maximizer. What we have is a continuous function with a unique maximizer in a bounded interval, and that’s the classical setup for a well-known algorithm for one-dimensional maximization - the Golden Section Search method. For completeness, I copied the code from the above Wikipedia page:
"""Python program for golden section search. This implementation
reuses function evaluations, saving 1/2 of the evaluations per
iteration, and returns a bounding interval.
Source: https://en.wikipedia.org/wiki/Golden-section_search#Iterative_algorithm
"""
import math
invphi = (math.sqrt(5) - 1) / 2 # 1 / phi
invphi2 = (3 - math.sqrt(5)) / 2 # 1 / phi^2
def gss(f, a, b, tol=1e-8):
"""Golden-section search.
Given a function f with a single local minimum in
the interval [a,b], gss returns a subset interval
[c,d] that contains the minimum with d-c <= tol.
Example:
>>> f = lambda x: (x-2)**2
>>> a = 1
>>> b = 5
>>> tol = 1e-5
>>> (c,d) = gss(f, a, b, tol)
>>> print(c, d)
1.9999959837979107 2.0000050911830893
"""
(a, b) = (min(a, b), max(a, b))
h = b - a
if h <= tol:
return (a, b)
# Required steps to achieve tolerance
n = int(math.ceil(math.log(tol / h) / math.log(invphi)))
c = a + invphi2 * h
d = a + invphi * h
yc = f(c)
yd = f(d)
for k in range(n-1):
if yc < yd:
b = d
d = c
yd = yc
h = invphi * h
c = a + invphi2 * h
yc = f(c)
else:
a = c
c = d
yc = yd
h = invphi * h
d = a + invphi * h
yd = f(d)
if yc < yd:
return (a, d)
else:
return (c, b)
Having all ingredients, we can finalize the implementation of the optimizer’s step
method:
def neg_entr(z):
if z > 0:
return z * math.log(z)
else:
return 0
def loss_conjugate(z):
return neg_entr(z) + neg_entr(1 - z)
class ProxPtFMTrainer:
def step(self, w_nz, y_hat):
self.nnz = w_nz.numel()
self.bias_sum = self.bs[w_nz].sum().item()
self.vs_nz = self.vs.weight[w_nz, :]
self.ones_times_vs_nnz = self.vs_nz.sum(dim=0, keepdim=True)
def q_neg(z): # neg. of the maximization objective - since the min_gss code minimizes functions.
return -(self.q_one(y_hat, z) + self.q_two(y_hat, z) - loss_conjugate(z))
opt_interval = min_gss(q_neg, 0, 1)
z_opt = sum(opt_interval) / 2
self.update_biases(w_nz, y_hat, z_opt)
self.update_vectors(w_nz, y_hat, z_opt)
Since the purpose of this pose is “academic” in nature, i.e. to show the limits of what is possible by proximal point rather than writing a production-ready training algorithm, we did not take the time to make it efficient, and thus we’ll test it on a toy dataset - MovieLens 100k. The dataset consists of the ratings on a 1 to 5 scale that users gave to 1682 movies. For users, we use their integer age, gender, and occupation as features. For the movies, we use the genre and the movie id as features. A rating \(\geq 5\) is considered positive, while below 5 are considered negative.
For clarity, in the post itself we’ll skip the data loading code, and assume that the features are already given in the W_train
tensor, whose rows are the vectors \(w_i\), and the corresponding labels are given in the y_train
tensor. The full code is available in simple_train_loop.py
file in the repo. Let’s train our model using the maximal allowed step-size for ten epochs, using a factorization machine of embedding dimension \(k=20\):
from tqdm import tqdm
import torch
# MISSING - the code which loads the dataset and builds the tensors W_train and y_train
num_features = W_train.size(1)
max_nnz = W_train.sum(dim=1).max().item()
step_size = 1. / (2*max_nnz + 1)
print(f'Training with step_size={step_size:.4} computed using max_nnz = {max_nnz}')
embedding_dim = 20
fm = FM(num_features, embedding_dim)
dataset = TensorDataset(W_train, y_train)
trainer = ProxPtFMTrainer(fm, step_size)
for epoch in range(10):
sum_epoch_loss = 0.
sum_pred = 0.
sum_label = 0.
desc = f'Epoch = {epoch}, loss = 0, pred = 0, label = 0, bias = 0'
with tqdm(DataLoader(dataset, batch_size=1, shuffle=True), desc=desc) as pbar:
def report_progress(idx):
avg_epoch_loss = sum_epoch_loss / (idx + 1)
avg_pred = sum_pred / (idx + 1)
avg_label = sum_label / (idx + 1)
desc = f'Epoch = {epoch:}, loss = {avg_epoch_loss:.4}, pred = {avg_pred:.4}, ' \
f'label = {avg_label:.4}, bias = {fm.bias.item():.4}'
pbar.set_description(desc)
for i, (x_sample, y_sample) in enumerate(pbar):
(ignore, w_nz) = torch.nonzero(x_sample, as_tuple=True)
y = y_sample.squeeze(1)
with torch.no_grad():
# aggregate loss and prediction per epoch, so that we can monitor convergence
pred = fm.forward(w_nz.unsqueeze(0))
loss = F.binary_cross_entropy_with_logits(pred, y)
sum_epoch_loss += loss.item()
sum_pred += torch.sigmoid(pred).item()
sum_label += y.item()
# train the model
y_hat = (2 * y.item() - 1) # transform 0/1 labels into -1/1
trainer.step(w_nz, y_hat)
if (i > 0) and (i % 2000 == 0):
report_progress(i)
report_progress(i)
That’s what I got:
Training with step_size=0.04348 computed using max_nnz = 11.0
Epoch = 0, loss = 0.4695, pred = 0.2118, label = 0.2124, bias = -1.148: 100%|██████████| 99831/99831 [11:36<00:00, 143.37it/s]
Epoch = 1, loss = 0.4362, pred = 0.2114, label = 0.2121, bias = -1.468: 100%|██████████| 99831/99831 [11:34<00:00, 143.80it/s]
Epoch = 2, loss = 0.427, pred = 0.2115, label = 0.2122, bias = -1.294: 100%|██████████| 99831/99831 [11:20<00:00, 146.62it/s]
Epoch = 3, loss = 0.4224, pred = 0.2117, label = 0.2123, bias = -1.254: 100%|██████████| 99831/99831 [10:30<00:00, 158.33it/s]
Epoch = 4, loss = 0.4194, pred = 0.2114, label = 0.212, bias = -1.419: 100%|██████████| 99831/99831 [10:00<00:00, 166.12it/s]
Epoch = 5, loss = 0.4173, pred = 0.2112, label = 0.2117, bias = -1.301: 100%|██████████| 99831/99831 [09:48<00:00, 169.73it/s]
Epoch = 6, loss = 0.4167, pred = 0.2117, label = 0.2121, bias = -1.368: 100%|██████████| 99831/99831 [09:49<00:00, 169.40it/s]
Epoch = 7, loss = 0.4155, pred = 0.2115, label = 0.2119, bias = -1.467: 100%|██████████| 99831/99831 [09:51<00:00, 168.81it/s]
Epoch = 8, loss = 0.4145, pred = 0.2114, label = 0.2118, bias = -1.605: 100%|██████████| 99831/99831 [09:47<00:00, 169.81it/s]
Epoch = 9, loss = 0.4146, pred = 0.2121, label = 0.2125, bias = -1.365: 100%|██████████| 99831/99831 [09:47<00:00, 169.85it/s]
Seems that the loss is indeed being minimized. Let’s compare it with the Adam optimizer with default parameters. Here is the training loop:
optimizer = torch.optim.Adam(fm.parameters())
for epoch in range(10):
sum_epoch_loss = 0.
sum_pred = 0.
sum_label = 0.
desc = f'Epoch = {epoch}, loss = 0, pred = 0, label = 0, bias = 0'
with tqdm(DataLoader(dataset, batch_size=1, shuffle=True), desc=desc) as pbar:
def update_progress(idx):
avg_epoch_loss = sum_epoch_loss / (idx + 1)
avg_pred = sum_pred / (idx + 1)
avg_label = sum_label / (idx + 1)
desc = f'Epoch = {epoch}, loss = {avg_epoch_loss:.4}, pred = {avg_pred:.4}, ' \
f'label = {avg_label:.4}, bias = {fm.bias.item():.4}'
pbar.set_description(desc)
for i, (x_sample, y_sample) in enumerate(pbar):
(ignore, w_nz) = torch.nonzero(x_sample, as_tuple=True)
y = y_sample.squeeze(1)
optimizer.zero_grad()
pred = fm.forward(w_nz.unsqueeze(0))
loss = F.binary_cross_entropy_with_logits(pred, y)
loss.backward()
optimizer.step()
with torch.no_grad():
sum_epoch_loss += loss.item()
sum_pred += torch.sigmoid(pred).item()
sum_label += y.item()
if (i > 0) and (i % 2000 == 0):
update_progress(i)
update_progress(i)
And here is the result:
Epoch = 0, loss = 0.4655, pred = 0.21, label = 0.212, bias = 0.539: 100%|██████████| 99831/99831 [02:47<00:00, 596.25it/s]
Epoch = 1, loss = 0.4596, pred = 0.208, label = 0.212, bias = 1.586: 100%|██████████| 99831/99831 [03:09<00:00, 527.90it/s]
Epoch = 2, loss = 0.4655, pred = 0.2075, label = 0.2118, bias = 2.668: 100%|██████████| 99831/99831 [02:59<00:00, 556.33it/s]
Epoch = 3, loss = 0.471, pred = 0.2078, label = 0.2122, bias = 3.805: 100%|██████████| 99831/99831 [02:50<00:00, 585.09it/s]
Epoch = 4, loss = 0.4744, pred = 0.2071, label = 0.2119, bias = 5.116: 100%|██████████| 99831/99831 [02:42<00:00, 615.88it/s]
Epoch = 5, loss = 0.4747, pred = 0.2071, label = 0.212, bias = 6.48: 100%|██████████| 99831/99831 [02:55<00:00, 569.75it/s]
Epoch = 6, loss = 0.4777, pred = 0.2064, label = 0.2119, bias = 7.992: 100%|██████████| 99831/99831 [02:56<00:00, 567.10it/s]
Epoch = 7, loss = 0.4793, pred = 0.2071, label = 0.2121, bias = 9.433: 100%|██████████| 99831/99831 [02:47<00:00, 595.92it/s]
Epoch = 8, loss = 0.4802, pred = 0.2062, label = 0.212, bias = 11.15: 100%|██████████| 99831/99831 [02:43<00:00, 610.91it/s]
Epoch = 9, loss = 0.4824, pred = 0.2066, label = 0.212, bias = 12.72: 100%|██████████| 99831/99831 [02:44<00:00, 605.32it/s]
Whoa! It isn’t converging! The loss grows after a few epochs, and we can see that the bias keeps increasing. Seems like our efforts are paying off - a custom method with a deeper step-size analysis let us just ‘hit’ a good-enough step-size without any tuning, while with Adam we’ll probably have to do some tuning to find a good step-size.
Let’s now do a more thorough stability comparison - run our method, Adam, Adagrad, and SGD, with various step-size parameters for ten epochs, and see what loss we are getting. The above methods ran with several step sizes for \(M=20\) epochs, each step-size was tested \(N=20\) times to take into account the effect of randomness in the weight initialization and the data shuffling. Then, I produced a plot showing the best loss achieved for each step-size and each algorithm, averaged over the \(N=15\) attempts, with transparent uncertainty bands. The code resides in stability_experiment.py
in the repo. Here is the result:
It’s quite apparent that the performance of the proximal point algorithm is quite consistent over the various step-size choices. We also see that Adam’s performance degrades when the step-size is too large. Consequently, to see the difference between the various algorithms more clearly, let’s plot the results without Adam:
Well, as we see, the proximal point’s performance is the most consistent accross various step-sizes, but it is certainly not the best algorithm for training a factorization machine on this dataset. It appears that Adagrad is.
One possible explanation is that the proximal point algorithm converges more slowly, and requires more epochs to achieve good performance. Let’s test this hypothesis, and run the proximal point algorithm for 50 epochs. And after a few days, I got:
The situation doesn’t seem to improve much. The method is quite consistent in its performance, but it doesn’t seem to converge rapidly to an optimum.
We have developed an efficiently implementable proximal point step for a highly non-trivial and non-convex problem, and provided an implementation. To the best of my knowledge, this post sets foot in an uncharted territory, and thus I an not sure what is the method converging to, but from these numerical experiments it doesn’t seem to minimize the average loss. It is my hope that the research community can provide such answers.
Writing this entire series about efficient implementation of incremental proximal point methods has been extremely fun and I certainly learned a lot about Python, PyTorch, and better understood the essence of these methods. I hope that you, the readers, enjoyed as much as I did. It’s time for new adventures! I don’t know what the next post will be about, but I’m sure it will be fun!
Steffen Rendle (2010), Factorization Machines, IEEE International Conference on Data Mining (pp. 995-1000) ↩
It follows from the fact that \(h(t)=\ln(1+\exp(t))\) and \(h^*(s)=s\ln(s)+(1-s)\ln(1-s)\) are both Legendre-type functions: essencially smooth and strictly-convex function. ↩
seperability is the fact that \(\displaystyle \min_{x_1,x_2} f(x_1) + g(x_2) = \min_{x_1} f(x_1) + \min_{x_2} g(x_2)\). ↩
We continue our endaevor of extending the reach of efficiently implementable stochastic proximal point method in the mini-batch setting:
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ \frac{1}{|B|} \sum_{i \in B} f_i(x) + \frac{1}{2\eta} \|x - x_t\|_2^2 \Biggr \}.\]Last time we discussed the implementation for convex-on-linear losses, which include linear least squares and linear logistic regression. Continuing the same journey we already went through before the mini-batch setting, this time we add regularization, and consider losses of the form:
\[f_i(x)=\phi(a_i^T x + b_i) + r(x),\]where the regularizer \(r\) is the same for all training samples, and \(\phi\) is a scalar convex function. In that case, the method becomes:
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ \frac{1}{|B|} \sum_{i \in B} \phi(a_i^T x + b_i) + r(x) + \frac{1}{2\eta} \|x - x_t\|_2^2 \Biggr \}.\]Our aim in this post is to derive an efficient implementation, in Python, of the above computational step.
We again employ duality in an attempt to make the problem of computing \(x_{t+1}\) tractable. We replace that problem with its equivalent constrained variant:
\[\operatorname*{minimize}_{x,z} \quad \frac{1}{|B|} \sum_{i\in B} \phi(z_i) + r(x) + \frac{1}{2\eta}\|x - x_t\|_2^2 \quad \operatorname{\text{subject to}} \quad z_i = a_i^T x + b_i.\]We will not get into the tedious details, but embedding the vectors \(a_i\) into the rows of the batch matrix \(A_B\), and after some mathematical manipulations, \(q(s)\) is:
\[q(s)=\color{magenta}{-\frac{1}{|B|}\sum_{i \in B} \phi^*(|B| s_i) + \underbrace{\min_x \left\{ r(x) + \frac{1}{2\eta} \|x - (x_t - \eta A_B^T s)\|_2^2\right\}}_{\text{Moreau envelope}}} + (A_B x_t + b_B)^T s - \frac{\eta}{2} \|A_B^Ts\|_2^2.\]And here we arrive at our problem, which appears in the magenta-colored part. The function \(-\phi^*\) is always concave, while the Moreau envelope is always convex. The sum of such functions may, in general, be nor convex nor concave. So, although we know from duality theory that the function \(q(s)\) is concave, by separating it into two components we cannot convey its concavity to a generic convex optimization solver such as CVX - these solvers require that we write functions in a way which obeys an explicit set of rules which convey the function’s curvature as either convex or concave. One such rule is - we can add convex functions together, and concave functions together. But we cannot mix both, for reasons which go beyond the scope of this blog post.
The conclusion is the same duality trick which served us well before in transforming a high-dimensional problem on \(x\) to a low-dimensional problem on the dual variables \(s\), cannot serve us now. Can we do something else? It turns out we can, but first let’s explore another extension of duality - inequality constraints.
We aim to compute
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ \frac{1}{|B|} \sum_{i \in B} \phi(a_i^T x + b_i) + r(x) + \frac{1}{2\eta} \|x - x_t\|_2^2 \Biggr \}.\]Let’s rephraze the optimization problem in a slightly different manner, by including two auxiliary variables - the vectors \(z\) and \(w\), and by embedding the vectors \(a_i\) into the rows of the batch matrix \(A_B\):
\[\operatorname*{minimize}_{x,z,w} \quad \frac{1}{|B|} \sum_{i\in B} \phi(z_i) + r(w) + \frac{1}{2\eta} \|x - x_t\|_2^2 \quad \operatorname{\text{subject to}} \quad z = A_B x + b, \ \ x = w\]Let’s construct a dual by assigning prices \(\mu\) and \(\nu\) to the violation of each set of constraints, and separating the minimization over each variable:
\[\begin{aligned} q(\mu, \nu) &= \inf_{x,z,w} \left\{\frac{1}{|B|} \sum_{i\in B} \phi(z_i) + r(w) + \frac{1}{2\eta} \|x - x_t\|_2^2 + \mu^T(A_B x + b - z) + \nu^T (x - w) \right\} \\ &= \color{blue}{\inf_x \left\{ \frac{1}{2\eta} \|x - x_t\|_2^2 + (A_B^T \mu + \nu)^T x \right\}} + \color{purple}{\inf_z \left\{ \frac{1}{|B|} \sum_{i\in B} \phi(z_i) - \mu^T z \right\}} + \color{green}{\inf_w \left\{ r(w)- \nu^T w \right\}} + \mu^T b. \end{aligned}\]We’ve already encountered the purple part in a previous post - it can be written in terms of the convex conjugate \(\phi^*\):
\[\text{purple} = -\frac{1}{|B|}\sum_{i \in B} \phi^*(|B| \mu_i)\]The green part is also straightforward, and is exactly \(-r^*(\nu)\), where \(r^*\) is the convex conjugate of the regularizer \(r\). Finally, the blue part, despite being some cumbersome, is a simple quadratic minimiation problem over \(x\), so let’s solve it by equating the gradient of the term inside the \(\inf\) with zero:
\[\frac{1}{\eta}(x - x_t) + A_B^T \mu - \nu = 0.\]By re-arranging, we obtain that the equation is solved at \(x = x_t - \eta(A_B^T \mu - \nu)\). Recall, that if strong duality holds, that’s exactly the rule for computing the optimal \(x\) from the optimal \((\mu, \nu)\) pair. Let’s substitute the above \(x\) into the blue term, and after some algebraic manipulations obtain:
\[\text{blue} = \frac{1}{2\eta} \left\| \color{brown}{x_t - \eta (A_B^T \mu - \nu)} - x_t \right\|_2^2 + (A_B^T \mu + \nu)^T [\color{brown}{x_t - \eta (A_B^T \mu - \nu)}] = - \frac{\eta}{2} \| A_B^T \mu - \nu \|_2^2 + (A_B x_t)^T \mu + x_t^T \nu\]Summarizing everything, we have:
\[q(\mu, \nu) = \color{blue}{- \frac{\eta}{2} \| A_B^T \mu - \nu \|_2^2 + (A_B x_t)^T \mu + x_t^T \nu} \color{purple}{-\frac{1}{|B|}\sum_{i \in B} \phi^*(|B| \mu_i)} \color{green}{-r^*(\nu)} + b^T \mu.\]The blue part is a concave quadratic, and \(-\phi^*\) and \(-r^*\) are both concave. Well, seem that we’ve done it, haven’t we? Not quite! Recall that the dimension of \(\nu\) is the same as the dimension of \(x\), so we haven’t reduced the problem’s dimension at all! If we have a huge model parameter vector \(x\), we’ll have a huge dual variable \(\nu\).
The above two failures to come up with an efficient algorithm for computing \(x_{t+1}\) for mini-batches in the regularized convex-on-linear losses might make us wonder - is it even possible to implement the method in this setting? Well, it turns out that if we insist on using an off-the-shelf optimization solver, it might be very hard. But if we are willing to write our own, it is possible.
Indeed, consider the dual problem we derived in take 1. We know that the dual function being maximized is concave. If we can compute its gradient, we can write our own fast gradient method, i.e. FISTA^{1} or Nesterov’s accelerated gradient^{2} method, to solve it. Note that I am not referring to a stochastic optimization algorithm for training a model, but a fully deterministic one for solving a simple optimization problem. Such methods can be quite fast. Furthermore, if we can compute the Hessian matrix of \(q\) we can employ Newton’s method^{3}, and solve the dual problem even faster, in a matter of a few milliseconds. However, writing convex optimization solvers is beyond the scope of this post.
The entire post series was devoted to deriving efficient implementations of the proximal point methods to various generic problem classes. Contrary to the above, I would like to devote the next, and last blog post of this series to implementing the method on a specific, but interesting problem. Stay tuned!
Beck A. & Teboulle M. (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Science, 2(11), 183–202. ↩
Nesterov Y. (1983) A method for solving the convex programming problem with convergence rate O(1/k^2). Dokl. Akad. Nauk SSSR 269, 543-547 ↩
https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization ↩