Alex Shtoff

Shape restricted function models

2024-10-14T00:00:00+00:00

Intro

Occasionally in practice we aim to train models that represent a function of restricted shape, when viewed as a function of one of the features. Formally, we are referring to fitting a function $f(\mathbf{x}, z)$, that is monotone, bounded, convex, or concave in $z$ for every $\mathbf{x}$. The feature $z$ is special in our context - the model $f$ has a special shape as a function of $z$. Here are some examples:

$f(\mathbf{x}, z)$ models insurance premium given features of the policy and the insured person in $\mathbf{x}$, and the coverage in $z$. We would like $f$ to be nondecreasing in $z$ for every $\mathbf{x}$: larger coverage incurs a potentially larger insurance premium.
$f(\mathbf{x}, z)$ models the probability of winning an auction described by features $\mathbf{x}$ and bid $z$. Here, $f$ must be bounded between 0 and 1, since it’s a probability, and nondecreasing in $z$, since higher bids mean potentially chances of winning.
$f(\mathbf{x}, z)$ models utility of an investment of $z$ dollars in a project described by features $\mathbf{x}$. Here it’s reasonable that $f$ is nondecreasing and concave, to model ‘diminishing returns’.

There is a vast amount of literature on learning $f(z)$ with constraints on the shape of $f$ for various families, especially when $f$ is a polynomial. In fact, there’s an entire field of polynomial optimization devoted just to polynomial shape constraints. See this playlist of video lectures, for a great introduction, or just search the web for the term ‘polynomial optimization.’ However, many of the ideas require specialized ‘acrobatics’ that are hard to implement in commodity ML packages we all love: PyTorch and TensorFlow.

There is also the idea of Lattice Networks¹, and a nice TensorFlow library that implements them called TensorFlow Lattice. They are designed for modeling functions of the form $f(\mathbf{x}, \mathbf{z})$, where $\mathbf{z}$ is a vector comprised of several features for which we want to constraint the shape of $f$. They are more generic than the idea I present here, but are also more expensive. This post is about a scalar $z$, meaning that we have only one shape-constrained feature. This lets us do something interesting and specialized for this case.

The idea I present here is probably not new, even though I couldn’t find literature on that. Probably, since I didn’t know what buzzwords to look for. So if you know some prior work I could cite, please let me know!

As customary, the code is available in a notebook you can deploy to Google Colab and play around with. So let’s dive in!

Bernstein polynomials strike again

We already met Bernstein polynomials in our series on polynomial features. So let’s make a short recap of what we learned. Given a degree $n$, we define the polynomials:

\[b_{i,n}(x) = \binom{n}{i} x^i (1-x)^{n-i}.\]

We can see that each $b_{i,d}(x)$ is indeed a polynomial function of $x$ of degree $n$. Moreover, we learned in the series that any polynomial $p(x)$ of degree $n$ can be written as:

\[p(x) = \sum_{i=0}^n a_{i} b_{i,n}(x).\]

In other words, these polynomials are actually a basis for all polynomials of degree $n$. We also learned in this series that this basis is useful for fitting functions on the unit interval $[0, 1]$ with machine learned models without the polynomials going ‘crazy’ and ‘wiggly’ with simple regularization tricks. Finally, we learned that their coefficients give us direct control over the shape of $p(x)$, and in particular:

If $a_0 \leq a_1 \leq \dots \leq a_n$, then $p(x)$ is nondecreasing on $[0, 1]$.
If $a_0 \geq a_1 \geq \dots \geq a_n$, then $p(x)$ is nonincreasing on $[0, 1]$.
If $a_i \in [a, b]$, then $p(x) \in [a, b]$ for any $x \in [0, 1]$.

In other words, nondecreasing or nonincreasing coefficients yield a nondecreasing or nonincreasing polynomial, and imposing a bound on the coefficiens imposes the corresponding bound on the polynomial.

So the basic idea is simple assuming $z \in [0, 1]$. Choose a polynomial degree $n$, feed $\mathbf{x}$ to an arbitrary model that produces the coefficients vector $\mathbf{a} = (a_0, \dots, a_n)$ having the desired monotonicity properties, and let the model’s output be the corresponding polynomial in the Bernstein basis. The basic flow is illustrated below:

Observe that we don’t really care what the model consuming $\mathbf{x}$ looks like. For all we care, $\mathbf{x}$ can be a free-form text with a description of an insurance policy, and the model consuming $\mathbf{x}$ is our super-duper state-of-the-art transformer that understands insurance policies and produces an embedding vector. But the embedding vector is not arbitrary - it’s a coefficient vector for Bernstein polynomials satisfying a desired shape property. Thus, in this example, the model will have to be fine-tuned for the task of producing the appropriate Bernstein coefficients.

The basic idea of learning a model to predict the coefficient vector of a function is not new. To the best of my knowledge, it dates back to the 1993 paper of Hastie and Tibarshiani², and more papers applying the idea appeared over the years³⁴⁵. That’s why it’s a blog post, rather than a paper. This is one of those posts where I want to understand something by implementing it, and share my understanding and learning experience with the readers.

Before developing the basic idea into a more concrete framework, let’s recall one more interesting fact we learned in the series about Bernstrin polynomials. The Bernstein coefficients control the polynomial locally, in the vicinity of the points on a grid, or a lattice. In this sense, we can think of this basic idea as an enhancement of one-dimensional lattice networks.

Implementing the framework in PyTorch

To implement this idea we need to take care of two details: what happens if $z$ is not in $[0, 1]$, and how do we generate Bernstein coefficients satisfying our desired properties. Then, we shall implement everything in PyTorch.

First, we discuss what happens if $z$ that is not in $[0, 1]$. As mahcine learning practitioners we have a pretty standard set of solutions - feature scaling. For example, if $z$ is assumed to be bounded, we can use simple min-max scaling. For a potentially unbounded, but non-negative feature, such as duration or money, we could scale using $\tanh$, $\arctan$, or an algebraic function such as:

\[\phi_a(z) = \frac{a}{a + z}\]

The choice of the scaling function is where our domain knowledge about $z$ is useful, and this is the “feature engineering” part of our idea. Because feature scaling is typically a part of the data preparation components of a machine learning pipeline, rather than the model, we assume here that our model takes an already scaled $z$.

Now, let’s discuss ensuring that the ‘embedding vector’ $\mathbf{a}$ that our model produces has the right properties (monotonicity / boundedness). This can be achieved by stacking an additional ‘coefficient transform’ layer on top of an existing model. For example, if the last layer of a given model produces a vector $\mathbf{u} = (u_0, \dots, u_n)$, our ‘coefficient transform’ layer produces a nondecreasing $\mathbf{a}$ as using $\mathrm{ReLU}$:

\[a_i = u_0 + \sum_{j=1}^i \mathrm{ReLU}(u_j),\]

or using $\mathrm{SoftPlus}$:

\[a_i = u_0 + \sum_{j=1}^i \mathrm{SoftPlus}(u_j).\]

Below is a $\mathrm{SoftPlus}$ based implementation:

import torch
from torch import nn

class NondecreasingCoefTransform(nn.Module):
    def forward(self, u):
        # We assume that `u` has mini-batch dimensions,
        # and the 'coefficient' dimension is the last one.
        u_head = u[... ,0:1]
        u_tail_relu = nn.functional.softplus(u[..., 1:])
        head_tail = torch.cat([u_head, u_tail_relu], dim=-1)
        return torch.cumsum(head_tail, dim=-1)

Let’s try it out:

u = torch.tensor([-5, 3, -2, 1])
print(NondecreasingCoefTransform()(u))

tensor([-5.0000, -1.9514,  0.1755,  0.4888,  3.5374])

Now we can stack a NondecreasingCoefTransform on top of an existing network, and obtain nondecreasing coefficients.

Now let’s proceed to implementing the idea in PyTorch. First, we need to compute the Bernstein basis using PyTorch using vectorized functions that run well on both CPU and GPU. For simplicity, even though it may not be the ‘best’ way to do it, we shall compute the basis by definition.

It turns out that PyTorch does not have a built-in function to compute the binomial coefficient $\binom{n}{i}$, so let’s implement one. Implementing it directly may cause overflow, since the binomial coefficient is defined in terms of factorials. Moreover, we would like a vectorized implementation that can take many values of $i$ at once. It turns out PyTorch does have the right tools, but in logarithmic space, using the torch.lgamma function that implements the logarithm of the Gamma function. Recall, that the Gamma function generalizes the factorial, since for an integer $n$ we have:

\[\Gamma(n + 1) = n!\]

Therefore,

\[\ln\left(\binom{n}{i}\right) = \ln\left( \frac{n!}{k!(n-k)!} \right) = \ln(\Gamma(n+1)) - \ln(\Gamma(k+1)) - \ln(\Gamma(n - k + 1))\]

So the code for the binomial coefficient in log-space is:

import torch

def log_binom_coef(n: torch.Tensor, k: torch.Tensor):
  return (
      torch.lgamma(n + 1)
      - torch.lgamma(k + 1) 
      - torch.lgamma(n - k + 1)
  )

Let’s see that it works by printing $\binom{5}{i}$ for $i = 0, \dots, 5$:

n = torch.tensor(5)
k = torch.arange(6)
print(log_binom_coef(n, k).exp())

tensor([ 1.,  5., 10., 10.,  5.,  1.])

Appears just right. Now we can implement the Bernstein basis in a naive manner, by definition:

def bernstein_basis(degree: int, z: torch.Tensor):
  """
  Computes a matrix containing the Bernstein basis of a given degree, where
  each row corresponds to an entry in the input tensor `z`.
  """

  # entries of `z` in rows, and basis indices in columns  
  z = z.view(-1, 1) 
  ks = torch.arange(degree + 1, device=z.device).view(1, -1)

  # degree in a tensor to call log_binom_coef
  degree_tensor = torch.as_tensor(degree, device=z.device)

  # now we compute the Bernstein basis by definition
  binom_coef = torch.exp(log_binom_coef(degree_tensor, ks))
  return binom_coef * (z ** ks) * ((1 - z) ** (degree_tensor - ks))

As stated above, this is not the most numerically ‘right’ way work with the Bernstein basis, and it would be more wise to use the well-known De Casteljau’s algorithm, that is both efficient and numerically stable. In fact, in production-quality code that’s what we should do. Maybe even implement a custom CUDA kernel to make it efficient on the GPU. But I chose to avoid adding more complexity by introducing yet another algorithm, and keep this post as straightforward as possible.

So now that we have our ingredients in place, let’s implement a short PyTorch module implementing the idea in the nice diagram we saw above:

from torch import nn

class BernsteinPolynomialModel(nn.Module):
  def __init__(self, x_model, coef_transformer):
    self.coef_model = nn.Sequential([
        x_model,
        coef_transformer
    ])
  
  def forward(self, x, z):
    coefs = self.coef_model(x)
    degree = coefs.shape[-1]
    basis = bernstein_basis(z, degree)
    return torch.sum(coefs * basis, dim=-1)

Coefficient transform components

We already saw a simple transform that takes a vector, and converts it to a vector with non-decreasing components based on the ReLU function. We can do something similar with non-increasing functions using the negative of the ReLU function:

class NonIncreasingCoefTransform(nn.Module):
    def forward(self, u):
        # We assume that `u` has mini-batch dimensions, 
        # and the 'coefficient' dimension is the last one.
        u_head = u[... ,0:1]
        u_tail_relu = -nn.functional.relu(u[..., 1:])
        return torch.cat([u_head, torch.cumsum(u_tail_relu, dim=-1)])

Let’s test it:

print(NonincreasingCoefTransform()(torch.tensor([-5, 3., 2., -1., 3.])))

tensor([ -5.0000,  -8.0486, -10.1755, -10.4888, -13.5374])

Appears to do what we wanted - produces a nonincreasing vector. What if we’re modeling a CDF? Well, then we can add an additional Sigmoid layer on top of a NonDecreasingCoefTransform, that transforms our non-decreasing function whose output are arbitrary numbers, into a non-decreasing function whose output is in $[0, 1]$. Namely, we can use:

nn.Sequential([
	NonDecreasingCoefTransform(),
	nn.Sigmoid()
])

An interesting case is a CDF of a distribution whose support is known to be $[0, 1]$. Then we can model it directly with Bernstein polynomials whose coefficient vector $\mathbf{a}$ satisfies:

\[a_0 = 0 \leq a_1 \leq \dots \leq a_n = 1.\]

To that end, we can use the SoftMax function with a cumulative sum. Assuming that $\mathbf{u} \in \mathbb{R}^n$, we can define:

\[a_i = \frac{\sum_{j=1}^i \exp(u_j)}{\sum_{j=1}^n \exp(u_j)}, \qquad 0 = 1, \dots, n\]

Consequently, the corresponding layer is:

class CDFCoefTransform(nn.Module):
  def forward(self, u):
    zero = torch.zeros_like(u[..., :1])
    cum_softmax = torch.cumsum(nn.functional.softmax(u, dim=-1), dim=-1)
    cdf_coefs = torch.cat([zero, cum_softmax], dim=-1)
    return cdf_coefs

Let’s try it out:

print(CDFCoefTransform()(torch.tensor([-5, 3., 2., -1., 3.])))

tensor([0.0000e+00, 1.4056e-04, 4.1916e-01, 5.7331e-01, 5.8098e-01, 1.0000e+00])

Appears to do what we desire - a non-decreasing vector, going from 0 to 1. Now let’s try to use our components.

Example - learning an increasing function

At first, I wanted to demonstrate it on an application from a domain I know - learning the CDF of auction bids in online advertising. But the data-sets, such as the IPinYou data-set, are too large to handle quickly enough for a blog post. We’ll be using NumPy to implement the synthetic function $f(\mathbf{x}, z)$ we intend to fit, to make plotting and inspection straightforward. When we use it to generate a dataset, we shall transform the NumPy arrays into PyTorch tensors.

import numpy as np

def relu(x):
  return np.maximum(np.zeros_like(x), x)

def softshrink(x, a=0.3):
  return relu(x - a) - relu(-x - a)

def sgn_square(x):
  return x * np.abs(x)

def hairy_increasing_func(x, z):
  x1, x2, x3 = x[..., 0], x[..., 1], x[..., 2]
  return (relu(np.cos(x1 - x2 + x3)) * sgn_square(softshrink(z - np.sin(x1) ** 2))
          + (1 + np.cos(x2 + x3)) * sgn_square(softshrink(z - np.sin(x2) ** 2))
          + (1 + np.cos(x1 - x2)) * sgn_square(softshrink(z - np.sin(x3) ** 2))
          + np.cos(x1 + x2 + x3))

Indeed seems a bit ‘hairy’, so let’s inspect a few examples:

import matplotlib.pyplot as plt

zs = np.linspace(0, 1, 1000)
plt.plot(zs, hairy_increasing_func(np.array([-1, 0.1, 0.5]), zs), label='function 1')
plt.plot(zs, hairy_increasing_func(np.array([1, 0.5, -0.5]), zs), label='function 2')
plt.plot(zs, hairy_increasing_func(np.array([-1.5, 0.8, 0.1]), zs), label='function 3')
plt.legend()
plt.show()

The function uses a few powers of ‘soft-shrink’ that generate ‘flat’ plateaus, to make fitting a bit challenging. The center and slope of these soft-shrink functions are based on trigonometric functions of the component of $\mathbf{x}$. Powers of the soft-shrink function have a discontinuous derivative, and this shall make fitting a bit challenging, even with a small number of features. But it’s possible with polynomials of a high enough degree. As we saw in the polynomial features series - we are not afraid of fitting high-degree polynomial.

Using this function we can generate a PyTorch dataset. The function below generates a data-set of the specified size, and uploads it to the default CUDA GPU if it is available. This is to make our fitting experiments simple and fast when we have a GPU available:

def generate_dataset(n_rows, noise=0.1):
  xs = np.random.randn(n_rows, 3)
  zs = np.random.rand(n_rows)
  labels = hairy_increasing_func(xs, zs) + np.random.randn(n_rows) * noise

  xs = torch.as_tensor(xs).to(dtype=torch.float32)
  zs = torch.as_tensor(zs).to(dtype=torch.float32)
  labels = torch.as_tensor(labels).to(dtype=torch.float32)
  if torch.cuda.is_available():
    xs = xs.cuda()
    zs = zs.cuda()
    labels = labels.cuda()

  return xs, zs, labels

Our next ingredient is a function that builds a PyTorch model. We will be comparing a monotonic model using our BernsteinPolynomialModel class we just implemented, to a regular fully-connected $\mathrm{ReLU}$ network. So here is a function to create a model given layer dimensions that suppotrs both cases:

def make_model(layer_dims, monotone=True):
  # create a fully connected ReLU network
  layers = [
      layer
      for in_dim, out_dim in zip(layer_dims[:-1], layer_dims[1:])
      for layer in [nn.Linear(in_dim, out_dim), nn.ReLU()]
  ]

  if monotone:
    # define a model for x - a ReLU network whose last layer is linear
    x_model = nn.Sequential(*layers[:-1])

    # construct a network for predicting non-decreasing functions
    # the polynomial degree is the output dimension of the last
    # layer.
    return BernsteinPolynomialModel(
        x_model,
        NondecreasingCoefTransform()
    )
  else:
    # define a simple ReLU network - just add a linear layer
    # with one output on top of the ReLU network
    layers.append(nn.Linear(layer_dims[-1], 1))
    return nn.Sequential(*layers)

Let’s verify that even without training, our ‘monotone’ model indeed produces non-decreasing functions of $z$ for each $\mathbf{x}$.

from functools import partial
torch.manual_seed(2024)  # just to make this result reproducible
net = make_model([3, 10, 10, 10])

plot_zs = torch.linspace(0, 1, 100)

func = partial(net, torch.tensor([30., 20, 10]).repeat(100, 1))
plt.plot(plot_zs, func(plot_zs).detach().numpy(), label='Input = [30, 20, 10]')

func = partial(net, torch.tensor([10., 20, 30]).repeat(100, 1))
plt.plot(plot_zs, func(plot_zs).detach().numpy(), label='Input = [10, 20, 30]')

plt.legend()
plt.show()

Well, indeed the model appears to generate increasing functions of z.

Now to our last ingredient - model training. Here is a pretty-standard PyTorch training loop, but with a small customization to support monotonic models accepting the features as two parameters $\mathbf{x}, z$, and ‘regular’ models accepting only one features parameter:

from tqdm.auto import tqdm

def train_epoch(data_iter, model, loss_fn, optim, monotone):
  for x, z, label in data_iter:
    if monotone:
      pred = model(x, z)
    else:
      pred = model(torch.cat([x, z.reshape(-1, 1)], dim=-1)).squeeze()
    loss = loss_fn(pred, label)

    optim.zero_grad()
    loss.backward()
    optim.step()

And here is a pretty-standard evaluation loop, doing the same:

@torch.no_grad()
def valid_epoch(data_iter, model, loss_fn, monotone):
  epoch_loss = 0.
  num_samples = 0
  for x, z, label in data_iter:
    if monotone:
      pred = model(x, z)
    else:
      pred = model(torch.cat([x, z.reshape(-1, 1)], dim=-1)).squeeze()
    loss = loss_fn(pred, label)
    epoch_loss += loss * label.size(0)
    num_samples += label.size(0)
  return epoch_loss.cpu().item() / num_samples

Now let’s integrate all ingredients into one function that creates a model and an optimizer, and runs several train+evaluation epochs using the mean squared error loss:

def train_model(train_iter, valid_iter, layer_dims, monotone=True,
                optim_fn=torch.optim.SGD, optim_params=None, num_epochs=100):
  if optim_params is None:
    optim_params = {}

  torch.manual_seed(2024)
  model = make_model(layer_dims, monotone=monotone)
  optim = optim_fn(model.parameters(), **optim_params)
  loss_fn = nn.MSELoss()

  if torch.cuda.is_available():
    model = model.cuda()

  with tqdm(range(num_epochs)) as epoch_range:
    for epoch in epoch_range:
      train_epoch(train_iter, model, loss_fn, optim, monotone)
      epoch_loss = valid_epoch(valid_iter, model, loss_fn, monotone)
      epoch_range.set_description(f'Validation loss = {epoch_loss:.5f}')
  return model, epoch_loss

Now let’s train! First, we create the train and evaluation datasets:

from batch_iter import BatchIter

batch_size = 256
train_iter = BatchIter(*generate_dataset(50000), batch_size=batch_size)
valid_iter = BatchIter(*generate_dataset(10000), batch_size=batch_size)

Now we train a monotonic model. I chose its architecture, the optimizer, and its parameters using hyperparameter tuning with the validation set. But to make this post straightforward, I’m just writing the final hyper-parameters I selected:

lr = 3e-3
weight_decay = 1e-5
degree = 50
layer_dims = [3,
              4 * degree,
              3 * degree,
              2 * degree,
              degree]
model, val_loss = train_model(
    train_iter, valid_iter, layer_dims,
    optim_fn=torch.optim.AdamW,
    optim_params=dict(lr=lr, weight_decay=weight_decay))
model = model.cpu()

I got a validation loss of 0.0127. Now let’s plot some functions the model learned, and see how they compare to the “true” hairy function we designed. Here is code to produce the function for $\mathbf{x} = (1, 0.5, -0.5)$:

features = torch.tensor([1, 0.5, -0.5]).repeat(100, 1)
func = partial(model, features)
plt.plot(plot_zs, func(plot_zs).detach().numpy(), label='Model function')
plt.plot(plot_zs, hairy_increasing_func(features.numpy(), plot_zs.numpy()), label='True function')

plt.legend()
plt.show()

Seems pretty close. Let’s try another one with $\mathbf{x} = (-1.5, 0.8, 0.1)$:

features = torch.tensor([-1.5, 0.8, 0.1]).repeat(100, 1)
func = partial(model, features)
plt.plot(plot_zs, func(plot_zs).detach().numpy(), label='Model function')
plt.plot(plot_zs, hairy_increasing_func(features.numpy(), plot_zs.numpy()), label='True function')

plt.legend()
plt.show()

A bit farther away, but not very bad.

Now let’s try training a regular ReLU network on the same dataset and see what functions we have. Its architecture is going to be similar to the $\mathbf{x}$ network from the monotonic example above, but its input dimension is going to be four, instead of three features. This is because now $z$ is not handled separately from the other features. So here is the code to train the network:

lr = 3e-3
weight_decay = 1e-5
degree = 50 # there is no "degree" - it's here just to preserve model architecture.
layer_dims = [4,
              4 * degree,
              3 * degree,
              2 * degree,
              degree]
model, val_loss = train_model(
    train_iter, valid_iter, layer_dims, monotone=False,
    optim_fn=torch.optim.AdamW,
    optim_params=dict(lr=lr, weight_decay=weight_decay))
model = model.cpu()

I got a validation loss of $0.01404$ - slightly worse, but no by much. Let’s see what functions we’re getting for the same two vectors $\mathbf{x}$ we tried before. So here is the code for $\mathbf{x} =(1, 0.5, -0.5)$:

features = torch.cat([
    torch.tensor([1, 0.5, -0.5]).repeat(100, 1),
    plot_zs.reshape(-1, 1)
], axis=-1)
plt.plot(plot_zs, model(features).detach().numpy(), label='Model function')
plt.plot(plot_zs, hairy_increasing_func(features.numpy(), plot_zs.numpy()), label='True function')

plt.legend()
plt.show()

The model function appears monotonic. Is this a coincidence? Well, let’s try our second vector $\mathbf{x} = (-1.5, 0.8, 0.1)$:

features = torch.cat([
    torch.tensor([-1.5, 0.8, 0.1]).repeat(100, 1),
    plot_zs.reshape(-1, 1)
], axis=-1)
plt.plot(plot_zs, model(features).detach().numpy(), label='Model function')
plt.plot(plot_zs, hairy_increasing_func(features.numpy(), plot_zs.numpy()), label='True function')

plt.legend()
plt.show()

This one isn’t! If we think about it - there is a good reason. Our synthetic dataset was generated by random sampling of standard normal variables - vectors with features are close to zero are more common than those with features farther away. The vector $\mathbf{x} = (1, 0.5, -0.5)$ has components closer to zero than $\mathbf{x} = (-1.5, 0.8, 0.1)$, so there was more training data similar to the former vector than to the latter. Consequently, the model could learn better to represent the functions in the neighbourhood of the former vector. However, when a model is monotone by design, we don’t rely on having enough data for the model to discover monotonic behavior. It’s built into the model.

Summary and discussion

In this post we saw an interesting combination of neural networks with Bernstein polynomials that allow learning shape constraints. This is useful when the shape constraint is actually a constraint, i.e. required for the predictions of the model to be correct from a mathematical or business perspective. Moreover, it’s a form of regularization, since that’s what regularization often is - injecting prior knowledge about the hypothesis class into the fitting procedure.

The idea of constraining coefficients of a function in a given basis to constrain its shape works not only for Bernstein polynomials, but also for the B-Spline basis⁶. Probably also for a variety of other ‘shape-preserving’ bases that I never heard about. So you’re welcome to try this idea with those bases as well, if you believe they suit your needs.

An interesting variation could be designing a polynomial that is monotonic, non-negative, convex or concave over the entire real line $(-\infty, \infty)$. There is an interesting theorem that dates back to Hilbert’s 1888 paper⁷, that any polynomial $p(z)$ of degree $2d$ is non-negative over the entire real line if and only if it is a sum of squares of polynomials. Alternatively, this can be phrased as the existance of a positive-semidefinite matrix $\mathbf{P} \in \mathbb{R}^{d \times d}$ such that the polynomial can be written as

\[p(z;\mathbf{P}) = \begin{pmatrix}1 & z & \dots & z^d\end{pmatrix} \mathbf{P} \begin{pmatrix}1 \\ z \\ \vdots \\ z^d \end{pmatrix}.\]

Any positive semidefinite matrix $\mathbf{P}$ can be decomposed as $\mathbf{P} = \mathbf{V} \mathbf{V}^T$. So just like we predicted the Bernstein coefficient vector $\mathbf{a}$ based on the features $\mathbf{x}$, we could alternatively build a model that learns to predict $\mathbf{V}$.

Since a polynomial is increasing if and only if its derivative is non-negative, we can just take an integral of a non-negative polynomial. Similarly, Convexity can be represented using double-integration of a non-negative polynomial, since a polynomial is convex if and only if its second derivative is non-negative. In boh cases, it’s just multiplying the matrix $\mathbf{P}$ by the corresponding constant representing integration or double integration. Similar “sum of squares” techniques can be used to construct polynomials over an interval, by integrating non-negative polynomials over an interval. See Blekherman et. al. ⁸, Theorem 3.72.

Now let’s get back to the realm of Bernstein polynomials. What happens if we want a polynomial that is both convex and increasing? Or both concave and increasing? This seems useful as well, if we would like to model a utility function that represents diminishing returns. But in this case, we need to impose two constraints on the coefficient vector of the polynomial: one for monotonicity, and another one for concavity. This appears easy with convex optimization solvers that support constraints out of the box, but harder to achieve if we want to train a neural network with PyTorch that produces an coefficient vector that satisfies several constraints. This is exactly what we shall explore in the next post!

You, S., Ding, D., Canini, K., Pfeifer, J., & Gupta, M. (2017). Deep lattice networks and partial monotonic functions. Advances in neural information processing systems, 30. ↩
Hastie, T., & Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 55(4), 757-779. ↩
Ghosal, R., Ghosh, S., Urbanek, J., Schrack, J. A., & Zipunnikov, V. (2023). Shape-constrained estimation in functional regression with Bernstein polynomials. Computational Statistics & Data Analysis, 178, 107614. ↩
Hoover, D. R., Rice, J. A., Wu, C. O., & Yang, L. P. (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika, 85(4), 809-822. ↩
Huang, J. Z., Wu, C. O., & Zhou, L. (2004). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica, 763-788. ↩
Carl De-Boor. A practical guide to splines. (1993) ↩
Hilbert, D. (1888). Über die darstellung definiter formen als summe von formenquadraten. Mathematische Annalen, 32(3), 342-350. ↩
Grigoriy Blekherman, Pablo A. Parrilo, and Rekha R. Thomas. Semidefinite Optimization and Convex Algebraic Geometry. SIAM (2012) ↩

Mini-batching with in-memory datasets

2024-08-31T00:00:00+00:00

Intro

When doing research and quickly trying out ideas, speed is important. Waiting a long time until an experiment completes may keep us idle, and reduce our efficiency as researchers. Quick feedback from our experiments is typically crucial to keep our productivity, and this post may help us do exactly that - be more productive by quickly iterating experiments.

When reading typical tutorials about training models with PyTorch from datasets stored in PyTorch tensors, we see this pattern:

from torch.utils.data import TensorDataset, DataLoader

ds = TensorDataset(X, y)
for Xb, yb in DataLoader(ds, batch_size=..., shuffle=...):
  # inner training loop code: forward, backward, optimizer update, ...

However, when the training loop code is fast, such as when we’re training a small model, this pattern might not be a good idea in practice. Why? Well, DataLoader, as its name suggests, is optimized for data loading. It has plenty of logic for handling loading, collating, and batching data in a generic and parallel manner. And it does a pretty good job - these features are important for many applications. However, when the data fits in memory, and models are fast to compute, this overhead is quite significant. And even more so - when the data and model fit in GPU memory! This is oftentimes the case when we want to experiment with some idea on a small scale, before trying it out on a larger scale.

This post is devoted to demonstrating this overhead, and presenting an alternative that is easy to use and is fast. As usual, the code for this post is in this notebook you can deploy on Colab, and the utilities we develop are in this gist. The examples, however, are assumed to be run in a notebook, since we use the %%time magic keyword to measure running times. Moreover, the post assumes we have access to an GPU with at least 1GB of memory. I ran it on Colab with a T4 GPU.

I know typical posts on this blog are mathematically inclined, but not this one. This one is purely about coding, so let’s get started!

**Update**

The utilities developed in this post were converted to a small Python library you can install with

pip install batch-iter

The source code is in this GitHub repo.

DataLoader overhead

Let’s try to measure the overhead of the DataLoader class first, before trying to solve it. To that end, let’s generate a data-set for a nonlinear problem:

import torch

device = torch.device('cuda:0')
n_features = 1000
n_samples = 500000
X = torch.randn(n_samples, n_features, device=device)
y = torch.randn(n_samples, device=device)

Note, that the labels are completely random, since we don’t aim to actually learn anything. Our aim is only benchmarking the running times of our training code.

Now let’s define a network to learn it:

from torch import nn

def make_network():
	return nn.Sequential(
  	nn.Linear(n_features, n_features // 2),
  	nn.ReLU(),
  	nn.Linear(n_features // 2, n_features // 8),
  	nn.ReLU(),
    nn.Linear(n_features // 8, 1)
	)

Now let’s train it, and measure the time it takes:

net = make_network().to(device)
optim = torch.optim.SGD(net.parameters(), lr=1e-3)
criterion = nn.MSELoss()
ds = torch.utils.data.TensorDataset(X, y)

%%time
for Xb, yb in torch.utils.data.DataLoader(ds, batch_size=64, shuffle=True):
  loss = criterion(net(Xb).squeeze(), yb)
  loss.backward()
  optim.step()
  optim.zero_grad()

I got the following output:

CPU times: user 12.8 s, sys: 293 ms, total: 13.1 s
Wall time: 13.4 s

How much of it is the DataLoader’s work? Let’s replace the training loop with pass and see what happens:

%%time
for Xb, yb in torch.utils.data.DataLoader(ds, batch_size=64, shuffle=True):
	pass

The output is:

CPU times: user 4.13 s, sys: 19 ms, total: 4.15 s
Wall time: 4.15 s

Whoa! Approximately 30% of the time is spent by just iterating over the data! Now let’s try to do something about it. These four seconds don’t sound like much, but we have several training epochs. And probably some hyperparameter tuning cycles. Multiply these four seconds by the number of epochs and then by the number of hyperparameter configurations, and you will find yourself wasting plenty of time! So let’s try to be more productive for small-scale experiments.

Manual batch iteration

Typically, we want to iterate over batches from a set of tensors. In most cases, this set is of size two - the features tensor, and the labels tensor. But sometimes we want more, and that’s why TensorDataset also accepts a set of arbitrary size.

Iterating over a set of tensors is quite easy with PyTorch. We just need to be careful about not copying data from CPU to GPU and vice versa, so we need to make sure that everything is one the same device. So here is the function - it accepts an array of tensors, checks which device they’re on, creates a list of indices on the device, and uses those to iterate over mini-batches:

def iter_tensors(*tensors, batch_size):
  device = tensors[0].device  # we assume all tensors are on the same device
  n = tensors[0].size(0)
  idxs = torch.arange(n, device=device).split(batch_size)
  for batch_idxs in idxs:
    yield tuple((x[batch_idxs, ...] for x in tensors))  

Well, let’s try it out:

%%time
for Xb, yb in iter_tensors(X, y, batch_size=64):
	pass	

CPU times: user 222 ms, sys: 925 µs, total: 223 ms
Wall time: 225 ms

Ah, much better! But this code does not support shuffling, so let’s add it using the torch.randperm() function:

def iter_tensors_with_shuffle(*tensors, batch_size, shuffle=False):
  device = tensors[0].device  # we assume all tensors are on the same device
  n = tensors[0].size(0)
  if shuffle:
	  idxs = torch.arange(n, device=device)
  else:
    idxs = torch.randperm(n, device=device)
	idxs = idxs.split(batch_size)
  for batch_idxs in idxs:
    yield tuple((x[batch_idxs, ...] for x in tensors))  

And let’s try it out:

%%time
for Xb, yb in iter_tensors_with_shuffle(X, y, batch_size=64, shuffle=True):
	pass

CPU times: user 226 ms, sys: 2.86 ms, total: 229 ms
Wall time: 231 ms

Well, pretty fast. Still much better than the 4.8 seconds with DataLoader.

And now for one more enhancement. In many cases we like to use the tqdm library when iterating over data. However, we need to know the amount of items we’re iterating over. Unfortunately, Python generators used in our functions above don’t provide the __len()__ method required. So let’s refactor our code into a class that has the required methods:

class BatchIter:
    def __init__(self, *tensors, batch_size, shuffle=True):
      """
      tensors: feature tensors (each with shape: num_samples x *)
      batch_size: int
      shuffle: bool (default: True) whether to iterate over randomly shuffled samples.
      """
      self.tensors = tensors

      device = tensors[0].device
      n = tensors[0].size(0)
      if shuffle:
          idxs = torch.randperm(n, device=device)
      else:
          idxs = torch.arange(n, device=device)

      self.idxs = idxs.split(batch_size)

    def __len__(self):
        return len(self.idxs)

    def __iter__(self):
        tensors = self.tensors
        for batch_idxs in self.idxs:
            yield tuple((x[batch_idxs, ...] for x in tensors))

Now let’s try it out:

from tqdm.auto import tqdm

%%time
for Xb, yb in BatchIter(X, y, batch_size=64, shuffle=True):
  pass

100%|██████████| 7813/7813 [00:00<00:00, 36521.03it/s]
CPU times: user 249 ms, sys: 1.88 ms, total: 251 ms
Wall time: 254 ms

Beautiful! We have built a small utility class that I called BatchIter to eliminate most of the overhead of DataLoader in simple cases, when all data is in-memory, and models are small and lean. I hope it is useful to your small experiments. But now let’s extend it.

Iterating over grouped data

There are applications where we want to iterate over mini-batches composed of groups of samples. One such case is the learning to rank problem: we are given a query and a corresponding list of candidate answers, each labeled with a score designating its relevance. Our objective is learning a function that scores items for a given query, such that more relevant items have a higher score. Methods that define a loss for the entire list of suggestions for a given query, known as list-wise methods, require all suggestions belonging to the same query to be grouped together.

Here, will built a utility class for iterating over grouped samples. We assume that the input consists of samples, each having a _group id, and that each group appears consecutively. The shuffling process shuffles entire groups, rather than individual samples. This is illustrated below - we have a group-id, and $n$ tensors $T_1, \dots, T_n$ that comprise our dataset:

Similarly, our utility assumes that the mini-batch size specifies the number of groups in each mini-batch, rather than the number of samples. This plays nicely with list-wise learning to rank, since each group produces one loss value for the entire group. Therefore, with a mini-batch of $k$ groups, we shall have a sample of $k$ losses.

Group shuffling

To shuffle entire batches, we need several utilities. Our main requirement for these utilities is that they are composed of primitive vectorized PyTorch functions, so that we can run them on the GPU as well. The first one is called lexical sort, and it does what you think it does - it returns the permutation for sorting several tensors in lexicographical order. There is a similar function in NumPy, called lexsort, and we shall implement our own for PyTorch. Fortunately, we don’t need to think too much about it - the developers of the PyTorch-Geometric¹ library already wrote one, so the implementation below is just a simplified version:

def lexsort(*keys, dim=-1):
    if len(keys) == 0:
        raise ValueError(f"Must have at least 1 key, but {len(keys)=}.")

    idx = keys[0].argsort(dim=dim, stable=True)
    for k in keys[1:]:
        idx = idx.gather(dim, k.gather(dim, idx).argsort(dim=dim, stable=True))

    return idx

It does what we would expect it to do - it computes the sorting order by each tensor separately using a stable sorting algorithm. It uses the PyTorch gather functions for reshuffling. Let’s see how it works - we shall sort the pairs $(5, 4), (3, 1), (5, 1), (3, 3), (5, 3), (5, 2), (3, 2)$ in lexicographic order - meaning, we compare by the first item of each pair, and among the pairs with equal first item, we compare by the second item. Conforming to the same convention as NumPy, we specify the tensors in reverse order, namely, first the tensor with the second components, and then the tensor with the first components, as below:

first = torch.tensor([5, 3, 5, 3, 5, 5, 3])
second = torch.tensor([4, 1, 1, 3, 3, 2, 2])
order = lexsort(second, first)
print(first[order], second[order])

tensor([3, 3, 3, 5, 5, 5, 5]) tensor([1, 2, 3, 1, 2, 3, 4])

Why is it useful? One simple way of shuffling entire groups is sorting by a hash code of the query id, and break ties by the query id itself. Tie braking is required due to hash collisions. Speaking of the devil, we will also need a function for component-wise hash codes in PyTorch, so I wrote my own which implements the FNV hash algorithm:

def fnv_hash(tensor):
    """
    Computes the FNV hash for each component of a PyTorch tensor of integers.
    Args:
      tensor: A PyTorch tensor of type int32 or int16
    Returns:
      A PyTorch tensor of the same size and dtype as the input tensor, containing the FNV hash for each element.
    """
    # Define the FNV prime and offset basis
    FNV_PRIME = torch.tensor(0x01000193, dtype=torch.int32)
    FNV_OFFSET = torch.tensor(0x811c9dc5, dtype=torch.int32)

    # Initialize the hash value with zeros (same size and dtype as tensor)
    hash_value = torch.full_like(tensor, FNV_OFFSET)
    for byte in split_int_to_bytes(tensor):
        hash_value = torch.bitwise_xor(hash_value * FNV_PRIME, byte)

    # No need to reshape, output already has the same size and dtype as input
    return hash_value

Now we can obtain permutation indices that permute entire groups with a given seed, simply by sorting by the pairs (hash(group_id + seed), group_id). Here is an example:

group_id = torch.tensor([5, 5, 8, 8, 8, 8, 1, 1])
seed = 1
order = lexsort(group_id, fnv_hash(group_id + seed))
print(group_id[order])

tensor([5, 5, 1, 1, 8, 8, 8, 8])

Let’s try another seed:

seed = 2
order = lexsort(group_id, fnv_hash(group_id + seed))
print(group_id[order])

tensor([1, 1, 8, 8, 8, 8, 5, 5])

Note, that both lexsort and fnv_hash are composed of vectorized PyTorch functions, as desired. The only loop is in the fnv_hash function, that loops over the element bytes. For example, when computing a hash of an int32 tensor where each element has four bytes, the loop will have four iterations.

It appears that the shuffling problem has been addressed.Our next challenge is addressing the batching problem - how do we iterate over mini-batches of groups.

Mini-batches of groups

Suppose we have a group_id tensor that has been permuted using our shuffling code. Now we need to somehow divide it into mini-batches of groups. As with the previous challenge, we would like the code to be composed of vectorized PyTorch primitives, so that it is GPU friendly and fast.

Our first utility function is simple - it computes the start indices of the groups. For example, in the group-id tensor [8, 8, 8, 1, 1, 7, 7, 7, 7], we have three groups: the first begins at index 0, the second at index 3, and the last one at index 5. For convenience, we have an additional “empty” group after the end of the tensor, which is by definition after the last element, at index 9. The reason why it is convenient will be apparent soon.

Such indices are pretty straightforward to compute using the torch.unique_consecutive function, that returns the unique consecutive elements, and optionlally their counts. The cumulative sum of the counts gives the indices of all, but the first group. The first group, by definition, is at index 0, and this is achieved by padding. So here is the function:

def group_idx(group_id):
  values, counts = group_id.unique_consecutive(return_counts=True)
  idx = torch.cumsum(counts, dim=-1)
  return torch.nn.functional.pad(idx, (1, 0))

Let’s test it:

group_id = torch.tensor([8, 8, 8, 1, 1, 7, 7, 7, 7])
indices = group_idx(group_id)
print(indices)

tensor([0, 3, 5, 9])

How does it help us? Well, suppose we want mini-batches of size two. The first mini-batch will be from sample 0 to sample 5. The next one, will be from sample 5 to sample 9. Indeed, group_id[0:5] is the tensor of [8, 8, 8, 1, 1], containing two groups, and group_id[5:9] is the tensor of [7, 7, 7], which is the last remaining groups.

So let’s write a function that takes the result of group_idx as its input, and produces the start and end indices of each mini-batch. Suppose our batch size is 5. So it looks simple - just take items group_idx[0], group_idx[5], group_idx[10], ... for the start indices, and group_idx[5], group_idx[10], group_idx[15], ... for the end indicates, right? Well, almost. There are certain special cases we need to take care of. First, what if we have less groups than our batch size? And second, what if the number of groups is not divisible by the batch size? In that case, would exclude the last batch. To make sure our code is correct, we will use the simple trick of padding, and make sure that the number of elements is divisible by the batch size. It’s easy to see that it solves both special cases. So here is the function:

def batch_endpoint_indices(group_idx, batch_size):
  padding = batch_size - (len(group_idx) - batch_size * (len(group_idx) // batch_size))
  group_idx = torch.nn.functional.pad(group_idx, (0, padding), mode='replicate')
  start_points = group_idx[0:-1:batch_size]
  end_points = group_idx[batch_size::batch_size]
  return start_points, end_points

Let’s try it out with our example:

group_id = torch.tensor([8, 8, 8, 1, 1, 7, 7, 7, 7])
from_idx, to_idx = batch_endpoint_indices(group_idx(group_id), batch_size=2)
for start, end in zip(from_idx, to_idx):
  print(start, end)

0 5
5 9

As expected, 0 to 5, and 5 to 9. What if the we try mini-batches of size 3?

group_id = torch.tensor([8, 8, 8, 1, 1, 7, 7, 7, 7])
from_idx, to_idx = batch_endpoint_indices(group_idx(group_id), batch_size=2)
for start, end in zip(from_idx, to_idx):
  print(start, end)

0 9

As expected, one mini-batch, from 0 to 9. All three groups inside. So now we can put our utilities together into a class, similar to BatchIter, that will do the iteration for us:

class GroupBatchIter:
  def __init__(self, group_id, *tensors, batch_size=1, shuffle=True, shuffle_seed=42):
    self.group_id = group_id
    self.tensors = tensors
    
    if shuffle:
      self.idxs = lexsort(group_id, fnv_hash(group_id + seed))
    else:
      self.idxs = torch.arange(len(group_id), device=group_id.device)
    
    group_start_indices = group_idx(group_id[self.idxs])
    self.batch_start, self.batch_end = batch_endpoint_indices(group_start_indices, batch_size)

  def __len__(self):
    return len(self.batch_start)
  

  def __iter__(self):
    # we create mini-batches containing both group-id, and the additional 
    # tensors
    tensors = (self.group_id,) + self.tensors

    # iterate over batch endpoints, and yield tensors
    for start, end in zip(self.batch_start, self.batch_end):
      batch_idxs = self.idxs[start:end]
      if len(batch_idxs) > 0:
        yield tuple(x[batch_idxs, ...] for x in tensors)

Now let’s try it out. First, we generate some data, and use Pandas for pretty-printing:

import pandas as pd

group_id = torch.tensor([8, 8, 8, 1, 1, 7, 7, 7, 7])
features = torch.arange(len(group_id) * 3).reshape(len(group_id), 3)
labels = torch.arange(len(group_id)) % 2

print(pd.DataFrame.from_dict({
    'group_id': group_id.tolist(),
    'features': features.tolist(),
    'labels': labels.tolist()
}))

   group_id      features  labels
       8     [0, 1, 2]       0
       8     [3, 4, 5]       1
       8     [6, 7, 8]       0
       1   [9, 10, 11]       1
       1  [12, 13, 14]       0
       7  [15, 16, 17]       1
       7  [18, 19, 20]       0
       7  [21, 22, 23]       1
       7  [24, 25, 26]       0

So we have three groups, and we are simulating some features of each sample, and binary labels. Now let’s try iterating with a batch size of two:

for gb, Xb, yb in GroupBatchIter(group_id, features, labels, batch_size=2, shuffle=True):
  print(pd.DataFrame.from_dict({
    'group_id': gb.tolist(),
    'features': Xb.tolist(),
    'labels': yb.tolist()
}))

   group_id      features  labels
       1   [9, 10, 11]       1
       1  [12, 13, 14]       0
       8     [0, 1, 2]       0
       8     [3, 4, 5]       1
       8     [6, 7, 8]       0
   group_id      features  labels
       7  [15, 16, 17]       1
       7  [18, 19, 20]       0
       7  [21, 22, 23]       1
       7  [24, 25, 26]       0

Indeed we see that the order has been changed, so shuffling happened. The first batch contains the samples from groups 1 and 8 - two groups, as specified by the batch size. The second batch contains samples from the remaining group 7. We also note that the order among the samples in each group is preserved.

So what about speed? Let’s try it out. We already have samples and labels from the previous batch iteration code without groups. So let’s just generate a group-id tensor, with 8 samples in group on average:

n_groups = n_samples // 8
group_id, _ = torch.multinomial(torch.ones(n_groups) / n_groups, n_samples, replacement=True).sort()
print(group_id[:50]) # print the first 50 group IDs

tensor([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6,
        6, 6])

Looks OK. Now let’s measure iteration speed with mini-batches of 64 groups:

%%time
for gb, Xb, yb in GroupBatchIter(group_id, X, y, batch_size=64, shuffle=True):
  pass

CPU times: user 178 ms, sys: 20 ms, total: 198 ms
Wall time: 199 ms

That’s fast, and it appears we are done :)

Summary

We wrote two batch iteration utilities - one for iterating over individual samples, and another one for iterating over groups of samples. Both are useful for different settings, and I hope you will find them useful to accelerate your experiments on a small scale, before you reach a larger scale. It certainly made me more productive, especially when working on experiments for papers. And most importantly, if you have a better way of implementing these utilities - please let me know!

References

Fey, M., & Lenssen, J. E. (2019). Fast Graph Representation Learning with PyTorch Geometric [Computer software]. https://github.com/pyg-team/pytorch_geometric ↩

Fun with sparsity in PyTorch via Hadamard product parametrization

2024-07-07T00:00:00+00:00

Intro

Models having a high-dimensional parameter space, such as large neural networks, often pose a challenge when deployed on edge devices, due to various constraints. Two remedies are often suggested: pruning and quantization. In this post I’d like to concentrate on the idea of pruning, which amounts to removing neurons that we beleive have little or no contribution to the over-all model performance. PyTorch provides various heuristics for model pruning, that are explained in a tutorial in the official documentation.

I’d like to discuss a decades-old alternative idea to those heuristics - L1 regularization. It is “known” to promote sparsity, and there is an end-less amount of resources online explaining why. But there are very little resources explaining how this can be achieved in modern ML frameworks, such as PyTorch. I believe there are two major reasons for that.

The first reason is very direct - just adding an L1 regularization term to the cost we differentiate in each training loop, in conjunction with an optimizer such as SGD or Adam, will often not produce sparse solution. You can find plenty of evidence online, such as here, here, or here. I want to avoid discussing why, and will just say that the reason is in the optimizers - they were not designed to properly handle sparsity-inducing regularizers, and some trickery is required.

The second reason stems from how software engineering is done. People want to re-use components or patterns. There is a very clear pattern of how PyTorch training is implemented, and we either implement it manually in simple cases, or rely on a helper library, such as PyTorch Ignite, or PyTorch Ligntning to do the job for us.

So can we use sparsity-inducing regularization with PyTorch, that nicely and easily integrates with the existing ecosystem? It turns out that there is an interesting stream of research that facilitates exactly that - the idea of sparse regularization by Hadamard parametrization. I first encountered it in a paper by Peter Hoff¹, and noticed that the idea has been further explored in several additional papers ²³⁴. I believe this stream of research hasn’t received the attention (pun intended!) it deserves, since it allows an extremely easy way of achieving sparse L1 regularization that seamlessly integrates into the existing PyTorch ecosystem of patterns and libraries. In fact, the code is so embarrasingly simple that I am surprised that such parametrizations haven’t become popular.

The basic idea is very simple. Suppose we aim to find model weights $\mathbf{w}$ by minimizing the L1 regularized loss over our training set:

\[\tag{P} \min_{\mathbf{w}} \quad \frac{1}{n} \sum_{i=1}^n \ell_i(\mathbf{w}) + \lambda \|w\|_1\]

We reformulate the problem $(P)$ above by representing $\mathbf{w}$ as a component-wise product of two vectors. Formally, $\mathbf{w} = \mathbf{u} \odot \mathbf{v}$ where $\odot$ is the component-wise (or Hadamard) product. And instead of solving $(P)$ we solve the problem below:

\[\tag{Q} \min_{\mathbf{u},\mathbf{v}} \quad \frac{1}{n} \sum_{i=1}^n \ell_i(\mathbf{u} \odot \mathbf{v}) + \lambda \left( \|\mathbf{u}\|_2^2 + \| \mathbf{v} \|_2^2 \right)\]

Note, that $(Q)$ uses L2 regularization! As it turns out¹, any local minimum $(Q)$ is also a local minimum of $(P)$. L2 regularization is native to PyTorch in the form of the weight_decay parameter to its optimizers. But more importantly, parametrizations are also a native beast in PyTorch!

We first begin with implementing this idea in PyTorch for a simple linear model, and then extend it to Neural Networks. This is, of course, not the best method to achieve sparsity. But it’s an extremmely simple one, easy to try out for your model, and fun! As customary, the code is available in a notebook that you can deploy on Google Colab.

Parametrizing a linear model

In this section we will demonstrate how to implement Hadamard parametrization in PyTorch to train a linear model on a data-set, and verify that we indeed achieve sparsity similar to a truly optimal solution of $(P)$. We regard the solution achieved by CVXPY, which is a well-known convex optimization package for Python, as an “exact” solution.

Setting up the dataset

We begin from the data which we use throughout this section. We will use the Madelon dataset, which is a synthetic data-set that was used for the NeurIPS 2003 feature selection challenge. It’s available from openml as data-set 1485, and therefore we can use the fetch_openml function from scikit-learn to fetch it:

from sklearn.datasets import fetch_openml

madelon = fetch_openml(data_id=1485, parser='auto')

To get a feel of what this data-set looks like, let’s print it:

print(madelon.frame)

The output is:

       V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  ...  V492  V493  V494  \
   485  477  537  479  452  471  491  476  475  473  ...   481   477   485   
   483  458  460  487  587  475  526  479  485  469  ...   478   487   338   
   487  542  499  468  448  471  442  478  480  477  ...   481   492   650   
   480  491  510  485  495  472  417  474  502  476  ...   480   474   572   
   484  502  528  489  466  481  402  478  487  468  ...   479   452   435   
...   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   ...   ...   ...   
493  458  503  478  517  479  472  478  444  477  ...   475   485   443   
481  484  481  490  449  481  467  478  469  483  ...   485   508   599   
485  485  530  480  444  487  462  475  509  494  ...   474   502   368   
477  469  528  485  483  469  482  477  494  476  ...   476   453   638   
482  453  515  481  500  493  503  477  501  475  ...   478   487   694   

      V495  V496  V497  V498  V499  V500  Class  
    511   485   481   479   475   496      2  
    513   486   483   492   510   517      2  
    506   501   480   489   499   498      2  
    454   469   475   482   494   461      1  
    486   508   481   504   495   511      1  
...    ...   ...   ...   ...   ...   ...    ...  
 517   486   474   489   506   506      1  
 498   527   481   490   455   451      1  
 453   482   478   481   484   517      1  
 471   538   470   490   613   492      1  
 493   499   474   494   536   526      2

So it’s a classification data-set with 500 numerical features, and two classes. Naturally, we will use the binary cross-entropy loss in our minimization problem.

At this stage our objective is just demonstrating properties of the model fitting procedure, rather than evaluating the performance of the model. Thus, for simplicity, we will not split into train / evaluation sets, and operate on the entire data-set.

To make it more friendly for model training, let’s first rescale it to zero mean and unit variance, and extract labels as values in $\{0, 1\}$:

import numpy as np
from sklearn.preprocessing import StandardScaler

scaled_data = np.asarray(StandardScaler().fit_transform(madelon.data))
labels = np.asarray(madelon.target.cat.codes)

Exact L1 regularization using CVXPY

Now let’s find an optimal solution of the L1 regularized problem $(P)$ using CVXPY, which is a Python framework for accurate solution of convex optimization problems. For Logistic Regression, the loss of the sample $(\mathbf{x}_i, y_i)$ is:

\[\ell_i(\mathbf{w}) = \ln(1+\exp(\mathbf{w}^T \mathbf{x_i})) - y \cdot \mathbf{w}^T \mathbf{x}_i\]

This is obviously convex, due to the convexity of $\ln(1+\exp(x))$, which is modeled by the cvxpy.logistic function. The corresponding CVXPY code for constructing an object representing $(P)$ is:

import cvxpy as cp

# (coef, intercept) are the vector w.
coef = cp.Variable(scaled_data.shape[1])
intercept = cp.Variable()
reg_coef = cp.Parameter(nonneg=True)

pred = scaled_data @ coef + intercept  # <--- this is w^T x
loss = cp.logistic(pred) - cp.multiply(labels, pred)
mean_loss = cp.sum(loss) / len(scaled_data)
cost = loss + reg_coef * cp.norm(coef, 1)
problem = cp.Problem(cp.Minimize(cost))

Of course, we don’t know at this stage which regularization coefficient to use to achieve sparsity, so let’s begin with $10^{-4}$:

reg_coef.value = 1e-4
problem.solve()
print(f'Loss at optimum = {loss.value:.4g}')

I got th efollowing output:

Loss at optimum = 0.5466

Let’s also plot the coefficients. The plot_coefficients function below just contains boilerplace to make the plot nice, and ability to specify transparency and color for other parts of this post, where we want to make several plots on the same axes:

import matplotlib.pyplot as plt

def plot_coefficients(coefs, ax_coefs=None, alpha=1., color='blue', **kwargs):
  if ax_coefs is None:
    ax_coefs = plt.gca()
  markerline, stemlines, baseline = ax_coefs.stem(coefs, markerfmt='o', **kwargs)
  ax_coefs.set_xlabel('Feature')
  ax_coefs.set_ylabel('Weight')
  ax_coefs.set_yscale('asinh', linear_width=1e-6)  # linear near zero, logarithmic further from zero

  stemlines.set_linewidth(0.25)
  markerline.set_markerfacecolor('none')
  markerline.set_linewidth(0.1)
  markerline.set_markersize(2.)
  baseline.set_linewidth(0.1)

  stemlines.set_color(color)
  markerline.set_color(color)
  baseline.set_color(color)

  stemlines.set_alpha(alpha)
  markerline.set_alpha(alpha)
  baseline.set_alpha(alpha)
 
plot_coefficients(coef.value)

I got the following plot:

Note that the y-axis appears logarithmic to to the $\operatorname{arcsinh}$ scale. Doesn’t look sparse at all! So let’s try a larger coefficient:

reg_coef.value = 1e-2
problem.solve()
print(f'Loss at optimum = {loss.value:.4g}')
plot_coefficients(coef.value)

The output is

Loss at optimum = 0.6188

The plot I obtained:

Now it looks much sparser! Let’s store the coefficients vector, we will need it in the remainder of this section to compare it to the results we achieve with PyTorch:

cvxpy_sparse_coefs = coef.value.copy()

I don’t know if this is a ‘good’ feature selection strategy for this specific dataset, but it’s not our objective. Our objective is showing how to implement Hadamard parametrization in PyTorch that recovers a similar sparsity pattern. So let’s do it!

Using PyTorch parametrization

Parametrizations in PyTorch allow representing any learnable parameter as a function of other learnable parameters. Typically, this is used to impose constraints. For example, we may represent a vector representing discrete event probabilities as the soft-max operation applied to a vector of arbitrary real values. A parametrization in PyTorch is just another module. Here is an example:

class SimplexParametrization(torch.nn.Module):
  def forward(self, x):
    return torch.softmax(x)

Now, suppose our model has a parameter called vec which we’d like to constrain to lie in the probability simplex. It can be done in the following manner:

torch.nn.utils.parametrize.register_parametrization(model, 'vec', SimplexParametrization())

Viola!

Since a parametrization is just another module, it can have its own learnable weights! So we can use this fact to easily parametrize the weights of a torch.nn.Linear module: we will regard its original weights as $\mathbf{u}$, the parametrization module will have its own weigths $\mathbf{v}$, and will compute $\mathbf{u} \odot \mathbf{v}$. Here is the code:

import torch

class HadamardParametrization(torch.nn.Module):
  def __init__(self, in_features, out_features):
    super().__init__()
    self.in_features = in_features
    self.out_features = out_features
    self.v = torch.nn.Parameter(torch.ones(out_features, in_features))

  def forward(self, u):
    return u * self.v

Note, that I initialized the $\mathbf{v}$ vector to a vector of ones. This is because the first time a parametrization is applied, the forward function is called to compute the parametrized value, and I want to use the deep mathematical fact that $1$ is neutral w.r.t the multiplication operator to keep the original weight unchanged.

Let’s apply it to a linear layer and inspect its trainable parameters to get a feeling:

layer = torch.nn.Linear(8, 1)
torch.nn.utils.parametrize.register_parametrization(layer, 'weight', HadamardParametrization(8, 1))

That’s it! Now if we train our linear model using PyTorch optimizers with weight_decay, we will in fact apply L1 regularization to the original weights. The weight decay is exactly equivalent to the L1 regularization coefficient. Under the hood, the layer.weight parameter is now represented as a Hadamard product of two tensors.

To get a feeling of what happens under the hood, let’s inspect our linear layer after applying the parametrization:

for name, param in layer.named_parameters():
  print(name, ': ', param)

The output I got is:

bias :  Parameter containing:
tensor([-0.1233], requires_grad=True)
parametrizations.weight.original :  Parameter containing:
tensor([[-0.0035,  0.2683,  0.0183,  0.3384, -0.0326,  0.1316, -0.1950, -0.0953]],
       requires_grad=True)
parametrizations.weight.0.v :  Parameter containing:
tensor([[1., 1., 1., 1., 1., 1., 1., 1.]], requires_grad=True)

We can see there are three trainable parameters. The bias of the linear layer, the original weight of the linear layer, which we now treat as the $\mathbf{u}$ vector, and the weight of the HadamardParametrization module, which is initialized to ones, which we treat as the $\mathbf{v}$ vector. What happens if we try to access the weight of the linear layer? Let’s see:

print(layer.weight)

Here is the output:

tensor([[-0.0035,  0.2683,  0.0183,  0.3384, -0.0326,  0.1316, -0.1950, -0.0953]],
       grad_fn=<MulBackward0>)

But it has a MulBackward gradient back-propagation function, because under the hood it is computed as a product of two tensors.

Training a parametrized logistic regression model

To see our parametrization in action, we will need three components. First, a function that implements a pretty standard PyTorch training loop. Something that looks familiar, and without any trickery. Second, a function that plots its results. Third, a function that integrates the two above ingredients to train a Hadamard-parametrized logistic regression model.

Here is our pretty-standard PyTorch training loop. It returns the training loss achieved in each epoch in a list, so that we can plot it:

from tqdm import trange

def train_model(dataset, model, criterion, optimizer, n_epochs=500, batch_size=8):
  epoch_losses = []
  for epoch in trange(n_epochs):
    epoch_loss = 0.
    for batch, batch_label in torch.utils.data.DataLoader(dataset, batch_size=batch_size):
      # compute predictiopn and loss
      batch_pred = model(batch)
      loss = criterion(batch_pred, batch_label)
      epoch_loss += loss.item() * torch.numel(batch_label)
      
      # invoke the optimizer using the gradients.
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
		
    epoch_losses.append(epoch_loss / len(dataset))
  return epoch_losses

Our second ingredient are plotting functions. Here is a function that plots the epoch losses:

def plot_convergence(epoch_losses, ax=None):
  if ax is None:
    ax = plt.gca()

  ax.set_xlabel('Epoch')
  ax.set_ylabel('Cost')
  ax.plot(epoch_losses)
  ax.set_yscale('log')
  last_iterate_loss = epoch_losses[-1]
  ax.axhline(last_iterate_loss, color='r', linestyle='--')
  ax.text(len(costs) / 2, last_iterate_loss, f'{last_iterate_loss:.4g}',
          fontsize=12, va='center', ha='center', backgroundcolor='w')

To get a feeling of what the output looks like, let’s plot a a dummy list simulatinga loss of $\exp(-\sqrt{i})$ in the $i$-th epoch:

plot_convergence(np.exp(-np.sqrt(np.arange(100))))

We can see a plot of the achieves loss on a logarithmic scale, and a horizontal line denoting the loss at the last epoch. We would also like to see the coefficients of our trained model, just like we did with the CVXPY models. So here is a function that plots the losses on the left, and the coefficients on the right. The coefficients are plotted together with ‘reference’ coefficients, so that we can visually compare our model to some reference. In our case, the reference coefficients are the ones we obtained from CVXPY.

def plot_training_results(model, losses, ref_coefs):
  # create figure and decorate axis labels
  fig, (ax_conv, ax_coefs) = plt.subplots(1, 2, figsize=(12, 4))
  plot_coefficients(ref_coefs, ax_coefs, color='blue', label='Reference')
  plot_coefficients(model.weight.ravel().detach().numpy(), ax_coefs, color='orange', label='Hadamard')
  ax_coefs.legend()
  plot_convergence(losses, ax_conv)
  plt.tight_layout()
  plt.show()

Our third and last ingredient is the function that integrates it all. It trains a Hadamard parametrized model, and plots the coefficients and the epoch losses:

import torch.nn.utils.parametrize

def train_parameterized_model(alpha, optimizer_fn, ref_coefs, **train_kwargs):
  model_shape = (scaled_data.shape[1], 1)
  model = torch.nn.Linear(*model_shape)
  torch.nn.utils.parametrize.register_parametrization(
      model, 'weight', HadamardParametrization(*model_shape)) # <-- this applies Hadamard parametrization

  dataset = torch.utils.data.TensorDataset(
      torch.as_tensor(scaled_data).float(),
      torch.as_tensor(labels).float().unsqueeze(1))  
  criterion = torch.nn.BCEWithLogitsLoss()  # <-- this is the loss for logistic regression
  optimizer = optimizer_fn(model.parameters(), weight_decay=alpha)
  epoch_losses = train_model(dataset, model, criterion, optimizer, **train_kwargs)
  
  plot_training_results(model, epoch_losses, ref_coefs)

Now let’s try it out with a regularization coefficient of $10^{-2}$. That is exactly the same coefficient we used to obtain the sparse coefficients with CVXPY. However, this is not CVXPY, and we need to also chose an optimizer and its parameters. I used the Adam optimizer with a learning rate of $10^{-4}$. And yes, I know⁵ that Adam’s weight decay is not exactly L2 regularization, but many use Adam as their go-to optimizer, and I want to demonstrate that the idea works with Adam as well:

from functools import partial

train_parameterized_model(alpha=1e-4,
                          optimizer_fn=partial(torch.optim.Adam, lr=1e-4),
                          ref_coefs=cvxpy_sparse_coefs)

Here is the result I got:

On the left, we can see the convergence plot. On the right, we can see coefficients from both CVXPY and the Hadamard parametrization. They almost coincide, with almost the same sparsity pattern. The training loss, 0.6194, is also pretty close to 6188, which is what we achieved with CVXPY.

Now, having seen that Hadamard parametrization indeed ‘induces sparsity’, just like its equivalent L1 regularization, we can do something more interesting, and apply it to neural networks.

Parametrizing a neural network

The concept of sparsity doesn’t necessarily fit neural networks in the best way, but a related concept of group sparsity does. We introduce it here, show how it is seamlessly implemented using a Hadamard product parametrization in PyTorch, and conduct an experiment with the famous california housing prices dataset.

Group sparsity

One caveat of the parametrization technique we saw is that it requires twice as many trainable parameters. For large neural networks this may be prohibitive in terms of time, space, or just the cost of training on the cloud. But with neural networks, it may be enough to produce a zero either at the neuron input level, or at the output level. For example, some of the neurons of a given layer produce a zero, whereas others do not.

This can be achieved through regularization that induces group sparsity, meaning that we would like entire groups of weights to be zero whenever the effect of the group on the loss is small enough. If we define the groups to be the columns of the weight matrices of our linear layers, we will achieve sparsity on neuron inputs. This is because in PyTorch the columns correspond to the of input features of a linear layer.

One way to achieve this, is using the sum of the column norms in the regularization coefficient. For example, suppose we have 3-layer neural network whose weight matrices are $\mathbf{W}_1 \in \mathbb{R}^{8\times 3}, \mathbf{W}_2\in\mathbb{R}^{3\times 2}$, and $\mathbf{W}_3 \in \mathbb{R}^{2\times 1}$, and we are training over a data-set with $n$ samples with cost functions $\ell_1, \dots, \ell_n$. Then to induce column sparsity, we should train by minimizing

\[\begin{align*} \min_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{W}_3} \quad \frac{1}{n} \sum_{i=1}^n \ell_i(\mathbf{W}_1, \mathbf{W}_2, \mathbf{W}_3) + \lambda \Bigl( &\| \mathbf{W}_{1,1} \|_2 + \| \mathbf{W}_{1,2} \|_2 + \| \mathbf{W}_{1,2} \|_3 + \\ &\| \mathbf{W}_{2,1} \|_2 + \| \mathbf{W}_{2,2} \|_2 + \\ &\| \mathbf{W}_{3,1} \|_2 \Bigr), \end{align*}\]

where $\mathbf{P}_i$ denotes the $i$-th column of the matrix $\mathbf{P}$. Seems a bit clumsy, but the regularizer just sums up the Euclidean norms of the weight matrix columns. Note, that the norms are not squared, so this is not our friendly neighborhood L2 regularization.

It turns out²³⁴ that this is equivalent to a Hadamard product parametrization with our friendly neighborhood L2 regularization. This means that we can again use the weight_decay feature of PyTorch optimizers to achieve column sparsity. As we would expect, the parametrization operates on matrix columns, rather than individual components. A weight matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$ will parametrized by the matrix $\mathbf{U} \in \mathbb{R}^{m \times n}$ and the vector $\mathbf{v} \in \mathbb{R}^n$ as by multiplying each column of $\mathbf{U}$ by the corresponding component of $\mathbf{v}$:

\[\mathbf{W} = \begin{bmatrix} \mathbf{U}_1 \cdot v_1 & \mathbf{U}_2 \cdot v_2 & \cdots & \mathbf{U}_m \cdot v_m \end{bmatrix}\]

The implementation in PyTorch is embarasingly simple:

class InputsHadamardParametrization(torch.nn.Module):
  def __init__(self, in_features):
    super().__init__()
    self.v = torch.nn.Parameter(torch.ones(1, in_features))

  def forward(self, u):
    return u * self.v

Note, that we use the broadcasting ability of PyTorch to multiply each column of the argument u by the corresponding component of v.

Group sparsity in action

To see our idea in action, we shall use the california housing dataset, mainly due to its availability on Google colab. It has 8 numerical features, and a continuous regression target. Let’s load it:

train_df = pd.read_csv('sample_data/california_housing_train.csv')
test_df = pd.read_csv('sample_data/california_housing_test.csv')

The first 5 rows of the train data-set are:

longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
-114.31	34.19	15	5612	1283	1015	472	1.4936	66900
-114.47	34.4	19	7650	1901	1129	463	1.82	80100
-114.56	33.69	17	720	174	333	117	1.6509	85700
-114.57	33.64	14	1501	337	515	226	3.1917	73400
-114.57	33.57	20	1454	326	624	262	1.925	65500

The first 8 columns are the features, and the last column is the regression target. We make some preprocessing by splitting into training and evaluation set, and Scikit-Learn’s StandardScaler to standardize the numerical features. Then, we convert everything to PyTorch datasets:

# standardize features
scaler = StandardScaler().fit(train_df)
train_scaled = scaler.transform(train_df)
test_scaled = scaler.transform(test_df)

# conver to PyTorch objects
train_ds = torch.utils.data.TensorDataset(
    torch.as_tensor(train_scaled[:, :-1]).float(),
    torch.as_tensor(train_scaled[:, -1]).float().unsqueeze(1))
test_features = torch.as_tensor(test_scaled[:, :-1]).float()
test_labels = torch.as_tensor(test_scaled[:, -1]).float().unsqueeze(1)

Now we are ready. We will use the following simple four-layer neural network to fit the training set:

class Network(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.fc1 = torch.nn.Linear(8, 32)
    self.fc2 = torch.nn.Linear(32, 64)
    self.fc3 = torch.nn.Linear(64, 32)
    self.fc4 = torch.nn.Linear(32, 1)
    self.relu = torch.nn.ReLU()

  def forward(self, x):
    x = self.relu(self.fc1(x))
    x = self.relu(self.fc2(x))
    x = self.relu(self.fc3(x))
    x = self.fc4(x)
    return x

  def linear_layers(self):
    return [self.fc1, self.fc2, self.fc3, self.fc4]

The first layer has 8 input features, since our data-set has 8 features. Later layers expand it to 64 hidden features, and then shrink back. We don’t know if we will need all those dimensions, but that’s what we have our sparsity inducing regularization for - so that we can find out. Note that I added a linear_layers() method to be able to operate on all the linear layers of the network. It could be done in a generic manner by inspecting all modules and checking which ones are torch.nn.Linear, but I want to make the subsequent code simpler.

Let’s inspect our network to see how many parameters it has. To that end, we shall use the torchinfo package:

import torchinfo

network = Network()
torchinfo.summary(network, input_size=(1, 8))

Most of the output is not interesting, but one line is:

Trainable params: 4,513

So our network has 4513 trainable parameters. As we shall see, using sparsity inducing regularization we can let gradient descent (or Adam) discover how many dimensions we need! Let’s proceed to training our network with column-parametrized weights:

def parametrize_neuron_inputs(network):
  for layer in network.linear_layers():
    num_inputs = layer.weight.shape[1]
    torch.nn.utils.parametrize.register_parametrization(
        layer, 'weight', InputsHadamardParametrization(num_inputs))

parametrize_neuron_inputs(network)
epoch_costs = train_model(train_ds, network, torch.nn.MSELoss(),
                          n_epochs=200, batch_size=128,
                          optimizer=torch.optim.Adam(network.parameters(), lr=0.002, weight_decay=0.001))

Reusing our plot_convergence function from the previous section, we can see how the model trains:

plot_convergence(epoch_costs)

We can now also evaluate the performance on the test set:

def eval_network(network):
  network.eval()
  criterion = torch.nn.MSELoss()
  print(f'Test loss = {criterion(network(test_features), test_labels):.4g}')

eval_network(network)

The output is:

Test loss = 0.2997

So we did not over-fit. The test MSE is similar to the train MSE. Now let’s inspect our sparsity. To that end, I implemeted a funciton that plots the matrix sparsity patterns of the four layers, where I regard any entry below some threshold as a zero. Nonzeros are white, whereas zeros are black. Here is the code:

def plot_network(network, zero_threshold=1e-5):
  fig, axs = plt.subplots(1, 4, figsize=(12, 3))
  layers = network.linear_layers()

  for i, (ax, layer) in enumerate(zip(axs.ravel(), layers), start=1):
    layer_weights = layer.weight.abs().detach().numpy()
    image = layer.weight.abs().detach().numpy() > zero_threshold
    ax.imshow(image, cmap='gray', vmin=0, vmax=1)
    ax.set_title(f'Layer {i}')
  plt.tight_layout()
  plt.show()
 
plot_network(network)

Now here is a surprise! It is most apparent in the first layer. Our parametrization is supposed to induce sparsity on the columns of the matrices, but we see that it also induces sparsity on the rows. So what’s going on? Well, it turns out that if the inputs of some inner layer are unused, because the column weights are zero, we can also zero-out the corresponding rows of the layer before. Since the outputs of the layer before are unused by the layer after, it has no effect on the training loss, but reduces the regularization term. Indeed, careful inspection will show that the rows of the first layer that were fully zeroed out exactly correspond to the columns of the second layer that were zeroed out. This is true to any consequent pair of layers. What’s truly amazing is that we didn’t have to do anything - gradient descent (or Adam, in this case) ‘discovered’ this pattern on its own!

Now that we know exactly which rows and columns we can remove, let’s write a function that does it. It’s a bit technical, and I don’t want to go into the PyTorch details, but you can read the code and convince yourself that this is exactly what the function below does for a linear layer - it computes a mask of columns whose norm is negilgibly small, receives the mask from the previous layer, and removes the corresponding rows and columns.

@torch.no_grad()
def shrink_linear_layer(layer, input_mask, threshold=1e-6):
  # compute mask of nonzero output neurons
  output_norms = torch.linalg.vector_norm(layer.weight, ord=1, dim=1)
  if layer.bias is not None:
    output_norms += layer.bias.abs()
  output_mask = output_norms > threshold

  # compute shrunk sizes
  in_features = torch.sum(input_mask).item()
  out_features = torch.sum(output_mask).item()

  # create a new shrunk layer
  has_bias = layer.bias is not None
  shrunk_layer = torch.nn.Linear(in_features, out_features, bias=has_bias)
  shrunk_layer.weight.set_(layer.weight[output_mask][:, input_mask])
  if has_bias:
    shrunk_layer.bias.set_(layer.bias[output_mask])
  return shrunk_layer, output_mask

Now let’s apply it to all four layers:

mask = torch.ones(8, dtype=bool)
network.fc1, mask = shrink_linear_layer(network.fc1, mask)
network.fc2, mask = shrink_linear_layer(network.fc2, mask)
network.fc3, mask = shrink_linear_layer(network.fc3, mask)
network.fc4, mask = shrink_linear_layer(network.fc4, mask)

Note, that we replace the linear layers of the network with new ones. These new layers do not have a Hadamard parametrization, so now applying weight decay will apply the regular L2 regularuzation we are used to. Let’s see how many trainable weights does our network have now:

torchinfo.summary(network, input_size=(1, 8))

Here is the output:

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Network                                  [1, 1]                    --
├─Linear: 1-1                            [1, 18]                   162
├─ReLU: 1-2                              [1, 18]                   --
├─Linear: 1-3                            [1, 15]                   285
├─ReLU: 1-4                              [1, 15]                   --
├─Linear: 1-5                            [1, 9]                    144
├─ReLU: 1-6                              [1, 9]                    --
├─Linear: 1-7                            [1, 1]                    10
==========================================================================================
Total params: 601
Trainable params: 601
Non-trainable params: 0
Total mult-adds (M): 0.00
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
==========================================================================================

Only 601 trainable parameters! So let’s train its 601 remaining weights, now without any parametrizations:

epoch_costs = train_model(train_ds, network, torch.nn.MSELoss(),
                          n_epochs=200, batch_size=128,
                          optimizer=torch.optim.Adam(network.parameters(), lr=0.002, weight_decay=1e-6))
plot_convergence(epoch_costs)

We can also evaluate its performance:

eval_network(network)

The output is:

Test loss = 0.2385

How should we work in practice?

You may have noticed that I manually chose the learning rate and the weight decay for the parametrized network, and a different learning rate and weight-decay for the shrunk network. In practice, we should do hyperparameter tuning, and select the best combination that optimizes for some metric on an evaluation set. Namely, each hyperparameter tuning experiment performs two phases, just like what we did with our neural network in this post. In the first phase, it trains a parametrized network and shrinks it. The parametrization helps ‘discover’ the correct sparsity pattern. Then, in the second phase, we train the shrunk network, and then evaluates its performance. This is because we have no way of knowing in advance which hyperparameters will induce the ‘optimal’ sparsity pattern. So a pesudo-code for hyperparameter tuning experiment may like this:

def tuning_objective(phase_1_lr, phase_1_alpha, phase_2_lr, phase_2_alpha):
  network = create_network()
  
  apply_hadamard_parametrization(network)
  train(network, phase_1_lr, phase_1_alpha)
  
  network = shrink_network(network)
  train(network, phase_2_lr, phase_2_alpha)
  
  return evaluate_performance(network)

So I recommend relying on the hyperparameter tuner to discover good parameters for the above objective, just like we rely on gradient descent to discover the ‘right’ sparsity pattern.

The idea of training first with sparsity inducing regularization, and then again without it, is not new. In fact, many statisticians working with Lasso do something similar: we first use Lasso for feature selection, and then re-train the model on the selected features wihtout Lasso. This is because sparsity inducing regularization typically hurts performance by shrinking the remaining model weights too aggressively. This was a kind of “crasftman-knowledge”, but recently some papers ⁶⁷ formally analyzed this approach and made it more publicly known. This idea also has some resemblance to relaxed Lasso⁸.

Finally, if we have an inference “budget”, we may choose to inform our hyperparameter tuner that the cost for exceeding the budget is very high. For example, in the above tuning objective, we can replace the return statement by:

  return evaluate_performance(network) + 1000 * max(number_of_parameters(network) - budget, 0)

This way the tuner will try to avoid exceeding the budget, because of the high cost of each additional model parameter. Of course, the cost doesn’t have to be that extreme, and we can make it much less than 1000 units for each additional parameter, depending on our requirements.

Conclusions

The beauty of sparsity inducing regularization is that we let our optimizer discover the sparsity patterns, instead of doing extremely expensive neural architecture search. And the beauty of Hadamard-product parametrization is that it lets us re-use existing optimizers of our ML frameworks to add sparsity-inducing regularizers, without having to write specialized custom optimizers. Maybe to some of you this may sound like Klingon, but for readers familiar with proximal minimization: writing a proximal operator for group sparsity inducing norm with componentwise learning rates using PyTorch, so that it is also GPU friendly, is extremely hard. But with Hadamard parametrization we don’t need to.

Beyond neural networks, the idea can be also applied to convolutional nets - we can make each filter a “group”, and let gradient descent discover how many filters, or channels, we need in each convolutional layer. We can also apply it to factorization machines⁹, to discover the ‘right’ latent embedding dimension. The idea is extremely versatile!

I hope you had fun reading it as much as I had fun writing it, and see you in the next post!

Hoff, Peter D. “Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization.” Computational Statistics & Data Analysis 115 (2017): 186-198. ↩ ↩²
Ziyin, Liu, and Zihao Wang. “spred: Solving L1 Penalty with SGD.” International Conference on Machine Learning. PMLR, 2023. ↩ ↩²
Kolb, Chris, et al. “Smoothing the edges: a general framework for smooth optimization in sparse regularization using Hadamard overparametrization.” arXiv preprint arXiv:2307.03571 (2023). ↩ ↩²
Poon, Clarice, and Gabriel Peyré. “Smooth over-parameterized solvers for non-smooth structured optimization.” Mathematical programming 201.1 (2023): 897-952. ↩ ↩²
Loshchilov, Ilya, and Frank Hutter. “Decoupled Weight Decay Regularization.” International Conference on Learning Representations (2019). ↩
Belloni, Alexandre, and Victor Chernozhukov. “ℓ1-penalized quantile regression in high-dimensional sparse models.” (2011): 82-130. ↩
BELLONI, ALEXANDRE, and VICTOR CHERNOZHUKOV. “Least squares after model selection in high-dimensional sparse models.” Bernoulli (2013): 521-547. ↩
Meinshausen, Nicolai. “Relaxed lasso.” Computational Statistics & Data Analysis 52.1 (2007): 374-393. ↩
Rendle, Steffen. “Factorization machines.” 2010 IEEE International conference on data mining. IEEE, 2010. ↩

Untilting the tilted loss

2024-06-14T00:00:00+00:00

Intro

Typically in machine learning we train a model by minimizing the average loss over the training set $\mathbf{x}_1, \dots, \mathbf{x}_n$¹, perhaps with some regularization. Mathematically, we solve the problem:

\[\min_\mathbf{w} \quad \frac{1}{n} \sum_{i=1}^n f(\mathbf{w}, \mathbf{x}_i)\]

A recently published JMLR paper² proposes an alternative, a tilted loss:

\[\min_\mathbf{w} \quad \frac{1}{t} \ln\left( \frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i)) \right)\]

The LogSumExp function $\mathbf{x} \to \ln(\sum_{i=1}^n x_i)$ serves as the “aggregator” of losses over individual samples, instead of just the plain average. The parameter $t$ can be thought as a kind of ‘temperature’ of our aggregator. When $t \to \infty$, it converges to the worst-case loss over the training set. This is useful when we want to make sure that we perform reasonably well on the difficult instances as well, and not only on average. Conversely, when $t \to -\infty$, it converges to the best-case loss over the training set. This is useful for the opposite - perform well on most easy instances, and be less sensitive to “outliers”, or more difficult instances. Taking $t \to 0$, it converges to the regular average loss. Thus, it allows interpolating between fairness and robustness. For the same of simplicity, in this post we shall assume that $t > 0$.

Off-the-shelf methods available in PyTorch and TensorFlow are based on stochastic gradients, and are designed to minimize averages over individual samples. However, the tilted loss is not an average due to the LogSumExp aggregation, and hence model training becomes a bit tricky. The paper proposes to use ‘tilted averaging’ on mini-batches, instead of regular averaging, but without a mathematical justification to the best of my knowledge. Intuitively, such strategy minimizes some approximation of the tilted loss, but not the tilted loss itself.

In theory, since a logarithm is monotonic, and $\frac{1}{t}$ is positive, we could discard both and train on the tilted loss itself by minimizing an average of exponentials:

\[\frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i))\]

Let’s call the above a stripped reformulation, since we stripped away the logarithm. Now it is just an average over the samples, so it should be easy, right? In practice this may cause severe numerical problems, since exponentials tend to ‘explode’ even for a moderate value of the loss $f(\mathbf{w}, \mathbf{x}_i)$. The LogSumExp function itself, and its gradient, the SoftMax, are not hard to evaluate numerically - but the logarithm plays a crurial role. So can we devise a reformulation with better numerical properties, and still minimize the tilted loss itself, rather than an approximation?

I got interesting inspiration by remembering Prof. Francis Bach’s fascinating post about the so-called “η-trick” and its use in iteratively reweighted methods. The trick allows transforming a function that is “hard” to deal with to a function that is “easy” to deal with by adding additional auxiliary variables. So it made me wonder - can we do the same with the tilted loss to mitigate the numerical issues? In this post we will try to understand how the numerical issue is manifested in model training, and explore an idea to try to mitigate it to some extent. As usual, the code is available in a notebook.

Training on an average of exponentials

To understand how the ‘numerical problems’ we just discussed are manifested, let’s try to do a simple line fitting problem. We will generate data, and try out training a linear model to fit to noisy samples using the stripped reformulation on top of the MSE loss.

Data generation

We will generate noisy measurements around the line $y = 0.8x - 1$, and fit a line by minimizing the tilted loss of squared residuals.

Here is the sample generation code:

import numpy as np

# true line parameters
true_w = np.array([0.8, -1])

# sampling parameters
noise_strength = 0.3
n = 100

# sample random X and noisy Y coordinates
np.random.seed(42)
x = np.stack([np.random.randn(n), np.ones(n)], axis=-1)
y = x @ true_w + noise_strength * np.random.standard_t(df=3, size=n)

I used T distribution for the noise to take advantage of its ‘heavy tails’, meaning that occasionally some samples will deviate further from the line than the majority of the samples. I also used 3 degrees of freedom, to make sure we have a finite variance, otherwise, demonstrating what I want in this post becomes even harder. Let’s take a look at the data:

import matplotlib.pyplot as plt

plt.scatter(x[:, 0], y)
plt.show()

Indeed, most of the samples lie along the line, but a few of them go a bit farther away.

PyTorch fitting

Now let’s try fitting with several PyTorch optimizers and several step sizes to see the behavior. Note, that we would like to understand the behavior of the optimizer as a minimization algorithm, and not as a learning algorithm. Thus, there will be no division into train/validation/test. Instead, we will just see how well did we manage to minimize the desired loss on the training set we just sampled.

Our first ingredient is a loss for our stripped reformulation - something that takes an existing loss, multiplies by $t$, and exponentiates it:

class StrippedTiltedLoss:
  def __init__(self, underlying_loss, t):
    super().__init__()
    self.underlying_loss = underlying_loss
    self.t = t

  def __call__(self, pred, target):
    return torch.exp(self.t * self.underlying_loss(pred, target))

Next, we shall write a function that fits a line to the data using a given loss and a given optimizer. It’s just a standard PyTorch training loop:

import torch

def pytorch_fit(x, y, criterion, make_optim_fn, n_epochs=100):
  # convert numpy arrays to torch tensors
  x = torch.as_tensor(x)
  y = torch.as_tensor(y)

  # define initial w to be the zero vector
  w_fit = torch.nn.Parameter(torch.zeros_like(x[0]))

  # create optimizer
  optim = make_optim_fn([w_fit])

  # regular PyTorch training loop.
  for epoch in range(n_epochs):
    for xi, yi in zip(x, y):
      pred = torch.dot(xi, w_fit)
      loss = criterion(pred, yi)

      optim.zero_grad()
      loss.backward()
      optim.step()

  return w_fit.detach()

To get a feeling, let’s try it the stripped tilted reformulation on top of the MSE loss:

from torch.nn import MSELoss

pytorch_fit(x, y, 
            criterion=StrippedTiltedLoss(MSELoss(), t=1), 
            make_optim_fn=lambda params: torch.optim.SGD(params, lr=1e-6))

The output I got is:

tensor([ 0.4708, -0.3032], dtype=torch.float64)

Doesn’t look like our line, so maybe we aren’t learning fast enough with such a small step size? Let’s try a larger step size:

pytorch_fit(x, y, 
            criterion=StrippedTiltedLoss(MSELoss(), t=1), 
            make_optim_fn=lambda params: torch.optim.SGD(params, lr=1e-4))

Now I got an output:

tensor([nan, nan], dtype=torch.float64)

We can add some printouts to understand what’s going on, but it’s quite simple. A large step size causes the weights to change sharply, which in turn causes large residuals, which in turn causes the gradients to become even more exponentially larger, causing even sharper changes to the learned weights.

We can conjecture, therefore, that there is a very narrow range of step sizes that perform reasonably well. A step size too small will make little progress, whereas a step size too large makes too much progress, causing exploding gradients. Consequently, hyper-parameter tuning becomes difficult and expensive, since pinpointing just the right step-size may require many training episodes, and waste previous time or money. Let’s verify our conjecture numerically, and plot the true tilted loss we obtain every step size.

Testing a set of step-sizes

Our first component is a function that computes the tilted loss for a given dataset. To ensure numerical accuracy and stability, I want to reuse PyTorch’s built-int logsumexp function. Using the fact that $\frac{1}{n} = \exp(-\ln(n))$, we can reformulate the tilted loss as:

\[\frac{1}{t} \ln\left( \frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i)) \right) = \frac{1}{t} \ln\left( \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i) - \ln(n)) \right)\]

Now we can use logsumexp to compute the tilted loss without numerical issues:

import math 

def compute_tilted_loss(w_fit, x, y, t):
  x = torch.as_tensor(x)
  y = torch.as_tensor(y)
  n = x.shape[0]
  squared_residuals = torch.square(x @ w_fit - y)
  return torch.logsumexp(t * squared_residuals - math.log(n), dim=-1) / t

Next, we write a function that experiments with our stripped equivalent of the tilted loss for various step-sizes with SGD:

from tqdm.auto import tqdm

def eval_sgd_exponential_loss(x, y, lrs, t=1):
  losses = []
  for lr in tqdm(lrs):
      optim_factory = partial(torch.optim.SGD, lr=lr)
      w_exp_fit = pytorch_fit(x, y, StrippedTiltedLoss(MSELoss(), t), optim_factory)
      w_exp_loss = compute_tilted_loss(w_exp_fit, x, y, t).item()
      losses.append(w_exp_loss)

  return losses

Let’s plot the results for a fine grid of step-sizes:

lrs = np.geomspace(1e-7, 1e-4, 60).tolist()
losses = eval_sgd_exponential_loss(x, y, lrs)

plt.plot(lrs, losses)
plt.xscale('log')
plt.yscale('log')
plt.xlim([np.min(lrs), np.max(lrs)])

The x-axis is the step size, whereas the y-axis is the achieved value of the tilted loss. We can see that a tiny range, somewhere between $10^{-5}$ and $2 \times 10^{-5}$, results in a reasonable performance. Above that, we see that we’re have no data - that’s because the losses array contains NaNs. Our gradients exploded, and the optimizer failed for step sizes that are just a bit too large. Now let’s do some interesting tricks.

Trickery with logarithms

I do not recall exactly where, but in an exercise on convex optimization I encountered this simple fact:

\[\ln(z) = \min_v \{ z \exp(v) - v \} - 1\]

The proof is a two liner - just take the derivative inside the $\min$ w.r.t $v$ and equate it with zero:

\[\begin{align*}\tag{V} &z \exp(v) - 1 = 0 \\ &v = -\ln(z) \end{align*}\]

Substitute this $v=-\ln(z)$ into the expression inside the $\min$ to get the desired result. Remember this formula for $v$ - it will be useful for PyTorch parameter initialization later in this post.

Writing a function as a minimum of a family of functions is called a variational formulation. So what we have is a variational formulation of the logarithm. Now let’s use it to do something useful. It’s a bit technical, but the end-result leads us in the right direction. We use the variational formulation of the logarithm in the tilted loss, and obtain the following:

\[\begin{aligned} \ln\left( \frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i)) \right) &= \min_v \left\{ \left( \frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i)) \right) \exp(v) - v \right\} - 1 \\ &= \min_v \left\{ \frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i) + v) - v \right\} - 1 \\ &= \min_v \left\{ \frac{1}{n} \sum_{i=1}^n \left( \exp(t f(\mathbf{w}, \mathbf{x}_i) + v) - v \right) \right\} - 1 \end{aligned}\]

When minimizing, the constant $-1$ at the end can also be stripped. Thus, training with a tilted loss amounts to solving the minimization problem

\[\min_{\mathbf{w}, v} \quad \frac{1}{n} \sum_{i=1}^n \left( \exp(t f(\mathbf{w}, \mathbf{x}_i) + v) - v \right)\]

Let’s call this the variational formulation of the tilted loss. At first it appears we have not done anything useful - it’s again an average of exponentials. But a closer examination reveals that if $v$ is negative it balances away large losses, and has a ‘stabilizing’ effect. So is it negative? Well, recalling equation $(V)$, the one I asked to remember, at the optimum we must have:

\[v = -\ln\left (\frac{1}{n} \sum_{i=1}^n \exp(t f(\mathbf{w}, \mathbf{x}_i)) \right )\]

Except for extremely rare cases, losses are typically positive, their exponentials are at least 1, and therefore the argument of the logarithm is at least 1. This means that for any reasonable ML task, $v$ at the optimum indeed is negative. But we do not care only about the optimum, we care about what happens throughout the entire training process. Thus, this variational formulation is just a heuristic which may, and we need to check weather it indeed does so.

An important component of making this heuristic useful is initializing $v$ properly, so that the first epochs don’t fail on large gradients. But it’s not hard - the formula for $v$ above is a great for initialization as well.

Testing our magic trick

Note, that we need to learn an additional parameter $v$, which is conceptually not part of the model, but rather a part of the loss. Moreover, we will need a function to initialize our loss object, so that we can initialize $v$. To facilitate the above, our losses will inherit torch.nn.Module, and will have an additional initialize method. Here is the stripped loss. Note that its initialization method does nothing:

class StrippedTiltedLoss(torch.nn.Module):
  def __init__(self, underlying_loss, t):
    super().__init__()
    self.underlying_loss = underlying_loss
    self.t = t

  def initialize(self, x, y):
    pass

  def forward(self, pred, target):
    exp_losses = torch.exp(self.t * self.underlying_loss(pred, target))
    return exp_losses.mean()

Here is the variational loss we just derived:

class VariationalTiltedLoss(torch.nn.Module):
  def __init__(self, t):
    super().__init__()
    self.t = t
    self.v = torch.nn.Parameter(torch.tensor(0.))

  def initialize(self, preds, targets):
    with torch.no_grad():
      init_losses = self.underlying_loss(preds, targets)
      n = init_losses.shape[0]

      v_init = -torch.logsumexp(self.t * init_losses - math.log(n), dim=-1)
      self.v.set_(v_init)

  def forward(self, pred, target):
    sample_loss = self.underlying_loss(pred, target)
    exp_tilted_losses = torch.exp(self.t * sample_loss + self.v) - self.v
    return exp_tilted_losses.mean()

It is assumed here that the underlying loss does not perform any reduction, such as averaging or summing the individual losses. We need the individual sample losses for our purposes, and thus we’re doing the averaging ourselves. In this post we train on individual samples, so no averaging is required, but in general people train on mini-batches of samples, and I wanted to make the code above re-usable in this scenario as well. This means that when passing the underlying loss, we need to tell it to avoid reducing, e.g. torch.nn.MSELoss(reduction='none')

To use losses that contain an addinal parameter, and to call the initialize method properly, we make a small modification to the fit_pytorch method we wrote above:

from itertools import chain

def pytorch_fit(x, y, criterion, make_optim, n_epochs=100):
  # convert numpy arrays to torch tensors
  x = torch.as_tensor(x)
  y = torch.as_tensor(y)
  dim = x.shape[1]
  dtype = x.dtype

  # perform proper initialization
  w_fit = torch.nn.Parameter(torch.zeros(dim, dtype=dtype))
  criterion.initialize(x @ w_fit, y)

  # create optimizer - don't foget that the `ciretion` now also has parameters!
  parameters_to_learn = chain(criterion.parameters(), [w_fit])
  optim = make_optim(parameters_to_learn)

  # regular PyTorch training loop.
  for epoch in range(n_epochs):
    for xi, yi in zip(x, y):
      pred = torch.dot(xi, w_fit)
      loss = criterion(pred, yi)

      optim.zero_grad()
      loss.backward()
      optim.step()

  return w_fit.detach()

Now let’s compare our two tilted loss formulations, the exponential, and the tilted exponential, in terms of their sensitivity to exactly pinpointing a narrow interval of step-sizes. We will expand our test a little bit, and try it for several values of the temperature $t$ and two different optimizers - SGD and Adam. To that end, we wrote a function that tries both losses for a given set of temperatures, a set of learning rates, and a given optimizer. The results are gathered in a Pandas DataFrame.

from functools import partial
from itertools import product
import pandas as pd

def compare_tilted_formulations(x, y, ts, lrs, optim_ctor):
  records = []
  for t, lr in tqdm(list(product(ts, lrs))):
      make_optim_fn = partial(optim_ctor, lr=lr)

      mse_loss = MSELoss(reduction='none')
      w_exp_fit = pytorch_fit(x, y, StrippedTiltedLoss(mse_loss, t), make_optim_fn)
      w_tilted_fit = pytorch_fit(x, y, VariationalTiltedLoss(mse_loss, t), make_optim_fn)

      w_exp_loss = compute_tilted_loss(w_exp_fit, x, y, t).item()
      w_tilted_loss = compute_tilted_loss(w_tilted_fit, x, y, t).item()
      records.append(dict(t=t, lr=lr, loss_type='stripped', value=w_exp_loss))
      records.append(dict(t=t, lr=lr, loss_type='variational', value=w_tilted_loss))

  return pd.DataFrame.from_records(records)

Now let’s conduct our experiments. We begin with SGD:

lrs = np.geomspace(1e-7, 1e-4, 20).tolist()
ts = [0.25, 1, 2]
sgd_eval_recs = compare_tilted_formulations(x, y, ts=ts, lrs=lrs, optim_ctor=torch.optim.SGD)

To plot the results, it will be convenient to use seaborn:

import seaborn as sns
sns.set()

g = sns.relplot(data=sgd_eval_recs,
                hue='loss_type', col='t', x='lr', y='value', alpha=0.5)
g.set(xscale='log')
g.set(ylim=[0, 10])
g.set(yscale='asinh')

We see that $t=0.5$ is not very challenging to both formulations. With $t=1$ we already begin to see the difference - the stripped variant works well only for a very narrow interval, whereas the variational variant works in a significantly larger range. With $t=2$, SGD fails altogether with the stripped variation.

But maybe SGD is less robust, so let’s try Adam:

adam_lrs = np.geomspace(1e-5, 1e2, 30)
adam_eval_recs = compare_tilted_formulations(x, y, ts=ts, lrs=adam_lrs, optim_ctor=torch.optim.Adam)

g = sns.relplot(data=adam_eval_recs,
                hue='loss_type', col='t', x='lr', y='value', alpha=0.5)
g.set(xscale='log')
g.set(ylim=[0, 20])
g.set(yscale='asinh')

Indeed, Adam is more robust. It does not miserably fail for larger values of $t$, but we can see a similar phenomenon. As $t$ increases, it becomes harder to ‘pinpoint’ just the right step-size. So indeed, this simple trick may improve the computational cost of training a model with a tilted loss, and may significantly reduce the costs of training models when it’s to perform well not just on average, but also close to the worst case.

Summary

The variational formulation of the logarithm indeed helped, at least on the line fitting exercise. I do not wish to invest the resources required to try it out with a neural network, but I hope the code here is generic enough for you to try it out on your own ML task.

A variational formulation for the logarithm is nice, but we might have been able to do much better if we had a useful variational formulation for the entire LogSumExp function. I personally do not know if such a closed-form formulation exists, but if you do - talk to me, and let’s write a paper!

I would like to thank Prof. Tian Li and her collegues for their paper on tilted losses. It was enlightening, and I recommend you read it. And moreover, thank Prof. Bach for providing the inspiration.

each $\mathbf{x}_i$ may be a pair consisting of features and label, so it subsumes supervised learning. ↩
Li, Tian, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. “On tilted losses in machine learning: Theory and applications.” Journal of Machine Learning Research 24, no. 142 (2023): 1-79. ↩

Regularization properties of polynomial bases

2024-06-03T00:00:00+00:00

Intro

Throughout this series, beginning here, we demonstrated various properties and applications of polynomial regression on different datasets. We used the Bernstein basis to demonstrate the importance of chosing a “good” polynomial basis, and that other well-known bases may be unfit for machine learning tasks. In this post, that concludes the series, we will try to understand why this happens by studying various regularization properties of the bases we encountered, including the standard power basis, the Chebyshev basis, the Legendre basis, and the Bernstein basis.

In this post we will not prove theorems, but rather demonstrate using an example. Therefore, there will be plenty of code and plots. And along the way we’ll learn some interesting tricks with linear regression and polynomials. All the code from this post is available in this notebook. So let’s get started!

The bias-variance tradeoff

When fitting data representing some “truth”, typically we observe a finite number of samples, and fit a model based on these samples. But what if, in theory, we did this again and again, and every time obtained a different set of samples? Well, our hope is that the corresponding models would, on average, be “close” to the truth.

Let’s simulate this by fitting a univariate function using polynomial regression. We will sample the function at some randomly chosen points, fit a polynomial, and repeat the experiment again and again.

Ingredients

Let’s write the components that will facilitate our experiments. We start by defining some interesting function $f$ to approximate:

import numpy as np

def f(x):
  first = np.sin(np.pi * (x + np.abs(x - 0.75) ** 1.5))
  second = np.cos(0.8 * np.pi * (np.abs(x + 0.75) ** 1.5 - x)) + 0.5
  return 2 * np.minimum(first, second)

Seems like we have some weird parameters there. To see why, let’s see what $f$ it looks like on $[-1, 1]$:

import matplotlib.pyplot as plt

plot_n = 10000
plot_xs = np.linspace(-1, 1, plot_n)

plt.plot(plot_xs, f(plot_xs))
plt.show()

So I played a bit with the code in fn above until I got this interesting plot - $f$ is composed of two functions joined at a “kink”. We will work with the interval $[-1, 1]$, since it’s easy to work with using NumPy’s built-in functions for polynomial bases.

To conduct our experiment, we will need a way to fit a polynomial using the basis of our choice by sampling some random points in $[-1, 1]$. Then, we would like to evaluate our polynomial on a dense grid of points in $[-1, 1]$ and compare it with the “truth”. To that end, we implemented a function that:

samples a set of points in $\{ x_1, \dots, x_n \} \subseteq [-1, 1]$
fits a polynomial $p$ to $(x_1, f(x_1)), \dots, (x_n, f(x_n))$ of a given degree $d$, using a given basis $\mathbb{B}$, with a given L2 regularization coefficient $\alpha$.
evaluates $p$ at a given set of points.

Let’s see it’s code:

def fit_eval(vander_fn, eval_at, deg=20, n=40, reg_coef=0., ax=None, **plot_kws):
  # sample points for fitting
  xs = np.random.uniform(-1, 1, n)
  ys = f(xs)

  # build matrix and vector for least-squares regression
  vander_mat = vander_fn(xs, deg)
  if reg_coef > 0:
    coef_mat = np.identity(1 + deg) * np.sqrt(reg_coef)
    vander_mat = np.concatenate([vander_mat, coef_mat], axis=0)
    ys = np.concatenate([ys, np.zeros(1 + deg)], axis=-1)

  # compute polynomial coefficients
  coef = np.linalg.lstsq(vander_mat, ys, rcond=None)[0]

  # evaluate the polynomial at `eval_at`
  return vander_fn(eval_at, deg) @ coef

We can see by the default parameters that by default we sample 40 points, and fit a polynomial of degree 20. We will not change that throughout the post, but you are welcome to play with the notebook as you wish.

Note the code in the if reg_coef > 0 block in the function above. There is an interesting trick there for re-using existing NumPy libraries for least-squares regression, which are reliable and numerically stable, to solve a regularized least-squares problem. Note, that: $\| A w - y\|^2 + \alpha \|w\|^2 = \sum_{j=1}^m ( a_j^T w - y_j )^2 + \sum_{i=1}^n (\sqrt{\alpha} w_i - 0)^2 = \left\| \begin{bmatrix} A \\ \sqrt{\alpha} I \end{bmatrix} w - \begin{bmatrix} y \\ 0 \end{bmatrix} \right\|^2$

So, a regularized regression problem is reducible to a simple least-squares problem, with a data matrix padded by $\sqrt{\alpha} I$, and the labels vector padded by zeros. That’s exactly what the code in the aforementioned block does. The reason I used this trick is to rely only on a small set of Python libraries, and avoid dependencies on scikit-learn and others.

So let’s see how it works:

ys = fit_eval(polyvander, plot_xs, deg=10)
plt.plot(plot_xs, ys, color='k')
plt.plot(plot_xs, fn(plot_xs), color='r')
plt.show()

So doing it once, is nice, but we aim to repeat this experiment many times. So let’s write a function that does just that:

def fit_eval_samples(eval_at, vander_fn, n_iter=1000, **fit_eval_kwargs):
  y_samples = []
  for i in range(n_iter):
    ys = fit_eval(vander_fn, eval_at, **fit_eval_kwargs)
    y_samples.append(ys)
  y_samples = np.vstack(y_samples)
  y_true = f(eval_at)
  return y_samples, y_true

This function samples n_iter sets of points, computes n_iter least-squares fits, and evaluates each of the resulting polynomials at the points at the evaluation points eval_at. The results are organized into the rows of the matrix y_samples - the $i$-th row contains the values of the $i$-th polynomial. For convenience, it also computes the values of our “true” function $f(x)$ at the evaluation points.

Finally, since we will be working on $[-1, 1]$, and the interval of approximation of the Bernstein basis is $[0, 1]$, let’s implement the Bernstein vandermonde function we already encountered, with appropriate scaling:

from scipy.stats import binom as binom_dist

def bernvander(x, deg, lb=-1, ub=1):
  x = np.array(x)
  x = np.clip((x - lb) / (ub - lb), lb, ub)
  return binom_dist.pmf(np.arange(1 + deg), deg, x.reshape(-1, 1))

Now that we have all our ingredients in place, let’s visualizee and analyze bias and variance.

Visualizing bias and variance

Let’s use our fit_eval_samples function to plot a large number of fits according to a given polynomial basis. Our function plots the fits and the true function returned by fit_eval_samples, and also the average polynomial, by averaging the returned y_samples array. Each one of the fit polynomials will be drawn in a transparent manner, so that we can see their density. And moreover, since badly fit polynomials “go crazy” near the boundaries, the function also accepts the y-axis limits for plotting.

def plot_basis_fits(n_iter, vander_fn, ylim=[-3, 3], alpha=0.1, ax=None, **fit_eval_kwargs):
  ax = ax or plt.gca()
  plot_xs = np.linspace(-1, 1, 10000)
  samples, y_true = fit_eval_samples(plot_xs, vander_fn, n_iter, **fit_eval_kwargs)
  mean_poly = np.mean(samples, axis=0)
  ax.plot(plot_xs, samples.T, 'r', alpha=alpha)
  ax.plot(plot_xs, mean_poly, color='blue', linewidth=2.)
  ax.plot(plot_xs, y_true, 'k--')
  ax.set_ylim(ylim)

Now we can use it to plot all our bases. The following function does just that by showing 100 different fit polynomials for every basis:

from numpy.polynomial.chebyshev import chebvander
from numpy.polynomial.legendre import legvander
from numpy.polynomial.polynomial import polyvander

def plot_all_bases(**plot_loop_kwargs):
  fig, axs = plt.subplots(2, 2, figsize=(10, 8))

  plot_basis_fits(100, polyvander, ax=axs[0, 0], **plot_loop_kwargs)
  axs[0, 0].set_title('Standard')

  plot_basis_fits(100, chebvander, ax=axs[0, 1], **plot_loop_kwargs)
  axs[0, 1].set_title('Chebyshev')

  plot_basis_fits(100, legvander, ax=axs[1, 0], **plot_loop_kwargs)
  axs[1, 0].set_title('Legendre')

  plot_basis_fits(100, bernvander, ax=axs[1, 1], **plot_loop_kwargs)
  axs[1, 1].set_title('Bernstein')

  plt.show()

Let’s let’s use it to visualize bias and variance without regularization:

plot_all_bases(reg_coef=0.)

The dashed black line is the true function, the blue line is the average among the 100 polynomials, and the transparent red lines are the polynomials themselves. As expected, without regularization, the fit polynomials with all bases appear ‘crazy’. Moreover, even the average polynomial appears to be far away from the true function near the boundaries.

The difference between the average polynomial and the true function is called the bias, whereas the spread of the different polynomials around the average is called the variance. Of course, the bias and variance are different at every point. Near $x=0$, they are pretty small, and as we approach the boundaries, both increase¹. The bias-variance tradeoff is a well-known concept in statistics, and a large body of research has been invested to its study. Here, we will try to approach it from a more empirical perspective.

Ideally, we would like both the bias and the variance to be small. A small bias means that on average over the samples, our polynomial represents the truth. A low variance means that regardless of the specific data-set, we will be always close to this truth, meaning that we will generalize well.

Typically, when measuring the bias, the deviation from the true function is squared. This is convenient, since the squared bias and the variance are a decomposition of the mean squared error. Informally, speaking:

\[\mathbb{E}[\mathrm{error}^2] = \mathbb{E}[\mathrm{bias}^2] + \mathrm{variance}\]

A more formal introduction can be found in the above-linked wikipedia article, and references therein.

Both bias and variance can be reduced either by a better estimation procedure or by more data. Let’s see what happens when we add more data - instead of using our default, and sampling 40 points for our least-squares regression, we will sample 200:

plot_all_bases(n=200, reg_coef=0.)

Appears much better! The mean polynomial almost coincides with the true function, and there’s little wiggling of individual polynomials around it. In this example, we have a very simlpe data-set with only one feature. In practice, data-sets are finite, and contain plenty of features. Not always enough to learn all coefficients in a model to the required precision.

As we pointed out, the bias-variance tradeoff also depends on the estimation procedure, and not only on the amount of data we have. Often in practice, our data-sets are finite, and we need to adapt the estimation procedure as well. Here we have two means to affect the estimation procedure - the choice of the basis, and the regularization coefficient. So let’s try using some regularization:

plot_all_bases(reg_coef=1e-3, ylim=[-1.5, 1.5])

We can see that with this regularizaton coefficients, the Bernstein and the power basis behave much better, with the Bernstein basis being a bit better in terms of variance. It even looks close to what we can achieve with more data. But why the two other bases perform poorly? Maybe we’re under-regularizing the other two bases? Let’s try a larger coefficient:

plot_all_bases(reg_coef=1e-1, ylim=[-1.5, 1.5])

We can clearly see we are over-regularizing the standard and the Bernstein bases - the average polynomial, the blue line, begins to get smoother and farther away from the true function. This is the expected increase of bias as a result of regularization. But the Chebyshev and Legendre bases are still a bit wiggly - so let’s try an even more aggressive regularization:

plot_all_bases(reg_coef=1, ylim=[-1.5, 1.5])

It appears that these two bases do not improve with regularization - their bias increases, without a significant improvement improvement to the variance. So both components of the estimation procedure are crucial - the regularization and the basis.

Visualization is nice, but let’s measure these effects. We will try several regularization strengths, and for each strength - we will compute the average bias and variance we encounter among the evaluation points. Since both the bias and the variance vary along the interval $[-1, 1]$, we will average the squared bias and variance over the interval. Computing the mean squared bias and the variance for a given basis is straightforward:

def bias_variance_tradeoff(vander_fn, reg_coefs, nx=1000, **fit_eval_kwargs):
  xs = np.linspace(-1, 1, nx)
  biases = []
  vars = []
  for reg_coef in reg_coefs:
    y_samples, y_true = fit_eval_samples(xs, vander_fn, reg_coef=reg_coef, **fit_eval_kwargs)
    
    # mean squared bias over the samples, averaged over the interval [-1, 1]
    bias_agg = np.mean((np.mean(y_samples, axis=0) - y_true) ** 2)
    
    # variance over the samples, averaged over the interval [-1, 1]
    variance_agg = np.mean(np.var(y_samples, axis=0))
    
    biases.append(bias_agg)
    vars.append(variance_agg)

  return biases, vars

So now ler’s define regularization coefficients and conduct our experiment. It will be convenient to gather all the data to a Pandas dataframe, and plot it later:

import pandas as pd

reg_coefs = np.geomspace(1e-8, 1e2, 64)

biases, vars = bias_variance_tradeoff(polyvander, reg_coefs)
power_df = pd.DataFrame({'reg_coef': reg_coefs, 'bias': biases, 'variance': vars, 'basis': 'Power'})

biases, vars = bias_variance_tradeoff(chebvander, reg_coefs)
cheb_df = pd.DataFrame({'reg_coef': reg_coefs, 'bias': biases, 'variance': vars, 'basis': 'Chebyshev'})

biases, vars = bias_variance_tradeoff(legvander, reg_coefs)
leg_df = pd.DataFrame({'reg_coef': reg_coefs, 'bias': biases, 'variance': vars, 'basis': 'Legendre'})

biases, vars = bias_variance_tradeoff(bernvander, reg_coefs)
ber_df = pd.DataFrame({'reg_coef': reg_coefs, 'bias': biases, 'variance': vars, 'basis': 'Bernstein'})

all_df = pd.concat([power_df, cheb_df, leg_df, ber_df])

Let’s see a sample of our data-frame:

print(all_df)

Here is the result I got

        reg_coef      bias  variance      basis
 1.000000e-08  0.012936  4.517494      Power
 1.441220e-08  0.024666  1.617035      Power
 2.077114e-08  0.017855  2.934622      Power
 2.993577e-08  0.025465  1.882831      Power
 4.314402e-08  0.005049  2.150978      Power
..           ...       ...       ...        ...
2.317818e+01  0.594062  0.000286  Bernstein
3.340485e+01  0.614740  0.000133  Bernstein
4.814372e+01  0.629843  0.000076  Bernstein
6.938568e+01  0.640700  0.000036  Bernstein
1.000000e+02  0.648413  0.000018  Bernstein

[256 rows x 4 columns]

For every regularization coefficient and basis choice, we have a bias and a variance measurement. Let’s plot these using a SeaBorn scatterplot:

import seaborn as sns

to_plot = all_df.copy()
to_plot['size'] = np.log(to_plot['reg_coef'])

sns.scatterplot(data=to_plot, x='bias', y='variance', hue='basis', size='size')
ax = plt.gca()
ax.set_xscale('log')
ax.set_yscale('log')
ax.legend(bbox_to_anchor=(1.05,1), loc=2, borderaxespad=0.)
plt.show()

For each basis we have a different color, and the size of the points corresponds to the regularization coefficients, with larger points representing more aggressive regularization:

Now we begin to understand the full picture. First, polynomials need regularization. A mild coefficient results in both a high measured bias and a high variance. At some point, we land on a nice trade-off curve. Moreover, we now see why the Bernstein basis was so successful in all our experiments before - it achieves a much better bias-variance tradeoff. Looking at the bottom-left part, we can see that it can achieve both low bias and low variance.

Does this phenomenon have a formal proof? Well, I wasn’t able to find one. But I was able to find a proof for a different sampling procedure, where the noise doesn’t come from sampling different data-sets, but from introducing noise in the $y$ coordinate in this elaborate stackexchange answer. I suppose a similar proof could be derived for the case of a random data-set selection. I don’t believe I discovered something new here, but merely learned something that is “known” but was not formally published. If you have found something - please email me, and I will be glad to update the post and give a proper credit.

To summarize, we now understand that not only the class of models is important, but also its representation. Indeed class of polynomial functions can be represented with different bases, but some bases perform better than others for machine-learning tasks. So I think the most important lesson from this series is that:

We cannot judge a class of models on its own, without considering a concrete represention, since its performance is often tightly coupled to representation of choice.

Now let’s move to studying a different, but important theoretical property of the Bernstein basis.

Derivative sign regularization

Throughout this series we regularized the derivative of the polynomials to achieve a certain goal, either smoothness or monotonicity. I will concentrate on monotonicity, since it’s simpler to study. We in this post a theorem with an interesting consequence - if the coefficients of a polynomial in Bernstein form are monotone increasing, then the polynomial is monotone increasing. A similar result is obtained for a decreasing sequence.

But what about the inverse imlpication? Does every monotone-increasing polynomial also have a monotone-increasing coefficient sequence when represented according to the Bernstein basis? Well, the answer is NO. This means that when we fit a polynomial in the Bernstein basis with an increasing coefficients sequence, like we did in a previous post, we are not guaranteed to get the the best-fit increasing polynomial. There may be another increasing polynomial, whose Bernstein coefficients are not increasing, but it achieves a smaller training error.

So an interesting question begs to be answered - how far apart are increasing polynomials, and Bernstein polynomials with increasing coefficients? If this distance is small, the above fact should not bother us too much. But what if it is large? So let’s try to study this question empirically. This is not a formal proof, but rather a demonstration of some interesting phenomena.

What we will do is generate random increasing polynomials, and then try to find a least-squares fit to these polynomials using the Bernstein basis with increasing coefficients. So first we need to understand one important thing - how do we generate a polynomial that is guaranteed to be increasing on $[-1, 1]$. Having understood that, we will be able to write a simple Python function to generate some random increasing polynomial. Our plan is simple - we first learn how to generate a non-negative polynomial on $[-1, 1]$, and then compute its integral to obtain an increasing polynomial. To that end, let’s dive into century-old results on non-negative polynomials, initiated by no other than David Hilbert.

We begin by introducing the concept of a polynomial that is a sum of squares. A polynomial $p(x)$ is a sum of squares if there exist polynomials $q_1, \dots, q_m$ such that:

\[p(x) = q_1^2(x) + \dots + q_m^2(x)\]

Obiviously, any sum-of-squares polynomial is non-negative on the entire real-line. But we are not interested on the entire real-line, but on the interval $[-1, 1]$. It turns out that there exists a theorem for characterizing non-negative polynomials on this interval.

Theorem (Blekherman et. al. ², Theorem 3.72)

The polynomial $p: \mathbb{R} \to \mathbb{R}$ of degree is non-negative on $[a, b]$ if and only if:

if the degree of $p$ is the odd number $2d+1$, then $p(x) = (x - a) \cdot s(x) + (b-x) \cdot t(x)$ where $s(x)$ and $t(x)$ are sum of square polynomials of degree at most $2d$.

if the degree of $p$ is the even number $2d$, then $p(x) = s(x) + (x-a) \cdot (b-x) \cdot t(x)$ where $s(x)$ and $t(x)$ are sum of squares polynomials of degrees at most $2 d$ and $2 d - 2$.

So it appears that all we have to do is generate random two polynomials that are sums of squares, and construct $p$ of the desired degree by multiplying and adding polynomials. To that end, we will use the np.polynomial.polynomial.Polynomial class that can represent operations such as addition and multiplication on arbitrary polynomials. So let’s implement the nonneg_on_biunit function, that generates a non-negative polynomial on the bi-unit interval $[-1, 1]$. To generate coefficients of the sum-of-squares polynomial, we will rely on the Cauchy distribution due to its heavy tails, so that we obtain a large variety of coefficients. Random numbers are generated by the np.random.standard_cauchy function.

from numpy.polynomial.polynomial import Polynomial

def sum_of_squares_poly(half_deg):
  num_coef = 1 + half_deg
  first_poly = Polynomial(np.random.standard_cauchy(num_coef))
  second_poly = Polynomial(np.random.standard_cauchy(num_coef))
  return first_poly * first_poly + second_poly * second_poly

def nonneg_on_biunit(deg):
  if deg == 0:
    return Polynomial(np.random.standard_cauchy()) ** 2
  if deg % 2 == 0: # odd degree
    s = sum_of_squares_poly(deg // 2)
    t = sum_of_squares_poly(deg // 2 - 1)
    return s + t * Polynomial(np.array([1, 0, -1]))
  else: # even degree
    s = sum_of_squares_poly((deg - 1) // 2)
    t = sum_of_squares_poly((deg - 1) // 2)
    return Polynomial(np.array([1, -1])) * s + \
           Polynomial(np.array([1, 1])) * t

Does it work? Let’s see! We will plot randomly generated non-negative polynomials of various degrees:

np.random.seed(42)
fig, ax = plt.subplots(3, 3, figsize=(10, 10))
xs = np.linspace(-1, 1, 1000)
for deg, ax in enumerate(ax.flatten()):
  ys = nonneg_on_biunit(deg)(xs)
  ax.plot(xs, ys)
  ax.set_title(f'deg = {deg}')
plt.show()

They indeed appear to be diverse, and non-negative. So let’s continue with our plan of creating increasing polynomials by integrating non-negative polynomials.

def increasing_on_biunit(deg):
  nonneg = nonneg_on_biunit(deg - 1)
  return nonneg.integ()

Let’s plot them, and see what we got:

np.random.seed(42)
fig, ax = plt.subplots(3, 3, figsize=(10, 10))
xs = np.linspace(-1, 1, 1000)
for deg, ax in enumerate(ax.flatten(), start=1):
  ys = increasing_on_biunit(deg)(xs)
  ax.plot(xs, ys)
  ax.set_title(f'deg = {deg}')
plt.show()

Now, that our code appears to work, lets proceed to fitting Bernstein form polynomials with increasing coefficients to these increasing polynomials. First, for every degree, we will fit a Bernstein form polynomial of the same degree, to see if an increasing polynomial of degree $d$ can be represented by a Bernstein polynomial of increasing coefficients of degree $d$. To that end, we will use our beloved CVXPY package again, to constrain the Bernstein coefficients:

import cvxpy as cp

np.random.seed(42)
fig, ax = plt.subplots(4, 3, figsize=(10, 14))
xs = np.linspace(-1, 1, 1000)
for deg, ax in enumerate(ax.flatten(), start=1):
  true_ys = increasing_on_biunit(deg)(xs)

  vander_mat = bernvander(xs, deg)
  coef_var = cp.Variable(1 + deg)
  objective = cp.Minimize(cp.sum_squares(vander_mat @ coef_var - true_ys))
  prob = cp.Problem(objective, constraints=[cp.diff(coef_var) >= 0])
  prob.solve()

  coef = coef_var.value
  bern_ys = bernvander(xs, deg) @ coef

  ax.plot(xs, true_ys, 'k--')
  ax.plot(xs, bern_ys, 'r')
  ax.set_title(f'deg = {deg}')
plt.show()

We see that the fit is often very close, but doesn’t exactly match. Obivously, it is posisble to find an increasing polynomial of the corresponding degree to fit each function, since each function is itself a polynomial of that degree. But the constraint that the Bernstein coefficients increase reduces the space to a subset of increasing polynomials, and we are unable to exactly fit.

But what happens if we allow fitting Bernstein form polynomials of higher degrees? Say, for an increasing polynomial of degree d, we will fit a Bernstein polynomial with increasing coefficients of degree 2d. Maybe increasing the degree helps reduce the gap?

import cvxpy as cp

np.random.seed(42)
fig, ax = plt.subplots(4, 3, figsize=(10, 12))
xs = np.linspace(-1, 1, 1000)
for deg, ax in enumerate(ax.flatten(), start=1):
  true_ys = increasing_on_biunit(deg)(xs)

  fit_deg = 2 * deg  # <--- NOTE HERE
  vander_mat = bernvander(xs, fit_deg)
  coef_var = cp.Variable(1 + fit_deg)
  objective = cp.Minimize(cp.sum_squares(vander_mat @ coef_var - true_ys))
  prob = cp.Problem(objective, constraints=[cp.diff(coef_var) >= 0])
  prob.solve()

  coef = coef_var.value
  bern_ys = bernvander(xs, fit_deg) @ coef

  ax.plot(xs, true_ys, 'k--')
  ax.plot(xs, bern_ys, 'r')
  ax.set_title(f'deg = {deg}')
plt.show()

It appears it does. Empirically, we are able to fit increasing polynomials of degree d with polynomials of degree 2d with increasing Bernstein coefficients. In practice, this gap means that we may need to use a higher polynomial degree than we could, in theory, when fitting a polynomial to an increasing function.

Is a factor of two always sufficient to reduce the representation gap? If not - what is the relationship between the higher degree fit with increasing coefficients and the degree of the original polynomial? Personally, I don’t know. But it’s an interesting question. What is known is that among the bases that have direct shape control properties via their coefficients, which are colloquially known as “normalized totally-positive bases”, the Bernstein basis is the unique basis with “optimal” shape control properties. The meaning of the optimality criterion is out of the scope of this post, but I refer the readers to the paper Shape preserving representations and optimality of the Bernstein basis³ for reference. As we mentioned before, these properties are extensively used in computer graphics to represent curves and shapes, including all the text you are reading on the screen.

It is possible to use the minimal degree via semidefinite optimization by exploiting the theory of sum-of-squares polynomials, but this is out of the scope of this post. Moreover, the heavier computational burdain of semidefinite optimization typically makes this technique less applicable to fitting models to large amounts of data. Interested readers are referred to the book Semidefinite Optimization and Convex Algebraic Geometry².

Concluding remarks

This exploration of polynomial regression certainly taught me a lot. I learned that polynomials are not to be feared when designing regression models, when using a proper basis. The simplicity of polynomials is appealing - they have only one hyperparameter to tune, which is their degree. There are plenty of other function bases that can be used when fitting a nonlinear model using linear regression techniques, such as cubic splines, or radial basis functions. All of them are very useful, but they require more hyperparameter tuning, which may result in longer model fitting times and a slower model experimentation feedback loop. For example, splines, which are essentially piecewise polynomials with continuous (higher order) derivatives, require specifying their degree, the number of break-points, and the degree of derivative continuity. But given enough computational resources and data, these techniques probably perform better than polynomials.

I hope you enjoyed this series as much as I did, and the next posts will probably be on a different subject.

In theory, a least-squares estimator is unbiased. But it appears we don’t have enough samples so that our average polynomial approaches the true mean, and it appears as if we have bias. ↩
Grigoriy Blekherman, Pablo A. Parrilo, and Rekha R. Thomas. Semidefinite Optimization and Convex Algebraic Geometry. SIAM (2012). ↩ ↩²
J.M. Camicer and J.M. Pefia. Shape preserving representations and optimality of the Bernstein basis. Advances in Computational Mathematics 1 (1993) ↩

A Bernstein SkLearn model calibrator

2024-05-19T00:00:00+00:00

Intro

We continue our journey in the land of polynomial regression and the Bernstein basis, that we began in this post, through another interesting landscape. There are many settings in which a model is trained to predict an abstract, meaningless score, which is later used for classification or ranking. For example, consider a linear support-vector machine (SVM) classifier. When classifying a sample, we only care about the sign of the score. If we take our SVM, and multiply its weights vector by a positive factor - we obtain the same classifier exactly. The scores are meaningless - only their sign is meaningful. Another example is the learning to rank setting. Our model produces a score that is used to rank items, and select the top-$k$ items to the user. The scores themselves are not meaningful - only their relative order is.

Statistically-inclined readers probably know that logistic regression tends to produce calibrated models out of the box. However, when the underlying logistic-regression model is a neural network, rather than a linear model, this is not the case. Indeed, there is a well-known paper by Guo et. al¹ that shows otherwise.

In many applications we want the score to represent some interpretable confidence in the prediction, and one way to achieve this is calibration. A model is calibrated, if the scores it produces are probabilities that are consistent with the empirical frequency of observing a positive sample. One formal way to define calibration is as following:

A supervised model $f$ trained on samples $(x, y)\sim \mathcal{D}$ with $y \in \{0, 1\}$ calibrated if
\[\mathbb{E}[y|f(x)] = f(x)\]

To make the discussion about calibration simpler, we avoid the discussion of models that produce multiple scores for a sample, such as multi-class and multi-label classifiers.

Calibrated models are important, for example, in online advertising. We truly care that a model produces the probability of a click, or the probability of a purchase, since these probabilities are used to compute expectations. Another context is safety critical applications - there might be a difference betwen a $0.00001\%$ probability that our self-driving care observed a human, and $0.1\%$.

One way to achieve calibration is to stack a calibrator model $\omega: \mathbb{R} \to [0, 1]$ on top an already trained model $f$, so that the predictions become:

\[\omega(f(x))\]

If the calibrator $\omega$ is an increasing function, classification or ranking remain unaffected, since the relative order of scores is preserved.

In this post we will use the power of the Bernstein basis in controlling the function we fit to devise monotonic calibrators $\omega$ that fit the requirements. Then, we compare the performance of our Bernstein calibrators to two built-in calibrators available in the Scikit-Learn package, that implements two well-known algorithms that are widely used to calibrate models throughout the industry. I recommend readers to take a look at the model calibration tutorial of the Scikit-Learn package as well. As usual, the code is available in a notebook you can try in Google Colab.

The idea of using shape-restricted polynomial regression for probabilistic calibration was, to the best of my knowledge, first proposed by Wang et. al. ² in 2019, so it’s quite new.

Working example - diabetes prediction

Throughout this post we will work with a support-vector machine classifier trained to predict diabetes on the CDC Diabetes Prediction Dataset. To easily access it, we can intall the ucimlrepo dataset that allow us to download it from the UCI machine-learning dataset repository:

pip install ucimlrepo

And now we can access it:

from ucimlrepo import fetch_ucirepo

# fetch dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)

# data (as pandas dataframes)
X = cdc_diabetes_health_indicators.data.features
y = cdc_diabetes_health_indicators.data.targets

Let’s print a summary of the data:

print(X.describe().transpose()[['min', '25%', '50%', '75%', 'max']])

The following output is produced:

                       min   25%   50%   75%   max
HighBP                 0.0   0.0   0.0   1.0   1.0
HighChol               0.0   0.0   0.0   1.0   1.0
CholCheck              0.0   1.0   1.0   1.0   1.0
BMI                   12.0  24.0  27.0  31.0  98.0
Smoker                 0.0   0.0   0.0   1.0   1.0
Stroke                 0.0   0.0   0.0   0.0   1.0
HeartDiseaseorAttack   0.0   0.0   0.0   0.0   1.0
PhysActivity           0.0   1.0   1.0   1.0   1.0
Fruits                 0.0   0.0   1.0   1.0   1.0
Veggies                0.0   1.0   1.0   1.0   1.0
HvyAlcoholConsump      0.0   0.0   0.0   0.0   1.0
AnyHealthcare          0.0   1.0   1.0   1.0   1.0
NoDocbcCost            0.0   0.0   0.0   0.0   1.0
GenHlth                1.0   2.0   2.0   3.0   5.0
MentHlth               0.0   0.0   0.0   2.0  30.0
PhysHlth               0.0   0.0   0.0   3.0  30.0
DiffWalk               0.0   0.0   0.0   0.0   1.0
Sex                    0.0   0.0   0.0   1.0   1.0
Age                    1.0   6.0   8.0  10.0  13.0
Education              1.0   4.0   5.0   6.0   6.0
Income                 1.0   5.0   7.0   8.0   8.0

We see that most features are actually binary. Let’s print the number of unique values of the non-binary columns:

X[['BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income']].nunique()

We get the following output:

BMI          84
GenHlth       5
MentHlth     31
PhysHlth     31
Age          13
Education     6
Income        8
dtype: int64

Therefore, I decided to treat only a few of the non-binary features as numerical, and the rest as categorical:

categorical_cols = ['HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke',
       'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'DiffWalk', 'Sex', 'Education',
       'Income']
numerical_cols = ['Age', 'BMI', 'MentHlth', 'PhysHlth']

Now let’s do the usual magic, and split the data. However, in this post, in addition to train and test sets we will have a calibration set whose purpose is training the calibrator model $\omega$. At this stage we will not use it, but let’s be prepared. We will use 15% for the test set, another 15% for the calibration set, and 70% for the train set:

from sklearn.model_selection import train_test_split

X_remain, X_test, y_remain, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_calib, y_train, y_calib = train_test_split(X_remain, y_remain, test_size=0.15/0.85, random_state=43)

And now let’s fit our linear support vector machine model. As usual, categorical features will be one-hot encoded, whereas numerical features will be min-max scaled. We use the LinearSVC lass for the classifier, with the class_weight='balanced' option to handle our imbalanced dataset, and the dual=False option to make it train faster in our case, when the samples greatly out-number the features:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.svm import LinearSVC

pipeline = Pipeline([
    ('feature_transformer', ColumnTransformer(
        transformers=[
            ('categorical', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=10), categorical_cols),
            ('numerical', MinMaxScaler(), numerical_cols)
        ]
    )),
    ('classifier', LinearSVC())
])

Now let’s fit our model to the training data, and reports its classification performance on the test set:

from sklearn.metrics import classification_report

pipeline.fit(X_train, y_train)
print(classification_report(y_test, pipeline.predict(X_test)))

I got the following output:

              precision    recall  f1-score   support

           0       0.96      0.71      0.82     32840
           1       0.30      0.79      0.44      5212

    accuracy                           0.72     38052
   macro avg       0.63      0.75      0.63     38052
weighted avg       0.87      0.72      0.77     38052

Looking at the “macro avg” row, we see that it’s not the best classifier in the world, but it has some discriminative power - the precision, recall, and F1 score are indeed reasonable. Enough to move on.

Evaluating calibration

Before explaining how calibration is evaluated, a short methodological note. Calibration should be evaluated on a held-out test set, not on the train set. If calibration is important for hyperparameter tuning, then we also need to evaluate it on the validation set. Now let’s talk about how we evaluate calibration.

One way to evaluate calbration is visually, using calibration curves or calibration reliability diagrams³. These curves attempt to directly visualize how far we are from our calibration criterion:

\[\mathbb{E}[y|f(x)] = f(x)\]

In theory, we would like to plot the points

\[(\mathbb{E}[y|f(x)], f(x)) \qquad (x, y) \sim \mathcal{D},\]

but we cannot, since we only have access to a finite data-set, not the distribution that generated it. Thus, in practice we resort to approximation by binning the outputs of $f(x)$ into sub-intervals of $[0, 1]$ and using averages instead of means. This is implemented by Scikit-Learn in the CalibrationDisplay class.

Let’s try it out with a very naive calibrator - we will just take the output of our SVM, and pass it through the sigmoid function $\sigma(y) = (1+\exp(-y))^{-1}$. This will produce values in $[0, 1]$ that we can use:

y_pred = pipeline.decision_function(X_test)
y_pred = 1 / (1 + np.exp(-y_pred))
CalibrationDisplay.from_predictions(y_test, y_pred, n_bins=10, name='SVM + Sigmoid')
plt.show()

I got the following plot:

In a perfectly calibrated classifier, the blue calibration curve should align with the dotted black line - the average prediction in each bin should align with the empirical positive sample frequency.

Beyond visuals means, we have metrics that can quantify the miscalibration error. The simplest of such metrics is the Empirical Calibration Error (ECE), whose computation is similar to how calibration curves are constructed. It is just the weighted average of calibration errors in each bin - the weights are the number of samples in each bin. Since Scikit-Learn is an open-source project, I implemented the ECE metric based on its code for computing calibration curves:

# implementation based on the code of calibration_curve in sklearn:
#   https://github.com/scikit-learn/scikit-learn/blob/872124551/sklearn/calibration.py#L927
def ece(y_true, y_prob, n_bins=10):
  bins = np.linspace(0.0, 1.0, n_bins + 1)
  binids = np.searchsorted(bins[1:-1], y_prob)

  bin_sums = np.bincount(binids, weights=y_prob, minlength=len(bins))
  bin_true = np.bincount(binids, weights=y_true, minlength=len(bins))
  bin_total = np.bincount(binids, minlength=len(bins))

  nonzero = bin_total != 0
  prob_true = bin_true[nonzero] / bin_total[nonzero]
  prob_pred = bin_sums[nonzero] / bin_total[nonzero]

  return np.sum(np.abs(prob_true - prob_pred) * bin_total[nonzero]) / np.sum(bin_total)

Now we can use this function to print the ECE of our naive sigmoid calibrator:

print(f'ECE = {ece(y_test, y_pred)}')

The output is:

ECE = 0.30732883382620496

At this stage it doesn’t tell us much, until we begin improving it.

In addition to the ECE, the standard cross-entropy loss and the mean-squared error loss can also help us quantify miscalibration. In the context of probability calibration, the mean-squared error is known as the breier score. However, they have an inherent weakness - they quantify both miscalibration and discriminative power⁴. For example, if our classifier is better at differentiating positive and negative samples than a competing classifier, these two losses show an improvement, even if its calibration error remains the same. Alternatively, improving only the calibration error without improving discriminative power also reduces these losses. Since in this post the classifier remains identical due to the monotonic nature of calibrators, and only its calibration error changes, these two metrics are useful. Both are implemented in Scikit-Learn and we can use them:

from sklearn.metrics import (
    brier_score_loss,
    log_loss
)

brier_score_loss(y_test, y_pred), log_loss(y_test, y_pred)

The output is

(0.19391187428082382, 0.5748329800290517)

Now let’s improve those numbers using calibrators designed for the task. The ECE is not a very reliable metric due to the approximation by binning, but we still include it, since it is widely used in papers on model calibration.

Working with Scikit-Learn built-in calibrators

The simplest well-known calibrator is the Platt calibrator⁵, which essentially boils down to fitting a logistic regression model whose only feature is the original model prediction. Namely, the Platt classifier is a function of the form

\[\omega(y) = \frac{1}{1 + \exp(a y + b)},\]

where $a$ and $b$ are learned parameters. Where are these parameters learned from? That’s what we have the above-mentioned calibration set. It is just the training set of the calibrator, and the training samples are $\{ (f(x_i), y_i) \}_{i \in C}$, where $C$ is the calibration set.

In Scikit-Learn, the Platt calibrator is implemented in the CalibratedClassifierCV class. This class is pretty versatile, and has various options for how a calibrator is trained, and what exactly is used as the calibration set. To make this post simple, we will use the cv=prefit option, which means that our model has been pre-fit, and we need to fit just the calibrator $\omega$ itself. The Platt calibrator can be chosen using the method='sigmoid' constructor option. So let’s try it out!

from sklearn.calibration import CalibratedClassifierCV

sigmoid_calib = CalibratedClassifierCV(pipeline, method='sigmoid', cv='prefit')
sigmoid_calib.fit(X_calib, y_calib)

To evaluate it, let’s implement a short function that will show all the three metrics we care about:

def estimator_errors(estimator, X_test, y_test):
  y_pred = estimator.predict_proba(X_test)[:, 1]
  return f'ECE = {ece(y_test, y_pred):.5f}, Brier = {brier_score_loss(y_test, y_pred):.5f}, LogLoss = {log_loss(y_test, y_pred):.5f}'

Now let’s plot the calibration curve and the metrics!

CalibrationDisplay.from_estimator(sigmoid_calib, X_test, y_test, n_bins=10)
plt.title(estimator_errors(sigmoid_calib, X_test, y_test))
plt.show()

Here is the output:

Looks a bit better. Some of the points lie on the diagonal line of perfect calibration, whereas others do not. However, looking at the metrics (in the title), we see that all of them were significantly improved, by orders of magnitude. This means that the points we see as mis-calibrated in the curve probably have little samples in the corresponding bins. Therefore, it is likely that their effect on the miscalibration error is quite small. It would be nice if Scikit-Learn could show the weight of each point using the point size, so we could see it visually - but unfortunately it does not.

The second well-known calibrators are piecewise-constant functions of the form

\[\omega(y) = \begin{cases} y_0 & y \leq x_1 \\ y_1 & x_1 < y \leq x_2 \\ \vdots & \\ y_{n-1} & x_{n-1} < y \leq x_n \\ y_n & y > x_n \end{cases},\]

where $y_0 < y_1 < \dots < y_n$, and $x_1, \dots, x_n$ are learned from the calibration set. The mathematical procedure for fitting such a function to data is called isotonic regression⁶⁷, and using it for calibration is done by passing the method='isotonic' to the CalibratedClassifierCV class. So let’s try it out as well!

isotonic_calib = CalibratedClassifierCV(pipeline, method='isotonic', cv='prefit')
isotonic_calib.fit(X_calib, y_calib)

CalibrationDisplay.from_estimator(sigmoid_calib, X_test, y_test, n_bins=10)
plt.title(estimator_errors(sigmoid_calib, X_test, y_test))
plt.show()

I obtained the following plot:

Looks much better! And the three metrics were improved as well. We can also plot the piecewise-constant function:

calibrator = isotonic_calib.calibrated_classifiers_[0].calibrators[0]
plt.plot(calibrator.f_.x, calibrator.f_.y)
plt.title(f'Caibrator with {len(calibrator.f_.x)} points')
plt.show()

We can see that our classifier produced scores approximately between -2 and 2 on the calibration set, and the best-fit piecewise constant function has 118 “jumps”.

There are two interesting observations we can make. First, a piecewise-constant function can harm ranking and classification, since it’s not strictly increasing by definition. Two samples having different, but nearby scores might be mapped to the same output. Second, the number of jumps may become large as the size of the calibration data-set increases. This means that inference may also become expensive, since computing $\omega(y)$ requires performing a lookup for the interval $y$ belongs to. So can we do better?

Calibration with Bernstein polynomials

In a previous post in our adventures with polynomial regression we saw an interesting theorem. Suppose our calibrator is:

\[\omega(y) = \sum_{i=0}^n u_i b_{i,n}(y),\]

where $\{ b_{i,n} \}_{i=0}^n$ is the $n$-degree Bernstein basis. Then having $u_{i+1} \geq u_i$ implies that $\omega$ is increasing. Moreover, if at least for one one index $j$ we have $u_{j+1} > u_j$, then $\omega$ is strictly increasing. Therefore, we can fit our calibrator to the calibration set $(\hat{y}_1, y_1), \dots, (\hat{y}_m, y_m)$ by solving a constrained polynomial regression problem using the Bernstein basis. As long as not all coefficients are equal, we will obtain a strictly increasing calibrator!

Denoting $\mathbf{b}(y) = (b_{0,n}(y), \dots, b_{n,n}(y))^T$, we need to solve the following constrained least-squares regression problem:

\[\begin{aligned} \min_{\mathbf{u}} &\quad \sum_{j=1}^m \left( \mathbf{b}(\hat{y}_j)^T \mathbf{u} - y_j \right)^2 \\ \text{s.t.} &\quad 0 \leq u_i \leq 1, & i = 0, \dots, n \\ &\quad u_{i} \geq u_{i-1}, & i = 1, \dots, n \end{aligned}\]

Letting $\hat{\mathbf{V}}$ be the Vandermonde matrix whose rows are $\mathbf{b}(y_j)$, we can write the above problem as:

\[\begin{aligned} \min_{\mathbf{u}} &\quad \| \hat{\mathbf{V}} \mathbf{u} - \mathbf{y} \|^2 \\ \text{s.t.} &\quad 0 \leq u_i \leq 1, & i = 0, \dots, n \\ &\quad u_{i} \geq u_{i-1}, & i = 1, \dots, n \end{aligned}\]

Having found the optimal solution $\mathbf{u}^*$ our calibrator’s prediction becomes:

\[\omega(y) = \mathbf{b}(y)^T \mathbf{u}^*.\]

The only issue stems from the fact that the underlying model’s predictions $y_j$ are not necessarily in $[0, 1]$, but the Bernstein basis requires inputs in that range. As we already saw, the remedy comes from a simple min-max scaling. So let’s code our Bernstein calibrator!

As you probably guessed, the Calibrator is just another Scikit-Learn classifier that applies the calibration procedure on top of a wrapped uncalibrated classifier. We use CVXPY, which we encountered before in this series, to solve the above-mentioned minimization problem. For the code in this post to work correctly, please make sure you have version 1.5 or above. So here it is:

from sklearn.base import ClassifierMixin, MetaEstimatorMixin, BaseEstimator
import cvxpy as cp
from scipy.stats import binom

class BernsteinCalibrator(BaseEstimator, ClassifierMixin, MetaEstimatorMixin):
  def __init__(self, estimator=None, *, degree=20):
    self.estimator = estimator
    self.degree = degree

  def fit(self, X, y):
    pred = self._get_predictions(X)
    self.classes_ = self.estimator.classes_

    # compute min / max for scaling
    self.min_ = np.min(pred)
    self.max_ = np.max(pred)

    # compute Vandermonde matrix
    vander = self._bernvander(pred)

    # find Bernstein polynomial coefficients
    self.coef_ = self._fit_coef(vander, y)
    return self

  def _fit_coef(self, vander, y):
    coef = cp.Variable(self.degree + 1, bounds=[0, 1])
    objective = cp.norm(vander @ coef - y)
    constraints = [cp.diff(coef) >= 0]
    prob = cp.Problem(cp.Minimize(objective), constraints)
    prob.solve()

    return coef.value

  def predict_proba(self, X):
    pred = self._get_predictions(X)
    calibrated = self._calibrate_scores(pred).reshape(-1, 1)
    return np.concatenate([1 - calibrated, calibrated], axis=1)

  def _calibrate_scores(self, pred):
    vander = self._bernvander(pred)
    return np.clip(vander @ self.coef_, 0, 1)

  def _bernvander(self, pred):
    scaled = (pred - self.min_) / (self.max_ - self.min_)
    scaled = np.clip(scaled, 0, 1)

    basis_idx = np.arange(1 + self.degree)
    return binom.pmf(basis_idx, self.degree, scaled[:, None])

  def _get_predictions(self, X):
    estimator = self.estimator
    if estimator is None:
      estimator = LinearSVC(random_state=0, dual="auto")
    if hasattr(estimator, 'predict_proba'):
      pred = estimator.predict_proba(X)
      return pred[:, 1]
    elif hasattr(estimator, 'decision_function'):
      return estimator.decision_function(X)
    else:
      raise RuntimeError('Estimator must have either predict_proba or decison_function method')

The fit method computes the minimum and maximum observed values for the min-max scaling mechanism. Then it fits the coefficients using CVXPY by calling the _fit_coef method. The predict_proba method just evaluates the fitted Bernstein polynomial after computing the predictions of the underlying estimator. The _bernvander method computes the Vandermonde matrix for a vector of predictions after applying the min-max scaling. The rest of the code is straightforward boilerplate.

Now let’s try it out, and fit a polynomial calibrator of degree 20:

bernstein_calib = BernsteinCalibrator(pipeline, degree=20)
bernstein_calib.fit(X_calib, y_calib)

CalibrationDisplay.from_estimator(bernstein_calib, X_test, y_test, n_bins=10)
plt.title(estimator_errors(bernstein_calib, X_test, y_test))
plt.show()

I got the following result:

Nice! All error metrics became smaller. The claibration curve looks good. And our model has a much smaller number of parameters than the isotonic one - only 21 coefficients, instead of 118. We can also plot the calibration function $\omega(y)$, with the Bernstein coefficients as control points:

xs = np.linspace(bernstein_calib.min_, bernstein_calib.max_, 1000)
ys = bernstein_calib._calibrate_scores(xs)
plt.plot(xs, ys, label='Calibrator', color='blue')

ctrl_xs = np.linspace(bernstein_calib.min_, bernstein_calib.max_, bernstein_calib.degree + 1)
ctrl_ys = bernstein_calib.coef_
plt.scatter(ctrl_xs, ctrl_ys, label='Coefficients', color='red')

plt.legend()
plt.show()

I got the following plot:

So maybe we can work with an even lower degree? Let’s try fitting a polynomial calibrator of degree 10:

bernstein_calib_lowdeg = BernsteinCalibrator(pipeline, degree=10)
bernstein_calib_lowdeg.fit(X_calib, y_calib)

CalibrationDisplay.from_estimator(bernstein_calib_lowdeg, X_test, y_test, n_bins=10)
plt.title(estimator_errors(bernstein_calib_lowdeg, X_test, y_test))
plt.show()

The result is below:

The curve looks a bit worse, but the metrics still outperform isotonic regression.

But wait! A calibrator is essetially a probability prediction model, and we know just the right tool for the task - logistic regression. In fact, we already saw the Platt calibrator that was in fact a simple logistic regression model, whose only feature is the underlying prediction. So maybe, using logistic, rather than least-squares regression we can work with an even lower degree polynomial and achieve good calibration.

For logistic regression, our loss function, or the minimization objective, need to be modified. Moreover, logistic regression coefficients may, in theory, go to infinity (or minus infinity) if the optimal prediction for some feature combinations is close to zero or one. This may cause the minimization procedure to declare that the problam is not solvable. Thus, we cap the Bernstein ceofficients to be in the range $[-15, 15]$, in addition to the monotonicity constraint. This ensures that our model’s predictions are also in that range, and the sigmoid function evaluated at the endpoints are 0 and 1 for all practical purposes. So, our modified convex optimization problem becomes:

\[\begin{aligned} \min_{\mathbb{u}} &\quad \sum_{j=1}^m \left( \ln(1+\exp(\mathbf{b}(\hat{y}_j)^T \mathbf{u}) - y_j\mathbf{b}(\hat{y}_j)^T \mathbf{u} \right) \\ \text{s.t.} &\quad -15 \leq u_i \leq 15, & i = 0, \dots, n \\ &\quad u_{i} \geq u_{i-1}, & i = 1, \dots, n \end{aligned}\]

Carefully inspecting the objective - it’s just the regular loss of the logistic regression problem. To implement it, we just override the _fit_coef method to implement the above minimization problem as the fitting procedure, and the _calibrate_scores method to apply the sigmoid function after computing the Bernstein polynomial. So here it is:

class BernsteinSigmoidCalibrator(BernsteinCalibrator):
  def _compute_coef(self, vander, y):
    coef = cp.Variable(self.degree + 1, bounds=[-15, 15])
    scores = vander @ coef
    objective = cp.sum(cp.logistic(scores) - cp.multiply(y, scores))
    constraints = [cp.diff(coef) >= 0]
    prob = cp.Problem(cp.Minimize(objective), constraints)
    prob.solve()

    return coef.value

  def _calibrate_scores(self, pred):
    vander = self._bernvander(pred)
    return self._sigmoid(vander @ self.coef_)

  @staticmethod
  def _sigmoid(scores):
    return np.piecewise(
        scores,
        [scores > 0],
        [lambda z: 1 / (1 + np.exp(-z)), lambda z: np.exp(z) / (1 + np.exp(z))]
    )

Note, that to avoid overflows and other numerical issues, we carefully implemented the sigmoid function to handle positive and negative values differently. Now let’s try it out with a degree of 10.

bernstein_sigmoid_calib = BernsteinSigmoidCalibrator(pipeline, degree=10)
bernstein_sigmoid_calib.fit(X_calib, y_calib)

CalibrationDisplay.from_estimator(bernstein_sigmoid_calib, X_test, y_test, n_bins=10)
plt.title(estimator_errors(bernstein_sigmoid_calib, X_test, y_test))
plt.show()

The result is:

Nice! With a degree of 10, we achieved a similar result than least-squares fitting with a degree of 20. To summarize, here are the metrics. The best metric is highlighted.

Calibrator	ECE	Breier	LogLoss
Platt	0.01295	0.09746	0.31395
Isotonic	0.00737	0.09707	0.31289
Bernstein (deg = 20)	0.00622	0.09703	0.31195
Bernstein (deg = 10)	0.00634	0.09708	0.31231
Bernstein logistic regression (deg = 10)	0.00653	0.09704	0.31199

Conclusion

We saw an interesting application of the ability to control the derivative of polynomials represented in the Bernstein basis for model calibration. I welcome you to try it our for your own work, where controlling derivatives in the context of your machine-learned models is important.

As a side note, there are other bases that allow controling derivatives in a similar manner. For example, the well-known B-Spline basis for polynomial splines. But that’s out of scope for our series - my objective was showing that polynomial regression is not that “scary overfitting monster”, but rather a useful tool in machine learning.

My next, and final post in the series will be of a more exploratory nature - of trying to understand why the Bernstein basis is useful for fitting polynomial models from a different, statistical perspective. Stay tuned!

Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger. On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning (2017) ↩
Yongqiao Wang, Lishuai Li, Chuangyin Dang. Calibrating Classification Probabilities with Shape-Restricted Polynomial Regression. IEEE Transactions on Pattern Analysis and Machine Intelligence 41.8 (2019) ↩
Morris H. Degroot, Stephen E. Fienberg. The Comparison and Evaluation of Forecasters. Journal of the Royal Statistical Society: Series D (The Statistician) (1983) ↩
Allan H. Murphy. A New Vector Partition of the Probability Score. Journal of Applied Meteorology and Climatology (1973). ↩
Platt, John. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.” Advances in large margin classifiers 10.3 (1999) ↩
R.E. Miles. The Complete Amalgamation into Blocks, by Weighted Means, of a Finite Set of Real Numbers. Biometrika 46.3 (1959) ↩
D. J. Bartholomew. A Test of Homogeneity for Ordered Alternatives. II Biometrika 46.3 (1959) ↩

SkLearning with Bernstein Polynomials - continued

2024-05-13T00:00:00+00:00

Intro

In the previous post we built a Scikit-Learn component that you can already integrate into your pipelines to train models whose numerical features are represented in the Bernstein basis. Feature interactions is a simple and effective feature engineering trick, and this post builds upon this knowledge and improves the component we built by introducing pairwise interactions between numerical features. This post is a direct continuation of the previous post, and I will assume that you are familiar with what we built so far. If what you see here looks like Klingon, and you don’t know Klingon, please take your time to read the posts on polynomial features from the beginning. As previously, the code is available in a notebook that you can open in Google Colab. Due to the nature of this post, the notebook extends the code from the last post with additional experiments, rather than being written from scratch.

A recap

The BernsteinTransformer component we created last time allowed us to construct a Scikit-Learn pipeline, train, and make predictions using the following simple lines of code:

categorical_features = [...] # the list of categorical feature names
numerical_features = [...] # the list of numerical feature names
my_estimator = ... # Ridge / Lasso / LogisticRegression / ...
pipeline = training_pipeline(BernsteinFeatures(), my_estimator, categorical_features, numerical_features)

pipeline.fit(train_df, train_df[label_column])
test_predictions = pipeline.predict(test_df)

We constructed pipelines of the following generic form to facilitate using polynomial bases over data in a compact interval, by first rescaling it:

Our BernsteinTransformer generated Bernstein basis features for each column separately. As a baseline, we also used the PowerBasisTransformer that generated the power-basis features. We will extend both classes in a way that will allow us to construct pairwise interactions between numerical features by generating tensot product bases:

\[b_{i,j,n}(x, y) = b_{i,n}(x) b_{j,n}(y)\]

Such bases can be used to learn a function of any given pair of features $x$ and $y$ with linear coefficients:

\[f(x, y) = \sum_{i=0}^n \sum_{j=0}^n \alpha_{i,j} b_{i,n}(x) b_{j,n}(y)\]

The basis $b_{0,n}, \dots, b_{n,n}$, in this post, can be either the power-basis for n-th degree polynomials, or the Bernstein basis.

As an additional baseline, we will also Scikit-Learn’s built-int PolynomialFeatures class, that does something similar, but different with the power basis. Given $m$ numerical features of degree $n$, it allows learning functions of the form

\[f(x_1, \dots x_m) = \sum_{\substack{i_1 + \dots + i_m = n \\ i_k \geq 0}} \alpha_{i_1, \dots, i_m} \left( \prod_{k=1}^m x_k^{i_k} \right).\]

This looks “scary”, but essentially this is a generic multivariate polynomial of degree $n$ whose variables are $x_1, \dots , x_m$. So let’s get started!

The pairwise interaction transformers

Without further due, let’s extend the the base-class for both polynomial feature transformers from the previous post, to have an additional interaction_features argument in its constructor, and produce tensor-product features. Again, we need to take care not to introduce an additional “bias term”, and to that end, we eliminate the first basis function, as in the previous post:

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from itertools import combinations


class PolynomialBasisTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, degree=5, bias=False, na_value=0., interactions=False):
        self.degree = degree
        self.bias = bias
        self.na_value = na_value
        self.interactions = interactions

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # Check if X is a Pandas DataFrame and convert to NumPy array
        if hasattr(X, 'values'):
            X = X.values

        # Ensure X is a 2D array
        if X.ndim == 1:
            X = X.reshape(-1, 1)

        # Get the number of columns in the input array
        n_rows, n_features = X.shape

        # Compute the specific polynomial basis for each column
        basis_features = [
            self.feature_matrix(X[:, i])
            for i in range(n_features)
        ]

        # create interaction features - basis tensor products
        if self.interactions:
          interaction_features = [
              (u[:, None, :] * v[:, :, None]).reshape(n_rows, -1)
              for u, v in combinations(basis_features, 2)
          ]
          result_basis = interaction_features
        else:
          result_basis = basis_features

        if not self.bias:
          result_basis = [basis[:, 1:] for basis in result_basis]

        return np.hstack(result_basis)


    def feature_matrix(self, column):
      vander = self.vandermonde_matrix(column)
      return np.nan_to_num(vander, self.na_value)


    def vandermonde_matrix(self, column):
        raise NotImplementedError("Subclasses must implement this method.")

Our concrete bernstein and power basis transformers from the previous post remain the same - their job is implementing the vandermonde_matrix method. We include them here for completeness:

class BernsteinFeatures(PolynomialBasisTransformer):
    def vandermonde_matrix(self, column):
        basis_idx = np.arange(1 + self.degree)
        basis = binom.pmf(basis_idx, self.degree, column[:, None])
        return basis


class PowerBasisFeatures(PolynomialBasisTransformer):
    def vandermonde_matrix(self, column):
        return poly.polyvander(column, self.degree)

The rest of the components we built in the previous post remain the same. So let’s try them out, and add another experiment to our attempts to predict california housing prices!

California housing dataset with pairwise polynomial features

Recall, that we’re given a train and a test set already in Google colab, and can load them:

train_df = pd.read_csv('sample_data/california_housing_train.csv')
test_df = pd.read_csv('sample_data/california_housing_test.csv')
print(train_df.head())

#    longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population  households  median_income  median_house_value
# 0    -114.31     34.19                15.0       5612.0          1283.0      1015.0       472.0         1.4936             66900.0
# 1    -114.47     34.40                19.0       7650.0          1901.0      1129.0       463.0         1.8200             80100.0
# 2    -114.56     33.69                17.0        720.0           174.0       333.0       117.0         1.6509             85700.0
# 3    -114.57     33.64                14.0       1501.0           337.0       515.0       226.0         3.1917             73400.0
# 4    -114.57     33.57                20.0       1454.0           326.0       624.0       262.0         1.9250             65500.0

The task is predicting the median_house_value column based on the other columns. Let’s use the same categorical and numerical features as in our previous post:

categorical_features = ['housing_median_age']
numerical_features = ['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
target = ['median_house_value']

And use the same pipeline construction function as in the previous post:

from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor

def california_housing_pipeline(basis_transformer):
    return training_pipeline(
        basis_transformer,
        TransformedTargetRegressor(
            regressor=Ridge(),
            transformer=MinMaxScaler()
        ),
        categorical_features,
        numerical_features
    )

So now, in addition to the linear, power basis, and Bernstein bases, we will add the pairwise power basis, pairwise Bernstein basis, and the the built-int PolynomialFeatures basis. Let’s begin with the pairwise Bernstein basis. Note the interactions=True argument I give to the BernsteinFeatures component:

inter_param_space = {
    'preprocessor__numerical__basis__degree': hp.uniformint('degree', 1, 8),
    'model__regressor__alpha': hp.loguniform('C', -10, 5)
}

bernstein_inter_pipeline = california_housing_pipeline(BernsteinFeatures(interactions=True))
tune_and_evaluate_pipeline(bernstein_inter_pipeline, inter_param_space,
                           train_df, test_df, target,
                           'neg_root_mean_squared_error')

This time we’ll use lower polynomial degrees, up to 8, because the model becomes too large to finish tuning in a few minutes on Colab. I got the following result:

Tuning params
100%|██████████| 100/100 [14:10<00:00,  8.50s/trial, best loss: 55558.90453947183]
Best params = {'model__regressor__alpha': 0.003973626254894749, 'preprocessor__numerical__basis__degree': 8}
Refitting with best params on the entire training set
Test metric = -58131.78984

Now let’s try the power basis with interactions:

power_inter_pipeline = california_housing_pipeline(PowerBasisFeatures(interactions=True))
tune_and_evaluate_pipeline(power_inter_pipeline, inter_param_space,
                           train_df, test_df, target,
                           'neg_root_mean_squared_error')

After a few minutes I get the following output:

Tuning params
100%|██████████| 100/100 [12:25<00:00,  7.46s/trial, best loss: 56765.359651797495]
Best params = {'model__regressor__alpha': 0.00017748456793637552, 'preprocessor__numerical__basis__degree': 8}
Refitting with best params on the entire training set
Test metric = -59228.75478

So we can certainly see that even with interaction features, the power basis performs worse than the Bernstein basis.

And last but not least, let’s use the PolynomialFeatures class that the Scikit-Learn package provides. To be fair, we need to choose its maximum degree so that the number of generated features is similar to that of the pairwise bases. So we have 7 numerical features, and therefore $\frac{1}{2} \cdot 7 \cdot 6 = 21$ feature pairs. With maximum degree of 8, each pair generates at most $8 \cdot 8 - 1 = 63$ basis functions. So the total number of generated features is $21 \cdot 63 = 1323$.

A multivariate polynomial with $7$ variables of degree $d$ has

\[{7 + d \choose d}\]

coefficients. It can be easily shown using the stars and bars technique in combinatorics. Choosing $d = 7$ we get 1716 coefficients, which is pretty close. With $d=6$ we get less than 1323 coefficients, so using polynomials of max degree 7 seems like a reasonable choice.

Let’s try it out!

from sklearn.preprocessing import PolynomialFeatures

polyfeat_param_space = {
    'preprocessor__numerical__basis__degree': hp.uniformint('degree', 1, 7),
    'model__regressor__alpha': hp.loguniform('C', -10, 5)
}


polyfeat_pipeline = california_housing_pipeline(PolynomialFeatures(include_bias=False))
tune_and_evaluate_pipeline(polyfeat_pipeline, polyfeat_param_space,
                           train_df, test_df, target,
                           'neg_root_mean_squared_error')

After a few minutes, I got the following output:

Tuning params
100%|██████████| 100/100 [25:59<00:00, 15.60s/trial, best loss: 56986.28132958403]
Best params = {'model__regressor__alpha': 0.0003080552886505334, 'preprocessor__numerical__basis__degree': 6}
Refitting with best params on the entire training set
Test metric = -59155.28068

So, summarizing the results of the previous post, together with the results of this post, we obtain the following table:

	Linear	Power basis	Bernstein basis	Pairwise Bernstein	Pairwise Power	Full polynomial
RMSE	67627.17474	63534.49228	61559.04848	58131.78984	59228.75478	59155.28068
Improvement over Linear	0%	6.05%	8.97%	14.04%	12.41%	12.52%
Tuned degree	1	31	50	8	8	6

As we can see, the clear winners are the pairwise polynomial features. This, of course, will not always be the case. But there is a good reason why it may be an option worth exploring.

Bernstein tensor products

Let’s get formally introduced - the set of functions $\mathbb{B}_{n,n} = \{ b_{i,n}(x) b_{j,n}(y) \}_{i,j=0}^n$ is the tensor product basis constructed from the $n$-th degree Bernstein basis. In general, tensor product bases are function bases that are composed of pairwise product of basis functions, but here we explore the special case of Bernstein basis functions. The basis $\mathbb{B}_{n,n}$ shares some nice properties with the Bernstein basis:

Non-negativity: $b_{i,n}(x) b_{j,n}(y) \geq 0$
Partition of unity: $\displaystyle \sum_{i=0}^n \sum_{j=0}^n b_{i,n}(x) b_{j,n}(y) = 1$

Now let’s look at an arbitrary function $f$ that is spanned by this basis:

\[f(x,y) = \sum_{i=0}^n \sum_{j=0}^n \alpha_{i,j} b_{i,n}(x) b_{j,n}(y)\]

Due to the two properties above, like in the case of the univariate Bernstein basis, $f$ is just a weighted sum of its coefficients $\alpha_{i,j}$. The basis function values specify the weight of each coefficient.

Moreover, we have the same ‘controlling’ property as with the univariate basis - $\alpha_{i,j}$ “controls” the value of the function $f$ in the vicinity of the point $(\frac{i}{n}, \frac{j}{n})$. These properties make it easy to regularize $f$, just as in the case of the univariate basis. We will not go into the details in this post, but just as is the case with the univariate basis, we can also control the first or second derivative of $f$ by imposing constraints on its coefficients based on discrete analogues of first and second order differences.

Despite the name ‘basis’, it is important to note that the tensor product basis does not span all bivariate polynomials of degree $2n$, but merely a very useful subspace. For example, the monomials $x^{2n}$ and $y^{2n}$ appear nowhere in the polynomial expansion of the $f(x, y)$ defined above.

Bases with such properties, such as the Bernstein basis, and also the well-known B-Spline basis are heavily used by computer aided design software to represent 2D surfaces embedded in 3D ¹. In the case of the Bernstein basis, the surfaces are known as Bézier surfaces, named after the French engineer Pierre Bézier. I like the idea of propagating knowledge established in one field to another field, and I believe this is one such case. I’d like to refer interested readers to the beautiful tutorial paper² by Michael Floater and Kai Hormann.

Summary

This post concludes our adventures in designing a Scikit-Learn transformer. I’m happy to receive feedback, so please don’t hesitate to contact me if you have feedback to share. Next, we will explore a practical case when controlling polynomial derivatives is important, and write yet another Scikit-Learn component. Stay tuned!

When representing a 3D surface, we have three functons $f_x, f_y, f_z$, one for the $x$ coordinate, one for the $y$, and one for the $z$ coordinate. ↩
Michael S. Floater & Kai Hormann Surface Parameterization: a Tutorial and Survey. Mathematics and Visualization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-26808-1_9 ↩

“SkLearning with Bernstein Polynomials”

2024-02-11T00:00:00+00:00

Intro

In the last two posts we introducted the Bernstein basis as an alternative way to generate polynomial features from data. In this post we’ll be concerned with an implementation that we can use in our model training pipelines based on Scikit-Learn. The Scikit-Learn library has the concept of a a transformer class that generates features from raw data, and we will indeed develop and test such a transformer class for the Bernstein basis. Contrary to previous posts, here we will have some math, but plenty of code, which is fully available in this Colab Notebook.

But beforehand, let’s do a short recap of what we learned in the last two posts:

The Bernstein basis is a useful alternative to the standard polynomial basis $\{1, x, x^2, \dots, x^n\}$. It has a well conditioned Vandermonde matrix, and is easy to regularize.
Any polynomial basis has a “natural domain” where its approximation properties are well-known. Raw features must be normalized to that domain. The natural domain of the Bernstein basis is the interval $[0, 1]$.
Extrapolation outside the training data distribution is not a problem - we can impose smoothness via regularization.
Extrapolation outside the “natural domain” should be avoided!¹

Transformer classes in Scikit-Learn generate new features out of existing ones, and can be combined in a convenient way into pipelines that perform a set of transformations that eventually generate features for a trained model. We will implement a Scikit-Learn transformer class for Bernstein polynomials, called BernsteinFeatures. As a baseline, we will also implement a similar transformer that generates the power basis, called PowerBasisFeatures. We will combine them in a Pipeline to build a mechainsm that trains and evaluates a model using the well-known fit-transform paradigm. In this post, we will train a linear model on our generated features.

Since feature normalization is a must, we will always prepend our polynomial transformer by a normalization transformer. In this post, we will use the MinMaxScaler class built into Scikit-Learn. For categorical features, we will use OneHotEncoder. Therefore, our pipelines in this post will have the following generic form:

Before we begin - some expectations. The behavior of the functions we approximate on real data-sets is typically not as ‘crazy’ as the toy functions we approximated in previous posts. The wide oscilations and wiggling of the “true” function we are aiming to learn are not that common in practice. A harder challenge is modeling the interaction between several features, rather than the effect of each feature separately. Therefore, the advantage we will see from a simple application of Bernstein polynomials over the power basis isn’t that large, but it’s quite visible and consistent. Thus, when fitting a model with polynomial features, I’d go with Bernstein polynomials by default, instead of a power basis. It’s very easy, and we have nothing to lose - we can only gain.

The transformer classes

A transformer class in Scikit-Learn needs to implement the basic fit-transform paradigm. Since polynomial features are the same regardless of the data, the fit method is empty. The transform method, as expected, will concatenate the generate a Vandermonde matrices of the columns. Note, that we will be handling each column separately at this stage, and do not aim to compute any interaction terms between columns.

There is one mathematical issue we need to take care of. Since a polynomial basis can represent any polynomial, including those that do not pass throught the origin, they implicitly contain a “bias” term. The power basis even explicit about it - its first basis function is the constant $1$. However, a typical linear model already has its own bias term, namely,

\[f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x}\rangle + b.\]

The bias is, of course, equivalent to having a constant feature. Thus, our data-matrix has two constant features, meaning it’s as ill-conditioned as it can be - its columns are linearly dependent. When several numerical features are used things become even worse - we have several implicit constant features.

To mitigate the above, we will add a bias boolean flag to our transformers that instructs the transformer to generate a basis of polynomials going through the origin. This policy is in line with other transformers that are built-in into Scikit-Learn, such as the SplineTransformer and the PolynomialFeatures classes. For the power basis it amounts to discarding the first basis function. It turns out that the same idea works for the Bernstein basis as well, since $b_{0,n}(0) = 1$, and $b_{i,n}(0) = 0$ for all $i \geq 1$.

Becide the above mathematical aspect, we will also have to take care of several technical aspects. First, we will add support for Pandas data-frames, since they are ubiquitously used by many practitioners. Second, we will have to take care of one-dimensional arrays as input, and reshape them into a column. Finally, we will treat transform NaN values to constant (zero) vectors to model the fact that a missing numerical feature “has no effect”. This is not always the best course of action, but it’s useful in this post. The base class taking care of the above mathematical and technical aspects is written below:

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin


class PolynomialBasisTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, degree=5, bias=False, na_value=0.):
        self.degree = degree
        self.bias = bias
        self.na_value = na_value

    def fit(self, X, y=None):
        return self
      
    def transform(self, X, y=None):
        # Check if X is a Pandas DataFrame and convert to NumPy array
        if hasattr(X, 'values'):
            X = X.values

        # Ensure X is a 2D array
        if X.ndim == 1:
            X = X.reshape(-1, 1)

        # Get the number of columns in the input array
        n_rows, n_features = X.shape

        # Compute the specific polynomial basis for each column
        basis_features = [
            self.feature_matrix(X[:, i])
            for i in range(n_features)
        ]

        # no bias --> skip the first basis function
        if not self.bias:
            basis_features = [basis[:, 1:] for basis in basis_features]
            
        return np.hstack(transformed_features)

    def feature_matrix(self, column):
      vander = self.vandermonde_matrix(column)
      return np.nan_to_num(vander, self.na_value)

    def vandermonde_matrix(self, column):
        raise NotImplementedError("Subclasses must implement this method.")

The power and Bernstein bases are easily implemented by overriding the vandermonde_matrix method of the above base-class:

import numpy.polynomial.polynomial as poly
from scipy.stats import binom

class BernsteinFeatures(PolynomialBasisTransformer):
    def vandermonde_matrix(self, column):
        basis_idx = np.arange(1 + self.degree)
        basis = binom.pmf(basis_idx, self.degree, column[:, None])
        return basis


class PowerBasisFeatures(PolynomialBasisTransformer):
    def vandermonde_matrix(self, column):
        return poly.polyvander(column, self.degree)

Let’s see how they work. We will use Pandas to display the results of our transformers as nicely formatted tables.

import pandas as pd

pbt = BernsteinFeatures(degree=2).fit(np.empty(0))
bbt = PowerBasisFeatures(degree=2).fit(np.empty(0))

# transform a column - output the Vandermonde matrix according to each basis
feature = np.array([0, 0.5, 1, np.nan])
print(pd.DataFrame.from_dict({
    'Feature': feature,
    'Power basis': list(pbt.transform(feature)),
    'Bernstein basis': list(bbt.transform(feature))
}))

# transform a column - output the Vandermonde matrix according to each basis
feature = np.array([0, 0.5, 1, np.nan])
print(pd.DataFrame.from_dict({
    'Feature': feature,
    'Power basis': list(pbt.transform(feature)),
    'Bernstein basis': list(bbt.transform(feature))
}))
#    Feature                 Power basis Bernstein basis
# 0      0.0                  [0.0, 0.0]      [0.0, 0.0]
# 1      0.5  [0.5000000000000002, 0.25]     [0.5, 0.25]
# 2      1.0                  [0.0, 1.0]      [1.0, 1.0]
# 3      NaN                  [0.0, 0.0]      [0.0, 0.0]

# transform two columns - concatenate the Vandermonde matrices
features = np.array([
    [0, 0.25],
    [0.5, 0.5],
    [np.nan, 0.75]
])
print(pd.DataFrame.from_dict({
    'Feature 0': features[:, 0],
    'Feature 1': features[:, 1],
    'Power basis': list(pbt.transform(features)),
    'Bernstein basis': list(bbt.transform(features))
}))
#    Feature 0  Feature 1                                           Power basis           Bernstein basis
# 0        0.0       0.25                             [0.0, 0.0, 0.375, 0.0625]  [0.0, 0.0, 0.25, 0.0625]
# 1        0.5       0.50  [0.5000000000000002, 0.25, 0.5000000000000002, 0.25]    [0.5, 0.25, 0.5, 0.25]
# 2        NaN       0.75                             [0.0, 0.0, 0.375, 0.5625]  [0.0, 0.0, 0.75, 0.5625]

Nice! Now let’s proceed to our example.

Model training components

Let’s implement the pipeline structure we saw at the beginning of this post in code, and a function to train models using this pipeline.

Training pipeline

We will write a function that a basis transformer and a model as an arguments, and constructs the components of the pipeline. Categorical features will be one-hot encoded, numerical features will be scaled and transformed using the given basis transformer, and finally the result will be passed as an input of the given model.

To make sure our scaled numerical features never fall outside of the $[0, 1]$ interval, even if the test-set contaisn values larger or smaller than what we saw in the training set, we clip the scaled value to $[0, 1]$. And to make sure we don’t inflate the dimension of our model by one-hot encoding rare categorical values, we will limit their frequency to 10. Here is the code:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer

def training_pipeline(basis_transformer, model_estimator,
                      categorical_features, numerical_features):
  basis_feature_transformer = Pipeline([
      ('scaler', MinMaxScaler(clip=True)),
      ('basis', basis_transformer)
  ])

  categorical_transformer = OneHotEncoder(
      sparse_output=False,
      handle_unknown='infrequent_if_exist',
      min_frequency=10
  )

  preprocessor = ColumnTransformer(
      transformers=[
        ('numerical', basis_feature_transformer, numerical_features),
        ('categorical', categorical_transformer, categorical_features)
      ]
  )

  return Pipeline([
    ('preprocessor', preprocessor),
    ('model', model_estimator)
  ])

We can now use a pipeline with Bernstein features with Ridge regression as in:

pipeline = training_pipeline(BernsteinFeatures(), Ridge(), categorical_features, numerical_features)
test_predictions = pipeline.fit(train_df, train_df[target]).transform(test_df)

But wait! We need to know what polynomial degree to use, and maybe tune some hyperparameters of the trained model. Otherwise, the experimental results we observe may simply be due to a bad choice of hyperparameters.

Tuning hyperparameters

We need two ingredients. One is technical - how we set hyperparameters of components hidden deep inside a Pipeline. The other is how we actually tune them. For setting hyperparameters, Scikit-Learn provides an interface. There are two functions: get_params() which returns a dictionary of all settable parameters, and set_params that can set parameters of all the components contained inside a pipeline. Let’s look at an example of a pipeline with BernsteinFeatures as the basis transformer, and Ridge as the model. Since Ridge has an alpha parameter, and BernsteinFeatures has a degree parameters, let’s look for those:

from sklearn.linear_model import Ridge

pipeline = training_pipeline(BernsteinFeatures(), Ridge(), [], [])
print({k:v for k,v in pipeline.get_params().items() 
           if 'degree' in k or 'alpha' in k})
# prints: {'preprocessor__numerical__basis__degree': 5, 'model__alpha': 1.0}

There is a pattern here! Looking at our training_pipeline method above, we see that there is a component named “preprocessor”, inside of which there is a component named “numerical”, that contains a “basis”. That “basis” component is our transformer, so it has a “degree”. The full name is just a concatenation of the above with double underscores. The same idea for the model. We can also set these parameters as follows:

pipeline.set_param('preprocessor__numerical__basis__degree', SOME_DEGREE)
pipeline.set_param('model__alpha', SOME_REGULARIZATION_COEFFICIENT)

So now that we know how to set hyperparameters of parts within a pipeline, let’s tune them. To that end, we will use hyperopt²! It’s a nice hyperparameter tuner, very easy to use, and implementes the state-of-the art Bayesian Optimization paradigm that can obtain high quality hyperparameter configurations hyperparameters in a relatively small number of trials. It’s as easy to use as a grid search, available by default on Colab, and saves us precious time. And I certainly don’t want to wait long until I see the results.

To use hyperopt, we need two ingredients. A a tuning objective that evaluates the performance of a given hyperparameter configuration, and a search space for hyperparameters. Writing such a tuning objective is quite easy - we will use a cross-validated score using Scikit-Learn’s built-int capabilities:

from sklearn.model_selection import cross_val_score

def tuning_objective(pipeline, metric, train_df, target, params):
    pipeline.set_params(**params)
    scores = cross_val_score(pipeline, train_df, train_df[target], scoring=metric)
    return -np.mean(scores)

Well, that wasn’t hard, but there’s an intricate detail - note that we are returning minus the average metric across folds. This is because Scikit-Learn’s metrics are built to be maximized, but hyperopt is built to minimize.

Defining a the hyperparameter seach space is also easy - it’s just a dict specifying a distribution for each hyperparameter. For our example above with a Ridge model we can use something like this:

from hyperopt import hp

param_space = {
    'preprocessor__numerical__basis__degree': hp.uniformint('degree', 1, 50),
    'model__alpha': hp.loguniform('alpha', -10, 5)
}

Hyperopt has a uniform and uniformint functions for hyperparameters that we would normally tune using a uniform grid, such as the number of layers of an NN, or the degree of a polynomial. In the code above, the degree of the polynomial is a number between 1 and 50, and all are equally likely. It also has a loguniform function for hyperparameters that we normally tune using a geometrically-spaced grid, such as a learning rate, or a regularization coefficient. In the example above, the regularization coefficient is between $e^{-10}$ and $e^5$, and all exponents are uniformly likely.

Having specified the objective function and the parameter space, we can use fmin for tuning, like this:

from hyperopt import fmin, tpe

fmin(lambda params: tuning_objective(pipeline, metric, train_df, target, params),
     space=param_space,
     algo=tpe.suggest,
     max_evals=100)

We have given it a function to minimize, gave it the hyperparameter search space, told it to use the TPE algorithm for tuning³, and limited it to 100 evaluations of our tuning objective. It will invoke our objective on hyperparameter configurations that it considers as worth trying, and eventually give us the best configuration it found. More on that can be found in hyperopt’s documentation. Beyond the objective and the search space, we also need to tell it which algorithm to use, and how many configurations it should try.

So let’s write a function that tunes hyperparameters using the training set, fits a model using the optimal configuration, and evaluates the resulting model’s performance using the test set. Then, re-train the pipeline on the entire training set using the best hyper-parameters, and evalluate it on the test set.

from hyperopt import fmin, tpe
from sklearn.metrics import get_scorer

def tune_and_evaluate_pipeline(pipeline, param_space,
                               train_df, test_df, target, metric,
                               max_evals=50, random_seed=42):
  print('Tuning params')
  def bound_tuning_objective(params):
    return tuning_objective(pipeline, metric, train_df, target, params)

  params = fmin(fn=bound_tuning_objective, # <-- this is the objective
                space=param_space,         # <-- the search space
                algo=tpe.suggest,          # <-- the algorithm to use. TPE is the most widely used.
                max_evals=max_evals,       # <-- maximum number of configurations to try
                rstate=np.random.default_rng(random_seed),
                return_argmin=False)
  print(f'Best params = {params}')

  print('Refitting with best params on the entire training set')
  pipeline.set_params(**params)
  fit_result = pipeline.fit(train_df, train_df[target])

  scorer = get_scorer(metric)
  score = scorer(fit_result, test_df, test_df[target])
  print(f'Test metric = {score:.5f}')

  return fit_result

Now we have all the ingredients in place! We can now, for example, tune, train a tuned Ridge regression model with Bernstein polynomial features that predicts the foo column in our data-set, and measures success using the Mean-Squared Error metric as follows:

train_df = ...
test_df = ...
categorical_features = [...]
numerical_features = [...]

pipeline = trainin_pipeline(BernsteinTransformer(), Ridge(), categorical_features, numerical_features)
model = tune_and_evaluate_pipeline(
  pipeline,
  param_space,
  train_df,
  test_df,
  'foo',
  'neg_root_mean_squared_error')

Now let’s put our work-horse to work!

California housing price prediction

The well-known California Housing price prediction data-set is available in the samples directory on Colab, so it will be convenient to use. Let’s load it, and print a sample:

train_df = pd.read_csv('sample_data/california_housing_train.csv')
test_df = pd.read_csv('sample_data/california_housing_test.csv')
print(train_df.head())

#    longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population  households  median_income  median_house_value
# 0    -114.31     34.19                15.0       5612.0          1283.0      1015.0       472.0         1.4936             66900.0
# 1    -114.47     34.40                19.0       7650.0          1901.0      1129.0       463.0         1.8200             80100.0
# 2    -114.56     33.69                17.0        720.0           174.0       333.0       117.0         1.6509             85700.0
# 3    -114.57     33.64                14.0       1501.0           337.0       515.0       226.0         3.1917             73400.0
# 4    -114.57     33.57                20.0       1454.0           326.0       624.0       262.0         1.9250             65500.0

The task is predicting the median_house_value column based on the other columns.

First, we can see that there are seveal feature columns with very large and diverse numbers. They probably have a very skewed distribution. Let’s plot those distributions:

skewed_columns = ['total_rooms', 'total_bedrooms', 'population', 'households']
axs = train_df.loc[:, skewed_columns].plot.hist(
    bins=20, subplots=True, layout=(2, 2), figsize=(8, 6))
axs.flat[0].get_figure().tight_layout()

Indeed very skewed! Typically applying a logarithm helps. Let’s see plot them after applying a logarithm (note the .apply(np.log)):

axs = train_df.loc[:, skewed_columns].apply(np.log).plot.hist(
    bins=20, subplots=True, layout=(2, 2), figsize=(8, 6))
axs.flat[0].get_figure().tight_layout()

Ah, much better! We also note that housing_median_age variable, despite being numerical, is discrete. Indeed, it has only 52 unitue values in the entire dataset. So we will treat it as a categorical variable. Let’s summarize our features in code:

categorical_features = ['housing_median_age']
numerical_features = ['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
target = ['median_house_value']

So we’re almost ready to fit a model. We can see that our target variable, median_house_value , has very large magnitude values. It is usually beneficial to scale them to a smaller range. However, we would like to measure the prediction error with respect to the original values. Fortunately, Scikit-Learn provides us with a TransformedTargetRegressor class that allows scaling the target variable for the regression model, and scaling it back to the original range when producing an output.

Now we’re ready to construct our model fitting pipeline that fits a Ridge model on scaled regression targets, and transformed features:

from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor

def california_housing_pipeline(basis_transformer):
    return training_pipeline(
        basis_transformer,
        TransformedTargetRegressor(
            regressor=Ridge(),
            transformer=MinMaxScaler()
        ),
        categorical_features,
        numerical_features
    )

Beautiful! Now we can use our hyperparameter tuning function to train a tuned model on our dataset. Since it’s a regression task, we will measure the Root Mean Squared Error (RMSE), implemented by the neg_root_mean_squared_error Scikit-Learn metric. So let’s begin with Bernstein polynomial features:

poly_param_space = {
    'preprocessor__numerical__basis__degree': hp.uniformint('degree', 1, 50),
    'model__regressor__alpha': hp.loguniform('C', -10, 5)
}

bernstein_pipeline = california_housing_pipeline(BernsteinFeatures())
bernstein_fit_result = tune_and_evaluate_pipeline(
    bernstein_pipeline, poly_param_space,
    train_df, test_df, target,
    'neg_root_mean_squared_error')

After a few minutes I got the following output:

Tuning params
100%|██████████| 50/50 [03:37<00:00,  4.34s/trial, best loss: 60364.25845777496]
Best params = {'model__regressor__alpha': 0.0075549014272857686, 'preprocessor__numerical__basis__degree': 50}
Refitting with best params on the entire training set
Test metric = -61559.04848

The root mean squared error (RMSE) on the test-set of the tuned model is $61559.04848$. Now let’s try the power basis:

power_basis_pipeline = california_housing_pipeline(PowerBasisFeatures())
power_basis_fit_result = tune_and_evaluate_pipeline(
    power_basis_pipeline, poly_param_space,
    train_df, test_df, target,
    metric='neg_root_mean_squared_error')

This time I got the following output:

Tuning params
100%|██████████| 50/50 [00:54<00:00,  1.10s/trial, best loss: 62205.78033504614]
Best params = {'model__regressor__alpha': 4.7685837926305776e-05, 'preprocessor__numerical__basis__degree': 31}
Refitting with best params on the entire training set
Test metric = -63534.49228

This time the RMSE is $63534.49228$. The Bernstein basis got us a $3.1\%$ improvement! If we look closer at the output, we can see that the tuned Bernstein polynomial is of degree 50, whereas the best tuned power basis polynomial is of degree 31. We already saw that high degree polynomials in the Bernstein basis are easy to regularize, and our tuner probably saw the same phenomenon, and cranked up the degree to 50.

How are our polynomial features compared to a simple linear model? Well, let’s see. To re-use all our existing code instead of writing a new pipeline, we’ll just use a “do nothing” feature transformer that implements the identity function. Note, that this time there is no degree to tune.

class IdentityTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, na_value=0.):
      self.na_value = na_value

    def fit(self, input_array, y=None):
        return self

    def transform(self, input_array, y=None):
        # we are compatible with our polynomial features - NA values are zeroed-out. The rest
        # are passed through
        return np.where(np.isnan(input_array), self.na_value, input_array)

linear_param_space = {
    'model__regressor__alpha': hp.loguniform('C', -10, 5)
}

linear_pipeline = california_housing_pipeline(IdentityTransformer())
linear_fit_result = tune_and_evaluate_pipeline(
    linear_pipeline, linear_param_space,
    train_df, test_df, target,
    'neg_root_mean_squared_error')

Here is the output:

Tuning params
100%|██████████| 50/50 [00:22<00:00,  2.22trial/s, best loss: 66571.41310284132]
Best params = {'model__regressor__alpha': 0.013724898474056764}
Refitting with best params on the entire training set
Test metric = -67627.17474

The RMSE is $67627.17474$. So let’s summarize the results in the table:

	Linear	Power basis	Bernstein basis
RMSE	67627.17474	63534.49228	61559.04848
Improvement over Linear	0%	6.05%	8.97%
Tuned degree	1	31	50

Impressive! Just changing the polynomial basis gives us a visible boost, and the high degree doesn’t appear to do something bad.

Now we shall inspect our models a bit closer. That’s why we stored the fit models in the bernstein_fit_result and power_basis_fit_result variables above. Following the structure of our pipelines, to get the coefficients we can use the following function:

def get_coefs(pipeline):
  transformed_target_regressor = pipeline.named_steps['model']
  ridge_model = transformed_target_regressor.regressor
  return ridge_model.coef_.ravel()

Now we can plot the polynomials! First, we will need to extract the coefficients of the numerical features, and ignore the ones corresponding to the categorical features. Next, we need to treat the coefficients of each numerical feature separately, and plot the polynomial they represent. Since our numerical features are always scaled to $[0, 1]$, plotting amounts to evaluating our polynomials on a dense grid in $[0, 1]$. So this is our plotting function:

import matplotlib.pyplot as plt

def plot_feature_curves(pipeline, basis_transformer_ctor, numerical_features):
  # get the coefficients and the degree
  degree = pipeline.get_params()['preprocessor__numerical__basis__degree']
  coefs = get_coefs(pipeline)

  # extract the numerical features, and form a matrix, such that the 
  # coefficients of each feature is in a separate row.
  numerical_slice = pipeline.get_params()['preprocessor'].output_indices_['numerical']
  feature_coefs = coefs[numerical_slice].reshape(-1, degree)

  # form the basis Vandermonde matrix on [0, 1]
  xs = np.linspace(0, 1, 1000)
  xs_vander = basis_transformer_ctor(degree=degree).fit_transform(xs)

  # do the plotting
  n_cols = 3
  n_rows = math.ceil(len(numerical_features) / n_cols)
  fig, axs = plt.subplots(n_rows, n_cols, figsize=(3 * n_cols, 3 * n_rows))
  for i, (ax, coefs) in enumerate(zip(axs.ravel(), feature_coefs)):
    ax.plot(xs, xs_vander @ coefs)
    ax.set_title(numerical_features[i])
  fig.show()

A bit lengthy, but understandable.

Recalling our previous post, we know that the coefficients in the Bernstein basis are actually “control points”, so let’s add the ability to plot them as well to the above function:

import matplotlib.pyplot as plt

def plot_feature_curves(pipeline, basis_transformer_ctor, numerical_features,
                        plot_control_pts=True):
  # get the coefficients and the degree
  degree = pipeline.get_params()['preprocessor__numerical__basis__degree']
  coefs = get_coefs(pipeline)

  # extract the numerical features, and form a matrix, such that the 
  # coefficients of each feature is in a separate row.
  numerical_slice = pipeline.get_params()['preprocessor'].output_indices_['numerical']
  feature_coefs = coefs[numerical_slice].reshape(-1, degree)

  # form the basis Vandermonde matrix on [0, 1]
  xs = np.linspace(0, 1, 1000)
  xs_vander = basis_transformer_ctor(degree=degree).fit_transform(xs)

  # do the plotting
  n_cols = 3
  n_rows = math.ceil(len(numerical_features) / n_cols)
  fig, axs = plt.subplots(n_rows, n_cols, figsize=(3 * n_cols, 3 * n_rows))
  for i, (ax, coefs) in enumerate(zip(axs.ravel(), feature_coefs)):
    if plot_control_pts: 
      control_xs = (1 + np.arange(len(coefs))) / len(coefs)
      ax.scatter(control_xs, coefs, s=30, facecolors='none', edgecolor='b', alpha=0.5)
    ax.plot(xs, xs_vander @ coefs)
    ax.set_title(numerical_features[i])
  fig.show()

Now let’s see our Bernstein polynomials!

plot_feature_curves(bernstein_fit_result, BernsteinFeatures, numerical_features)

What about the power basis? Let’s take a look as well. Note, that we won’t plot the coefficients as “control points”, since the coefficients of the power basis are not control points in any way.

plot_feature_curves(power_basis_fit_result, PowerBasisFeatures, numerical_features, plot_control_pts=False)

Look at the “households” and “total_bedrooms” polynomials. Seems that they’re “going crazy” near the boundary of the domain. As we expected - was not specifically designed to approximate functions on $[0, 1]$, and it’s hard to regularize to produce a good fit. It will either under-fit, or over-regularize.

In fact, we may recall that the “natural domain” of the power basis is the complex unit circle. It may be interesting to try representing periodic features, such as the time of day using the power basis, since such features naturally map to a point on a circle. However, there are other challenges involved, such as ensuring that our model will be real-valued rather than complex-valued, and this may be a nice subject for another post.

Summary

This was a nice adventure. I certainly learned a lot about Scikit-Learn while writing this post, and I hope that the transformer for producing the Bernstein basis may be useful for to you as well. We note that polynomial non-linear features have a nice property they have only one tunable hyperparameter, so learning a tuned model should be computationally cheaper compared to other alternatives, such as radial basis functions.

Looking again at the Bernstein polynomials above, we see that they are a bit ‘wiggly’, the control point seem like a mess, and in the previous post we learned how to smooth them out by regularizing their second derivative. Moreover, in the beginning of this post we said something interesting - the predictive power of simple models may be improved by incorporating interactions between features. So in the next posts we’re going to do exactly that - enhance our transformer to model feature interactions, and write an enhance version of the Ridge estimator to smooth polynomial features. Stay tuned!

I wouldn’t even call it extrapolation - in our context I think of the polynomial basis as “undefined” outside of its natural domain. ↩
Bergstra, James, Daniel Yamins, and David Cox. “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures.” International conference on machine learning. PMLR, 2013. ↩
Watanabe, S., 2023. Tree-structured Parzen estimator: Understanding its algorithm components and their roles for better empirical performance. arXiv preprint arXiv:2304.11127. ↩

“Keeping the polynomial monster under control”

2024-01-25T00:00:00+00:00

A recap

In the previous post we we saw that the Bernstein polynomials can be used to fit a high-degree polynomial curve with ease, without its shape going out of control. In this post we’ll look at the Bernstein polynomials in more depth, both experimentally and theoretically. First, we will explore the Bernstein polynomials $\mathbb{B}_n = \{ b_{0,n}, \dots, b_{n, n} \}$, where

\[b_{i,n}(x) = \binom{n}{i} x^i (1-x)^{n-i},\]

empirically and visually. We will see how to use the coefficients to achieve a higher degree of control over the shape of the function we fit. Then, we’ll explore them more theoretically, and see that they are indeed a basis - they represent the same model class as the classical power basis $\{1, x, x^2, \dots, x^n\}$. All the results are reproducible from this notebook.

Shape preserving properties

To study the shape preserving properties, we will rely on the bernvander function we’ve implemented in the last post, that given the numbers $x_1, \dots, x_m$, computes the Bernstein Vandermonde matrix of a given degree $n$, that contains all the polynomials evaluated at all the given points:

\[\begin{pmatrix} b_{0,n}(x_1) & b_{1, n}(x_1) &\dots & b_{n, n}(x_1) \\ b_{0,n}(x_2) & b_{1, n}(x_2) &\dots & b_{n, n}(x_2) \\ \vdots & \vdots & \ddots & \vdots \\ b_{0,n}(x_m) & b_{1, n}(x_m) &\dots & b_{n, n}(x_m) \\ \end{pmatrix}\]

This is something we should have probably done earler, but let’s plot the Bernstein polynomials to see what they look like. Below, we plot the basis $\mathbb{B}_{7}$ using the bernvander function.

import matplotlib.pyplot as plt
import numpy as np

plt_xs = np.linspace(0, 1, 1000)
bernstein_basis = bernvander(plt_xs, deg=7)

plt.plot(plt_xs, bernstein_basis, 
         label=[f'$b_{{{i},8}}$' for i in range(8)])
plt.legend(ncols=2)
plt.show()

We can see that each polynomial is a “hill” whose maxima appear equally spaced. So are they? Let’s add vertical bars using the axvline function to verify:

plt_xs = np.linspace(0, 1, 1000)
bernstein_basis = bernvander(plt_xs, deg=7)

plt.plot(plt_xs, bernstein_basis, 
         label=[f'$b_{{{i},8}}$' for i in range(8)])
for x in np.linspace(0, 1, 8):
  plt.axvline(x, color='gray', linestyle='dotted')
plt.legend(ncols=2)
plt.show()

It indeed appears so - the maxima of the polynomials are at $\{ \tfrac{i}{n}\}_{i=0}^n$. We won’t prove it formally, but that’s not hard. Now we can have some interesting insights. Suppose we have a polynomial written in Bernstein form, namely, as a weighted sum of Bernstein polynomials:

\[f(x) = \sum_{k=0}^n u_i b_{i,n}(x)\]

Recall from the previous post that the Bernstein polynomials sum to one, and therefore $f(x)$ is just a weighted average of the coefficients $u_0, \dots, u_n$. Thus, at $x=\frac{i}{n}$, the weight of $u_i$ in the weighted average dominates the weights of the other coefficients. In other words,

$u_i$ controls the polynomial $f(x)$ in the vicinity of the point $\frac{i}{n}$.

In fact, the name often given to the coefficients $u_0, \dots, u_n$ is “control points”. To visualize this observation, let’s see what happens if we change one coefficient, $u_3$, of a 7-th degree polynomial using an animation:

from matplotlib.animation import FuncAnimation, PillowWriter

n = 7
n_frames = 50

ctrl_xs = np.linspace(0, 1, 1 + n)      # the points i / n
w_init = np.cos(2 * np.pi * ctrl_xs)    # initial coefficients
plt_vander = bernvander(plt_xs, deg=n)  # bernstein basis at plot points

fig, ax = plt.subplots()
def animate(i):
  # animate the coefficients "w"
  t = np.sin(2 * np.pi * i / n_frames)
  w = np.array(w_init)
  w[3] = (1 - t) * w[3] + t * 3

  # plot the Bernstein polynomial and the coefficients at i / n
  ax.clear()
  ax.set_xlim([-0.05, 1.05])
  ax.set_ylim([-3, 3])
  control_plot = ax.scatter(ctrl_xs, w, color='red')        # plot control points
  poly_plot = ax.plot(plt_xs, plt_vander @ w, color='blue') # plot the polynomial
  return poly_plot, control_plot

ani = FuncAnimation(fig, animate, n_frames)
ani.save('control_coefficients.gif', dpi=300, writer=PillowWriter(fps=25))

We get the following result:

Looks nice! We can indeed see where the name “control points” comes from. But what can we say about it formally? Well, there are several results. The most famous one is the constructive proof of the Weierstrass approximation theorem:

Theorem [Lorentz¹, 1952] Suppose $g(x)$ is continuous in $[0, 1]$. Then the polynomials $\sum_{i=0}^n g(\tfrac{i}{n}) b_{i,n}(x)$ uniformly converge to $g(x)$ as $n \to \infty$.

As a consequence, we can interpret the Bernstein coefficient $u_i$ as the value of some function $g$ that our polynomial approximates at $x=\frac{i}{n}$. Equipped with this idea, we can ask ourselves a simple question. What if the coefficients are increasing? Will the polynomial be an increasing function?

Well, it turns out the answer is yes - we can force the polynomial to be an increasing function of $x$ by making sure the coefficients are increasing. In fact, we have even more interesting things we can formally say. To do that, let’s look at the derivatives of polynomials in Bernstein form. Suppose that

\[f(x) = \sum_{i=0}^n u_i b_{i,n}(x),\]

then the first and second derivatives are:

\[\begin{align} f'(x) &= n \sum_{i=0}^{n-1} (u_{i+1} - u_i) b_{i,n-1}(x) \\ f''(x) &= n (n-1) \sum_{i=0}^{n-2} (u_{i+2} - 2 u_{i+1} + u_i) b_{i,n-2}(x) \end{align}\]

The first derivative is a weighted sum of the coefficient first order differences $u_{i+1}-u_i$, whereas the second derivative is a weighted sum of the second order differences $u_{i+2}-2u_{i+1}+u_i$. Therefore, we can conclude that:

Theorem [Chang et. al², 2007, Proposition 1] Given $f(x) = \sum_{i=0}^n u_i b_{i,n}(x)$

If $u_{i+1} - u_i \geq 0$, then $f'(x) \geq 0$, and $f$ is nondecreasing,

If $u_{i+1} - u_i \leq 0$, then $f'(x) \leq 0$, and $f$ is nondecreasing,

If $u_{i+2} - 2u_{i+1} + u_i \geq 0$, then $f''(x) \geq 0$, and $f$ is convex,

If $u_{i+2} - 2u_{i+1} + u_i \leq 0$, then $f''(x) \leq 0$, and $f$ is concave,

An important application of fitting nondecreasing functions, for example, is fitting a CDF. One practical example of CDF fitting is the bid shading problem³⁴⁵ in online advertising. We are required to model the probability of winning an ad auction given a bid $x$. Naturally, the winning probability should increase when the bid $x$ increases. Another important example is calibration curves⁶⁷⁸ in classification models, which are functions that map the model’s score to a probability such that the mean predicted probability conforms to the true conditional probability of the label given the features. The curve should be increasing - the higher the score, the higher probability it represents. See this great tutorial in the SkLearn documentation.

The simplest way to impose constraints on the coefficients when fitting models on small-scale data is using the CVXPY library, which we already encountered in previous posts in this blog. The library allows solving arbitrary convex optimization problems, specified by the function to minimize, and a set of constraints. Let’s see how we can use CVXPY to fit a nondecreasing Bernstein polynomial. First, we define the function and use it to generate noisy data:

def nondecreasing_func(x):
  return (3 - 2 * x) * (x ** 2) * np.exp(x)

# define number of points and noise
m = 30
sigma = 0.2

np.random.seed(42)
x = np.random.rand(m)
y = nondecreasing_func(x) + sigma * np.random.randn(m)

Now, we define the model fitting as an optimization problem with constraints. Mathematically, we aim to minimize the L2 loss subject to coefficient monotonicity contstraints:

\[\begin{align} \min_{\mathbf{u}} & \quad \| \mathbf{V} \mathbf{u} - \mathbf{y} \|^2 \\ \text{s.t.} &\quad u_{i+1} \geq u_i && i = 0, \dots, n-1 \end{align}\]

The matrix $\mathbf{V}$ is the Bernsten Vandermonde matrix at $x_1, \dots, x_m$. When multiplied by $\mathbf{u}$ we obtain the values of the polynomials in Bernstein form at each of the data points. The following CVXPY code is just a direct formulation of the above for fitting a polynomial of degree $n=20$:

import cvxpy as cp

deg = 20
u = cp.Variable(deg + 1)                          # a placeholder for the optimal Bernstein coefficients
loss = cp.sum_squares(bernvander(x, deg) @ u - y) # The L2 loss - the sum of residual squares
constraints = [cp.diff(u) >= 0]                   # constraints - u_{i+1} - u_i >= 0
problem = cp.Problem(cp.Minimize(loss), constraints)

# solve the minimization problem and 
problem.solve()
u_opt = u.value

Now, let’s plot the points, the original, and the fit functions:

plt.scatter(x, y, color='red')
plt.plot(plt_xs, nondecreasing_func(plt_xs), color='blue')
plt.plot(plt_xs, bernvander(plt_xs, deg) @ u_opt, color='green')

Not bad, given the level of noise, and the fact that we have no regularization whatsoever! For larger scale problems we will typically use an ML framework, such as PyTorch or Tensorflow, and they do not provide mechanisms to impose hard constraints on parameters. Therefore, when using such frameworks, we need to use a regularization term that penalizes violation of our desired constraints. For example, to penalize for violating the nondecreasing constraint, we can use the regularizer:

\[r(\mathbf{u}) = \sum_{i=1}^n \max(0, u_{i} - u_{i+1})^2\]

Looking at the curve above, we see that it’s a bit wiggly. Can we do something about it? Looking at the the second derivative formula above, we can “smooth out” the curve by adding a regularization term that penalizes the second order differences. This will, in turn, penalize the second order derivative. Why second order? Because ideally, when the second order differences are zero, we’ll get a straight line. So we’re “smoothing out” the curve to be more straight.

Mathematically, we’ll need to solve:

\[\begin{align} \min_{\mathbf{u}} & \quad \| \mathbf{V} \mathbf{u} - \mathbf{y} \|^2 + \alpha \sum_{i=0}^{n-2} (u_{i+2} - 2 u_{i+1} - u_i)^2 \\ \text{s.t.} &\quad u_{i+1} \geq u_i && i = 0, \dots, n-1 \end{align}\]

where $\alpha$ is a tuned regularization parameter. The code in CVXPY, after tuning $\alpha$, looks like this:

deg = 20
alpha = 2

u = cp.Variable(deg + 1)                          # a placeholder for the optimal Bernstein coefficients
loss = cp.sum_squares(bernvander(x, deg) @ u - y) # The L2 loss - the sum of residual squares
reg = alpha * cp.sum_squares(cp.diff(u, 2))       # penalty for 2nd order differences
constraints = [cp.diff(u) >= 0]                   # constraints - u_{i+1} - u_i >= 0
problem = cp.Problem(cp.Minimize(loss + reg), constraints)

After solving the problem and plotting the polynomial, I obtained this:

Not bad! Now we will study the Bernstein basis from a more theoretical perspective to understand their representation power.

The Bernstein polynomials as a basis

So, is it really a basis? If it is, then there should be a simple transition matrix for going back and forth between the standard and the Bernstein basis. In this case, solving a regression problem with both bases should be equivalent. So why should we bother working with the Bernstein basis? We explore those questions below.

First, let’s begin by showing that it’s indeed a basis. Note that the set $\mathbb{B}_n$ of n-th degree Bernstein poynomials indeed has $n+1$ polynomial functions. So it remains to be convinced that any polynomial can be expressed as a weighted sum of these $n+1$ functions. It turns out that for any $k < n$, we can write: $x^k = \sum_{j=k}^n \frac{\binom{j}{k}}{\binom{n}{k}} b_{j, n}(x) = \sum_{j=k}^n q_{j,k} b_{j,n}(x)$

The proof is a bit technical and involved, and requires the inverse binomial transform, but it gives us our desired result: any power of $x$ up to $n$ can be expressed using Bernstein polynomials. Consequently, any polynomial of degree up to $n$ can be expressed as a weighted sum of Bernstein polynomials, and therefore:

The representation power of Bernstein polynomials is identical to that of the standard basis. Both represent the same model class we fit to data.

Using Bernstein polynomials, in itself, does not restrict or regularize the model class, since any polynomial can be written in Bernstein form. The Bernstein form is just easier to regularize.

This observation leads to some interesting insights, which will be easier to describe by writing the standard and the the Bernstein bases as vectors:

\[\mathbf{p}_n(x)=(1, x, x^2, \cdots, x^n)^T, \qquad \mathbf{b}_n(x)=(b_{0,n}(x), \cdots, b_{n,n}(x))^T\]

We note that the standard and Bernstein Vandermonde matrix rows we saw in the previous post are exactly $\mathbf{p}_n(x_i)$, and $\mathbf{b}_n(x_i)$, respectively. Using this notation, we can write the powers of $x$ in terms of the Bernstein basis in matrix form, by gathering the coefficients $q_{j,k}$ above, assuming that $q_{j,k}=0$ whenever \(j \[\mathbf{p}_n(x)^T = \mathbf{b}_n(x)^T \mathbf{Q}_n\]

The matrix $\mathbf{Q}_n$ is the basis trasition matrix - it can transform any polynomial written using the standard basis to the same polynomial written in the Bernstein basis:

\[a_0 + a_1 x + \dots + a_n x^n = \mathbf{p}_n(x)^T \mathbf{a} = \mathbf{b}_n(x)^T \mathbf{Q}_n \mathbf{a}\]

The vector $\mathbf{Q}_n \mathbf{a}$ is s the coefficient vector w.r.t the Bernstein basis. Does it mean we can actually fit a polynomial in the standard basis, but regularize it as if it was written in the Bernstein basis? Well, yes we can! Polynomial fitting in the Bernstein basis can be written as

\[\min_{\mathbf{w}} \quad \frac{1}{2}\sum_{i=1}^n (\mathbf{b}_n(x_i) \mathbf{w} - y_i)^2 + \frac{\alpha}{2} \| \mathbf{w} \|^2.\]

The constants $\frac{1}{2}$ are for convenience later, when taking derivatives. Introducing the change of variables $\mathbf{w} = \mathbf{Q}_n \mathbf{a}$, the above problem becomes equivalent to:

\[\min_{\mathbf{a}} \quad \frac{1}{2} \sum_{i=1}^n (\mathbf{p}_n(x_i) \mathbf{a} - y_i)^2 + \frac{\alpha}{2} \| \mathbf{Q}_n \mathbf{a} \|^2. \tag{P}\]

Thus, we can fit a polynomial in terms of its standard basis coefficients $\mathbf{a}$, but regularize its Bernstein coefficients $\mathbf{Q}_n \mathbf{a}$. So does it really work? Let’s check! First, let’s implement the transition matrix function:

import numpy as np
from scipy.special import binom

def basis_transition(n):
  ks = np.arange(0, 1 + n)
  js = np.arange(0, 1 + n).reshape(-1, 1)
  Q = binom(js, ks) / binom(n, ks)
  Q = np.tril(Q)
  return Q

The regularized least-squares problem (P) above is a convex problem that can be easily solved by equating the gradient w.r.t $\mathbf{a}$ with zero. Putting all the $\mathbf{p}_n(x_i)$ for the data points $i = 1, \dots, m$ into the rows of the Vandermonde matrix $\mathbf{V}$, equating the gradient to zero becomes:

\[\mathbf{V}^T (\mathbf{V} \mathbf{a} - \mathbf{y}) + \alpha \mathbf{Q}_n^T \mathbf{Q}_n \mathbf{a} = 0.\]

Re-arranging, and solving for the coefficients $\mathbf{a}$, we obtain:

\[\mathbf{a} = (\mathbf{V}^T \mathbf{V} + \alpha \mathbf{Q}_n^T \mathbf{Q}_n)^{-1} \mathbf{V}^T \mathbf{y}\]

So let’s implement the fitting procedure:

import numpy.polynomial.polynomial as poly

def fit_bernstein_reg(x, y, alpha, deg):
  """ Fit a polynomial in the standard basis to the data-points `(x[i], y[i])` with Bernstein
      regularization `alpha`, and degree `deg`.
  """
  V = poly.polyvander(x, deg)
  Q = basis_transition(deg)
  
  A = V.T @ V + alpha * Q.T @ Q
  b = V.T @ y
  
  # solve the linear system 
  a = np.linalg.solve(A, b)
  return a

Now, let’s try reproducing the results of the previous post with degrees 50 and 100.

def true_func(x):
  return np.sin(8 * np.pi * x) / np.exp(x) + x

# define number of points and noise
m = 30
sigma = 0.1
deg = 50

# generate features
np.random.seed(42)
x = np.random.rand(m)
y = true_func(X) + sigma * np.random.randn(m)

# fit the polynomial
a = fit_bernstein_reg(x, y, 5e-4, deg=deg)

# plot the original function, the points, and the fit polynomial
plt_xs = np.linspace(0, 1, 1000)
polynomial_ys = poly.polyvander(plt_xs, deg) @ a
plt.scatter(x, y)
plt.plot(plt_xs, true_func(plt_xs), 'blue')
plt.plot(plt_xs, polynomial_ys, 'red')
plt.show()

I got the following plot, which appears pretty similar to what we got in the previous post, but slightly worse:

Let’s crank up the degree to 100 by setting deg = 100. I got the following image:

Again, slightly worse than what we achieved by directly fitting the Bernstein form, but appears close.

There two technical issues with our idea. First, manually fitting models rather than relying on standard tools, such as SciKit-Learn appears to be troublesome, and in terms of computational efficiency, we need to deal with the additional matrix $\mathbf{Q}_n$. Second, and most importantly, the standard Vandermonde matrix and the basis transition matrix $\mathbf{Q}_n$ are extremely ill conditioned⁹. This makes hard to actually solve the fitting problem and obtain coefficients that are close to the true optimal coefficients. This is true regardless if we chose direct matrix inversion, CVXPY, or an SGD-based optimizer from PyTorch or TensorFlow.

Due to inefficiency and ill conditioning this trick has a little value in practice. But provides us with an important insight: achieving good regularization requires a sophisticated non-diagonal matrix in the regularization term. It’s not a formal statement, but probably any “good” basis will have a non-diagonal transition matrix to the standard basis. This means that fitting a polynomial in the standard basis using typical ML tricks of rescaling the columns of the Vandermonde matrix has a little chance of success. And it doesn’t matter if we rescale using min-max scaling, or standardization to zero mean and unit variance. To fit a polynomial, we need to use a “good” basis directly.

Conclusion

In this post we explored the ability of the Bernstein form to control the shape of the curve we’re fitting - either making it smooth, increasing, decreasing, convex, or concave. Then, we saw that Bernstein polynomials are just polynomials - they have the same representation power as the standard basis, but just easier to regularize.

The next post will be more engineering oriented. We’ll see how to use the Bernstein basis for feature engineering and fitting models to some real-world data-sets, and we will write a SciKit-Learn transformer to do so. Stay tuned!

Lorentz, G. G. (1952). Bernstein Polynomials. University of Toronto Press. ↩
Chang, I. S., Chien, L. C., Hsiung, C. A., Wen, C. C., & Wu, Y. J. (2007). Shape restricted regression with random Bernstein polynomials. Lecture Notes-Monograph Series, 187-202. ↩
Sarah Sluis, S. (2019). Everything you need to know about bid shading. ↩
Karlsson, N., & Sang, Q. (2021, May). Adaptive bid shading optimization of first-price ad inventory. In 2021 American Control Conference (ACC) (pp. 4983-4990). IEEE. ↩
Gligorijevic, D., Zhou, T., Shetty, B., Kitts, B., Pan, S., Pan, J., & Flores, A. (2020, October). Bid shading in the brave new world of first-price auctions. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (pp. 2453-2460). ↩
Niculescu-Mizil, A., & Caruana, R. (2005, August). Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning (pp. 625-632). ↩
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3), 61-74. ↩
Zadrozny, B., & Elkan, C. (2002, July). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 694-699). ↩
Intuitively, a matrix is ill conditioned if numerical algorithms fail to accurately perform computations with this matrix, such as matrix multiplication, solving a linear system, or training a machine learned model. ↩

“Are polynomial features the root of all evil?”

2024-01-21T00:00:00+00:00

A myth

When fitting a non-linear model using linear regression, we typically generate new features using non-linear functions. We also know that any function, in theory, can be approximated by a sufficiently high degree polynomial. This result is known as Weierstrass approximation theorem. But many blogs, papers, and even books tell us that high polynomials should be avoided. They tend to oscilate and overfit, and regularization doesn’t help! They even scare us with images, such as the one below, when the polynomial fit using the data points (in red) is far away from the true function (in blue):

It turns out that it’s just a MYTH. There’s nothing inherently wrong with high degree polynomials, and in contrast to what is typically taught, high degree polynomials are easily controlled using standard ML tools, like regularization. The source of the myth stems mainly from two misconceptions about polynomials that we will explore here. In fact, not only they are great non-linear features, certain representations also provide us with powerful control over the shape of the function we wish to learn.

A colab notebook with the code for reproducing the above results is available here.

Approximation vs estimation

Vladimir Vapnik, in his famous book “The Nature of Statistical Learning Theory” which is cited more than 100,000 times as of today, coined the approximation vs. estimation balance. The approximation power of a model is its ability to represent the “reality” we would like to learn. Typically, approximation power increases with the complexity of the model - more parameters mean more power to represent any function to arbitrary precision. Polynomials are no different - higher degree polynomials can represent functions to higher accuracy. However, more parameters make it difficult to estimate these parameters from the data.

Indeed, higher degree polynomials have a higher capacity to approximate arbitrary functions. And since they have more coefficients, these coefficients are harder to estimate from data. But how does it differ from other non-linear features, such as the well-known radial basis functions? Why do polynomials have such a bad reputation? Are they truly hard to estimate from data?

It turns out that the primary source is the standard polynomial basis for n-degree polynomials $\mathbb{E}_n = {1, x, x^2, ..., x^n}$. Indeed, any degree $n$ polynomial can be written as a linear combination of these functions:

\[\alpha_0 \cdot 1 + \alpha_1 \cdot x + \alpha_2 \cdot x^2 + \cdots + \alpha_n x^n\]

But the standard basis $\mathbb{B}_n$ is awful for estimating polynomials from data. In this post we will explore other ways to represent polynomials that are appropriate for machine learning, and are readily available in standard Python packages. We note, that one advantage of polynomials over other non-linear feature bases is that the only hyperparameter is their degree. There is no “kernel width”, like in radial basis functions¹.

The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features.

Building the basics

In this post we will demonstrate fitting the function

\[f(x)=\sin(8 \pi x) / \exp(x)+x\]

on the interval $[0, 1]$ by fitting to $m=30$ samples corrupted by Gaussian noise. The following code implements the function and generates samples:

import numpy as np

def true_func(x):
  return np.sin(8 * np.pi * x) / np.exp(x) + x

m = 30
sigma = 0.1

# generate features
np.random.seed(42)
X = np.random.rand(m)
y = true_func(X) + sigma * np.random.randn(m)

For function plotting, we will use uniformly-spaced points in $[0, 1]$. The following code plots the true function and the sample points:

import matplotlib.pyplot as plt

plt_xs = np.linspace(0, 1, 1000)
plt.scatter(X.ravel(), y.ravel())
plt.plot(plt_xs, true_func(plt_xs), 'blue')
plt.show()

Now let’s fit a polynomial to the sampled points using the standard basis. Namely, we’re given the set of noisy points $\{ (x_i, y_i) \}_{i=1}^m$, and we need to find the coefficients $\alpha_0, \dots, \alpha_n$ that minimize:

\[\sum_{i=1}^m (\alpha_0 + \alpha_1 x_i + \dots + \alpha_n x_i^n - y_i)^2\]

As expected, this is readily accomplished by transforming each sample $x_i$ to a vector of features $1, x_i, \dots, x_i^n$, and fitting a linear regression model to the resulting features. Fortunately, NumPy has the numpy.polynomial.polynomial.polyvanderfunction. It takes a vector containing $x_1, \dots, x_m$ and produces the matrix

\[\begin{pmatrix} 1 & x_1 & x_1^2 & \dots & x_1^n \\ 1 & x_2 & x_2^2 & \dots & x_2^n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & \dots & x_m^n \\ \end{pmatrix}\]

The name of the function comes from the name of the matrix - the Vandermonde matrix. Let’s use it to fit a polynomial of degree $n=50$.

from sklearn.linear_model import LinearRegression
import numpy.polynomial.polynomial as poly

n = 50
model = LinearRegression(fit_intercept=False)
model.fit(poly.polyvander(X, deg=n), y)

The reason we use fit_intercept=False is because the ‘intercept’ is provided by the first column of the Vandermonde matrix. Now we can plot the function we just fit:

plt.scatter(X.ravel(), y.ravel())                                    # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue')                          # plot the true function
plt.plot(plt_xs, model.predict(poly.polyvander(plt_xs, deg=n)), 'r') # plot the fit model
plt.ylim([-5, 5])
plt.show()

As expected, we got the “scary” image from the beginning of this post. Indeed, the standard basis is awful for model fitting! We hope that regularization provides a remedy, but it does not. Maybe adding some L2 regularization helps? Let’s use the Ridge class from the sklearn.linear_model package to fit an L2 regularized model:

from sklearn.linear_model import Ridge

reg_coef = 1e-7
model = Ridge(fit_intercept=False, alpha=reg_coef)
model.fit(poly.polyvander(X, deg=n), y)

plt.scatter(X.ravel(), y.ravel())                                    # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue')                          # plot the true function
plt.plot(plt_xs, model.predict(poly.polyvander(plt_xs, deg=n)), 'r') # plot the fit model
plt.ylim([-5, 5])
plt.show()

We get the following result:

The regularization coefficient coefficient of $\alpha=10^{-7}$ is large enough to break the model in $[0,0.8]$ but not large enough to avoid over-fitting in $[0.8, 1]$. Increasing the coefficient clearly won’t help - the model will be broken even further in $[0, 0.8]$.

Since we will be trying several polynomial bases, it makes sense to write a more generic function for our experiments that will accept various “Vandermonde” matrix functions of the basis of our choice, fit the polynomial using the Ridge class, and plot it with the original function and the sample points.

def fit_and_plot(vander, n, alpha):
  model = Ridge(fit_intercept=False, alpha=alpha)
  model.fit(vander(X, deg=n), y)

  plt.scatter(X.ravel(), y.ravel())                           # plot the samples
  plt.plot(plt_xs, true_func(plt_xs), 'blue')                 # plot the true function
  plt.plot(plt_xs, model.predict(vander(plt_xs, deg=n)), 'r') # plot the fit model
  plt.ylim([-5, 5])
  plt.show()  

Now we can reproduce our latest experiment by invoking:

fit_and_plot(poly.polyvander, n=50, alpha=1e-7)

Polynomial bases

It turns out that in our sister discipline, approximation theory, reseachers also encountered similar difficulties with the standard basis $\mathbb{E}_n$, and developed a thoery for approximating functions by polynomials from different bases. Two prominent examples of bases of $n$-degree polynomials include, and their:

The Chebyshev polynomials $\mathbb{T}_n = \{ T_0, T_1, \dots, T_n \}$, implemented in the numpy.polynomial.chebyshev module.
The Legendre polynomials $\mathbb{P}_n = \{ P_0, P_1, \dots, P_n \}$, implemented in the numpy.polynomial.legendre module.

They are the computational workhorse of a large variety of numerical algorithms that are enabled by approximating a function using a polynomial, and are well-known for their advantages in approximating functions in the $[-1, 1]$ interval². In particular, the corresponding “Vandermonde” matrices are provided by the chebvander and legvander functions in corresponding modules above. Each row in these matrices contains the value of the basis functions at each point, just like the standard Vandermonde matrix of the standard basis. For example, the Chebyshev Vandermonde matrix is:

\[\begin{pmatrix} T_0(x_1) & T_1(x_1) & \dots & T_n(x_1) \\ T_0(x_2) & T_1(x_2) & \dots & T_n(x_2) \\ \vdots & \vdots & \ddots& \vdots \\ T_0(x_m) & T_1(x_m) & \dots & T_n(x_m) \\ \end{pmatrix}\]

I will not elaborate their formulas and properties here for a reason that will immediately be revealed. However, I highly recomment Prof. Nick Trefethen’s “Approximation theory and approximation practice” online video course to get familiar with their advantages. His book with the same name is an excellent introduction to the subject.

It might be tempting to try fitting a Chebyshev polynomial using our fit_and_plot method above directly:

import numpy.polynomial.chebyshev as cheb

fit_and_plot(cheb.chebvander, n=50, alpha=1e-7)

However, that’s not the best thing to do. We aim to fit a function sampled from $[0, 1]$, but the Chebyshev basis “lives” in $[-1, 1]$. Therefore, we will add the transformation $x \to 2x-1$ before invoking the chebvander function:

def scaled_chebvander(x, deg):
  return cheb.chebvander(2 * x - 1, deg=deg)

fit_and_plot(scaled_chebvander, n=50, alpha=1)

Note that a different basis requires a different regularization coefficient. We get the following result:

Whoa! Seems even worse than the standard basis!. Maybe more regularization helps?

fit_and_plot(scaled_chebvander, n=50, alpha=10)

Appears that our polynomial is both a bad fit for the function, and extremely oscilatory. Even worse when the standard basis! Interested readers can repeat the experiment with Legendre polynomials and see a slightly better, but similar result. So what’s wrong? Is everything that approximation theory tries to teach us about polynomials wrong?

The answer stems from the fundamental difference between two tasks:

Interpolation - finding a polynomial that agrees with the approximated function $f(x)$ exactly at a set of carefully chosen points
Fitting - finding a polynomial that agrees approximately with a given noisy set of points, which are out of our control.

The Chebyshev and Legendre bases perform extremely well at the the interpolation task, but not at the fitting task. It turns out that the polynomial $T_k$ in the Chebyshev basis, and the polynomial $P_k$ in the Legendre basis, are both $k$-degree polynomials. For example, $T_1$ is a linear function, whereas $T_{50}$ is a polynomial of degree 50. These two functions are radically different. Thus, the coefficient of $T_1$ and $T_{50}$ have “different units”. This property is shared with the standard basis as well. Thus, we have two issues:

A small change of the coefficient of a high degree basis function, say the coefficient $\alpha_{50}$, has a huge effect on the shape of the polynomial. Thus, a small perturbation in the input data, be it from noise or a slighly different data point $x_i$, has a huge effect of the fit model.
L2 regularization makes no sense! For reasonable functions, the coefficient $\alpha_{50}$ should be much smaller than the coefficient $\alpha_1$. This is regardless of the choice of the basis!

Both properties show that for the fitting, rather the interpolation tasks we need something else.

The Bernstein basis

A remedy is provided by the Bernstein basis $\mathbb{B}_n = \{ b_{0,n}, \dots, b_{n, n} \}$. These are $n$-degree polynomials defined by on $[0, 1]$ by:

\[b_{i,n}(x) = \binom{n}{i} x^i (1-x)^{n-i}\]

These polynomials are widely used in computer graphics to approximate curves and surfaces, but it appears that they’re less known in the machine learning community. In fact, all the text you see on the screen when reading this post is rendered using Bernstein polynomials³. We will study them more in depth in the next posts, but at this stage I would like to point out two simple properties that give an intuitive explanation of why they’re useful in machine learning.

First, note that each $b_{i,n}$ is an $n$-degree polynomial. Thus, when representing a polynomial using

\[p_n(x) = \alpha_0 b_{0,n}(x) + \alpha_1 b_{1,n}(x) + \dots + \alpha_n b_{n,n}(x),\]

all the coefficients have the same “units”.

If the formula of $b_{i,n}(x)$ seems familiar - you are correct. It is exactly the probability mass function of the binomial distribution for obtaining $i$ successes in a sequence of trials whose success probability is $x$. Therefore, $b_{i,n}(x) \geq 0$, and $\sum_{i=0}^n b_{i,n}(x) = 1$ for any $x \in [0, 1]$. Consequently, the polynomial $p_n(x)$ is just a weighted average of the coefficients $\alpha_0, \dots, \alpha_n$. So not only the coefficients have the same “units”, their “units” are also the same as the model’s labels. Thus, they’re much easier to regularize - they’re all on the same “scale”.

Finally, due to the equivalence with the binomial distribution p.m.f, we can implement a “Vandermonde” matrix in Python using the scipy.stats.binom.pmf function.

from scipy.stats import binom

def bernvander(x, deg):
	return binom.pmf(np.arange(1 + deg), deg, x.reshape(-1, 1))

Let’s try and fit without regularization at all

fit_and_plot(bernvander, n=50, alpha=0)

We see our regular over-fitting. Now let’s see that they’re indeed easy to regularize. After trying several regularization coefficients, I came up with this:

fit_and_plot(bernvander, n=50, alpha=5e-7)

Beautiful! This is a polynomial of degree 50! The fit is great, no oscillations, and the misfit near the right endpoint stems from the noise - I don’t believe there’s enough information in the data to convey the fact that it should “curve up” rather than “curve down”.

Let’s see what happens when we crank-up the degree. Can we produce a nice non-oscilating polynomial?

fit_and_plot(bernvander, n=100, alpha=5e-4)

This is a polynomial of degree 100, that does not overfit!

Summary

The notorious reputation of high-degree polynomials in the machine learning community is primarily a myth. Despite it, papers, books, and blog posts are based on this premise as if it was an axiom. Bernstein polynomials are little known in the machine learning community, but there are a few papers⁴⁵ using them to represent polynomial features. Their main advantage is ease of use - we can use high degree polynomials to exploit their approximation power, and easily control model complexity with just one hyperparameter - the regularization coefficient.

In the following posts we will explore the Bernstein basis in more detail. We will use it to create polynomial features for real-world datasets and test it versus the standard basis. Moreover, we will see how to regularize the coefficients to control the shape of the function we aim to represent.. For example, what if we know that the function we’re aiming to fit is increasing? Stay tuned!

There are also kernel methods, and polynomial kernels. But polynomial kernels suffer from problems similar to the standard basis. ↩
The standard basis is not that awful. It’s a great basis for representing polynomials on the complex unit circle. In fact, the Fourier transform is based exactly on this observation. ↩
See Bézier curves and TrueType font outlines. ↩
Marco, Ana, and José-Javier Martı. “Polynomial least squares fitting in the Bernstein basis.” Linear Algebra and its Applications 433.7 (2010): 1254-1264. ↩
Wang, Jiangdian, and Sujit K. Ghosh. “Shape restricted nonparametric regression with Bernstein polynomials.” Computational Statistics & Data Analysis 56.9 (2012): 2729-2741. ↩