In the last two posts we introducted the Bernstein basis as an alternative way to generate polynomial features from data. In this post we’ll be concerned with an implementation that we can use in our model training pipelines based on Scikit-Learn. The Scikit-Learn library has the concept of a a transformer class that generates features from raw data, and we will indeed develop and test such a transformer class for the Bernstein basis. Contrary to previous posts, here we will have some math, but plenty of code, which is fully available in this Colab Notebook.
But beforehand, let’s do a short recap of what we learned in the last two posts:
Transformer classes in Scikit-Learn generate new features out of existing ones, and can be combined in a convenient way into pipelines that perform a set of transformations that eventually generate features for a trained model. We will implement a Scikit-Learn transformer class for Bernstein polynomials, called BernsteinFeatures
. As a baseline, we will also implement a similar transformer that generates the power basis, called PowerBasisFeatures
. We will combine them in a Pipeline
to build a mechainsm that trains and evaluates a model using the well-known fit-transform paradigm. In this post, we will train a linear model on our generated features.
Since feature normalization is a must, we will always prepend our polynomial transformer by a normalization transformer. In this post, we will use the MinMaxScaler
class built into Scikit-Learn. For categorical features, we will use OneHotEncoder
. Therefore, our pipelines in this post will have the following generic form:
Before we begin - some expectations. The behavior of the functions we approximate on real data-sets is typically not as ‘crazy’ as the toy functions we approximated in previous posts. The wide oscilations and wiggling of the “true” function we are aiming to learn are not that common in practice. A harder challenge is modeling the interaction between several features, rather than the effect of each feature separately. Therefore, the advantage we will see from a simple application of Bernstein polynomials over the power basis isn’t that large, but it’s quite visible and consistent. Thus, when fitting a model with polynomial features, I’d go with Bernstein polynomials by default, instead of a power basis. It’s very easy, and we have nothing to lose - we can only gain.
A transformer class in Scikit-Learn needs to implement the basic fit-transform paradigm. Since polynomial features are the same regardless of the data, the fit
method is empty. The transform method, as expected, will concatenate the generate a Vandermonde matrices of the columns. Note, that we will be handling each column separately at this stage, and do not aim to compute any interaction terms between columns.
There is one mathematical issue we need to take care of. Since a polynomial basis can represent any polynomial, including those that do not pass throught the origin, they implicitly contain a “bias” term. The power basis even explicit about it - its first basis function is the constant \(1\). However, a typical linear model already has its own bias term, namely,
\[f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x}\rangle + b.\]The bias is, of course, equivalent to having a constant feature. Thus, our data-matrix has two constant features, meaning it’s as ill-conditioned as it can be - its columns are linearly dependent. When several numerical features are used things become even worse - we have several implicit constant features.
To mitigate the above, we will add a bias
boolean flag to our transformers that instructs the transformer to generate a basis of polynomials going through the origin. This policy is in line with other transformers that are built-in into Scikit-Learn, such as the SplineTransformer
and the PolynomialFeatures
classes. For the power basis it amounts to discarding the first basis function. It turns out that the same idea works for the Bernstein basis as well, since \(b_{0,n}(0) = 1\), and \(b_{i,n}(0) = 0\) for all \(i \geq 1\).
Becide the above mathematical aspect, we will also have to take care of several technical aspects. First, we will add support for Pandas data-frames, since they are ubiquitously used by many practitioners. Second, we will have to take care of one-dimensional arrays as input, and reshape them into a column. Finally, we will treat transform NaN values to constant (zero) vectors to model the fact that a missing numerical feature “has no effect”. This is not always the best course of action, but it’s useful in this post. The base class taking care of the above mathematical and technical aspects is written below:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
class PolynomialBasisTransformer(BaseEstimator, TransformerMixin):
def __init__(self, degree=5, bias=False, na_value=0.):
self.degree = degree
self.bias = bias
self.na_value = na_value
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# Check if X is a Pandas DataFrame and convert to NumPy array
if hasattr(X, 'values'):
X = X.values
# Ensure X is a 2D array
if X.ndim == 1:
X = X.reshape(-1, 1)
# Get the number of columns in the input array
n_rows, n_features = X.shape
# Compute the specific polynomial basis for each column
basis_features = [
self.feature_matrix(X[:, i])
for i in range(n_features)
]
# no bias --> skip the first basis function
if not self.bias:
basis_features = [basis[:, 1:] for basis in basis_features]
return np.hstack(transformed_features)
def feature_matrix(self, column):
vander = self.vandermonde_matrix(column)
return np.nan_to_num(vander, self.na_value)
def vandermonde_matrix(self, column):
raise NotImplementedError("Subclasses must implement this method.")
The power and Bernstein bases are easily implemented by overriding the vandermonde_matrix
method of the above base-class:
import numpy.polynomial.polynomial as poly
from scipy.stats import binom
class BernsteinFeatures(PolynomialBasisTransformer):
def vandermonde_matrix(self, column):
basis_idx = np.arange(1 + self.degree)
basis = binom.pmf(basis_idx, self.degree, column[:, None])
return basis
class PowerBasisFeatures(PolynomialBasisTransformer):
def vandermonde_matrix(self, column):
return poly.polyvander(column, self.degree)
Let’s see how they work. We will use Pandas to display the results of our transformers as nicely formatted tables.
import pandas as pd
pbt = BernsteinFeatures(degree=2).fit(np.empty(0))
bbt = PowerBasisFeatures(degree=2).fit(np.empty(0))
# transform a column - output the Vandermonde matrix according to each basis
feature = np.array([0, 0.5, 1, np.nan])
print(pd.DataFrame.from_dict({
'Feature': feature,
'Power basis': list(pbt.transform(feature)),
'Bernstein basis': list(bbt.transform(feature))
}))
# transform a column - output the Vandermonde matrix according to each basis
feature = np.array([0, 0.5, 1, np.nan])
print(pd.DataFrame.from_dict({
'Feature': feature,
'Power basis': list(pbt.transform(feature)),
'Bernstein basis': list(bbt.transform(feature))
}))
# Feature Power basis Bernstein basis
# 0 0.0 [0.0, 0.0] [0.0, 0.0]
# 1 0.5 [0.5000000000000002, 0.25] [0.5, 0.25]
# 2 1.0 [0.0, 1.0] [1.0, 1.0]
# 3 NaN [0.0, 0.0] [0.0, 0.0]
# transform two columns - concatenate the Vandermonde matrices
features = np.array([
[0, 0.25],
[0.5, 0.5],
[np.nan, 0.75]
])
print(pd.DataFrame.from_dict({
'Feature 0': features[:, 0],
'Feature 1': features[:, 1],
'Power basis': list(pbt.transform(features)),
'Bernstein basis': list(bbt.transform(features))
}))
# Feature 0 Feature 1 Power basis Bernstein basis
# 0 0.0 0.25 [0.0, 0.0, 0.375, 0.0625] [0.0, 0.0, 0.25, 0.0625]
# 1 0.5 0.50 [0.5000000000000002, 0.25, 0.5000000000000002, 0.25] [0.5, 0.25, 0.5, 0.25]
# 2 NaN 0.75 [0.0, 0.0, 0.375, 0.5625] [0.0, 0.0, 0.75, 0.5625]
Nice! Now let’s proceed to our example.
Let’s implement the pipeline structure we saw at the beginning of this post in code, and a function to train models using this pipeline.
We will write a function that a basis transformer and a model as an arguments, and constructs the components of the pipeline. Categorical features will be one-hot encoded, numerical features will be scaled and transformed using the given basis transformer, and finally the result will be passed as an input of the given model.
To make sure our scaled numerical features never fall outside of the \([0, 1]\) interval, even if the test-set contaisn values larger or smaller than what we saw in the training set, we clip the scaled value to \([0, 1]\). And to make sure we don’t inflate the dimension of our model by one-hot encoding rare categorical values, we will limit their frequency to 10. Here is the code:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
def training_pipeline(basis_transformer, model_estimator,
categorical_features, numerical_features):
basis_feature_transformer = Pipeline([
('scaler', MinMaxScaler(clip=True)),
('basis', basis_transformer)
])
categorical_transformer = OneHotEncoder(
sparse_output=False,
handle_unknown='infrequent_if_exist',
min_frequency=10
)
preprocessor = ColumnTransformer(
transformers=[
('numerical', basis_feature_transformer, numerical_features),
('categorical', categorical_transformer, categorical_features)
]
)
return Pipeline([
('preprocessor', preprocessor),
('model', model_estimator)
])
We can now use a pipeline with Bernstein features with Ridge regression as in:
pipeline = training_pipeline(BernsteinFeatures(), Ridge(), categorical_features, numerical_features)
test_predictions = pipeline.fit(train_df, train_df[target]).transform(test_df)
But wait! We need to know what polynomial degree to use, and maybe tune some hyperparameters of the trained model. Otherwise, the experimental results we observe may simply be due to a bad choice of hyperparameters.
We need two ingredients. One is technical - how we set hyperparameters of components hidden deep inside a Pipeline
. The other is how we actually tune them. For setting hyperparameters, Scikit-Learn provides an interface. There are two functions: get_params()
which returns a dictionary of all settable parameters, and set_params
that can set parameters of all the components contained inside a pipeline. Let’s look at an example of a pipeline with BernsteinFeatures
as the basis transformer, and Ridge
as the model. Since Ridge
has an alpha
parameter, and BernsteinFeatures
has a degree
parameters, let’s look for those:
from sklearn.linear_model import Ridge
pipeline = training_pipeline(BernsteinFeatures(), Ridge(), [], [])
print({k:v for k,v in pipeline.get_params().items()
if 'degree' in k or 'alpha' in k})
# prints: {'preprocessor__numerical__basis__degree': 5, 'model__alpha': 1.0}
There is a pattern here! Looking at our training_pipeline
method above, we see that there is a component named “preprocessor”, inside of which there is a component named “numerical”, that contains a “basis”. That “basis” component is our transformer, so it has a “degree”. The full name is just a concatenation of the above with double underscores. The same idea for the model. We can also set these parameters as follows:
pipeline.set_param('preprocessor__numerical__basis__degree', SOME_DEGREE)
pipeline.set_param('model__alpha', SOME_REGULARIZATION_COEFFICIENT)
So now that we know how to set hyperparameters of parts within a pipeline, let’s tune them. To that end, we will use hyperopt^{2}! It’s a nice hyperparameter tuner, very easy to use, and implementes the state-of-the art Bayesian Optimization paradigm that can obtain high quality hyperparameter configurations hyperparameters in a relatively small number of trials. It’s as easy to use as a grid search, available by default on Colab, and saves us precious time. And I certainly don’t want to wait long until I see the results.
To use hyperopt, we need two ingredients. A a tuning objective that evaluates the performance of a given hyperparameter configuration, and a search space for hyperparameters. Writing such a tuning objective is quite easy - we will use a cross-validated score using Scikit-Learn’s built-int capabilities:
from sklearn.model_selection import cross_val_score
def tuning_objective(pipeline, metric, train_df, target, params):
pipeline.set_params(**params)
scores = cross_val_score(pipeline, train_df, train_df[target], scoring=metric)
return -np.mean(scores)
Well, that wasn’t hard, but there’s an intricate detail - note that we are returning minus the average metric across folds. This is because Scikit-Learn’s metrics are built to be maximized, but hyperopt is built to minimize.
Defining a the hyperparameter seach space is also easy - it’s just a dict specifying a distribution for each hyperparameter. For our example above with a Ridge model we can use something like this:
from hyperopt import hp
param_space = {
'preprocessor__numerical__basis__degree': hp.uniformint('degree', 1, 50),
'model__alpha': hp.loguniform('alpha', -10, 5)
}
Hyperopt has a uniform
and uniformint
functions for hyperparameters that we would normally tune using a uniform grid, such as the number of layers of an NN, or the degree of a polynomial. In the code above, the degree of the polynomial is a number between 1 and 50, and all are equally likely. It also has a loguniform
function for hyperparameters that we normally tune using a geometrically-spaced grid, such as a learning rate, or a regularization coefficient. In the example above, the regularization coefficient is between \(e^{-10}\) and \(e^5\), and all exponents are uniformly likely.
Having specified the objective function and the parameter space, we can use fmin
for tuning, like this:
from hyperopt import fmin, tpe
fmin(lambda params: tuning_objective(pipeline, metric, train_df, target, params),
space=param_space,
algo=tpe.suggest,
max_evals=100)
We have given it a function to minimize, gave it the hyperparameter search space, told it to use the TPE algorithm for tuning^{3}, and limited it to 100 evaluations of our tuning objective. It will invoke our objective on hyperparameter configurations that it considers as worth trying, and eventually give us the best configuration it found. More on that can be found in hyperopt’s documentation. Beyond the objective and the search space, we also need to tell it which algorithm to use, and how many configurations it should try.
So let’s write a function that tunes hyperparameters using the training set, fits a model using the optimal configuration, and evaluates the resulting model’s performance using the test set. Then, re-train the pipeline on the entire training set using the best hyper-parameters, and evalluate it on the test set.
from hyperopt import fmin, tpe
from sklearn.metrics import get_scorer
def tune_and_evaluate_pipeline(pipeline, param_space,
train_df, test_df, target, metric,
max_evals=50, random_seed=42):
print('Tuning params')
def bound_tuning_objective(params):
return tuning_objective(pipeline, metric, train_df, target, params)
params = fmin(fn=bound_tuning_objective, # <-- this is the objective
space=param_space, # <-- the search space
algo=tpe.suggest, # <-- the algorithm to use. TPE is the most widely used.
max_evals=max_evals, # <-- maximum number of configurations to try
rstate=np.random.default_rng(random_seed),
return_argmin=False)
print(f'Best params = {params}')
print('Refitting with best params on the entire training set')
pipeline.set_params(**params)
fit_result = pipeline.fit(train_df, train_df[target])
scorer = get_scorer(metric)
score = scorer(fit_result, test_df, test_df[target])
print(f'Test metric = {score:.5f}')
return fit_result
Now we have all the ingredients in place! We can now, for example, tune, train a tuned Ridge regression model with Bernstein polynomial features that predicts the foo
column in our data-set, and measures success using the Mean-Squared Error metric as follows:
train_df = ...
test_df = ...
categorical_features = [...]
numerical_features = [...]
pipeline = trainin_pipeline(BernsteinTransformer(), Ridge(), categorical_features, numerical_features)
model = tune_and_evaluate_pipeline(
pipeline,
param_space,
train_df,
test_df,
'foo',
'neg_root_mean_squared_error')
Now let’s put our work-horse to work!
The well-known California Housing price prediction data-set is available in the samples directory on Colab, so it will be convenient to use. Let’s load it, and print a sample:
train_df = pd.read_csv('sample_data/california_housing_train.csv')
test_df = pd.read_csv('sample_data/california_housing_test.csv')
print(train_df.head())
# longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
# 0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
# 1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
# 2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
# 3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
# 4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
The task is predicting the median_house_value
column based on the other columns.
First, we can see that there are seveal feature columns with very large and diverse numbers. They probably have a very skewed distribution. Let’s plot those distributions:
skewed_columns = ['total_rooms', 'total_bedrooms', 'population', 'households']
axs = train_df.loc[:, skewed_columns].plot.hist(
bins=20, subplots=True, layout=(2, 2), figsize=(8, 6))
axs.flat[0].get_figure().tight_layout()
Indeed very skewed! Typically applying a logarithm helps. Let’s see plot them after applying a logarithm (note the .apply(np.log)
):
axs = train_df.loc[:, skewed_columns].apply(np.log).plot.hist(
bins=20, subplots=True, layout=(2, 2), figsize=(8, 6))
axs.flat[0].get_figure().tight_layout()
Ah, much better! We also note that housing_median_age
variable, despite being numerical, is discrete. Indeed, it has only 52 unitue values in the entire dataset. So we will treat it as a categorical variable. Let’s summarize our features in code:
categorical_features = ['housing_median_age']
numerical_features = ['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
target = ['median_house_value']
So we’re almost ready to fit a model. We can see that our target variable, median_house_value
, has very large magnitude values. It is usually beneficial to scale them to a smaller range. However, we would like to measure the prediction error with respect to the original values. Fortunately, Scikit-Learn provides us with a TransformedTargetRegressor
class that allows scaling the target variable for the regression model, and scaling it back to the original range when producing an output.
Now we’re ready to construct our model fitting pipeline that fits a Ridge model on scaled regression targets, and transformed features:
from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor
def california_housing_pipeline(basis_transformer):
return training_pipeline(
basis_transformer,
TransformedTargetRegressor(
regressor=Ridge(),
transformer=MinMaxScaler()
),
categorical_features,
numerical_features
)
Beautiful! Now we can use our hyperparameter tuning function to train a tuned model on our dataset. Since it’s a regression task, we will measure the Root Mean Squared Error (RMSE), implemented by the neg_root_mean_squared_error
Scikit-Learn metric. So let’s begin with Bernstein polynomial features:
poly_param_space = {
'preprocessor__numerical__basis__degree': hp.uniformint('degree', 1, 50),
'model__regressor__alpha': hp.loguniform('C', -10, 5)
}
bernstein_pipeline = california_housing_pipeline(BernsteinFeatures())
bernstein_fit_result = tune_and_evaluate_pipeline(
bernstein_pipeline, poly_param_space,
train_df, test_df, target,
'neg_root_mean_squared_error')
After a few minutes I got the following output:
Tuning params
100%|██████████| 50/50 [03:37<00:00, 4.34s/trial, best loss: 60364.25845777496]
Best params = {'model__regressor__alpha': 0.0075549014272857686, 'preprocessor__numerical__basis__degree': 50}
Refitting with best params on the entire training set
Test metric = -61559.04848
The root mean squared error (RMSE) on the test-set of the tuned model is \(61559.04848\). Now let’s try the power basis:
power_basis_pipeline = california_housing_pipeline(PowerBasisFeatures())
power_basis_fit_result = tune_and_evaluate_pipeline(
power_basis_pipeline, poly_param_space,
train_df, test_df, target,
metric='neg_root_mean_squared_error')
This time I got the following output:
Tuning params
100%|██████████| 50/50 [00:54<00:00, 1.10s/trial, best loss: 62205.78033504614]
Best params = {'model__regressor__alpha': 4.7685837926305776e-05, 'preprocessor__numerical__basis__degree': 31}
Refitting with best params on the entire training set
Test metric = -63534.49228
This time the RMSE is \(63534.49228\). The Bernstein basis got us a \(3.1\%\) improvement! If we look closer at the output, we can see that the tuned Bernstein polynomial is of degree 50, whereas the best tuned power basis polynomial is of degree 31. We already saw that high degree polynomials in the Bernstein basis are easy to regularize, and our tuner probably saw the same phenomenon, and cranked up the degree to 50.
How are our polynomial features compared to a simple linear model? Well, let’s see. To re-use all our existing code instead of writing a new pipeline, we’ll just use a “do nothing” feature transformer that implements the identity function. Note, that this time there is no degree to tune.
class IdentityTransformer(BaseEstimator, TransformerMixin):
def __init__(self, na_value=0.):
self.na_value = na_value
def fit(self, input_array, y=None):
return self
def transform(self, input_array, y=None):
# we are compatible with our polynomial features - NA values are zeroed-out. The rest
# are passed through
return np.where(np.isnan(input_array), self.na_value, input_array)
linear_param_space = {
'model__regressor__alpha': hp.loguniform('C', -10, 5)
}
linear_pipeline = california_housing_pipeline(IdentityTransformer())
linear_fit_result = tune_and_evaluate_pipeline(
linear_pipeline, linear_param_space,
train_df, test_df, target,
'neg_root_mean_squared_error')
Here is the output:
Tuning params
100%|██████████| 50/50 [00:22<00:00, 2.22trial/s, best loss: 66571.41310284132]
Best params = {'model__regressor__alpha': 0.013724898474056764}
Refitting with best params on the entire training set
Test metric = -67627.17474
The RMSE is \(67627.17474\). So let’s summarize the results in the table:
Linear | Power basis | Bernstein basis | |
---|---|---|---|
RMSE | 67627.17474 | 63534.49228 | 61559.04848 |
Improvement over Linear | 0% | 6.05% | 8.97% |
Tuned degree | 1 | 31 | 50 |
Impressive! Just changing the polynomial basis gives us a visible boost, and the high degree doesn’t appear to do something bad.
Now we shall inspect our models a bit closer. That’s why we stored the fit models in the bernstein_fit_result
and power_basis_fit_result
variables above. Following the structure of our pipelines, to get the coefficients we can use the following function:
def get_coefs(pipeline):
transformed_target_regressor = pipeline.named_steps['model']
ridge_model = transformed_target_regressor.regressor
return ridge_model.coef_.ravel()
Now we can plot the polynomials! First, we will need to extract the coefficients of the numerical features, and ignore the ones corresponding to the categorical features. Next, we need to treat the coefficients of each numerical feature separately, and plot the polynomial they represent. Since our numerical features are always scaled to \([0, 1]\), plotting amounts to evaluating our polynomials on a dense grid in \([0, 1]\). So this is our plotting function:
import matplotlib.pyplot as plt
def plot_feature_curves(pipeline, basis_transformer_ctor, numerical_features):
# get the coefficients and the degree
degree = pipeline.get_params()['preprocessor__numerical__basis__degree']
coefs = get_coefs(pipeline)
# extract the numerical features, and form a matrix, such that the
# coefficients of each feature is in a separate row.
numerical_slice = pipeline.get_params()['preprocessor'].output_indices_['numerical']
feature_coefs = coefs[numerical_slice].reshape(-1, degree)
# form the basis Vandermonde matrix on [0, 1]
xs = np.linspace(0, 1, 1000)
xs_vander = basis_transformer_ctor(degree=degree).fit_transform(xs)
# do the plotting
n_cols = 3
n_rows = math.ceil(len(numerical_features) / n_cols)
fig, axs = plt.subplots(n_rows, n_cols, figsize=(3 * n_cols, 3 * n_rows))
for i, (ax, coefs) in enumerate(zip(axs.ravel(), feature_coefs)):
ax.plot(xs, xs_vander @ coefs)
ax.set_title(numerical_features[i])
fig.show()
A bit lengthy, but understandable.
Recalling our previous post, we know that the coefficients in the Bernstein basis are actually “control points”, so let’s add the ability to plot them as well to the above function:
import matplotlib.pyplot as plt
def plot_feature_curves(pipeline, basis_transformer_ctor, numerical_features,
plot_control_pts=True):
# get the coefficients and the degree
degree = pipeline.get_params()['preprocessor__numerical__basis__degree']
coefs = get_coefs(pipeline)
# extract the numerical features, and form a matrix, such that the
# coefficients of each feature is in a separate row.
numerical_slice = pipeline.get_params()['preprocessor'].output_indices_['numerical']
feature_coefs = coefs[numerical_slice].reshape(-1, degree)
# form the basis Vandermonde matrix on [0, 1]
xs = np.linspace(0, 1, 1000)
xs_vander = basis_transformer_ctor(degree=degree).fit_transform(xs)
# do the plotting
n_cols = 3
n_rows = math.ceil(len(numerical_features) / n_cols)
fig, axs = plt.subplots(n_rows, n_cols, figsize=(3 * n_cols, 3 * n_rows))
for i, (ax, coefs) in enumerate(zip(axs.ravel(), feature_coefs)):
if plot_control_pts:
control_xs = (1 + np.arange(len(coefs))) / len(coefs)
ax.scatter(control_xs, coefs, s=30, facecolors='none', edgecolor='b', alpha=0.5)
ax.plot(xs, xs_vander @ coefs)
ax.set_title(numerical_features[i])
fig.show()
Now let’s see our Bernstein polynomials!
plot_feature_curves(bernstein_fit_result, BernsteinFeatures, numerical_features)
What about the power basis? Let’s take a look as well. Note, that we won’t plot the coefficients as “control points”, since the coefficients of the power basis are not control points in any way.
plot_feature_curves(power_basis_fit_result, PowerBasisFeatures, numerical_features, plot_control_pts=False)
Look at the “households” and “total_bedrooms” polynomials. Seems that they’re “going crazy” near the boundary of the domain. As we expected - was not specifically designed to approximate functions on \([0, 1]\), and it’s hard to regularize to produce a good fit. It will either under-fit, or over-regularize.
In fact, we may recall that the “natural domain” of the power basis is the complex unit circle. It may be interesting to try representing periodic features, such as the time of day using the power basis, since such features naturally map to a point on a circle. However, there are other challenges involved, such as ensuring that our model will be real-valued rather than complex-valued, and this may be a nice subject for another post.
This was a nice adventure. I certainly learned a lot about Scikit-Learn while writing this post, and I hope that the transformer for producing the Bernstein basis may be useful for to you as well. We note that polynomial non-linear features have a nice property they have only one tunable hyperparameter, so learning a tuned model should be computationally cheaper compared to other alternatives, such as radial basis functions.
Looking again at the Bernstein polynomials above, we see that they are a bit ‘wiggly’, the control point seem like a mess, and in the previous post we learned how to smooth them out by regularizing their second derivative. Moreover, in the beginning of this post we said something interesting - the predictive power of simple models may be improved by incorporating interactions between features. So in the next posts we’re going to do exactly that - enhance our transformer to model feature interactions, and write an enhance version of the Ridge estimator to smooth polynomial features. Stay tuned!
I wouldn’t even call it extrapolation - in our context I think of the polynomial basis as “undefined” outside of its natural domain. ↩
Bergstra, James, Daniel Yamins, and David Cox. “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures.” International conference on machine learning. PMLR, 2013. ↩
Watanabe, S., 2023. Tree-structured Parzen estimator: Understanding its algorithm components and their roles for better empirical performance. arXiv preprint arXiv:2304.11127. ↩
In the previous post we we saw that the Bernstein polynomials can be used to fit a high-degree polynomial curve with ease, without its shape going out of control. In this post we’ll look at the Bernstein polynomials in more depth, both experimentally and theoretically. First, we will explore the Bernstein polynomials \(\mathbb{B}_n = \{ b_{0,n}, \dots, b_{n, n} \}\), where
\[b_{i,n}(x) = \binom{n}{i} x^i (1-x)^{n-i},\]empirically and visually. We will see how to use the coefficients to achieve a higher degree of control over the shape of the function we fit. Then, we’ll explore them more theoretically, and see that they are indeed a basis - they represent the same model class as the classical power basis \(\{1, x, x^2, \dots, x^n\}\). All the results are reproducible from this notebook.
To study the shape preserving properties, we will rely on the bernvander
function we’ve implemented in the last post, that given the numbers \(x_1, \dots, x_m\), computes the Bernstein Vandermonde matrix of a given degree \(n\), that contains all the polynomials evaluated at all the given points:
This is something we should have probably done earler, but let’s plot the Bernstein polynomials to see what they look like. Below, we plot the basis \(\mathbb{B}_{7}\) using the bernvander
function.
import matplotlib.pyplot as plt
import numpy as np
plt_xs = np.linspace(0, 1, 1000)
bernstein_basis = bernvander(plt_xs, deg=7)
plt.plot(plt_xs, bernstein_basis,
label=[f'$b_{{{i},8}}$' for i in range(8)])
plt.legend(ncols=2)
plt.show()
We can see that each polynomial is a “hill” whose maxima appear equally spaced. So are they? Let’s add vertical bars using the axvline
function to verify:
plt_xs = np.linspace(0, 1, 1000)
bernstein_basis = bernvander(plt_xs, deg=7)
plt.plot(plt_xs, bernstein_basis,
label=[f'$b_{{{i},8}}$' for i in range(8)])
for x in np.linspace(0, 1, 8):
plt.axvline(x, color='gray', linestyle='dotted')
plt.legend(ncols=2)
plt.show()
It indeed appears so - the maxima of the polynomials are at \(\{ \tfrac{i}{n}\}_{i=0}^n\). We won’t prove it formally, but that’s not hard. Now we can have some interesting insights. Suppose we have a polynomial written in Bernstein form, namely, as a weighted sum of Bernstein polynomials:
\[f(x) = \sum_{k=0}^n u_i b_{i,n}(x)\]Recall from the previous post that the Bernstein polynomials sum to one, and therefore \(f(x)\) is just a weighted average of the coefficients \(u_0, \dots, u_n\). Thus, at \(x=\frac{i}{n}\), the weight of \(u_i\) in the weighted average dominates the weights of the other coefficients. In other words,
\(u_i\) controls the polynomial \(f(x)\) in the vicinity of the point \(\frac{i}{n}\).
In fact, the name often given to the coefficients \(u_0, \dots, u_n\) is “control points”. To visualize this observation, let’s see what happens if we change one coefficient, \(u_3\), of a 7-th degree polynomial using an animation:
from matplotlib.animation import FuncAnimation, PillowWriter
n = 7
n_frames = 50
ctrl_xs = np.linspace(0, 1, 1 + n) # the points i / n
w_init = np.cos(2 * np.pi * ctrl_xs) # initial coefficients
plt_vander = bernvander(plt_xs, deg=n) # bernstein basis at plot points
fig, ax = plt.subplots()
def animate(i):
# animate the coefficients "w"
t = np.sin(2 * np.pi * i / n_frames)
w = np.array(w_init)
w[3] = (1 - t) * w[3] + t * 3
# plot the Bernstein polynomial and the coefficients at i / n
ax.clear()
ax.set_xlim([-0.05, 1.05])
ax.set_ylim([-3, 3])
control_plot = ax.scatter(ctrl_xs, w, color='red') # plot control points
poly_plot = ax.plot(plt_xs, plt_vander @ w, color='blue') # plot the polynomial
return poly_plot, control_plot
ani = FuncAnimation(fig, animate, n_frames)
ani.save('control_coefficients.gif', dpi=300, writer=PillowWriter(fps=25))
We get the following result:
Looks nice! We can indeed see where the name “control points” comes from. But what can we say about it formally? Well, there are several results. The most famous one is the constructive proof of the Weierstrass approximation theorem:
Theorem [Lorentz^{1}, 1952] Suppose \(g(x)\) is continuous in \([0, 1]\). Then the polynomials \(\sum_{i=0}^n g(\tfrac{i}{n}) b_{i,n}(x)\) uniformly converge to \(g(x)\) as \(n \to \infty\).
As a consequence, we can interpret the Bernstein coefficient \(u_i\) as the value of some function \(g\) that our polynomial approximates at \(x=\frac{i}{n}\). Equipped with this idea, we can ask ourselves a simple question. What if the coefficients are increasing? Will the polynomial be an increasing function?
Well, it turns out the answer is yes - we can force the polynomial to be an increasing function of \(x\) by making sure the coefficients are increasing. In fact, we have even more interesting things we can formally say. To do that, let’s look at the derivatives of polynomials in Bernstein form. Suppose that
\[f(x) = \sum_{i=0}^n u_i b_{i,n}(x),\]then the first and second derivatives are:
\[\begin{align} f'(x) &= n \sum_{i=0}^{n-1} (u_{i+1} - u_i) b_{i,n-1}(x) \\ f''(x) &= n (n-1) \sum_{i=0}^{n-2} (u_{i+2} - 2 u_{i+1} + u_i) b_{i,n-2}(x) \end{align}\]The first derivative is a weighted sum of the coefficient first order differences \(u_{i+1}-u_i\), whereas the second derivative is a weighted sum of the second order differences \(u_{i+2}-2u_{i+1}+u_i\). Therefore, we can conclude that:
Theorem [Chang et. al^{2}, 2007, Proposition 1] Given \(f(x) = \sum_{i=0}^n u_i b_{i,n}(x)\)
- If \(u_{i+1} - u_i \geq 0\), then \(f'(x) \geq 0\), and \(f\) is nondecreasing,
- If \(u_{i+1} - u_i \leq 0\), then \(f'(x) \leq 0\), and \(f\) is nondecreasing,
- If \(u_{i+2} - 2u_{i+1} + u_i \geq 0\), then \(f''(x) \geq 0\), and \(f\) is convex,
- If \(u_{i+2} - 2u_{i+1} + u_i \leq 0\), then \(f''(x) \leq 0\), and \(f\) is concave,
An important application of fitting nondecreasing functions, for example, is fitting a CDF. One practical example of CDF fitting is the bid shading problem^{3}^{4}^{5} in online advertising. We are required to model the probability of winning an ad auction given a bid \(x\). Naturally, the winning probability should increase when the bid \(x\) increases. Another important example is calibration curves^{6}^{7}^{8} in classification models, which are functions that map the model’s score to a probability such that the mean predicted probability conforms to the true conditional probability of the label given the features. The curve should be increasing - the higher the score, the higher probability it represents. See this great tutorial in the SkLearn documentation.
The simplest way to impose constraints on the coefficients when fitting models on small-scale data is using the CVXPY library, which we already encountered in previous posts in this blog. The library allows solving arbitrary convex optimization problems, specified by the function to minimize, and a set of constraints. Let’s see how we can use CVXPY to fit a nondecreasing Bernstein polynomial. First, we define the function and use it to generate noisy data:
def nondecreasing_func(x):
return (3 - 2 * x) * (x ** 2) * np.exp(x)
# define number of points and noise
m = 30
sigma = 0.2
np.random.seed(42)
x = np.random.rand(m)
y = nondecreasing_func(x) + sigma * np.random.randn(m)
Now, we define the model fitting as an optimization problem with constraints. Mathematically, we aim to minimize the L2 loss subject to coefficient monotonicity contstraints:
\[\begin{align} \min_{\mathbf{u}} & \quad \| \mathbf{V} \mathbf{u} - \mathbf{y} \|^2 \\ \text{s.t.} &\quad u_{i+1} \geq u_i && i = 0, \dots, n-1 \end{align}\]The matrix \(\mathbf{V}\) is the Bernsten Vandermonde matrix at \(x_1, \dots, x_m\). When multiplied by \(\mathbf{u}\) we obtain the values of the polynomials in Bernstein form at each of the data points. The following CVXPY code is just a direct formulation of the above for fitting a polynomial of degree \(n=20\):
import cvxpy as cp
deg = 20
u = cp.Variable(deg + 1) # a placeholder for the optimal Bernstein coefficients
loss = cp.sum_squares(bernvander(x, deg) @ u - y) # The L2 loss - the sum of residual squares
constraints = [cp.diff(u) >= 0] # constraints - u_{i+1} - u_i >= 0
problem = cp.Problem(cp.Minimize(loss), constraints)
# solve the minimization problem and
problem.solve()
u_opt = u.value
Now, let’s plot the points, the original, and the fit functions:
plt.scatter(x, y, color='red')
plt.plot(plt_xs, nondecreasing_func(plt_xs), color='blue')
plt.plot(plt_xs, bernvander(plt_xs, deg) @ u_opt, color='green')
Not bad, given the level of noise, and the fact that we have no regularization whatsoever! For larger scale problems we will typically use an ML framework, such as PyTorch or Tensorflow, and they do not provide mechanisms to impose hard constraints on parameters. Therefore, when using such frameworks, we need to use a regularization term that penalizes violation of our desired constraints. For example, to penalize for violating the nondecreasing constraint, we can use the regularizer:
\[r(\mathbf{u}) = \sum_{i=1}^n \max(0, u_{i} - u_{i+1})^2\]Looking at the curve above, we see that it’s a bit wiggly. Can we do something about it? Looking at the the second derivative formula above, we can “smooth out” the curve by adding a regularization term that penalizes the second order differences. This will, in turn, penalize the second order derivative. Why second order? Because ideally, when the second order differences are zero, we’ll get a straight line. So we’re “smoothing out” the curve to be more straight.
Mathematically, we’ll need to solve:
\[\begin{align} \min_{\mathbf{u}} & \quad \| \mathbf{V} \mathbf{u} - \mathbf{y} \|^2 + \alpha \sum_{i=0}^{n-2} (u_{i+2} - 2 u_{i+1} - u_i)^2 \\ \text{s.t.} &\quad u_{i+1} \geq u_i && i = 0, \dots, n-1 \end{align}\]where \(\alpha\) is a tuned regularization parameter. The code in CVXPY, after tuning \(\alpha\), looks like this:
deg = 20
alpha = 2
u = cp.Variable(deg + 1) # a placeholder for the optimal Bernstein coefficients
loss = cp.sum_squares(bernvander(x, deg) @ u - y) # The L2 loss - the sum of residual squares
reg = alpha * cp.sum_squares(cp.diff(u, 2)) # penalty for 2nd order differences
constraints = [cp.diff(u) >= 0] # constraints - u_{i+1} - u_i >= 0
problem = cp.Problem(cp.Minimize(loss + reg), constraints)
After solving the problem and plotting the polynomial, I obtained this:
Not bad! Now we will study the Bernstein basis from a more theoretical perspective to understand their representation power.
So, is it really a basis? First, let’s note the set \(\mathbb{B}_n\) of n-th degree Bernstein poynomials indeed has \(n+1\) polynomial functions. So it remains to be convinced that any polynomial can be expressed as a weighted sum of these \(n+1\) functions. It turns out that for any \(k < n\), we can write:
\[x^k = \sum_{j=k}^n \frac{\binom{j}{k}}{\binom{n}{k}} b_{j, n}(x) = \sum_{j=k}^n q_{j,k} b_{j,n}(x)\]The proof is a bit technical and involved, and requires the inverse binomial transform, but it gives us our desired result: any power of \(x\) up to \(n\) can be expressed using Bernstein polynomials. Consequently, any polynomial of degree up to \(n\) can be expressed as a weighted sum of Bernstein polynomials, and therefore:
The representation power of Bernstein polynomials is identical to that of the standard basis. Both represent the same model class we fit to data.
Using Bernstein polynomials, in itself, does not restrict or regularize the model class, since any polynomial can be written in Bernstein form. The Bernstein form is just easier to regularize.
This observation leads to some interesting insights, which will be easier to describe by writing the standard and the the Bernstein bases as vectors:
\[\mathbf{p}_n(x)=(1, x, x^2, \cdots, x^n)^T, \qquad \mathbf{b}_n(x)=(b_{0,n}(x), \cdots, b_{n,n}(x))^T\]We note that the standard and Bernstein Vandermonde matrix rows we saw in the previous post are exactly \(\mathbf{p}_n(x_i)\), and \(\mathbf{b}_n(x_i)\), respectively. Using this notation, we can write the powers of \(x\) in terms of the Bernstein basis in matrix form, by gathering the coefficients \(q_{j,k}\) above, assuming that \(q_{j,k}=0\) whenever \(j<k\), into a triangular matrix \(\mathbf{Q}_n\):
\[\mathbf{p}_n(x)^T = \mathbf{b}_n(x)^T \mathbf{Q}_n\]The matrix \(\mathbf{Q}_n\) is the basis trasition matrix - it can transform any polynomial written using the standard basis to the same polynomial written in the Bernstein basis:
\[a_0 + a_1 x + \dots + a_n x^n = \mathbf{p}_n(x)^T \mathbf{a} = \mathbf{b}_n(x)^T \mathbf{Q}_n \mathbf{a}\]The vector \(\mathbf{Q}_n \mathbf{a}\) is s the coefficient vector w.r.t the Bernstein basis. Does it mean we can actually fit a polynomial in the standard basis, but regularize it as if it was written in the Bernstein basis? Well, yes we can! Polynomial fitting in the Bernstein basis can be written as
\[\min_{\mathbf{w}} \quad \frac{1}{2}\sum_{i=1}^n (\mathbf{b}_n(x_i) \mathbf{w} - y_i)^2 + \frac{\alpha}{2} \| \mathbf{w} \|^2.\]The constants \(\frac{1}{2}\) are for convenience later, when taking derivatives. Introducing the change of variables \(\mathbf{w} = \mathbf{Q}_n \mathbf{a}\), the above problem becomes equivalent to:
\[\min_{\mathbf{a}} \quad \frac{1}{2} \sum_{i=1}^n (\mathbf{p}_n(x_i) \mathbf{a} - y_i)^2 + \frac{\alpha}{2} \| \mathbf{Q}_n \mathbf{a} \|^2. \tag{P}\]Thus, we can fit a polynomial in terms of its standard basis coefficients \(\mathbf{a}\), but regularize its Bernstein coefficients \(\mathbf{Q}_n \mathbf{a}\). So does it really work? Let’s check! First, let’s implement the transition matrix function:
import numpy as np
from scipy.special import binom
def basis_transition(n):
ks = np.arange(0, 1 + n)
js = np.arange(0, 1 + n).reshape(-1, 1)
Q = binom(js, ks) / binom(n, ks)
Q = np.tril(Q)
return Q
The regularized least-squares problem (P) above is a convex problem that can be easily solved by equating the gradient w.r.t \(\mathbf{a}\) with zero. Putting all the \(\mathbf{p}_n(x_i)\) for the data points \(i = 1, \dots, m\) into the rows of the Vandermonde matrix \(\mathbf{V}\), equating the gradient to zero becomes:
\[\mathbf{V}^T (\mathbf{V} \mathbf{a} - \mathbf{y}) + \alpha \mathbf{Q}_n^T \mathbf{Q}_n \mathbf{a} = 0.\]Re-arranging, and solving for the coefficients \(\mathbf{a}\), we obtain:
\[\mathbf{a} = (\mathbf{V}^T \mathbf{V} + \alpha \mathbf{Q}_n^T \mathbf{Q}_n)^{-1} \mathbf{V}^T \mathbf{y}\]So let’s implement the fitting procedure:
import numpy.polynomial.polynomial as poly
def fit_bernstein_reg(x, y, alpha, deg):
""" Fit a polynomial in the standard basis to the data-points `(x[i], y[i])` with Bernstein
regularization `alpha`, and degree `deg`.
"""
V = poly.polyvander(x, deg)
Q = basis_transition(deg)
A = V.T @ V + alpha * Q.T @ Q
b = V.T @ y
# solve the linear system
a = np.linalg.solve(A, b)
return a
Now, let’s try reproducing the results of the previous post with degrees 50 and 100.
def true_func(x):
return np.sin(8 * np.pi * x) / np.exp(x) + x
# define number of points and noise
m = 30
sigma = 0.1
deg = 50
# generate features
np.random.seed(42)
x = np.random.rand(m)
y = true_func(X) + sigma * np.random.randn(m)
# fit the polynomial
a = fit_bernstein_reg(x, y, 5e-4, deg=deg)
# plot the original function, the points, and the fit polynomial
plt_xs = np.linspace(0, 1, 1000)
polynomial_ys = poly.polyvander(plt_xs, deg) @ a
plt.scatter(x, y)
plt.plot(plt_xs, true_func(plt_xs), 'blue')
plt.plot(plt_xs, polynomial_ys, 'red')
plt.show()
I got the following plot, which appears pretty similar to what we got in the previous post, but slightly worse:
Let’s crank up the degree to 100 by setting deg = 100
. I got the following image:
Again, slightly worse than what we achieved by directly fitting the Bernstein form, but appears close.
There two technical issues with our idea. First, manually fitting models rather than relying on standard tools, such as SciKit-Learn appears to be troublesome, and in terms of computational efficiency, we need to deal with the additional matrix \(\mathbf{Q}_n\). Second, and most importantly, the standard Vandermonde matrix and the basis transition matrix \(\mathbf{Q}_n\) are extremely ill conditioned. This makes hard to actually solve the fitting problem and obtain coefficients that are close to the true optimal coefficients. This is true regardless if we chose direct matrix inversion, CVXPY, or an SGD-based optimizer from PyTorch or TensorFlow.
Due to inefficiency and ill conditioning this trick has a little value in practice. But provides us with an important insight: achieving good regularization requires a sophisticated non-diagonal matrix in the regularization term. It’s not a formal statement, but probably any “good” basis will have a non-diagonal transition matrix to the standard basis. This means that fitting a polynomial in the standard basis using typical ML tricks of rescaling the columns of the Vandermonde matrix has a little chance of success. And it doesn’t matter if we rescale using min-max scaling, or standardization to zero mean and unit variance. To fit a polynomial, we need to use a “good” basis directly.
In this post we explored the ability of the Bernstein form to control the shape of the curve we’re fitting - either making it smooth, increasing, decreasing, convex, or concave. Then, we saw that Bernstein polynomials are just polynomials - they have the same representation power as the standard basis, but just easier to regularize.
The next post will be more engineering oriented. We’ll see how to use the Bernstein basis for feature engineering and fitting models to some real-world data-sets, and we will write a SciKit-Learn transformer to do so. Stay tuned!
Lorentz, G. G. (1952). Bernstein Polynomials. University of Toronto Press. ↩
Chang, I. S., Chien, L. C., Hsiung, C. A., Wen, C. C., & Wu, Y. J. (2007). Shape restricted regression with random Bernstein polynomials. Lecture Notes-Monograph Series, 187-202. ↩
Sarah Sluis, S. (2019). Everything you need to know about bid shading. ↩
Karlsson, N., & Sang, Q. (2021, May). Adaptive bid shading optimization of first-price ad inventory. In 2021 American Control Conference (ACC) (pp. 4983-4990). IEEE. ↩
Gligorijevic, D., Zhou, T., Shetty, B., Kitts, B., Pan, S., Pan, J., & Flores, A. (2020, October). Bid shading in the brave new world of first-price auctions. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (pp. 2453-2460). ↩
Niculescu-Mizil, A., & Caruana, R. (2005, August). Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning (pp. 625-632). ↩
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3), 61-74. ↩
Zadrozny, B., & Elkan, C. (2002, July). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 694-699). ↩
When fitting a non-linear model using linear regression, we typically generate new features using non-linear functions. We also know that any function, in theory, can be approximated by a sufficiently high degree polynomial. This result is known as Weierstrass approximation theorem. But many blogs, papers, and even books tell us that high polynomials should be avoided. They tend to oscilate and overfit, and regularization doesn’t help! They even scare us with images, such as the one below, when the polynomial fit using the data points (in red) is far away from the true function (in blue):
It turns out that it’s just a MYTH. There’s nothing inherently wrong with high degree polynomials, and in contrast to what is typically taught, high degree polynomials are easily controlled using standard ML tools, like regularization. The source of the myth stems mainly from two misconceptions about polynomials that we will explore here. In fact, not only they are great non-linear features, certain representations also provide us with powerful control over the shape of the function we wish to learn.
A colab notebook with the code for reproducing the above results is available here.
Vladimir Vapnik, in his famous book “The Nature of Statistical Learning Theory” which is cited more than 100,000 times as of today, coined the approximation vs. estimation balance. The approximation power of a model is its ability to represent the “reality” we would like to learn. Typically, approximation power increases with the complexity of the model - more parameters mean more power to represent any function to arbitrary precision. Polynomials are no different - higher degree polynomials can represent functions to higher accuracy. However, more parameters make it difficult to estimate these parameters from the data.
Indeed, higher degree polynomials have a higher capacity to approximate arbitrary functions. And since they have more coefficients, these coefficients are harder to estimate from data. But how does it differ from other non-linear features, such as the well-known radial basis functions? Why do polynomials have such a bad reputation? Are they truly hard to estimate from data?
It turns out that the primary source is the standard polynomial basis for n-degree polynomials \(\mathbb{E}_n = {1, x, x^2, ..., x^n}\). Indeed, any degree \(n\) polynomial can be written as a linear combination of these functions:
\[\alpha_0 \cdot 1 + \alpha_1 \cdot x + \alpha_2 \cdot x^2 + \cdots + \alpha_n x^n\]But the standard basis \(\mathbb{B}_n\) is awful for estimating polynomials from data. In this post we will explore other ways to represent polynomials that are appropriate for machine learning, and are readily available in standard Python packages. We note, that one advantage of polynomials over other non-linear feature bases is that the only hyperparameter is their degree. There is no “kernel width”, like in radial basis functions^{1}.
The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features.
In this post we will demonstrate fitting the function
\[f(x)=\sin(8 \pi x) / \exp(x)+x\]on the interval \([0, 1]\) by fitting to \(m=30\) samples corrupted by Gaussian noise. The following code implements the function and generates samples:
import numpy as np
def true_func(x):
return np.sin(8 * np.pi * x) / np.exp(x) + x
m = 30
sigma = 0.1
# generate features
np.random.seed(42)
X = np.random.rand(m)
y = true_func(X) + sigma * np.random.randn(m)
For function plotting, we will use uniformly-spaced points in \([0, 1]\). The following code plots the true function and the sample points:
import matplotlib.pyplot as plt
plt_xs = np.linspace(0, 1, 1000)
plt.scatter(X.ravel(), y.ravel())
plt.plot(plt_xs, true_func(plt_xs), 'blue')
plt.show()
Now let’s fit a polynomial to the sampled points using the standard basis. Namely, we’re given the set of noisy points \(\{ (x_i, y_i) \}_{i=1}^m\), and we need to find the coefficients \(\alpha_0, \dots, \alpha_n\) that minimize:
\[\sum_{i=1}^m (\alpha_0 + \alpha_1 x_i + \dots + \alpha_n x_i^n - y_i)^2\]As expected, this is readily accomplished by transforming each sample \(x_i\) to a vector of features \(1, x_i, \dots, x_i^n\), and fitting a linear regression model to the resulting features. Fortunately, NumPy has the numpy.polynomial.polynomial.polyvander
function. It takes a vector containing \(x_1, \dots, x_m\) and produces the matrix
The name of the function comes from the name of the matrix - the Vandermonde matrix. Let’s use it to fit a polynomial of degree \(n=50\).
from sklearn.linear_model import LinearRegression
import numpy.polynomial.polynomial as poly
n = 50
model = LinearRegression(fit_intercept=False)
model.fit(poly.polyvander(X, deg=n), y)
The reason we use fit_intercept=False
is because the ‘intercept’ is provided by the first column of the Vandermonde matrix. Now we can plot the function we just fit:
plt.scatter(X.ravel(), y.ravel()) # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue') # plot the true function
plt.plot(plt_xs, model.predict(poly.polyvander(plt_xs, deg=n)), 'r') # plot the fit model
plt.ylim([-5, 5])
plt.show()
As expected, we got the “scary” image from the beginning of this post. Indeed, the standard basis is awful for model fitting! We hope that regularization provides a remedy, but it does not. Maybe adding some L2 regularization helps? Let’s use the Ridge
class from the sklearn.linear_model
package to fit an L2 regularized model:
from sklearn.linear_model import Ridge
reg_coef = 1e-7
model = Ridge(fit_intercept=False, alpha=reg_coef)
model.fit(poly.polyvander(X, deg=n), y)
plt.scatter(X.ravel(), y.ravel()) # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue') # plot the true function
plt.plot(plt_xs, model.predict(poly.polyvander(plt_xs, deg=n)), 'r') # plot the fit model
plt.ylim([-5, 5])
plt.show()
We get the following result:
The regularization coefficient coefficient of \(\alpha=10^{-7}\) is large enough to break the model in \([0,0.8]\) but not large enough to avoid over-fitting in \([0.8, 1]\). Increasing the coefficient clearly won’t help - the model will be broken even further in \([0, 0.8]\).
Since we will be trying several polynomial bases, it makes sense to write a more generic function for our experiments that will accept various “Vandermonde” matrix functions of the basis of our choice, fit the polynomial using the Ridge
class, and plot it with the original function and the sample points.
def fit_and_plot(vander, n, alpha):
model = Ridge(fit_intercept=False, alpha=alpha)
model.fit(vander(X, deg=n), y)
plt.scatter(X.ravel(), y.ravel()) # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue') # plot the true function
plt.plot(plt_xs, model.predict(vander(plt_xs, deg=n)), 'r') # plot the fit model
plt.ylim([-5, 5])
plt.show()
Now we can reproduce our latest experiment by invoking:
fit_and_plot(poly.polyvander, n=50, alpha=1e-7)
It turns out that in our sister discipline, approximation theory, reseachers also encountered similar difficulties with the standard basis \(\mathbb{E}_n\), and developed a thoery for approximating functions by polynomials from different bases. Two prominent examples of bases of \(n\)-degree polynomials include, and their:
numpy.polynomial.chebyshev
module.numpy.polynomial.legendre
module.They are the computational workhorse of a large variety of numerical algorithms that are enabled by approximating a function using a polynomial, and are well-known for their advantages in approximating functions in the \([-1, 1]\) interval^{2}. In particular, the corresponding “Vandermonde” matrices are provided by the chebvander
and legvander
functions in corresponding modules above. Each row in these matrices contains the value of the basis functions at each point, just like the standard Vandermonde matrix of the standard basis. For example, the Chebyshev Vandermonde matrix is:
I will not elaborate their formulas and properties here for a reason that will immediately be revealed. However, I highly recomment Prof. Nick Trefethen’s “Approximation theory and approximation practice” online video course to get familiar with their advantages. His book with the same name is an excellent introduction to the subject.
It might be tempting to try fitting a Chebyshev polynomial using our fit_and_plot
method above directly:
import numpy.polynomial.chebyshev as cheb
fit_and_plot(cheb.chebvander, n=50, alpha=1e-7)
However, that’s not the best thing to do. We aim to fit a function sampled from \([0, 1]\), but the Chebyshev basis “lives” in \([-1, 1]\). Therefore, we will add the transformation \(x \to 2x-1\) before invoking the chebvander
function:
def scaled_chebvander(x, deg):
return cheb.chebvander(2 * x - 1, deg=deg)
fit_and_plot(scaled_chebvander, n=50, alpha=1)
Note that a different basis requires a different regularization coefficient. We get the following result:
Whoa! Seems even worse than the standard basis!. Maybe more regularization helps?
fit_and_plot(scaled_chebvander, n=50, alpha=10)
Appears that our polynomial is both a bad fit for the function, and extremely oscilatory. Even worse when the standard basis! Interested readers can repeat the experiment with Legendre polynomials and see a slightly better, but similar result. So what’s wrong? Is everything that approximation theory tries to teach us about polynomials wrong?
The answer stems from the fundamental difference between two tasks:
The Chebyshev and Legendre bases perform extremely well at the the interpolation task, but not at the fitting task. It turns out that the polynomial \(T_k\) in the Chebyshev basis, and the polynomial \(P_k\) in the Legendre basis, are both \(k\)-degree polynomials. For example, \(T_1\) is a linear function, whereas \(T_{50}\) is a polynomial of degree 50. These two functions are radically different. Thus, the coefficient of \(T_1\) and \(T_{50}\) have “different units”. This property is shared with the standard basis as well. Thus, we have two issues:
Both properties show that for the fitting, rather the interpolation tasks we need something else.
A remedy is provided by the Bernstein basis \(\mathbb{B}_n = \{ b_{0,n}, \dots, b_{n, n} \}\). These are \(n\)-degree polynomials defined by on \([0, 1]\) by:
\[b_{i,n}(x) = \binom{n}{i} x^i (1-x)^{n-i}\]These polynomials are widely used in computer graphics to approximate curves and surfaces, but it appears that they’re less known in the machine learning community. In fact, all the text you see on the screen when reading this post is rendered using Bernstein polynomials^{3}. We will study them more in depth in the next posts, but at this stage I would like to point out two simple properties that give an intuitive explanation of why they’re useful in machine learning.
First, note that each \(b_{i,n}\) is an \(n\)-degree polynomial. Thus, when representing a polynomial using
\[p_n(x) = \alpha_0 b_{0,n}(x) + \alpha_1 b_{1,n}(x) + \dots + \alpha_n b_{n,n}(x),\]all the coefficients have the same “units”.
If the formula of \(b_{i,n}(x)\) seems familiar - you are correct. It is exactly the probability mass function of the binomial distribution for obtaining \(i\) successes in a sequence of trials whose success probability is \(x\). Therefore, \(b_{i,n}(x) \geq 0\), and \(\sum_{i=0}^n b_{i,n}(x) = 1\) for any \(x \in [0, 1]\). Consequently, the polynomial \(p_n(x)\) is just a weighted average of the coefficients \(\alpha_0, \dots, \alpha_n\). So not only the coefficients have the same “units”, their “units” are also the same as the model’s labels. Thus, they’re much easier to regularize - they’re all on the same “scale”.
Finally, due to the equivalence with the binomial distribution p.m.f, we can implement a “Vandermonde” matrix in Python using the scipy.stats.binom.pmf
function.
from scipy.stats import binom
def bernvander(x, deg):
return binom.pmf(np.arange(1 + deg), deg, x.reshape(-1, 1))
Let’s try and fit without regularization at all
fit_and_plot(bernvander, n=50, alpha=0)
We see our regular over-fitting. Now let’s see that they’re indeed easy to regularize. After trying several regularization coefficients, I came up with this:
fit_and_plot(bernvander, n=50, alpha=5e-7)
Beautiful! This is a polynomial of degree 50! The fit is great, no oscillations, and the misfit near the right endpoint stems from the noise - I don’t believe there’s enough information in the data to convey the fact that it should “curve up” rather than “curve down”.
Let’s see what happens when we crank-up the degree. Can we produce a nice non-oscilating polynomial?
fit_and_plot(bernvander, n=100, alpha=5e-4)
This is a polynomial of degree 100, that does not overfit!
The notorious reputation of high-degree polynomials in the machine learning community is primarily a myth. Despite it, papers, books, and blog posts are based on this premise as if it was an axiom. Bernstein polynomials are little known in the machine learning community, but there are a few papers^{4}^{5} using them to represent polynomial features. Their main advantage is ease of use - we can use high degree polynomials to exploit their approximation power, and easily control model complexity with just one hyperparameter - the regularization coefficient.
In the following posts we will explore the Bernstein basis in more detail. We will use it to create polynomial features for real-world datasets and test it versus the standard basis. Moreover, we will see how to regularize the coefficients to control the shape of the function we aim to represent.. For example, what if we know that the function we’re aiming to fit is increasing? Stay tuned!
There are also kernel methods, and polynomial kernels. But polynomial kernels suffer from problems similar to the standard basis. ↩
The standard basis is not that awful. It’s a great basis for representing polynomials on the complex unit circle. In fact, the Fourier transform is based exactly on this observation. ↩
See Bézier curves and TrueType font outlines. ↩
Marco, Ana, and José-Javier Martı. “Polynomial least squares fitting in the Bernstein basis.” Linear Algebra and its Applications 433.7 (2010): 1254-1264. ↩
Wang, Jiangdian, and Sujit K. Ghosh. “Shape restricted nonparametric regression with Bernstein polynomials.” Computational Statistics & Data Analysis 56.9 (2012): 2729-2741. ↩
In this final episode of the proximal point series I would like to take the method to the extreme, and show that we can actually train a model which composed with an appropriate loss produces functions which are non-linear and non-convex: we’ll be training a factorization machine for classification problems without linearly approximating loss functions and relying on loss gradients. Factorization machines and variants are widely used in recommender systems, i.e. recommending movies to users. I assume readers are familiar with the basics, and below I provide only a brief introduction, so that throughout this post we have a consistent notation and terminology, and understand the assumptions we make.
I do not claim that it is the best method for training factorization machines, but it is indeed an interesting challenge in order to see what are the limits of efficiently implementable proximal point methods. We’ll have some more advanced optimization theory, even some advanced linear algebra, but most importantly, at the end of the journey we’ll have a github repo with code which you can run and try it out on your own dataset!
Since it’s an ‘academic’ experiment in nature, and I do not aim to implement the most efficient and robust code, we’ll make some simplifying assumptions. However, a production-ready training algortithm will not be far away from the implementation we construct in this post.
Let’s begin by a quick introduction to factorization machines. Factorization machines are usually trained on categorical data representing the users and the items, for example, age group and gender may be user features, while product category and price group may be item features. The model embeds each categorical feature to a latent space of some pre-defined dimension \(k\), and the model’s prediction comprises of inner products of the latent vectors corresponding to the current samples. The most simple variant are second order factorization machines, which we the focus of this post.
Formally, our second-order factorization machine \(\sigma(w, x)\) is given a binary input \(w \in \{0, 1\}^m\) which is a one-hot encoding of a subset of at most \(m\) categorical features. For exampe, suppose we would like to predict the affinity of people with chocolate. Assume, for simplicity, that we have only two user gender values \(\{ \mathrm{male}, \mathrm{female} \}\), and two two age groups \(\{ \mathrm{young}, \mathrm{old} \}\). For our items, suppose we have only one feature - the chocolate type, which may take the values \(\{\mathrm{dark}, \mathrm{milk}, \mathrm{white}\}\). In that case, the model’s input is the vector of zeros and ones encoding feature indicators:
\[w=(w_{\mathrm{male}}, w_{\mathrm{female}}, w_{\mathrm{young}}, w_{\mathrm{old}}, w_{\mathrm{dark}}, w_{\mathrm{milk}}, w_{\mathrm{white}}).\]A young male who tasted dark chocolate is represented by the vector
\[w = (1, 0, 1, 0, 1, 0, 0).\]In general, the vector \(w\) can be defined by arbitrary real numbers, but I promised that we’ll make simplifying assumptions :)
The model’s parameter vector \(x = (b_0, b_1, \dots, b_m, v_1, \dots, v_m)\) is composed of the model’s global bias \(b_0 \in \mathbb{R}\), the biases \(b_i \in \mathbb{R}\) for the features \(i\in \{1, \dots, m\}\), and the latent vectors \(v_i \in \mathbb{R}^k\) for the same features with \(k\) being the embeddig dimension. The model computes:
\[\sigma(w, x) := b_0 + \sum_{i = 1}^m w_i b_i + \sum_{i = 1}^m\sum_{j = i + 1}^{m} (v_i^T v_j) w_i w_j.\]Let’s set up some notation which will become useful throughout this post. We will detote a set of consecutive integers by \(i..j=\{i, i+1, \dots, j\}\), and the set of distinct pairs of the integers \(J\) is denoted by \(P[J]=\{ (i,j) \in J\times J : i<j \}\). Consequently, we can re-write:
\[\sigma(w,x)=b_0 + \sum_{i\in 1..m} w_i b_i+\sum_{(i,j) \in P[1..m]} (v_i^T v_j) w_i w_j\]At this stage this notation does not seem useful, but it will simplify things later in this post. We’ll use this notation consistently throughout the post.
For completeness, let’s implement a factorization machine in PyTorch. To that end, recall a famous trick introduced by Steffen Rendle in his pioneering paper^{1} on factorization machines, based on the formula
\[\Bigl\| \sum_{i\in 1..m} w_i v_i \Bigr\|_2^2 = \sum_{i\in 1..m} \|w_i v_i\|_2^2 + 2 \sum_{(i,j)\in P[1..m]} (v_i^T v_j) w_i w_j.\]After re-arrangement, the above results in:
\[\sum_{(i,j)\in P[1..m]} (v_i^T v_j) w_i w_j= \frac{1}{2}\Bigl\| \sum_{i\in1.m} w_i v_i \Bigr\|_2^2-\frac{1}{2}\sum_{i\in1..m} \|w_i v_i\|_2^2. \tag{L}\]Since \(w\) is a binary vector, we can associate it with its non-zero indices \(\operatorname{nz}(w)\), and the right-hand side of above term can be written as:
\[\frac{1}{2}\Bigl\| \sum_{i \in \operatorname{nz}(w)} v_i \Bigr\|_2^2-\frac{1}{2}\sum_{i \in \operatorname{nz}(w)} \| v_i\|_2^2.\]Consequently, the pairwise terms can be computed in time linear in the number of non-zero indicators in \(w\), instead of the quadratic time imposed by the naive way. The PyTorch implementation below uses the trick above.
import torch
from torch import nn
class FM(torch.nn.Module):
def __init__(self, m, k):
super(FM, self).__init__()
self.bias = nn.Parameter(torch.zeros(1))
self.biases = nn.Parameter(torch.zeros(m))
self.vs = nn.Embedding(m, k)
with torch.no_grad():
torch.nn.init.normal_(self.vs.weight, std=0.01)
torch.nn.init.normal_(self.biases, std=0.01)
def forward(self, w_nz): # since w are indicators, we simply use the non-zero indices
vs = self.vs(w_nz)
# in vs:
# dim = 0 is the mini-batch dimension. We would like to operate on each elem. of a mini-batch separately.
# dim = 1 are the embedding vectors
# dim = 2 are their components.
pow_of_sum = vs.sum(dim=1).square().sum(dim=1) # sum vectors -> square -> sum components
sum_of_pow = vs.square().sum(dim=[1,2]) # square -> sum vectors and components
pairwise = 0.5 * (pow_of_sum - sum_of_pow)
biases = self.biases
linear = biases[w_nz].sum(dim=1) # sum biases for each element of the mini-batch
return pairwise + linear + self.bias
If we are interested in solving a regression problem, i.e. predicting arbitrary real values, such as a score a person would give to the chocolate, we can use \(\sigma\) directly to make predictions. If we are in the binary classification setup, i.e. predict the probability that a person likes the corresponding chocolate, we compose \(\sigma\) with a sigmoid, and predict \(p(w,x) = (1+e^{-\sigma(w,x)})^{-1}\).
In this post we are interested in the binary classification setup, with the binary cross-entropy loss. Namely, given a label \(y \in \{0,1\}\) the loss is:
\[-y \ln(p(w,x)) - (1 - y) \ln(1 - p(w,x)).\]For example, if we would like to predict which chocolate people like, we could train the model on a data-set with samples of people who liked a certain chocolate having the label \(y = 1\), and people who tasted but did not like it will have the lable \(y = 0\). Having trained the model, we can recommend chocolate to a person by choosing the one with the highest probability of being liked.
Using a simple transformation \(\hat{y} = 2y-1\) we can remap the labels to be in \(\{-1, 1\}\) instead. Then, it isn’t hard to verify that the binary cross-entropy loss above reduces to:
\[\ln(1+\exp(-\hat{y} \sigma(w,x))).\]Consequently, our aim will be training over the set \(\{ (w_i, \hat{y}_i) \}_{i=1}^n\) by minimizing the average loss
\[\frac{1}{n} \sum_{i=1}^n \underbrace{\ln(1+\exp(-\hat{y}_i \sigma(w_i, x)))}_{f_i(x)}.\]Instead of using regular SGD-based methods for training, which construct a linear approximations of \(f_i\) and are able to use only the information provided by the gradient, we will avoid approximating and use the loss itself via the stochastic proximal point algorithm - at iteration \(t\) choose \(f \in \{f_1, \dots, f_n\}\) and compute:
\[x_{t+1} = \operatorname*{argmin}_x \left\{ f(x) + \frac{1}{2\eta} \|x - x_t\|_2^2 \right\}. \tag{P}\]Careful readers might notice that the formula above is total nonsense in general. Why? Well, the each \(f\) is a non-convex function of \(x\). If \(f\) was convex, we would obtain a unique and well-defined minimizer \(x_{t+1}\). However, in general, the \(\operatorname{argmin}\) above is a set of minimizers, which might even be empty! In this post we will attempt to mitigate this issue:
Having done the above, we’ll be able to construct an algorithm which can train classifying factorization machines which exploit the exact loss function, instead of just relying on its slope as in SGD.
In previous posts we heavily relied on duality in general, and convex conjugates in particular, and this post is no exception. Recall, that the convex conjugate of the function \(h\) is defined by:
\[h^*(z) = \sup_x \{ x^T z - h(x) \},\]and recall also that in a previous post we saw that \(h(t)=\ln(1+\exp(t))\) is convex, and its convex conjugate is:
\[h^*(z) = \begin{cases} z\ln(z) + (1 - z) \ln(1 - z), & 0 < z < 1 \\ 0, & \text{otherwise}. \end{cases}\]An interesting result about conjugates is that under some technical conditions, which hold for \(h(t)\) above, we have \(h^{**} = h\), namely, the conjugate of \(h^*\) is \(h\). Moreover, in our case the \(\sup\) in the conjugate’s definition can be replaced with a \(\max\), since the supermum is always attained^{2}. Why is it useful? Since now we know that:
\[\ln(1+\exp(t))=\max_z \left\{ t z - z \ln(z) - (1-z) \ln(1-z) \right\}.\]Consequently, the term inside the \(\operatorname{argmin}\) of the proximal point step (P) can be written as:
\[\begin{aligned} f(x) &+ \frac{1}{2\eta} \|x - x_t\|_2^2 \\ &\equiv \ln(1+\exp(-\hat{y} \sigma(w,x))) + \frac{1}{2\eta} \|x - x_t\|_2^2 \\ &= \max_z \Bigl\{ \underbrace{ -z \hat{y} \sigma(w,x) + \frac{1}{2\eta} \|x - x_t\|_2^2 - z\ln(z) - (1-z)\ln(1-z) }_{\phi(x,z)} \Bigr\}. \end{aligned}\]Since we are interested in minimizing the above, we will be solving the saddle-point problem:
\[\min_x \max_z \phi(x,z). \tag{Q}\]Convex duality theory has another interesting form - it provides conditions on saddle-point problems which ensure that we can switch the order of \(\min\) and \(\max\) to obtain an equivalent problem. Why is it interesting? Because switching the order produces
\[\max_z \underbrace{ \min_x \phi(x,z)}_{q(z)},\]and finding the optimal \(z\) means maximizing the one dimensional function \(q\), which may even be as simple as a high-school calculus exercise.
So here is relevant duality theorem, which is a simplification of Sion’s minimax theorem from 1958 for this post:
Let \(\phi(x,z)\) be a continuous function which is convex in \(x\) and concave in \(z\). Suppose that the domain of \(\phi\) over \(z\) is compact, i.e. a closed a bounded set. Then,
\[\min_x \max_z \phi(x,z) = \max_z \min_x \phi(x,z)\]
In our case, it’s easy to see that \(\phi\) is indeed concave in \(z\) using negativity of its second derivative, and its domain, the interval \([0,1]\), is indeed compact. What we require for the theorem’s conditions to hold is convexity in \(x\), which is what we explore next. Then, we’ll see that \(q\), despite not being so simple, can still be quite efficiently maximized. The theorem does not imply that a pair \((x, z)\) solving the max-min problem also solves the min-max problem, but in our case the max-min problem has a unique solution, and in that particular case it indeed also solves the min-max problem.
Consequently, having found \(z^*=\operatorname{argmax}_z q(z)\), we by construction obtain a formula for computing the optimal \(x\): \(x_{t+1} = \operatorname*{argmin}_x ~ \phi(x, z^*).\)
So let’s begin by ensuring that the conditions for Sion’s theorem hold. Ignoring the terms of \(\phi\) which do not depend on \(x\), we need to study the convexity of the following part as a function of \(x\):
\[(*) = -z \hat{y} \sigma(w,x) + \frac{1}{2\eta} \|x - x_t\|_2^2.\]To that end, we need to open the ‘black box’ and look inside \(\sigma\) again. That’s going to be a bit technical, but it gets us where we need. If you don’t wish to read all the details, you may skip to the conclusion below.
Recall, the composition \(x = (b_0, b_1, \dots, b_m, v_1, \dots, v_m)\) and the definition
\[\sigma(w, b_0, \dots, b_m, v_1, \dots, v_m) = b_0 + \sum_{i\in1..m} w_i b_i + \sum_{(i,j)\in P[1..m]} (v_i^T v_j) w_i w_j.\]Consequently, we can re-write \((*)\) as:
\[\begin{aligned} (*) =& \color{blue}{-z \hat{y} \Bigl[ b_0 + \sum_{i\in1..m} w_i b_i \Bigr] + \frac{1}{2\eta} \|b - b_t\|_2^2} \\ & \color{brown}{- z \hat{y} \sum_{i\in P[1..m]} (v_i^T v_j) w_i w_j + \frac{1}{2\eta} \sum_{i\in1..m} \| v_i - v_{i,t} \|_2^2}. \end{aligned}\]The part colored in blue is always convex - it is the sum of a linear function and a convex-quadratic one. It remains to study the convexity of the brown part. Re-arranging the formula for \(\|v_i + v_j\|_2^2\), we obtain that:
\[v_i^T v_j = \frac{1}{2} \|v_i + v_j\|_2^2 - \frac{1}{2}\|v_i\|_2^2 - \frac{1}{2} \|v_j\|_2^2.\]Denoting \(\alpha_{ij} = -z \hat{y} w_i w_j\) we can re-write the brown part as: \(\begin{aligned} \color{brown}{\text{brown}} &= \sum_{i\in P[1..m]} |\alpha_{ij}| v_i^T ( \operatorname{sign}(\alpha_{ij}) v_j) + \frac{1}{2\eta} \sum_{i\in1..m} \| v_i - v_{i,t} \|_2^2 \\ &= \frac{1}{2}\sum_{i\in P[1..m]} |\alpha_{ij}| \left[ \|v_i + \operatorname{sign}(\alpha_{ij}) v_j\|_2^2 - \|v_i\|_2^2-\|v_j\|_2^2 \right] + \sum_{i\in1..m} \left[ \| v_i \|_2^2 \color{darkgray}{- 2 v_i^T v_{i,t} + \|v_{i,t}\|_2^2} \right] \end{aligned}\)
The grayed-out part on the right is linear in \(v_i\), so it’s convex. Since \(\alpha_{ij} = \alpha_{ji}\), to simplify notation we define \(\alpha_{ii}=0\), and the remaining non-greyed parts can be written as:
\[\frac{1}{2} \sum_{i\in P[1..m]} |\alpha_{ij}| \|v_i + \operatorname{sign}(\alpha_{ij}) v_j\|_2^2 + \sum_{i\in 1..m} \left(\frac{1}{2\eta} - \sum_{j\in 1..m} |\alpha_{ij}|\right) \|v_i\|_2^2.\]Again, the first sum is a sum of convex-quadratic functions, and thus convex. For the second part to be convex, we require that for each \(i\) we have
\[\frac{1}{2\eta} \geq \sum_{j\in 1..m} |\alpha_{ij}|,\]or equivalently that the step-size \(\eta\) must satisfy
\[\eta \leq \frac{1}{2\sum_{j \in 1..m} |\alpha_{ij}|}\]Since \(\vert \alpha_{ij} \vert \leq 1\), we can easily deduce that for any step-size \(\eta \leq \frac{1}{2m}\), we obtain a convex \(\phi\). A better bound is obtained if we have a bound on the number of indicators in the vector \(w\) which may be non-zero at the same time. For example, if we have six categorical fields, we will have at most six non-zero elements in \(w\), and thus \(\eta \leq \frac{1}{12}\).
Convexity is nice if we want Sion’s theorem to hold, but if we want a unique minimizer \(x_{t+1}\) we need strict convexity, which is obtained by using a strict inequality - replace \(\leq\) with \(<\). In this post we will assume that we have at most \(d\) categorical features, and use step-sizes which satisfy
\[\eta \leq \frac{1}{2d+1} < \frac{1}{2d}.\]Suppose that Sion’s theorem holds, and that we can obtain a unique minimizer \(x_{t+1}\). How do we compute it? Well, Sion’s theorem lets us switch the order of \(\min\) and \(\max\), so we are aiming to solve:
\[\max_z \underbrace{ \min_x \phi(x,z)}_{q(z)},\]and explicitly writing \(\phi\) we have:
\[\begin{aligned} q(z) = \min_{b,v_i} \Bigl\{ &-z \hat{y} \Bigl[ b_0 + \sum_{i\in 1..m} w_i b_i \Bigr] + \frac{1}{2\eta} \|b - b_t\|_2^2 \\ &- z \hat{y} \sum_{(i,j) \in P[1..m]} (v_i^T v_j) w_i w_j + \frac{1}{2\eta} \sum_{i\in 1..m} \| v_i - v_{i,t} \|_2^2 \\ &- z \ln(z) - (1-z) \ln(1-z) \Bigr\} \end{aligned}\]From now it becomes a bit technical, but the end-result will be an algorithm to compute \(q(z)\) for any \(z\) by solving the minimization problem over \(x\). Afterwards, we’ll find a way to maximize \(q\) over \(z\).
Using separability^{3} we can separate the minimum above into a sum of three parts: the minimum over the biases \(b\), another minimum over the latent vectors \(v_1, \dots, v_m\), and the term \(-z \ln(z) - (1-z) \ln(1-z)\), namely:
\[\begin{aligned} q(z) &= \underbrace{\min_b \left\{ -z \hat{y} \left[ b_0 + \sum_{i\in 1..m} w_i b_i \right] + \frac{1}{2\eta} \|b - b_t\|_2^2 \right\}}_{q_1(z)} \\ &+ \underbrace{\min_{v_1, \dots, v_m} \left\{ - z \hat{y} \sum_{(i,j) \in P[1..m]} (v_i^T v_j) w_i w_j + \frac{1}{2\eta} \sum_{i\in 1..m} \| v_i - v_{i,t} \|_2^2 \right\}}_{q_2(z)} \\ &-z \ln(z) - (1-z) \ln(1-z) \end{aligned}\]We’ll analyze \(q_1\), and \(q_2\) shortly, but let’s take a short break and implement a skeleton of our training algorithm. A deeper analysis of \(q_1\), \(q_2\), and \(q\) will let us fill the skeleton. On construction, it receives a factorization machine object of the class we implemented above, and the step size. Then, each training step’s input is the set \(\operatorname{nz}(w)\) of the non-zero feature indicators, and the label \(\hat{y}\):
class ProxPtFMTrainer:
def __init__(self, fm, step_size):
# training parameters
self.b0 = fm.bias
self.bs = fm.biases
self.vs = fm.vs
self.step_size = step_size
def step(self, w_nz, y_hat):
pass # we'll replace it with actual code to train the model.
Defining \(\hat{w}=(1, w_1, \dots, w_m)^T\) and \(\hat{b}=(b_0, b_1, \dots, b_m)\), we obtain:
\[\begin{aligned} q_1(z) =&\min_{\hat{b}} \left\{ -z \hat{y} \hat{w}^T \hat{b} + \frac{1}{2\eta} \|\hat{b} - \hat{b}_t\|_2^2 \right\} \end{aligned}.\]The term inside the minimu is a simple convex quadratic which is minimized by comparing its gradient with zero: \(\hat{b}^* = \hat{b}_t + \eta z \hat{y} \hat{w}. \tag{A}\)
Consequently:
\[\begin{aligned} q_1(z) &= -z \hat{y} \hat{w}^T (\hat{b}_t + \eta z \hat{y} \hat{w}) + \frac{1}{2\eta} \| \eta z \hat{y} \hat{w} \|_2^2 \\ &= -\hat{y} (\hat{w}^T \hat{b}_t) z - \eta \hat{y}^2 \|\hat{w}\|_2^2 z^2 + \frac{\eta \hat{y}^2 \|\hat{w}\|_2^2}{2} z^2 \\ &= -\hat{y} (\hat{w}^T \hat{b}_t) z - \frac{\eta \hat{y}^2 \|\hat{w}\|_2^2}{2} z^2 \end{aligned}\]Since \(\hat{y} =\pm 1\) we have that \(\hat{y}^2 = 1\). Moreover, since \(w_i\) are indicators, the term \(\|\hat{w}\|_2^2\) is the number of non-zero entries of \(w\) plus one. So, to summarize, the above can be written as
\[q_1(z) = -\frac{\eta (1 + |\operatorname{nz}(w)|)}{2}z^2 -\hat{y} (w^T b_t + b_{0,t}) z.\]What a surprise - \(q_1\) is just a concave parabola!
So, to summarize, what we have here is an explicit expression for \(q_1\), and the formula (A) to update the biases once we have obtained the optimal \(z\).
Let’s implement the code for the two steps above. We’ll see below that the function \(q_1\) will have to be evaluated several times in order to find the optimal \(z\), and consequently it’s beneficial to cache various expensive-to-compute elements so that its evaluation is quick and efficient every time. Consequently, the step
function will store these parts in the classe’s members.
# inside ProxPtFMTrainer
def step(self, w_nz, y_hat):
self.nnz = w_nz.numel() # |nz(w)|
self.bias_sum = self.bs[w_nz].sum().item() # w^T b_t
# TODO - this function will grow as we proceed
def q_one(self, y_hat, z)
return -0.5 * self.step_size * (1 + self.nnz) * (z ** 2) \
- y_hat * (self.bias_sum + self.b0.item()) * z
def update_biases(self, w_nz, y_hat, z):
self.bs[w_nz] = self.bs[w_nz] + self.step_size * z * y_hat
self.b0.add_(self.step_size * z * y_hat)
You might be asking yourself why we stored the bias sum in a member of self
. It will become apparent shortly, but we’ll be calling the function q_one
repeatedly, and we would like to avoid re-computing time consuming things we could compute only once.
We are aiming to compute
\[q_2(z) = \min_{v_1, \dots, v_m} \left\{ Q(v_1, \dots, v_m, z) \equiv - z \hat{y} \sum_{(i,j)\in P[1..m]} (v_i^T v_j) w_i w_j + \frac{1}{2\eta} \sum_{i\in 1..m} \| v_i - v_{i,t} \|_2^2 \right\}.\]Of course, we assume that we indeed chose \(\eta\) such that \(Q\) inside the \(\min\) operator is strictly convex in \(v_1, \dots, v_m\), so that there is a unique minimizer.
Since \(w\) is a vector of indicators, we can write the function \(Q\) by separating out the part which corresponds to non-zero indicators in \(w\):
\[Q(v_1, \dots, v_m,z) = \underbrace{-z \hat{y} \sum_{(i,j)\in P[\operatorname{nz}(w)]} v_{i}^T v_{j}+\frac{1}{2\eta} \sum_{i\in \operatorname{nz}(w)} \|v_i - v_{i,t} \|_2^2}_{\hat{Q}} + \underbrace{\frac{1}{2\eta}\sum_{i \notin \operatorname{nz}(w)} \|v_i-v_{i,t}\|_2^2}_{R}.\]Looking at \(R\), clearly the minimizer must satisfy \(v_i^* = v_{i,t}\) for all \(i \notin \operatorname{nz}(w)\), and consequently \(R\) must be zero at optimum, independent of \(z\). Hence, we have:
\[q_2(z)=\min_{v_{\operatorname{nz}(w)}} \hat{Q}(v_{\operatorname{nz}(w)}, z),\]where \(v_{\operatorname{nz}(w)}\) is the set of the vectors \(v_i\) for \(i \in \operatorname{nz}(w)\). Since \(\hat{Q}\) is a quadratic function which we made sure is strictly convex, we can find our optimal \(v_{\operatorname{nz}(w)}^*\) by solving the linear system obtained by equating the gradient of \(\hat{Q}\) with zero.
So let’s see what the gradient looks like. We have a function of several vector variables \(v_{\operatorname{nz}(w)}\), and we imagine that they are all stacked into one big vector. Consequently, the gradient of \(\hat{Q}\) is a stacked vector comprising of the gradients w.r.t each of the vectors. So, let’s compute the gradient w.r.t each \(v_i\) and equate it with zero:
\[\nabla_{v_i} \hat{Q} = -z \hat{y} \sum_{\substack{j \in \operatorname{nz}(w)\\j\neq i}} v_{j}+\frac{1}{\eta} (v_{i} - v_{i,t})=0.\]By re-arranging and putting constants on the RHS we can re-write the above as
\[-\eta z \hat{y} \sum_{\substack{j \in \operatorname{nz}(w)\\j\neq i}} v_{j} + v_{i} = v_{i,t}.\]The above system means that we are actually solving linear systems with the same coefficients for each coordinate of the embedding vectors. Equivalently written, we can stack the vectors \(v_{\operatorname{nz}(w)}\) into the rows of the matrix \(V\), and the vectors \(v_{\operatorname{nz}(w),t}\) into the rows of the matrix \(V_t\), and solve the linear system
\[\underbrace{\begin{pmatrix} 1 & -\eta z \hat{y} & \cdots & -\eta z \hat{y} \\ -\eta z \hat{y} & 1 & \cdots & -\eta z \hat{y} \\ \vdots & \vdots & \ddots & \vdots & \\ -\eta z \hat{y} & -\eta z \hat{y} & \cdots & 1 \end{pmatrix}}_{S(z)} V = V_t\]Note, that the matrix \(S(z)\) is small, since its dimensions only depend on the number of non-zero elements in \(w\). So, now we have an efficient algorithm for computing \(q_2(z)\) given the sample \(w\) and the latent vectors from the previous iterate \(v_{1,t}, \dots, v_{m,t}\):
Algorithm B
- Embed the latent vectors \(\{v_{t,i}\}_{i \in \operatorname{nz}(w)}\) into thr rows of the matrix \(V_t\).
- Obtain a solution \(V^*\) of the linear system of equations \(S(z) V = V_t\), and use the rows of \(V^*\) as the vectors \(\{v_{i}^*\}_{i \in \operatorname{nz}(w)}\).
- Output: \(q_2(z)=-z \hat{y} \sum_{(i,j) \in P[\operatorname{nz}(w)]} ({v_{i}^*}^T v_{j}^*)+\frac{1}{2\eta} \sum_{i\in \operatorname{nz}(w)} \|v_{i}^* - v_{i,t} \|_2^2\)
However, let’s see how we can solve the linear system without invoking any matrix inversion algorithm altogether, since it turns out we can directly and efficiently compute \(S(z)^{-1}\)! The matrix \(S(z)\) can be written as:
\[S(z) = (1 + \eta z \hat{y}) I - \eta z \hat{y}(\mathbf{e} ~ \mathbf{e}^T)\]where \(\mathbf{e} \in \mathbb{R}^{\vert\operatorname{nz}(w)\vert}\) is a column vector whose components are all \(1\). Now, we’ll employ the Sherman-Morrison matrix inversion identity:
\[(A+u v^T)^{-1} = A^{-1} - \frac{A^{-1} u v^T A^{-1}}{1 + v^T A^{-1} u}.\]In our case, we’ll be taking \(A = (1 + \eta \hat{y} z) I\), \(u=-\eta \hat{y} z \mathbf{e}\), and \(v = \mathbf{e}\), and consequently we have:
\[S(z)^{-1} = \frac{1}{1 + \eta \hat{y} z} I + \frac{\eta \hat{y} z}{(1 + \eta \hat{y} z)^2 - \eta \hat{y} z(1 + \eta \hat{y} z) \mathbf{e}^T \mathbf{e}} \mathbf{e}~\mathbf{e}^T\]Now, note that \(\mathbf{e}~\mathbf{e}^T = \unicode{x1D7D9}\) is a matrix whose components are all \(1\), and that \(\mathbf{e}^T \mathbf{e} = \vert\operatorname{nz}(w)\vert\) by construction. Thus:
\[\begin{aligned} S(z)^{-1} &= \frac{1}{1 + \eta \hat{y} z} I + \frac{\eta \hat{y} z}{(1 + \eta \hat{y} z)^2 - \eta \hat{y} z(1 + \eta \hat{y} z) |\operatorname{nz}(w)|} \unicode{x1D7D9} \\ &= I - \frac{\eta \hat{y} z}{1+\eta \hat{y} z} I + \frac{\eta \hat{y} z}{(1 + \eta \hat{y} z)^2 - \eta \hat{y} z(1 + \eta \hat{y} z) |\operatorname{nz}(w)|} \unicode{x1D7D9} \\ &= I - \frac{\eta \hat{y} z}{1+\eta \hat{y} z} \left[ I - \frac{1}{1+\eta \hat{y} z (1- |\operatorname{nz}(w)| )} \unicode{x1D7D9} \right] \end{aligned}\]So the solution of the linear system \(S(z)V = V_t\) is:
\[V^*=S(z)^{-1} V_t = V_t - \frac{\eta \hat{y} z}{1+\eta \hat{y} z} \underbrace{ \left[ V_t - \frac{1}{1+\eta \hat{y} z (1- |\operatorname{nz}(w)| )} \unicode{x1D7D9} V_t \right]}_{(*)} \tag{C}\]Finally, we note that the matrix \(\unicode{x1D7D9} V_t\) is the matrix obtained by computing the sum of the rows of \(V_t\) and replicating the result \(\vert \operatorname{nz}(w)\vert\) times, so we don’t even need to invoke any matrix multiplication function at all!
So, to summarize, we have Algorithm B above to compute \(q_2(z)\), where the solution of the linear system is obtained via formula (C) above. Moreover, formula (C) is used to update the latent vectors once the optimal \(z\) is found. Let’s implement the above:
# inside ProxPtFMTrainer
def step(self, w_nz, y_hat):
self.nnz = w_nz.numel() # |nz(w)|
self.bias_sum = self.bs[w_nz].sum().item() # w^T b_t
self.vs_nz = self.vs.weight[w_nz, :] # the matrix V_t
self.ones_times_vs_nnz = self.vs_nz.sum(dim=0, keepdim=True) # the sums of the rows of V_t
# TODO - this function will grow as we proceed
def q_two(self, y_hat, z):
if z == 0:
return 0
# solve the linear system - find the optimal vectors
vs_opt = self.solve_s_inv_system(y_hat, z)
# compute q_2
pairwise = (vs_opt.sum(dim=0).square().sum() - vs_opt.square().sum()) / 2 # the pow-of-sum - sum-of-pow trick
diff_squared = (vs_opt - self.vs_nz).square().sum()
return (-z * y_hat * pairwise + diff_squared / (2 * self.step_size)).item()
def update_vectors(self, w_nz, yhat, z): # use equation (C) to update the latent vectors
if z == 0:
return
self.vs.weight[w_nz, :].sub_(self.vectors_update_dir(yhat, z))
def solve_s_inv_system(self, y_hat, z):
return self.vs_nz - self.vectors_update_dir(y_hat, z)
def vectors_update_dir(self, y_hat, z): # marked with (*) in equation (C)
beta = self.step_size * y_hat * z
alpha = beta / (1 + beta)
return alpha * (self.vs_nz - self.ones_times_vs_nnz / (1 + beta * (1 - self.nnz)))
We need one last ingredient - a way to maximize \(q\) and compute the optimal \(z\).
Recall that
\[q(z) = q_1(z) + q_2(z) - z\ln(z) - (1-z)\ln(1-z).\]Now, consider two important properties of \(q\):
So, if it has a maximizer, it must be unique, and must lie in the interval \([0,1]\). So, does it have a maximizer? Well, it does! Any concave function is continuous, and by the well-known Weirstrass theorem, any continuous function on a compact interval has a maximizer. What we have is a continuous function with a unique maximizer in a bounded interval, and that’s the classical setup for a well-known algorithm for one-dimensional maximization - the Golden Section Search method. For completeness, I copied the code from the above Wikipedia page:
"""Python program for golden section search. This implementation
reuses function evaluations, saving 1/2 of the evaluations per
iteration, and returns a bounding interval.
Source: https://en.wikipedia.org/wiki/Golden-section_search#Iterative_algorithm
"""
import math
invphi = (math.sqrt(5) - 1) / 2 # 1 / phi
invphi2 = (3 - math.sqrt(5)) / 2 # 1 / phi^2
def gss(f, a, b, tol=1e-8):
"""Golden-section search.
Given a function f with a single local minimum in
the interval [a,b], gss returns a subset interval
[c,d] that contains the minimum with d-c <= tol.
Example:
>>> f = lambda x: (x-2)**2
>>> a = 1
>>> b = 5
>>> tol = 1e-5
>>> (c,d) = gss(f, a, b, tol)
>>> print(c, d)
1.9999959837979107 2.0000050911830893
"""
(a, b) = (min(a, b), max(a, b))
h = b - a
if h <= tol:
return (a, b)
# Required steps to achieve tolerance
n = int(math.ceil(math.log(tol / h) / math.log(invphi)))
c = a + invphi2 * h
d = a + invphi * h
yc = f(c)
yd = f(d)
for k in range(n-1):
if yc < yd:
b = d
d = c
yd = yc
h = invphi * h
c = a + invphi2 * h
yc = f(c)
else:
a = c
c = d
yc = yd
h = invphi * h
d = a + invphi * h
yd = f(d)
if yc < yd:
return (a, d)
else:
return (c, b)
Having all ingredients, we can finalize the implementation of the optimizer’s step
method:
def neg_entr(z):
if z > 0:
return z * math.log(z)
else:
return 0
def loss_conjugate(z):
return neg_entr(z) + neg_entr(1 - z)
class ProxPtFMTrainer:
def step(self, w_nz, y_hat):
self.nnz = w_nz.numel()
self.bias_sum = self.bs[w_nz].sum().item()
self.vs_nz = self.vs.weight[w_nz, :]
self.ones_times_vs_nnz = self.vs_nz.sum(dim=0, keepdim=True)
def q_neg(z): # neg. of the maximization objective - since the min_gss code minimizes functions.
return -(self.q_one(y_hat, z) + self.q_two(y_hat, z) - loss_conjugate(z))
opt_interval = min_gss(q_neg, 0, 1)
z_opt = sum(opt_interval) / 2
self.update_biases(w_nz, y_hat, z_opt)
self.update_vectors(w_nz, y_hat, z_opt)
Since the purpose of this pose is “academic” in nature, i.e. to show the limits of what is possible by proximal point rather than writing a production-ready training algorithm, we did not take the time to make it efficient, and thus we’ll test it on a toy dataset - MovieLens 100k. The dataset consists of the ratings on a 1 to 5 scale that users gave to 1682 movies. For users, we use their integer age, gender, and occupation as features. For the movies, we use the genre and the movie id as features. A rating \(\geq 5\) is considered positive, while below 5 are considered negative.
For clarity, in the post itself we’ll skip the data loading code, and assume that the features are already given in the W_train
tensor, whose rows are the vectors \(w_i\), and the corresponding labels are given in the y_train
tensor. The full code is available in simple_train_loop.py
file in the repo. Let’s train our model using the maximal allowed step-size for ten epochs, using a factorization machine of embedding dimension \(k=20\):
from tqdm import tqdm
import torch
# MISSING - the code which loads the dataset and builds the tensors W_train and y_train
num_features = W_train.size(1)
max_nnz = W_train.sum(dim=1).max().item()
step_size = 1. / (2*max_nnz + 1)
print(f'Training with step_size={step_size:.4} computed using max_nnz = {max_nnz}')
embedding_dim = 20
fm = FM(num_features, embedding_dim)
dataset = TensorDataset(W_train, y_train)
trainer = ProxPtFMTrainer(fm, step_size)
for epoch in range(10):
sum_epoch_loss = 0.
sum_pred = 0.
sum_label = 0.
desc = f'Epoch = {epoch}, loss = 0, pred = 0, label = 0, bias = 0'
with tqdm(DataLoader(dataset, batch_size=1, shuffle=True), desc=desc) as pbar:
def report_progress(idx):
avg_epoch_loss = sum_epoch_loss / (idx + 1)
avg_pred = sum_pred / (idx + 1)
avg_label = sum_label / (idx + 1)
desc = f'Epoch = {epoch:}, loss = {avg_epoch_loss:.4}, pred = {avg_pred:.4}, ' \
f'label = {avg_label:.4}, bias = {fm.bias.item():.4}'
pbar.set_description(desc)
for i, (x_sample, y_sample) in enumerate(pbar):
(ignore, w_nz) = torch.nonzero(x_sample, as_tuple=True)
y = y_sample.squeeze(1)
with torch.no_grad():
# aggregate loss and prediction per epoch, so that we can monitor convergence
pred = fm.forward(w_nz.unsqueeze(0))
loss = F.binary_cross_entropy_with_logits(pred, y)
sum_epoch_loss += loss.item()
sum_pred += torch.sigmoid(pred).item()
sum_label += y.item()
# train the model
y_hat = (2 * y.item() - 1) # transform 0/1 labels into -1/1
trainer.step(w_nz, y_hat)
if (i > 0) and (i % 2000 == 0):
report_progress(i)
report_progress(i)
That’s what I got:
Training with step_size=0.04348 computed using max_nnz = 11.0
Epoch = 0, loss = 0.4695, pred = 0.2118, label = 0.2124, bias = -1.148: 100%|██████████| 99831/99831 [11:36<00:00, 143.37it/s]
Epoch = 1, loss = 0.4362, pred = 0.2114, label = 0.2121, bias = -1.468: 100%|██████████| 99831/99831 [11:34<00:00, 143.80it/s]
Epoch = 2, loss = 0.427, pred = 0.2115, label = 0.2122, bias = -1.294: 100%|██████████| 99831/99831 [11:20<00:00, 146.62it/s]
Epoch = 3, loss = 0.4224, pred = 0.2117, label = 0.2123, bias = -1.254: 100%|██████████| 99831/99831 [10:30<00:00, 158.33it/s]
Epoch = 4, loss = 0.4194, pred = 0.2114, label = 0.212, bias = -1.419: 100%|██████████| 99831/99831 [10:00<00:00, 166.12it/s]
Epoch = 5, loss = 0.4173, pred = 0.2112, label = 0.2117, bias = -1.301: 100%|██████████| 99831/99831 [09:48<00:00, 169.73it/s]
Epoch = 6, loss = 0.4167, pred = 0.2117, label = 0.2121, bias = -1.368: 100%|██████████| 99831/99831 [09:49<00:00, 169.40it/s]
Epoch = 7, loss = 0.4155, pred = 0.2115, label = 0.2119, bias = -1.467: 100%|██████████| 99831/99831 [09:51<00:00, 168.81it/s]
Epoch = 8, loss = 0.4145, pred = 0.2114, label = 0.2118, bias = -1.605: 100%|██████████| 99831/99831 [09:47<00:00, 169.81it/s]
Epoch = 9, loss = 0.4146, pred = 0.2121, label = 0.2125, bias = -1.365: 100%|██████████| 99831/99831 [09:47<00:00, 169.85it/s]
Seems that the loss is indeed being minimized. Let’s compare it with the Adam optimizer with default parameters. Here is the training loop:
optimizer = torch.optim.Adam(fm.parameters())
for epoch in range(10):
sum_epoch_loss = 0.
sum_pred = 0.
sum_label = 0.
desc = f'Epoch = {epoch}, loss = 0, pred = 0, label = 0, bias = 0'
with tqdm(DataLoader(dataset, batch_size=1, shuffle=True), desc=desc) as pbar:
def update_progress(idx):
avg_epoch_loss = sum_epoch_loss / (idx + 1)
avg_pred = sum_pred / (idx + 1)
avg_label = sum_label / (idx + 1)
desc = f'Epoch = {epoch}, loss = {avg_epoch_loss:.4}, pred = {avg_pred:.4}, ' \
f'label = {avg_label:.4}, bias = {fm.bias.item():.4}'
pbar.set_description(desc)
for i, (x_sample, y_sample) in enumerate(pbar):
(ignore, w_nz) = torch.nonzero(x_sample, as_tuple=True)
y = y_sample.squeeze(1)
optimizer.zero_grad()
pred = fm.forward(w_nz.unsqueeze(0))
loss = F.binary_cross_entropy_with_logits(pred, y)
loss.backward()
optimizer.step()
with torch.no_grad():
sum_epoch_loss += loss.item()
sum_pred += torch.sigmoid(pred).item()
sum_label += y.item()
if (i > 0) and (i % 2000 == 0):
update_progress(i)
update_progress(i)
And here is the result:
Epoch = 0, loss = 0.4655, pred = 0.21, label = 0.212, bias = 0.539: 100%|██████████| 99831/99831 [02:47<00:00, 596.25it/s]
Epoch = 1, loss = 0.4596, pred = 0.208, label = 0.212, bias = 1.586: 100%|██████████| 99831/99831 [03:09<00:00, 527.90it/s]
Epoch = 2, loss = 0.4655, pred = 0.2075, label = 0.2118, bias = 2.668: 100%|██████████| 99831/99831 [02:59<00:00, 556.33it/s]
Epoch = 3, loss = 0.471, pred = 0.2078, label = 0.2122, bias = 3.805: 100%|██████████| 99831/99831 [02:50<00:00, 585.09it/s]
Epoch = 4, loss = 0.4744, pred = 0.2071, label = 0.2119, bias = 5.116: 100%|██████████| 99831/99831 [02:42<00:00, 615.88it/s]
Epoch = 5, loss = 0.4747, pred = 0.2071, label = 0.212, bias = 6.48: 100%|██████████| 99831/99831 [02:55<00:00, 569.75it/s]
Epoch = 6, loss = 0.4777, pred = 0.2064, label = 0.2119, bias = 7.992: 100%|██████████| 99831/99831 [02:56<00:00, 567.10it/s]
Epoch = 7, loss = 0.4793, pred = 0.2071, label = 0.2121, bias = 9.433: 100%|██████████| 99831/99831 [02:47<00:00, 595.92it/s]
Epoch = 8, loss = 0.4802, pred = 0.2062, label = 0.212, bias = 11.15: 100%|██████████| 99831/99831 [02:43<00:00, 610.91it/s]
Epoch = 9, loss = 0.4824, pred = 0.2066, label = 0.212, bias = 12.72: 100%|██████████| 99831/99831 [02:44<00:00, 605.32it/s]
Whoa! It isn’t converging! The loss grows after a few epochs, and we can see that the bias keeps increasing. Seems like our efforts are paying off - a custom method with a deeper step-size analysis let us just ‘hit’ a good-enough step-size without any tuning, while with Adam we’ll probably have to do some tuning to find a good step-size.
Let’s now do a more thorough stability comparison - run our method, Adam, Adagrad, and SGD, with various step-size parameters for ten epochs, and see what loss we are getting. The above methods ran with several step sizes for \(M=20\) epochs, each step-size was tested \(N=20\) times to take into account the effect of randomness in the weight initialization and the data shuffling. Then, I produced a plot showing the best loss achieved for each step-size and each algorithm, averaged over the \(N=15\) attempts, with transparent uncertainty bands. The code resides in stability_experiment.py
in the repo. Here is the result:
It’s quite apparent that the performance of the proximal point algorithm is quite consistent over the various step-size choices. We also see that Adam’s performance degrades when the step-size is too large. Consequently, to see the difference between the various algorithms more clearly, let’s plot the results without Adam:
Well, as we see, the proximal point’s performance is the most consistent accross various step-sizes, but it is certainly not the best algorithm for training a factorization machine on this dataset. It appears that Adagrad is.
One possible explanation is that the proximal point algorithm converges more slowly, and requires more epochs to achieve good performance. Let’s test this hypothesis, and run the proximal point algorithm for 50 epochs. And after a few days, I got:
The situation doesn’t seem to improve much. The method is quite consistent in its performance, but it doesn’t seem to converge rapidly to an optimum.
We have developed an efficiently implementable proximal point step for a highly non-trivial and non-convex problem, and provided an implementation. To the best of my knowledge, this post sets foot in an uncharted territory, and thus I an not sure what is the method converging to, but from these numerical experiments it doesn’t seem to minimize the average loss. It is my hope that the research community can provide such answers.
Writing this entire series about efficient implementation of incremental proximal point methods has been extremely fun and I certainly learned a lot about Python, PyTorch, and better understood the essence of these methods. I hope that you, the readers, enjoyed as much as I did. It’s time for new adventures! I don’t know what the next post will be about, but I’m sure it will be fun!
Steffen Rendle (2010), Factorization Machines, IEEE International Conference on Data Mining (pp. 995-1000) ↩
It follows from the fact that \(h(t)=\ln(1+\exp(t))\) and \(h^*(s)=s\ln(s)+(1-s)\ln(1-s)\) are both Legendre-type functions: essencially smooth and strictly-convex function. ↩
seperability is the fact that \(\displaystyle \min_{x_1,x_2} f(x_1) + g(x_2) = \min_{x_1} f(x_1) + \min_{x_2} g(x_2)\). ↩
We continue our endaevor of extending the reach of efficiently implementable stochastic proximal point method in the mini-batch setting:
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ \frac{1}{|B|} \sum_{i \in B} f_i(x) + \frac{1}{2\eta} \|x - x_t\|_2^2 \Biggr \}.\]Last time we discussed the implementation for convex-on-linear losses, which include linear least squares and linear logistic regression. Continuing the same journey we already went through before the mini-batch setting, this time we add regularization, and consider losses of the form:
\[f_i(x)=\phi(a_i^T x + b_i) + r(x),\]where the regularizer \(r\) is the same for all training samples, and \(\phi\) is a scalar convex function. In that case, the method becomes:
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ \frac{1}{|B|} \sum_{i \in B} \phi(a_i^T x + b_i) + r(x) + \frac{1}{2\eta} \|x - x_t\|_2^2 \Biggr \}.\]Our aim in this post is to derive an efficient implementation, in Python, of the above computational step.
We again employ duality in an attempt to make the problem of computing \(x_{t+1}\) tractable. We replace that problem with its equivalent constrained variant:
\[\operatorname*{minimize}_{x,z} \quad \frac{1}{|B|} \sum_{i\in B} \phi(z_i) + r(x) + \frac{1}{2\eta}\|x - x_t\|_2^2 \quad \operatorname{\text{subject to}} \quad z_i = a_i^T x + b_i.\]We will not get into the tedious details, but embedding the vectors \(a_i\) into the rows of the batch matrix \(A_B\), and after some mathematical manipulations, \(q(s)\) is:
\[q(s)=\color{magenta}{-\frac{1}{|B|}\sum_{i \in B} \phi^*(|B| s_i) + \underbrace{\min_x \left\{ r(x) + \frac{1}{2\eta} \|x - (x_t - \eta A_B^T s)\|_2^2\right\}}_{\text{Moreau envelope}}} + (A_B x_t + b_B)^T s - \frac{\eta}{2} \|A_B^Ts\|_2^2.\]And here we arrive at our problem, which appears in the magenta-colored part. The function \(-\phi^*\) is always concave, while the Moreau envelope is always convex. The sum of such functions may, in general, be nor convex nor concave. So, although we know from duality theory that the function \(q(s)\) is concave, by separating it into two components we cannot convey its concavity to a generic convex optimization solver such as CVX - these solvers require that we write functions in a way which obeys an explicit set of rules which convey the function’s curvature as either convex or concave. One such rule is - we can add convex functions together, and concave functions together. But we cannot mix both, for reasons which go beyond the scope of this blog post.
The conclusion is the same duality trick which served us well before in transforming a high-dimensional problem on \(x\) to a low-dimensional problem on the dual variables \(s\), cannot serve us now. Can we do something else? It turns out we can, but first let’s explore another extension of duality - inequality constraints.
We aim to compute
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ \frac{1}{|B|} \sum_{i \in B} \phi(a_i^T x + b_i) + r(x) + \frac{1}{2\eta} \|x - x_t\|_2^2 \Biggr \}.\]Let’s rephraze the optimization problem in a slightly different manner, by including two auxiliary variables - the vectors \(z\) and \(w\), and by embedding the vectors \(a_i\) into the rows of the batch matrix \(A_B\):
\[\operatorname*{minimize}_{x,z,w} \quad \frac{1}{|B|} \sum_{i\in B} \phi(z_i) + r(w) + \frac{1}{2\eta} \|x - x_t\|_2^2 \quad \operatorname{\text{subject to}} \quad z = A_B x + b, \ \ x = w\]Let’s construct a dual by assigning prices \(\mu\) and \(\nu\) to the violation of each set of constraints, and separating the minimization over each variable:
\[\begin{aligned} q(\mu, \nu) &= \inf_{x,z,w} \left\{\frac{1}{|B|} \sum_{i\in B} \phi(z_i) + r(w) + \frac{1}{2\eta} \|x - x_t\|_2^2 + \mu^T(A_B x + b - z) + \nu^T (x - w) \right\} \\ &= \color{blue}{\inf_x \left\{ \frac{1}{2\eta} \|x - x_t\|_2^2 + (A_B^T \mu + \nu)^T x \right\}} + \color{purple}{\inf_z \left\{ \frac{1}{|B|} \sum_{i\in B} \phi(z_i) - \mu^T z \right\}} + \color{green}{\inf_w \left\{ r(w)- \nu^T w \right\}} + \mu^T b. \end{aligned}\]We’ve already encountered the purple part in a previous post - it can be written in terms of the convex conjugate \(\phi^*\):
\[\text{purple} = -\frac{1}{|B|}\sum_{i \in B} \phi^*(|B| \mu_i)\]The green part is also straightforward, and is exactly \(-r^*(\nu)\), where \(r^*\) is the convex conjugate of the regularizer \(r\). Finally, the blue part, despite being some cumbersome, is a simple quadratic minimiation problem over \(x\), so let’s solve it by equating the gradient of the term inside the \(\inf\) with zero:
\[\frac{1}{\eta}(x - x_t) + A_B^T \mu - \nu = 0.\]By re-arranging, we obtain that the equation is solved at \(x = x_t - \eta(A_B^T \mu - \nu)\). Recall, that if strong duality holds, that’s exactly the rule for computing the optimal \(x\) from the optimal \((\mu, \nu)\) pair. Let’s substitute the above \(x\) into the blue term, and after some algebraic manipulations obtain:
\[\text{blue} = \frac{1}{2\eta} \left\| \color{brown}{x_t - \eta (A_B^T \mu - \nu)} - x_t \right\|_2^2 + (A_B^T \mu + \nu)^T [\color{brown}{x_t - \eta (A_B^T \mu - \nu)}] = - \frac{\eta}{2} \| A_B^T \mu - \nu \|_2^2 + (A_B x_t)^T \mu + x_t^T \nu\]Summarizing everything, we have:
\[q(\mu, \nu) = \color{blue}{- \frac{\eta}{2} \| A_B^T \mu - \nu \|_2^2 + (A_B x_t)^T \mu + x_t^T \nu} \color{purple}{-\frac{1}{|B|}\sum_{i \in B} \phi^*(|B| \mu_i)} \color{green}{-r^*(\nu)} + b^T \mu.\]The blue part is a concave quadratic, and \(-\phi^*\) and \(-r^*\) are both concave. Well, seem that we’ve done it, haven’t we? Not quite! Recall that the dimension of \(\nu\) is the same as the dimension of \(x\), so we haven’t reduced the problem’s dimension at all! If we have a huge model parameter vector \(x\), we’ll have a huge dual variable \(\nu\).
The above two failures to come up with an efficient algorithm for computing \(x_{t+1}\) for mini-batches in the regularized convex-on-linear losses might make us wonder - is it even possible to implement the method in this setting? Well, it turns out that if we insist on using an off-the-shelf optimization solver, it might be very hard. But if we are willing to write our own, it is possible.
Indeed, consider the dual problem we derived in take 1. We know that the dual function being maximized is concave. If we can compute its gradient, we can write our own fast gradient method, i.e. FISTA^{1} or Nesterov’s accelerated gradient^{2} method, to solve it. Note that I am not referring to a stochastic optimization algorithm for training a model, but a fully deterministic one for solving a simple optimization problem. Such methods can be quite fast. Furthermore, if we can compute the Hessian matrix of \(q\) we can employ Newton’s method^{3}, and solve the dual problem even faster, in a matter of a few milliseconds. However, writing convex optimization solvers is beyond the scope of this post.
The entire post series was devoted to deriving efficient implementations of the proximal point methods to various generic problem classes. Contrary to the above, I would like to devote the next, and last blog post of this series to implementing the method on a specific, but interesting problem. Stay tuned!
Beck A. & Teboulle M. (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Science, 2(11), 183–202. ↩
Nesterov Y. (1983) A method for solving the convex programming problem with convergence rate O(1/k^2). Dokl. Akad. Nauk SSSR 269, 543-547 ↩
https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization ↩
Then, using a more general version of convex duality we saw that the above is equivalent to maximizing the dual function
\[q(s) = -\frac{\eta}{2} \|A_B^T s\|_2^2 + (A_B x_t + b_B)^T s - \frac{1}{|B|} \sum_{i \in B} \phi^*(|B| s_i),\]where \(A_B\) and \(b_B\) are obtained by vertical concatenation of \(a_i^T\) and \(b_i\) for \(i \in B\).
Before digging deeper, to simplify notation we first transform the maximization problem to an equivalent minimization problem. To that end, we do the following transformation:
\[q(s) = - \left( \frac{1}{2} \left \| \sqrt{\eta} A_B^T s \right \|_2^2 - (A_B x_t + b_B)^T s + \frac{1}{\vert B \vert} \sum_{i \in B} \phi^*(\vert B\vert s_i) \right)\]Recalling that maximizing a function is equivalent to minimizing its negation, and denoting \(P=\sqrt{\eta} A_B^T\), \(c = -(A_b x_t + b_B)\), and \(m = \vert B \vert\), we obtain that we aim to minimize functions of the form
\[\tilde{q}(s)=\frac{1}{2} \|P s\|_2^2 + c^T s + \frac{1}{m} \sum_{i \in B} \phi^*(m s_i). \tag{DM}\]Recall, also, that the algorithm for computing \(x_{t+1}\) comprises the following steps:
The first and last steps can be computed by a generic optimizer, while the second step depends on \(\phi\) and has to be computed by specialized code. So, here is our generic optimizer
import torch
class MiniBatchConvLinOptimizer:
def __init__(self, x, step_size, phi):
self._x = x
self._step_size = step_size
self._phi = phi
def step(self, A_batch, b_batch):
# helper variables
x = self._x
step_size = self._step_size
phi = self._phi
# compute dual problem coefficients
P = math.sqrt(step_size) * A_batch.t()
c_neg = torch.addmv(b_batch, A_batch, x)
# solve dual problem
s_star = phi.solve_dual(P, c_neg)
# perform step
step_dir = torch.mm(A_batch.t(), s_star)
x.sub_(step_size * step_dir.reshape(x.shape))
# return the losses w.r.t the params before making the step
return phi.eval(c_neg)
We can also immediately implement the corresponding phi
for \(L2\) losses, namely, for least-squares problems. In that case, the dual problem amounts to solving
or, equivalently, computing
\[s^* = -(P^T P + m I)^{-1} c\]Here is the code:
class L2Loss:
def solve_dual(self, P, c_neg):
m = P.shape[1] # number of columns = batch size
lsh_mat = torch.addmm(I, P.t(), P, beta=m)
# solve positive-definite linear system using Cholesky factorization
lhs_factor = torch.cholesky(lhs_mat)
rhs_col = c.unsqueeze(1) # make rhs a column vector, so that cholesky_solve works
return torch.cholesky_solve(rhs_col, lhs_factor)
def eval(self, lin):
return 0.5 * (lin ** 2)
Now we can construct a least-squares mini-batch optimizer using:
opt = MiniBatchConvLinOptimizer(x, step_size, L2Loss())
Beyond least squares, i.e. logistic regression, the minimum of \(\tilde{q}\) cannot be computed analytically by a simple closed-form formula. But we are at luck! Minimizing low-dimensional convex functions has been the subject of a few decades of research, and many efficient methods have emerged.
CVXPY is a Python package for convex optimization, which acts like a kind of a compiler. We specify the optimization problem we want to solve, and the package “compiles” it to a low-level representation which can be passed to a lower-level solver, usually written in C, which solves it. This low level solver is usually referred to as a backend. Examples of such solvers include the open-source solvers SCS and ECOS, the commercial solver MOSEK, and many more.
Entire courses can be, and are taught about convex optimization, and I do not intend to teach another one in this blog. Rather, I intend to introduce unfamiliar readers to the mere existance of this technology, and readers who are interested will be able to learn the subject through one of the available courses, from a book^{1}, or from any other source.
To install CVXPY we can use:
pip install cvxpy
Now we can import and use it. An optimization problem consists of two components - the function we wish to minimize or maximize, called the objective, and the set of constraints subject to which the optimization is done. Let’s solve a simple optimization problem with constraints:
import cvxpy as cp
# define the problem
x = cp.Variable(2) # a two-dimensional variable
objective = cp.sum_squares(cp.vstack([x[0] - x[1] + 1,
x[0] + x[1] - 1,
x[0] - 2*x[1] + 3]))
constraints = [x >= 0, cp.sum(x) == 1]
problem = cp.Problem(cp.Minimize(objective), constraints)
# solve the problem, print the optimal value, and the optimal solution
result = problem.solve()
print(f'Optimal value = {result}, optimal x = {x.value}')
The code is, hopefully, self explanatory, and readers can see that it indeed aims to solve the following optimization problem:
\[\begin{aligned} \min_{x \in \mathbb{R}^2} &\quad (x_1-x_2+1)^2+(x_1+x_2-1)^2+(x_1-2x_2+3)^2 \\ \text{subject to} &\quad x_1,x_2 \geq 0 \\ &\quad x_1+x_2=1 \end{aligned}\]Here is the output of the code above:
Optimal value = 0.9999999999999997, optimal x = [-3.60317695e-19 1.00000000e+00]
This means that the minimum value is \(\approx 1\), and the vector \(x^*\approx(0, 1)\) attains it, meaning it is an optimal solution.
Let’s try to implement the logistic loss using CVXPY. We have \(\phi(z)=\ln(1+\exp(z))\), meaning that \(\phi^*(s) = s \ln(s) + (1-s) \ln(1-s)\). Fortunately, the convex function \(x \ln(x)\) is used in a many optimization problems, and in many cases is recognized by convex optimization packages. It is referred to as the negative entropy function. So here is the code for the logistic loss implementation, which minimizes \(\tilde{q}(s)\) using CVXPY:
import torch
import cvxpy as cp
class LogisticLoss:
def solve_dual(self, P, c_neg):
# extract information and convert tensors to numpy. CVXPY
# works with numpy arrays
m = P.shape[1]
P = P.data.numpy()
c_neg = c_neg.data.numpy()
# define the dual optimization problem
s = cp.Variable(m)
objective = 0.5 * cp.sum_squares(P @ s) - # `@` is a matrix product
cp.sum(c_neg * s) -
(cp.sum(cp.entr(m * s)) + cp.sum(cp.entr(1 - m * s))) / m
prob = cp.Problem(cp.Minimize(objective))
# solve the problem, and extract the optimal solution
prob.solve()
s_star = torch.tensor(s.value).unsqueeze(1)
return s_star
def eval(self, lin):
return torch.log1p(torch.exp(lin))
Now, we can also employ the mini-batch version of the stochastic proximal point method to solve logistic regression problems using the following optimizer:
opt = MiniBatchConvLinOptimizer(x, step_size, LogisticLoss())
First, let’s see that our optimizer produces reasonable results. We will use the same spambase data-set we used before. It is composed of 57 numerical columns, signifying frequencies of various frequently-occuring words, and average run-lengths of capital letters, and a 58-th column with a spam indicator.
Let’s load the data-set, scale its features, and construct a PyTorch data-set object:
import pandas as pd
from sklearn import preprocessing
from torch.utils.data.dataset import TensorDataset
url = 'https://web.stanford.edu/~hastie/ElemStatLearn/datasets/spam.data'
df = pd.read_csv(url, delimiter=' ', header=None)
min_max_scaler = preprocessing.MinMaxScaler()
scaled = min_max_scaler.fit_transform(df.iloc[:, 0:56])
df.iloc[:, 0:56] = scaled
W = torch.tensor(np.array(df.iloc[:, 0:56])) # features
Y = torch.tensor(np.array(df.iloc[:, 57])) # labels
ds = TensorDataset(W, Y)
Now, let’s train a logistic regression model for classifying spam based on the features in the data-set, with batches of 4 samples.
# init. model parameter vector
x = torch.empty(56, requires_grad=False, dtype=torch.float64)
torch.nn.init.normal_(x)
# create optimizer
step_size = 1
opt = MiniBatchConvLinOptimizer(x, step_size, LogisticLoss())
# run 40 epochs, print out data loss and reg. loss
for epoch in range(40):
loss = 0.0
for w, y in DataLoader(ds, shuffle=True, batch_size=4):
A_batch = (1 - 2*y) * w
b_batch = torch.zeros_like(a)
losses = opt.step(A_batch, b_batch)
loss += torch.sum(losses).item()
print(f'epoch = {epoch}, loss = {loss / len(ds)}')
I obtained the following output:
epoch = 0, loss = 0.49779446008581707
epoch = 1, loss = 0.37720419193721605
...
epoch = 37, loss = 0.24126466392084467
epoch = 38, loss = 0.24089484987149695
epoch = 39, loss = 0.24035799118074533
If you ran the code above, you may have noticed that it is quite slow.We can make it a bit faster by utilizing more powerful features of CVXPY. Recall, that CVXPY is just a ‘compiler’ which transforms the problem we provide into a lower-level form. The optimization problems of minimizing \(\tilde{q}(s)\) share similar structure - they only differ in the matrix \(P\) and the vector \(c\). But every call to the solve_dual
method constructs a new optimization problem, which is compiled every time we invoke that method.
CVXPY lets us compile a family of problems sharing the same structure once, and re-use it every time we need to solve it with different data. Objects of cvxpy.Parameter
are used as placeholders for the problem data, and can be used instead of the actual data. The problem can then be constructed once, and re-used by assigning values to the parameters.
Another improvement comes from elementary linear algebra:
\[\frac{1}{2} \|P s\|_2^2=\frac{1}{2} s^T (P^T P) s.\]The size of the matrix \(P\) is \(d \times m\), where \(d\) is the dimension of the training data and \(m\) is the mini-batch size. The above means that we can formulate our optimization problems in terms of the quadratic matrix \(P^T P\) of size is \(m\times m\), which does not depend on the dimension of the training data. Consequently we can compute this matrix as a PyTorch tensor, and use it to construct the CVXPY problem, which now also becomes independent on the dimension of the training data.
Here is an implementation of the logistic loss based on the combination of both idea:
class LogisticLoss:
def __init__(self, batch_size):
self.PTP = cp.Parameter((batch_size, batch_size), PSD=True)
self.c_neg = cp.Parameter(batch_size)
self.batch_size = batch_size
self.prob, self.s = LogisticLoss._build_problem(self.PTP, self.c_neg, batch_size)
@staticmethod
def _build_problem(PTP, c_neg, m):
s = cp.Variable(m)
objective = 0.5 * cp.quad_form(s, PTP) - cp.sum(c_neg * s) - (cp.sum(cp.entr(m * s)) + cp.sum(cp.entr(1 - m * s))) / m
prob = cp.Problem(cp.Minimize(objective))
return prob, s
def solve_dual(self, P, c_neg):
PTP = torch.mm(P.t(), P)
m = PTP.shape[0]
if m == self.batch_size:
prob, s = self._reuse(PTP, c_neg)
else:
prob, s = self._build_new(PTP, c_neg, m)
return self._solve(prob, s)
@staticmethod
def _solve(prob, s):
# ECOS/SCS/MOSEK are capable of dealing with the dual problem at hand
# MOSEK is a commercial-grade solver and is the most reliable,
# and is free for academic use!
prob.solve(cp.MOSEK)
s_star = torch.from_numpy(s.value).unsqueeze(1)
return s_star
@staticmethod
def _build_new(PTP, c_neg, m):
prob, s = LogisticLoss._build_problem(PTP.data.numpy(), c_neg.data.numpy(), m)
return prob, s
def _reuse(self, PTP, c_neg):
self.PTP.value = PTP.data.numpy()
self.c_neg.value = c_neg.data.numpy()
prob = self.prob
s = self.s
return prob, s
def eval(self, lin):
return torch.log1p(torch.exp(lin))
On my computer, it is approximately 1.5 times faster. Not too impressive, but any gain is a gain. We could make it faster by exploiting the lower-level facilities provided by CVXPY. But the only plausible way of making it really fast is building a custom optimization solver for the logistic losses’ dual problem, which fully exploits the structure of the problem. This is what we would have done if we were to implement a production-grade optimizer.
Building such a solver is beyond the scope of this blog post, but there are several resources^{1} interested readers might consider. A specialized solver could solve the dual problem, even for batches of size 256, in a few milliseconds.
Note, that computing \(P^T P\) costs \(\mathcal{O}(d m^2)\) operations. Assuming that model dimensions are large, while the mini-batch size is small, this part dominates the computational complexity of computing \(x_{t+1}\). To be fair, we should point out that regular SGD with mini-batches costs only \(\mathcal{O}(d m)\) time, so we should use the proximal point algorithm when its benefits, such as cheaper hyperparameter tuning, outweigh the above-mentioned cost.
Let’s see if for our logistic regression problem we are getting a stable algorithm. As previously, for each step size and for each batch size we run several experiments. Then, we show the results in a loss vs. step size plot.
I chose to use Python’s multiprocessing capabilities to make the experiment faster, and be able to utilize a cloud service with many CPUs to parallelize work. So here is the code which runs the experiment and plots the results. We use a multiprocessing pool to run an asynchronous task for every combination of the experiment parameters, and use a queue to feed the results of each epoch from the parallel executions back to the main program.
import seaborn as sns
import matplotlib.pyplot as plt
import torch.multiprocessing as mp
# define experiment setup
batch_sizes = [1, 2, 3, 4, 5, 6]
experiments = range(10)
epochs = range(10)
step_sizes = np.geomspace(0.001, 100, 10)
# construct multiprocessing data exchange objecgts
manager = mp.Manager()
queue = manager.Queue()
# run experiments
print('Starting parallel training experiments')
pool = mp.Pool(processes=6)
params = [
(queue, dataset, epochs, step_size, batch_size, experiment)
for step_size in step_sizes
for batch_size in batch_sizes
for experiment in experiments
]
res = pool.starmap_async(train, params, chunksize=1)
# wait for results to arrive
print('Gathering results from parallel experiments')
losses = []
total_epochs = len(batch_sizes) * len(experiments) * len(step_sizes) * len(epochs)
for i in tqdm(range(total_epochs), desc='Training', unit='epochs', ncols=160, smoothing=0.05):
losses.append(queue.get())
print('Waiting for parallel jobs to end')
res.wait(timeout=1)
losses = pd.concat(losses)
losses.to_csv('losses.csv')
best_losses = losses[['batch_size', 'step_size', 'experiment', 'loss']]\
.groupby(['batch_size', 'step_size', 'experiment'], as_index=False)\
.min()
sns.set()
ax = sns.lineplot(x='step_size', y='loss', hue='batch_size', data=best_losses, err_style='band', legend='full')
ax.set_yscale('log')
ax.set_xscale('log')
plt.show()
And here is the train
function called on line 24, which actually trains a model with the given parameters, and feeds the results of each epoch through the queue back to the main program.
def train(queue, dataset, epochs, step_size, batch_size, experiment):
x = torch.empty(56, requires_grad=False, dtype=torch.float64)
torch.nn.init.normal_(x)
optimizer = MiniBatchConvLinOptimizer(x, step_size, LogisticLoss(batch_size=batch_size))
successful = True
for epoch in epochs:
epoch_loss = 0.
for w, y in torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=batch_size):
# logistic regression input in "convex-on-linear" form
sign = (1 - 2 * y.unsqueeze(1))
A_batch = sign * w
b_batch = torch.zeros_like(y, dtype=x.dtype)
batch_losses = optimizer.step(A_batch, b_batch)
epoch_loss += torch.sum(batch_losses).item()
epoch_loss /= len(dataset)
df = pd.DataFrame.from_dict(
{'batch_size': [batch_size],
'step_size': [step_size],
'experiment': [experiment],
'epoch': [epoch],
'loss': [epoch_loss]})
queue.put(df)
After a few hours, I obtained the following result:
When the step sizes are too small, in this case less than \(10\), convergence is slow and we are not able to converge to a solution achieving a training loss below \(3\times 10^{-1}\) - our 10 epochs are not enough. When the step size becomes larger, the benefit of mini-batching becomes apparent - larger batch size, which appears as a darker color, leads to a better training loss. The stability property remains - we do not diverge, and we obtain a solution with a reasonable training loss for a huge range of step sizes: from as small as 0.1 to as large as 100!
We have gone through a very long journey, and explored a variety of cases for which the proximal point approach of using the losses instead of approximating leads to a practically implementable algorithm. When training on a single sample at a time, our most general algorithm was aimed at convex-on-linear losses with regularization. Note, that in this post we did not deal with regularized losses, and that is exactly the subject of the next post - mini-batch training for convex-on-linear losses with regularization.
As a final remark, I would like to thank the MOSEK generous help with debugging some of the issues I had when I wrote the code for this blog post.
Since for a general function \(f\) the above is intractable, we covered special and interesting functions \(f\) of practical value to the machine learning community: convex on linear, and regularized variants with an L2 and a generic regularizer. For all of the above, we derived an efficient algorithm to compute \(x_{t+1}\), but our derivations shared one thing in common - we only considered training on one sample at each iteration.
Training by exploiting information from one arbitrarily training sample from the entire training set at each iteration is quite noisy. A standard practice is reducing noise by selecting a mini-batch of training samples, and their corresponding incurred losses, at each iteration. In the following sequence of posts we will derive the mini-batch version of the proximal point algorithm, by following along lines similar to our derivation for convex on linear losses.
Let’s begin by re-interpreting the mini-batch version of stochastic gradient descent: at each iteration, select a mini-batch of samples \(B \in \{1, \dots, n\}\) and compute:
\[x_{t+1} = x_t - \frac{\eta}{|B|} \sum_{i \in B} \nabla f_i(x_t),\]Recalling the proximal view we discussed at the beginning of the series, the mini-batch SGD step can be written as
\[x_{t+1} = \operatorname*{argmin}_x \Biggl\{ \color{blue}{\frac{1}{|B|} \sum_{i \in B} f_i(x_t) + \left( \frac{1}{B} \sum_{i \in B} \nabla f_i(x_t) \right)^T(x - x_t)} + \color{red}{\frac{1}{2\eta} \|x - x_t\|_2^2} \Biggr\}.\]The red part is, as usually, a proximal term penalizing the distance from \(x_t\). The blue term is a linear approximation of \(\frac{1}{\vert B \vert} \sum_{i \in B} f_i(x)\), which is the average loss of the mini-batch. Consequently, we can interpret mini-batch SGD as:
find a point which balances between descending along the tangent of the mini-batch average \(\frac{1}{\vert B \vert} \sum_{i \in B} f_i(x)\) and staying close to \(x_t\).
The balance is, as previously, determined by the step-size parameter \(\eta\). If we avoid approximating and instead use the functions directly, we obtain the mini-batch version of the stochastic proximal point method:
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ \color{blue}{\frac{1}{|B|} \sum_{i \in B} f_i(x)} + \color{red}{ \frac{1}{2\eta} \|x - x_t\|} \Biggr \}.\]The challenge is, as before, computing \(x_{t+1}\). In this and some of the following posts we will consider the convex on linear setup, and derive efficient algorithms for solving the proximal problem
\[x_{t+1} = \operatorname*{argmin}_x \left \{ \frac{1}{|B|} \sum_{i \in B} \phi(a_i^T x + b_i) + \frac{1}{2\eta} \|x - x_t\|_2^2 \right\}, \tag{PROX}\]where \(\phi\) is a convex function. Recall, that linear least squares and linear logistic regression problems are special instances with \(\phi(t) = \frac{1}{2} t^2\) and \(\phi(t) = \log(1+\exp(t))\), respectively.
Previously, for \(\vert B \vert = 1\) we used a simple stripped-down version of convex duality to derive an efficient implementation, but for our current endaevor the stripped-down version is not enough.
Our encounter with convex duality led us to reformulating a minimization problem with a single constraint to a maximization problem with one variable. Now we present an extension covering minimization problems with several constraints.
Consider the minimization problem
\[\tag{P} \min_{x,t} \quad \sum_{i \in B} \phi(t) + g(x) \quad \text{s.t.} \quad t_i = a_i^T x + b_i, \quad i \in B\]Note, that both \(x\) and \(t\) are vectors. We assume that there is an optimal solution \((x^*, t^*)\), and we are specifically interested in \(x^*\). Denote the optimal value by \(\mathcal{v}(P)\). Take a look at the following function:
\[q(s) = \inf_{x,t} \left\{ \phi(t) + g(x) + \sum_{i \in B} s_i(a_i^T x + b_i - t_i) \right\}.\]The function \(q(s)\) is defined by an optimization problem without constraints, which is parameterized by prices \(s_i\) for violating the constraints \(t_i = a_i^T x + b_i\). Now, we can derive the following careful, but simple result:
\[\begin{aligned} q(s) &= \inf_{x,t} \left\{ \phi(t) + g(x) + \sum_{i \in B} s_i(a_i^T x + b_i - t_i) \right\} \\ &\leq \inf_{x,t} \left\{ \phi(t) + g(x) + \sum_{i \in B} s_i(a_i^T x + b_i - t_i) : t_i = a_i^T x + b \right\} \\ &= \inf_{x,t} \left\{ \phi(t) + g(x) : t_i = a_i^T x + b \right\} = \mathcal{v}(P). \end{aligned}\]The inequality holds since minimizing over the entire space produces a smaller value than minimizing over a subset. The observation means that \(q(s)\) is a lower bound on the optimal value of our desired problem. The dual problem is about the finding the “best” lower bound:
\[\max_s \quad q(s) \tag{D}\]Clearly, even the best lower bound \(\mathcal{v}(D)\) is still a lower bound, namely, \(\mathcal{v}(D) \leq \mathcal{v}(P)\). This is a well-known result, called weak duality. But we are interested in a stronger result, called strong duality:
Suppose that both \(\phi\) and \(g\) are closed^{1} convex functions, and that \(\mathcal{v}(P)\) is finite. Then,
(a) the dual problem (D) has an optimal solution \(s^*\), and the lower bound is tight: \(\mathcal{v}(D) = \mathcal{v}(P)\).
(b) if the minimization problem defining \(q(s^*)\) has a unique optimal solution \(x^*\), then \(x^*\) is the optimal solution of (P)
Conclusion (b) of the strong duality lets us extract the optimal \(x^*\) given the optimal \(s^*\). So, if the dual problem can be solved quickly, we can obtain an efficient algorithm for solving the original problem. Readers who are interested to learn more about duality are referred to the excellent book Convex Optimization by Boyd & Vanderberghe, which is available online at no cost.
Note, that the dimension of \(s\) is exactly the size of the mini-batch. In the extreme case of \(\vert B \vert=1\), the dual problem is one-dimensional. In general, mini-batches in machine learning are typically small (\(\leq 128\)), and there is plenty of literature and software aimed at solving low-dimensional optimization problems extremely quickly, and we will discuss some of them in the this and the following posts.
Let’s apply convex duality to derive an algorithm template for solving (PROX) defined above. A fair bit of warning - this part is a bit technical.
Since duality requires constraints, let’s add them by defining the auxiliary variables \(t_i\):
\[x_{t+1} = \operatorname*{argmin}_{x,t} \Biggl \{ \frac{1}{|B|} \sum_{i \in B} \phi(t_i) + \frac{1}{2\eta} \|x - x_t\|_2^2 : t_i=a_i^T x + b_i \Biggr \}\]Consequently, the dual objective function \(q\) is:
\[\begin{align} q(s) &= \min_{x,t} \left\{ \frac{1}{|B|} \sum_{i \in B} \phi(t_i) + \frac{1}{2\eta} \|x - x_t\|_2^2 + s_i(a_i^T x + b_i - t_i) \right\} \\ &= \color{blue}{\min_x \left\{ \frac{1}{2\eta} \|x - x_t\|_2^2 + \left( \sum_{i \in B} s_i a_i \right)^T x \right\}} + \color{red}{\sum_{i \in B} \min_{t_i} \left\{ \frac{1}{|B|} \phi(t_i) - s_i t_i \right\}} + \sum_{i \in B} s_i b_i, \end{align}\]where the last equality follows from separability^{2}. Despite its ‘hairy’ appearance, the blue part is a simple quadratic optimization problem over \(x\). Taking the derivative of the term inside \(\min\) and equating with zero, we obtain
\[x^* = x_t - \eta \sum_{i \in B} s_i a_i,\]while the optimal value obtained by plugging \(x^*\) into the formula inside the blue \(\min\), after some math, is
\[-\frac{\eta}{2} \Bigl \|\sum_{i \in B} s_i a_i \Bigr \|_2^2 + \Bigl(\sum_{i \in B} s_i a_i \Bigr)^T x_t\]The red part can be re-written as
\[\sum_{i \in B} \min_{t_i} \left\{ \frac{1}{|B|} \phi(t_i) - s_i t_i \right\} = \frac{1}{|B|} \sum_{i \in B} \underbrace{ \min_{t_i} \{ \phi(t_i) - |B| s_i t_i \} }_{-\phi^*(|B| s_i)},\]where \(\phi^*\) is the familiar convex conjugate of \(\phi\). To summarize, the dual aims to maximize
\[q(s) = \color{blue}{-\frac{\eta}{2} \Bigl \|\sum_{i \in B} s_i a_i \Bigr \|_2^2 + \Bigl(\sum_{i \in B} s_i a_i \Bigr)^T x_t} \color{red}{- \frac{1}{|B|} \sum_{i \in B} \phi^*(|B| s_i)} + \sum_{i \in B} s_i b_i \tag{PD}\]It might look a bit hairy, but all we have is quadratic function of \(s\), a linear functions of \(s\), and the term colored in red, which is the only part which depends on the function \(\phi\).
Based on the strong duality theorem, we have the following algorithm template for computing \(x_{t+1}\):
The challenging part is, of course, finding a maximizer of \(q(s)\).
Before we jump into concrete examples and write code, let’s compare the method we obtained with mini-batch SGD. The mini-batch version of SGD would compute
\[x_{t+1} = x_t - \frac{\eta}{|B|} \sum_{i \in B} \phi'(a_i^T x + b_i) a_i,\]while according to step (2) in the algorithm above, the proximal point method is going to compute
\[x_{t+1} = x_t - \eta \sum_\limits{i \in B} s_i^* a_i.\]Apparently, both methods step in a direction obtained from a linear combination of the vectors \(a_i\). But while SGD multiplies each vector by derivatives of \(\phi\), which correspond to a linear approximation, the proximal point method uses coefficients \(s_i^*\) obtained by considering exact losses via duality.
Using some linear algebra, we can re-write the function \(q(s)\) above in a more compact form, which allows us to use the existing linear-algebra machinery built into many framework, such as PyTorch and NumPy, and cooperate nicely with how mini-batches are given by PyTorch’s DataLoader
class.
By embedding the vectors \(\{a_i\}_{i \in B}\) into the rows of the matrix \(A_B\), and the numbers \(\{ b_i \}_{i \in B}\) into the column vector \(b_B\), we can re-write (PD) as:
\[q(s) = -\frac{\eta}{2} \|A_B^T s\|_2^2 + (A_B x_t + b_B)^T s - \frac{1}{|B|} \sum_{i \in B} \phi^*(|B| s_i)\]Having found \(s^*\), we obtain
\[x_{t+1} = x_t - \eta A_B^T s^*\]Linear least-squares is interesting because it is one of these rare occasions where we are going to get a closed-form formula for computing \(x_{t+1}\) in the mini-batch setting. We aim to minimize
\[\frac{1}{2n} \sum_{i=1}^n (a_i^T x + b_i)^2, \tag{LS}\]which corresponds to \(\phi(t) = \frac{1}{2} t^2\) and consequently \(\phi^*(z)=\frac{1}{2} z^2\). According to the formula (PD) above, the dual for the mini-batch \(B\) aims to maximize
\[q(s)= -\frac{\eta}{2} \|A_B^T s\|_2^2 + (A_B x_t + b_B)^T s - \frac{|B|}{2} \|s\|_2^2.\]We are at luck! Why? We got a concave quadratic \(q(s)\), and quadratic functions have linear gradients. Thus, maximizing \(q(s)\) is done by solving the linear system of equations \(\nabla q(s) = 0\):
\[\nabla q(s) = (-\eta A_B A_B^T - |B| I) s + A_B x_t + b_B = 0.\]Extracting \(s\), we obtain:
\[s^* = (\underbrace{\eta A_B A_B^T + |B| I}_{P_B})^{-1} (A_B x_t + b_B).\]The expression is well-defined since the matrix denoted by \(P_B\) above is symmetric and positive definite, and thus invertible. Usually, we don’t like inverting matrices, since it is an expensive numerical operation. But this time we have a linear system of only \(\vert B \vert\) variables, which can be solved in a few microseconds for small enough mini-batches batches.
Let’s implement a PyTorch version of the optimizer. To make it a bit more efficient, we will exploit the positive-definiteness of the matrix \(P_B\) to solve the linear system using Cholesky factorization, instead of just calling torch.solve()
.
import torch
class LeastSquaresProxPointOptimizer:
def __init__(self, x, step_size):
self._x = x
self._step_size = step_size
def step(self, A_batch, b_batch):
# helper variables
x = self._x
step_size = self._step_size
m = A_batch.shape[0] # number of rows = batch size
# compute linear system coefficients
I = torch.eye(m, dtype=A_batch.dtype)
P_batch = torch.addmm(I, A_batch, A_batch.t(), beta=m, alpha=step_size)
rhs = torch.addmv(b_batch, A_batch, x)
# solve positive-definite linear system using Cholesky factorization
P_factor = torch.cholesky(P_batch)
rhs_col = rhs.unsqueeze(1) # make rhs a column vector, so that cholesky_solve works
s_star = torch.cholesky_solve(rhs_col, P_factor)
# perform step
step_dir = torch.mm(A_batch.t(), s_star)
x.sub_(step_size * step_dir.reshape(x.shape))
# return the losses w.r.t the params before making the step
return 0.5 * (rhs ** 2)
We had to remember a bit of linear algebra, but now we have a mini-batch stochastic proximal point solver for least squares problems, which should be stable w.r.t its step size.
Working with a larger family of functions \(\phi\) will be addressed in future posts. In this post - let’s see if mini-batching helps us.
Let’s test our new shiny optimizer on the Boston housing dataset, and see how mini-batching affects its performance. The code is also available in this git repo. We will try to predict housing prices \(y\) based on the data vector \(p \in \mathbb{R}^3\) comprising the number of rooms, population lower status percentage, and average pupil-teacher ratio by the linear model: \(y = p^T \beta +\alpha\)
To that end, we will attempt to minimize the mean squared error over all our samples \((p_j, y_j)\), namely:
\[\min_{\alpha, \beta} \quad \frac{1}{2n} \sum_{j=1}^n (p_j^T \beta +\alpha-y_j)^2\]In terms of (LS) above , we have the parameters \(x = (\beta_1, \beta_2, \beta_3, \alpha)^T\), and the data \(a_i = (p_{i,1}, p_{i,2}, p_{i,3}, 1)^T\), and \(b_i = -y_i\).
Since data extraction and cleaning is not the focus of this blog, we will assume our training data is already present in the boston.csv
file, and we start from there. First, let’s see that the optimizer works, and we can actually train the model.
import pandas as pd
import torch
from sklearn.preprocessing import minmax_scale
import numpy as np
# load the data, and form the tensor dataset
df = pd.read_csv('boston.csv')
inputs = minmax_scale(df[['RM','LSTAT','PTRATIO']].to_numpy()) # rescale inputs
inputs = np.hstack([inputs, np.ones((inputs.shape[0], 1))]) # add "1" to each sample
labels = minmax_scale(df['MEDV'].to_numpy())
dataset = torch.utils.data.TensorDataset(torch.tensor(inputs), -torch.tensor(labels))
Now let’s train our model with batches of size 1, as a sanity check, just to see what we get. I arbitrarily chose a step size of \(0.1\), since the method’s performance should be quite stable w.r.t the step size choice:
x = torch.empty(4, dtype=torch.float64, requires_grad=False)
torch.nn.init.normal_(x)
optimizer = LeastSquaresProxPointOptimizer(x, 0.1)
for epoch in range(10):
epoch_loss = 0.
for A_batch, b_batch in torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=1):
losses = optimizer.step(A_batch, b_batch)
epoch_loss += torch.sum(losses).item()
epoch_loss /= len(dataset)
print(f'epoch = {epoch}, loss = {epoch_loss}')
I got the following output:
epoch = 0, loss = 0.0353296791886312
epoch = 1, loss = 0.0054689887762457805
epoch = 2, loss = 0.005244492954797097
epoch = 3, loss = 0.005063800404922291
epoch = 4, loss = 0.005067674384940123
epoch = 5, loss = 0.005074529328079233
epoch = 6, loss = 0.0049781901226854525
epoch = 7, loss = 0.004972211130983918
epoch = 8, loss = 0.005056279938532879
epoch = 9, loss = 0.005009527545457453
Seems that the model is indeed training. Now, let’s try increasing the batch size. Note, that since the data-set itself is quite small, I wouldn’t try batches of more than \(4 \sim 6\) samples. Increasing the batch size to 4 I got the following output:
epoch = 0, loss = 0.02791556991645117
epoch = 1, loss = 0.009767545467004557
epoch = 2, loss = 0.006549118396497772
epoch = 3, loss = 0.005453119300548743
epoch = 4, loss = 0.005126206748396651
epoch = 5, loss = 0.004884885984465991
epoch = 6, loss = 0.004885505038333496
epoch = 7, loss = 0.004807177492127341
epoch = 8, loss = 0.004756576804948836
epoch = 9, loss = 0.00477020746145285
Ah! An improvement! Indeed, reducing the stochastic noise by increasing mini-batch size indeed improves the algorithm - a mini-batch of training samples is a less noisy approximation of the entire training set than just one sample.
Now, let’s perform a stability experiment with batch sizes from 1 to 6, and see how the method performs. For each batch size and each step size, we will run 20 training experiment, consisting of 10 epochs each. Then, for each batch size, we will plot the best training loss from each experiment as a function of the step size to see how the performance of the method varies with the step size choice. The following code runs the experiment and populates the losses data-frame with the results of the experiment.
from tqdm import tqdm
# setup experiment parameters
batch_sizes = [1, 2, 3, 4, 5, 6]
experiments = range(20)
epochs = range(10)
step_sizes = np.geomspace(0.001, 100, 30)
# run experiments and record results
losses = pd.DataFrame(columns=['batch_size', 'step_size', 'experiment', 'epoch', 'loss'])
total_epochs = len(batch_sizes) * len(experiments) * len(step_sizes) * len(epochs)
with tqdm(total=total_epochs, desc='batch_size = NA, step_size = NA, experiment = NA',
unit='epochs',
ncols=160) as pbar:
for batch_size in batch_sizes:
for step_size in step_sizes:
for experiment in experiments:
x = torch.empty(4, requires_grad=False, dtype=torch.float64)
torch.nn.init.normal_(x)
optimizer = LeastSquaresProxPointOptimizer(x, step_size)
for epoch in epochs:
epoch_loss = 0.
for A_batch, b_batch in torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=batch_size):
batch_losses = optimizer.step(A_batch, b_batch)
epoch_loss += torch.sum(batch_losses).item()
epoch_loss /= len(dataset)
losses = losses.append(pd.DataFrame.from_dict(
{'batch_size': [batch_size],
'step_size': [step_size],
'experiment': [experiment],
'epoch': [epoch],
'loss': [epoch_loss]}), sort=True)
pbar.update()
pbar.set_description(f'batch_size = {batch_size}, step_size = {step_size}, experiment = {experiment}')
After approximately 30 minutes the losses
data-frame contains the results, and now we can produce our plot:
import seaborn as sns
import matplotlib.pyplot as plt
best_losses = losses[['batch_size', 'step_size', 'experiment', 'loss']]\
.groupby(['batch_size', 'step_size', 'experiment'], as_index=False)\
.min()
sns.set()
ax = sns.lineplot(x='step_size', y='loss', hue='batch_size', data=best_losses, err_style='band')
ax.set_yscale('log')
ax.set_xscale('log')
plt.show()
Here is what I got:
Seaborn’s hue coding is a bit confusing, since we don’t really have a batch size of 0, but it is clear that darker is higher. We can see that for small enough step sizes the method does not perform well. It is not surprising - taking small steps leads to slow convergence, and we are going to need much more than 10 epochs to converge. However, for a large range of step sizes beginning at approximately \(0.1\), the performance of the method is quite stable. Even for step sizes as large as \(10^2\), the method does not diverge!
Moreover, we clearly see that mini-batches improve the performance whenever we do converge to something plausible. Let’s plot the same data, but focused to step sizes at-least \(0.1\).
focused_losses = best_losses[best_losses['step_size'] >= 0.1]
sns.set()
ax = sns.lineplot(x='step_size', y='loss', hue='batch_size', data=focused_losses, err_style='band')
ax.set_yscale('log')
ax.set_xscale('log')
plt.show()
Here is the resulting plot:
Indeed, darker colors, which correspond go higher batch sizes, lead to generally better results.
Implementing a mini-batch version of the stochastic proximal point method poses a real challenge. This time, instead of solving simple one-dimensional optimization problems in each iteration we need to work harder, and solve \(\vert B \vert\) dimensional optimization problems. But we are seeing an interesting pattern - solving a large-scale optimization problem using a stochastic algorithm is done by solving a sequence of very small-scale classical optimization problems.
Consequently, we are going to talk about methods for solving the classical optimization problems which involve maximizing \(q(s)\) we derived in this post. Using the above, we will be able to build mini-batch optimizers for a variety of convex-on-linear losses. Stay tuned!
]]>Our setup is minimizing the average loss of the form
\[\frac{1}{n} \sum_{i=1}^n \bigl[ \underbrace{\phi(g_i(x)) + r(x)}_{f_i(x)} \bigr],\]where \(\phi\) is a one-dimensional convex function, \(g_1, \dots, g_n\) are continuously differentiable, and the regularizer \(r\) is convex as well. Let’s take a minute to grasp the formula in the machine learning context: each of the functions \(g_i\) produces some affinity score of the \(i^{\text{th}}\) training sample with respect to the parameters \(x\), and the result is fed to \(\phi\) which produces the final loss, which is regularized with \(r\).
Before seeing some examples of why such decomposition of the losses into the inner parts \(g_i\) and the outer part \(\phi\) is useful, one remark is in place. Note, that since \(g_i\) can be an arbitrary function, we are dealing with a non-convex optimization problem, meaning that usually any hope of finding an optimum is lost. However, the practice of machine learning shows us that more often than not, stochastic optimization methods do produce reasonable solutions.
The first example is an arbitrary model with the L2 (mean squared error) loss for a regression problem. Namely, we have the training set \(\{ (w_i, \ell_i) \}_{i=1}^n\) with arbitrary labels \(\ell_i \in \mathbb{R}\), and we are training a model \(\sigma(w, x)\) which predicts the label of input \(w\) given parameters \(x\) by minimizing
\[\frac{1}{n} \sum_{i=1}^n \Bigl[ \tfrac{1}{2} \bigl( \underbrace{\sigma(w_i, x) - \ell_i}_{g_i(x)} \bigr)^2 + r(x) \Bigr].\]In this case, we have the outer loss \(\phi(t)=\tfrac{1}{2}t^2\). The model \(\sigma\) can be arbitrarily complex, i.e. a neural network or a factorization machine.
The second is an arbitrary model with the logistic loss. Namely, we have the training set \(\{(w_i, \ell_i)\}_{i=1}^n\) with binary labels \(\ell_i \in \{0,1\}\). We are training a model \(\sigma(w, x)\) whose prediction on input \(w\) given parameters \(x\) is the sigmoid
\[p(w,x)= \frac{1}{1+e^{-\sigma(w,x)}},\]which is trained by minimizing the regularized cross-entropy losses
\[\frac{1}{n} \sum_{i=1}^n \Bigl[ -\ell_i \ln(p(w_i, x)) - (1 - \ell_i) \ln(1 - p(w_i, x)) + r(x) \Bigr].\]At first glance the above formula does not appear to fall into our framework, but after some algebraic manipulation we can transform it into
\[\frac{1}{n} \sum_{i=1}^n \Bigl[\ln(1+\exp(\underbrace{(1-2\ell_i) \sigma(w_i, x)}_{g_i(x)})) + r(x) \Bigr].\]In this case, we have the convex outer loss \(\phi(t)=\ln(1+\exp(t))\). The model \(\sigma\) can be, again, arbitrarily complex.
Many other training setups fall into this framework, including training with the hinge loss, the mean absolute deviation, and many more. Note, that \(\phi\) does not even have to be differentiable.
SGD-type methods eliminate the complexity imposed by arbitrary loss functions by using their linear approximations. We discuss a similar approach: eliminate the complexity imposed by arbitrary functions \(g_i\) by approximating them using a tangent. However, we keep the regularizer and the outer-loss intact, without approximating them. Conretely, at each iteration we select \(g \in \{ g_1, \dots, g_n \}\) and compute:
\[x_{t+1} =\operatorname*{argmin}_x \Biggl\{ \phi(\underbrace{g(x_t)+ \nabla g(x_t)^T( x - x_t) }_{\text{Linear approx. of } g(x)}) + r(x) + \frac{1}{2\eta} \|x - x_t\|_2^2 \Biggr\}.\]The partial approximation idea is not new, and in the non-stochastic setup dates back to the Gauss-Netwon algorithm from 1809. The stochastic setup, which is what we are dealing with in this post, has been recently analyzed in several papers^{1}^{2}, and surprisingly has been shown to enjoy, under some technical assumptions, the same step-size stability properties we observed for proximal point approach provided that \(\phi\) is bounded below. For the two prototypical examples at the beginning of this post \(\phi\) is indeed bounded below: \(\phi(t) \geq 0\) for both of them. The boundedness property holds almost ubiquitously in machine learning.
In our approximation above, the term inside \(\phi\) is linear in \(x\), and can be explicitly re-written as
\[x_{t+1} = \operatorname*{argmin}_x \Biggl\{ \phi(\underbrace{\nabla g(x_t)}_a \ ^T x + \underbrace{g(x_t) - \nabla g(x_t)^T x_t}_b) + r(x) + \frac{1}{2\eta} \|x - x_t\|_2^2 \Biggr\}, \tag{PL}\]which falls exactly to the regularized convex-on-linear framework. For the above problem we already devised an efficient algorithm for computing \(x_{t+1}\), and wrote a Python implementation. The algorithm we devised relies on the proximal operator of the regularizer \(\operatorname{prox}_{\eta r}\), and on the convex conjugate of the outer loss \(\phi^*\). It consists of the following steps, with \(a\) and \(b\) taken from equation (PL) above:
Before diving into an implementation, let’s take a short break to appreciate the method from an intuitive standpoint. In the last post we saw, visually, that the proximal point algorithm takes a step which differs from SGD both in its direction and its length. Let’s see what does the prox-linear method do.
For simplicity, imagine we have no regularization. In that case, step (2) in the algorithm above becomes
\[x_{t+1} = x_t - \eta s^* \nabla g(x_t)\]If we used SGD instead, it would be
\[x_{t+1} = x_t - \eta \ \phi’(g(x_t)) \nabla g(x_t)\]Looks similar? Well, it seems that the prox-linear differs from SGD only in the step’s length, but not in its direction - both go in the direction of \(\nabla g(x_t)\). Personally, I would be surprised if it would not be the case. We are relying on a linear approximation of \(g\), so it is only natural that the step is in the direction dictated by its gradient.
From a high-level perspective the prox-linear method adapts a step length to each training sample, but make no attempt to learn from the history of the training samples to adapt a step-size to each coordinate. In contrast, methods such as AdaGrad, Adam, and others of similar nature, adapt a custom step-size to each coordinate based on the entire training history, but make no attempt to adapt to each training sample separately. These two seem like two orthogonal concerns, and it might be interesting if they can somehow be combined to construct an even better optimizer.
A discussion about adapting a step length to each iteration of the optimizer is not complete without recalling another famous approach - the Polyak step-size^{3}. It has nice convergence properties for deterministic convex optimization, and a variant has been recently analyzed in the stochastic machine-learning centric setting in a paper by Loizou et. al.^{4}.
Since it is simple to handle and widely used, let’s implement the algorithm for \(L2\) regularization \(r(x)=\frac{\alpha}{2} \|x\|_2^2\). Recalling that \(\operatorname{prox}_{\eta r}(u) = \frac{1}{1+\eta \alpha} u\), the algorithm amounts to:
Looking at the second step, some machine learning practitioners may recognize a variant of the well-known weight decay idea - the algorithm decays the parameters of the model by the factor \(1 + \eta \alpha\), after performing something that seems like a gradient step.
The first step in the algorithm above depends on the outer loss \(\phi\), while the second step can be performed by a generic optimizer. Since we need some machinery to compute of \(\nabla g_i\), we will implement the generic component as a full-fledged PyTorch optimizer and rely on the autograd mechanism built into PyTorch to compute \(\nabla g_i\).
import torch
class ProxLinearL2Optimizer(torch.optim.Optimizer):
def __init__(self, params, step_size, alpha, outer):
if not 0 <= step_size:
raise ValueError(f"Invalid step size: {step_size}")
if not 0 <= alpha:
raise ValueError(f"Invalid regularization coefficient: {alpha}")
defaults = dict(step_size = step_size, alpha = alpha)
super(ProxLinearL2Optimizer, self).__init__(params, defaults)
self._outer = outer
def step(self, inner):
inner_val = inner() # g(x_t)
outer = self._outer
loss_noreg = outer.eval(inner_val) # phi(g(x_t))
# compute the coefficients (c, d) of the dual problem, and the regularization term
c = inner_val
d = 0
for group in self.param_groups:
eta = group['step_size']
alpha = group['alpha']
for p in group['params']:
if p.grad is None:
continue
c -= eta * alpha * torch.sum(p.data * p.grad).item() / (1 + eta * alpha)
d += eta * torch.sum(p.grad * p.grad).item() / (1 + eta * alpha)
# solve the dual problem
s_star = outer.solve_dual(d, c)
# update the parameters, and compute the regularization term.
for group in self.param_groups:
eta = group['step_size']
alpha = group['alpha']
for p in group['params']:
if p.grad is None:
continue
p.data.sub_(eta * s_star * p.grad)
p.data.div_(1 + eta * alpha)
# return the loss without regularization terms, like other PyTorch optimizers do, so that we can compare them.
return loss_noreg
We could ada[t] the optimizer above to supporting sparse tensors, and add special handling to the case of no regularization, namely, \(\alpha = 0\), but for a blog-post it is enough. We can now train any model we like using our optimizer with a variant of the standard PyTorch training loop, for example:
model = create_model() # this is the model "sigma" from the beginning of this post
optimizer = ProxLinearL2Optimizer(mode.parameters(), learning_rate, reg_coef, SomeOuterLoss())
for input, target in data_set
def closure():
pred = model(input)
inner = inner_loss(pred, target) # this is g_i
inner.backward() # this computes the gradient of g_i
return inner.item()
opt.step(closure)
For our two prototypical scenarios we have already implemented the outer losses in previous posts, so I will just repeat them below for completeness:
import torch
import math
# 0.5 * t^2
class SquaredSPPLoss:
def solve_dual(self, d, c):
return c / (1 + d)
def eval(self, inner):
return 0.5 * (inner ** 2)
# log(1+exp(t))
class LogisticSPPLoss:
def solve_dual(self, d, c):
def qprime(s):
-d * s + c + math.log(1-s) - math.log(s)
# compute [l,u] containing a point with zero qprime
l = 0.5
while qprime(l) <= 0:
l /= 2
u = 0.5
while qprime(1 - u) >= 0:
u /= 2
u = 1 - u
while u - l > 1E-16: # should be accurate enough
mid = (u + l) / 2
if qprime(mid) == 0:
return mid
if qprime(l) * qprime(mid) > 0:
l = mid
else:
u = mid
return (u + l) / 2
def eval(self, inner):
return math.log(1+math.exp(inner))
Let’s see how our optimizer does for training a factorization machine. I will be using the results of my collegue’s, Yoni Gottesman, phenomenal tutorial Movie Recommender from Pytorch to Elasticsearch, where he trains a factorization machine on the MovieLens 1M data-set to create a movie recommendation engine. He also shared a Jupyter notebook with the code. Readers unfamiliar with factorization machines are encouraged to read this, or any other tutorial, since I assume a basic understanding of the concept and its PyTorch implementation.
Since my focus is on optimization, e.g. training a model, I will assume that we have done all the data-preparation, and we have the table created Yoni’s notebook in a file called data_set.csv
.
Each feature value is associated with an index, and \(m\) is the total number of distinct feature values.
We will be using a second-order factorization machine \(\sigma(w, x)\) for our recommendation prediction task. The model \(\sigma(w, x)\) is given a binary input \(w \in \{0, 1\}^m\) which encodes the feature values of the training sample, its parameter vector is \(x = (b_0, b_1, \dots, b_m, v_1, \dots, v_m)\) which is composed of the model’s bias \(b_0 \in \mathbb{R}\), the feature value biases \(b_1, \dots, b_m\in \mathbb{R}\), and the feature value latent vectors \(v_1, \dots, v_m \in \mathbb{R}^k\), where \(k\) is our embedding dimension. The model computes:
\[\sigma(w, x) := b_0 + \sum_{i = 1}^m w_i b_i + \sum_{i = 1}^m\sum_{j = i + 1}^{m} (v_i^T v_j) w_i w_j\]As the embedding dimension \(k\) increases, the model has more parameters to represent the data. Here is a PyTorch implementation of the model above, adapted from Yoni’s blog:
import torch
from torch import nn
def trunc_normal_(x, mean=0., std=1.):
"Truncated normal initialization."
# From https://discuss.pytorch.org/t/implementing-truncated-normal-initializer/4778/12
return x.normal_().fmod_(2).mul_(std).add_(mean)
class FMModel(nn.Module):
def __init__(self, m, k):
super().__init__()
self.b0 = nn.Parameter(torch.zeros(1))
self.bias = nn.Embedding(m, 1) # b_1, \dots, b_m
self.embeddings = nn.Embedding(m, k) # v_1, \dots, v_m
# See https://arxiv.org/abs/1711.09160
with torch.no_grad():
trunc_normal_(self.embeddings.weight, std=0.01)
with torch.no_grad():
trunc_normal_(self.bias.weight, std=0.01)
def forward(self, w):
# Fast impl. of pairwise interactions. See lemma 3.1 from paper:
# Steffen Rendle. Factorization Machines. In ICDM, 2010.
emb = self.embeddings(w) # tensor of size (1, num_input_features, embed_dim)
pow_of_sum = emb.sum(dim=1).pow(2)
sum_of_pow = emb.pow(2).sum(dim=1)
pairwise = 0.5 * (pow_of_sum - sum_of_pow).sum(dim=1)
bias_emb = self.bias(w) # tensor of size (1, num_input_features, 1)
bias = bias_emb.sum(dim=1)
return self.b0 + bias.squeeze() + pairwise.squeeze()
Let’s try to train our model using the L2 loss. To save computational resources for this small example, I am going to train the model with embedding dimension \(k=8\) using and only for 5 epochs. Since I rely on the stability properties of the prox-linear optimizer, I chose the step_size
arbitrarily.
import pandas as pd
import torch
from tqdm import tqdm
# create a Dataset object
df = pd.read_csv('dataset.csv')
inputs = torch.tensor(df[['userId_index','movieId_index','age_index','gender_index','occupation_index']].values)
labels = torch.tensor(df['rating'].values).float()
dataset = torch.utils.data.TensorDataset(inputs, labels)
# create the model
model = FMModel(m=inputs.max() + 1, k=8)
# run five training epochs
opt = ProxLinearL2Optimizer(model.parameters(), step_size=0.1, alpha=1E-5, outer=SquaredSPPLoss())
for epoch in range(5):
train_loss = 0.
for w, l in tqdm(torch.utils.data.DataLoader(dataset, shuffle=True)):
opt.zero_grad()
def closure():
pred = model(w)
inner = pred - l
inner.backward()
return inner.item()
train_loss += opt.step(closure)
print(f'epoch = {epoch}, loss = {train_loss / len(dataset)}')
I got the following output, discarding the tqdm
progress bars:
epoch = 0, loss = 0.5289862413141884
epoch = 1, loss = 0.5224190998653859
epoch = 2, loss = 0.5189374542805951
epoch = 3, loss = 0.5175283396570431
epoch = 4, loss = 0.5169355024401286
Note that our loss is half of the mean squared error, so to get RMSE (root mean-squared error) we need to multiply it by two, and take the square root. The last epoch corresponds roughly to training RMSE of \(\approx 1.02\), which seems reasonable for a low-dimensional model of \(k=8\) and a guessed step-size which was not tuned in any way.
Now, let’s do a more serious experiment which will test the stability of our optimizer w.r.t the step size choices, and compare it to AdaGrad. As in our first post, we will train our model with various step-sizes, for each step-size we will perform two experiments, and each experiment trains the model for 10 epochs and take the training loss from the best epoch. The multiple experiments exist to take into account the stochastic nature of training sample selection. Then, we will plot a graph of the average regularized^{5} training loss achieved for each step-size among the multiple experiments. The code is derived from what we have above.
import numpy as np
import pandas as pd
import torch
from tqdm import tqdm
import seaborn as sns
# create a Dataset object
df = pd.read_csv('dataset.csv')
inputs = torch.tensor(df[['userId_index','movieId_index','age_index','gender_index','occupation_index']].values)
labels = torch.tensor(df['rating'].values).float()
dataset = torch.utils.data.TensorDataset(inputs, labels)
# setup experiment parameters and results
epochs = range(10)
step_sizes = np.geomspace(0.001, 100, num=10)
experiments = range(0, 3)
exp_results = pd.DataFrame(columns=['optimizer', 'step_size', 'experiment', 'epoch', 'loss'])
# run prox-linear experiment
for step_size in step_sizes:
for experiment in experiments:
model = FMModel(m=inputs.max() + 1, k=8)
opt = ProxLinearL2Optimizer(model.parameters(), step_size=step_size, alpha=alpha, outer=SquaredSPPLoss())
for epoch in epochs:
train_loss = 0.
train_loss_reg = 0.
desc = f'ProxLinear: step_size = {step_size}, experiment = {experiment}, epoch = {epoch}'
for idx in tqdm(sampler, desc=desc):
w, l = dataset[idx]
opt.zero_grad()
def closure():
inner = model(w) - l
inner.backward()
return inner.item()
loss = opt.step(closure)
train_loss += loss
train_loss_reg += (loss + alpha * model.params_l2().item())
train_loss /= len(dataset)
train_loss_reg /= len(dataset)
print(f'train_loss = {train_loss}, train_loss_reg = {train_loss_reg}')
exp_results = exp_results.append(pd.DataFrame.from_dict(
{'optimizer': 'prox-linear',
'step_size': step_size,
'experiment': experiment,
'epoch': epoch,
'loss': [train_loss_reg]}), sort=True)
# run ada-grad experiment
for step_size in step_sizes:
for experiment in experiments:
model = FMModel(m=inputs.max() + 1, k=8)
opt = torch.optim.Adagrad(model.parameters(), lr=step_size, weight_decay=alpha)
for epoch in epochs:
train_loss = 0.
train_loss_reg = 0.
desc = f'Adagrad: step_size = {step_size}, experiment = {experiment}, epoch = {epoch}'
for idx in tqdm(sampler, desc=desc):
w, l = dataset[idx]
opt.zero_grad()
def closure():
loss = 0.5 * ((model(w) - l) ** 2)
loss.backward()
return loss.item()
loss = opt.step(closure)
train_loss += loss
train_loss_reg += (loss + alpha * model.params_l2().item())
train_loss /= len(dataset)
train_loss_reg /= len(dataset)
print(f'train_loss = {train_loss}, train_loss_reg = {train_loss_reg}')
exp_results = exp_results.append(pd.DataFrame.from_dict(
{'optimizer': 'Adagrad',
'step_size': step_size,
'experiment': experiment,
'epoch': epoch,
'loss': [train_loss_reg]}), sort=True)
# display the results
best_results = exp_results[['optimizer', 'step_size', 'experiment', 'loss']].groupby(['optimizer', 'step_size', 'experiment'], as_index=False).min()
sns.set()
ax = sns.lineplot(x='step_size', y='loss', hue='optimizer', data=best_results, err_style='band')
ax.set_yscale('log')
ax.set_xscale('log')
plt.show()
And after a few days we obtain the following result:
Seems that the algorithm is indeed stable, and does not diverge even when taking extremely large step sizes. However, AdaGrad does achieve a better loss for its optimal step-size range. To see a better picture, let’s focus on a smaller step size range:
Seems that at this smaller range both algorithms seem stable, and AdaGrad even performs better. It is not surprising - we already saw that when using L2 regularization, AdaGrad tends to perform quite well. Let’s compare convergence rates for the optimal step sizes of each algorithm:
# extract convergence rate of each optimizer at its own optimal step size
convrate_pl = exp_results[(exp_results['step_size'] == 0.003593813663804626) & (exp_results['optimizer'] == 'prox-linear')].drop(['step_size'], axis=1)
convrate_adagrad = exp_results[(exp_results['step_size'] == 0.1668100537200059) & (exp_results['optimizer'] == 'Adagrad')].drop(['step_size'], axis=1)
convrate = convrate_pl.append(convrate_adagrad)
# plot the results
sns.lineplot(x='epoch', y='loss', hue='optimizer', data=convrate)
plt.show()
Indeed, the rate of AdaGrad’s convergence is substantially better. Seems that adapting a step size to each coordinate is more beneficial than adapting a step size to each training sample via the prox-linear algorithm. But we also saw that this approach really shines when the regularization is not L2, and we can actually benefit from not approximating the regularizer.
Despite its drawbacks, it’s still a reasonable algorithm to use when we don’t have the resources to tune the step size, but to get good results we will have to use more epochs to let it converge to a better solution. But our journey is not done - we will improve our optimizers using the framework we develop and see where it gets us.
For simplicity, suppose that we have no regularization (\(r=0\)). An interesting approach suggested in the paper^{1} by Asi & Duchi is using trunaced approximations for losses which are lower bounded, namely, \(f_i \geq 0\). Luckily, most losses in machine learning are lower bounded, and we can replace each \(f_i\) with
\[\max(0, f_i(x)),\]and treat \(\phi(t)=\max(0, t)\) as the outer loss and the original loss \(f_i\) as the inner loss. The computational step of the prox-linear method using the above inner/outer decomposition for \(f \in \{f_1, \dots, f_n\}\) becomes:
\[x_{t+1} = \operatorname*{argmin}_x \left\{ \max(0, f(x_t) + \nabla f(x_t)^T (x - x_t)) + \frac{1}{2\eta} \|x - x_t\|_2^2 \right\}.\]The above formula differs from regular SGD only in the fact that the linear approximation of \(f\) is trunaced at zero when it becomes negative, and is based on the simple intuition: if the loss is bounded below at zero, we shouldn’t allow its approixmation to be unbounded.
Remarkably, the above simple idea was proven by Asi & Duchi to enjoy similar stability properties, when the only information about the loss we exploit is the fact that it has a lower bound. By following the convex-on-linear solution recipe, we obtain an explicit formula for computing \(x_{t+1}\):
\[x_{t+1} = x_t - \min\left(\eta, \frac{f(x_t)}{\|\nabla f(x_t) \|_2^2} \right) \nabla f(x_t).\]That is, when the ratio of \(f(x_t)\) to the squared norm of the gradient \(\nabla f(x_t)\) is small, we take a regular SGD step of size \(\eta\). Otherwise, we modify the step length to be the above ratio. The above remarkably simple formula is enough to substantially improve the stability properties of SGD, both in theory and practice. The reason for not dealing with the above approach in this blog post, is because Hilal Asi, one of the authors of the above-mentioned paper^{1}, already provided truncated approximation optimizers for both PyTorch and TensorFlow in this GitHub repo.
The convex-on-linear framework has proven to be powerful enough to allow us to efficiently train arbitrary models using the prox-linear algorithm when taking individual training samples. Each training sample can be thought of as an approximation of the entire average loss, but the error of this approximation can be quite large. A standard practice to reduce this error is to use a mini-batch of training samples. In the next post we will discuss an efficient implementation for a mini-batch of convex-on-linear functions. Then, we will be able to derive a more generic prox-linear optimizer which can train arbitrary models using mini-batches of training samples.
Footnotes and references
Asi, H. & Duchi J. (2019). Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity SIAM Journal on Optimization 29(3) (pp. 2257–2290) ↩ ↩^{2} ↩^{3}
Davis, D. & Drusvyatskiy. D. (2019) Stochastic Model-Based Minimization of Weakly Convex Functions. SIAM Journal on Optimization 29(1) (pp. 207-239) ↩
Polyak B. T. (1987) Introduction to Optimization. Optimization and Software, Inc., New York ↩
Loizou N. & Vaswani S. & Laradji I. & Lacoste-Julien S. (2020) Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence. Arxiv preprint: https://arxiv.org/abs/2002.10542 ↩
We use a regularized loss, since we are minimizing the regularized loss, and would like to appreciate the performance of optimizers as optimizers. The choice of the regularization parameter which achieves a good validation loss has to be done using standard techniques in machine learning. We may be able to avoid extensive step size tuning, but we still have to find the optimal regularization coefficient. ↩
while the other is the proximal point approach, which computes
\[x_{t+1} = \operatorname*{argmin}_x \Biggl \{ P_t(x) \equiv \underbrace{\color{blue}{ f(x) }}_{f \text{ itself}} + \frac{1}{2\eta}\|x - x_t\|_2^2 \Biggr \}. \tag{PP}\]The idea is, that exploiting the loss \(f\) itself instead of approximating it can prove to be advantageous, despite the additional burden of handling \(f\) exactly. At first glance, both methods are indeed two contrasting approaches, and have nothing in common. But is it really so?
In previous posts we concentrated on devising efficient implementations of the proximal point approach. Here we try to understand what the proximal point method really is, beyond the simple idea of “let’s use the loss directly, instead of its linear approximation”. I will do a lot of hand-waving and simplifications, and avoid most mathematical rigor, to make the ideas as accessible as possible. Apologies from the mathematically-inclined readers.
The proximal point approach aims, at each step, to compute \(x_{t+1}\) by minimizing \(P_t\) above. Assume, for simplicity, that \(f\) is differentiable. Recall, that in that case being a minimizer of \(P_t\) implies that the gradient at that point is zero, namely, \(\nabla P_t(x_{t+1}) = 0\). Writing explicitly, we obtain
\[\nabla f(x_{t+1}) + \frac{1}{\eta}(x_{t+1} - x_t) = 0,\]and by re-arranging we obtain
\[x_{t+1} = x_t - \eta \nabla f(x_{t+1}).\]Looks similar to SGD, but the gradient is taken at \(x_{t+1}\) and not at \(x_t\)! In other words, the proximal point method makes a step in the direction of the gradient at the next iterate, in contrast to SGD-type methods which use the gradient at the current iterate, hence the name ‘forward looking’.
Recalling that the gradient is orthogonal (perpendicular) to its level contour, the following pair of images provides a visual insight into the difference between the two approaches (generated by this code).
Gradient step - perpendicular to the level contour at \(x_t\). | Proximal step - perpendicular to the level contour at \(x_{t+1}\). |
The proximal step makes a ‘straight shot’ toward the next, hopefully lower level contour. So, intuitively, it is less probable to miss with a step-size which is a bit too large, and thus is stable w.r.t the step-size choice.
Let’s assume, for simplicity, that \(f\) is convex, since the results provide a nice intuition. In the last post, we saw that the proximal point step in equation (PP) above can be alternatively writte using Moreau’s proximal operator:
\[x_{t+1} = \operatorname{prox}_{\eta f}(x_t)\]We also saw that the gradient of \(M_{\eta} f\), the Moreau envelope of \(f\), is given by the formula
\[\nabla M_\eta f(x) = \frac{1}{\eta}(x - \operatorname{prox}_{\eta f}(x)),\]which can be re-arranged into
\[\operatorname{prox}_{\eta f}(x) = x - \eta \nabla M_\eta f(x).\]So, the proximal point method is, in fact:
\[x_{t+1} = x_t - \eta \nabla M_{\eta} f(x_t).\]Whoa! So, it seems that the proximal point step is nothing but the gradient step, but applied to the Moreau envelope of the loss, instead of the loss itself. In other words, it seems that the stochastic proximal point algorithm is just SGD with each loss replaced by its Moreau envelope. Why should we apply SGD to the envelopes of the incurred losses, instead of the losses themselves? That will become apparent in the next section.
Both SGD-type methods and the proximal point method share a common structure:
\[x_{t+1} = \operatorname*{argmin}_x \left\{ \color{blue}{\text{approx. of the loss}} + \color{red}{\text{proximity to } x_t} \right\}\]We know the role of the blue part - its role is to carry information about the loss w.r.t the current training sample. But what is the role of the proximity term? Well, recall that our aim is to optimize the entire average
\[F(x)= \frac{1}{n} \sum_{i=1}^n f_i(x)\]and not just the loss incurred by one training sample \(f \in \{ f_1,\dots, f_n \}\). How do we know how close we are to minimizing \(F\)? Well, at optimality we certainly should have \(\nabla F(x) = 0\), so one natural measure of ‘how far away from optimality are we’, which we call an optimality gap, is the gradient norm \(\| \nabla F(x) \|\). Another optimality gap may be \(F(x) - \inf_x F(x)\), namely, how far away is the function value at \(x\) from the optimal value?
In contrast to the blue part, the red part’s role is carrying information about the entire average \(F\) by providing a bound on ‘how much would we ruin the optimality gap by moving from \(x_t\) to \(x\)?’. So each iteration in fact is about balancing between decreasing the loss incurred by the current training sample, while avoiding ruining the optimality gap for the entire training set. A properly chosen step-size will create such balance between the two opposing forces such that the iteration sequence will drive \(F\) towards optimality.
Now let’s do some hand-waving, without any formal analysis. Both methods use the weighted squared Euclidean norm \(\frac{1}{2 \eta}\| x - x_t\|_2^2\) as their proximity measure, meaning that they assume that it can somehow bound our optimality gap at \(x\) versus \(x_{t}\). The above is not true in general, but it is true for functions whose gradient changes slowly: if we take two nearby points, the gradients at these points are close to each other well. In formal language, papers occasionally say that \(F\) has a Lipschitz-continuous gradient.
So now let’s get back to the Moreau envelope viewpoint. A remarkable property of Moreau envelopes of convex functions is that they indeed possess the above-mentioned property of having a Lipschitz continuous gradient, and therefore comprise perfect candidates for applying SGD-type methods. Readers who are interested in reading a more formal and rigorous introduction to Moreau envelopes and proximal operators and their relationships to optimization are encouraged to read this or that book, or this excellent tutorial.
As promised, we are going to use the fundamentals we have so far to develop proximal-point based algorithms for more advanced scenarios, such as training factorization machines and neural networks, and handle the mini-batch setting. Stay tuned!
]]>to losses of the form
\[f(x)=\underbrace{\phi(a^T x + b)}_{\text{data fitting}} + \underbrace{r(x)}_{\text{regularizer}},\]where \(\phi\) and \(r\) are convex functions, and devised an efficient implementation for L2 regularized losses: \(r(x) = (\lambda/2) \|x\|_2^2\). We now aim for a more general approach, which will allow us to deal with many other regularizers. One important example is L1 regularization \(r(x)=\lambda \|x\|_1 = \sum_{i=1}^d \vert x_i \vert\), since it is well-known that it promotes a sparse parameter vector \(x\). It is useful when we have a huge number of features, and the non-zero components of \(x\) select only the features which have meaningful effect on the model’s predictive power.
Recall that implementing stochastic proximal point method amounts to solving the one-dimensional problem dual to the optimizer’s step (S) by maximizing:
\[q(s)=\color{blue}{\inf_x \left \{ \mathcal{Q}(x,s) = r(x)+\frac{1}{2\eta} \|x-x_t\|_2^2 + s a^T x \right \}} - \phi^*(s)+sb. \tag{A}\]Having obtained the maximizer \(s^*\) we compute \(x_{t+1} = \operatorname*{argmin}_x \left\{ \mathcal{Q}(x,s^*) \right\}\).
The part highlighted in blue is where the regularizer comes in, and is a major barrier between a practitioner and the method’s advantages - the practitioner has to mathematically re-derive the minimization of \(\mathcal{Q}(x,s)\) and re-implement the resulting derivation for every regularizer. Seems quite impractical, and since \(r(x)\) can be arbitrarily complex, it might even be impossible.
Unfortunately, we cannot remove this obstacle entirely, but we can express it in terms of a “textbook recipe” - something a practitioner can pick from a catalog in a textbook and use, instead of re-deriving. In fact, that’s how most optimization methods work. For example, SGD is based on the textbook recipe of ‘gradient’ and we have explicit rules for computing them, the stochastic proximal point method we derived in the last post is based on the textbook recipe of ‘convex conjugate’, and in this post we will introduce and use yet another textbook recipe.
Our journey begins with one trick which is taught in high-school algebra classes is known as “completing a square” - we re-arrange the formula for a square of a sum \((a+b)^2=a^2+2ab+b^2\) to the form
\[a^2+2ab=(a+b)^2-b^2.\]Such a trick is occasionally useful to express things in terms of squares only. We do a similar trick on the formula for the squared Euclidean norm:
\[\frac{1}{2}\|a + b\|_2^2= \frac{1}{2}\|a\|_2^2+a^T b+\frac{1}{2}\|b\|_2^2. \tag{a}\]Re-arranging, we obtain
\[\frac{1}{2}\|a\|_2^2+a^T b = \frac{1}{2}\|a + b\|_2^2 - \frac{1}{2}\|b\|_2^2. \tag{b}\]Now, let’s apply the trick to the term \(\mathcal{Q}(x,s)\) inside the infimum in the definition of the dual problem. It is a bit technical, but the end-result leads us to our desired texbook recipe.
\[\begin{align} \mathcal{Q}(x,s) &= r(x)+\frac{1}{2\eta} \|x-x_t\|_2^2 + s a^T x \\ &= \frac{1}{\eta} \left[ \eta r(x) + \frac{1}{2} \|x - x_t\|_2^2 + \eta s a^T x \right] & \leftarrow \text{Factoring out } \frac{1}{\eta} \\ &= \frac{1}{\eta} \left[ \eta r(x) + \color{orange}{\frac{1}{2} \|x\|_2^2 - (x_t - \eta s a )^T x} + \frac{1}{2} \|x_t\|_2^2 \right] & \leftarrow \text{opening } \frac{1}{2}\|x-x_t\|_2^2 \text{with (a)} \\ &= \frac{1}{\eta} \left[ \eta r(x) + \color{orange}{\frac{1}{2} \|x - (x_t - \eta s a)\|_2^2 - \frac{1}{2} \|x_t - \eta s a\|_2^2 } + \frac{1}{2} \|x_t\|_2^2 \right] & \leftarrow \text{square completion with (b)}\\ &= \left[ r(x)+\frac{1}{2\eta} \|x - (x_t - \eta s a)\|_2^2 \right] - \frac{1}{2\eta} \|x_t - \eta s a\|_2^2 + \frac{1}{2\eta} \|x_t\|_2^2 &\leftarrow{\text{Multiplying by }\frac{1}{\eta}} \\ &= \left[ r(x)+\frac{1}{2\eta} \|x - (x_t - \eta s a)\|_2^2 \right] + (a^T x_t) s - \frac{\eta \|a\|_2^2}{2} s^2 &\leftarrow{\text{applying (a) and canceling terms}} \end{align}\]Plugging the above expression for \(\mathcal{Q}(x,s)\) into the formula (A), results in:
\[q(s)= \color{magenta}{ \inf_x \left \{ r(x)+\frac{1}{2\eta} \| x - (x_t - \eta s a)\|_2^2 \right \}} + (a^T x_t + b) s - \frac{\eta \|a\|_2^2}{2} s^2 - \phi^*(s)\]The magenta part may seem unfamiliar, but it is a well-known concept in optimization: the Moreau envelope^{1} of the function \(r(x)\). Let’s get introduced to the concept properly.
Formally, the Moreau envelope of a convex function \(r\) with parameter \(\eta\) is denoted by \(M_\eta r\) and defined by
\[M_\eta r(u) = \inf_x \left\{ r(x) + \frac{1}{2\eta} \|x - u\|_2^2 \right\}. \tag{c}\]Consequently, we can write the function \(q(s)\) of the dual problem as:
\[q(s) = \color{magenta}{M_{\eta} r (x_t - \eta s a)} + (a^T x_t + b) s - \frac{\eta \|a\|_2^2}{2} s^2 - \phi^*(s).\]Now \(q(s)\) is composed of two textbook concepts - the convex conjugate \(\phi^*\), and the Moreau envelope \(M_\eta r\). A related concept is the minimizer of (c) above - Moreau’s proximal operator of \(r\) with parameter \(\eta\):
\[\operatorname{prox}_{\eta r}(u) = \operatorname*{argmin}_x \left\{ r(x)+\frac{1}{2\eta} \|x - u\|_2^2 \right\}.\]Moreau envelopes and proximal operators, which were introduced in 1965 by the French mathematician Jean Jacques Moreau, create a “smoothed” version of arbitrary convex functions, and are nowdays ubiquitous in modern optimization theory and practice. Since then, the concepts have been used for many other purposes - just Google Scholar it. In fact, the stochastic proximal point method itself can be compactly written in terms of the operator: select \(f \in \{f_1, \dots, f_n\}\) and compute \(x_{t+1} = \operatorname{prox}_{\eta f}(x_t)\).
Before going deeper, let’s recap and explicitly write our “meta-algorithm” for computing \(x_{t+1}\) using the above concepts:
To make things less abstract, look at a one-dimensional example to gain some more intuition: the absolute value function \(r(x) = \lvert x \rvert\). Doing some lengthy calculus, which is out of scope of this post, we can compute:
\[M_{\eta}r (u) = \inf_x \left\{ \vert x \vert + \frac{1}{2\eta}(x - u)^2\right\} = \begin{cases} \frac{u^2}{2\eta} & \mid u\mid \leq \eta \\ \vert u \vert - \frac{\eta}{2} & \vert u \vert>\eta \end{cases}\]That is, the envelope is a function which looks like a prabola when \(u\) is close enough to the 0, and switches to behaving like the absolute value when we get far away. Some readers may recognize this function - this is the well-known Huber function, which is commonly used in statistics as differentiable approximation of the absolute value. Let’s plot it:
import numpy as np
import matplotlib.pyplot as plt
import math
def huber_1d(eta, u):
if math.fabs(u) <= eta:
return (u ** 2) / (2 * eta)
else:
return math.fabs(u) - eta / 2
def huber(eta, u):
return np.array([huber_1d(eta, x) for x in u])
x = np.linspace(-2, 2, 1000)
plt.plot(x, np.abs(x), label='|x|')
plt.plot(x, huber(1, x), label='eta=1')
plt.plot(x, huber(0.5, x), label='eta=0.5')
plt.plot(x, huber(0.1, x), label='eta=0.1')
plt.legend(loc='best')
plt.show()
Here is the resulting plot:
Viola! A smoothed version of the absolute value function. Smaller values of \(\eta\) lead to a better, but less smooth approximation. As we said before, this behavior is not unique to the absolute value’s envelope - Moreau envelopes of convex functions are always differentiable, and their gradient is always continuous.
Another interesting thing we can see in the plot is that the envelopes approach the approximating function from below. It is not a coincidence as well, since:
\[M_\eta r(u) = \inf_x \left\{ r(x) + \frac{1}{2\eta} \|x - u\|_2^2 \right\} \underbrace{\leq}_{\text{taking }x=u} r(u)+\frac{1}{2\eta}\|u-u\|_2^2=r(u),\]that is, the envelope always lies below the function itself.
Now let’s get back to our meta-algorithm. We need to solve the dual problem by maximizing \(q(s)\), and we typically do it by equating its derivative \(q’(s)\) with zero. Hence, in practice, we are interested in the derivative of \(q\) rather than its value (assuming \(q\) is indeed continuously differentiable). Using the chain rule, we obtain:
\[q'(s) = -\eta a^T ~\nabla M_\eta r(x_t - \eta s a) + (a^T x_t + b) - \eta \|a\|_2^2 s - {\phi^*}'(s). \tag{d}\]Moreau’s exceptional work does not disappoint, and using some clever analysis he derived the following remarkable formula for the gradient of \(M_\eta r\):
\[\nabla M_\eta r(u) = \frac{1}{\eta} \left(u - \operatorname{prox}_{\eta r}(u) \right).\]Substituting the formula for into (d), the derivative \(q’(s)\) can be written as
\[\begin{aligned} q'(s) &=-a^T(x_t - \eta s a - \operatorname{prox}_{\eta r}(x_t - \eta s a)) + (a^T x_t+b) - \eta \|a\|_2^2 s -{\phi^*}'(s) \\ &= a^T \operatorname{prox}_{\eta r}(x_t - \eta s a) - {\phi^*}'(s) + b \end{aligned} \tag{DD}\]To conclude, our ingredients for \(q’(s)\) are: a formula for the proximal operator of \(r\), and a formula for the derivative of \(\phi^*\). Since proximal operators are ubiquitous in optimization theory and practice, entire book chapters about proximal operators were written, i.e. see here^{2} and here^{3}. The second reference contains, at the end of the chapter, a catalog of explicit formulas for \(\operatorname{prox}_{\eta r}\) for various functions \(r\) summarized in a table. Here are a two important examples:
\(r(x)\) | \(\operatorname{prox}_{\eta r}(u)\) | Remarks |
---|---|---|
\((\lambda/2) |x|_2^2\) | \(\frac{1}{1+\eta \lambda} u\) | |
\(\lambda |x|_1 = \lambda\sum_{i=1}^n \vert x_i \vert\) | \([\vert u \vert -\lambda \eta \mathbf{1}]_+ \cdot \operatorname{sign}(u)\) | \(\mathbf{1}\) is a vector whose components are all 1. \([a]_+\equiv\max(0, a)\) is the ‘positive part’ of \(a\). More details later in this post. |
0 | u | no regularizer |
With the above in mind, the meta-algorithm for computing \(x_{t+1}\) amounts to:
Last time we used a lengthy mathematical derivation to obtain the computational steps for L2 regularized losses, namely, losses of the form
\[f(x)=\phi(a^T x + b)+\underbrace{\frac{\lambda}{2} \|x\|_2^2}_{r(x)}.\]Let’s see if we can avoid lengthy and error-prone mathematics using the dual-derivative formula (DD). According to the table of proximal operators, we have \(\operatorname{prox}_{\eta r}(u)=\frac{1}{1+\eta \lambda} u\). Thus, to compute \(s^*\) we plug the above into the formula and solve:
\[\begin{align} q'(s) &=\frac{1}{1+\eta\lambda}a^T(x_t - \eta s a) - {\phi^*}'(s) + b \\ &= \frac{a^T x_t}{1+\eta\lambda} + b -\frac{\eta \|a\|_2^2}{1+\eta\lambda} s - {\phi^*}'(s) = 0 \end{align}\]Looking carefully, we see that it is exactly the derivative of \(q(s)\) from the last post, but this time it was obtained by taking a formula from a textbook. No lengthy math this time!
Having obtained the solution of \(s^*\) of the equation \(q’(s)=0\), we can proceed and compute
\[x_{t+1}=\operatorname{prox}_{\eta r}(x_t - \eta s^* a) = \frac{1}{1+\eta\lambda}(x_t - \eta s^* a),\]which is, again, the same formula we obtained in the last post, but without doing any lengthy math.
The only thing a practitioner wishing to derive a formula for \(x_{t+1}\) has to do by herself is to find a way to solve the one-dimensional equation \(q’(s)=0\). The rest is provided by our textbook recipes - the proximal operator, and the convex conjugate.
We consider losses of the form
\[f(x)=\ln(1+\exp(a^T x)) + \lambda\|x\|_1,\]namely, \(\phi(t)=\ln(1+\exp(t))\) and \(r(x)=\lambda \|x\|_1\). The vector \(a\) comprises both the training sample \(w\) and the label \(y \in \{0, 1\}\), since for a positive sample the incurred loss is \(\ln(1+\exp(-w^T x))\), while for a negative sample the incurred loss is \(\ln(1+\exp(w^T x))\). Namely, we have \(a = \pm w\), where the sign depends on the label \(y\).
We need to deal with two tasks: find \({\phi^*}’\) and \(\operatorname{prox}_{\eta r}\). In previous posts in the series we already saw that
\[\phi^*(s)=s \ln(s) + (1-s) \ln(1-s), \qquad 0 \ln(0) \equiv 0,\]and it is defined on the closed interval \([0,1]\). The derivative is
\[{\phi^*}'(s)=\ln(s)-\ln(1-s),\]and it is defined on the open interval \((0,1)\). From the table of proximal operators above, we find that
\[\operatorname{prox}_{\eta r}(u) = [\mid u \mid -\lambda \eta \mathbf{1}]_+ \cdot \operatorname{sign}(u).\]The above formula may be familar to some of you with background in signal processing: this is the soft-thresholding function \(S_\delta(u)\) with \(\delta = \eta \lambda\). It is implemented in PyTorch as torch.nn.functional.softshrink
, while in NumPy it can be easily implemented as:
import numpy as np
# the soft-thresholding function with parameter `delta`
def soft_threshold(delta, u):
return np.clip(np.abs(u) - delta, 0, None) * np.sign(u)
Let’s plot it for the one-dimensional case, to see what it looks like:
import matplotlib.pyplot as plt
x = np.linspace(-5, 5, 1000)
plt.plot(x, soft_threshold(1, x), label='delta=1')
plt.plot(x, soft_threshold(3, x), label='delta=3')
plt.legend(loc='best')
plt.show()
Here is the result:
The function zeroes-out inputs close to the origin, and behaves as a linear function when our distance from the origin is \(\geq \delta\).
Now let’s discuss the implementation. Seems we have all our ingredients for the derivative formula (DD) - the proximal operator of \(r\) and the derivative \({\phi^*}’\). Putting the ingredients together, we aim to solve:
\[q'(s)=a^T S_{\eta \lambda}(x_t - \eta s a)-\ln(s)+\ln(1-s)=0\]At first glance it seems like a hard equation to solve, but we have already dealt with a similar challenge in a previous post. Recall that:
The dual function \(q\) is aways concave, and therefore its derivative \(q’\) is decreasing. Moreover, it tends to negative infinity when \(s \to 1\), and to positive infinity when \(s \to 0\) . In other words, \(q’\) looks something like this:
The function \(q’\) is defined on the interval \((0,1)\) and is continuous. Thus, we can employ the same bisection strategy we used in a previous post for non-regularized logistic regression:
Finally, having solved the equaion \(q’(s)=0\) we use its solution \(s^*\) to compute:
\[x_{t+1} = S_{\eta \lambda}(x_t - \eta s^* a)\]Let’s implement the method, and then discuss one of its important properties:
import math
import torch
from torch.nn.functional import softshrink
class LogisticRegressionL1:
def __init__(self, x, eta, lambdaa):
self._x = x
self._eta = eta
self._lambda = lambdaa
def step(self, w, y):
# helper local variables
delta = self._eta * self._lambda
eta = self._eta
x = self._x
# extract vector `a` from features w and label y
if y == 0:
a = w
else:
a = -w
# compute the incurred loss components
data_loss = math.log1p(math.exp(torch.dot(a, x).item())) # logistic loss
reg_loss = self._lambda * x.abs().sum().item(). # L1 regularization
# dual derivative
def qprime(s):
return torch.dot(a, softshrink(x - eta * s * a, delta)).item() - math.log(s) + math.log(1 - s)
# find initial bisection interval
l = 0.5
while qprime(l) <= 0:
l /= 2
u = 0.5
while qprime(1 - u) >= 0:
u /= 2
u = 1 - u
# run bisection - find s_star
while u - l > 1E-16: # should be accurate enough
mid = (u + l) / 2
if qprime(mid) == 0:
break
if qprime(l) * qprime(mid) > 0:
l = mid
else:
u = mid
s_star = (u + l) / 2
# perform the computational step
x.set_(softshrink(x - eta * s_star * a, delta))
# return loss components
return data_loss, reg_loss
Before running an experiment, let’s discuss an important property of our optimizer. Look at the last line in the code above - the next iterate \(x_{k+1}\) is the result of the soft-thresholding operator, and we saw that it zeroes-out entries of \(x\) whose absolute value is very small (\(\leq \eta\lambda\)). Consequently, the algorithm itself reflects the sparsity promoting nature of L1 regularization - we zero-out insignificant entries!
The above property is exactly the competitive edge of the proximal point approach in contrast to black-box approaches, such as SGD. Since we deal with the loss itself, rather with its first-order approximation, we preserve its important properties. As we will see in the experiment below, AdaGrad does not produce sparse vecotrs, while the solver we implemented above does. So even if we did not have the benefit of step-size stability, we still have the benefit of preserving our regularizer’s properties.
Since my computational resources are limited, and I do not wish to train models for several days, we will use a rather small data-set this time. I chose the spambase dataset available from here. It is composed of 57 numerical columns, signifying frequencies of various frequently-occuring words, and average run-lengths of capital letters, and a 58-th column with a spam indicator.
Let’s begin by loading the data-set
import pandas as pd
url = 'https://web.stanford.edu/~hastie/ElemStatLearn/datasets/spam.data'
df = pd.read_csv(url, delimiter=' ', header=None)
df.sample(4)
Here is a possible result of the sample
function, which samples 4 rows at random:
0 1 2 3 4 ... 53 54 55 56 57
4192 0.0 0.00 0.00 0.0 0.00 ... 0.000 2.939 51 97 0
1524 0.0 0.90 0.00 0.0 0.00 ... 0.000 6.266 41 94 1
1181 0.0 0.00 0.00 0.0 1.20 ... 0.000 50.166 295 301 1
669 0.0 0.26 0.26 0.0 0.39 ... 0.889 12.454 107 1096 1
Now, let’s normalize all our numerical columns to lie in \([0,1]\), so that all our logistic regression coefficients we will be at the same scale (otherwise, L1 regularization will not be effective):
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
scaled = min_max_scaler.fit_transform(df.iloc[:, 0:56])
df.iloc[:, 0:56] = scaled
Now, let’s create our PyTorch data-set, which we will use for training:
from torch.utils.data.dataset import TensorDataset
W = torch.tensor(np.array(df.iloc[:, 0:56])) # features
Y = torch.tensor(np.array(df.iloc[:, 57])) # labels
ds = TensorDataset(W, Y)
And now, let’s run the optimizer we wrote above, LogisticRegressionL1
, to find the weights of the regularized logistic regression model, with regularization parameter \(\lambda=0.0003\). Since this post is on optimization, we refer readers to standard techniques for choosing regularization parameters, such as K-fold cross-validation. Since our proximal point optimizers are quite stable w.r.t the step size choices, I just arbitrarily chose \(\eta = 1\).
from torch.utils.data.dataloader import DataLoader
# init. model parameter vector
x = torch.empty(56, requires_grad=False, dtype=torch.float64)
torch.nn.init.normal_(x)
# create optimizer
step_size = 1
llambda = 3E-4
opt = LogisticRegressionL1(x, step_size, llambda)
# run 40 epochs, print out data loss and reg. loss
for epoch in range(40):
data_loss = 0.0
reg_loss = 0.0
for w, y in DataLoader(ds, shuffle=True):
ww = w.squeeze(0)
yy = y.item()
step_data_loss, step_reg_loss = opt.step(ww, yy)
data_loss += step_data_loss
reg_loss += step_reg_loss
print('data loss = ', data_loss / len(ds),
' reg. loss = ', reg_loss / len(ds),
' loss = ', (data_loss + reg_loss) / len(ds))
# print the parameter vector
print(x)
Here is the output I got:
data loss = 0.3975306874061495 reg. loss = 0.036288109782517605 loss = 0.4338187971886671
data loss = 0.31726795987087547 reg. loss = 0.054970775631660425 loss = 0.3722387355025359
...
data loss = 0.26724176087574 reg. loss = 0.0789531172022866 loss = 0.34619487807802657
tensor([-2.2331e-01, -2.4050e+00, -1.9527e-01, 1.3942e-01, 2.5370e+00,
1.3878e+00, 1.3516e+01, 2.6608e+00, 2.0741e+00, 1.7322e-01,
1.6017e+00, -2.9143e+00, -2.9302e-01, -1.6100e-02, 2.1631e+00,
1.1565e+01, 4.9527e+00, 1.7578e-01, -1.1546e+00, 3.2810e+00,
1.6294e+00, 2.9166e+00, 1.1044e+01, 4.6498e+00, -3.0636e+01,
-8.7392e+00, -2.6487e+01, -6.1895e-02, -1.3150e+00, -1.7103e+00,
5.8073e-02, 0.0000e+00, -8.5150e+00, 0.0000e+00, 0.0000e+00,
-1.4386e-02, -2.4498e+00, 0.0000e+00, -2.0155e+00, -6.6931e-01,
-2.7988e+00, -1.3649e+01, -2.0991e+00, -7.5787e+00, -1.4172e+01,
-1.9791e+01, -1.1263e+00, -2.6178e+00, -3.5694e+00, -6.1839e+00,
-1.4704e+00, 5.4873e+00, 2.1276e+01, 0.0000e+00, 3.6687e-01,
1.8539e+00], dtype=torch.float64)
After 40 epochs we got a parameter vector with several zero entries. These are the entries which our regularized optimization problem zeroes out, due to their small effect on model accuracy. Note, that increasing \(\lambda\) puts more emphasis on regularization and less on model accuracy, and therefore would zero out more entries at the cost of even less accurate model. Namely, we trade simplicity (less features used) for accuracy w.r.t our training data.
Running the same code with \(\lambda=0\) produces training loss of \(\approx 0.228\), while the data_loss
we see above is \(\approx 0.267\). So we indeed have a model which fits the training data less, but is also simpler, since it uses less features.
Now, let’s optimize the same logistic regression model with PyTorch’s standard AdaGrad optimizer.
from torch.optim import Adagrad
x = torch.empty(56, requires_grad=True, dtype=torch.float64)
torch.nn.init.normal_(x)
llambda = 3E-4
step_size = 1
opt = Adagrad([x], lr = step_size)
for epoch in range(40):
data_loss = 0.0
reg_loss = 0.0
for w, y in DataLoader(ds, shuffle=True):
ww = w.squeeze(0)
yy = y.item()
if yy == 0:
a = ww
else:
a = -ww
# zero-out x.grad
opt.zero_grad()
# compute loss
sample_data_loss = torch.log1p(torch.exp(torch.dot(a, x)))
sample_reg_loss = llambda * x.abs().sum()
sample_loss = sample_data_loss + sample_reg_loss
# compute loss gradient and perform optimizer step
sample_loss.backward()
opt.step()
data_loss += sample_data_loss.item()
reg_loss += sample_reg_loss.item()
print('data loss = ', data_loss / len(ds),
' reg. loss = ', reg_loss / len(ds),
' loss = ', (data_loss + reg_loss) / len(ds))
print(x)
After some time, I got the following output:
data loss = 0.32022869947190186 reg. loss = 0.05981964910605969 loss = 0.3800483485779615
...
data loss = 0.2645752374292745 reg. loss = 0.07861101490768707 loss = 0.34318625233696154
tensor([-9.3965e-01, -2.4251e+00, -2.6659e-02, 1.6550e-01, 2.6870e+00,
1.1179e+00, 1.3851e+01, 2.9437e+00, 2.1336e+00, 5.9068e-02,
2.2235e-01, -3.2626e+00, -2.2777e-01, -1.0366e-01, 2.1678e+00,
1.1491e+01, 5.0057e+00, 5.1104e-01, -1.1741e+00, 3.3976e+00,
1.6788e+00, 2.8254e+00, 1.1238e+01, 4.7773e+00, -3.0312e+01,
-8.7872e+00, -2.6049e+01, -8.1157e-03, -1.7401e+00, -1.7351e+00,
-4.8288e-05, 9.3150e-05, -8.6179e+00, -7.2897e-05, -6.7084e-02,
2.4873e-05, -2.6019e+00, 2.0619e-06, -2.1924e+00, -4.1967e-01,
-2.8286e+00, -1.3638e+01, -2.2318e+00, -7.7308e+00, -1.4278e+01,
-1.9813e+01, -1.1892e+00, -2.7827e+00, -3.7886e+00, -5.9871e+00,
-1.6373e+00, 5.5717e+00, 2.1019e+01, 3.9419e-03, 5.1873e-01,
2.2150e+00], dtype=torch.float64, requires_grad=True)
Well, it seems that none of the vector’s components are zero! Some may think that maybe AdaGrad has not converged, and letting it run for more epochs, or with a different step-size will make a difference. But, unfortunately, it does not. A black-box optimization technique, which is based on a linear approximation of our losses, instead of exploiting the losses themselves, often cannot produce solutions which preserve important properties imposed by our loss functions. In this case, it does not preserve the feature-selection property of L1 regularization.
Up until now we have laid the theoretical and computational foundations, which we can now use for non black-box optimizers for a much broader variety of problems, much beyond the simple ‘convex on linear’ setup. Our foundations are: the proximal viewpoint of optimization, convex duality, convex conjugate, and Moreau envelope.
Two important examples in machine learning come to mind - factorization machines, and neural networks. Losses incurred by these models clearly do not fall into the ‘convex on linear’ category, and we will see in future posts how we can construct non black-box optimizers, which exploit more information about the loss, to train such models.
Moreover, up until now we assumed the fully stochastic setting: at each iteration we select one training sample, and perform a computational step based on the loss the sample incurres. We will see that the concepts we developed so far let us find an efficient implementation for the stochastic proximal point algorithm for the mini-batch setting, where we select a small subset of training samples \(B \subseteq \{1, \dots, n \}\), and perform a computational step based on the average loss \(\frac{1}{\vert B \vert} \sum_{j \in B} f_j(x)\) incurred by these samples.
Before we proceed to the above interesting stuff, one foundational concept is missing. Despite the fact that it doesn’t seem so at first glance, the stochastic proximal point method is a gradient method: it makes a small step towards a negative gradient direction. But in contrast to SGD-type methods, the gradient is not \(\nabla f(x_t)\), but taken at a different point, which is not \(x_t\). The next post will deal with this nature of the method. We will mostly have theoretical explanations and drawings illustrating them, and see why it is a gradient method after all. No code, just theory. And no lengthy math, just elementary math and hand-waving with drawings and illustrations. Stay tuned!
J-J Moreau. (1965). Proximite et dualit´e dans un espace Hilbertien. Bulletin de la Société Mathématique de France 93. (pp. 273–299) ↩
N. Parikh, S. Boyd (2013). Proximal Algorithms. Foundations and Trends in Optimization Vol.1 No. 3 (pp. 123-231) ↩
A. Beck (2017). First-Order Methods in Optimization. SIAM Books Ch.6 (pp. 129-177) ↩