# Mario Becerra

## Minibatch gradient descent for linear regression in Rcpp

In a previous post I showed how to implement gradient descent and stochastic gradient descent for linear regression and logistic regression. The problem with that implementation was that I used R’s for loops which can be quite slow. In this version, I implement the functions in Rcpp, reducing the running time considerably.

I also implement a more general algorithm, called minibatch gradient descent. I do the implementation for the case of linear regression.

## Loss function optimization

As mentioned in the previous post, in statistical learning, one usually wants to find the parameters that minimize a loss function, which almost always has to do with the error of the model we’re using.

The general problem of unrestricted optimization is

$$$\min_{\theta \in \mathbb{R}^p} L(\theta).$$$

Usually, $$L$$ is a convex loss function such as quadratic loss, and $$\theta$$ is a parameter or vector of parameters.

I give a quick reminder of how GD works, but in my previous post, it is more thoroughly explained.

In each iteration of GD, the algorithm searches for a direction in which to go and then update the current value in accordance to that direction. That is, in the $$k$$-th iteration, we have a value $$\theta_k$$, and we look for a direction $$p_k$$ to update to a new value $$\theta_{k+1} = \theta_k + \alpha_k p_k$$, where $$\alpha_k > 0$$ is the ‘distance’ that the algorithm moves toward direction $$p_k$$. Once that the value of the parameter is updated, we find a new direction in which to move forward and then update the parameter value again. This is done until a stopping criteria is met, this usually being that the gradient vector norm is smaller than a certain small positive scalar. The direction $$p_k$$ in which the algorithm moves is the maximum descent direction, that is, the negative of the gradient $$-\nabla L(\theta_k)$$. So, in each iteration we have

$$$\theta_{k+1} = \theta_k - \alpha_k \nabla L(\theta_k).$$$

In linear regression we want to minimize the quadratic loss function

$$$L(x, \beta) = \frac{1}{n} \sum_{i = 1}^n \left( y_i - \beta_0 - \beta_1 x_{i1} - ... \beta_p x_{ip} \right)^2.$$$

Taking its partial derivatives, we have for each $$j \in \left\{1, ..., p \right\}$$

$\frac{\partial L}{\partial \beta_j} = -\frac{2}{n} \sum_{i = 1}^n \left( x_{ij} \ell_i(x, \beta) \right)$

where $$x_{i1} = 1$$ for each $$i \in \left\{1, ..., n \right\}$$ and

$\ell_i(x, \beta) = \left( y_i - \beta_0 - \beta_1 x_{i1} - ... \beta_p x_{ip} \right).$

From here, we have that the gradient descent direction for each iteration is

$\nabla_{\beta} L(x, \beta) = \left( \frac{\partial L}{\partial \beta_0}, ..., \frac{\partial L}{\partial \beta_p} \right)^T.$

As previously mentioned, in statistical learning it is common to find the need to solve optimization problems of the form

$$$\min_{\theta \in \mathbb{R}^p} L(x, \theta), \quad \text{with} \, \, L(\theta) = \frac{1}{n} \sum_{i = 1}^n { \psi_i(x, \theta) }.$$$

Gradient descent uses iterations in the form

$\theta_{k+1} = \theta_k - \alpha_k \nabla L(\theta_k) :=\theta_k - \frac{\alpha_k}{n} \sum_{i = 1}^n \nabla \psi_i(\theta_k),$

which involves evaluating $$n$$ gradients and then taking an average. In the cases of big scale machine learning, the number of observations $$n$$ is really big, so computing all of those gradients in each iteration is expensive.

Minibatch gradient descent solves this problem because the number of gradients to compute doesn’t depend on $$n$$, it is constant. In minibatch gradient descent, one chooses a fixed integer $$l$$, then the dataset is divided in batches if size $$l$$, where the values in each batch are randomly chosen. Then, each of these batches are used to compute the gradient and update the values of the parameters. Usually $$l$$ is a small number compared to the size of a big dataset, but big enough so that the gradient estimation isn’t so noisy, such as $$l = 32$$ or $$l = 100$$. This way, each iteration is cheaper because it involves the computation of only $$l$$ gradients instead of $$n$$. Stochastic gradient descent (SGD) is just minibatch gradient descent with $$l = 1$$.

## Examples

I implemented minibatch gradient descent for linear regression using Rcpp, a package that compiles C++ code and then can be executed in R.

library(gridExtra)
library(Rcpp)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## combine(): dplyr, gridExtra
## filter():  dplyr, stats
## lag():     dplyr, stats

The following lines show the C++ code that is compiled in R. The code has three main functions, the first one computes the L2 norm of a vector, the second one computed the gradient of the loss function of some data with respect to the beta parameters; and the last one (epoch_update) runs an iteration of minibatch gradient descent over the whole dataset.

# Show C++ file content
cat sgd.cpp
#include <RcppArmadillo.h>
#include <Rcpp.h>
#include <iostream>
using namespace std;
using namespace Rcpp;

//////////////////////////
/////// FUNCTION 1
//////////////////////////
// [[Rcpp::export]]
double norm(NumericVector g){
// Receives a numeric vector and returns L2 norm
int m = g.length();
NumericVector g2(m);
for(int i = 0; i < m; i++){
g2(i) = g(i)*g(i);
}
return(sqrt(sum(g2)));
}

//////////////////////////
/////// FUNCTION 2
//////////////////////////
// [[Rcpp::export]]
// Receives a data matrix and a vector of parameters.
// for the current value of the parameters (betas).
int n = data.nrow(); // number of data points
int p = data.ncol(); // number of coefficients (including intercept)
NumericVector li(n); // vector that will contain each observation's loss
NumericVector g_out(p); // output gradient vector
for(int i = 0; i < n; i++){ // iterate over rows
double sum_betas_xi  = 0.0; // accumulator
for(int j = 1; j < p; j++){ // iterate over columns
sum_betas_xi = sum_betas_xi + betas(j)*data(i, j-1);
}
li(i) = data(i, p-1) - betas(0) - sum_betas_xi;
}
NumericMatrix li_x(n, p-1);
for(int i = 0; i < n; i++){ // iterate over rows
for(int j = 0; j < p - 1; j++){ // iterate over columns
li_x(i, j) = data(i, j)*li(i);
}
}

g_out(0) = -2*sum(li)/n;

// gradient for the rest of the parameters
for(int j = 0; j < p-1; j++){
double sum_li_x = 0.0;
for(int i = 0; i < n; i++){
sum_li_x = sum_li_x + li_x(i, j);
}
g_out(j+1) = -2*sum_li_x/n;
}
return g_out;
}

//////////////////////////
/////// FUNCTION 3
//////////////////////////
// [[Rcpp::export]]
List epoch_update(NumericMatrix data, NumericVector betas_in, double alpha, int n_epoch, int minibatch_size){
// Updates the value of the parameters based on the gradients of minibatch iteration
// Receives a data matrix, vector of parameters, learning rate alpha,
// what epoch is currently running, and minibatch size.
// Important: we have to clone the vector betas_in, otherwise, the original input vector
// will be modified in the R environment because they have the same name (betas).
NumericVector betas = clone(betas_in); // output vector
int n = data.nrow(); // number of rows in data matrix
int p = betas.length(); // number of parameters (including intercept)
int num_iters; // number of iterations (based on minibatch size and data size)
if(n % minibatch_size == 0) num_iters = floor(n/minibatch_size);
else num_iters = floor(n/minibatch_size) + 1;
NumericMatrix epoch_values(num_iters, p+3); // values of parameters in this epoch

// iterate over minibatches
for(int k = 0; k < num_iters; k++){
if((k + 1) != num_iters){ // all but last minibatch
NumericMatrix data_subset(minibatch_size, data.ncol()); // subsets the minibatch
int i = 0;
int lwr = k*minibatch_size; // lower index for minibatch
for(int i = 0; i < data_subset.nrow(); i++){ // subset the data
for(int j = 0; j < data_subset.ncol(); j++){
data_subset(i, j) = data(lwr + i, j);
}
}
} else { // last minibatch
int lwr = k*minibatch_size;
int size = data.nrow() - lwr; // size of last minibatch
if(size == 0) break; // if last minibatch is of size 0, break
NumericMatrix data_subset(size, data.ncol());
for(int i = 0; i < data_subset.nrow(); i++){ // subset data
for(int j = 0; j < data_subset.ncol(); j++){
data_subset(i, j) = data(lwr + i, j);
}
}
}

// Update parameter vector
for(int j = 0; j < p; j++){
betas(j) = betas(j) - alpha*g(j);
}

double g_norm = norm(g); // Gradient norm

// Fill matrix with parameter and norm values
epoch_values(k, 0) = n_epoch;
epoch_values(k, 1) = k + 1;
epoch_values(k, 2) = g_norm;
for(int j = 0; j < p; j++){
epoch_values(k, j+3) = betas(j);
}
} // end for

// output list of values
List ret;
ret["epoch_values"] = epoch_values;
ret["betas"] = betas;
return ret;
}
# cat(readLines("sgd.cpp"), sep = '\n')

# Compile with Rcpp
sourceCpp("sgd.cpp")

In the following lines, we can see the function that calls epoch_update iteratively to update the beta parameter values. It receives a matrix $$X$$ that represents the covariables, and a vector $$y$$ that represents the response variable.

## Function that performs minibatch gradient descent on dataset
lm_minibatch <- function(X, y, minibatch_size = 15,
max_it = 3000,
initial_point = NA,
seed = 201802,
alpha = 0.001,
g_norm_tol = 1e-8,
beta_diff_norm_tol = 1e-8,
verbose = F){

# Parameters:
# X: covariate data
# y: response variable
# minibatch_size: Minibatch size
# max_it: maximum number of iterations of the algorithm
# initial_point: initial point for the beta parameters
# seed: seed for the data shuffling face
# alpha: learning rate
# beta_diff_norm_tol: beta difference norm tolerance
# verbose: whether to print each iteration number or not

data_matrix = as.matrix(cbind(X, y)) # Creates a data matrix
n <- nrow(data_matrix) # number of observations
p <- ncol(data_matrix) # number of parameters (including intercept)

# Default initial point
if(is.na(initial_point[1])) initial_point = rep(0.0, p)

# Number of iterations according to minibatch size
if(n %% minibatch_size == 0) {
num_iters = floor(n/minibatch_size)
} else {
num_iters = floor(n/minibatch_size) + 1
}

# Two daa frames that keep track of the values of the parameters during the execution of the algorithm.
# This may consume a lot of memory if the data has too many observations and
# the max_it parameter is too big
betas_df <- as.data.frame(matrix(rep(0, max_it*num_iters*p), ncol = p))
names(betas_df) <- paste0("beta_", 0:(p-1))
iteration_values <- tibble(
epoch = rep(0, max_it*num_iters),
obs = 1:(max_it*num_iters),
bind_cols(betas_df)

i = 0
betas <- initial_point
# Start the algorithm
while(i < max_it){
i = i + 1
if(verbose) print(i)
# Shuffle data
set.seed(seed)
shuffle_ix <- sample(nrow(data_matrix))
shuffled_data = data_matrix[shuffle_ix,]
# Compute beta parameters for one epoch
epoch_betas <- epoch_update(shuffled_data, betas, alpha, i, minibatch_size)

# Update beta parameters
betas_old <- betas
betas <- epoch_betas$betas # Save values of each iteration in the output dataframe epoch_values_temp <- as.data.frame(epoch_betas$epoch_values)
names(epoch_values_temp) <- names(iteration_values)
iteration_values[((i-1)*num_iters + 1):((i)*num_iters),] <- epoch_values_temp
g_norm <- epoch_betas$epoch_values[num_iters, 3] # Compute the norm of the beta differences dif_betas_norm <- sum((betas - betas_old)^2) # If the gradient norm is close to zero, or the parameters hardly change, exit the algorithm if(g_norm < g_norm_tol | dif_betas_norm < beta_diff_norm_tol) break } # Keep only the values of the valid iterations iteration_values <- iteration_values[1:(i*num_iters),] iteration_values$it <- 1:nrow(iteration_values)

return(list(
iteration_values = iteration_values,
betas = betas
))
}

## Examples

I show the implementation in three different examples, each of which achieves something different. The first one, shows how minibatch size ($$l$$) affects the result, the second one compares the running time of minibatch gradient descent and normal gradient descent on different sizes of a dataset, and the last one shows the implementation on the diamonds dataset, a dataset used commonly to show linear regression.

#### Example 1

In this example, we see how the minibatch size affects the algorithm. First, we write a function that, given the output of the lm_minibatch function, computes the values of $$\beta_0$$ and $$\beta_1$$ on each iteration of the algorithm.

# Function that plots the values of the parameters in each iteration.
# It only works for simple linear regression (2 parameters).
plot_minibatches_2_params <- function(lm_minibatch
# beta_0, beta_1, coef_1, coef_2
){
data <- lm_minibatch$iteration_values if(length(lm_minibatch$beta) == 2) {
n = nrow(data)
val_1 = data$beta_0[n] val_2 = data$beta_1[n]
gg <- data %>%
ggplot(aes(beta_0, beta_1)) +
xlab("Beta 0") +
ylab("Beta 1") +
geom_path(size = 0.1, color = 'black') +
geom_point(size = 0.01, color = 'black', alpha = 0.2) +
geom_point(aes(x, y),
data = tibble(x = val_1,
y = val_2),
shape = 'x',
size = 5,
color = 'blue') +
theme_bw()
return(gg)
} else{
return("Error")
}
}

Next, we create some fake data to run the algorithm. We create a dataset with 500 observations.

N <- 500
beta_0 <- -2
beta_1 <- 4

data <- tibble(x = rnorm(N),
y = beta_0 + beta_1*x + rnorm(N, 0, 0.2))

Now, we run the algorithm with different minibatch sizes. These sizes go from 1 to 256, growing in powers of 2. The plots show the iterations for each minibatch size. It can be seen that with minibatch size of 1 or 2, the iterations are very noisy, but as the minibatch size grows, these behave less erratically.

# Runs the algorithm for different minibatch sizes
plots_size <- lapply(0:8, function(i){
mb_size = 2^i
mod_minibatch <- lm_minibatch(data[,"x"], data$y, 2^i, initial_point = c(-1, -1)) gg <- plot_minibatches_2_params(mod_minibatch) + ggtitle(paste("Size:", mb_size)) return(gg) }) # Didn't bother to look for a more elegant way to do this grid.arrange(plots_size[[1]], plots_size[[2]], plots_size[[3]], plots_size[[4]], plots_size[[5]], plots_size[[6]], plots_size[[7]], plots_size[[8]], plots_size[[9]], ncol = 3) #### Example 2 In this example, we see how the running time of normal gradient descent increases when the dataset size increases, while minibatch gradient descent time remains constant. First, we create a dataset with $$2^{21} = 2,097,152$$ observations. # Create dataset of considerable size data_2 <- tibble(x = rnorm(2^21), y = 1 + 3*x + rnorm(2^21, 0, 0.5)) Now, we run the algorithm for different subsets of the data and measure the running time. We run it for minibatch sizes of 10, 100, 1000, and the whole dataset. The dataset sizes used are $$1024$$, $$2048$$, $$4096$$, $$8192$$, $$16384$$, $$32768$$, $$65536$$, $$131072$$, $$262144$$, $$524288$$ and $$1048576$$. # Create folder to save objects and cache dir.create("cache", showWarnings = FALSE) # The if-else checks if this code has been run before and loads # the resulting dataset if it has. # If it hasn't been run, then it executes it and saves the output. if("minibatch_times_10.rds" %in% list.files("cache/")){ minibatch_times_10 <- readRDS("cache/minibatch_times_10.rds") } else { # Output dataframe minibatch_times_10 <- tibble( log_data_size =seq(10, 20), seconds = 0, n_epoch = 0, beta_0 = 0, beta_1 = 0) # Runs minibatch gradient descent with minibatches of size 10 for different sizes of dataset for(i in 1:nrow(minibatch_times_10)){ # Chooses the size of the dataset log_size = minibatch_times_10$log_data_size[i]
# subsets the data from the bigger dataset
dat = data_2[1:2^log_size,]
# computes the runnig time
time = system.time(temp <- lm_minibatch(dat$x, dat$y, 10, max_it = 60))
# Number of epochs run
n_epoch_temp = max(unique(temp$iteration_values$epoch))
# Fill the output dataframe
minibatch_times_10$seconds[i] = time[3] minibatch_times_10$n_epoch[i] = n_epoch_temp
minibatch_times_10$beta_0[i] = temp$betas[1]
minibatch_times_10$beta_1[i] = temp$betas[2]
}
saveRDS(minibatch_times_10, "cache/minibatch_times_10.rds")
}

if("minibatch_times_100.rds" %in% list.files("cache/")){
} else {
minibatch_times_100 <- tibble(
log_data_size =seq(10, 20),
seconds = 0,
n_epoch = 0,
beta_0 = 0,
beta_1 = 0)

# Runs minibatch gradient descent with minibatches of size 100 for different sizes of dataset
for(i in 1:nrow(minibatch_times_100)){
# Chooses the size of the dataset
log_size = minibatch_times_100$log_data_size[i] # subsets the data from the bigger dataset dat = data_2[1:2^log_size,] # computes the runnig time time = system.time(temp <- lm_minibatch(dat$x, dat$y, 100, max_it = 500)) n_epoch_temp = max(unique(temp$iteration_values$epoch)) # Fill the output dataframe minibatch_times_100$seconds[i] = time[3]
minibatch_times_100$n_epoch[i] = n_epoch_temp minibatch_times_100$beta_0[i] = temp$betas[1] minibatch_times_100$beta_1[i] = temp$betas[2] } saveRDS(minibatch_times_100, "cache/minibatch_times_100.rds") } if("minibatch_times_1000.rds" %in% list.files("cache/")){ minibatch_times_1000 <- readRDS("cache/minibatch_times_1000.rds") } else { # Runs minibatch gradient descent with minibatches of size 1000 for different sizes of dataset minibatch_times_1000 <- tibble( log_data_size =seq(10, 20), seconds = 0, n_epoch = 0, beta_0 = 0, beta_1 = 0) for(i in 1:nrow(minibatch_times_1000)){ log_size = minibatch_times_1000$log_data_size[i]
dat = data_2[1:2^log_size,]
time = system.time(temp <- lm_minibatch(dat$x, dat$y, 1000, max_it = 2000))
n_epoch_temp = max(unique(temp$iteration_values$epoch))
minibatch_times_1000$seconds[i] = time[3] minibatch_times_1000$n_epoch[i] = n_epoch_temp
minibatch_times_1000$beta_0[i] = temp$betas[1]
minibatch_times_1000$beta_1[i] = temp$betas[2]
}
saveRDS(minibatch_times_1000, "cache/minibatch_times_1000.rds")
}

} else {
# Runs gradient descent for different sizes of dataset
seconds = 0,
n_epoch = 0)
log_size = gradient_descent_times$log_data_size[i] dat = data_2[1:2^log_size,] time = system.time(temp <- lm_minibatch(dat$x, dat$y, nrow(dat), max_it = 300000)) n_epoch_temp = max(unique(temp$iteration_values$epoch)) gradient_descent_times$seconds[i] = time[3]
gradient_descent_times$n_epoch[i] = n_epoch_temp } saveRDS(gradient_descent_times, "cache/gradient_descent_times.rds") } Now, we visualize the running time of each algorithm for each dataset size. We can see that minibatch gradient descent’s running time remains constant, while gradient descent’s time grows as the dataset size grows. There is no appreciable difference between the 3 variants of minibatch gradient descent. gradient_descent_times %>% mutate(type = "GD") %>% bind_rows( minibatch_times_10 %>% select(-beta_0, -beta_1) %>% mutate(type = "minibatch_10") ) %>% bind_rows( minibatch_times_100 %>% select(-beta_0, -beta_1) %>% mutate(type = "minibatch_100") ) %>% bind_rows( minibatch_times_1000 %>% select(-beta_0, -beta_1) %>% mutate(type = "minibatch_1000") ) %>% ggplot(aes(x = log_data_size, y = seconds, group = type, color = type)) + geom_point() + geom_line() + xlab("Data size (log 2 scale)") + ylab("Time (seconds)") + theme_bw() #### Example 3 The last example uses a dataset that contains the prices and other attributes of close to 54,000 diamonds. It is accessible through the ggplot2 package. We load the dataset and look at the first rows. data(diamonds) head(diamonds) ## # A tibble: 6 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.230 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 ## 2 0.210 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 ## 3 0.230 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 ## 4 0.290 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 ## 5 0.310 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 ## 6 0.240 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48 For this examples, we eill adjust a linear model on the logarithm of the price, and use as covariables the log of the weight (carat) of the diamond and the clarity, used as a one-hot dummy variable. # Select and transform the variables we're gonna use data_diamonds_temp <- diamonds %>% select(-x, -y, -z, -depth, -table, -cut, -color) %>% mutate(y = log(price), x2 = log(carat), clarity = as.factor(as.character(clarity))) # Transform clarity to a one-hot matrix clarity_dummy <- dummies::dummy(x = data_diamonds_temp$clarity) %>% as.data.frame()
names(clarity_dummy) <- paste0("x1_", 1:ncol(clarity_dummy))

# Create the dataset of the log-carat and summy-transformed clarity
data_diamonds <- clarity_dummy %>%
bind_cols(
data_diamonds_temp %>%
select(x2, y)
)

# Look at the transformed data
head(data_diamonds)
##   x1_1 x1_2 x1_3 x1_4 x1_5 x1_6 x1_7 x1_8        x2        y
## 1    0    0    0    1    0    0    0    0 -1.469676 5.786897
## 2    0    0    1    0    0    0    0    0 -1.560648 5.786897
## 3    0    0    0    0    1    0    0    0 -1.469676 5.789960
## 4    0    0    0    0    0    1    0    0 -1.237874 5.811141
## 5    0    0    0    1    0    0    0    0 -1.171183 5.814131
## 6    0    0    0    0    0    0    0    1 -1.427116 5.817111

We now run the minibatch gradient descent algorithm with a minibatch size of 32.

# Run algorithm
mod_minibatch_diamonds_100 <- lm_minibatch(data_diamonds[,1:9], data_diamonds$y, minibatch_size = 32, beta_diff_norm_tol = 1e-11) The following plot shows the value of the parameters in each iteration. All of them appear to have converged to a fixed value. mod_minibatch_diamonds_100$iteration_values %>%
select(grep("beta", names(.)), it) %>%
gather(param, value, -it) %>%
ggplot(aes(it, value, group = param)) +
geom_line(size = 0.5, alpha = 0.7) +
facet_wrap(~param, scales = 'free_y', ncol = 3) +
theme_bw()