Gradient Descent and Stochastic Gradient Descent for regression

Sep 11, 2017 • Code available here. 02_gd_and_sgd_regression.utf8

Loss function optimization

In statistical learning, one usually wants to find the parameters that minimize a loss function, which almost always has to do with the error of the model we’re using. In many cases, as in linear regression, using the first and second order conditions, we can find a closed formula to find the parameters. But there are many other cases in which this isn’t possible, that’s when numerical optimization comes in. For this post, I use definitions from Numerical Optimization (2nd Ed.) by Nocedal & Wright and Optimization Methods for Large-Scale Machine Learning by Bottou, Curtis & Nocedal.

The general problem of unrestricted optimization is

\[\begin{equation} \min_{\theta \in \mathbb{R}^p} L(\theta). \end{equation}\]

For statistical learning, \(L\) is usually a convex loss function such as quadratic loss, and \(\theta\) is a parameter or vector of parameters. Generally, numerical optimization algorithms are iterative. A solution is a vector \(\theta^*\) called local minimizer, which makes the function \(L\) to be minimized in a neighbourhood around \(\theta^*\). Formally, a vector \(\theta^*\) is a local minimizer if there exists a neighbourhood \(\mathcal{N}\) of \(\theta^*\) such that \(L(\theta^*) \leq L(\theta)\) for all \(\theta \in \mathcal{N}\).

In numerical optimization, we make use of the sufficient second order conditions. Suppose that the Hessian matrix \(\nabla^2 L\) is continuous in an open neighbourhood of \(\theta^*\), that the gradient \(\nabla L(\theta^*) = 0\) and that \(\nabla^2 L(\theta^*)\) is positive definite; then \(\theta^*\) is a local minimizer of \(L\).

This is basic calculus, but it provides the base of numerical optimization algorithms. In general, all algorithms search for a point \(\theta^*\) such that \(\nabla L(\theta^*) = 0\).

Gradient descent (GD)

Gradient descent is an algorithm that belongs to a family called line search algorithms. In each iteration, thee algorithms search for a direction in which to go and then update the current value in accordance to that direction. That is, in the \(k\)-th iteration, we have a value \(\theta_k\), and we look for a direction \(p_k\) to update to a new value \(\theta_{k+1} = \theta_k + \alpha_k p_k\), where \(\alpha_k > 0\) is the ‘distance’ that the algorithm moves toward direction \(p_k\), and is called step length. Once that the value of the parameter is updated, we find a new direction in which to move forward and then update the parameter value again. This is done until a stopping criteria is met, this usually being that the gradient vector norm is smaller than a certain small positive scalar.

In gradient descent, the direction \(p_k\) in which the algorithm moves is the maximum descent direction, that is, the negative of the gradient \(-\nabla L(\theta_k)\). So, in each iteration we have

\[\begin{equation} \theta_{k+1} = \theta_k - \alpha_k \nabla L(\theta_k). \end{equation}\]

We find a problem when we want to compute the step length \(\alpha_k\): we want to find a value that the function \(L\) decreases as much as possible, but we don’t want to waste much time choosing the value. The best option is the global minimizer of the auxiliary function \(\phi(\alpha_k) = L(\theta_k + \alpha_k p_k)\), but it’s too expensive to compute. Generally, heuristics are used to choose the sequence of values for \(\alpha_k\) and try which one satisfies those conditions.

One of those conditions is called Armijo conditions, and finds the \(\alpha_k\) that allows a sufficient descent in the function \(L\), measured as

\[ L(\theta_k + \alpha_k p_k) \leq L(\theta_k) + c_1 \alpha_k \nabla L(\theta_k)^T p_k, \]

for a constant \(c_1 \in (0, 1)\). Usually \(c_1\) is small, such as \(10^{-4}\). This condition may not be enough, because for very small values of \(\alpha_k\) the condition may be met, and we don’t usually want very small step lengths. Some ways to fix this is to backtrack. This consists of choosing a big value of \(\alpha_k\) (such as \(\alpha_k = 1\)), and then an iterative sub-algorithm is initiated which decreases the value of \(\alpha_k\) until de Armijo condition is met.

There are some other ways to choose step length \(\alpha_k\), but those are beyond the scope of this post. Nocedal & Wright’s book has a lot more information about it.

Examples:

Now that we’ve covered some theory, to explain further the concept of gradient descent, we work with two examples: linear regression and logistic regression.

Linear regression

In linear regression we want to minimize the quadratic loss function

\[\begin{equation} L(x, \beta) = \frac{1}{n} \sum_{i = 1}^n \left( y_i - \beta_0 - \beta_1 x_{i1} - ... \beta_p x_{ip} \right)^2. \end{equation}\]

Taking its partial derivatives, we have for each \(j \in \left\{1, ..., p \right\}\)

\[ \frac{\partial L}{\partial \beta_j} = -\frac{2}{n} \sum_{i = 1}^n \left( x_{ij} \ell_i(x, \beta) \right) \]

where \(x_{i1} = 1\) for each \(i \in \left\{1, ..., n \right\}\) and

\[ \ell_i(x, \beta) = \left( y_i - \beta_0 - \beta_1 x_{i1} - ... \beta_p x_{ip} \right). \]

From here, we have that the gradient descent direction for each iteration is

\[ \nabla_{\beta} L(x, \beta) = \left( \frac{\partial L}{\partial \beta_0}, ..., \frac{\partial L}{\partial \beta_p} \right)^T. \]

In the implemented example, I generated a vector \(x \in \mathbb{R}^n\) with \(n = 1000\) such that \(x_i \sim N(0, 1)\) for each \(i \in \left\{1, ..., n \right\}\) and the response variable was \(y = \beta_0 + \beta_1x + \varepsilon\) with \(\varepsilon \sim N(0, 1)\), \(\beta_0 = 2\) and \(\beta_1 = 1\). We can see the code used and the generated data in the following image.

library(tidyverse)

theme_set(theme_bw())


N <- 1000
beta_0 <- 2
beta_1 <- 1

set.seed(20170909)
data <- tibble(x = rnorm(N),
               y = beta_0 + beta_1*x + rnorm(N))

model <- lm(y~x, data = data)


(data %>% 
    ggplot(aes(x, y)) +
    geom_point(size = 0.7, alpha = 0.6) +
    geom_abline(slope = beta_1, intercept = beta_0)
)

So, to minimize the quadratic loss function, we start with a vector \(\beta^0 \in \mathbb{R}^2\), and in each iteration we update as

\[ \beta^{k+1} = \beta^k - \alpha_k \nabla_{\beta} L(x, \beta) \]

until we meet a stopping criteria. In this case, the stopping criteria was that the norm of the gradient \(\nabla_{\beta} L(x, \beta)\) was smaller than \(0.000001\) or that it exceeded 100 iterations. The vector of initial parameters was \(\beta^0 = (0, 0)^T\), and a fixed \(\alpha_k\) of \(0.1\) for all \(k\) without taking into account the Armijo condition.

The gradient was implemented in R as a function as follows. It receives a dataframe and a vector of parameters, and returns a vector of directions.

gradient <- function(data, betas){
  n <- nrow(data)
  const <- rep(1, nrow(data))
  li <- data$y - betas[1]*const - betas[2]*data$x
  g1 <- -2*sum(li)/n
  g2 <- -2*sum(li*data$x)/n
  return(c(g1, g2))
}

We call the function and store the values of the parameters and descent directions in a dataframe.

max_it <- 100
data_gradient_descent <- tibble(it = 1:max_it,
                                beta_0 = rep(0, max_it),
                                beta_1 = rep(0, max_it),
                                gradient_norm = rep(0, max_it))

i = 0
betas <- c(0, 0)
alpha = 0.1
while(i < max_it){
  i = i + 1
  g <- gradient(data, betas)
  g_norm <- sqrt(sum(g^2))
  g_unit <- g/g_norm
  data_gradient_descent$beta_0[i] <- betas[1]
  data_gradient_descent$beta_1[i] <- betas[2]
  data_gradient_descent$gradient_norm[i] <- g_norm
  if(g_norm < 0.000001) break
  betas <- betas - alpha*g
}


data_gradient_descent <- data_gradient_descent[1:i,]

We can see the values of the parameter vector in each iteration in the following image. The big dot is the real value of the parameters (2 and 1) and the x is the value that I get using the lm package in R. We can see that the implemented algorithm converges to these values. In fact, the value we get is exactly the same as the one lm gets.

data_gradient_descent %>% 
  ggplot(aes(beta_0, beta_1)) +
  xlab("Beta 0") +
  ylab("Beta 1") +
  geom_segment(
    aes(
      xend = c(tail(beta_0, n = -1), NA),
      yend = c(tail(beta_1, n = -1), NA)
    ),
    size = 0.4,
    color = '#919191',
    arrow = arrow(length = unit(0.18, "cm"))
  ) +
  geom_point(size = 0.4, color = 'black') +
  geom_point(aes(x, y),
             data = tibble(x = beta_0,
                           y = beta_1)) +
  geom_point(aes(x, y),
             data = tibble(x = model$coefficients[1],
                           y = model$coefficients[2]),
             shape = 'x',
             size = 5) +
  theme(
    panel.grid.minor = element_blank()
  )

And here we can see the norm of the gradient vector in each iteration. We can see that it keeps decreasing.

data_gradient_descent %>% 
  ggplot(aes(it, gradient_norm)) +
  geom_point(size = 0.7) +
  geom_line(size = 0.4)

Logistic regression

In the case of logistic regression, we want to minimize a loss function called deviance that is defined as

\[ L(x, \beta) = - \frac{2}{n} \sum_{i=1}^{n} \left[ y_i \log(h(\beta^T x_i)) + (1-y_i) \log(1-h(\beta^T x_i)) \right] = - \frac{2}{n} \sum_{i=1}^{n}{\ell_i(\beta)} \]

where

\[ \beta^T x_i = \sum_{j=0}^{p}{\beta_j x_{ij}}, \]

\[ h(w) = \frac{e^w}{1+e^w} \]

and

\[ \ell_i(x) = y_i \log(h(\beta^T x_i)) + (1-y_i) \log(1-h(\beta^T x_i)). \]

We get the partial derivatives

\[ \frac{\partial L}{\partial \beta_j} = -\frac{2}{n} \sum_{i = 1}^n { \frac{\partial \ell_i}{\partial \beta_j} } \]

and, using the fact that \(h'(w) = h(w)(1-h(w))\), we get

\[\begin{equation} \begin{split} \frac{\partial \ell_i}{\partial \beta_j} & = \frac{y_i h'(\beta^T x_i) x_{ij} } {h(\beta^T x_i)} + \frac{(1 - y_i) (-1) h'(\beta^T x_i) x_{ij}} {1 - h(\beta^T x_i)} \\ & = \frac{h'(\beta^T x_i) x_{ij} y_i}{h(\beta^T x_i)} - \frac{(1 - y_i) h'(\beta^T x_i) x_{ij}}{1 - h((\beta^T x_i))} \\ & = h'(\beta^T x_i) x_{ij} \left(\frac{y_i}{h(\beta^T x_i)} - \frac{1-y_i}{1-h(\beta^T x_i)} \right) \\ & = h'(\beta^T x_i) x_{ij} \left(\frac{y_i - y_i h(\beta^T x_i) - h(\beta^T x_i) + y_i h(\beta^T x_i)}{h(\beta^T x_i)(1-h(\beta^T x_i))} \right) \\ & = x_{ij}(y_i - h(\beta^T x_i)). \end{split} \end{equation}\]

So, we have that

\[ \frac{\partial L}{\partial \beta_j} = -\frac{2}{n} \sum_{i = 1}^n { x_{ij}(y_i - h(\sum_{j=0}^{p}{\beta_j x_{ij}})) }, \]

where once more \(x_{i1} = 1\) for all \(i \in \left\{1, ..., n \right\}\).

We once again generate a vector \(x \in \mathbb{R}^n\) with \(n = 1000\) such that \(x_i \sim N(0, 1)\) for each \(i \in \left\{1, ..., n \right\}\) and then an auxiliary vector was computed, \(p_i = \frac{1}{\exp \left( - \beta_0 - \beta_1 x_i \right)}\), with \(\beta_0 = 1\) and \(\beta_1 = 4\). Finally, the response variable \(y\) was built simulating Bernoulli random variables, such that \(y_i \sim Bern(p_i)\). The generated data can be seen in the following image.

N <- 1000
beta_0 <- 1
beta_1 <- 4

set.seed(124362)
data <- tibble(x = rnorm(N),
               z = beta_0 + beta_1*x,
               pr = 1/(1 + exp(-z)),
               y = rbinom(1000, 1, pr))

model <- glm(y ~ x, data = data, family = "binomial")


data %>% 
  ggplot(aes(x, y)) +
  geom_point(size = 0.5, alpha = 0.4) +
  stat_smooth(method="glm", 
              method.args = list(family = "binomial"), 
              se = FALSE,
              size = 0.6,
              color = 'black')

To minimize the deviance, we do exactly as in the previous case: we start with an initial vector of parameters \(\beta^0 \in \mathbb{R}^2\), and in each iteration we update

\[ \beta^{k+1} = \beta^k - \alpha_k \nabla_{\beta} L(x, \beta) \]

until certain criteria is met. In this case, the stopping criteria was harder: the norm of the gradient \(\nabla_{\beta} L(x, \beta)\) should be less than \(0.0001\), or the ratio of norms between one iteration and the next had to be bigger than \(0.8\), or to exceed 500 iterations. The vector of initial parameters was \(\beta^0 = (-10, 10)^T\). We did backtracking with an initial \(\alpha_0 = 3\).

The backtracking function implemented is shown below.

backtrack <- function(betas, g, alpha, deviance, data, rho = 0.5, max_iter = 50, c1 = 10^-4){
  funct_value <- do.call(deviance, list(betas, data))
  norm_pk <- sum(g^2)
  i = 0
  while(i < max_iter){
    i = i + 1
    left <- do.call(deviance, list(betas - alpha*g, data))
    right <- funct_value + c1*alpha*norm_pk
    if(left <= right) break
    alpha = alpha*rho
  }
  return(list(alpha = alpha,
              iter = i))
}

The auxiliary \(h\) function, the gradient and the deviance are shown here.

h <- function(x){
  return(1/(1 + exp(-x)))
}

gradient <- function(data, betas){
  n <- nrow(data)
  const <- rep(1, nrow(data))
  bx <- betas[1]*const + betas[2]*data$x
  li <- data$y - h(bx)
  g1 <- -2*sum(li)/n
  g2 <- -2*sum(li*data$x)/n
  return(c(g1, g2))
}


deviance <- function(betas, data = data){
  n <- nrow(data)
  const <- rep(1, nrow(data))
  bx <- betas[1]*const + betas[2]*data$x
  hbx <- h(bx)
  aux <- (1 - data$y)*log(1 - hbx)
  aux2 <- ifelse(is.nan(aux), 0, aux)
  li <- data$y*log(hbx) + aux2
  d <- -2*sum(li)/n
  return(d)
}

We call the functions to start the algorithm and save the parameters of interest in a dataframe.

max_it <- 1000
data_gradient_descent <- tibble(it = 1:max_it,
                                beta_0 = rep(0, max_it),
                                beta_1 = rep(0, max_it),
                                gradient_norm = rep(0, max_it),
                                deviance = rep(0, max_it),
                                alpha = rep(0, max_it))

i = 0
betas <- c(-10, 10) # reales 1 y 4
g <- gradient(data, betas)
g_norm <- sqrt(sum(g^2))
while(i < max_it){
  i = i + 1
  g_norm_0 <- g_norm # old gradient norm
  g <- gradient(data, betas)
  g_norm <- sqrt(sum(g^2))
  g_unit <- g/g_norm
  data_gradient_descent$beta_0[i] <- betas[1]
  data_gradient_descent$beta_1[i] <- betas[2]
  data_gradient_descent$gradient_norm[i] <- g_norm
  data_gradient_descent$deviance[i] <- deviance(betas, data)
  if(g_norm < 0.0001 & g_norm/g_norm_0 > 0.8) break
  backtrack_result <- backtrack(betas, g, 3, deviance, data)
  alpha <- backtrack_result[[1]]
  data_gradient_descent$alpha[i] <- alpha
  betas <- betas - alpha*g
}


data_gradient_descent <- data_gradient_descent[1:i,]

We can see the values of the parameters in each iteration in the following image.

data_gradient_descent %>% 
  ggplot(aes(beta_0, beta_1)) +
  xlab("Beta 0") +
  ylab("Beta 1") +
  geom_segment(
    aes(
      xend = c(tail(beta_0, n = -1), NA),
      yend = c(tail(beta_1, n = -1), NA)
    ),
    size = 0.4,
    color = '#919191',
    arrow = arrow(length = unit(0.18, "cm"))
  ) +
  geom_point(size = 0.4, color = 'black') +
  geom_point(aes(x, y),
             data = tibble(x = beta_0,
                           y = beta_1)) +
  geom_point(aes(x, y),
             data = tibble(x = model$coefficients[1],
                           y = model$coefficients[2]),
             shape = 'x',
             size = 5) +
  theme(panel.grid.minor = element_blank())

The big point is the real value of the parameters (1 and 4), and the x is the value that we get by using the glm package in R. The algorithm converges to that value, and once again the value that we get is exactly the same as in the glm package.

The following images show the decreasing values of the gradient norm and the deviance with each iteration.

data_gradient_descent %>% 
  ggplot(aes(it, gradient_norm)) +
  geom_point(size = 0.7) +
  geom_line(size = 0.4)

data_gradient_descent %>% 
  ggplot(aes(it, deviance)) +
  geom_point(size = 0.7) +
  geom_line(size = 0.4)

Stochastic gradient descent (SGD)

In statistical learning it is common to find the need to solve optimization problems of the form

\[\begin{equation} \min_{\theta \in \mathbb{R}^p} L(x, \theta), \quad \text{with} \, \, L(\theta) = \frac{1}{n} \sum_{i = 1}^n { \psi_i(x, \theta) }. \end{equation}\]

In both past examples, the loss functions that we wanted to minimize were expressed in that way. Gradient descent uses iterations in the form

\[ \theta_{k+1} = \theta_k - \alpha_k \nabla L(\theta_k) :=\theta_k - \frac{\alpha_k}{n} \sum_{i = 1}^n \nabla \psi_i(\theta_k), \]

which involves evaluating \(n\) gradients and then taking an average. In the cases of big scale machine learning, the number of observations \(n\) is really big, so computing all of those gradients in each iteration is expensive. That’s why methods such as SGD come up, methods where the number of gradients to compute doesn’t depend on \(n\), it is constant. SGD uses iterations of the form

\[ \theta_{k+1} = \theta_k - \alpha_k \nabla \psi_{i_k}(\theta_k), \]

where \(i_k \in \left\{1, 2, ..., n \right\}\) is randomly chosen. The gradient \(\nabla \psi_{i_k}(\theta_k)\) is an unbiased estimator of \(\nabla L(\theta_k)\). This way, each iteration is really cheap because it involves the computation of only one gradient. It can happen that some \(\nabla \psi_{i_k}(\theta_k)\) in particular doesn’t give a direction of descent from \(\theta_k\), but on average we do get descent directions, such that the sequence \(\left\{ \theta_0, \theta_1, ... \right\}\) can be guided to a minimizer \(\theta^*\).

Examples:

To show this method, the same two examples are implemented using SGD.

Linear regression

As we saw before, we want to minimize the quadratic loss function. Here, each \(\psi_i(x, \theta)\) is

\[ \psi_i(x, \theta) = \left( y_i - \beta_0 - \beta_1 x_{i1} - ... \beta_p x_{ip} \right)^2 = \ell_i^2(x, \beta), \]

with \(\ell_i(x, \beta)\) defined earlier as

\[ \ell_i(x, \beta) = \left( y_i - \beta_0 - \beta_1 x_{i1} - ... \beta_p x_{ip} \right). \] We can see that the partial derivative of \(\psi_i(x, \beta)\) with respect to each \(\beta_j\) with \(j \in \left\{ 0, ..., p \right\}\) is

\[ \frac{\partial \psi_i}{\partial \beta_j} = \frac{\partial \ell^2_i}{\partial \beta_j} = -2 x_{ij} \ell_i(x, \beta), \]

so, the direction of descent in each iteration is

\[ \nabla_{\beta} \psi_i(x, \beta) = \left( \frac{\partial \psi_i}{\partial \beta_0}, ..., \frac{\partial \psi_i}{\partial \beta_p} \right). \]

Once again, the vector of initial parameters was \(\beta^0 = (0, 0)^T\). The stopping rule was that the norm of the gradient \(\nabla_{\beta} L(x, \beta)\) was less than \(0.000001\), or that there were more than 300 iterations or that the squared norm of the differences of the parameter vector from one iteration to the next (i.e. \(||\beta^{k+1} - \beta^k||^2_2\)) was less than \(10^{-15}\).

In SGD, an epoch is a set of \(n\) accesses to the dataset. That is, in each epoch, the gradients of all elements in the dataset have been computed once.

The following code shows the functions implemented to get SGD running. I use for loops, which I know is not the best idea in R, but it’s to illustrate. Maybe later I’ll implement it in Rcpp and see the difference in speed.

(Update: I implemented it in Rcpp and it is much faster. Blog post here).

# Gradient for an observation
gradient_row <- function(data_row, betas){
  li <- data_row$y - betas[1] - betas[2]*data_row$x
  g1 <- -2*li
  g2 <- -2*li*data_row$x
  return(c(g1, g2))
}

epoch_update <- function(data, betas, alpha, n_epoch, verbose = 1, reordering = F){
  n <- nrow(data)
  if(reordering) data$ix <- sample(1:n)
  else data$ix <- 1:n
  epoch_values <- tibble(
    n_epoch = rep(n_epoch, n),
    obs = 1:n,
    beta_0 = rep(0, n),
    beta_1 = rep(0, n),
    gradient_norm = rep(0, n))
  
  # Iterate over rows in data
  # Not the best practice in R to use for loops, but it's to illustrate
  for(i in 1:n){
    # Update coefficients
    g <- gradient_row(data[data$ix == i,], betas) 
    betas <- betas - alpha*g
    # Print and write values in table to keep track and make plots
    g_norm <- sqrt(sum(g^2))
    g_unit <- g/g_norm
    epoch_values$beta_0[i] <- betas[1]
    epoch_values$beta_1[i] <- betas[2]
    epoch_values$gradient_norm[i] <- g_norm
    if(verbose == 2){
      cat(
        "\n\tEpoch: ", n_epoch,
        "\n\tObs: ", i, 
        "\n\tbeta_0: ", betas[1], 
        "\n\tbeta_1: ", betas[2],
        "\n\tgradient_norm: ", g_norm, 
        "\n\tDirection: ", g_unit[1], g_unit[2],
        "\n")
    }
  } # End for
  
  if(verbose == 1){
    cat(
      "\n\tEpoch: ", n_epoch,
      "\n\tbeta_0: ", epoch_values$beta_0[n],
      "\n\tbeta_1: ", epoch_values$beta_1[n],
      "\n\tgradient_norm: ", epoch_values$gradient_norm[n],
      "\n")
  }
  return(list(
    epoch_values = epoch_values,
    betas = betas
  ))
}

We start the algorithm and save the parameters in a dataframe.

# Start the algorithm
max_it <- 300
n <- nrow(data)
data_gradient_descent <- tibble(
  epoch = rep(0, max_it*n),
  obs = 1:(max_it*n),
  beta_0 = rep(0, max_it*n),
  beta_1 = rep(0, max_it*n),
  gradient_norm = rep(0, max_it*n))

i = 0
betas <- c(0, 0)
alpha = 0.001
while(i < max_it){
  i = i + 1
  epoch_betas <- epoch_update(data, betas, alpha, i, verbose = 0)
  betas_old <- betas
  betas <- epoch_betas$betas
  data_gradient_descent[((i-1)*n + 1):((i)*n),] <- epoch_betas$epoch_values
  g_norm <- epoch_betas$epoch_values$gradient_norm[n]
  dif_betas_norm <- sum((betas - betas_old)^2)
  if(g_norm < 0.000001 | dif_betas_norm < 1e-15) break
}

data_gradient_descent <- data_gradient_descent[1:(i*n),]
data_gradient_descent$it <- 1:nrow(data_gradient_descent)

We define an auxiliary function that helps us plot iterations and epochs.

plot_gd_iter <- function(data_gradient_descent, model, beta_0, beta_1, denom = 0){
  if(denom > 0) {
    data <- data_gradient_descent %>% 
      filter(it %% floor(n/denom) == 1)
  } else {
    data <- data_gradient_descent
  }
  
  gg <- data %>% 
    ggplot(aes(beta_0, beta_1)) +
    xlab("Beta 0") +
    ylab("Beta 1") +
    geom_segment(
      aes(
        xend = c(tail(beta_0, n = -1), NA),
        yend = c(tail(beta_1, n = -1), NA)
      ),
      size = 0.4,
      color = '#919191',
      arrow = arrow(length = unit(0.18, "cm"))
    ) +
    geom_point(size = 0.3, color = 'black') +
    geom_point(aes(x, y),
               data = tibble(x = beta_0,
                             y = beta_1)) +
    geom_point(aes(x, y),
               data = tibble(x = model$coefficients[1],
                             y = model$coefficients[2]),
               shape = 'x',
               size = 5) +
    theme(
      panel.grid.minor = element_blank()
    )
  return(gg)
}

We plot the points of the parameters in each iteration and each epoch. The first plot shows the value at the beginning of each epoch, and the second one shows all iterations.

# Plots only at the beginning of each epoch
plot_gd_iter(data_gradient_descent, model, beta_0, beta_1, 1)

# Plots all iterations in all epochs
data_gradient_descent %>% 
  ggplot(aes(beta_0, beta_1)) +
  xlab("Beta 0") +
  ylab("Beta 1") +
  geom_path(size = 0.1, color = 'black') +
  geom_point(size = 0.01, color = 'black', alpha = 0.2) +
  geom_point(aes(x, y),
             data = tibble(x = beta_0,
                           y = beta_1)) +
  geom_point(aes(x, y),
             data = tibble(x = model$coefficients[1],
                           y = model$coefficients[2]),
             shape = 'x',
             size = 5) +
  theme(
    panel.grid.minor = element_blank()
  )

We can see that in the second figure that the directions go zigzagging, but in the end reach the final solution. In GD the directions were more uniform.

In both figures there’s once more the big dot that represents the real value of the parameters (\(\beta_0=2\) y \(\beta_1 = 1\)) and the x that represents the value from the lm package, but the second image isn’t that clear because of all the noise in the directions.

Logistic regression

For logistic regression, each \(\psi_i(x, \theta)\) is

\[ \psi_i(x, \theta) = y_i \log(h(\beta^T x_i)) + (1-y_i) \log(1-h(\beta^T x_i)) = \ell_i(x, \beta). \]

And we can get to the following result

\[ \frac{\partial \ell_i}{\partial \beta_j} = \frac{\partial \psi_i}{\partial \beta_j} = x_{ij}(y_i - h(\beta^T x_i)). \]

In the implementation, the stopping criteria was the same as in SGD for linear regression.

This is the code used to compute the directions in which the algorithm moves.

gradient_row <- function(data_row, betas){
  bx <- betas[1] + betas[2]*data_row$x
  li <- li <- data_row$y - h(bx)
  g1 <- -2*li
  g2 <- -2*li*data_row$x
  return(c(g1, g2))
}

We start the algorithm.

max_it <- 300
n <- nrow(data)
data_gradient_descent <- tibble(
  epoch = rep(0, max_it*n),
  obs = 1:(max_it*n),
  beta_0 = rep(0, max_it*n),
  beta_1 = rep(0, max_it*n),
  gradient_norm = rep(0, max_it*n))

i = 0
betas <- c(-10, 10)
alpha = 0.01
while(i < max_it){
  i = i + 1
  epoch_betas <- epoch_update(data, betas, alpha, i, verbose = 0)
  betas_old <- betas
  betas <- epoch_betas$betas
  data_gradient_descent[((i-1)*n + 1):((i)*n),] <- epoch_betas$epoch_values
  g_norm <- epoch_betas$epoch_values$gradient_norm[n]
  dif_betas_norm <- sum((betas - betas_old)^2)
  if(g_norm < 0.000001 | dif_betas_norm < 1e-15) break
}

data_gradient_descent <- data_gradient_descent[1:(i*n),]
data_gradient_descent$it <- 1:nrow(data_gradient_descent)

The following figures show the values of the parameters in each iteration. As before, the first figure shows only the beginning of each epoch and the second shows all iterations in all epochs. The values converge to the real values and the values that we get if we use the glm package.

# Plots only at the beginning of each epoch
plot_gd_iter(data_gradient_descent, model, beta_0, beta_1, 1)

# Plots all iterations in all epochs
data_gradient_descent %>% 
  ggplot(aes(beta_0, beta_1)) +
  xlab("Beta 0") +
  ylab("Beta 1") +
  geom_point(size = 0.01, color = 'black', alpha = 0.2) +
  geom_point(aes(x, y),
             data = tibble(x = beta_0,
                           y = beta_1)) +
  geom_point(aes(x, y),
             data = tibble(x = model$coefficients[1],
                           y = model$coefficients[2]),
             shape = 'x',
             size = 5) +
  theme(
    panel.grid.minor = element_blank()
  )

That’s it. :D