Mario Becerra

Gradient descent for shallow neural networks

In this post, I show the implementation of gradient descent in Rcpp for shallow artificial neural networks (ANNs) used for classification. Rcpp is an R package that provides integration between R and C++. It compiles the functions, so it is much faster to use than base R code, especially in situations with loops.

Brief introduction to ANNs

The type of ANN used in this post is called a feedforward neural network or multilayer perceptron (MLP). These models are basically a composition of non-linear functions of the data, i.e., \(f^{(1)}(f^{(2)}(f^{(3)}(x)))\), where \(f^{(1)}\) is called the first layer, \(f^{(2)}\) the second layer, and etc. In this post, I call them shallow because they have only one hidden layer (two layers in total), in contrast to the deep models used in deep learning.

The type of problem considered in this post is a binary classification problem, so we consider a data matrix \(X \in \mathbb{R}^{m \times n_x}\) and a response variable \(y \in \{0,1\}^m\). Let \(p_k\) be the probability of an observation \(x_k \in \mathbb{R}^{n_x}\) of being 1. The single layer feedforward ANN models this quantity as \(\hat{p_k} = \sigma(\beta_0 + \sum_{i = 1}^q \beta_i a_i^{(k)})\) with \(a_i^{(k)} = \sigma(\theta_0^{(i)} + \sum_{l = 1}^{n_x} \theta_l^{(i)} x_l^{(k)})\), and \(\sigma(x)\) is a non-linear function. Common choices of \(\sigma(x)\) are the logistic function, \(\sigma(x) = (1 + e^{-x})^{-1}\), and \(\sigma(x) = \mathrm{tanh}(x)\). The \(x\) is called the input layer, the \(a_i^{(k)}\) is called the hidden layer, and the \(\hat{p_k}\) is called the output layer.

For example, the following image shows the diagram of a feedforward ANN with 4 input variables (\(x_1, x_2, x_3, x_4\)) and two hidden nodes (\(a_1, a_2\)).

Gradient descent

Partial derivatives

We want to find parameters \(\theta_l^{(j)}\) and \(\beta_j\) for \(l = 1, ..., n_x\) and \(j = 1, ..., q\) that minimize a certain loss function. Since we are working on a binary classification problem, a suitable loss function is the sometimes called cross-entropy loss or deviance loss:

\[ \mathcal{L}(\theta, \beta) = \frac{1}{m} \sum_{k = 1}^m \left[ -(y^{(k)} \mathrm{log}(\hat{p_k}) + (1 - y^{(k)}) \mathrm{log}(1 - \hat{p_k})) \right]. \]

Now, let’s rewrite de loss function in terms of another function \(\mathcal{l_k}\), defined as

\[ \mathcal{l_k}(\theta, \beta) = -(y^{(k)} \mathrm{log}(\hat{p_k}) + (1 - y^{(k)}) \mathrm{log}(1 - \hat{p_k})), \]

such that

\[ \mathcal{L}(\theta, \beta) = \frac{1}{m} \sum_{k = 1}^m \mathcal{l_k}(\theta, \beta). \]

To do gradient descent, we need to find the partial derivative of \(\mathcal{L}\) with respect to each parameter. We first take the partial derivative with respect to \(\mathcal{l_k}\) and then sum them and divide them by \(m\) to have the partial derivative with respect to \(\mathcal{L}\).

First, we go with the \(\beta\) parameters. For each \(l \in \left\{0, ..., q \right\}\), we have that according to the chain rule

\[ \frac{\partial \mathcal{l_k}}{\partial \beta_l} = \frac{\partial \mathcal{l_k}}{\partial p_k} \frac{\partial \mathcal{p_k}}{\partial w_k} \frac{\partial \mathcal{w_k}}{\partial \beta_l} \]

with \(w_k = \beta_0 + \sum_{l = 1}^q \beta_l a_l^{(k)}\). Then we have that

\[ \frac{\partial \mathcal{l_k}}{\partial \beta_l} = \left[ \frac{1 - y^{(k)}}{1 - \hat{p_k}} - \frac{y^{(k)}}{\hat{p_k}} \right] \left[\sigma(w_k) ( 1 - \sigma(w_k)) \right] \left[ a_l^{(k)} \right] = (p_k - y^{(k)}) a_l^{(k)}, \]

and the special case of \(\beta_0\), which happens to be

\[ \frac{\partial \mathcal{l_k}}{\partial \beta_0} = p_k - y^{(k)}. \]

And for the \(\theta\) parameters, for each \(n = 1, ..., n_x\) and \(l = 1, ..., q\) we have the following:

\[ \frac{\partial \mathcal{l_k}}{\partial \theta_l^{(n)}} = \frac{\partial \mathcal{l_k}}{\partial p_k} \frac{\partial p_k}{\partial w_k} \frac{\partial w_k}{\partial a_l^{(k)}} \frac{\partial a_l^{(k)}}{\partial z_l^{(k)}} \frac{\partial z_l^{(k)}}{\partial \theta_l^{(n)}}, \]

with \(z_l^{(k)} = \sum_{i = 1}^{n_x} \theta_l^{(i)} x_i^{(k)}\). So,

\[ \frac{\partial \mathcal{l_k}}{\partial \theta_l^{(n)}} = \left[ \frac{1 - y^{(k)}}{1 - \hat{p_k}} - \frac{y^{(k)}}{\hat{p_k}} \right] \left[\sigma(w_k) ( 1 - \sigma(w_k)) \right] \left[\beta_l \right] \left[ \sigma(z_l^{(k)}) \left(1 - \sigma(z_l^{(k)}) \right) \right] \left[ x_n^{(k)} \right] \]

Simplifying,

\[ \frac{\partial \mathcal{l_k}}{\partial \theta_l^{(n)}} = \left[ p_k - y^{(k)} \right] \left[ a_l^{(k)} (1 - a_l^{(k)}) \right] \beta_l x_n^{(k)}. \]

Then we have that

\[ \frac{\partial \mathcal{L}}{\partial \beta_l} = \frac{1}{m} \sum_{k = 1}^m \frac{\partial \mathcal{l_k}}{\partial \beta_l} = \frac{1}{m} \sum_{k = 1}^m (p_k - y^{(k)}) a_l^{(k)} \] and

\[ \frac{\partial \mathcal{L}}{\partial \theta_l^{(n)}} = \frac{1}{m} \sum_{k = 1}^m \frac{\partial \mathcal{l_k}}{\partial \theta_l^{(n)}} = \frac{1}{m} \sum_{k = 1}^m \left[ p_k - y^{(k)} \right] \left[ a_l^{(k)} (1 - a_l^{(k)}) \right] \beta_l x_n^{(k)}. \]

Algorithm

So, in order to do the gradient descent algorithm, we need to have an initial value for \(\theta\) and \(\beta\), and then update in the direction that the gradient dictates. So, for each iteration \(i = 1, ..., n_{\mathrm{max}}\), we update in the following way:

\[ \beta_i = \beta_{i-1} - \alpha \nabla_{\beta} \mathcal{L(\theta_{i-1}, \beta_{i-1})} \]

\[ \theta_i = \theta_{i-1} - \alpha \nabla_{\theta} \mathcal{L(\theta_{i-1}, \beta_{i-1})} \]

where \(\nabla_{\beta} \mathcal{L(\theta_{i-1}, \beta_{i-1})}\) denotes the gradient vector of the loss function with respect to \(\beta\), that is

\[ \nabla_{\beta} \mathcal{L(\theta_{i-1}, \beta_{i-1})} = \left( \frac{\partial \mathcal{L}}{\partial \beta_0}, \frac{\partial \mathcal{L}}{\partial \beta_1}, ..., \frac{\partial \mathcal{L}}{\partial \beta_q} \right)^T \]

and

\[ \nabla_{\theta} \mathcal{L(\theta_{i-1}, \beta_{i-1})} = \left( \frac{\partial \mathcal{L}}{\partial \theta_0^{(0)}}, \frac{\partial \mathcal{L}}{\partial \theta_0^{(1)}}, ..., \frac{\partial \mathcal{L}}{\partial \theta_0^{(n_x)}}, \frac{\partial \mathcal{L}}{\partial \theta_1^{(0)}}, \frac{\partial \mathcal{L}}{\partial \theta_1^{(1)}}, ..., \frac{\partial \mathcal{L}}{\partial \theta_q^{(n_x)}} \right)^T. \]

The parameter \(\alpha\) is usually small, such as \(0.01\), but there are several ways to choose it. One way to choose it is by line search, where we start with a relatively large value of \(\alpha\) (such as 1) and iteratively make it smaller until a certain amount of the loss function decreases. In this exercise, we will keep a fixed value for all iterations.

The gradient descent algorithm usually stops after a certain number of iterations (\(n_{\mathrm{max}}\)), or when the norm of the gradients is very close to 0, or when the change in the norm of the parameters is under a certain threshold close to 0.

R code

First, we write an Rcpp function that computes the gradient given the data \(X\), the current values of \(\beta_i\) and \(\theta_i\), the response variable values \(y\), and the vector of computed probabilities \(\hat{p}\) given \(\beta_i\) and \(\theta_i\). This function returns a list with the value of the gradients at that point.

# The X matrix is assumed to have a column vector of 1s

Rcpp::cppFunction(
" List compute_gradient(NumericMatrix X, 
                        NumericVector p_hat, 
                        NumericVector y, 
                        NumericVector beta,
                        NumericMatrix A) {
    int q = beta.size() - 1, nx = X.ncol(), m = X.nrow();
    
    NumericVector dL_dbeta(q+1);
    NumericMatrix dL_dtheta(nx, q);
    double dbeta;
    double pk_minus_yk;
    double sum_beta0;
    double sum_theta;
    double sum_beta;

    for(int l = 0; l < q; l++){
      for(int n = 0; n < nx; n++){

        sum_theta = 0;
        sum_beta = 0;
        sum_beta0 = 0;
        
        for(int k = 0; k < m; k++){
          pk_minus_yk = p_hat(k) - y(k);
          dbeta = pk_minus_yk*A(k,l);
          sum_theta = sum_theta + dbeta*beta(l+1)*(1-A(k,l))*X(k,n);
          sum_beta = sum_beta + dbeta;
          sum_beta0 = sum_beta0 + pk_minus_yk;
        } // end for k to m
        dL_dtheta(n,l) = sum_theta/m;
      } // end for n to nx
      dL_dbeta(l+1) = sum_beta/m;
    } // end for l to q
    dL_dbeta(0) = sum_beta0/m;
    
    return List::create(
    dL_dbeta,
    dL_dtheta
    );
  }
"
)

Now, we write the function that will iteratively call the gradient function and update the values. It takes as input the data matrix \(X\), the response variable vector \(y\), the number of nodes in the hidden layere \(q\), the \(\alpha\) parameter for gradient descent, the maximum number of iterations of the algorithm, the initial parameters \(\beta\) and \(\theta\), and the seed to replicate the results.

ann_rcpp <- function(X, 
                     y, 
                     q = 3, 
                     alpha = 0.01, 
                     n_iter = 200, 
                     init_beta = "random", 
                     init_theta = "random", 
                     seed = "random"){
  # X: data matrix.
  # y: response vector.
  # q: number of hidden nodes.
  # alpha: learning rate.
  # n_iter: number of iterations.
  # init_beta: initial beta vector. If "random", it will be chosen at random.
  # init_theta: initial theta matrix. If "random", it will be chosen at random.
  # seed: used to for initial beta and theta.
  
  m = nrow(X)
  X = cbind(rep(1, m), X)
  nx = ncol(X)
  
  if(init_beta == "random") {
    if(seed != "random") set.seed(seed)
    beta = 0.1*runif(q + 1, -0.5, 0.5)
    init_beta = beta
  } else{
    beta = init_beta
  }
  if(init_theta == "random"){
    if(seed != "random") set.seed(seed)
    theta = replicate(q, 0.1*runif(nx, -0.5, 0.5))
    init_theta = theta
  } else {
    theta = init_theta
  }
  
  # Create a dataframe that will show values of interest in each iteration
  convergence_monitor_df <- data_frame(
    iter = 1:n_iter,
    lossf = as.numeric(rep(NA, n_iter)),
    norm_diff_params = as.numeric(rep(NA, n_iter)),
    norm_gradient = as.numeric(rep(NA, n_iter))
  )
  
  i = 0
  while(i < n_iter){
    if(i %% 1000 == 0) cat("Iter:", i, "\n")
    i = i + 1
    z = X %*% theta
    A = sigma(z)
    A_aug = cbind(rep(1, m), A)
    p_hat = sigma(A_aug %*% beta)
    
    gradients <- compute_gradient(X, p_hat, y, beta, A)
    dL_dbeta <- gradients[[1]]
    dL_dtheta <- gradients[[2]]
    
    theta_old <- theta
    beta_old <- beta
    
    beta <- beta - alpha*dL_dbeta
    theta <- theta - alpha*dL_dtheta
    
    norm_diff_beta_sq <- sum((beta - beta_old)^2)
    norm_diff_theta_sq <- sum((theta - theta_old)^2)
    
    convergence_monitor_df$lossf[i] <- loss_function(p_hat, y)
    convergence_monitor_df$norm_diff_params[i] <- sqrt(norm_diff_beta_sq + norm_diff_theta_sq)
    convergence_monitor_df$norm_gradient[i] <- sqrt(sum(dL_dbeta^2) + sum(dL_dtheta^2))
    
  }
  
  out <- list(
    beta = beta,
    theta = theta, 
    init_theta = init_theta,
    init_beta = init_beta,
    convergence_monitor_df = convergence_monitor_df
  )
  return(out)
}

We also write the auxiliary function that computes \(\sigma(x)\), the logistic function.

sigma <- function(x){
  return(1/(1 + exp(-x)))
}

We write the loss function given a vector of response variables \(y\) and predicted probabilities \(\hat{p}\).

loss_function <- function(p_hat, y){
  lossf = - 2 * mean(y*log(p_hat) + (1-y)*log(1-p_hat))
  return(lossf)
}

And a function that, given a computed ANN, makes predictions for new data \(X_{\mathrm{new}}\).

predict <- function(ann, X){
  m = nrow(X)
  X = cbind(rep(1, m), X)
  zeta = X %*% ann$theta
  A = sigma(zeta)
  A_aug = cbind(rep(1, m), A)
  p_hat = sigma(A_aug %*% ann$beta)
  return(p_hat)
}

This next function plots the value of the loss function in each iteration of the gradient descent algorithm.

plot_loss_iter <- function(ann){
  
  gg_lossf <- ann$convergence_monitor_df %>% 
    ggplot(aes(iter, lossf)) + 
    geom_line(size = 0.5) +
    geom_line() +
    ylab("Loss function") +
    xlab("Iteration")

  return(gg_lossf)
}

Examples

The following examples were taken from the course taught by Felipe González at ITAM. They show how the code can approximate non-linear functions quite accurately.

Example 1

This example is a function of a single variable \(x\) such that for each \(x\), the probability of \(x\) being 1 is \(\sigma(2 - 3x^2)\), where \(\sigma(x)\) is the logistic function. The data generating function and the generated data can be seen in the following plot. Six hundred data points were generated.

set.seed(201801)

dat_1 <- data.frame(x_1 = runif(600, -2, 2)) %>% 
  mutate(y = rbinom(600, 1, sigma(2 - 3 * x_1^2)))

dat_p <- data.frame(x = seq(-2, 2, 0.01)) %>% 
  mutate(p = sigma(2 - 3 * x^2))

dat_p %>% 
  ggplot() +
  geom_line(aes(x, p), col = 'red') +
  geom_jitter(data = dat_1, aes(x = x_1, y = y), 
              alpha = 0.5,
              size = 0.8,
              position = position_jitter(height=0.01))

Now, we call the implemented function. We set it to 8000 iterations. This value was chosen manually. Ideally, the function would have some stopping criteria besides the number of iterations.

ann_1 <- ann_rcpp(as.matrix(dat_1$x_1), dat_1$y, q = 4, alpha = 0.3, n_iter = 8000, seed = 2018)
## Iter: 0 
## Iter: 1000 
## Iter: 2000 
## Iter: 3000 
## Iter: 4000 
## Iter: 5000 
## Iter: 6000 
## Iter: 7000

The next image shows the plot of the value of the loss function in each iteration. It can be seen that the value decreases with each iteration.

plot_loss_iter(ann_1)

And now, we compute the predicted values for the generated data. The plot shows the original data generating function in red and the computed function in blue. We can see that the adjusted values are quite good.

predictions_1 <- predict(ann_1, as.matrix(dat_1$x_1))

dat_1 %>% 
  mutate(pred = predictions_1) %>% 
  ggplot() + 
  geom_line(aes(x = x_1, y = pred),
            col = 'blue') +
  geom_jitter(
    aes(x_1, y),
    alpha = 0.5,
    size = 0.8,
    position = position_jitter(height=0.01)) +
  geom_line(data = dat_p, aes(x = x, y = p), col='red')