Gradient Descent

Today, we continue our discussion of supervised learning, where we have labeled training data and our goal is to train a model that accurately predicts the labels of unseen testing data. Recall that our general approach to supervised learning is to use empirical risk minimization: We focus on a class of models, define a loss function that quantifies how accurately a particular model explains the training data, and search for a model with low loss.

Last week, we considered the class of linear models, i.e., the prediction is a weighted sum of the input features. We chose to measure loss via mean squared error, a choice both rooted in convenience and a compelling modeling assumption (if the data is generated by a model in our class plus some Gaussian noise, then the mean squared error is the maximum likelihood estimator). In order to find the best parameters of the linear model, we used our knowledge of gradients to exactly compute the parameters that minimize the mean squared error loss.

This week, we will address two of the nagging issues with computing the best parameters of a linear model. We begin with the issue of runtime; computing the optimal parameters requires building a large matrix and inverting it, which takes \(O(nd^2 + d^3)\) time, where \(n\) is the number of data points and \(d\) is the number of features. We will now see how we can use gradient descent to speed up this process.

Gradient Descent

The mean squared error of a linear model is particularly well-behaved because it is convex i.e., there is a single minimum. Previously, we computed the parameters where the gradient is 0, which, by convexity, immediately gave us the single optimal point. However, we could instead use a more relaxed approach; rather than jumping immediately to the best parameters, we can iterate towards better parameters by taking steps towards lower loss.

Gradient descent is an iterative method for moving in the direction of steepest descent. Concretely, the process produces a sequence of parameters \(\mathbf{w}^{(1)}, \mathbf{w}^{(2)}, \ldots\). At each step, we compute the direction of steepest ascent i.e., the gradient of the loss function with respect to each parameter. The gradient quantifies how the loss function responds as we tweak each parameter. If the partial derivative is positive (increasing the parameter increases the loss), then we will want to decrease the parameter. Analogously, if the partial derivative is negative (increasing the parameter decreases the loss), then we will want to increase the parameter. In both cases, we are moving in the direction away from the gradient. Hence, we reach the next parameter vector by subtracting the gradient from the current parameter vector: \[ \begin{align*} \mathbf{w}^{(t+1)} \gets \mathbf{w}^{(t)} - \alpha \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}^{(t)}), \end{align*} \] where \(\alpha\) is a small positive constant called the step size or learning rate.

Notice that, for one, this approach stops when we reach a point where the gradient is 0, i.e., we have reached a local minimum.

Beyond the stopping condition, why does this work? Consider the one dimensional setting. The derivative of the loss function is \[ \begin{align*} \frac{\partial \mathcal{L}(w)}{\partial w} = \lim_{\Delta \to 0} \frac{\mathcal{L}(w + \Delta) - \mathcal{L}(w)}{\Delta}. \end{align*} \] so, for small \(\Delta\), we can approximate the loss function as \[ \begin{align*} \mathcal{L}(w+\Delta) - \mathcal{L}(w) &\approx \frac{\partial \mathcal{L}(w)}{\partial w} \cdot \Delta \\ \end{align*} \] We want \(\mathcal{L}(w+\Delta)\) to be smaller than \(\mathcal{L}(w)\), so we want \(\frac{\partial \mathcal{L}(w)}{\partial w} \Delta < 0\). This can be achieved by setting \(\Delta = -\alpha \frac{\partial \mathcal{L}(w)}{\partial w}\), where \(\alpha\) is a small positive constant. Then \(w^{(t+1)} = w^{(t)} - \alpha \frac{\partial \mathcal{L}(w^{(t)})}{\partial w}\) is a step in the direction of descent.

In the multi-dimensional setting, the partial derivative of the loss function with respect to each parameter is given by the gradient: \[ \frac{\partial \mathcal{L}(\mathbf{w})}{\partial w_i} = \lim_{\Delta \to 0} \frac{\mathcal{L}(\mathbf{w} + \Delta \mathbf{e}_i) - \mathcal{L}(\mathbf{w})}{\Delta}, \] where \(\mathbf{e}_i\) is the \(i\)th standard basis vector. Then, for small \(\Delta\), we can approximate the loss function as \[ \mathcal{L}(\mathbf{w} + \Delta \mathbf{e}_i) - \mathcal{L}(\mathbf{w}) \approx \frac{\partial \mathcal{L}}{\partial w_i} \cdot \Delta = \langle \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}), \Delta \mathbf{e}_i \rangle. \] We can write a general vector \(\mathbf{v}\) as a linear combination of the standard basis vectors, i.e., \(\mathbf{v} = \sum_{i=1}^d v_i \mathbf{e}_i\). Then, we have \[ \begin{align*} \mathcal{L}(\mathbf{w} + \Delta \mathbf{v}) - \mathcal{L}(\mathbf{w}) &\approx \sum_{i=1}^d (\text{change in loss from stepping in the direction of } \mathbf{e}_i) \\&\approx \sum_{i=1}^d \Delta v_i \frac{\partial \mathcal{L}(\mathbf{w})}{\partial w_i}. \end{align*} \] If we want to move in the direction of steepest descent, we can set \(\Delta \mathbf{v} = -\alpha \nabla_\mathbf{w} \mathcal{L}(\mathbf{w})\), where \(\alpha\) is a small positive constant. Then, we have \(\mathcal{L}(\mathbf{w} - \alpha \nabla_\mathbf{w} \mathcal{L}(\mathbf{w})) - \mathcal{L}(\mathbf{w}) \approx -\alpha \langle \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}), \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}) \rangle = -\alpha \|\nabla_\mathbf{w} \mathcal{L}(\mathbf{w})\|^2\).

Could we choose a better direction \(\mathbf{v}'\) to move in? Well, recall for any vectors \(\mathbf{a}\) and \(\mathbf{b}\), we have \(\langle \mathbf{a}, \mathbf{b} \rangle = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta)\), where \(\theta\) is the angle between the two vectors. The largest value of \(\cos(\theta)\) is \(1\), which occurs when the two vectors are in the same direction. Notice we achieve the largest magnitude of the inner product when we take the step in the direction of the negative gradient, i.e., \(- \nabla_\mathbf{w} \mathcal{L}(\mathbf{w})\).

For linear models, we already know the gradient of the mean squared error loss: \[ \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}) = \frac2{n} \mathbf{X}^\top (\mathbf{X w - y}). \] In contrast to the \(O(nd^2 + d^3)\) time required to compute the exact solution, we can now compute the gradient in \(O(nd)\) time, as long as we restrict ourselves to matrix-vector multiplications rather than matrix-matrix multiplications. The final time complexity of gradient descent is \(O(T nd)\), where \(T\) is the number of iterations of gradient descent.

While we have achieved a significant speedup, \(O(T nd)\) could still be prohibitively large when we have a large number of data points \(n\) and/or a large number of features \(d\). Our solution will be a stochastic approach, where we only use a small random subset of the data to compute the gradient.

Stochastic Gradient Descent

Our approach will be similar to gradient descent, except now we will compute the gradient using only the data in the batch. For the mean squared error loss, we can write the loss function for a random subset \(S\) of the data as \[ \mathcal{L}_S(\mathbf{w}) = \frac1{|S|} \sum_{i \in S} (f(\mathbf{x}^{(i)}) - y^{(i)})^2. \] Then, for our linear model, the gradient of the loss function with respect to the parameters \(\mathbf{w}\) is given by \[ \nabla_\mathbf{w} \mathcal{L}_S(\mathbf{w}) = \frac2{|S|} \mathbf{X}_S^\top (\mathbf{X}_S \mathbf{w} - \mathbf{y}_S), \] where \(\mathbf{X}_S\) is the data matrix for the subset \(S\) and \(\mathbf{y}_S\) is the target vector for the subset \(S\). One iteration of stochastic gradient descent takes time \(O(|S|d)\), which can be much faster than the \(O(nd)\) time required to compute the gradient for the full dataset.

Adaptive Step Sizes

The step size \(\alpha\) is a crucial hyperparameter in gradient descent. If \(\alpha\) is too small, then the algorithm will take a long time to converge because it will take small steps towards the minimum. If \(\alpha\) is too large, then the algorithm may overshoot the minimum and diverge by repeatedly moving in the right direction but by too much. Instead, we want to choose a step size \(\alpha\) that is just right, allowing us to make progress towards the minimum without overshooting.

There are several strategies for choosing the step size:

Manual Search: We often exponentially increase and decrease the step size (e.g., multiply by a factor of \(3\) or \(1/3\)). If the loss consistently improves, we try increasing the step size; if the loss diverges or oscillates, we decrease it.
Learning Rate Decay: This is an automated schedule where we start with a large step size to quickly find a good region of the parameter space, and then decrease it over time (e.g., \(\alpha^{(t)} = \alpha^{(0)} / (1 + kt)\)). This allows for coarse adjustments early on and fine-tuned adjustments as we approach the minimum.
Adaptive Step Sizes (Adagrad): A more advanced approach is to adapt the learning rate for each parameter individually based on the history of gradients. The intuition is as follows: if a parameter has a large gradient (steep slope), we are at risk of overshooting, so we should decrease its effective step size. If a parameter has a small gradient (flat slope), progress is slow, so we should maintain a larger step size to speed up convergence. One implementation of this idea (Adagrad) accumulates the sum of squared gradients: \[ G_i^{(t)} = \sum_{\tau=1}^{t} \left( \frac{\partial \mathcal{L}(\mathbf{w}^{(\tau)})}{\partial w_i} \right)^2 \] We then update the parameter \(w_i\) using a base learning rate \(\eta\), scaled by this history: \[ w_i^{(t+1)} \gets w_i^{(t)} - \frac{\eta}{\sqrt{G_i^{(t)} + \epsilon}} \cdot \frac{\partial \mathcal{L}(\mathbf{w}^{(t)})}{\partial w_i} \] Here, \(\epsilon\) is a small constant (e.g., \(10^{-8}\)) to ensure we never divide by zero. Notice that parameters with constantly steep gradients accumulate a large \(G_i\), naturally dampening their learning rate to prevent oscillation.

In addition to the step size, the direction of each step is also important.

Momentum

The idea of gradient descent is to converge to a local minimum of the loss function, but things can go wrong even if we have the right step size: The gradient may not point in the direction of the minimum if, for example, the loss function is not symmetric. The plot illustrates this issue for a convex loss function on two parameters \({w}_1\) and \({w}_2\), where the loss function is given by level sets. In the plot, a standard gradient descent approach takes many steps but the directions cancel out, resulting in a zig-zag pattern that slows convergence.

Our solution is to keep track of the direction we have been moving in and use that to inform our next step. This idea, called momentum, retains a running average of the gradients, which allows us to smooth out the direction of the steps. We can think of momentum as a ball rolling down a hill, even when the ball is pushed left or right, it will continue to roll downwards. A standard implementation of momentum is as follows: \[ \begin{align} \mathbf{v}^{(t+1)} &= \beta \mathbf{v}^{(t)} + \nabla_\mathbf{w} \mathcal{L}_S(\mathbf{w}^{(t)}) \\ \mathbf{w}^{(t+1)} &= \mathbf{w}^{(t)} - \alpha \mathbf{v}^{(t+1)} \end{align} \] Here, \(\beta \in [0, 1)\) is a hyperparameter (often set to \(0.9\)) that represents “friction” or how much of the previous velocity is retained. By adding a fraction of the previous update to the current one, the oscillations in opposite directions cancel each other out, while the steps along the flat valley add up, accelerating convergence toward the minimum.