Gradient Descent and Polynomial Regression
Today, we continue our discussion of supervised learning, where we have labeled training data and our goal is to train a model that accurately predicts the labels of unseen testing data. Recall that our general approach to supervised learning is to use empirical risk minimization: We focus on a class of models, define a loss function that quantifies how accurately a particular model explains the training data, and search for a model with low loss.
Last week, we considered the class of linear models, i.e., the prediction is a weighted linear combination of the input features. We chose to measure loss via mean squared error, a choice both rooted in convenience and a compelling modeling assumption (if the data is generated by a linear process with Gaussian, then the mean squared error is the maximum likelihood estimator). In order to find the best parameters of the linear model, we used our knowledge of gradients to exactly compute the parameters that minimize the mean squared error loss.
This week, we will address two of the nagging issues with computing the best parameters of a linear model. We begin with the issue of runtime; computing the optimal parameters requires building a large matrix and inverting it, which can be computationally expensive. We will now see how we can use gradient descent to speed up this process.
Gradient Descent
The mean squared error of a linear model is particularly well-behaved because it is convex i.e., there is a single minimum. Previously, we computed the parameters where the gradient is 0, which, by convexity, immediately gave us the single optimal point. However, we could instead use a more relaxed approach; rather than jumping immediately to the best parameters, we can iterate towards better parameters by taking steps towards lower loss.
Gradient descent is an iterative method for moving in the direction of steepest descent. Concretely, the process produces a sequence of parameters \(\mathbf{w}^{(1)}, \mathbf{w}^{(2)}, \ldots\). At each step, we compute the direction of steepest ascent i.e., the gradient of the loss function with respect to each parameter. The gradient quantifies how the loss function responds as we tweak each parameter. If the partial derivative is positive (increasing the parameter increases the loss), then we will want to decrease the parameter. Analogously, if the partial derivative is negative (increasing the parameter decreases the loss), then we will want to increase the parameter. In both cases, we are moving in the direction away from the gradient. Hence, we reach the next parameter vector by subtracting the gradient from the current parameter vector: \[ \begin{align*} \mathbf{w}^{(t+1)} \gets \mathbf{w}^{(t)} - \alpha \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}^{(t)}), \end{align*} \] where \(\alpha\) is a small positive constant called the step size or learning rate.
Notice that, for one, this approach stops when we reach a point where the gradient is 0, i.e., we have reached a local minimum.
Beyond the stopping condition, why does this work? Consider the one dimensional setting. The derivative of the loss function is \[ \begin{align*} \mathcal{L}'(w) = \lim_{\Delta \to 0} \frac{\mathcal{L}(w + \Delta) - \mathcal{L}(w)}{\Delta}. \end{align*} \] so, for small \(\Delta\), we can approximate the loss function as \[ \begin{align*} \mathcal{L}(w+\Delta) - \mathcal{L}(w) &\approx \mathcal{L}'(w) \cdot \Delta \\ \end{align*} \] We want \(\mathcal{L}(w+\Delta)\) to be smaller than \(\mathcal{L}(w)\), so we want \(\mathcal{L}'(w) \Delta < 0\). This can be achieved by setting \(\Delta = -\eta \mathcal{L}'(w)\), where \(\eta\) is a small positive constant. Then \(w^{(t+1)} = w^{(t)} - \eta \mathcal{L}'(w^{(t)})\) is a step in the direction of descent.
In the multi-dimensional setting, the partial derivative of the loss function with respect to each parameter is given by the gradient: \[ \frac{\partial \mathcal{L}}{\partial w_i} = \lim_{\Delta \to 0} \frac{\mathcal{L}(\mathbf{w} + \Delta \mathbf{e}_i) - \mathcal{L}(\mathbf{w})}{\Delta}, \] where \(\mathbf{e}_i\) is the \(i\)-th standard basis vector. Then, for small \(\Delta\), we can approximate the loss function as \[ \mathcal{L}(\mathbf{w} + \Delta \mathbf{e}_i) - \mathcal{L}(\mathbf{w}) \approx \frac{\partial \mathcal{L}}{\partial w_i} \cdot \Delta = \langle \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}), \Delta \mathbf{e}_i \rangle. \] For a general vector \(\mathbf{v}\), we can write \[ \mathcal{L}(\mathbf{w} + \Delta \mathbf{v}) - \mathcal{L}(\mathbf{w}) \approx \langle \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}), \Delta \mathbf{v} \rangle. \] If we want to move in the direction of steepest descent, we can set \(\Delta \mathbf{v} = -\eta \nabla_\mathbf{w} \mathcal{L}(\mathbf{w})\), where \(\eta\) is a small positive constant. Then, we have \(\mathcal{L}(\mathbf{w} - \eta \nabla_\mathbf{w} \mathcal{L}(\mathbf{w})) - \mathcal{L}(\mathbf{w}) \approx -\eta \langle \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}), \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}) \rangle = -\eta \|\nabla_\mathbf{w} \mathcal{L}(\mathbf{w})\|^2\). Why is this the right choice? Well, recall for any vectors \(\mathbf{a}\) and \(\mathbf{b}\), we have \(\langle \mathbf{a}, \mathbf{b} \rangle = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta)\), where \(\theta\) is the angle between the two vectors. The largest value of \(\cos(\theta)\) is \(1\), which occurs when the two vectors are in the same direction. Notice we achieve the largest magnitude of the inner product when we take the step in the direction of the gradient, i.e., \(- \nabla_\mathbf{w} \mathcal{L}(\mathbf{w})\).
For linear models, we already know the gradient of the mean squared error loss: \[ \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}) = \frac2{n} \mathbf{X}^\top (\mathbf{X w - y}). \] In contrast to the \(O(nd^2 + d^3)\) time required to compute the exact solution, we can now compute the gradient in \(O(nd)\) time, as long as we restrict ourselves to matrix-vector multiplications rather than matrix-matrix multiplications. The final time complexity of gradient descent is \(O(T nd)\), where \(T\) is the number of iterations of gradient descent.
While we have achieved a significant speedup, \(O(T nd)\) could still be prohibitively large when we have a large number of data points \(n\) and/or a large number of features \(d\). Our solution will be a stochastic approach, where we only use a small random subset of the data to compute the gradient.
Stochastic Gradient Descent
Our approach will be similar to gradient descent, except now we will compute the gradient using only the data in the batch. For the mean squared error loss, we can write the loss function for a random subset \(S\) of the data as \[ \mathcal{L}_S(\mathbf{w}) = \frac1{|S|} \sum_{i \in S} (f(\mathbf{x}^{(i)}) - y^{(i)})^2. \] Then, for our linear model, the gradient of the loss function with respect to the parameters \(\mathbf{w}\) is given by \[ \nabla_\mathbf{w} \mathcal{L}_S(\mathbf{w}) = \frac2{|S|} \mathbf{X}_S^\top (\mathbf{X}_S \mathbf{w} - \mathbf{y}_S), \] where \(\mathbf{X}_S\) is the data matrix for the subset \(S\) and \(\mathbf{y}_S\) is the target vector for the subset \(S\). One iteration of stochastic gradient descent takes time \(O(|S|d)\), which can be much faster than the \(O(nd)\) time required to compute the gradient for the full dataset.
Adaptive Step Sizes
The step size \(\alpha\) is a crucial hyperparameter in gradient descent. If \(\alpha\) is too small, then the algorithm will take a long time to converge because it will take small steps towards the minimum. If \(\alpha\) is too large, then the algorithm may overshoot the minimum and diverge by repeatedly moving in the right direction but by too much. Instead, we want to choose a step size \(\alpha\) that is just right, allowing us to make progress towards the minimum without overshooting.
There are several strategies for choosing the step size:
When searching manually, we often exponentially increase and decrease the step size i.e., multiply by a factor of \(2\) or \(1/2\). If the loss consistently improves over several iterations of gradient descent, then we try increasing the step size; if the loss is unstable, then we try decreasing the step size.
Learning rate schedules offer a more automated approach, where we start with a large step size and then decrease it over time. This is often done by multiplying the step size by a factor less than \(1\) after each iteration. The intuition is that we want to take large steps at the beginning to quickly find a good region of the parameter space, and then take smaller steps so as to not overshoot the minima as we get closer.
An even more automated approach is to use an adaptive learning rate, where we adjust the step size based on the gradient. If the gradient is large, then we can decrease the step size to avoid overshooting; if the gradient is small, then we can increase the step size to speed up convergence. One implementation of this idea is as follows: \[ \alpha^{(t+1)} \gets \frac{\alpha^{(t)}}{(\nabla_\mathbf{w} \mathcal{L}(\mathbf{w}^{(t)}))^2}. \] Notice that the division is element-wise, so we are adjusting the step size for each parameter individually based on the squared partial derivative of that parameter.
In addition the step size, the direction of each step is also important.
Momentum
The idea of gradient descent is to converge to a local minimum of the loss function, but things can go wrong even if we have the right step size: The gradient may not point in the direction of the minima if, for example, the loss function is not symmetric. The plot illustrates this issue for a convex loss function on two parameters \({w}_1\) and \({w}_2\), where the loss function is given by level sets. In the plot, a standard gradient descent approach takes many steps but the directions cancel out, resulting in a zig-zag pattern that slows convergence.
Our solution is to keep track of the direction we have been moving in and use that to inform our next step. This idea, called momentum, retains a running average of the gradients, which allows us to smooth out the direction of the steps. We can think of momentum as a ball rolling down a hill, even when the ball is pushed left or right, it will continue to roll downwards. An implementation of momentum is as follows: \[ \begin{align} \mathbf{m}^{(t+1)} &= \beta \mathbf{m}^{(t)} + (1 - \beta) \nabla_\mathbf{w} \mathcal{L}_S(\mathbf{w}^{(t)}) \\ \mathbf{w}^{(t+1)} &= \mathbf{w}^{(t)} - \alpha \mathbf{m}^{(t+1)}, \end{align} \] where \(\beta\) is a hyperparameter that controls the amount of history we keep in the momentum vector \(\mathbf{m}^{(t)}\).