PageRank (The Power Method)
There is a rich mathematical foundation behind machine learning. Today, we will see how the eigenvalue decomposition of a matrix can be used to solve a real-world data science problem: finding the most important pages on the internet. This algorithm is known as PageRank, and it was the key technology that powered Google in its early days.
There are lots of webpages on the internet, and they are all interconnected by hyperlinks. Some pages, like wikipedia.org (website A) and mayoclinic.org (website B), are clearly important and authoritative. Other pages, like tealscats.com (website C), are less important. When we search for a topic, we’d like to return the most relevant pages, sorted by their importance.
Of course, we could manually order the pages, but that would be tedious, and way less fun. Instead, we will use the structure of the internet to determine which pages are most important. Consider a graph representation of the internet, where each page is a node and each hyperlink is a directed edge from one node to another.
PageRank
We roughly expect that important pages will have many incoming links (e.g., lots of people link to Wikipedia). Not only that, but we also expect that these links come from other important pages (e.g., even authoritative pages like Mayo Clinic link to Wikipedia).
With these ideas in mind, let’s define an iterative process for refining the importance of each page. At first, we will assign each page the same importance, say \(1/n\) where \(n\) is the number of pages. Let \(\mathbf{p}^{(0)} \in \mathbb{R}^n\) be the initial importance vector, where \(p^{(0)}_i = 1/n\) for each page \(i\). We will then iteratively update the importance vector so that the new importance of each page depends on how many pages link to it, and their importance in turn. We will use the following update rule: \[ p_i^{(t+1)} = \sum_{j=1}^n \mathbb{1}[j \text{ links to } i] \frac{p_j^{(t)}}{d_j}, \] where \(d_j\) is the out-degree of page \(j\), i.e., the number of pages that page \(j\) links to. Why do we divide by \(d_j\)? If we didn’t, then a page with many outgoing links would dominate the importance of the pages it links to.
Note: We will only consider pages with outgoing links, i.e., \(d_j > 0\) for all \(j\). Otherwise, the importance of pages with no outgoing links would be undefined.
Observe that the new importance is a linear combination of the previous importances. In the language of linear algebra, we are taking the dot product between the previous importance vector \(\mathbf{p}^{(t)}\) and a vector that only depends on the structure of the graph. For page \(i\), call this vector \([\mathbf{A}]_{i,} \in \mathbb{R}^n\) where \[ [\mathbf{A}]_{i,j} = \begin{cases} \frac{1}{d_j} & \text{if } j \text{ links to } i \\ 0 & \text{otherwise.} \end{cases} \] When we consider the entire importance vector, we can write the update rule as \[ \mathbf{p}^{(t+1)} = \mathbf{A} \mathbf{p}^{(t)}, \] where \(\mathbf{A} \in \mathbb{R}^{n \times n}\) is the matrix whose \(i\)th row is \(\mathbf{A}_i\). One useful observation is that each column of \(\mathbf{A}\) sums to 1, i.e., \(\sum_{i=1}^n [\mathbf{A}]_{i,j} = \sum_{i=1}^n \mathbb{1}[j \text{ links to } i ] \frac1{d_j} = \frac{d_j}{d_j} = 1\) for each \(j\). This means that \(\mathbf{p}^{(t+1)}\) will have the same sum as \(\mathbf{p}^{(t)}\) i.e., \[ \sum_{i=1}^n p_i^{(t+1)} = \sum_{i=1}^n \left( \sum_{j=1}^n [\mathbf{A}]_{i,j} p_j^{(t)} \right) = \sum_{j=1}^n p_j^{(t)} \sum_{i=1}^n [\mathbf{A}]_{i,j} = \sum_{j=1}^n p_j^{(t)}. \] Since we initialize \(\mathbf{p}^{(0)}\) to have sum 1, we can see that \(\mathbf{p}^{(t)}\) will always have sum 1. This enables us to interpret the importance vector \(\mathbf{p}^{(t)}\) as a probability distribution over the pages. In particular, \(p_i^{(t)}\) is the probability that a random user will land on page \(i\) after \(t\) iterations of following links, when starting from a uniform distribution over all the pages.
But now, we could have graphs with the following undesirable properties:
Reducible: Some groups of pages may form isolated clusters with no links in or out. For example, suppose pages X and Y link to each other, but no other pages link to either X or Y. Then, any probability that flows into this cluster will be trapped there, and the importance scores of all other pages will go to zero.
Periodic: Some pages may be part of a cycle, causing the importance scores to oscillate. For example, Page X links only to Page Y, and Page Y links only to Page X. In this case, we would bounce back and forth between these two pages, never settling on a stable importance score.
To avoid this, we will add a small amount of probability to each page at every iteration, which effectively simulates a user who jumps to a random page with some small probability. Let \(\alpha \in (0,1)\) be close to 1, and define the new importance update rule as \[ \mathbf{p}^{(t+1)} = \alpha \mathbf{A} \mathbf{p}^{(t)} + (1-\alpha) \mathbf{1} \frac{1}{n}. \] The resulting updates are:
Irreducible: With probability \((1-\alpha)\), the user jumps to a random page, ensuring that every page can be reached from any other page over time. This prevents isolated clusters from trapping probability.
Aperiodic: Since there’s always a chance of jumping randomly, cycles are “broken”, and the chain no longer gets stuck in oscillations.
Power Method
Our final update rule is slightly inelegant because it has two terms: one that depends on the previous importance vector and one that does not. But, we can rewrite it as a single matrix multiplication. Let \(\mathbf{1} \in \mathbb{R}^n\) be the vector of all ones. Recall that \(\sum_{i=1}^n p_i^{(t)} = 1 = \mathbf{1}^\top \mathbf{p}^{(t)}\). Then, we can rewrite the update rule as \[ \mathbf{p}^{(t+1)} = \alpha \mathbf{A} \mathbf{p}^{(t)} + (1-\alpha) \frac{1}{n} \mathbf{1} \mathbf{1}^\top \mathbf{p}^{(t)} = \left( \alpha \mathbf{A} + (1-\alpha) \frac{1}{n} \mathbf{1} \mathbf{1}^\top \right) \mathbf{p}^{(t)}. \]
Define the matrix \(\mathbf{M} = \alpha \mathbf{A} + (1-\alpha) \frac{1}{n} \mathbf{1} \mathbf{1}^\top\). The update rule can now be written as \[ \mathbf{p}^{(t+1)} = \mathbf{M} \mathbf{p}^{(t)} = \mathbf{M}^2 \mathbf{p}^{(t-1)} = \mathbf{M}^t \mathbf{p}^{(0)}. \]
Suppose that \(\mathbf{M}\) is diagonalizable. However, because \(\mathbf{M}\) is not necessarily symmetric, the left eigenvectors may be different from the right eigenvectors. Let \[ \mathbf{M} = \sum_{i=1}^r \mathbf{v}_i \lambda_i \mathbf{w}_i^\top \] be the eigenvalue decomposition of \(\mathbf{M}\), where \(\lambda_1 \geq \ldots \geq \lambda_r\) are the eigenvalues, \(\mathbf{w}_1, \ldots, \mathbf{w}_r\) are the corresponding left eigenvectors, and \(\mathbf{v}_1, \ldots, \mathbf{v}_r\) are the corresponding right eigenvectors. The right eigenvectors satisfy \[ \mathbf{M} \mathbf{v}_i = \lambda_i \mathbf{v}_i \] while the left eigenvectors satisfy \[ \mathbf{w}_i^\top \mathbf{M} = \lambda_i \mathbf{w}_i^\top. \] The left and right eigenvectors are orthonormal, i.e., \(\mathbf{w}_i^\top \mathbf{w}_j = 0 = \mathbf{v}_i^\top \mathbf{v}_j\) for \(i \neq j\) and \(1\) if \(i=j\). In addition, they satisfy a biorthonormal property, i.e., \(\mathbf{w}_i^\top \mathbf{v}_j = 0\) for \(i \neq j\). To see this, observe that \[ \mathbf{M} \mathbf{v}_i = \lambda_i \mathbf{v}_i \Leftrightarrow \mathbf{w}_j^\top \mathbf{M} \mathbf{v}_i = \lambda_i \mathbf{w}_j^\top \mathbf{v}_i \Leftrightarrow (\mathbf{w}_j^\top \lambda_j - \lambda_i \mathbf{w}_j^\top) \mathbf{v}_i = 0 \Leftrightarrow (\lambda_j - \lambda_i) \mathbf{w}_j^\top \mathbf{v}_i = 0. \] So, if \(\lambda_j \neq \lambda_i\), then \(\mathbf{w}_j^\top \mathbf{v}_i = 0\).
When we repeatedly multiply by \(\mathbf{M}\), the biorthonormal property gives us a surprisingly simple result: \[ \mathbf{M}^2 = \sum_{i=1}^r \mathbf{v}_i \lambda_i \mathbf{w}_i^\top \sum_{j=1}^r \mathbf{v}_j \lambda_j \mathbf{w}_j^\top = \sum_{i=1}^r \mathbf{v}_i \lambda_i \mathbf{w}_i^\top \mathbf{v}_i \lambda_i \mathbf{w}_i^\top = \sum_{i=1}^r \lambda_i^2 \mathbf{v}_i \mathbf{w}_i^\top. \] In general, \(\mathbf{M}^t = \sum_{i=1}^r \lambda_i^t \mathbf{v}_i \mathbf{w}_i^\top\).
Returning to the probability vector, we can write \[ \mathbf{p}^{(t)} = \mathbf{M}^t \mathbf{p}^{(0)} = \sum_{i=1}^r \lambda_i^t \mathbf{v}_i \mathbf{w}_i^\top \mathbf{p}^{(0)} = \lambda_1^t \sum_{i=1}^r \left( \frac{\lambda_i}{\lambda_1} \right)^t \mathbf{v}_i \mathbf{w}_i^\top \mathbf{p}^{(0)}. \] When \(\lambda_1 > \lambda_2\), the term \(\left( \frac{\lambda_i}{\lambda_1} \right)^t\) will go to 0 for all \(i \geq 2\) as \(t\) increases. As long as \(\mathbf{v}_1^\top \mathbf{p}^{(0)}\) is non-zero, the first term will dominate the sum, and \[ \lim_{t \to \infty} \mathbf{p}^{(t)} = \lambda_1^t \mathbf{v}_1 \left(\mathbf{w}_1^\top \mathbf{p}^{(0)}\right). \] Since every entry of \(\mathbf{M}\) is at most \(1\) and \(\mathbf{1}^\top \mathbf{M} = \mathbf{1}^\top\), it’s possible to show that \(\lambda_1 = 1\).
In summary, the vector that Google uses to measure the importance of a page is simply a multiple of the first eigenvector of the matrix \(\mathbf{M}\). Repeatedly multiplying matrices is known as the power method, and results in the dominant eigenvector(s) of a matrix.