PageRank (The Power Method)

Behind the real world impact of the machine learning algorithms we will study this semester, there is a rich mathematical foundation. Today, we will see how the eigenvalue decomposition of a matrix (that we learned about in the previous lecture) can be used to solve a real-world problem: finding the most important pages on the internet. This algorithm is known as PageRank, and it was the key technology that powered Google in its early days.

There are lots of webpages on the internet, and they are all interconnected by hyperlinks. Some pages, like wikipedia.org (website A) and mayoclinic.org (website B), are clearly important and authoritative. Other pages, like tealscats.com (website C), are less important. When we search for a topic, we’d like to return the most relevant pages, sorted by their importance.

Of course, we could manually rank the pages, but that would be tedious, and way less fun. Instead, we will use the structure of the internet to determine which pages are most important. Consider a graph representation of the internet, where each page is a node and each hyperlink is a directed edge from one node to another.

We roughly expect that important pages will have many incoming links (e.g., lots of people link to Wikipedia). Not only that, but we also expect that these links come from other important pages (e.g., even authoritative pages like Mayo Clinic link to Wikipedia).

PageRank

With these ideas in mind, let’s define an iterative process for refining our ranks of the pages. At first, we will assign each page the same rank, say \(1/n\) where \(n\) is the number of pages. Let \(\mathbf{p}^{(0)} \in \mathbb{R}^n\) be the initial rank vector, where \(\mathbf{p}^{(0)}_i = 1/n\) for each page \(i\). We will then iteratively update the rank vector so that the new rank of each page depends on how many pages link to it, and their ranks. We will use the following update rule: \[ p_i^{(t+1)} = \sum_{j: j \text{ links to } i } \frac{p_j^{(t)}}{d_j}, \] where \(d_j\) is the out-degree of page \(j\), i.e., the number of pages that page \(j\) links to. Why do we divide by \(d_j\)? If we didn’t, then a page with many outgoing links would dominate the rank of the pages it links to.

Notice that the new rank is a linear combination of the previous ranks. In the language of linear algebra, we are taking the dot product between the previous rank vector \(\mathbf{p}^{(t)}\) and a vector that only depends on the structure of the graph. For page \(i\), call this vector \([\mathbf{A}]_{i,} \in \mathbb{R}^n\) where \[ \mathbf{A}_{i,j} = \begin{cases} \frac{1}{d_j} & \text{if } j \text{ links to } i \\ 0 & \text{otherwise.} \end{cases} \] When we consider the entire rank vector, we can write the update rule as \[ \mathbf{p}^{(t+1)} = \mathbf{A} \mathbf{p}^{(t)}, \] where \(\mathbf{A} \in \mathbb{R}^{n \times n}\) is the matrix whose \(i\)th row is \(\mathbf{A}_i\). One useful observation is that each column of \(\mathbf{A}\) sums to 1, i.e., \(\sum_{i=1}^n \mathbf{A}_{i,j} = \sum_{i: j \text{ links to i }} \frac1{d_j} = \frac{d_j}{d_j} = 1\) for each \(j\). This means that \(\mathbf{p}^{(t+1)}\) will have the same sum as \(\mathbf{p}^{(t)}\) i.e., \[ \sum_{i=1}^n p_i^{(t+1)} = \sum_{i=1}^n \left( \sum_{j=1}^n \mathbf{A}_{i,j} p_j^{(t)} \right) = \sum_{j=1}^n p_j^{(t)} \sum_{i=1}^n \mathbf{A}_{i,j} = \sum_{j=1}^n p_j^{(t)}. \] Since we initialize \(\mathbf{p}^{(0)}\) to have sum 1, we can see that \(\mathbf{p}^{(t)}\) will always have sum 1. This enables us to interpret the rank vector \(\mathbf{p}^{(t)}\) as a probability distribution over the pages. In particular, \(p_i^{(t)}\) is the probability that a random user will land on page \(i\) after \(t\) iterations of following links, when starting from a uniform distribution over all the pages.

But now, we have a problem. Suppose a webpage has no outgoing links. Then, all the probability coming into that page will be stuck. To avoid this, we will add a small amount of probability to each page at every iteration, which effectively simulates a random user who jumps to a random page with some small probability. Let \(\alpha \in (0,1)\) be close to 1, and define the new rank update rule as \[ \mathbf{p}^{(t+1)} = \alpha \mathbf{A} \mathbf{p}^{(t)} + (1-\alpha) \mathbf{1} \frac{1}{n}. \] So, even if a page traps all the probability it receives, we will still add a small amount of probability to every page.

Power Method

Our final update rule is slightly inelegant because it has two terms: one that depends on the previous rank vector and one that does not. But, we can rewrite it as a single matrix multiplication. Let \(\mathbf{1} \in \mathbb{R}^n\) be the vector of all ones. Recall that \(\sum_{i=1}^n p_i^{(t)} = 1 = \mathbf{1}^\top \mathbf{p}^{(t)}\). Then, we can rewrite the update rule as \[ \mathbf{p}^{(t+1)} = \alpha \mathbf{A} \mathbf{p}^{(t)} + (1-\alpha) \frac{1}{n} \mathbf{1} \mathbf{1}^\top \mathbf{p}^{(t)} = \left( \alpha \mathbf{A} + (1-\alpha) \frac{1}{n} \mathbf{1} \mathbf{1}^\top \right) \mathbf{p}^{(t)}, \] where we used that \(\mathbf{1}^\top \mathbf{p}^{(t)} = 1\).

Define the matrix \(\mathbf{M} = \alpha \mathbf{A} + (1-\alpha) \frac{1}{n} \mathbf{1} \mathbf{1}^\top\). The update rule can now be written as \[ \mathbf{p}^{(t+1)} = \mathbf{M} \mathbf{p}^{(t)} = \mathbf{M}^2 \mathbf{p}^{(t-1)} = \mathbf{M}^t \mathbf{p}^{(0)}. \]

Let \(\mathbf{M} = \sum_{i=1}^r \mathbf{v}_i \lambda_i \mathbf{v}_i^\top\) be the eigenvalue decomposition of \(\mathbf{M}\), where \(\lambda_1 \geq \ldots \geq \lambda_r\) are the eigenvalues and \(\mathbf{v}_1, \ldots, \mathbf{v}_r\) are the corresponding eigenvectors. When we repeatedly multiply by \(\mathbf{M}\), the orthonormal property of the eigenvectors gives us a surprisingly simple result: \[ \mathbf{M}^2 = \sum_{i=1}^r \mathbf{v}_i \lambda_i \mathbf{v}_i^\top \sum_{j=1}^r \mathbf{v}_j \lambda_j \mathbf{v}_j^\top = \sum_{i=1}^r \mathbf{v}_i \lambda_i \mathbf{v}_i^\top \mathbf{v}_i \lambda_i \mathbf{v}_i^\top = \sum_{i=1}^r \lambda_i^2 \mathbf{v}_i \mathbf{v}_i^\top. \] Here, we used that \(\mathbf{v}_i^\top \mathbf{v}_j = 0\) for \(i \neq j\) and \(1\) if \(i=j\). In general, \(\mathbf{M}^t = \sum_{i=1}^r \lambda_i^t \mathbf{v}_i \mathbf{v}_i^\top\).

Returning to the page rank vector, we can write \[ \mathbf{p}^{(t)} = \mathbf{M}^t \mathbf{p}^{(0)} = \sum_{i=1}^r \lambda_i^t \mathbf{v}_i \mathbf{v}_i^\top \mathbf{p}^{(0)}. = \lambda_1^t \sum_{i=1}^r \left( \frac{\lambda_i}{\lambda_1} \right)^t \mathbf{v}_i \mathbf{v}_i^\top \mathbf{p}^{(0)}. \] When \(\lambda_1 > \lambda_2\), the term \(\left( \frac{\lambda_i}{\lambda_1} \right)^t\) will go to 0 for all \(i \geq 2\) as \(t\) increases. As long as \(\mathbf{v}_1^\top \mathbf{p}^{(0)}\) is a small constant, the first term will dominate the sum, and \[ \lim_{t \to \infty} \mathbf{p}^{(t)} = \lambda_1 \mathbf{v}_1 \left(\mathbf{v}_1^\top \mathbf{p}^{(0)}\right). \] In summary, the page rank that Google uses to measure the importance of a page is simply a multiple of the first eigenvector of the matrix \(\mathbf{M}\). Repeatedly multiplying matrices is known as the power method, and results in the dominant eigenvector of a matrix.