Set Size Estimation

Introduction

Powered by repeated innovations in chip manufacturing, computers have grown exponentially more powerful over the last several decades. As a result, we have access to unparalleled computational resources and data. For example, a single NASA satellite collects 20 terabytes of satellite images, more than 8 billion searches are made on Google, and estimates suggest the internet creates more than 300 million terabytes of data every single day. Simultaneously, we are quickly approaching the physical limit of how many transistors can be packed on a single chip. In order to learn from the data we have and continue expanding our computational abilities into the future, fast and efficient algorithms are more important than ever.

At first glance, an algorithm that performs only a few operations per item in our data set is efficient. However, these algorithms can be too slow when we have lots and lots of data. Instead, we turn to randomized algorithms that can run even faster. Randomized algorithms typically exploit some source of randomness to run on only a small part of the data set (or use only a small amount of space) while still returning an approximately correct result.

We can run randomized algorithms in practice to see how well they work. But we also want to prove that they work and understand why. Today, we will solve a problem using randomized algorithms. Before we get to the problems and algorithms, we’ll build some helpful probability tools.

Probability Background

Consider a random variable \(X\). For example, \(X\) could be the outcome of a fair dice roll and be equal to \(1,2,3,4,5\) or \(6\), each with probability \(\frac{1}{6}\). Formally, we use \(\Pr(X=x)\) to represent the probability that the random variable \(X\) is equal to the outcome \(x\). The expectation of a discrete random variable is \[ \mathbb{E}[X] = \sum_{x} x \Pr(X=x). \] For example, the expected outcome of a fair dice roll is \(\mathbb{E}[X] = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = \frac{21}{6}\). Note: If the random variable is continuous, we can similarly define its expected value using an integral.

The expected value tells us where the random variable is on average but we’re also interested in how closely the random variable concentrates around its expectation. The variance of a random variable is \[ \textrm{Var}[X] = \mathbb{E}\left[(X - \mathbb{E}[X])^2\right]. \] Notice that the variance is larger when the random variable is often far from its expectation. In the figure below, can you identify the expected value for each of the three distributions? Which distribution has the largest variance? Which has the smallest?

There are a number of useful facts about the expected value and variance. For example,

\[ \mathbb{E}[\alpha X] = \alpha \mathbb{E}[X] \hspace{1em} \textrm{and} \hspace{1em} \textrm{Var}(\alpha X) = \alpha^2 \textrm{Var}(X) \] where \(\alpha \in \mathbb{R}\) is a real number. To see this, observe that \[ \mathbb{E}[\alpha X] = \sum_{x} \alpha x \Pr(X=x) = \alpha \sum_{x} x \Pr(X=x) = \alpha \mathbb{E}[X] \] and \[ \textrm{Var}(\alpha X) = \sum_x (\alpha x - \alpha \mathbb{E}[X])^2 = \alpha^2 \sum_x ( x - \mathbb{E}[X])^2 = \alpha^2 \textrm{Var}(X). \]

Independent Random Variables

Once we have defined random variables, we are often interested in events defined on their outcomes. Let \(A\) and \(B\) be two events. For example, \(A\) could be the event that the dice shows \(1\) or \(2\) while \(B\) could be the event that the dice shows an odd number. We use \(\Pr(A \cap B)\) to denote the probability that events \(A\) and \(B\) both happen. Often, we have information about one event and want to see how that changes the probability of another event. We use \(\Pr(A | B)\) to denote the conditional probability of event \(A\) given that \(B\) happened. We define

\[ \Pr(A | B) = \frac{\Pr(A \cap B)}{\Pr(B)}. \]

If information about event \(B\) does not give us information about event \(A\), we say that \(A\) and \(B\) are independent. Formally, events \(A\) and \(B\) are independent if \(\Pr(A|B) = \Pr(A)\). By the definition of conditional probability, an equivalent definition of independence is \(\Pr(A \cap B) = \Pr(A) \Pr(B)\).

Let’s figure out whether the event \(A\) that the dice shows 1 or 2 is independent of the event \(B\) that the dice shows an odd number. Well, \(\Pr(A \cap B) = \frac{1}{6}\) since the only outcome that satisfy both events is when the dice shows a 1. We also know that \(\Pr(A) \Pr(B) = \frac{2}{6} \times \frac{3}{6} = \frac{1}{6}\). So, by the second definition of independence, we can conclude that \(A\) and \(B\) are independent.

We’ve been talking about events defined on random variables, but we’ll also be interested in when random variables are independent. Consider random variables \(X\) and \(Y\). We say that \(X\) and \(Y\) are independent if, for all outcomes \(x\) and \(y\), \(\Pr(X=x \cap Y=y) = \Pr(X=x) \Pr(Y=y)\).

Linearity of Expectation

One of the most powerful theorems in all of probability is the linearity of expectation.

Theorem: Let \(X\) and \(Y\) be random variables. Then \[ \mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]. \] The result is a powerful tool that requires no assumptions on the random variables.

Proof: Observe that \[ \mathbb{E}[X+Y] = \sum_{x,y}(x+y) \Pr(X=x \cap Y=y) \] Now, we’ll separate the equation into two terms and factor out the \(x\) and \(y\) terms, respectively. \[ = \sum_x x \sum_y \Pr(X=x \cap Y=y) + \sum_y y \sum_x \Pr(X=x \cap Y=y) \] Finally, using the law of total probability, we have \[ = \sum_x x \Pr(X=x) + \sum_y y \Pr(Y=y) = \mathbb{E}[X] + \mathbb{E}[Y]. \]

There are also several other useful facts about the expected value and variance.

Fact 1: When \(X\) and \(Y\) are independent, \(\mathbb{E}[XY] = \mathbb{E}[X] \mathbb{E}[Y]\).

Proof: Observe that \[ \mathbb{E}[XY] = \sum_{x,y} xy \Pr(X=x \cap Y=y) = \sum_{x,y} xy \Pr(X=x) \Pr(Y=y) \]

\[ = \sum_x x \Pr(X=x) \sum_y y \Pr(Y=y) = \mathbb{E}[X] \mathbb{E}[Y] \] where the second equality followed by the assumption that \(X\) and \(Y\) are independent.

Fact 2: Consider a random variable \(X\). Then \(\textrm{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2\).

Proof: Observe that \[ \textrm{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2] \] \[ = \mathbb{E}[X^2 - 2 X \mathbb{E}[X] + \mathbb{E}[X]^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2 \] where the first equality is by definition, the second equality is by foiling, and the third equality is by linearity of expectation and the observation that \(\mathbb{E}[X]\) is a scaler.

Fact 3: When \(X\) and \(Y\) are independent, \(\textrm{Var}(X+Y) = \textrm{Var}(X) + \textrm{Var}(Y)\).

Proof: Observe that

\[\begin{align*} \textrm{Var}(X+Y) &= \mathbb{E}\left[(X + Y - \mathbb{E}[X] - \mathbb{E}[Y])^2\right] \\ &= \mathbb{E}\left[(X- \mathbb{E}[X])^2 + 2(X-\mathbb{E}[X])(Y-\mathbb{E}[Y]) + (Y-\mathbb{E})^2\right] \\ &= \textrm{Var}(X) + 2\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]+ \textrm{Var}(Y). \end{align*}\] Then, when \(X\) and \(Y\) are independent, \[ \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] = \mathbb{E}[XY - \mathbb{E}[X]Y - X\mathbb{E}[Y] + \mathbb{E}[X]\mathbb{E}[Y]] = 0 \] where the last equality follows by Fact 1 when \(X\) and \(Y\) are independent.

Set Size Estimation

We’ll pose a problem that has applications in ecology, social networks, and internet indexing. However, while efficiently solving the problem is useful, our purpose is really to gain familiarity with linearity of expectation and learn Markov’s inequality.

Suppose you run a website that is considering contracting with a company to provide CAPTCHAs for login verification. The company claims to have a database with \(n=1000000\) unique CAPTCHAs. For each API call, they’ll return a CAPTCHA chosen uniformly at random from their database. Here’s our problem: How many queries \(m\) do we need to make to their API until we can independently verify that they do in fact have a million CAPTCHAs?

An obvious approach is to keep calling ther API until we find a million unique CAPTCHAs. Of course, the issue is that we have to make at least a million API calls. That’s not so good if we care about efficiency, they charge us per call, or the size they claim to have in their database is much bigger than a million.

A more clever approach is to call their API and count duplicates. Intuitively, the larger their database, the fewer duplicates we expect to see. Define a random variable \(D_{i,j}\) which is 1 if the \(i\)th and \(j\)th calls return the same CAPTCHA and 0 otherwise. (To avoid double counting, we’ll assume \(i < j\).) For example, in the figure below, the \(5\)th, \(6\)th, and \(7\)th calls returned the same CAPTCHA so \(D_{5,6}\), \(D_{5,7}\), and \(D_{6,7}\) are all 1.

When a random variable can only be 0 or 1, we call it an indicator random variable. Indicator random variables have the special property that their expected value is the probability they are 1. We can define the total number of duplicates \(D\) in terms of our indicator random variables \(D_{i,j}\).

\[ D = \sum_{\substack{i, j \in \{1, \ldots, m\} \\ i < j }} D_{i,j} \]

We can calculate the expected number of duplicates using linearity of expectation.

\[ \mathbb{E}[D] = \sum_{\substack{i, j \in \{1, \ldots, m\} \\ i < j }} \mathbb{E}[D_{i,j}] \]

Since \(D_{i,j}\) is an indicator random variable, we know \(\mathbb{E}[D_{i,j}]\) is the probability the \(i\)th and \(j\)th CAPTCHA are the same. Since each API call is a uniform and independent sample from the database, the probability the \(j\)th CAPTCHA is the same as the \(i\)th is \(\frac{1}{n}\). With this observation in hand,

\[ \mathbb{E}[D] = \sum_{\substack{i, j \in \{1, \ldots, m\} \\ i < j }} \frac{1}{n} = \binom{m}{2} \frac{1}{n} = \frac{m(m-1)}{2n}. \]

Suppose we take \(m=1000\) queries and see \(D=10\) duplicates. How does this compare to what we would expect if the database had \(n=1000000\) CAPTCHAs?

Well, the expectation would be \(\mathbb{E}[D] = \frac{1000 \times 999}{2 \times 1000000} = .4995\). Something seems wrong… we observed many more duplicates than we expect. Can we formalize this intuition?

Markov’s Inequality

Concentration inequalities are a powerful tool in the analysis of randomized algorithms. They tell us how likely it is that a random variable differs from its expectation.

There are many concentration inequalities. Some apply in general and some apply only under special assumptions. The concentration inequalities that apply only under special assumptions tend to give stronger results. We’ll start with one of the most simple and general concentration inequalities.

Theorem: For any non-negative random variable \(X\) and any positive threshold \(t\), \[ \Pr(X \geq t) \leq \frac{\mathbb{E}[X]}{t}. \]

Proof: We’ll prove the inequality directly. By the definition of expectation, we have \[ \mathbb{E}[X] = \sum_{x} x \Pr(X=x) = \sum_{\substack{x \\ x \geq t}} x \Pr(X=x) + \sum_{\substack{x \\ x < t}} x \Pr(X=x) \] \[ \geq \sum_{\substack{x \\ x \geq t}} t \Pr(X=x) + 0 = t \Pr(X \geq t). \] Rearranging the above inequality gives Markov’s. Can you see where we used that all outcomes \(x\) are non-negative?

Now let’s apply Markov’s inequality to our set size estimation problem. Since the number of duplicates \(D\) is always positive, we satisfy the assumption of the inequality. \[ \Pr(D \geq 10 ) \leq \frac{\mathbb{E}[D]}{10} = \frac{.4995}{10} = .04995 \] The probability of observing the 10 duplicates is less than \(5\%\)! We should probably start asking the CAPTCHA company some questions.

In practice, many of the set size estimation problems are slightly different. Instead of checking a claim about the set size, we want to estimate the set size directly. Notice that we computed \(\mathbb{E}[D] = \frac{m(m-1)}{2n}\). Rearranging, we see that \(n = \frac{m(m-1)}{2\mathbb{E}[D]}\). Given \(m\) samples, we can naturally build an estimator for the whole set size using the empirical number of duplicates we found in the sample. With a little more work, we can show the following.

Claim: If we make \(m \geq c \frac{\sqrt{n}}{\epsilon}\) samples for a particular constant \(c\), then the estimate \(\hat{n} = \frac{m(m-1)}{2D}\) satisfies \((1-\epsilon) n \leq \hat{n} \leq (1+\epsilon) n\) with probability \(9/10\).