Distinct Elements

We’ve seen the power of linearity of expectation, Markov’s inequality, and the union bound. This class, we will learn and use linearity of variance and Chebyshev’s inequality.

Chebyshev’s Inequality

Lemma: Let \(X\) be a random variable expectation \(\mathbb{E}[X]\) and variance \(\sigma^2 = \textrm{Var}[X]\). Then for any \(k > 0\),

\[ \Pr(|X - \mathbb{E}[X]| > k \sigma) \leq \frac{1}{k^2}. \]

When the variance is smaller, we expect \(X\) to concentrate more closely around its expectation. Chebyshev’s inequality is one way to formalize this intuition.

There are two main benefits of Chebyshev’s over Markov’s :

Chebyshev’s applies to any random variable \(X\). In contrast, Markov’s requires that the random variable \(X\) is non-negative.
Chebyshev’s gives a two-sided bound: we bound the probability that \(|X - \mathbb{E}[X]|\) is large, which means that \(X\) isn’t too far above or below its expectation. In contrast, Markov’s only bounds the probability that \(X\) is larger than \(\mathbb{E}[X]\).

While Chebyshev’s gives a more general bound, it also needs a bound on the variance of \(X\). In general, bounding the variance is harder than bounding the expectation.

Note: There’s no hard rule for which to apply! Both Markov’s and Chebyshev’s are useful in different settings.

Proof of Chebyshev’s Inequality: We’ll apply Markov’s inequality to the non-negative random variable \(S = (X - \mathbb{E}[X])^2\). By Markov’s, \[ \Pr\left(S \geq t \right) \leq \frac{\mathbb{E}[S]}{t}. \] Observe that \(\mathbb{E}[S] = \mathbb{E}[(X-\mathbb{E}[X])^2] = \textrm{Var}(X)\), plug in \(t = k^2 \sigma^2\), and use the definition of \(S\). Then

\[ \Pr \left((X - \mathbb{E}[X])^2 \geq k^2 \sigma^2\right) \leq \frac{\textrm{Var}(X)}{k^2 \sigma^2}. \] We’ll take the square root of the event inside the probability (this preserves the inequality) and observe that \(\textrm{Var}(X) = \sigma^2\). Then simplifying

\[ \Pr\left(|X - \mathbb{E}[X]| \geq k \sigma\right) \leq \frac{1}{k^2}. \]

Linearity of Variance

In order to use Chebyshev’s inequality, we need to bound the variance. Luckily, there’s a helpful tool to help us compute the variance when we have a sum of independent random variables.

Fact: For pairwise independent random variables \(X_1, \ldots, X_m\),

\[ \textrm{Var}[X_1 + X_2 + \ldots + X_m] = \textrm{Var}[X_1] + \textrm{Var}[X_2] + \ldots + \textrm{Var}[X_m]. \]

Notice that we require pairwise independence so for any \(i, j \in \{1, \ldots, m\}\), \(X_i\) and \(X_j\) are independent. Pairwise independence is a strictly weaker requirement than \(k\)-wise independence (when \(k>2\)) which requires that for all \(k\) variables \(X_1, \ldots, X_k\) and all values \(v_1, \ldots, v_k\), \[ \Pr(X_1 = v_1, \ldots, X_k=v_k) = \Pr(X_1=v_1) \cdot \ldots \cdot \Pr(X_k = v_k). \]

Question: Can you think of three random variables that are pairwise independent but not 3-wise independent?

Here’s one answer: Let \(X_1\) be a random variable that is equally likely to be \(1\) and \(-1\), let \(X_2\) be another random variable that is equally to be \(1\) and \(-1\), and let \(X_3 = X_1 X_2\). If we only look at two of the random variables, they appear independent. But if we look at all three of them, the values of two give us the value of the third.

Coin Example

In order to gain familiarity with Chebyshev’s inequality and linearity of variance, let’s go through an example with coin flips. Let \(C_1, \ldots, C_{100}\) be independent random variables that are \(1\) with probability \(\frac{1}{2}\) and \(0\) otherwise.

Let the number of heads \(H\) be \(\sum_{i=1}^{100} C_i\). By linearity of expectation, we know that \[ \mathbb{E}[H] = \sum_{i=1}^{100} \mathbb{E}[C_i] = \sum_{i=1}^{100} \frac{1}{2}=50. \] Since \(C_i\) is an indicator random variable, we know \[ \textrm{Var}(C_i) = \mathbb{E}[C_i^2] - \mathbb{E}[C_i]^2 = \frac{1}{2} - \left(\frac{1}{2}\right)^2 = \frac{1}{4}. \] Then, by linearity of variance (all the flips are independent), we know that \[ \textrm{Var}(H) = \sum_{i=1}^{100} \textrm{Var}(C_i) = \sum_{i=1}^{100} \frac{1}{4} = 25. \]

Now let’s apply Chebyshev’s inequality. We know \(\mathbb{E}[H] = 50\) and \(\textrm{Var}(H) = \sigma^2 =25\). \[ \Pr(|H - 50| \geq 5 \cdot 4 ) \leq \frac{1}{4^2} = \frac{1}{16}. \] So, with at least \(93\%\) chance, we get between 30 and 70 heads.

Distinct Elements

The first problem that we’ll consider is estimating distinct elements. The problem, like many we’ll see in this class (e.g. frequent items estimation), is posed in the streaming model. In this model, we have a massive data set that arrives in a sequential stream. There is far too much data to store or process in a single location but we still want to analyze the data. To do this, we must compress the data on the fly, storing some smaller data structure which still contains useful information.

The input to the distinct elements problem is \(x_1, \ldots, x_n\) where each item is in a huge universe of items \(\mathcal{U}\). The output is the number of distinct inputs \(D\). For example, if the input is \(1, 10, 2, 4, 9, 2, 10, 4\) then the output is \(D=5\).

The distinct elements problem has many applications: Distinct users visiting a webpage. Distinct values in a database columns. Distinct queries to a search engine. Distinct motifs in a DNA sequence. Because the problem is so general, implementations of the algorithm we’ll describe today (and several refinements of it) are deployed in practice at companies like Google, Yahoo, Twitter, Facebook, and many more.

The naive approach to solve the problem is to build a dictionary of all items seen so far. Every time we see an element, we hash it and check if we already have it in the dictionary. Unfortunately, the space complexity is \(O(D)\) and, if we have millions or billions of distinct elements, this naive approach is infeasible.

Our goal is to return an estimate \(\hat{D}\) that satisfies \[ (1-\epsilon) D \leq \hat{D} \leq (1+\epsilon) D \] but only uses \(O(1/\epsilon^2)\) space. This should be surprising: We can use space that is (basically) independent of the number of distinct elements!

The algorithm we’ll use to accomplish this is surprisingly simple. Choose a random hash function \(h: U \rightarrow [0,1]\). We’ll initialize an estimate \(S\). For every item we see in the stream, we’ll set \[ S \gets \min(S, h(x_i)). \] Once we processed the stream, return \(\frac{1}{S} - 1\).

The figure below describes the algorithm. Elements arrive in a stream and we hash each item to the real number line between \(0\) and \(1\). By the definition of the hash function, each repeated item hashes to the same point. We maintain a value \(S\) hashed value we’ve seen so far.

Why do we return the estimate \(\hat{D} = \frac{1}{S}- 1\)? Intuitively, when \(D\) is larger, \(S\) will be smaller because we get more chances to get small values. But the reason behind the exact estimate is that

Lemma: \(\mathbb{E}[S] = \frac{1}{D+1}\).

We’ll see two proofs.

Calculus Proof: We’ll use the identity that \[ \mathbb{E}[X] = \int \Pr(X \geq x) dx \] for a continuous and non-negative random variable \(X\). To see this, notice that \[ X = \int_{x=0}^X dx = \int_{x=0}^\infty \mathbb{1}[X \geq x] dx. \] Taking expectation yields the identity. There are other ways to prove this identity described in notes here.

Now let’s deploy the identity for the continuous random variable \(S\): \[ \mathbb{E}[S] = \int_{s=0}^1 \Pr(S \geq s) ds = \int_{s=0}^1 (1-s)^D d s \] \[ = \left. \frac{-(1-s)^{D+1}}{D+1} \right|_{s=0}^1 = \frac{1}{D+1}. \] In the second equality, we used the observation that the minimum \(S\) is greater than a value \(s\) if and only if all \(D\) hashed values are greater than \(s\).

We can understand every step of the calculus proof but it doesn’t give us a deeper understanding for why the lemma is true. Let’s turn to another proof “from the book”. This phrase comes from Paul Erdős, a prolific Hungarian mathematician, who believed that there are proofs so simple and elegant that they were written in a divine book.

Proof from the Book: We’ll use the following observation: For any event \(A\) and random variable \(x\), \[\begin{align} \mathbb{E}_x[\Pr[A|x]] = \mathbb{E}_x[\mathbb{E}[\mathbb{1}[A] \mid x]] = \mathbb{E}[\mathbb{1}[A]] = \Pr[A]. \end{align}\] The first and last equality follow because \(\mathbb{1}[A]\) is an indicator random variable for event \(A\). The middle equality follows by the law of total expectation. After a little bit of thought, the law of total expectation is intuitive. But you can prove it to yourself by plugging in the definition of expectations and exchanging sums/integrals, or you can find the proof online here.

Back to our proof, we know that \(h_1,\ldots,h_D\) are i.i.d. (independent and identically distributed) uniform samples drawn from the interval \([0,1]\). Recall that \(S = \min_{i\in[D]} h_i\). Since all the \(h_i\) are drawn from the interval between \(0\) and \(1\), we can interpret \(S\) as the probability that the next hashed value is less than the minimum of the first \(D\) hashed values. Mathematically, \[\begin{align} S = \Pr[h_{D+1} \leq \min_{i\in[D]} h_i \mid h_1,\ldots,h_D] \end{align}\] Now we can compute the expectation of \(S\): \[\begin{align*} \mathbb{E}_{h_1,\ldots,h_D}[S] &= \mathbb{E}_{h_1,\ldots,h_D}[\Pr[h_{D+1} \leq \min_{i\in[D]} h_i \mid h_1,\ldots,h_D]]\\ &= \Pr_{h_1,\ldots,h_{D+1}}[h_{D+1} \leq \min_{i\in[D]} h_i] \\ &= \frac1{D+1}. \end{align*}\] The first equality follows by our alternative definition of \(S\), the second equality follows by the first observation in the proof, the final equality follows because each \(h_i\) is equally likely to be the minimum of all \(D+1\).

We now know the expectation of \(S\) (twice) but, in order to apply Chebyshev’s inequality, we need to also bound the variance. Recall that

\[ \textrm{Var}(S) = \mathbb{E}[S^2] - \mathbb{E}[S]^2. \]

We know \(\mathbb{E}[S]\) but we don’t yet know \(\mathbb{E}[S^2]\).

Lemma: \(\mathbb{E}[S^2] = \frac{2}{(D+1)(D+2)}\).

Calculus Proof: Using a calculus proof similar to the one for the expectation, we know \[ \mathbb{E}[S^2] = \int_{s=0}^1 \Pr(S^2 \geq s) ds = \int_{s=0}^1 \Pr(S \geq \sqrt{s}) ds = \int_{s=0}^1 (1-\sqrt{s})^D d s. \] Using the WolframAlpha query here, we can find that the last equality is \[ \frac{2}{(D+1)(D+2)}. \]

Again, the calculus proof is correct but it doesn’t give us a deeper understanding.

Proof from the Book: We can apply the same machinery that we used in the prior proof from the book. The key observation now is that \(S^2\) is the probability that the next two hashed values are both less than the minimum of the first \(D\) hashed values. Formally, \[ S^2 = \Pr(\max(h_{D+1}, h_{D+2}) \leq \min(h_1, \ldots, h_D) | h_1, \ldots, h_D). \] What’s the probability that both of the next hashed values are less the minimum of the first \(D\) hashed values? Well, that’s the probability that either the first hashed value is less than the minimum of the first \(D+1\) hashed values and the second hashed value is less than all \(D+2\) hashed values or the second hashed value is less than the minimum of the first \(D+1\) hashed values and the first hashed value is less than all \(D+2\) hashed values. The probability of the first event is \(\frac{1}{(D+1)(D+2)}\) and the probability of the second event is \(\frac{1}{(D+1)(D+2)}\).

Using the same analysis from the prior proof from the book, we know that \[\begin{align} \mathbb{E}_{h_1, \ldots, h_D}[S^2] &= \Pr_{h_1, \ldots, h_{D+2}}(\max(h_{D+1}, h_{D+2}) \leq \min(h_1, \ldots, h_D)) \\ &= \frac{2}{(D+1)(D+2)}. \end{align}\]

Remember our goal is to compute the variance. With our expressions for \(\mathbb{E}[S]\) and \(\mathbb{E}[S^2]\), we know \[ \textrm{Var}(S) = \frac{2}{(D+1)(D+2)} - \frac{1}{(D+1)^2} \leq \frac{1}{(D+1)^2}. \]

Let’s try applying Chebyshev’s inequality. We know \(\mathbb{E}[S] = \frac{1}{D+1} = \mu\) and \(\textrm{Var}(S) \leq \frac{1}{(D+1)^2} = \mu^2\). The bound we want is \(\Pr(|S - \mu | \geq \epsilon \mu) \leq \delta\) where \(\delta\) is a (small) probability of failure. Instead, Chebyshev’s inequality gives \[ \Pr(|S - \mu| \geq \epsilon \mu) = \Pr( |S - \mu| \geq \epsilon \sigma ) \leq \frac{1}{\epsilon^2}. \] But \(\epsilon\) is a number less than \(1\) so the bound is vacuous!!

Variance Reduction

Just as we repeated the core subroutine in the count-min algorithm, we’ll repeat core subroutine in this algorithm. The new algorithm is to choose \(k\) random hash functions \(h_1, \ldots, h_k: \mathcal{U} \rightarrow [0, 1]\). We’ll keep \(k\) independent sketches of the minimum value \(S_1, \ldots, S_k\) that are all initialized to \(1\). Then, when we see item \(x_i\), we update \[S_j \gets \min(S_j, h_j(x_i))\] for every \(j \in \{1, \ldots, k\}\). At the end of the stream, we take the average \(S = (S_1 + \ldots + S_k)/k\) and return the estimate \(\hat{D} = \frac{1}{S} - 1\).

Our new algorithm is an example of a general strategy for variance reduction. We repeat many independent trials and take the mean. Given i.i.d. (independent, identically distributed) random variables, \(X_1, \ldots, X_k\) with mean \(\mu\) and variance \(\sigma^2\), we know that

\[ \mathbb{E} \left[ \frac{1}{k} \sum_{i=1}^k X_i \right] = \frac{1}{k} \sum_{i=1}^k \mathbb{E} \left[ X_i \right] = \mu \] and

\[ \textrm{Var}\left[ \frac{1}{k} \sum_{i}^k X_i\right] = \frac{1}{k^2}\sum_{i}^k \textrm{Var} \left[X_i\right] = \frac{\sigma^2}{k} \] where we used linearity of expectation and linearity of variance (since each \(X_i\) is independent).

Applying the variance reduction analysis to our algorithm, we have that \(\mathbb{E}[S] = \mu\) as before but now \(\textrm{Var}(S) \leq \frac{\mu^2}{k}\). Then Chebyshev’s inequality gives \[ \Pr\left(|S - \mu| \geq c \frac{\mu}{\sqrt{k}} \right) \leq \frac{1}{c^2}. \] Setting \(c = \frac{1}{\sqrt{\delta}}\) and \(k = \frac{1}{\epsilon^2 \delta}\), we have \[ \Pr(|S - \mu| \geq \epsilon \mu) \leq \delta. \]

We’re nearly done except that we have only bounded \(S\) around its expectation. We really need to bound \(\hat{D}\) around its expectation. We know that with probability \(\delta\), \[ (1-\epsilon) \mathbb{E}[S] \leq S \leq (1+\epsilon) \mathbb{E}[S]. \] Inverting gives, \[ \frac{1}{(1+\epsilon) \mathbb{E}[S]} \leq \frac{1}{S} \leq \frac{1}{(1-\epsilon) \mathbb{E}[S]}. \] We can use the fact (easily verified on a graphing calculator like Desmos) that \(\frac{1}{1+\epsilon} \leq 1 - 2\epsilon\) and \(\frac{1}{1-\epsilon} \geq 1 + 2\epsilon\). We subtract 1 from each term to get \[ (1-2\epsilon) \frac{1}{\mathbb{E}[S]} - 1 \leq \frac{1}{S} - 1 \leq (1+2\epsilon) \frac{1}{\mathbb{E}[S]} - 1. \] Then, for \(\mathbb{E}[S] \leq \frac12\), we have \[ (1-4 \epsilon)\left( \frac{1}{\mathbb{E}[S]} - 1 \right) \leq \frac{1}{S} - 1 \leq (1+4\epsilon) \left( \frac{1}{\mathbb{E}[S]} - 1 \right). \] (You can check the inequalities by plotting here.) Since \(D = \frac{1}{\mathbb{E}[S]} - 1\) and \(\hat{D} = \frac{1}{S} - 1\), we have that with probability \(\delta\), \[ (1-4\epsilon) D \leq \hat{D} \leq (1+4\epsilon) D. \] We lose a small factor on the \(\epsilon\) but this goes into the big-O notation. So our final space complexity is \(O(k \log D) = O\left(\frac{\log D}{\epsilon^2 \delta}\right)\). The \(\log D\) factor comes from the fact that each of the \(k\) hash functions needs \(\log D\) bits to represent the input. Impressively, the space complexity has no linear dependence on the number of distinct elements \(D\). We know that the \(\frac{1}{\epsilon^2}\) dependence cannot be improved but we can get a better bound depending on \(\log(1/\delta)\) instead of \(1/\delta\) by using a stronger concentration inequality.

For our analysis, we assumed that we could hash to real numbers on \([0,1]\). In practice, the hash function maps to bit vectors and the estimate \(S\) is the number of leading zeros in the bit vector. Intuitively, the more distinct hashes we see, the higher we expect the maximum number of leading zeros to be.

While the space complexity is very small, the true benefit of the algorithm is that it can be implemented in a distributed setting. The estimates \(S_i\) from each machine can be sent to a central location and the final \(S = \min_i S_i\) can be computed there. Use cases of the algorithm include counting the number of distinct users in New York City that made at least one search containing the word ‘car’ in the last month or counting the number of distinct subject lines in emails sent by users that have registered in the last week (to detect spam). Answering such a query requires a distributed linear scan over the database that can be done in less than two seconds using Google’s implementation.