We saw how, without additional structure, we expect that learning in \(d\) dimensions requires \(n=2^d\) data points. If we really had a data set that large, then the JL lemma would be vacuous since \(\log n = d\).

The JL lemma tells us how we can preserve \(\ell_2\)-norm distances between points. Let’s see how we can preserve similarity between points.

Similarity Estimation

Let’s consider the following problem: How do services like Shazam match a song clip against a library of more than 11 million songs in a fraction of a second? To make the problem more challenging, the song clips we match generally have additional noise like background sound in a car.

We know how Shazam does this because they published a white paper on their method. The key idea is to use a spectrogram to represent the song clip. A spectrogram is a 2D matrix where the rows correspond to frequencies and the columns correspond to time. We then convert the spectrogram to a fingerprint by taking the peaks in the spectrogram represented as a high-dimensional binary vector.

The process is a little more challenging because song clips may be slightly off set from the song clips in our database. In order to make the method work, they use anchor points to align the song clips. But, for our purposes, we are primarily interested in how we can compress the high-dimensional binary vector.

Once we have a fingerprint from a song clip, we want to search our database for similar songs. Formally, given a binary vector \(q \in \{0,1\}^d\) representing our song clip in \(d\) dimensions, we want to find a nearby fingerprint \(\mathbf{y} \in \{0,1\}^d\) in our database. The first challenge is that our database is possibly huge with \(O(nd)\) bits where \(n\) is the number of song clips. The second challenge is that it is expensive to compute the distance between \(\mathbf{y}\) and \(\mathbf{x}\) with runtime \(O(d)\).

In light of these challenges, our goal is to design a more compact sketch for comparing \(\mathbf{x}\) and \(\mathbf{y}\). We ideally want to use \(k \ll d\) space and time complexity. We will compute a compression \(C(\mathbf{x})\) and \(C(\mathbf{y})\) of \(\mathbf{x}\) and \(\mathbf{y}\), respectively. As in the JL compressions, we want \(C(\mathbf{x})\) and \(C(\mathbf{y})\) to be close when \(\mathbf{x}\) and \(\mathbf{y}\) are similar and far otherwise.

We will use the notion of Jaccard similarity to measure the similarity between \(\mathbf{x}\) and \(\mathbf{y}\).

Jaccard Similary: The Jaccard similarity between \(\mathbf{x}\) and \(\mathbf{y}\) is \[\begin{align*} J(\mathbf{x}, \mathbf{y}) = \frac{|\mathbf{x} \cap \mathbf{y}|}{|\mathbf{x} \cup \mathbf{y}|} = \frac{\textrm{# non-zero entries in common}}{\textrm{# non-zero entries in total}}. \end{align*}\]

Jaccard similarity can be applied to any data which has a natural binary representation:

  • We can use the “bag-of-words” model to represent a document as a binary vector where each entry is 1 if the word is in the document and 0 otherwise. Then the Jaccard similarity is the fraction of words in common between the two documents.

  • We can extract features from earthquake data for early detection. The approach is described in this paper.

  • We can compare cached web pages to determine if they are similar and avoid downloading the same content multiple times.

MinHash Algorithm

We’ll use the MinHash algorithm to build the compression function \(c:\{0,1\}^d \rightarrow \mathbb{R}^k\).

Consider an input \(\mathbf{x} \in \{0,1\}^d\). We start by choosing \(k\) random hash functions \(h_i: \{0, \ldots, d\} \rightarrow [0,1]\) for \(i \in [k]\). For each hash function \(h_i\) we’ll compute \[\begin{align*} c_i = \min_{j \in \{1, \ldots, d\}: \mathbf{x}_j = 1} h_i(j). \end{align*}\] In words, we hash each index where \(\mathbf{x}\) is non-zero to a random value in \([0,1]\). Then we take the minimum value of the hash function over all the non-zero indices. We call the compression \(C(\mathbf{x}) = (c_1, \ldots, c_k)\). We can see this process represented in the figure below.

We’ll argue that for all \(i\), \[\begin{align*} \Pr \left( c_i(\mathbf{x}) = c_i(\mathbf{y}) \right) = J(\mathbf{x}, \mathbf{y}) = \frac{|\mathbf{x} \cap \mathbf{y}|}{|\mathbf{x} \cup \mathbf{y}|}. \end{align*}\]

Every non-zero index in \(\mathbf{x} \cup \mathbf{y}\) is equally likely to produce the lowest hash value. We have \(c_i(\mathbf{x}) = c_j(\mathbf{x})\) if and only if the index hashed to the lowest value is 1 in both \(\mathbf{x}\) and \(\mathbf{y}\). Since there are \(|\mathbf{x} \cap \mathbf{y}|\) such indices, the probability that \(c_i(\mathbf{x}) = c_j(\mathbf{x})\) is the Jaccard similarity.

Inspired by this observation, the MinHash algorithm returns an estimate for the Jaccard similarity between \(\mathbf{x}\) and \(\mathbf{y}\): \[\begin{align*} \tilde{J}(\mathbf{x}, \mathbf{y}) = \frac1{k} \sum_{i=1}^k \mathbb{1}[c_i(\mathbf{x}) = c_i(\mathbf{y})]. \end{align*}\] By linearity of expectation, we have \[\begin{align*} \mathbb{E}[\tilde{J}(\mathbf{x}, \mathbf{y})] = \frac1{k} \sum_{i=1}^k \Pr( c_i(\mathbf{x}) = c_i(\mathbf{y})) = J(\mathbf{x}, \mathbf{y}). \end{align*}\] We can reduce the variance of the estimator by increasig the number of hash functions \(k\). The variance of the estimator is \[\begin{align*} \textrm{Var}( \tilde{J}(\mathbf{x}, \mathbf{y}) ) &= \frac1{k^2} \sum_{i=1}^k \textrm{Var}( \mathbb{1}[c_i(\mathbf{x}) = c_i(\mathbf{y})] ) \\ &= \frac1{k^2} \sum_{i=1}^k J(\mathbf{x}, \mathbf{y}) - J(\mathbf{x}, \mathbf{y})^2 \leq \frac1{k} J(\mathbf{x}, \mathbf{y}) \leq \frac1{k}. \end{align*}\]

How big should we choose \(k\) so that the the estimate is with an additive error \(\epsilon\) of its expectation with probability \(1-\delta\)?

Chebyshev’s inequality tells us that \[\begin{align*} \Pr\left( | J(\mathbf{x}, \mathbf{y}) - \tilde{J}(\mathbf{x}, \mathbf{y}) | \geq \alpha \frac{1}{\sqrt{k}}\right) \leq \frac1{\alpha^2} = \delta \end{align*}\]

The additive error is \(\epsilon= \alpha \frac{1}{\sqrt{k}}\). Since \(\alpha = 1/\sqrt{\delta}\), we have \(k = \frac1{\epsilon^2 \delta}\).

We have shown that as long as \(k=O(\frac1{\epsilon^2 \delta})\), then with probability \(1-\delta\), \[\begin{align*} J(\mathbf{x}, \mathbf{y}) - \epsilon \leq \tilde{J} (\mathbf{x}, \mathbf{y}) \leq J(\mathbf{x}, \mathbf{y}) + \epsilon. \end{align*}\] Notice that we only need \(O(k)\) time to compute the estimate which is independent of the original fingerprint dimension \(d\).

We can improve the result to have a \(\log(1/\delta)\) dependence using a Chernoff bound or the biased coin bound we showed in the concentration inequalities lecture.

Biased Coin Bound: Let the probability that a coin lands heads be \(b\). Choose \(k \geq \frac{3\log(2/\delta)}{\epsilon^2}\). If we flip a biased coin \(k\) times, let \(S\) be the number of heads. Notice that \(\mu = bk\). Then \[ \Pr(| S - b k | \geq \epsilon k) \leq \delta. \]

Think of the indicator random variables \(\mathbb{1}[c_i(\mathbf{x}) = c_i(\mathbf{y})]\) as coin flips. The probability the coin lands “heads” is \(J(\mathbf{x}, \mathbf{y})\). Then we can apply the biased coin bound after dividing by \(k\). With probability \(1-\delta\), the number of “heads” is within \(\epsilon\) of \(J(\mathbf{x}, \mathbf{y})\) if we set \(k=O(\frac{\log(1/\delta)}{\epsilon^2})\).