Dealing with High-Dimensional DataSketching & the JL-Lemma2025-06-09Itโs extremely common in computer science to be dealing with very high-dimensional vectors. A natural question is can we sketch, or reduce the dimensionality, of these vectors while still maintaining the larger structure of the data. A common application of this is clustering, where we cluster datapoints based on their distances to other data. For this kind of application, we need to be able to maintain the pairwise distances between any two datapoints.In particular, given vectors ๐ฃ1,โฆ,๐ฃ๐โโ๐, where the dimension ๐ is large, we want to find a function ๐:โ๐โโ๐ where ๐โช๐ and pairwise distances are preserved. In other words, we wantDefinition 0.1 (๐ Pairwise Distance Property): For all pairs ๐ฃ๐,๐ฃ๐โ(๐ฃ1,โฆ,๐ฃ๐) and some small ๐(1โ๐)โ๐ฃ๐โ๐ฃ๐โโคโ๐(๐ฃ๐)โ๐(๐ฃ๐)โโค(1+๐)โ๐ฃ๐โ๐ฃ๐โwhich tells us that ๐ should be able to preserve the distance between any two vectors with greater and greater accuracy as our parameter ๐ approaches 0.Theorem 0.1 (Johnson-Lindenstrauss): For ๐>๐(log(๐)/๐2), there exists ๐ satisยญfying the equation above. This means that ๐ points in a high dimensional space can be mapped onto ๐ dimensions and still preserve pairwise distance. Actually, it turns out ๐ is a linear map and can be computed in a computationally efficient way.Algorithm idea. Our first attempt at dimensionality reduction may involve trying to choose some subset of coordinates to include from each vector at random. However, this attempt fails because the accuracy depends a lot on the basis the vectors are represented in. The actual idea is as follows. We are going to generate a vector ๐โโ๐ that points in a random direction and take the product โจ๐ฃ,๐โฉ. We do this ๐=๐ช๏ธ(log(๐)) times, storing each result in a vector โโ๐. The main intuition behind this is that โจ๐ฃ,๐โฉ=โ๐ฃโโ๐โcos(๐)โโ๐ฃโ, and that viewing the vector from many different directions gives you a lot of information about the vector.The way we generate a vector in a random direction is by generating a vector where each individual entry is i.i.d from a Gaussian ๐ฉ๏ธ(0,1). We wonโt prove why this works, but the intuition is that applying any orthogonal matrix (i.e. a distance-preserving linear mapping, where ๐๐๐=๐ผ) to a Gaussian random vector leads to another random vector with the same distribution. Since rotation preserves distances, the distribution is invariant to rotation, so it must be that the Gaussian random vectors are uniformly distributed across all directions. Putting this all together, this gives rise to the following algorithm.Algorithm. Build the matrix ๐ดโโ๐ร๐ where each row is a random Gaussian vector, which just means that each entry ๐ด๐๐โผ๐ฉ๏ธ(0,1), and then scale each entry by 1โ๐. Then compute ๐(๐ฃ)=๐ด๐ฃ, which takes the product of all the scaled, random ๐-dimensional Gaussian row vectors.Johnson-Lindenstrauss PropertyDefinition 0.2 ((๐,๐ฟ) JL Property): A distribution over matrices ๐ดโโ๐ร๐ satisfies this property ifwhenever ๐=๐ช๏ธ(log(1/๐ฟ)/๐2), for any ๐ฃโโ๐, the inequality(1โ๐)โ๐ฃโโคโ๐ด๐ฃโโค(1+๐)โ๐ฃโholds with probability 1โ๐ฟ.Remark 0.1: Notice that if we can show that ๐ด satisfies the JL property, then we will be able to prove the JL theorem. Consider picking one of the (๐2) pairs of vectors ๐ฃ๐,๐ฃ๐, and use that pair to form ๐ฃ๐โ๐ฃ๐. Then, we know โ๐ด(๐ฃ๐โ๐ฃ๐)โ=โ๐ด๐ฃ๐โ๐ด๐ฃ๐โ=โ๐(๐ฃ๐)โ๐(๐ฃ๐)โ. If ๐ด satisfies the (๐,๐ฟ) JL property, then this pair will preserve distance with probability (1โ๐ฟ), and will fail to preserve distance with probability ๐ฟ. If we let ๐น๐ be an indicator r.v. denoting if our pair fails, then we fail any pair with probabilityPr(๐น1โชโฆโช๐น๐)โคโ๐Pr(๐น๐)=(๐2)๐ฟSo choosing ๐ฟ<๐ฟโ/(๐2), if ๐ด satisfies the (๐,๐ฟ) property, then we preserve all pairwise distances with probability 1โ๐ฟโ, as long as ๐=๐ช๏ธ(log(๐/๐ฟ)/๐2).Remark 0.2: We will show that our matrix ๐ด satisfies the property, but it is not the only such matrix. For example, the Rademacher matrices, where each entry is independently chosen to be either +1/โ๐ or โ1/โ๐ with equal probability satisfy the property. A random orthogonal matrix will also satisfy the property.Theorem 0.2: Our construction of ๐ด satisfies the (๐,๐ฟ) JL propertyProof. Consider some ๐ฃโโ๐. It suffices to prove the theorem for a unit vector where โ๐ฃโ=1, since to prove it for full generality we just introduce a bunch of scaling factors.An important fact is the stability of Gaussian r.v.s. If we have two independent Gausยญsians and take a linear combination of them, ๐1๐ฉ๏ธ(๐1,๐21)+๐2๐ฉ๏ธ(๐2,๐22) will follow the distribution ๐ฉ๏ธ(๐1๐1+๐2๐2,๐21๐21+๐22๐22). We can use this to show if we take the product of ๐ฃ with some random Gaussian vector ๐, itโs distributed as:โจ๐ฃ,๐โฉ=โ๐๐=1[๐ฃ]๐โ ๐ฉ๏ธ(0,1)=๐ฉ๏ธ(0,โ๐๐=1[๐ฃ]2๐)=๐ฉ๏ธ(0,โ๐ฃโ2)=๐ฉ๏ธ(0,1)If we let ๐๐ denote the ๐th row of ๐ด, this tells us that:๐ด๐ฃ=1โ๐(โจ๐ฃ,๐1โฉ,โฆ,โจ๐ฃ,๐๐โฉ)=1โ๐(๐ฆ1,โฆ,๐ฆ๐),๐ฆ๐โผ๐ฉ๏ธ(0,1)Now, we want to show that Pr(โ๐ด๐ฃโโฅ(1+๐)โ๐ฃโ) and Pr(โ๐ด๐ฃโโค(1โ๐)โ๐ฃโ) are both small. We do this by proving the stronger inequalities on the squared norm:Pr(โ๐ด๐ฃโ2โฅ(1+๐)โ๐ฃโ2)=Pr(โ๐๐=1๐ฆ2๐โฅ(1+๐)๐)=Pr(๐๐โ๐ฆ2๐โฅ๐๐(1+๐)๐)โค๐ผ[๐๐โ๐ฆ2๐]๐๐(1+๐)๐=โ๐๐=1๐ผ[๐๐๐ฆ2๐]๐๐(1+๐)=(1โ1โ2๐โ ๐๐(1+๐))๐โค((1+๐)๐โ๐)๐/2โค๐ฟ/2A nearly identical proof shows:Pr(โ๐ด๐ฃโ2โค(1โ๐)โ๐ฃโ2)=Pr(๐โ๐โ๐ฆ2๐โฅ๐โ๐(1โ๐)๐)โค๐ผ[๐โ๐โ๐ฆ2๐]๐โ๐(1โ๐)๐=(1โ1+2๐โ ๐โ๐(1โ๐))๐โค((1โ๐)๐โ๐)๐/2โค๐ฟ/2So that Pr(out of bounds)โค๐ฟ, proving that our construction satisfies the JL theorem.โ Algorithmic SpeedupThe naive algorithm to compute all pairwise distances takes ๐ช๏ธ(๐2๐) time, to compute the distance in ๐ช๏ธ(๐) time for (๐2) pairs. Since the matrix is dimension ๐ช๏ธ(log(๐))ร๐, it takes ๐ช๏ธ(๐log(๐)) time to project a vector down, and we do this ๐ times. Once we have the lower dimensional vectors, however, we just need to compute the (๐2)=๐ช๏ธ(๐2) pairwise distances in ๐ช๏ธ(log(๐)) time each. So the algorithm runs in ๐ช๏ธ(๐๐log(๐)+๐2log(๐)) time, which is good if ๐โซlog(๐). This says we can get a good approximation much quicker than brute-force!Subspace EmbeddingLet ๐ be a ๐-dimensional space. Weโre interested in a ๐-dimensional subspace of this space. Given ๐ฃ1,๐ฃ2,โฆ,๐ฃ๐โ๐, let ๐=span(๐ฃ1,๐ฃ2,โฆ,๐ฃ๐).TheoremIf we project each ๐ฃโ๐ to ๐ด๐ฃ, then when ๐>(๐log(1/๐)+log(1/๐ฟ))/๐2, we have for all ๐ฃโ๐(1โ๐)โ๐ฃโโคโ๐ด๐ฃโโคโ๐ฃโ(1+๐)with probability 1โ๐ฟ.ProofIdeally, we want to apply the union bound over all vectors in the subspace. However, thereโs infinitely many vectors, and we canโt do thatโฆ The main idea is to look at a nearby neighborhood of each vector, called the ๐-net.