Dealing with High-Dimensional Data

Sketching & the JL-Lemma

Itโ€™s extremely common in computer science to be dealing with very high-dimensional vectors. A natural question is can we sketch, or reduce the dimensionality, of these vectors while still maintaining the larger structure of the data. A common application of this is clustering, where we cluster datapoints based on their distances to other data. For this kind of application, we need to be able to maintain the pairwise distances between any two datapoints.In particular, given vectors ๐‘ฃ1,โ€ฆ,๐‘ฃ๐‘›โˆˆโ„๐‘‘, where the dimension ๐‘‘ is large, we want to find a function ๐‘“:โ„๐‘‘โ†’โ„๐‘š where ๐‘šโ‰ช๐‘‘ and pairwise distances are preserved. In other words, we wantDefinition 0.1 (๐œ€ Pairwise Distance Property): For all pairs ๐‘ฃ๐‘–,๐‘ฃ๐‘—โˆˆ(๐‘ฃ1,โ€ฆ,๐‘ฃ๐‘›) and some small ๐œ€(1โˆ’๐œ€)โ€–๐‘ฃ๐‘–โˆ’๐‘ฃ๐‘—โ€–โ‰คโ€–๐‘“(๐‘ฃ๐‘–)โˆ’๐‘“(๐‘ฃ๐‘—)โ€–โ‰ค(1+๐œ€)โ€–๐‘ฃ๐‘–โˆ’๐‘ฃ๐‘—โ€–which tells us that ๐‘“ should be able to preserve the distance between any two vectors with greater and greater accuracy as our parameter ๐œ€ approaches 0.Theorem 0.1 (Johnson-Lindenstrauss): For ๐‘š>๐‘‚(log(๐‘›)/๐œ€2), there exists ๐‘“ satisยญfying the equation above. This means that ๐‘› points in a high dimensional space can be mapped onto ๐‘š dimensions and still preserve pairwise distance. Actually, it turns out ๐‘“ is a linear map and can be computed in a computationally efficient way.Algorithm idea. Our first attempt at dimensionality reduction may involve trying to choose some subset of coordinates to include from each vector at random. However, this attempt fails because the accuracy depends a lot on the basis the vectors are represented in. The actual idea is as follows. We are going to generate a vector ๐‘”โˆˆโ„๐‘‘ that points in a random direction and take the product โŸจ๐‘ฃ,๐‘”โŸฉ. We do this ๐‘š=๐’ช๏ธ€(log(๐‘›)) times, storing each result in a vector โˆˆโ„๐‘š. The main intuition behind this is that โŸจ๐‘ฃ,๐‘”โŸฉ=โ€–๐‘ฃโ€–โ€–๐‘”โ€–cos(๐œƒ)โˆโ€–๐‘ฃโ€–, and that viewing the vector from many different directions gives you a lot of information about the vector.The way we generate a vector in a random direction is by generating a vector where each individual entry is i.i.d from a Gaussian ๐’ฉ๏ธ€(0,1). We wonโ€™t prove why this works, but the intuition is that applying any orthogonal matrix (i.e. a distance-preserving linear mapping, where ๐‘ˆ๐‘‡๐‘ˆ=๐ผ) to a Gaussian random vector leads to another random vector with the same distribution. Since rotation preserves distances, the distribution is invariant to rotation, so it must be that the Gaussian random vectors are uniformly distributed across all directions. Putting this all together, this gives rise to the following algorithm.Algorithm. Build the matrix ๐ดโˆˆโ„๐‘šร—๐‘‘ where each row is a random Gaussian vector, which just means that each entry ๐ด๐‘–๐‘—โˆผ๐’ฉ๏ธ€(0,1), and then scale each entry by 1โˆš๐‘š. Then compute ๐‘“(๐‘ฃ)=๐ด๐‘ฃ, which takes the product of all the scaled, random ๐‘‘-dimensional Gaussian row vectors.Johnson-Lindenstrauss PropertyDefinition 0.2 ((๐œ€,๐›ฟ) JL Property): A distribution over matrices ๐ดโˆˆโ„๐‘šร—๐‘‘ satisfies this property ifwhenever ๐‘š=๐’ช๏ธ€(log(1/๐›ฟ)/๐œ€2), for any ๐‘ฃโˆˆโ„๐‘‘, the inequality(1โˆ’๐œ€)โ€–๐‘ฃโ€–โ‰คโ€–๐ด๐‘ฃโ€–โ‰ค(1+๐œ€)โ€–๐‘ฃโ€–holds with probability 1โˆ’๐›ฟ.Remark 0.1: Notice that if we can show that ๐ด satisfies the JL property, then we will be able to prove the JL theorem. Consider picking one of the (๐‘›2) pairs of vectors ๐‘ฃ๐‘–,๐‘ฃ๐‘—, and use that pair to form ๐‘ฃ๐‘–โˆ’๐‘ฃ๐‘—. Then, we know โ€–๐ด(๐‘ฃ๐‘–โˆ’๐‘ฃ๐‘—)โ€–=โ€–๐ด๐‘ฃ๐‘–โˆ’๐ด๐‘ฃ๐‘—โ€–=โ€–๐‘“(๐‘ฃ๐‘–)โˆ’๐‘“(๐‘ฃ๐‘—)โ€–. If ๐ด satisfies the (๐œ€,๐›ฟ) JL property, then this pair will preserve distance with probability (1โˆ’๐›ฟ), and will fail to preserve distance with probability ๐›ฟ. If we let ๐น๐‘– be an indicator r.v. denoting if our pair fails, then we fail any pair with probabilityPr(๐น1โˆชโ€ฆโˆช๐น๐‘›)โ‰คโˆ‘๐‘–Pr(๐น๐‘–)=(๐‘›2)๐›ฟSo choosing ๐›ฟ<๐›ฟโˆ—/(๐‘›2), if ๐ด satisfies the (๐œ€,๐›ฟ) property, then we preserve all pairwise distances with probability 1โˆ’๐›ฟโˆ—, as long as ๐‘š=๐’ช๏ธ€(log(๐‘›/๐›ฟ)/๐œ€2).Remark 0.2: We will show that our matrix ๐ด satisfies the property, but it is not the only such matrix. For example, the Rademacher matrices, where each entry is independently chosen to be either +1/โˆš๐‘š or โˆ’1/โˆš๐‘š with equal probability satisfy the property. A random orthogonal matrix will also satisfy the property.Theorem 0.2: Our construction of ๐ด satisfies the (๐œ€,๐›ฟ) JL propertyProof. Consider some ๐‘ฃโˆˆโ„๐‘‘. It suffices to prove the theorem for a unit vector where โ€–๐‘ฃโ€–=1, since to prove it for full generality we just introduce a bunch of scaling factors.An important fact is the stability of Gaussian r.v.s. If we have two independent Gausยญsians and take a linear combination of them, ๐‘Ž1๐’ฉ๏ธ€(๐œ‡1,๐œŽ21)+๐‘Ž2๐’ฉ๏ธ€(๐œ‡2,๐œŽ22) will follow the distribution ๐’ฉ๏ธ€(๐‘Ž1๐œ‡1+๐‘Ž2๐œ‡2,๐‘Ž21๐œŽ21+๐‘Ž22๐œŽ22). We can use this to show if we take the product of ๐‘ฃ with some random Gaussian vector ๐‘”, itโ€™s distributed as:โŸจ๐‘ฃ,๐‘”โŸฉ=โˆ‘๐‘‘๐‘–=1[๐‘ฃ]๐‘–โ‹…๐’ฉ๏ธ€(0,1)=๐’ฉ๏ธ€(0,โˆ‘๐‘‘๐‘–=1[๐‘ฃ]2๐‘–)=๐’ฉ๏ธ€(0,โ€–๐‘ฃโ€–2)=๐’ฉ๏ธ€(0,1)If we let ๐‘”๐‘– denote the ๐‘–th row of ๐ด, this tells us that:๐ด๐‘ฃ=1โˆš๐‘š(โŸจ๐‘ฃ,๐‘”1โŸฉ,โ€ฆ,โŸจ๐‘ฃ,๐‘”๐‘šโŸฉ)=1โˆš๐‘š(๐‘ฆ1,โ€ฆ,๐‘ฆ๐‘š),๐‘ฆ๐‘–โˆผ๐’ฉ๏ธ€(0,1)Now, we want to show that Pr(โ€–๐ด๐‘ฃโ€–โ‰ฅ(1+๐œ€)โ€–๐‘ฃโ€–) and Pr(โ€–๐ด๐‘ฃโ€–โ‰ค(1โˆ’๐œ€)โ€–๐‘ฃโ€–) are both small. We do this by proving the stronger inequalities on the squared norm:Pr(โ€–๐ด๐‘ฃโ€–2โ‰ฅ(1+๐œ€)โ€–๐‘ฃโ€–2)=Pr(โˆ‘๐‘š๐‘–=1๐‘ฆ2๐‘–โ‰ฅ(1+๐œ€)๐‘š)=Pr(๐‘’๐œ†โˆ‘๐‘ฆ2๐‘–โ‰ฅ๐‘’๐œ†(1+๐œ€)๐‘š)โ‰ค๐”ผ[๐‘’๐œ†โˆ‘๐‘ฆ2๐‘–]๐‘’๐œ†(1+๐œ€)๐‘š=โˆ๐‘š๐‘–=1๐”ผ[๐‘’๐œ†๐‘ฆ2๐‘–]๐‘’๐œ†(1+๐œ€)=(1โˆš1โˆ’2๐œ†โ‹…๐‘’๐œ†(1+๐œ€))๐‘šโ‰ค((1+๐œ€)๐‘’โˆ’๐œ€)๐‘š/2โ‰ค๐›ฟ/2A nearly identical proof shows:Pr(โ€–๐ด๐‘ฃโ€–2โ‰ค(1โˆ’๐œ€)โ€–๐‘ฃโ€–2)=Pr(๐‘’โˆ’๐œ†โˆ‘๐‘ฆ2๐‘–โ‰ฅ๐‘’โˆ’๐œ†(1โˆ’๐œ€)๐‘š)โ‰ค๐”ผ[๐‘’โˆ’๐œ†โˆ‘๐‘ฆ2๐‘–]๐‘’โˆ’๐œ†(1โˆ’๐œ€)๐‘š=(1โˆš1+2๐œ†โ‹…๐‘’โˆ’๐œ†(1โˆ’๐œ€))๐‘šโ‰ค((1โˆ’๐œ€)๐‘’โˆ’๐œ€)๐‘š/2โ‰ค๐›ฟ/2So that Pr(out of bounds)โ‰ค๐›ฟ, proving that our construction satisfies the JL theorem.โ– Algorithmic SpeedupThe naive algorithm to compute all pairwise distances takes ๐’ช๏ธ€(๐‘›2๐‘‘) time, to compute the distance in ๐’ช๏ธ€(๐‘‘) time for (๐‘›2) pairs. Since the matrix is dimension ๐’ช๏ธ€(log(๐‘›))ร—๐‘‘, it takes ๐’ช๏ธ€(๐‘‘log(๐‘›)) time to project a vector down, and we do this ๐‘› times. Once we have the lower dimensional vectors, however, we just need to compute the (๐‘›2)=๐’ช๏ธ€(๐‘›2) pairwise distances in ๐’ช๏ธ€(log(๐‘›)) time each. So the algorithm runs in ๐’ช๏ธ€(๐‘‘๐‘›log(๐‘›)+๐‘›2log(๐‘›)) time, which is good if ๐‘‘โ‰ซlog(๐‘›). This says we can get a good approximation much quicker than brute-force!Subspace EmbeddingLet ๐‘ˆ be a ๐‘‘-dimensional space. Weโ€™re interested in a ๐‘˜-dimensional subspace of this space. Given ๐‘ฃ1,๐‘ฃ2,โ€ฆ,๐‘ฃ๐‘˜โˆˆ๐‘ˆ, let ๐‘†=span(๐‘ฃ1,๐‘ฃ2,โ€ฆ,๐‘ฃ๐‘˜).TheoremIf we project each ๐‘ฃโˆˆ๐‘ˆ to ๐ด๐‘ฃ, then when ๐‘š>(๐‘˜log(1/๐œ€)+log(1/๐›ฟ))/๐œ€2, we have for all ๐‘ฃโˆˆ๐‘†(1โˆ’๐œ€)โ€–๐‘ฃโ€–โ‰คโ€–๐ด๐‘ฃโ€–โ‰คโ€–๐‘ฃโ€–(1+๐œ€)with probability 1โˆ’๐›ฟ.ProofIdeally, we want to apply the union bound over all vectors in the subspace. However, thereโ€™s infinitely many vectors, and we canโ€™t do thatโ€ฆ The main idea is to look at a nearby neighborhood of each vector, called the ๐œ€-net.