2. k-Means Clustering (12 points)
Consider n data points $x_1, \dots, x_n \in \mathbb{R}^d$ and $X \in \mathbb{R}^{n \times d}$ with these data points as its rows. Consider any clustering $C = \{C_1, C_2, \dots, C_k\}$ of these data points. I.e., each $C_j$ is a set (a cluster) of points
and each data point $x_i$ is assigned to one of these sets. Let $\mu_j = \frac{1}{|C_j|} \sum_{x \in C_j} x$ be the centroid (i.e., the mean) of cluster $C_j$. The k-means clustering objective is to minimize
$\text{cost}(C, X) = \sum_{j=1}^k \sum_{x \in C_j} ||x - \mu_j||_2^2$.
1. (2 points) Prove that $\text{cost}(C, X) = \sum_{j=1}^k \frac{1}{2|C_j|} \sum_{x \in C_j} \sum_{y \in C_j} ||x - y||_2^2$.
Hint: Prove that both expressions the cost can be written as $\sum_{j=1}^k \left[ \left( \sum_{x \in C_j} ||x||_2^2 \right) - |C_j| \cdot ||\mu_j||_2^2 \right]$.
This will require some vector algebra. It will be helpful to use that for any vector $z$, $||z||_2^2 = \sum_{i=1}^d (z^{(i)})^2 = (z, z)$, as well as to use the linearity of inner product.
2. (2 points) Suppose that $\Pi \in \mathbb{R}^{m \times d}$ is a random projection matrix with each entry chosen independently as $N(0, 1/m)$. For each $x_i$ in the dataset, let $\tilde{x}_i = \Pi x_i$ and let $\tilde{X} \in \mathbb{R}^{n \times m}$ contain the compressed data points as its rows. By part (1), for $m = O(\frac{d}{\epsilon^2})$, with high probability, we have, for all clusterings $C$,
$(1 - \epsilon) \cdot \text{cost}(C, X) \le \text{cost}(C, \tilde{X}) \le (1 + \epsilon) \cdot \text{cost}(C, X)$.
Assuming that the above bound holds, prove that if we compute the optimal clustering on the compressed data, $\tilde{C} = \arg \min_C \text{cost}(C, \tilde{X})$, then it is near-optimal for the original data, i.e., that: $\text{cost}(\tilde{C}, X) \le \frac{1 + \epsilon}{1 - \epsilon} \min_C \text{cost}(C, X)$.
The next few questions focus on the connection between k-means clustering and low-rank matrix approximation/PCA.
3. (2 points) Let $X_C$ be the $n \times d$ matrix whose $i^{th}$ row is equal to $\mu_j$ if $x_i$ is assigned to cluster $C_j$ in $C$. Verify that the k-means cost function can be written as $\text{cost}(C, X) = ||X - X_C||_F^2$.
4. (2 points) Use (3) to prove that for any clustering $C$, $\text{cost}(C, X) \ge \min_{\text{rank}(B) \le k} ||X - B||_F^2$.
Hint: What is the rank of $X_C$?
5. (2 points) Show that we can write $X_C = VV^T X$ where $V \in \mathbb{R}^{n \times k}$ has orthonormal columns.
6. (2 points) Explain in a few sentences what parts (3)-(4) mean. How is k-means clustering similar to PCA? How is it different?