Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.

Stream: learning: questions

Topic: Linear algebra behind PCA

David Tanzer (Mar 03 2024 at 09:06):

Hi, I'm trying to understand principal component analysis in terms of general ideas from linear algebra and probability theory; I see these ideas on the web, but am trying to spell it out for myself in a compact and systematic way. Here's as far as I got, with a couple of missing links to the linear algebra part.

Working in the setting of some probability space, suppose we have real-valued random variables $v_1$ , ..., $v_k$ . We can form linear combinations of these to derive new variables. Let $V$ be the span of these variables, a $k$ -dimensional vector space of random variables. We can add in the ring structure given by multiplication of random variable, to make it an algebra.

Next, we have covariance, which is a symmetric bilinear form taking $V \times V$ to $\mathbb{R}$ . It is positive, but not positive definite. And variance is the associated quadratic form. (finite dimensional Hilbert space)

In the standard construction, the principal directions are defined iteratively. First find the direction $u_1$ in $V$ which maximizes the variance (quadratic form) when restricted to the unit sphere. Then repeat to find the direction $u_2$ in $u_1^{\perp}$ that maximizes the variance, still constrained to the unit sphere. Then repeat to find $u_3$ in ${span(u_1,u_2)}^{\perp}$ etc.

The result is an orthonormal basis $u_1, u_2,.., u_k$ , called the principal directions or components, along with an associated list of variances $z_1 \ge z_2 ... \ge z_k$ .

That's all well and good, but it's not a computable procedure.

Fortunately, because the matrix of the symmetric form is symmetric, we can rely on some form or elaboration of the spectral theorem to find the principal components and their associated variances. So, we can diagonalize the matrix to get a basis of eigenvectors - which will be principal directions - and take the associated eigenvalues to give us the variances in those directions.

My question is, what form or elaboration of the spectral theorem is need to justify the claim that this computable procedure will give us the results defined by the iterative construction above?

It's good we know the eigenvalues are real. And I see the standard claim that the eigenvector whose eigenvalue has largest magnitude is the direction that maximizes the quadratic form, and by the one with smallest magnitude minimizes the quadratic form. That's good too.

But what about all the principal directions in-between, and their associated variances? Here we don't have the obvious criteria of minimum or maximum to identify them, but there must be some special property that characterizes these intermediate directions as being "principal". The iterative construction does define them, by progressively taking the maximum over more and more restricted subspaces - but it seems these directions should be understandable in a non-procedural manner, by a more detailed elaboration of the spectral theorem.

It looks like, geometrically, every principal direction $u_i$ has the special property that, when restricted to the unit sphere in a small neighborhood of $u_i$ , the quadratic form has a critical point at $u_i$ - the directional derivative is zero in every direction from $u_i$ along the surface of the unit sphere. The intermediate directions are points of inflection for the quadratic form on the unit sphere; this looks like a general property of ellipsoids and their axes.

Things could also get sticky, in that with the iterative procedure there may not be a unique direction that maximizes the quadratic form; and similarly we know that there can be degrees of freedom in choosing the eigenbasis, when their are eigenspaces of dimension more than one.

As you can see, this is kind of a searching explanation, that would need to be tightened up and completed. Can anyone help with a tighter statement of results from linear algebra, which could help tie this all together? Thanks!

p.s. Maybe it would go along these lines? (1) The eigenvectors of matrix of the form are critical points (as described above). (2) Every eigenbasis that can be obtained through diagonalization can also be achieved through the iterative procedure.

Simon Burton (Mar 03 2024 at 15:02):

The way you are describing this makes me think of (algebraic) Morse theory. You have a scalar function on a high dimensional sphere, and the critical points correspond to the $u_i$ . So there should be a decomposition of this sphere along critical lines connecting these critical points, etc. etc. I have no idea if this is helpful or not for your purposes... but it sounds fun and/or profitable.

David Tanzer (Mar 03 2024 at 21:46):

Thanks, that sounds like an interesting subject, I'll will check it out

David Tanzer (Mar 04 2024 at 02:42):

Ok, new approach, simpler.

First diagonalize the matrix to get eigenvectors $e_1, ..., e_n$ , with eigenvalues $\lambda_1, ... \lambda_n$ . Choose the order so that $\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_n \ge 0$ .

Then run the optimization based algorithm. At each stage $i$ , claim that $e_i$ is a valid choice for the optimal vector, and so $e_i$ can be chosen as the ith principal component, with associated principal value $\lambda_i$ .

Proof by induction. We know that $e_1$ is a valid choice for the optimal vector, with associated value $\lambda_1$ . Then take $V' = e_1^{\perp}$ , for which $e_2,...,e_n$ is an orthonormal basis, with associated values $\lambda_2, ..., \lambda_n$ . And $V'$ is the next space used by the optimization-based algorithm.

Peva Blanchard (Mar 04 2024 at 07:18):

I think the argument is similar to the proof of the min-max theorem

John Baez (Mar 04 2024 at 23:11):

Here are a couple of fact +s you're touching on but not saying out loud, David:

There's a 1-1 correspondence between self-adjoint $n \times n$ real matrices $A$ and symmetric bilinear forms $Q : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$ given by

$Q(v,w) = v \cdot Aw$

Then the largest eigenvalue of $A$ equals the maximum value of $Q(v,v)$ where $v$ lies on the unit sphere.
This maximum value, say $\lambda_{max}$ , may be attained by many vectors $v$ , but these vectors will form the intersection of the unit sphere and some linear subspace $L \subseteq \mathbb{R}^n$ . This subspace $L$ is an eigenspace of $A$ , namely the eigenspace consisting of all vectors $v$ with

$A v = \lambda_{max} v$

John Baez (Mar 04 2024 at 23:13):

Knowing this, we can then restrict $A$ to the subspace $L^\perp$ consisting of all vector orthogonal to $L$ .
$A$ will map this subspace to itself, and it restricts to a self-adjoint operator on this subspace. So we can repeat the process and get the second largest eigenvalue of $A$ , and so on.

John Baez (Mar 04 2024 at 23:14):

In fact these ideas are used in one common proof that a self-adjoint matrix has an orthonormal basis of eigenvectors!

David Tanzer (Mar 05 2024 at 08:54):

Great stuff, thanks John for the clear explanations, and putting it in perspective! The min-max theorem is very interesting, looking through it now.