Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.


Stream: learning: questions

Topic: Linear algebra behind PCA


view this post on Zulip David Tanzer (Mar 03 2024 at 09:06):

Hi, I'm trying to understand principal component analysis in terms of general ideas from linear algebra and probability theory; I see these ideas on the web, but am trying to spell it out for myself in a compact and systematic way. Here's as far as I got, with a couple of missing links to the linear algebra part.

Working in the setting of some probability space, suppose we have real-valued random variables v1v_1, ..., vkv_k. We can form linear combinations of these to derive new variables. Let VV be the span of these variables, a kk-dimensional vector space of random variables. We can add in the ring structure given by multiplication of random variable, to make it an algebra.

Next, we have covariance, which is a symmetric bilinear form taking V×VV \times V to R\mathbb{R}. It is positive, but not positive definite. And variance is the associated quadratic form. (finite dimensional Hilbert space)

In the standard construction, the principal directions are defined iteratively. First find the direction u1u_1 in VV which maximizes the variance (quadratic form) when restricted to the unit sphere. Then repeat to find the direction u2u_2 in u1u_1^{\perp} that maximizes the variance, still constrained to the unit sphere. Then repeat to find u3u_3 in span(u1,u2){span(u_1,u_2)}^{\perp} etc.

The result is an orthonormal basis u1,u2,..,uku_1, u_2,.., u_k, called the principal directions or components, along with an associated list of variances z1z2...zkz_1 \ge z_2 ... \ge z_k.

That's all well and good, but it's not a computable procedure.

Fortunately, because the matrix of the symmetric form is symmetric, we can rely on some form or elaboration of the spectral theorem to find the principal components and their associated variances. So, we can diagonalize the matrix to get a basis of eigenvectors - which will be principal directions - and take the associated eigenvalues to give us the variances in those directions.

My question is, what form or elaboration of the spectral theorem is need to justify the claim that this computable procedure will give us the results defined by the iterative construction above?

It's good we know the eigenvalues are real. And I see the standard claim that the eigenvector whose eigenvalue has largest magnitude is the direction that maximizes the quadratic form, and by the one with smallest magnitude minimizes the quadratic form. That's good too.

But what about all the principal directions in-between, and their associated variances? Here we don't have the obvious criteria of minimum or maximum to identify them, but there must be some special property that characterizes these intermediate directions as being "principal". The iterative construction does define them, by progressively taking the maximum over more and more restricted subspaces - but it seems these directions should be understandable in a non-procedural manner, by a more detailed elaboration of the spectral theorem.

It looks like, geometrically, every principal direction uiu_i has the special property that, when restricted to the unit sphere in a small neighborhood of uiu_i, the quadratic form has a critical point at uiu_i - the directional derivative is zero in every direction from uiu_i along the surface of the unit sphere. The intermediate directions are points of inflection for the quadratic form on the unit sphere; this looks like a general property of ellipsoids and their axes.

Things could also get sticky, in that with the iterative procedure there may not be a unique direction that maximizes the quadratic form; and similarly we know that there can be degrees of freedom in choosing the eigenbasis, when their are eigenspaces of dimension more than one.

As you can see, this is kind of a searching explanation, that would need to be tightened up and completed. Can anyone help with a tighter statement of results from linear algebra, which could help tie this all together? Thanks!

p.s. Maybe it would go along these lines? (1) The eigenvectors of matrix of the form are critical points (as described above). (2) Every eigenbasis that can be obtained through diagonalization can also be achieved through the iterative procedure.

view this post on Zulip Simon Burton (Mar 03 2024 at 15:02):

The way you are describing this makes me think of (algebraic) Morse theory. You have a scalar function on a high dimensional sphere, and the critical points correspond to the uiu_i. So there should be a decomposition of this sphere along critical lines connecting these critical points, etc. etc. I have no idea if this is helpful or not for your purposes... but it sounds fun and/or profitable.

view this post on Zulip David Tanzer (Mar 03 2024 at 21:46):

Thanks, that sounds like an interesting subject, I'll will check it out

view this post on Zulip David Tanzer (Mar 04 2024 at 02:42):

Ok, new approach, simpler.

First diagonalize the matrix to get eigenvectors e1,...,ene_1, ..., e_n, with eigenvalues λ1,...λn\lambda_1, ... \lambda_n. Choose the order so that λ1λ2...λn0\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_n \ge 0.

Then run the optimization based algorithm. At each stage ii, claim that eie_i is a valid choice for the optimal vector, and so eie_i can be chosen as the ith principal component, with associated principal value λi\lambda_i.

Proof by induction. We know that e1e_1 is a valid choice for the optimal vector, with associated value λ1\lambda_1. Then take V=e1V' = e_1^{\perp}, for which e2,...,ene_2,...,e_n is an orthonormal basis, with associated values λ2,...,λn\lambda_2, ..., \lambda_n. And VV' is the next space used by the optimization-based algorithm.

view this post on Zulip Peva Blanchard (Mar 04 2024 at 07:18):

I think the argument is similar to the proof of the min-max theorem

view this post on Zulip John Baez (Mar 04 2024 at 23:11):

Here are a couple of fact +s you're touching on but not saying out loud, David:

There's a 1-1 correspondence between self-adjoint n×nn \times n real matrices AA and symmetric bilinear forms Q:Rn×RnRQ : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R} given by

Q(v,w)=vAw Q(v,w) = v \cdot Aw

Then the largest eigenvalue of AA equals the maximum value of Q(v,v)Q(v,v) where vv lies on the unit sphere.
This maximum value, say λmax\lambda_{max}, may be attained by many vectors vv, but these vectors will form the intersection of the unit sphere and some linear subspace LRnL \subseteq \mathbb{R}^n. This subspace LL is an eigenspace of AA, namely the eigenspace consisting of all vectors vv with

Av=λmaxv A v = \lambda_{max} v

view this post on Zulip John Baez (Mar 04 2024 at 23:13):

Knowing this, we can then restrict AA to the subspace LL^\perp consisting of all vector orthogonal to LL.
AA will map this subspace to itself, and it restricts to a self-adjoint operator on this subspace. So we can repeat the process and get the second largest eigenvalue of AA, and so on.

view this post on Zulip John Baez (Mar 04 2024 at 23:14):

In fact these ideas are used in one common proof that a self-adjoint matrix has an orthonormal basis of eigenvectors!

view this post on Zulip David Tanzer (Mar 05 2024 at 08:54):

Great stuff, thanks John for the clear explanations, and putting it in perspective! The min-max theorem is very interesting, looking through it now.