From this MMDS lecture, an this parallel presentation.
Manifold learning addresses the problem of high-dimensional data. Dimensionality reduction improves computational analysis, and it is beneficial as long as it preserves most of the important information.
An \(n = 698\) dataset of \(64 \times 64\) gray images of faces, where the faces only vary in terms of the head moving up and down or left to right, a 2D manifold is defined. In this case the different orientations are shown on an isomap embedding:
with each ellipse showing one of the \(698\) faces.
The idea is that PCA makes the implicit assumption that the manifold is a linear space, projecting onto it. However this is not the case in high-dimensional datasets, motivating the topic - i.e. non-linear dimensionality reduction or manifold learning.
Local linear embedding expresses every point as a linear combination of its neighbors (it’s a weighted average).
Diffusion Map and Laplacian Eigenmap work by defining a random walk on a set of points and do an eigen decomposition of the random walk.
Linear Tangent Space Alignment do local PCA, and patch them together.
Isomaps preserve the geodesics, i.e. the distances on the manifold.
The question that arises, then, is which of these embeddings is “correct.”
One type of diffusion maps is the LAPLACIAN EIGENMAPS:
\[S = \left [ S_{pp'}\right ]_{p,p'} \in \mathcal D\]
with \(S_{pp'}= \exp\left(-\frac{1}{2} \Vert p - p' \Vert^2 \right)\) iff \(p,p'\) are neighbors.
\[L = I - T^{-1}S\]
with \(T=\text{diag}(S1).\)
require(igraph)
g <- graph(edges=c('Alice','Sam','Sam','David','David','Alice','Frank','David','Frank','Paul','Paul','Tom','Tom','Frank'), directed = FALSE)
igraph_options(plot.layout=layout_nicely, vertex.size=20, vertex.frame.color=F, edge.width=5, edge.color='turquoise', vertex.label.dist=4, vertex.color='red4')
plot(g)
laplacian_matrix(g)
## 6 x 6 sparse Matrix of class "dgCMatrix"
## Alice Sam David Frank Paul Tom
## Alice 2 -1 -1 . . .
## Sam -1 2 -1 . . .
## David -1 -1 3 -1 . .
## Frank . . -1 3 -1 -1
## Paul . . . -1 2 -1
## Tom . . . -1 -1 2
Reading out the output:
From the linear algebra explanation by Professor Strang in here, and this post.
A square matrix with nodes in rows and columns.
This is the diagonal in the Laplacian graph matrix. It says that node
Alice
has \(2\) edges
going in; Sam
has \(2\)
edges going in; David
and Frank
have \(3\) edges; and Paul
and
Tom
have \(2.\)
That is to say:
\[D=\begin{bmatrix} &\text{Alice}&\text{Sam}&\text{David}&\text{Frank}&\text{Paul}&\text{Tom}\\ \text{Alice}&2&0&0&0&0&0\\ \text{Sam} &0&2&0&0&0&0\\ \text{David}&0&0&3&0&0&0\\ \text{Frank}&0&0&0&3&0&0\\ \text{Paul}&0&0&0&0&2&0\\ \text{Tom}&0&0&0&0&0&2 \end{bmatrix} \]
A square matrix of nodes.
This is the off-diagonal, and tells us which node is connected to
which. So no nodes are connected to themselves. In the case at hand,
Alice
is connected to Sam
and
David
; Sam
is connected to Alice
and David
; David
is connected to
Alice
, Sam
and Frank
, etc. So the
adjacency matrix is:
\[A = \begin{bmatrix} &\text{Alice}&\text{Sam}&\text{David}&\text{Frank}&\text{Paul}&\text{Tom}\\ \text{Alice}&0&1&1&0&0&0\\ \text{Sam} &1&0&1&0&0&0\\ \text{David}&1&1&0&1&0&0\\ \text{Frank}&0&0&1&0&1&1\\ \text{Paul}&0&0&0&1&0&1\\ \text{Tom}&0&0&0&1&1&0 \end{bmatrix} \]
Laplacian Matrix:
\[\cal L =D-A= \begin{bmatrix} &\text{Alice}&\text{Sam}&\text{David}&\text{Frank}&\text{Paul}&\text{Tom}\\ \text{Alice}&2&-1&-1&0&0&0\\ \text{Sam} &-1&2&-1&0&0&0\\ \text{David}&-1&-1&3&-1&0&0\\ \text{Frank}&0&0&-1&3&-1&-1\\ \text{Paul}&0&0&0&-1&2&-1\\ \text{Tom}&0&0&0&-1&-1&2 \end{bmatrix} \]
\[ L = \text{Degree Mat (Diagonal) - Adjacency}\] The matrix in the output (explaining the negative signs).
The Laplacian matrix is positive semidefinite Gramian matrix of the \(A^\top A\) type resulting from the multiplication of the transpose of the incidence matrix times itself:
The incidence matrix is rectangular with edges on the rows and nodes in the columns:
\[I = \begin{bmatrix}
&\text{Alice}&\text{Sam}&\text{David}&\text{Frank}&\text{Paul}&\text{Tom}\\
\text{e1}&-1&0&1&0&0&0\\
\text{e2} &-1&1&0&0&0&0\\
\text{e3}&0&-1&1&0&0&0\\
\text{e4}&0&0&-1&1&0&0\\
\text{e5}&0&0&0&-1&1&0\\
\text{e6}&0&0&0&-1&1&0\\
\text{e7}&0&0&0&0&-1&1
\end{bmatrix}
\] For instance, the first row says that there is an edge leaving
Alice
and going to David
. The second row
denotes the edge from Alice
to Sam
, etc.
Running the \(A^\top A\) operation:
I = matrix(c(-1,0,1,0,0,0,-1,1,0,0,0,0,0,-1,1,0,0,0,0,0,-1,1,0,0,0,0,0,-1,1,0,0,0,0,-1,1,0,0,0,0,0,-1,1),nrow=7,byrow = T)
# Now we recover the Laplacian:
(Laplacian = t(I) %*% I)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 2 -1 -1 0 0 0
## [2,] -1 2 -1 0 0 0
## [3,] -1 -1 3 -1 0 0
## [4,] 0 0 -1 3 -2 0
## [5,] 0 0 0 -2 3 -1
## [6,] 0 0 0 0 -1 1
From which we can extract the eigenvectors and eigenvalues:
eig <- eigen(Laplacian)
The Laplacian matrix is not positive definite because it is singular, i.e. there is a vector in the null space: \(\cal L \vec x= 0.\) This is the vector of all ones, or a multiple. Why? Because in each row of the Laplacian matrix we have summed all the \(1\)’s created by either \(1 \times 1\) or \(-1\times -1\) in the dot product of same-node column and placed it in the diagonal, but there’ll be an equivalent amount of \(-1\)’s.
The first eigenvalue will be zero, and the first eigenvector a vector of ones (or multiple):
round(eig$vec[,ncol(eig$vec)],2)
## [1] -0.41 -0.41 -0.41 -0.41 -0.41 -0.41
The last but one eigenvector is the Fiedler vector: \(\begin{bmatrix}-0.44&-0.44&-0.27&0.18&0.37&0.61\end{bmatrix}^\top\):
eig <- eigen(Laplacian)
round(eig$vec[,ncol(eig$vec)-1],2)
## [1] -0.44 -0.44 -0.27 0.18 0.37 0.61
which separates the negatives nodes (Alice
,
Sam
, David
) from the positive nodes
(Frank
, Paul
,Tom
) - two perfectly
separate clusters!
An embedding of a manifold \(\mathcal M\) into \(\mathcal N\) is an injective map \(\phi: \mathcal M\to \mathcal N\) with \(\text{rank}(\phi)=\dim(\mathcal M),\) such that \(\mathcal M\) is homeomorphic to its image under \(\phi.\)
Rank of a smooth map is \(k\) if the Jacobian \(d\phi_p:T_p\mathcal M \to T_{\phi(p)}\mathcal N\) of the map has rank \(k\) for all points \(p\in \mathcal M.\)
The geometry is defined in terms of the Reimannian metric (RM), \(g,\) which is a symmetric and positive definite vector field which defines and inner product \(<\;,\;>_g\) on the tangent space \(T_p\mathcal M.\)
A Riemannian manifold \((\mathcal M, g)\) is a smooth manifold with a RM defined at every point in \(T_p\mathcal M.\)
It allows defining angles between vectors in the tangent space:
\[\cos \theta=\frac{<u,v>_g}{\Vert u\Vert \Vert v \Vert }\]
The arc length of a curve \(c \in \mathcal M:\)
\[l(c) = \int_a^b \sqrt{g_{ij}\frac{dx^i}{dt}\frac{dx^j}{dt}}dt\]
where \((x^1,\dots,x^d)\) are the coordinates of chart \((U,\vec x)\) with \(c([a,b])\in U.\)
Similarly the volume of \(W\subset U\) is given by
\[\text{vol}(W)= \int_W\sqrt{\det(g)}\; dx^1\dots dx^d\]
The smooth map \(f: \mathcal M \to \mathcal N\) between to R. manifolds \((\mathcal M, g)\) and \((\mathcal N, h)\) is called an isometry iff for all \(p\in \mathcal M\) and all \(u,v\in T_p(\mathcal M),\) \(<u,v>_{g(p)}=\left< df_p u, df_p w \right>_{h(f(p))},\) where \(df_p\) denotes the Jacobian of \(f\) at \(p,\) i.e. the map \(df_p:T_p(\mathcal M)\to T_{f(p)}(\mathcal N).\)
For every \(u,v \in T_{f(p)}f(\mathcal M) \oplus T_{f(p)}f(\mathcal M)^\perp,\) the embedding pushforward metric \(h\) of an embedding \(f\) at point \(p \in \mathcal M\) is defined by the inner product \(<u,v>_h(f(p))\equiv\left < df_p^\dagger (n), df_p^\dagger (v) \right >_{g(p)}\) where \(df_p^\dagger: T_{f(p)}f(\mathcal M) \oplus T_{f(p)}f(\mathcal M)^\perp \to T_p\mathcal M\) is the pseudoinverse of the Jacobian \(df_p\) of \(f: \mathcal M \to \mathbb R^\mathcal s.\)
It’s a way of passing on the inner product to the embedded space to preserve the isometry.
The L-B operator \(\Delta_\mu\) acting on a twice differentiable function \(f: \mathcal M \to \mathbb R\) is defined as the divergence of the gradient:
\[\Delta_\mathcal M f \equiv \text{div grad}(f)\] In local coordinates, for chart \((U,x),\) the L-B operator is expressed by means of \(g:\)
\[\Delta_\mathbb M f = \frac{1}{\sqrt{\det(g)}}\frac{\partial}{\partial x^l}\left(\sqrt{\det(g)}g^{lk}\frac{\partial}{\partial x^k} f\right ).\]
The L-B operator contains all the geometry of the object, and at the same time, it is coordinate free.
Given the coordinate chart \((U,x)\) of a smooth R.Manifold, \(M\) and \(\Delta_\mu\) defined on \(M,\) then the \(g(p)^{-1},\) the inverse of the RM at point \(p\in U\) as expressed in local coordinates \(x\) can be derived from
\[g^{ij}=\frac{1}{2}\Delta_\mu (x^i - x^i(p))\,(x^j - x^j (p))\rvert_{x^i=x^i(p),x^j=x^j(p)}\]
with \(i,j=1,\dots, d.\)
For an embedding this takes the form
\[\tilde h^{ij}=\frac{1}{2}\Delta_M\; (f^i-f^i(p))\,(f^j-f^j(p))\rvert_{f^i=f^i(p),f^j=f^j(p)}\]
and so \(h(p)\) is given by the pseudoinverse \(\tilde h.\)
The input level for the embedding algorith is a set of points \(\mathcal D =p_1,\dots,p_n\subset M.\) These points are assumed to be a sample from the distribution
\[p(x) = e^{-U(x)}\] on \(M,\) absolutely continuous with respect to the Lebesgue measure on \(M.\)
Embedding algorithms construct a map \(f_n: \mathcal D\to \mathbb R^\mathcal s.\)
We go on to define the symmetric positive semidefinite embedding pushforward \(h_n\) on the set of points \(\mathcal D.\) We obtain it by defining a discrete estimator \(\tilde{ \mathcal L}_{n,\epsilon}\) that is consistent with \(\Delta_M.\) The idea being to get an approximation of the Laplacian.
Construct graph \(g_n\) to have adjacency matrix \(k_n =[k_\epsilon(p,p')]_{p,p'\in \mathcal D}\) given by \(k_\epsilon(p,p') = \exp(-\Vert p-p'\Vert^2/\epsilon),\) which is the Gaussian kernel.
If the sampling density is uniform over \(M,\) then one can immediately construct the random walk graph Laplacian:
\[\mathcal L_{n,\epsilon}\equiv \frac{I_n - T_n^{-1}k_n}{\epsilon}\]
where \(T_n\) is the diagonal matrix of outdegrees, i.e. \(T_n= \text{diag}\left\{t(p), p\in \mathcal D \right\}\) with \(t(p) =\sum_{p'\in \mathcal D}k_n(p,p').\)
A simple renormalization exists when \(p(x)\) is not uniform.
Using the pushforward metric \(h_p\) at point \(p \in M,\) we can construct an affine tranformation of \(f\) that defines a local isometry at \(p:\)
Given \((f(\mathcal D), h(\mathcal D))\) metric embedding of \(\mathcal D:\)
Display \(\mathcal D\) in coordinates \(\tilde f\)
Computing the geodesic distance between two points \(p\) and \(p'\) requires discretizing the arclength integral and finding the shortest path:
\[\hat d_\mathcal M \,(f(P),f(p'))= \min\left \{ \sum_{i=1}^l \mathcal H_{q_i,q_{i-1}},l\geq 1, (q_0=p,\dots,q_1=p')\in\mathcal G \right\}\]
where \(\hat d\) is the metric, \(\Vert f(p) - f(p')\Vert\) is the distance between two points and \(d_\mathcal G\) the shortest path.
\[\mathcal H_{q_i,q_{i-1}}= \frac{1}{2}\sqrt{(f(q_i)-f(q_{i-1}))'\,h_{q_i}(f(q_i)-f(q_{i-1}))}+ \frac{1}{2}\sqrt{(f(q_i)-f(q_{i-1}))'\,h_{q_{i-1}}(f(q_i)-f(q_{i-1}))} \]
The computation can be done regardless of the embedding.
By performing a Voronoi tessellation of a coordinate chart \((U,x)\) we can obtain the estimator \(\Delta x^1,\dots \Delta x^d\) around \(p\) and multiply it by \(\sqrt{\det (h)}\) to obtain \(\Delta \text{Vol}\approx d\text{Vol}.\) Summing over all points in a set \(W\subset \mathcal M\) gives the estimator:
\[\small \text{Vol}(W) =\sum_{p\in W} \sqrt{\det(h_p)}\Delta x^1(p),\dots \Delta x^d(p).\]
GP’s can be extended to M via SPDE’s. Let \((k^2 - \Delta_{\mathbb R^d})^{\alpha/2}u(x) = \xi\) with \(\xi\) Gaussian with noise, then \(u(x)\) is a Matérn GP. Substituting \(\Delta_\mathcal M\) allows us to define a Matérn GP on \(\mathcal M.\)
Unlabelled points can be learned by GP regression (semi-supervised learning) using the covariance matrix \(\Sigma=(k^2 - \Delta_{\mathbb R^d})^{-\alpha}\) of GP \(u(x).\)
From this youtube video
The context is a metric \(g_{\alpha\beta},\) which in full notation would be a (0,2) tensor \(g_{\alpha\beta}\, dx^\alpha \otimes dx^\beta,\) taking two vectors of the \(T_p(M)\) to the real line. Since \(g_{\alpha\beta}\) is a function of space-time, i.e. \(g_{\alpha\beta}(x)\) with \(x=( x^0, x^1,x^2, x^3),\) it turns out that \(g_{\alpha\beta}\) is a field.
A Riemannian metric on a differentiable manifold \(M\) is a mapping that assigns to each point of the manifold \(M\) an inner product \(g_p\) on the tangent space \(T_pM.\) Thus, for every two vector fields \(X_p, Y_p \in M\), the function \(g_p(X_p, Y_p)=\left<X_p,Y_p\right>_p\) is smooth. An \(M\) manifold provided with a Riemannian metric is called Riemannian manifold. The metric does not depend on the choice of local coordinates. In addition it can be expressed
\[g_p(X_p, Y_p) =\sum_{i,j}^n v_i\,w_j\,\left< \partial_{i\vert p, \partial_{j\vert p}} \right>\]
with \(\large \partial_i =\frac{\partial}{\partial x_i},\)
where \(v_i\) and \(w_j\) are the coefficients in the representation of \(X_p\) and \(Y_p\) in the canonical basis of the tangent space \(T_p(M)\) given by \(\{\partial_{1\vert p},\dots,\partial_{n\vert p}\}.\) The terms \(g_{ij}(p)\) represent the entries of the matrix \(g(p),\) which is symmetric and definite positive.
The existence of the metric allows defining the length of vectors (of curves). The curve for which the shortest distance is presented is called geodesic.
Let \((M,g)\) be a Riemannian manifold. For any smooth function \(f\) over \(M,\) the gradient is defined as the vector field \(\text{grad}(f)\) in \(T(M)\) that satisfies \(\left<\text{grad}(f), X(f) \right>_g= X(f)\) for all \(X\in T(M).\) In local coordinates, the gradient is written as
\[(\text{grad(f)})_i =\sum_j g^{ij}\,\frac{\partial f}{\partial x_j},\] where \(g^{ij}\) are the components of the inverse matrix of \(g=[g_{ij}].\) Then
\[\text{grad}(f) = \sum_{i,j=1}^n \underbrace{\left(g^{ij} \frac{\partial f}{\partial x_j} \right)}_{\text{grad}(f)}\partial_i\]
In the same local coordinates, the divergent of \(X\) is written
\[\text{div}(X) = \frac{1}{\sqrt{\det g}}\sum_i \frac{\partial}{\partial x_i}\left(\sqrt{\det g}\,X_i\right)\] The Laplacian or Laplace-Beltrami operator on \((M,g)\) of a smooth function \(f: M\to \mathbb R\) is defined as:
\[\Delta_g \, f = \text{div(grad)}(f))= \frac{1}{\sqrt{\det g}}\sum_j \frac{\partial}{\partial x_j}\left(\sum_i g^{ij}\sqrt{\det g}\,\frac{\partial f}{\partial X_i}\right)\]
The Laplacian operator has to do with partial derivatives.
In the usual case, the gradient applied to a function \(f\) is
\[\text{grad}(f) =\left< \frac{\partial f}{\partial x},\frac{\partial f}{\partial y}.\frac{\partial f}{\partial z}\right >\]
in this case the basis is \(\vec i, \vec j, \vec k;\) however in the case under study the basis is formed by the partials \(\partial i =\frac{\partial}{\partial i}.\)
This gradient determines a vectorial field \(X\)
\[X = \text{grad}(f) =\left< \frac{\partial f}{\partial x},\frac{\partial f}{\partial y},\frac{\partial f}{\partial z}\right >\]
Divergence applied to this vector field,
\[\text{div}(X)=\frac{\partial X_1}{\partial x}+\frac{\partial X_2}{\partial y}+\frac{\partial X_3}{\partial z}\]
it will produce the Laplacian of the function:
\[\Delta f = \text{div}\left(\text{grad}(f)\right)= \frac{\partial^2 f}{\partial x^2}+\frac{\partial^2 f}{\partial y^2}+\frac{\partial^2 f}{\partial z^2}\]
I presume we would need exponential maps to draw geodesic lines on the manifold.
NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.