NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Statistical Learning or Non-Linear Dimension-Reduction Algorithms:


From this MMDS lecture, an this parallel presentation.

Manifold learning addresses the problem of high-dimensional data. Dimensionality reduction improves computational analysis, and it is beneficial as long as it preserves most of the important information.

An \(n = 698\) dataset of \(64 \times 64\) gray images of faces, where the faces only vary in terms of the head moving up and down or left to right, a 2D manifold is defined. In this case the different orientations are shown on an isomap embedding:

with each ellipse showing one of the \(698\) faces.


Steps in the process:

  1. Input data iid points \(p_1,p_2,\dots,p_n.\)
  2. Construct neighborhood graph where \(p,p'\) are neighbors iff \(\Vert p - p' \Vert^2\leq \epsilon.\) Neighbors will have an edge connecting them.
  3. Represent the graph as an \(n \times n\) matrix with leading eigenvectors as the coordinates \(\phi(p_{1:n}).\) The only information taken into consideration is the distances between neighbors.
  4. Construct embedding (This step varies between algorithms):

The idea is that PCA makes the implicit assumption that the manifold is a linear space, projecting onto it. However this is not the case in high-dimensional datasets, motivating the topic - i.e. non-linear dimensionality reduction or manifold learning.


Types of embeddings:

  • Local Linear Embedding
  • Laplacian Eigenmap
  • Diffusion Map
  • Linear Tangent Space Alignment
  • Isomap / MDS

Local linear embedding expresses every point as a linear combination of its neighbors (it’s a weighted average).

Diffusion Map and Laplacian Eigenmap work by defining a random walk on a set of points and do an eigen decomposition of the random walk.

Linear Tangent Space Alignment do local PCA, and patch them together.

Isomaps preserve the geodesics, i.e. the distances on the manifold.

The question that arises, then, is which of these embeddings is “correct.”

One type of diffusion maps is the LAPLACIAN EIGENMAPS:

  1. Construct similarity matrix:

\[S = \left [ S_{pp'}\right ]_{p,p'} \in \mathcal D\]

with \(S_{pp'}= \exp\left(-\frac{1}{2} \Vert p - p' \Vert^2 \right)\) iff \(p,p'\) are neighbors.

  1. Construct a Laplacian matrix

\[L = I - T^{-1}S\]

with \(T=\text{diag}(S1).\)

  1. Calculate smallest eigenvectors of \(L,\) labelled \(\psi^{1,\dots,m}.\)
  2. Cordinates of \(p\in \mathcal D\) are \(\psi^1(p),\psi^2(p),\dots,\psi^m(p).\)

Example of Graph Laplacian in R:


require(igraph)

g <- graph(edges=c('Alice','Sam','Sam','David','David','Alice','Frank','David','Frank','Paul','Paul','Tom','Tom','Frank'), directed = FALSE)
igraph_options(plot.layout=layout_nicely, vertex.size=20, vertex.frame.color=F, edge.width=5, edge.color='turquoise', vertex.label.dist=4, vertex.color='red4')
plot(g)
laplacian_matrix(g)

## 6 x 6 sparse Matrix of class "dgCMatrix"
##       Alice Sam David Frank Paul Tom
## Alice     2  -1    -1     .    .   .
## Sam      -1   2    -1     .    .   .
## David    -1  -1     3    -1    .   .
## Frank     .   .    -1     3   -1  -1
## Paul      .   .     .    -1    2  -1
## Tom       .   .     .    -1   -1   2

Reading out the output:

From the linear algebra explanation by Professor Strang in here, and this post.

Degree Matrix:

A square matrix with nodes in rows and columns.

This is the diagonal in the Laplacian graph matrix. It says that node Alice has \(2\) edges going in; Sam has \(2\) edges going in; David and Frank have \(3\) edges; and Paul and Tom have \(2.\)

That is to say:

\[D=\begin{bmatrix} &\text{Alice}&\text{Sam}&\text{David}&\text{Frank}&\text{Paul}&\text{Tom}\\ \text{Alice}&2&0&0&0&0&0\\ \text{Sam} &0&2&0&0&0&0\\ \text{David}&0&0&3&0&0&0\\ \text{Frank}&0&0&0&3&0&0\\ \text{Paul}&0&0&0&0&2&0\\ \text{Tom}&0&0&0&0&0&2 \end{bmatrix} \]

Adjacency Matrix:

A square matrix of nodes.

This is the off-diagonal, and tells us which node is connected to which. So no nodes are connected to themselves. In the case at hand, Alice is connected to Sam and David; Sam is connected to Alice and David; David is connected to Alice, Sam and Frank, etc. So the adjacency matrix is:

\[A = \begin{bmatrix} &\text{Alice}&\text{Sam}&\text{David}&\text{Frank}&\text{Paul}&\text{Tom}\\ \text{Alice}&0&1&1&0&0&0\\ \text{Sam} &1&0&1&0&0&0\\ \text{David}&1&1&0&1&0&0\\ \text{Frank}&0&0&1&0&1&1\\ \text{Paul}&0&0&0&1&0&1\\ \text{Tom}&0&0&0&1&1&0 \end{bmatrix} \]


Laplacian Matrix:

\[\cal L =D-A= \begin{bmatrix} &\text{Alice}&\text{Sam}&\text{David}&\text{Frank}&\text{Paul}&\text{Tom}\\ \text{Alice}&2&-1&-1&0&0&0\\ \text{Sam} &-1&2&-1&0&0&0\\ \text{David}&-1&-1&3&-1&0&0\\ \text{Frank}&0&0&-1&3&-1&-1\\ \text{Paul}&0&0&0&-1&2&-1\\ \text{Tom}&0&0&0&-1&-1&2 \end{bmatrix} \]

\[ L = \text{Degree Mat (Diagonal) - Adjacency}\] The matrix in the output (explaining the negative signs).

The Laplacian matrix is positive semidefinite Gramian matrix of the \(A^\top A\) type resulting from the multiplication of the transpose of the incidence matrix times itself:

The incidence matrix is rectangular with edges on the rows and nodes in the columns:

\[I = \begin{bmatrix} &\text{Alice}&\text{Sam}&\text{David}&\text{Frank}&\text{Paul}&\text{Tom}\\ \text{e1}&-1&0&1&0&0&0\\ \text{e2} &-1&1&0&0&0&0\\ \text{e3}&0&-1&1&0&0&0\\ \text{e4}&0&0&-1&1&0&0\\ \text{e5}&0&0&0&-1&1&0\\ \text{e6}&0&0&0&-1&1&0\\ \text{e7}&0&0&0&0&-1&1 \end{bmatrix} \] For instance, the first row says that there is an edge leaving Alice and going to David. The second row denotes the edge from Alice to Sam, etc.

Running the \(A^\top A\) operation:

I = matrix(c(-1,0,1,0,0,0,-1,1,0,0,0,0,0,-1,1,0,0,0,0,0,-1,1,0,0,0,0,0,-1,1,0,0,0,0,-1,1,0,0,0,0,0,-1,1),nrow=7,byrow = T)

# Now we recover the Laplacian:

(Laplacian = t(I) %*% I)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2   -1   -1    0    0    0
## [2,]   -1    2   -1    0    0    0
## [3,]   -1   -1    3   -1    0    0
## [4,]    0    0   -1    3   -2    0
## [5,]    0    0    0   -2    3   -1
## [6,]    0    0    0    0   -1    1

From which we can extract the eigenvectors and eigenvalues:

eig <- eigen(Laplacian)

The Laplacian matrix is not positive definite because it is singular, i.e. there is a vector in the null space: \(\cal L \vec x= 0.\) This is the vector of all ones, or a multiple. Why? Because in each row of the Laplacian matrix we have summed all the \(1\)’s created by either \(1 \times 1\) or \(-1\times -1\) in the dot product of same-node column and placed it in the diagonal, but there’ll be an equivalent amount of \(-1\)’s.

The first eigenvalue will be zero, and the first eigenvector a vector of ones (or multiple):

round(eig$vec[,ncol(eig$vec)],2)
## [1] -0.41 -0.41 -0.41 -0.41 -0.41 -0.41

The last but one eigenvector is the Fiedler vector: \(\begin{bmatrix}-0.44&-0.44&-0.27&0.18&0.37&0.61\end{bmatrix}^\top\):

eig <- eigen(Laplacian)
round(eig$vec[,ncol(eig$vec)-1],2)
## [1] -0.44 -0.44 -0.27  0.18  0.37  0.61

which separates the negatives nodes (Alice, Sam, David) from the positive nodes (Frank, Paul,Tom) - two perfectly separate clusters!


Embedding:


An embedding of a manifold \(\mathcal M\) into \(\mathcal N\) is an injective map \(\phi: \mathcal M\to \mathcal N\) with \(\text{rank}(\phi)=\dim(\mathcal M),\) such that \(\mathcal M\) is homeomorphic to its image under \(\phi.\)

Rank of a smooth map is \(k\) if the Jacobian \(d\phi_p:T_p\mathcal M \to T_{\phi(p)}\mathcal N\) of the map has rank \(k\) for all points \(p\in \mathcal M.\)

The geometry is defined in terms of the Reimannian metric (RM), \(g,\) which is a symmetric and positive definite vector field which defines and inner product \(<\;,\;>_g\) on the tangent space \(T_p\mathcal M.\)

A Riemannian manifold \((\mathcal M, g)\) is a smooth manifold with a RM defined at every point in \(T_p\mathcal M.\)

It allows defining angles between vectors in the tangent space:

\[\cos \theta=\frac{<u,v>_g}{\Vert u\Vert \Vert v \Vert }\]

The arc length of a curve \(c \in \mathcal M:\)

\[l(c) = \int_a^b \sqrt{g_{ij}\frac{dx^i}{dt}\frac{dx^j}{dt}}dt\]

where \((x^1,\dots,x^d)\) are the coordinates of chart \((U,\vec x)\) with \(c([a,b])\in U.\)

Similarly the volume of \(W\subset U\) is given by

\[\text{vol}(W)= \int_W\sqrt{\det(g)}\; dx^1\dots dx^d\]


Isomety:

The smooth map \(f: \mathcal M \to \mathcal N\) between to R. manifolds \((\mathcal M, g)\) and \((\mathcal N, h)\) is called an isometry iff for all \(p\in \mathcal M\) and all \(u,v\in T_p(\mathcal M),\) \(<u,v>_{g(p)}=\left< df_p u, df_p w \right>_{h(f(p))},\) where \(df_p\) denotes the Jacobian of \(f\) at \(p,\) i.e. the map \(df_p:T_p(\mathcal M)\to T_{f(p)}(\mathcal N).\)


Rank Deficient Matrix - The Embedding Pushforward Metric:

For every \(u,v \in T_{f(p)}f(\mathcal M) \oplus T_{f(p)}f(\mathcal M)^\perp,\) the embedding pushforward metric \(h\) of an embedding \(f\) at point \(p \in \mathcal M\) is defined by the inner product \(<u,v>_h(f(p))\equiv\left < df_p^\dagger (n), df_p^\dagger (v) \right >_{g(p)}\) where \(df_p^\dagger: T_{f(p)}f(\mathcal M) \oplus T_{f(p)}f(\mathcal M)^\perp \to T_p\mathcal M\) is the pseudoinverse of the Jacobian \(df_p\) of \(f: \mathcal M \to \mathbb R^\mathcal s.\)

It’s a way of passing on the inner product to the embedded space to preserve the isometry.


Laplacian on \(\mathcal M\): The Laplace-Beltrami Operator:


The L-B operator \(\Delta_\mu\) acting on a twice differentiable function \(f: \mathcal M \to \mathbb R\) is defined as the divergence of the gradient:

\[\Delta_\mathcal M f \equiv \text{div grad}(f)\] In local coordinates, for chart \((U,x),\) the L-B operator is expressed by means of \(g:\)

\[\Delta_\mathbb M f = \frac{1}{\sqrt{\det(g)}}\frac{\partial}{\partial x^l}\left(\sqrt{\det(g)}g^{lk}\frac{\partial}{\partial x^k} f\right ).\]

The L-B operator contains all the geometry of the object, and at the same time, it is coordinate free.


Obtaining the Riemannian Metric:


Given the coordinate chart \((U,x)\) of a smooth R.Manifold, \(M\) and \(\Delta_\mu\) defined on \(M,\) then the \(g(p)^{-1},\) the inverse of the RM at point \(p\in U\) as expressed in local coordinates \(x\) can be derived from

\[g^{ij}=\frac{1}{2}\Delta_\mu (x^i - x^i(p))\,(x^j - x^j (p))\rvert_{x^i=x^i(p),x^j=x^j(p)}\]

with \(i,j=1,\dots, d.\)

For an embedding this takes the form

\[\tilde h^{ij}=\frac{1}{2}\Delta_M\; (f^i-f^i(p))\,(f^j-f^j(p))\rvert_{f^i=f^i(p),f^j=f^j(p)}\]

and so \(h(p)\) is given by the pseudoinverse \(\tilde h.\)


Discretized problem:


The input level for the embedding algorith is a set of points \(\mathcal D =p_1,\dots,p_n\subset M.\) These points are assumed to be a sample from the distribution

\[p(x) = e^{-U(x)}\] on \(M,\) absolutely continuous with respect to the Lebesgue measure on \(M.\)

Embedding algorithms construct a map \(f_n: \mathcal D\to \mathbb R^\mathcal s.\)

We go on to define the symmetric positive semidefinite embedding pushforward \(h_n\) on the set of points \(\mathcal D.\) We obtain it by defining a discrete estimator \(\tilde{ \mathcal L}_{n,\epsilon}\) that is consistent with \(\Delta_M.\) The idea being to get an approximation of the Laplacian.


Random Walk Graph Laplacian:


Construct graph \(g_n\) to have adjacency matrix \(k_n =[k_\epsilon(p,p')]_{p,p'\in \mathcal D}\) given by \(k_\epsilon(p,p') = \exp(-\Vert p-p'\Vert^2/\epsilon),\) which is the Gaussian kernel.

If the sampling density is uniform over \(M,\) then one can immediately construct the random walk graph Laplacian:

\[\mathcal L_{n,\epsilon}\equiv \frac{I_n - T_n^{-1}k_n}{\epsilon}\]

where \(T_n\) is the diagonal matrix of outdegrees, i.e. \(T_n= \text{diag}\left\{t(p), p\in \mathcal D \right\}\) with \(t(p) =\sum_{p'\in \mathcal D}k_n(p,p').\)

A simple renormalization exists when \(p(x)\) is not uniform.


Algorithm to learn the metric:


  1. Construct the graph Laplacian \(\tilde{\mathcal L}_{n,\epsilon}\)
  2. Obtain the embedding coordinates \(f_n(p) =\left( f_n^1(p),\dots,f_n^\mathcal s (p) \right)\) of each point \(p \in \mathcal D\) by \([f_n(p)]_{p\in \mathcal D}\equiv \text{GENERIC EMBED}(\mathcal D,\mathcal s, \epsilon_n)\)
  3. Calculate the embedding metric \(h_n(p)\) at each point, using \(\tilde{\mathcal L}_{n,\epsilon}\). Output \(\left( f_n(p), h_n(p) \right)_{p\in \mathcal D}.\)

Locally Isometric Embedding:


Using the pushforward metric \(h_p\) at point \(p \in M,\) we can construct an affine tranformation of \(f\) that defines a local isometry at \(p:\)

Given \((f(\mathcal D), h(\mathcal D))\) metric embedding of \(\mathcal D:\)

  1. Select a point \(p\in \mathcal D\) on the manifold
  2. Transform coordinates \(\tilde f(p') \leftarrow h_p^{-1/2}\,f(p')\) for \(p'\in \mathcal D\)

Display \(\mathcal D\) in coordinates \(\tilde f\)


Computing Geodesic Distance:


Computing the geodesic distance between two points \(p\) and \(p'\) requires discretizing the arclength integral and finding the shortest path:

\[\hat d_\mathcal M \,(f(P),f(p'))= \min\left \{ \sum_{i=1}^l \mathcal H_{q_i,q_{i-1}},l\geq 1, (q_0=p,\dots,q_1=p')\in\mathcal G \right\}\]

where \(\hat d\) is the metric, \(\Vert f(p) - f(p')\Vert\) is the distance between two points and \(d_\mathcal G\) the shortest path.

\[\mathcal H_{q_i,q_{i-1}}= \frac{1}{2}\sqrt{(f(q_i)-f(q_{i-1}))'\,h_{q_i}(f(q_i)-f(q_{i-1}))}+ \frac{1}{2}\sqrt{(f(q_i)-f(q_{i-1}))'\,h_{q_{i-1}}(f(q_i)-f(q_{i-1}))} \]

The computation can be done regardless of the embedding.


Computing Area and Volume:


By performing a Voronoi tessellation of a coordinate chart \((U,x)\) we can obtain the estimator \(\Delta x^1,\dots \Delta x^d\) around \(p\) and multiply it by \(\sqrt{\det (h)}\) to obtain \(\Delta \text{Vol}\approx d\text{Vol}.\) Summing over all points in a set \(W\subset \mathcal M\) gives the estimator:

\[\small \text{Vol}(W) =\sum_{p\in W} \sqrt{\det(h_p)}\Delta x^1(p),\dots \Delta x^d(p).\]


Gaussian Processes on Manifolds:


GP’s can be extended to M via SPDE’s. Let \((k^2 - \Delta_{\mathbb R^d})^{\alpha/2}u(x) = \xi\) with \(\xi\) Gaussian with noise, then \(u(x)\) is a Matérn GP. Substituting \(\Delta_\mathcal M\) allows us to define a Matérn GP on \(\mathcal M.\)

Unlabelled points can be learned by GP regression (semi-supervised learning) using the covariance matrix \(\Sigma=(k^2 - \Delta_{\mathbb R^d})^{-\alpha}\) of GP \(u(x).\)


From this youtube video


Riemann metric:


The context is a metric \(g_{\alpha\beta},\) which in full notation would be a (0,2) tensor \(g_{\alpha\beta}\, dx^\alpha \otimes dx^\beta,\) taking two vectors of the \(T_p(M)\) to the real line. Since \(g_{\alpha\beta}\) is a function of space-time, i.e. \(g_{\alpha\beta}(x)\) with \(x=( x^0, x^1,x^2, x^3),\) it turns out that \(g_{\alpha\beta}\) is a field.

A Riemannian metric on a differentiable manifold \(M\) is a mapping that assigns to each point of the manifold \(M\) an inner product \(g_p\) on the tangent space \(T_pM.\) Thus, for every two vector fields \(X_p, Y_p \in M\), the function \(g_p(X_p, Y_p)=\left<X_p,Y_p\right>_p\) is smooth. An \(M\) manifold provided with a Riemannian metric is called Riemannian manifold. The metric does not depend on the choice of local coordinates. In addition it can be expressed

\[g_p(X_p, Y_p) =\sum_{i,j}^n v_i\,w_j\,\left< \partial_{i\vert p, \partial_{j\vert p}} \right>\]

with \(\large \partial_i =\frac{\partial}{\partial x_i},\)

where \(v_i\) and \(w_j\) are the coefficients in the representation of \(X_p\) and \(Y_p\) in the canonical basis of the tangent space \(T_p(M)\) given by \(\{\partial_{1\vert p},\dots,\partial_{n\vert p}\}.\) The terms \(g_{ij}(p)\) represent the entries of the matrix \(g(p),\) which is symmetric and definite positive.

The existence of the metric allows defining the length of vectors (of curves). The curve for which the shortest distance is presented is called geodesic.


Laplacian operator:


Let \((M,g)\) be a Riemannian manifold. For any smooth function \(f\) over \(M,\) the gradient is defined as the vector field \(\text{grad}(f)\) in \(T(M)\) that satisfies \(\left<\text{grad}(f), X(f) \right>_g= X(f)\) for all \(X\in T(M).\) In local coordinates, the gradient is written as

\[(\text{grad(f)})_i =\sum_j g^{ij}\,\frac{\partial f}{\partial x_j},\] where \(g^{ij}\) are the components of the inverse matrix of \(g=[g_{ij}].\) Then

\[\text{grad}(f) = \sum_{i,j=1}^n \underbrace{\left(g^{ij} \frac{\partial f}{\partial x_j} \right)}_{\text{grad}(f)}\partial_i\]

In the same local coordinates, the divergent of \(X\) is written

\[\text{div}(X) = \frac{1}{\sqrt{\det g}}\sum_i \frac{\partial}{\partial x_i}\left(\sqrt{\det g}\,X_i\right)\] The Laplacian or Laplace-Beltrami operator on \((M,g)\) of a smooth function \(f: M\to \mathbb R\) is defined as:

\[\Delta_g \, f = \text{div(grad)}(f))= \frac{1}{\sqrt{\det g}}\sum_j \frac{\partial}{\partial x_j}\left(\sum_i g^{ij}\sqrt{\det g}\,\frac{\partial f}{\partial X_i}\right)\]

The Laplacian operator has to do with partial derivatives.

In the usual case, the gradient applied to a function \(f\) is

\[\text{grad}(f) =\left< \frac{\partial f}{\partial x},\frac{\partial f}{\partial y}.\frac{\partial f}{\partial z}\right >\]

in this case the basis is \(\vec i, \vec j, \vec k;\) however in the case under study the basis is formed by the partials \(\partial i =\frac{\partial}{\partial i}.\)

This gradient determines a vectorial field \(X\)

\[X = \text{grad}(f) =\left< \frac{\partial f}{\partial x},\frac{\partial f}{\partial y},\frac{\partial f}{\partial z}\right >\]

Divergence applied to this vector field,

\[\text{div}(X)=\frac{\partial X_1}{\partial x}+\frac{\partial X_2}{\partial y}+\frac{\partial X_3}{\partial z}\]

it will produce the Laplacian of the function:

\[\Delta f = \text{div}\left(\text{grad}(f)\right)= \frac{\partial^2 f}{\partial x^2}+\frac{\partial^2 f}{\partial y^2}+\frac{\partial^2 f}{\partial z^2}\]


I presume we would need exponential maps to draw geodesic lines on the manifold.


Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.