NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


GAUSSIAN PROCESS REGRESSION:


From Wikipedia:

In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution

From r-bloggers

Based on the book by C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning.


options(warn=-1)

require(MASS)
require(plyr)
require(reshape2)
require(ggplot2)
require(RColorBrewer)
require(GPBayes)
set.seed(2023)

MULTIVARIATE GAUSSIAN DISTRIBUTION:


bi = mvrnorm(n=2e4, mu = c(0,0), Sigma = matrix(c(1,-.6,-.6,1), nrow=2))
plot(bi, pch=19, col=rgb(0,0,0.2,0.02))

EACH FUNCTION IS A SINGLE SAMPLE FROM AN INFINITE DIMENSIONAL GAUSSIAN:

The specification of the covariance function implies a distribution over functions. To see this, we can draw samples from the distribution of functions evaluated at any number of points; in detail, we choose a number of input points \(X_∗\) and write out the corresponding covariance matrix using eq. (2.16) element wise:

\[\begin{align} \text{cov}\left(f({\bf x}_p), f({\bf x_q})\right) = k(\bf x_p, \bf x_q) = \exp\left(−\frac 1 2 \vert \bf x_p - \bf x_q \vert^2\right) \end{align}\]

Then we generate a random Gaussian vector with this covariance matrix \(f_∗ ∼ {\cal N}(0, K(X_∗, X_∗))\).

calcSigma <- function(X1, X2, theta=1) {
  Sigma <- matrix(rep(0, length(X1) * length(X2)), nrow=length(X1))
  for (i in 1:nrow(Sigma)) {
    for (j in 1:ncol(Sigma)) {
      Sigma[i,j] <- exp(-0.5 * (abs(X1[i] - X2[j]) / theta)^2) # Squared exponential
    }
  }
  return(Sigma)
}
  
# 1. Plot some sample functions from the Gaussian process

# Define the points at which we want to define the functions
x.star <- seq(-5, 5, len = 100)

# Calculate the covariance matrix
sigma <- calcSigma(x.star, x.star)

# Generate a number of functions from the process
n.samples <- 8

# Initiate a matrix with ncols = length x.star and 3 cols (no. samples):
fstar <- matrix(rep(0, length(x.star) * n.samples), ncol = n.samples)

for (i in 1:n.samples) {
  # Each column represents a sample from a multivariate normal distribution with zero mean and covariance sigma
  fstar[,i] <- mvrnorm(1, rep(0, length(x.star)), sigma)
}
values <- cbind(x = x.star, as.data.frame(fstar))
values <- melt(values, id="x")


co <- brewer.pal(n = n.samples, name = 'Spectral')
par(bg='gray5')
p <- ggplot(values, aes(x=x, y=value)) +
  geom_line(aes(group=variable), colour=co[as.numeric(values$variable)], linewidth=1) 
p

# If the l value increases the correlation decreases faster:

l = 0.3
fst <- matrix(rep(0, length(x.star) * n.samples), ncol = n.samples)
sig <- calcSigma(x.star, x.star, theta=l)

for (i in 1:n.samples) {
  # Each column represents a sample from a multivariate normal distribution with zero mean and covariance sigma
  fst[,i] <- mvrnorm(1, rep(0, length(x.star)), sig)
}
va <- cbind(x = x.star, as.data.frame(fst))
va <- melt(va, id="x")


pp <- ggplot(va, aes(x=x, y=value)) +
  geom_line(aes(group=variable), colour=co[as.numeric(va$variable)], linewidth=1) 
pp

These functions are the prior distribution before observing any data points. Each line is a draw from, in this case of a \(100\) dimensional Gaussian distribution (x.star <- seq(-5, 5, len = 100)). The reason why they form functions or curves is the covariance kernel between the points.

What is a kernel? The dot product between features (in machine learning, a feature is an independent variable that acts as input for a system). The feature space (the n-dimensions where your variables live (not including a target variable, if it is present)) is a Hilbert space, i.e. a vector space endowed with an inner product, which induces a norm.

A kernel has to be positive semidefinite as explained here.

From Wikipedia:

\[(\mathbf {x} ,\mathbf {x'} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {x'} )\rangle _{\mathcal {V}}\] The key restriction is that \(\langle \cdot ,\cdot \rangle _{\mathcal {V}}\) must be a proper inner product. On the other hand, an explicit representation for \(\varphi\) is not necessary, as long as \(\mathcal {V}\) is an inner product space. The alternative follows from Mercer’s theorem: an implicitly defined function \(\varphi\) exists whenever the space \(\mathcal {X}\) can be equipped with a suitable measure ensuring the function \(k\) satisfies Mercer’s condition. Mercer’s theorem is similar to a generalization of the result from linear algebra that associates an inner product to any positive-definite matrix.

Different kernels:

From Robert Gramacy Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences 5.3.3. Stationary kernels:

If a Gaussian kernel isn’t working well, or if you want finer control on mean-square differentiability, then consider a Matérn family member, with

\[k_\nu(x,x')=\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\Vert x - x' \Vert \sqrt{\frac{2\nu}{l}} \right)^\nu K_\nu \left(\Vert x - x' \Vert \sqrt{\frac{2\nu}{l}}\right)\]

Above \(K_\nu\) is a modified Bessel function of the second kind, \(\nu\) controls smoothness, and \(l\) is a length scale (changed to \(\theta\) in the code for conspicuity) as before. As \(\nu →∞\) , i.e., a very smooth parameterization, we get

\[K_\nu (x,x') \to k_\infty(x,x')=\exp\left( -\frac{(\Vert x - x'\Vert)}{2l}\right)\]

which can be recognized as (a re-parameterized) Gaussian family, where \(\Vert x - x'\Vert\) is measured on the scale of ordinary (not squared) Euclidean distances.

The code below sets up this “full” Matérn in R so we can play with it a little.

matern <- function(X1, X2, theta=1, nu=1/2) {
  Sigma <- matrix(rep(0, length(X1) * length(X2)), nrow=length(X1))
  for (i in 1:nrow(Sigma)) {
    for (j in 1:ncol(Sigma)) {
      r <- abs(X1[i] - X2[j])
      Sigma[i,j] <- 2^(1 - nu)/gamma(nu) * (r * sqrt(2*nu/theta))^nu * besselK(r * sqrt(2*nu/theta), nu)
    }
  }
  Sigma[is.nan(Sigma)] <- 1
  return(Sigma)
}
  
# 1. Plot some sample functions from the Gaussian process

# Define the points at which we want to define the functions
x.st <- seq(0, 10, len = 100)

# Generate a number of functions from the process
n.samples <- 8
set.seed(0)
par(mfrow=c(1,3))
par(bg='gray90')
nu <- c(1/2,2,10)
for (j in 1:3){
# Initiate a matrix with ncols = length x.star and 3 cols (no. samples):
fsta <- matrix(rep(0, length(x.st) * n.samples), ncol = n.samples)
for (i in 1:n.samples) {
  # Each column represents a sample from a multivariate normal distribution with zero mean and covariance sigma
  fsta[,i] <- mvrnorm(1, rep(0, length(x.st)), matern(x.st, x.st, theta=1, nu=nu[j]))
}
vl <- cbind(x = x.st, as.data.frame(fsta))
val <- melt(vl, id="x")

co <- brewer.pal(n = n.samples, name = 'Spectral')

plot(val$x, val$value, col=co[val$variable], pch=19, cex=0.5, xlab=call(':', quote(nu), nu[j])
, ylab='lines')

for(l in 2:n.samples) lines(vl$x, vl[,l], col=co[l])
}


FITTING DATASET:

Now let us assume that we have some known data points. In the book, the notation ‘f’ is used for f$y below.

f <- data.frame(x = c(-4,-3,-1,0,2), y = c(-2,0,1,2,-1))

The following matrix calculations correspond to equation (2.19) in the book: \[\begin{align}\large f_* \mid X_∗ , X, f ∼ {\cal{N}} \left(K(X_∗, X)K(X, X)^{−1} f, \quad K(X_∗, X_∗) − K(X_∗, X) K(X, X)^{−1} K(X, X_∗) \right) \end{align}\]

where \(X\) are the independent variable of the sample and \(X_*\) are values initially employed to build the curves, i.e. not the sample, but just the \(x\) axis (both \(x\) and \(x'\) are considered index variables or the equivalent of independent variables).

This derivation is heavy according to mathematicalmonk. However, this is simply the multivariate Gaussian conditional distribution.

In linear algebra, as explained in here:

The sample or known points in a GP are \(X = f(x_1), f(x_2), \dots, f(x_n)\) the unknown points are \(X_* =f(x)\) and \(X, X_* \sim N(0, \Sigma).\) The sampled points are not necessarily in the points generating the prior curves (i.e. the \(x\) axis of the plot), and it can be considered that together with the sampled points we get a vector of data such as

\[\begin{bmatrix}f(x)\\ f(x_*)\end{bmatrix}\sim \mathcal N\left ( 0, \begin{bmatrix} K(X,X) & K(X,X_*)\\ K(X_*,X) & K(X_*,X_*)\end{bmatrix}\right )\] The covariance matrix is a block matrix:

\[\begin{align}\Sigma = \large \left[\begin{array}{ccc|c} k(x_1,x_1) & \dots & k(x_1, x_n) & k(x_1,x) \\ \vdots & \ddots & \vdots & \vdots \\ k(x_n,x_1) & \dots & k(x_n, x_n) & k(x_n,x) \\ \hline k(x,x_1) & \dots & k(x, x_n) & k(x,x) \\ \end{array} \right] = \left[\begin{array}{c|c} K_{XX} & K_X(x) \\ \hline K_X(x)^\top & K(x,x) \end{array} \right] \end{align}\]

where \(X =\{x_1, \dots, x_n\}\), \([K_{XX}]_{ij}=k(x_i,x_j)\) is the Gram kernel matrix, and \([K_X(x)]_j=k(x_j, x).\)

The conditional distribution

\(f(x) \mid f(x_1), \dots, f(x_n) \sim {\cal N}(\mu(x), \mathbb V(x))\)

with

\(\mu(x) = K_X(x)^\top K_{XX}^{-1} \bf \quad f\)

where \({\bf f} = [f(x_1), \dots, f(x_n)]^\top\) are the \(y\) (dependent variable) sample points, and

\(K_X(x)^\top= [k(x,x_1), \dots, k(x,x_n)]\)

Notice that this is a matrix multiplication with

\[\mu(x)=\small \begin{bmatrix}\text{points on }x\text{ axis}\\\times\\ \text{sample points}\end{bmatrix}\;\begin{bmatrix}\text{sample points}\\\times\\ \text{sample points}\end{bmatrix}\;\begin{bmatrix}\text{sample points}\\\times\\ 1\end{bmatrix} = \begin{bmatrix}\text{points on }x\text{ axis}\\\times\\ 1\end{bmatrix}\]

therefore yielding a different mean for each marginal in the conditional distribution.

The variance is

\(\mathbb V(x) = K(x,x) - K_X(x)^\top K_{XX}^{-1} K_x(x)\)

Note that each curve is a single sample of the (theoretically) infinite-dimensional multivariate Gaussian. Now, we are talking about just a handful of these components or elements in the vector - so a part of a single sample. It is the same as conditioning on a bivariate:

options(warn=-1)
# Calculate the covariance matrices using the same x.star values as above
x      <- f$x
k.xx   <- calcSigma(x,x) # K_{XX}
k.xxs  <- calcSigma(x, x.star) # K_X(x)
k.xsx  <- calcSigma(x.star, x) # K_X(x)^\top
k.xsxs <- calcSigma(x.star, x.star) #K(x,x)


f.star.bar <- k.xsx %*% solve(k.xx) %*% f$y # The mean of the conditional distribution.
cov.f.star <- k.xsxs - k.xsx %*% solve(k.xx) %*% k.xxs

# This time we'll plot more samples.  We could of course simply plot a +/- 2 standard deviation confidence interval
# as in the book but I want to show the samples explicitly here.

n.samples <- 50
values <- matrix(rep(0,length(x.star) * n.samples), ncol=n.samples)

for (i in 1:n.samples) {
  values[,i] <- mvrnorm(1, f.star.bar, cov.f.star)
}
values <- cbind(x=x.star, as.data.frame(values))

values2 <- melt(values, id="x")

# Plot the results including the mean function
# and constraining data points
fig2b <- ggplot(values2, aes(x=x, y=value)) +
  geom_line(aes(group=variable), colour= rgb(0.1,0,0.5,0.2)) +
  geom_line(data=values, aes(x=x.star, y=f.star.bar), colour="red", linewidth=1.4) + # Plots the mean, which depends on x
  geom_point(data=f,aes(x=x,y=y)) +
  theme_bw() +
  scale_y_continuous(lim=c(-3,3), name="output, f(x)") +
  xlab("input, x")
fig2b

In practice the observations are not right on top of the curve - there is noise:

\[y_i = f(x_i) + N(0, \sigma^2)\]

and

\[\begin{align}\Sigma = \large \left[\begin{array}{ccc|c} &&& k(x_1,x) \\ &K_{XX} + \sigma^2 I && \vdots \\ &&& k(x_n,x) \\ \hline k(x,x_1) & \dots & k(x, x_n) & k(x,x) \\ \end{array} \right] = \left[\begin{array}{c|c} K_{XX} & K_X(x) \\ \hline K_X(x)^\top & K(x,x) \end{array} \right] \end{align}\]

and

\(\mu(x) = K_X(x)^\top \left( K_{XX} + \sigma^2 I \right)^{-1} \bf y\)

The variance is

\(\mathbb V(x) = k(x,x) - K_X(x)^\top \left( K_{XX} + \sigma^2 I \right)^{-1} K_x(x)\)

options(warn=-1)
# 3. Now assume that each of the observed data points have some
# normally-distributed noise.

# The standard deviation of the noise
sigma.n <- 0.1

# Recalculate the mean and covariance functions
f.bar.star <- k.xsx%*%solve(k.xx + sigma.n^2*diag(1, ncol(k.xx)))%*%f$y
cov.f.star <- k.xsxs - k.xsx%*%solve(k.xx + sigma.n^2*diag(1, ncol(k.xx)))%*%k.xxs

# Recalculate the sample functions
values <- matrix(rep(0, length(x.star)*n.samples), ncol=n.samples)
for (i in 1:n.samples) {
  values[,i] <- mvrnorm(1, f.bar.star, cov.f.star)
}
values <- cbind(x=x.star,as.data.frame(values))
values2 <- melt(values, id="x")

# Plot the result, including error bars on the observed points
gg <- ggplot(values2, aes(x=x, y=value)) + 
  geom_line(aes(group=variable), colour= rgb(0.1,0,0.5,0.2)) +
  geom_line(data=values, aes(x=x.star,y=f.bar.star),colour="red", linewidth=1.4) + 
  geom_errorbar(data=f,aes(x=x,y=NULL,ymin=y-2*sigma.n, ymax=y+2*sigma.n), width=0.2) +
  geom_point(data=f,aes(x=x,y=y)) +
  theme_bw() +
  scale_y_continuous(lim=c(-3,3), name="output, f(x)") +
  xlab("input, x")
gg


Interesting links:

Math of Gaussian regression

On the Matérn kernel

Some chunks of code

Medium article on kernels

Exploring kernels


Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.