Mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables, or the “amount of information” (in units such as shannons (bits), nats or hartleys) obtained about one random variable by observing the other random variable (Wikipedia).
It is considered more general than correlation and handles nonlinear dependencies.
Mathematically, it is defined as
\[I(X;Y)=D_{KL}\left(P_{(X,Y)}||P_X \otimes P_Y\right)\]
in which \(D_{(KL)}\) is the Kullback-Leibler divergence or relative entropy, which measures how different two probability distributions are, and is calculated as
\[D_{KL}(P||Q)=\sum_{x\in \mathcal X} P(x) \log\left(\frac{P(x)}{Q(x)} \right)\]
which can be interpreted as the expected excess surprise from using \(Q\) as a model instead of \(P.\)
In the context of probability distributions, the tensor product of two distributions \(P_X\) and \(P_Y,\) denoted as \(P_X \otimes P_Y)\) creates a joint distribution that assumes \(X\) and \(Y\) are independent. Therefore it is just the product of of the marginal distributions of the random variables in question.
The symbol \(\parallel\) just means “parallel” and it is used for the entries of the distribution that are the input for \(D_{KL}\).
Equivalently, mutual information can be expressed as
\[I(X;Y)=H(X,Y) - H(X\vert Y) - H(Y\vert X)\]
For Gaussian r.v.’s if \(\mathbf{x} = [X_{1}, X_{2}]\) follows a bivariate Gaussian distribution, with mean vector \(\mathbf{0},\) unit variance for both variables and a correlation coefficient \(\rho\), we write the pdf as:
\[f_X(\bf x)= \frac{1}{2\pi\vert \Sigma \vert^{1/2}}\exp\left[-1/2 \left( \bf x^\top\Sigma^{-1}\bf x\right)\right]\]
with \(\Sigma =\begin{bmatrix}1&\rho\\\rho& 1\end{bmatrix}\)
The entropy of the Gaussian is
\[H(x)=1/2 \log(2\pi e)\]
with \[H(X,Y)= 1/2 \log\left((2\pi e)^2(1-\rho^2)\right)\]
And hence the MI is
\[I(X;Y) = -1/2\log(1-\rho^2)\]
This is explained here.
Since \(\rho\) is in the equation we can plot MI as a function of correlation
rho <- seq(-1,1,0.001)
mi <- -1/2 * log(1- rho^2)
plot(rho, mi, type='l')
As can be clearly seen, the amount of information that we get about one random variable when we observe the other increases slowly at first and then much, much faster around \(\rho = \pm 0.8\) and then even more significantly around \(\rho = \pm 0.9.\) When \(\rho =\pm 1\) the amount of information tends to infinity – if random variables are perfectly correlated then we know everything about one of them by observing the other one.
This is consistent with Taleb’s presentation on YouTube and in Fooled by Correlation: Common Misinterpretations in Social “Science”. He says that a correlation of \(0.5\) is closer to \(0\) than \(1.\)
Distinguishing two distributions parameterized by \(\Theta\) and with a differentially small separation in the parameters can be accomplished by calculation the \(\mathbb E[\Delta^2]\), with \(\Delta\) being the pointwise difference between the two distributions:
\[\Delta = \frac{1}{p(x \mid \Theta)}\left(p(x \mid \Theta + \Delta \Theta) - p(x \mid \Theta)\right)\]
The Taylor expansion of the expression \(p(x \mid \Theta + \Delta \Theta)\) in the numerator above is
\[p(x \mid \Theta + \Delta \Theta) \approx p(x \mid \Theta) + \sum_{a} \frac{\partial p(x\mid \Theta)}{\partial \theta^a} \;\Delta \theta^a + \mathcal{O}(|\delta \Theta|^2)\] Plugging this back results in
\[p(x \mid \Theta + \delta \Theta) - p(x \mid \Theta) \approx \left(p(x \mid \Theta) + \frac{\partial p(x\mid \Theta)}{\partial \theta^a} \delta \theta^a\right) - p(x \mid \Theta)\]
\[p(x \mid \Theta + \Delta \Theta) - p(x \mid \Theta) \approx \frac{\partial p(x\mid \Theta)}{\partial \theta^a} \Delta \theta^a\]
Now substitute this result into the expression for \(\Delta\):
\[\Delta \approx \frac{1}{p(x \mid \Theta)} \left(\frac{\partial p(x\mid \Theta)}{\partial \theta^a} \Delta \theta^a \right)\]
Using the chain rule we see the log-likelihood inside the expression above:
\[\frac{1}{p(x \mid \Theta)} \cdot \frac{\partial p(x\mid \Theta)}{\partial \theta^a}=\frac{\partial}{\partial \theta^a} \log\left(p(x \mid \Theta)\right)\]
Therefore,
\[\Delta \approx \left(\frac{\partial}{\partial \theta^a}\log\left(p(x \mid \Theta) \right)\right) \Delta\theta^a + \mathcal{O}(|\Delta \Theta|^2)\]
The score function, denoted \(\mathbf{s}(x; \boldsymbol{\theta})\) (or simply \(\mathbf{s}\)), is the gradient of the log-likelihood with respect to the parameters \(\boldsymbol{\theta}\).
\[s_a(x; \boldsymbol{\theta}) = \frac{\partial \log p(x \mid \boldsymbol{\theta})}{\partial \theta^a}\]
The Fisher Information Matrix (FIM), \(\mathbf{I}(\boldsymbol{\theta})\), is defined as the covariance (variance) of the score function:
\[I_{ab} = \text{Cov}[s_a, s_b] = \mathbb{E}[s_a s_b] - \mathbb{E}[s_a]\mathbb{E}[s_b]\]
\(\mathbb{E}[s_a] = 0\) and \(\mathbb{E}[s_b] = 0\), the covariance simplifies directly to the expected outer product:
\[\mathbf{I}_{ab} = \mathbb{E}\left[\frac{\partial \log p(x \mid \boldsymbol{\theta})}{\partial \theta^a} \frac{\partial \log p(x \mid \boldsymbol{\theta})}{\partial \theta^b}\right]= \int p(x\mid \boldsymbol{\theta}) \frac{\partial \log p(x \mid \boldsymbol{\theta})}{\partial \theta^a} \frac{\partial \log p(x \mid \boldsymbol{\theta})}{\partial \theta^b} \Delta^a \Delta^b \;dx\]
This is explained here.
The derivation is at the bottom (**).
This is a Riemannian metric, i.e. a positive-definite symmetric bilinear form (i.e. an inner product), which allows describing geodesics in the parameter manifold of probability distributions. In the case of the normal distribution, these geodesics are in the Poincaré disc. If the means are constant, the geodesics are vertical lines across varying variance, but if the mean of the distribution changes, the variance will have an effect, which is maximal at small changes in the mean.
The Fisher Information Metric for the family of univariate Gaussian (Normal) distributions is known to induce the geometry of the Poincaré half-plane.
The parameter space for a Gaussian distribution \(\mathcal{N}(\mu, \sigma^2)\) is \(\mathbf{\theta} = (\mu, \sigma)\), where \(\mu \in \mathbb{R}\) is the mean and \(\sigma > 0\) is the standard deviation.
This space is the upper half-plane \(\mathbb{H} = \{(\mu, \sigma) \mid \sigma > 0\}\). The geodesics in this space are:
Vertical lines: For distributions with the same mean \(\mu\), the geodesic is the line \(\mu = \text{constant}\).
Semicircles centered on the \(\mu\)-axis (\(\sigma=0\)): For distributions with different means, the geodesic is a half-ellipse in the \((\mu, \sigma)\) plane, which corresponds to a semicircle in the rescaled Poincaré half-plane.
The Fisher Information Metric for the parameters \((\mu, \sigma)\) is:
\[G_F(\mu, \sigma) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{2}{\sigma^2} \end{pmatrix}\] This metric is proportional but not identical to the Poincaré metric \(\frac{1}{\sigma^2} \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}\).
The geodesics in the Gaussian parameter space with the Fisher metric (sometimes called the Fisher half-plane) are vertical lines and half-ellipses centered on the \(\mu\)-axis.
A half-ellipse is defined by the equation: \[\frac{(\mu - \mu_0)^2}{a^2} + \frac{\sigma^2}{b^2} = 1 \quad \text{with } \sigma > 0\] where \(b = \frac{a}{\sqrt{2}}\). Substituting \(b\) gives: \[\frac{(\mu - \mu_0)^2}{a^2} + \frac{2\sigma^2}{a^2} = 1 \quad \text{or} \quad (\mu - \mu_0)^2 + 2\sigma^2 = a^2\]
The FIM is seen again as the Hessian or second derivative of the KL divergence:
Let \(p(x; \boldsymbol{\theta})\) be the PDF/PMF and \(L(\boldsymbol{\theta}; x) = \log p(x; \boldsymbol{\theta})\) be the log-likelihood.
Step 1: Prove the Zero Expectation of the score function. The score function is \(\mathbf{s}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} L(\boldsymbol{\theta}; X)\).
We start with the fundamental probability axiom that the total probability must be one for all \(\boldsymbol{\theta}\):
\[\int p(x; \boldsymbol{\theta}) \, dx = 1\]
Take the derivative with respect to \(\boldsymbol{\theta}\) on both sides. Assuming regularity conditions allow interchanging differentiation and integration:
\[\nabla_{\boldsymbol{\theta}} \left( \int p(x; \boldsymbol{\theta}) \, dx \right) = \nabla_{\boldsymbol{\theta}} (1)\] \[\int \nabla_{\boldsymbol{\theta}} p(x; \boldsymbol{\theta}) \, dx = \mathbf{0}\]
Now, use the identity \(\nabla_{\boldsymbol{\theta}} p = p \cdot \nabla_{\boldsymbol{\theta}} \log p\) (*) to rewrite the integrand:
\[\int \left( \nabla_{\boldsymbol{\theta}} \log p(x; \boldsymbol{\theta}) \right) p(x; \boldsymbol{\theta}) \, dx = \mathbf{0}\] By the definition of expectation \(\mathbb{E}[\cdot] = \int (\cdot) p(x; \boldsymbol{\theta}) dx\), this proves the identity:\[\mathbb{E}_{p(\cdot; \boldsymbol{\theta})}[\mathbf{s}(\boldsymbol{\theta})] = \mathbb{E}\left[\nabla_{\boldsymbol{\theta}} \log p(X; \boldsymbol{\theta})\right] = \mathbf{0}\]
Step 2: Relate Variance to Negative Expected Hessian
The two standard forms of the Fisher Information Matrix (FIM), \(\mathbf{I}(\boldsymbol{\theta})\), are:
Variance of the Score (Form 1):
The Fisher Information Matrix (FIM), \(\mathbf{I}(\boldsymbol{\theta})\), is defined as the covariance (variance) of the score function:\[I_{ab} = \text{Cov}[s_a, s_b] = \mathbb{E}[s_a s_b] - \mathbb{E}[s_a]\mathbb{E}[s_b]\] since \(\mathbb{E}[s_a] = 0\) and \(\mathbb{E}[s_b] = 0\)
Negative Expected Hessian (Form 2):
We start from the zero expectation derived in Step 1:
\[\mathbb{E}[\mathbf{s}(\boldsymbol{\theta})] = \int \mathbf{s}(\boldsymbol{\theta}) p(x; \boldsymbol{\theta}) \, dx = \mathbf{0}\]
Take the derivative \(\nabla_{\boldsymbol{\theta}}\) of the entire zero expression again:
\[\nabla_{\boldsymbol{\theta}} \mathbb{E}[\mathbf{s}(\boldsymbol{\theta})] = \nabla_{\boldsymbol{\theta}} \left( \int \mathbf{s}(\boldsymbol{\theta}) p(x; \boldsymbol{\theta}) \, dx \right) = \mathbf{0}\]
Interchange differentiation and integration, and apply the product rule to the integrand:
\[\int \left[ (\nabla_{\boldsymbol{\theta}} \mathbf{s}(\boldsymbol{\theta})) p(x; \boldsymbol{\theta}) + \mathbf{s}(\boldsymbol{\theta}) (\nabla_{\boldsymbol{\theta}} p(x; \boldsymbol{\theta}))^T \right] \, dx = \mathbf{0}\]
Recognize that \(\nabla_{\boldsymbol{\theta}} \mathbf{s}(\boldsymbol{\theta})\) is the Hessian of the log-likelihood \(\mathbf{H}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}}^2 L(\boldsymbol{\theta}; x)\), and substitute \(\nabla_{\boldsymbol{\theta}} p = p \cdot \mathbf{s}\):
\[\int \left[ \mathbf{H}(\boldsymbol{\theta}) p(x; \boldsymbol{\theta}) + \mathbf{s}(\boldsymbol{\theta}) (\mathbf{s}(\boldsymbol{\theta}) p(x; \boldsymbol{\theta}))^T \right] \, dx = \mathbf{0}\]
Separate the integrals:
\[\int \mathbf{H}(\boldsymbol{\theta}) p(x; \boldsymbol{\theta}) \, dx + \int \mathbf{s}(\boldsymbol{\theta}) \mathbf{s}(\boldsymbol{\theta})^T p(x; \boldsymbol{\theta}) \, dx = \mathbf{0}\]
Rewrite in terms of expectation:
\[\mathbb{E}[\mathbf{H}(\boldsymbol{\theta})] + \mathbb{E}[\mathbf{s}(\boldsymbol{\theta}) \mathbf{s}(\boldsymbol{\theta})^T] = \mathbf{0}\]
Rearranging gives the desired equivalence:
\[\mathbb{E}[\mathbf{s}(\boldsymbol{\theta}) \mathbf{s}(\boldsymbol{\theta})^T] = - \mathbb{E}[\mathbf{H}(\boldsymbol{\theta})]\]
Therefore, Form 1 (the variance of the score) is proven to be equivalent to Form 2 (the negative expected Hessian).
(*) Start with the Logarithm Identity: We can always write a positive function as \(p = e^{\log p}\).
Apply the Gradient (Chain Rule): Take the gradient of both sides with respect to \(\boldsymbol{\theta}\):
\[\nabla_{\boldsymbol{\theta}} p = \nabla_{\boldsymbol{\theta}} \left( e^{\log p} \right)\] Differentiate the Exponential Function:Using the chain rule, the derivative of \(e^u\) is \(e^u\) times the derivative of the inner function \(u\):
\[\nabla_{\boldsymbol{\theta}} p = e^{\log p} \cdot \left( \nabla_{\boldsymbol{\theta}} \log p \right)\] Substitute Back \(p\): Since \(e^{\log p}\) is simply \(p\) itself:
\[\nabla_{\boldsymbol{\theta}} p = p \cdot \nabla_{\boldsymbol{\theta}} \log p\]
(**) The Fisher information metric 1\(g_{ij}(\theta)\) is a fundamental concept in Information Geometry, representing the Riemannian metric on the manifold of probability distributions parameterized by 2\(\boldsymbol{\theta}\).3 It is derived from the Fisher Information Matrix (FIM), 4\(I(\boldsymbol{\theta})\).5
The derivation involves two main steps:
Find the log-likelihood function \(\ln p(x|\boldsymbol{\theta})\) for the Gaussian distribution. Calculate the Fisher Information Matrix, which is the expected value of the negative second partial derivatives (Hessian) of the log-likelihood function.
The general formula for the elements of the FIM is:
\[I_{ij}(\boldsymbol{\theta}) = E\left[ \frac{\partial \ln p(x|\boldsymbol{\theta})}{\partial \theta_i} \frac{\partial \ln p(x|\boldsymbol{\theta})}{\partial \theta_j} \right] = -E\left[ \frac{\partial^2 \ln p(x|\boldsymbol{\theta})}{\partial \theta_i \partial \theta_j} \right]\]
The Fisher metric is then defined as \(g_{ij}(\boldsymbol{\theta}) = I_{ij}(\boldsymbol{\theta})\).
Here is the derivation for the univariate Gaussian distribution with parameters \(\boldsymbol{\theta} = (\mu, \sigma^2)\), where \(\mu\) is the mean and \(\sigma^2\) is the variance.
The probability density function (PDF) for the univariate Gaussian is:
\[p(x|\boldsymbol{\theta}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]
We set the parameter vector as \(\boldsymbol{\theta} = (\theta_1, \theta_2) = (\mu, \sigma^2)\).
Step 1: Log-Likelihood Function
The natural logarithm of the PDF is:
\[\ln p(x|\boldsymbol{\theta}) = -\frac{1}{2} \ln(2\pi) - \frac{1}{2} \ln(\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}\]
Step 2: First Partial Derivatives (The Score)
Calculate the first partial derivatives with respect to \(\theta_1 = \mu\) and \(\theta_2 = \sigma^2\):
\[\frac{\partial \ln p}{\partial \mu} = 0 - 0 - \frac{1}{2\sigma^2} \cdot 2(x-\mu)(-1) = \frac{x-\mu}{\sigma^2}\] \[\frac{\partial \ln p}{\partial \sigma^2} = 0 - \frac{1}{2\sigma^2} - \frac{(x-\mu)^2}{2} \cdot \frac{\partial}{\partial \sigma^2} \left(\frac{1}{\sigma^2}\right) = -\frac{1}{2\sigma^2} + \frac{(x-\mu)^2}{2(\sigma^2)^2} = \frac{(x-\mu)^2 - \sigma^2}{2(\sigma^2)^2}\]
Step 3: Second Partial Derivatives
Calculate the second partial derivatives with respect to the mean \(\mu\):
\[\frac{\partial^2 \ln p}{\partial \mu^2} = \frac{\partial}{\partial \mu} \left(\frac{x-\mu}{\sigma^2}\right) = -\frac{1}{\sigma^2}\]
and wrt the variance \(\sigma^2\)
\[\small\frac{\partial^2 \ln p}{\partial (\sigma^2)^2} = \frac{\partial}{\partial \sigma^2} \left( \frac{(x-\mu)^2 - \sigma^2}{2(\sigma^2)^2} \right) = \frac{\partial}{\partial \sigma^2} \left( \frac{(x-\mu)^2}{2(\sigma^2)^2} - \frac{1}{2\sigma^2} \right) = \frac{(x-\mu)^2}{2} (-2(\sigma^2)^{-3}) - \frac{1}{2} (-1(\sigma^2)^{-2}) = \frac{1}{2(\sigma^2)^2} - \frac{(x-\mu)^2}{(\sigma^2)^3}\]
\[\frac{\partial^2 \ln p}{\partial \mu \partial \sigma^2} = \frac{\partial}{\partial \sigma^2} \left(\frac{x-\mu}{\sigma^2}\right) = (x-\mu) (-1(\sigma^2)^{-2}) = -\frac{x-\mu}{(\sigma^2)^2}\]
Step 4: Expected Value (The Fisher Metric)
The Fisher metric \(g\) is the negative expected value of the Hessian matrix \(H\): \(g = -E[H]\). Recall that for a Gaussian distribution \(X \sim \mathcal{N}(\mu, \sigma^2)\), we have \(E[X] = \mu\), \(E[X-\mu] = 0\), and \(E[(X-\mu)^2] = \sigma^2\). Component \(g_{11}\) (for \(\mu, \mu\)):
\[g_{11} = -E\left[ -\frac{1}{\sigma^2} \right] = \frac{1}{\sigma^2}\] Component \(g_{22}\) (for \(\sigma^2, \sigma^2\)):
\[g_{22} = -E\left[ \frac{1}{2(\sigma^2)^2} - \frac{(x-\mu)^2}{(\sigma^2)^3} \right] = -\left( \frac{1}{2(\sigma^2)^2} - \frac{E[(x-\mu)^2]}{(\sigma^2)^3} \right)\] \[g_{22} = -\left( \frac{1}{2(\sigma^2)^2} - \frac{\sigma^2}{(\sigma^2)^3} \right) = -\left( \frac{1}{2\sigma^4} - \frac{1}{\sigma^4} \right) = \frac{1}{2\sigma^4}\] Component \(g_{12} = g_{21}\) (for \(\mu, \sigma^2\)):
\[g_{12} = -E\left[ -\frac{x-\mu}{(\sigma^2)^2} \right] = \frac{E[x-\mu]}{(\sigma^2)^2} = \frac{0}{(\sigma^2)^2} = 0\]
Result: The Fisher Metric for Univariate Gaussian
The Fisher metric for the univariate Gaussian parameterized by \((\mu, \sigma^2)\) is:
\[\mathbf{g}(\mu, \sigma^2) = \begin{pmatrix} g_{11} & g_{12} \\ g_{21} & g_{22} \end{pmatrix} = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2\sigma^4} \end{pmatrix}\]
This diagonal form shows that the parameters \(\mu\) and \(\sigma^2\) are orthogonal in the sense of the Fisher metric.
NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.