The anatomy of a neural network is as follows:
The index notation of a neural network is as follows:
Notice that \(\large x_1 =a_1^{(1)}\) and that \(\large x_2 = a_2^{(1)}.\)
\(L\) is the total number of layers. \(s_l\) is the number of units (excluding the bias unit) in layer \(l\).
#####Example with regression after this youtube video:
\(a_j^{(l)}\) is the activation node \(j\) in layer \(l\).
We need to calculate \(\large \frac {\partial J}{\partial \Theta}\). The weights \(\large \Theta\) are spread in two matrices: \(\large \Theta^{1}\) and \(\large \Theta^{2}\):
\(\color{red}{\large \Theta^{1}=\begin{bmatrix}\theta_{11}^{(1)} & \theta_{12}^{(1)} & \theta_{13}^{(1)}\\\theta_{21}^{(1)} & \theta_{22}^{(1)} & \theta_{23}^{(1)} \end{bmatrix}}\) and \(\color{green}{\large \Theta^{2}=\begin{bmatrix}\theta_{11}^{(2)} \\ \theta_{21}^{(2)} \\ \theta_{31}^{(2)} \end{bmatrix}}\)
We need to calculate:
\(\color{red}{\large \frac{\partial J}{\partial \Theta^{(1)}} = \begin{bmatrix}\frac{\partial J}{\partial \theta_{11}^{(1)}} & \frac{\partial J}{\partial \theta_{12}^{(1)}} & \frac{\partial J}{\partial \theta_{13}^{(1)}}\\ \frac{\partial J}{\partial \theta_{21}^{(1)}} & \frac{\partial J}{\partial \theta_{22}^{(1)}} & \frac{\partial J}{\partial \theta_{23}^{(1)}}\end{bmatrix}}\) and \(\color{green}{\large \frac {\partial J}{\partial \Theta^{(2)}}=\begin{bmatrix}\frac{\partial J}{\partial\theta_{11}^{(2)}} \\ \frac{\partial J}{\partial\theta_{21}^{(2)}} \\ \frac{\partial J}{\partial\theta_{31}^{(2)}} \end{bmatrix}}\)
The cost function adds cost from each example. In what follows we will use the cost function of a regression model as \(\frac{1}{2}\sum (y-\hat y)^2\), although it would be more accurate as \(\frac{1}{2m}\sum (y-\hat y)^2\).
\(\large \color{green}{\frac{\partial J}{\partial \Theta^{(2)}}}=\frac{\partial \sum \frac{1}{2}(y-\hat y)^2}{\partial \Theta^{(2)}}=\sum \frac{\partial \frac{1}{2}(y-\hat y)^2}{\partial \Theta^{(2)}}\) by he sum rule in differentiation that says that \(\frac{d}{dx}(u+v)=\frac{du}{dx}+\frac{dv}{dx}\)
Focusing on the expression inside the sum, and applying the chain rule, \(\large(f\circ g)'=(f'\circ g)\cdot g'\) or \(\large\frac{dz}{dx}=\frac{dz}{dy}\cdot\frac{dy}{dx}\).
\(\large \color{green}{\frac{\partial J}{\partial \Theta^{(2)}}}=\frac{\partial J}{\partial y}\frac{\partial y}{\partial \Theta^{(2)}}=\frac{\partial \left(\frac{1}{2}\sum (y-\hat y)^2\right)}{\partial y}\frac{\partial y}{\partial \Theta^{(2)}}=\color{fuchsia}{(y-\hat y)}\cdot \frac{\color{fuchsia}{-}\partial \hat y}{\partial \Theta^{(2)}}\)
Since \(\hat y\) is the sigmoid activation function of \(z^{(3)}\), which is \(g\big(z^{(3)}\big)\) we can apply the chain rule again:
\(\large \color{green}{\frac{\partial J}{\partial \Theta^{(2)}}}=(y-\hat y)\cdot \frac{-\partial \hat y}{\partial z^{(3)}}\cdot \frac{\partial z^{(3)}}{\partial \Theta^{(2)}}=\color{fuchsia}{-(y-\hat y)}\cdot \color{blue}{g'(z^{(3)})} \cdot \color{lime}{\frac{\partial z^{(3)}}{\partial \Theta^{(2)}}}\tag 1\)
The derivative of the sigmoid activation function with respect to \(z\) is:
\(\large \color{blue}{g'(z)}=\frac{e^{-z}}{\left(1+e^{-z}\right)^2}=\left(\frac{1}{1+e^{-z}}\right)\times \left(1 - \frac{1}{1+e^{-z}}\right)=\color{blue}{\hat y\left(1-\hat y\right)}\) (see embedded mathy picture below for derivation).
Now, \(\large z^{(3)}=a^{(2)}\Theta^{(2)}\), and hence \(a\) is the slope of \(z\) with respect to \(z\) for each synapse. Therefore \(\large \color{lime}{\frac{\partial z^{(3)}}{\partial \Theta^{(2)}}=\begin{bmatrix}a_{11}^{(2)}&a_{12}^{(2)}&a_{13}^{(2)}\\a_{21}^{(2)}&a_{22}^{(2)}&a_{23}^{(2)}\\a_{31}^{(2)}&a_{32}^{(2)}&a_{33}^{(2)}\end{bmatrix}}\)
\(\large \color{green}{\frac{\partial J}{\partial \Theta^{(2)}}}=\color{fuchsia}{-(y-\hat y)}\cdot \color{blue}{\left (\hat y\,(1-\hat y)\right)}\cdot \color{lime}{a^{(2)}}\tag{*}\)
In Hinton’s Coursera course on NN, this expression (*) is seen as:
\[\frac{\partial E}{\partial w_i}=\sum_n \frac{\partial y^n}{\partial w_i}\color{orange}{\frac{\partial E}{\partial y^n}}=\color{orange}{-}\sum_n \color{red}{x_i^n}\,\color{blue}{y^n \,(1-y^n)}\,\color{orange}{(t^n-y^n)}\]
where \(E\) stands for the error, \(w_i\) is the weight \(i\), \(n\) is the index of the examples in the training set, and \(t\) stands for training case.
because,
\[\frac{\partial y^n}{\partial w_i}=\color{red}{\frac{\partial z}{\partial w_i}}\color{blue}{\frac{dy}{dz}}=\color{red}{x_i^n}\,\color{blue}{y^n(1-y^n)}\]
the latter part in blue is derived here:
and because
\[E=\frac{1}{2}\displaystyle\sum_{n\in\text{ex's train set}}(t^n-y^n)^2\]
and the derivative with respect to \(y^n\) is:
\[\color{orange}{\frac{\partial E}{\partial y^n}=-(t^n-y^n)}.\]
Going back to Eq. 1:
So we end up with:
\[\begin{align}\large \color{green}{\frac{\partial J}{\partial \Theta^{(2)}}}&=\color{gray}{(y-\hat y)}\cdot \color{purple}{\frac{-\partial \hat y}{\partial z^{(3)}}}\cdot \frac{\partial z^{(3)}}{\partial \Theta^{(2)}}\\[2ex]&=\color{gray}{(y-\hat y)}\cdot \color{purple}{-g'(z^{(3)})}\cdot \frac{\partial z^{(3)}}{\partial \Theta^{(2)}}\\&=\color{orange}{\delta^{(3)}}\frac{\partial z^{(3)}}{\partial \Theta^{(2)}}\\[2ex]&=(a^{2})^T\,\color{orange}{\delta^{(3)}}\\[2ex]&=\large \begin{bmatrix}a_{11}^{(2)}&a_{12}^{(2)}&a_{13}^{(2)}\\a_{21}^{(2)}&a_{22}^{(2)}&a_{23}^{(2)}\\a_{31}^{(2)}&a_{32}^{(2)}&a_{33}^{(2)}\end{bmatrix} \begin{bmatrix}\delta^{(3)}_1\\\delta^{(3)}_2\\\delta^{(3)}_3\end{bmatrix}=\begin{bmatrix}a_{11}^{(2)}\delta^{(3)}_1+a_{12}^{(2)}\delta^{(3)}_2+a_{13}^{(2)}\delta^{(3)}_3\\a_{21}^{(2)}\delta^{(3)}_1+a_{22}^{(2)}\delta^{(3)}_2+a_{23}^{(2)}\delta^{(3)}_3\\a_{31}^{(2)}\delta^{(3)}_1+a_{32}^{(2)}\delta^{(3)}_2+a_{33}^{(2)}\delta^{(3)}_3\end{bmatrix}\end{align}\]
Then \(\large\color{orange}{\delta^{(3)}}=\color{gray}{(y-\hat y)}\cdot \color{purple}{g'(z^{(3)})}\)
Therefore, the matrix multiplication also takes care of the \(\sum\) part in front of the first derivative formula.
Notice that the more general efinition of \(\delta\) outside the last layer is:
\(\bbox[10px,border:2px solid purple]{\large \delta^{(l)}=\left(\Theta^{(l)}\right)^\top\delta^{(l+1)}\,g'(z^{(l)})}\) with \(g'(z^{(l)}) = a^{(l)}\cdot (1 - a^{(l)})\)
As for…
\(\large \color{red}{\large \frac{\partial J}{\partial \Theta^{(1)}}}=\color{orange}{\delta^{(3)}}\frac{\partial z^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial \Theta^{(1)}}=\color{orange}{\delta^{(3)}} \left(\Theta^{(2)}\right)^\top \frac{\partial a^{(2)}}{\partial \Theta^{(1)}}\)
since, \(\large\frac{\partial z^{(3)}}{\partial a^{(2)}}=\Theta^{(2)}\).
Next,
\(\large \color{red}{\large \frac{\partial J}{\partial \Theta^{(1)}}}=\color{orange}{\delta^{(3)}} \left(\Theta^{(2)}\right)^\top \frac{\partial a^{(2)}}{\partial \Theta^{(1)}}=\color{orange}{\delta^{(3)}} \left(\Theta^{(2)}\right)^\top \frac{\partial a^{(2)}}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial \Theta^{(1)}}=\color{orange}{\delta^{(3)}} \left(\Theta^{(2)}\right)^\top g'(z^{(2)})\frac{\partial z^{(2)}}{\partial \Theta^{(1)}}\)
Finally,
\(\large\frac{\partial z^{(2)}}{\partial \Theta^{(1)}}=X\). Therefore,
\(\large \color{red}{\large \frac{\partial J}{\partial \Theta^{(1)}}}=X^\top \color{orange}{\delta^{(3)}} \left(\Theta^{(2)}\right)^\top \color{purple}{g'(z^{(2)})}=X^\top \color{brown}{\delta^{(2)}}.\)
The cost function in logistic regression is:
For the case of a multi-class logistic regression NN the cost function is:
\[\large J(\Theta)=-\frac{1}{m}\left[\displaystyle\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} \log\left(h_\Theta(x^{(i)})\right)_k + \left(1-y_k^{(i)}\right)\log\left(1-\left(h_\Theta\left(x^{(i)}\right)\right)_k\right) \right]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}\left(\Theta_{ji}^l\right)^2\]
What follows is explained in this video.
\(K\) is the number of classes or labels (output nodes).
\(L\) is the total number of layers in the network.
\(s_l\) is the number of units (without bias unit) in layer \(l\).
\(m\) is the number of examples or rows of the input matrix.
\(\lambda\) is the regularization constant.
The number of columns in our current theta matrix \(i\) (nothing to do with the training example index notation) is equal to the number of nodes in our current layer (including the bias unit): \(i=1\) to \(s_{l}\). The number of rows, \(j\), in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit): \(s_{l+1}\).
In backpropagation we try to minimize
\[\frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}}.\]
\(\delta_j^{(l)}\) is the “error”of node \(j\) in layer \(l\).
For the last layer:
\(\large \delta^{(L)}= a^{(L)}-y\) with \(L\) being the total number of layers, and \(a^{(L)}\) is the vector of outputs of the activation units in the last layer. So it’s simply the difference between the activation (or result of \(h(z)\)) and the actual value of the training example (\(y\)).
The other layers are calculated as:
\(\large \delta^{(l)}=\left(\Theta^{(l)}\right)^\top\delta^{(l+1)}\,g'(z^{(l)})\) with \(g'(z^{(l)}) = a^{(l)}\cdot (1 - a^{(l)})\).
stopping at \(\delta^{(2)}\).
We can compute our partial derivative terms by multiplying our activation values and our error values for each training example \(t\):
\[\large \dfrac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = \frac{1}{m}\sum_{t=1}^m a_j^{(t)(l)} {\delta}_i^{(t)(l+1)}\]
Backpropagation Algorithm:
Given training set \(\{ (x^{(1)}, y^{(1)}), \cdots (x^{(m)}, y^{(m)}) \}\)
Set \(\Delta_{i,j}^{(l)}:=0\) for all \((l, i, j)\).
For training example \(i=1\) to \(m\):
1. Set $a^{(1)}:=x^{(t)}$
2. Perform forward propagation to compute $a^{(l)}$ for $l=2,3,\cdots, L$.
3. Using $y^{(t)}$ compute $\delta^{(L)}=a^{(L)}-y^{(t)}$.
4. Compute $\delta^{(L-1)}, \delta^{(L-2)},\cdots,\delta^{(2)}$ using $\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)}) \cdot a^{(l)}\cdot (1 - a^{(l)})$.
5. $\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}$ or vectorized: $\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$.
\(D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right)\) if \(j \neq 0\).
\(D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}\) if \(j=0\).
The capital-delta matrix is used as an “accumulator” to add up our values as we go along and eventually compute our partial derivative. It can be proven that:
\(\large D_{i,j}^{(l)} = \dfrac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}}.\)
From the Coursera Wiki:
Now, time to evaluate these derivatives:
\[\large \frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}}\] using \[\large \delta^{(L)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}}.\]
We need to evaluate both partial derivatives.
Given \(\large J(\theta) = -y\log(a^{(L)}) - (1-y)\log(1-a^{(L)})\), where \(a^{(L)} = h_{\theta}(x))\), the partial derivative is: \[\large\frac{\partial J(\theta)}{\partial a^{(L)}} = \frac{y-1}{a^{(L)}-1} - \frac{y}{a^{(L)}}\]
And given \(a = g(z)\), where \(g = \frac{1}{1+e^{-z}}\), the partial derivative is:
\[\large \frac{\partial a^{(L)}}{\partial z^{(L)}} = a^{(L)}(1-a^{(L)})\]
\[\large\delta^{(L)} = \frac{\partial
J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial
z^{(L)}}\]
\[\large\delta^{(L)} = (\frac{y-1}{a^{(L)}-1}
- \frac{y}{a^{(L)}}) (a^{(L)}(1-a^{(L)}))\]
\[\large\delta^{(L)} =a^{(L)} -
y\]
So, for a 3-layer network \(L=3\), \(\delta^{(3)} =a^{(3)} - y\)
Note that this is the correct equation, as given in our notes.
Now, given \(z = \theta \times \text{input}\), and in layer \(L\) the input is \(a^{(L-1)}\), the partial derivative is: \[\large \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}} = a^{(L-1)}\]
Put it together for the output layer:
\[\large\frac{\partial J(\theta)}{\partial
\theta^{(L-1)}} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial
\theta^{(L-1)}}\]
\[\large\frac{\partial J(\theta)}{\partial
\theta^{(L-1)}} = (a^{(L)} - y) (a^{(L-1)})\]
Let’s continue on for the hidden layer (let’s assume we only have
1 hidden layer):
\[\large\frac{\partial J(\theta)}{\partial
\theta^{(L-2)}} = \delta^{(L-1)} \frac{\partial z^{(L-1)}}{\partial
\theta^{(L-2)}}\]
Let’s figure out \(\delta^{(L-1)}\). Once again, given \(z = \theta \times\text{input}\), the
partial derivative is:
\[\large \frac{\partial z^{(L)}}{\partial
a^{(L-1)}} = \theta^{(L-1)}\] And: \[\frac{\partial a^{(L-1)}}{\partial z^{(L-1)}} =
a^{(L-1)}(1-a^{(L-1)})\]
So, let’s substitute these in for \(\delta^{(L-1)}\):
\[\large\delta^{(L-1)} = \delta^{(L)}
\frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial
a^{(L-1)}}{\partial z^{(L-1)}}\]
\[\large\delta^{(L-1)} = \delta^{(L)}
(\theta^{(L-1)}) (a^{(L-1)}(1-a^{(L-1)}))\]
\[\large\delta^{(L-1)} = \delta^{(L)}
\theta^{(L-1)} a^{(L-1)}(1-a^{(L-1)})\]
So, for a 3-layer network,
\[\large\delta^{(2)} = \delta^{(3)}
\theta^{(2)} a^{(2)}(1-a^{(2)})\]
Put it together for the [last] hidden
layer:
\[\large\frac{\partial J(\theta)}{\partial
\theta^{(L-2)}} = \delta^{(L-1)} \frac{\partial z^{(L-1)}}{\partial
\theta^{(L-2)}}\]
\[\large\frac{\partial J(\theta)}{\partial
\theta^{(L-2)}} = \left(\delta^{(L)} \frac{\partial z^{(L)}}{\partial
a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}\right)
(a^{(L-2)})\]
\[\large\frac{\partial J(\theta)}{\partial
\theta^{(L-2)}} = \left((a^{(L)} - y)
(\theta^{(L-1)})(a^{(L-1)}(1-a^{(L-1)}))\right)
(a^{(L-2)})\]
NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.