Gradient Descent

Neural Networks & Gradient Descent:

The anatomy of a neural network is as follows:

The index notation of a neural network is as follows:

Notice that \(\large x_1 =a_1^{(1)}\) and that \(\large x_2 = a_2^{(1)}.\)

\(L\) is the total number of layers. \(s_l\) is the number of units (excluding the bias unit) in layer \(l\).

#####Example with regression after this youtube video:

\(a_j^{(l)}\) is the activation node \(j\) in layer \(l\).

We need to calculate \(\large \frac {\partial J}{\partial \Theta}\). The weights \(\large \Theta\) are spread in two matrices: \(\large \Theta^{1}\) and \(\large \Theta^{2}\):

\(\color{red}{\large \Theta^{1}=\begin{bmatrix}\theta_{11}^{(1)} & \theta_{12}^{(1)} & \theta_{13}^{(1)}\\\theta_{21}^{(1)} & \theta_{22}^{(1)} & \theta_{23}^{(1)} \end{bmatrix}}\) and \(\color{green}{\large \Theta^{2}=\begin{bmatrix}\theta_{11}^{(2)} \\ \theta_{21}^{(2)} \\ \theta_{31}^{(2)} \end{bmatrix}}\)

We need to calculate:

\(\color{red}{\large \frac{\partial J}{\partial \Theta^{(1)}} = \begin{bmatrix}\frac{\partial J}{\partial \theta_{11}^{(1)}} & \frac{\partial J}{\partial \theta_{12}^{(1)}} & \frac{\partial J}{\partial \theta_{13}^{(1)}}\\ \frac{\partial J}{\partial \theta_{21}^{(1)}} & \frac{\partial J}{\partial \theta_{22}^{(1)}} & \frac{\partial J}{\partial \theta_{23}^{(1)}}\end{bmatrix}}\) and \(\color{green}{\large \frac {\partial J}{\partial \Theta^{(2)}}=\begin{bmatrix}\frac{\partial J}{\partial\theta_{11}^{(2)}} \\ \frac{\partial J}{\partial\theta_{21}^{(2)}} \\ \frac{\partial J}{\partial\theta_{31}^{(2)}} \end{bmatrix}}\)

The cost function adds cost from each example. In what follows we will use the cost function of a regression model as \(\frac{1}{2}\sum (y-\hat y)^2\), although it would be more accurate as \(\frac{1}{2m}\sum (y-\hat y)^2\).

\(\large \color{green}{\frac{\partial J}{\partial \Theta^{(2)}}}=\frac{\partial \sum \frac{1}{2}(y-\hat y)^2}{\partial \Theta^{(2)}}=\sum \frac{\partial \frac{1}{2}(y-\hat y)^2}{\partial \Theta^{(2)}}\) by he sum rule in differentiation that says that \(\frac{d}{dx}(u+v)=\frac{du}{dx}+\frac{dv}{dx}\)

Focusing on the expression inside the sum, and applying the chain rule, \(\large(f\circ g)'=(f'\circ g)\cdot g'\) or \(\large\frac{dz}{dx}=\frac{dz}{dy}\cdot\frac{dy}{dx}\).

\(\large \color{green}{\frac{\partial J}{\partial \Theta^{(2)}}}=\frac{\partial J}{\partial y}\frac{\partial y}{\partial \Theta^{(2)}}=\frac{\partial \left(\frac{1}{2}\sum (y-\hat y)^2\right)}{\partial y}\frac{\partial y}{\partial \Theta^{(2)}}=\color{fuchsia}{(y-\hat y)}\cdot \frac{\color{fuchsia}{-}\partial \hat y}{\partial \Theta^{(2)}}\)

Since \(\hat y\) is the sigmoid activation function of \(z^{(3)}\), which is \(g\big(z^{(3)}\big)\) we can apply the chain rule again:

\(\large \color{green}{\frac{\partial J}{\partial \Theta^{(2)}}}=(y-\hat y)\cdot \frac{-\partial \hat y}{\partial z^{(3)}}\cdot \frac{\partial z^{(3)}}{\partial \Theta^{(2)}}=\color{fuchsia}{-(y-\hat y)}\cdot \color{blue}{g'(z^{(3)})} \cdot \color{lime}{\frac{\partial z^{(3)}}{\partial \Theta^{(2)}}}\tag 1\)

The derivative of the sigmoid activation function with respect to \(z\) is:

\(\large \color{blue}{g'(z)}=\frac{e^{-z}}{\left(1+e^{-z}\right)^2}=\left(\frac{1}{1+e^{-z}}\right)\times \left(1 - \frac{1}{1+e^{-z}}\right)=\color{blue}{\hat y\left(1-\hat y\right)}\) (see embedded mathy picture below for derivation).

Now, \(\large z^{(3)}=a^{(2)}\Theta^{(2)}\), and hence \(a\) is the slope of \(z\) with respect to \(z\) for each synapse. Therefore \(\large \color{lime}{\frac{\partial z^{(3)}}{\partial \Theta^{(2)}}=\begin{bmatrix}a_{11}^{(2)}&a_{12}^{(2)}&a_{13}^{(2)}\\a_{21}^{(2)}&a_{22}^{(2)}&a_{23}^{(2)}\\a_{31}^{(2)}&a_{32}^{(2)}&a_{33}^{(2)}\end{bmatrix}}\)

\(\large \color{green}{\frac{\partial J}{\partial \Theta^{(2)}}}=\color{fuchsia}{-(y-\hat y)}\cdot \color{blue}{\left (\hat y\,(1-\hat y)\right)}\cdot \color{lime}{a^{(2)}}\tag{*}\)

In Hinton’s Coursera course on NN, this expression (*) is seen as:

\[\frac{\partial E}{\partial w_i}=\sum_n \frac{\partial y^n}{\partial w_i}\color{orange}{\frac{\partial E}{\partial y^n}}=\color{orange}{-}\sum_n \color{red}{x_i^n}\,\color{blue}{y^n \,(1-y^n)}\,\color{orange}{(t^n-y^n)}\]

where \(E\) stands for the error, \(w_i\) is the weight \(i\), \(n\) is the index of the examples in the training set, and \(t\) stands for training case.

because,

\[\frac{\partial y^n}{\partial w_i}=\color{red}{\frac{\partial z}{\partial w_i}}\color{blue}{\frac{dy}{dz}}=\color{red}{x_i^n}\,\color{blue}{y^n(1-y^n)}\]

the latter part in blue is derived here:

and because

\[E=\frac{1}{2}\displaystyle\sum_{n\in\text{ex's train set}}(t^n-y^n)^2\]

and the derivative with respect to \(y^n\) is:

\[\color{orange}{\frac{\partial E}{\partial y^n}=-(t^n-y^n)}.\]

[This is the youtube video].

Going back to Eq. 1:

So we end up with:

\[\begin{align}\large \color{green}{\frac{\partial J}{\partial \Theta^{(2)}}}&=\color{gray}{(y-\hat y)}\cdot \color{purple}{\frac{-\partial \hat y}{\partial z^{(3)}}}\cdot \frac{\partial z^{(3)}}{\partial \Theta^{(2)}}\\[2ex]&=\color{gray}{(y-\hat y)}\cdot \color{purple}{-g'(z^{(3)})}\cdot \frac{\partial z^{(3)}}{\partial \Theta^{(2)}}\\&=\color{orange}{\delta^{(3)}}\frac{\partial z^{(3)}}{\partial \Theta^{(2)}}\\[2ex]&=(a^{2})^T\,\color{orange}{\delta^{(3)}}\\[2ex]&=\large \begin{bmatrix}a_{11}^{(2)}&a_{12}^{(2)}&a_{13}^{(2)}\\a_{21}^{(2)}&a_{22}^{(2)}&a_{23}^{(2)}\\a_{31}^{(2)}&a_{32}^{(2)}&a_{33}^{(2)}\end{bmatrix} \begin{bmatrix}\delta^{(3)}_1\\\delta^{(3)}_2\\\delta^{(3)}_3\end{bmatrix}=\begin{bmatrix}a_{11}^{(2)}\delta^{(3)}_1+a_{12}^{(2)}\delta^{(3)}_2+a_{13}^{(2)}\delta^{(3)}_3\\a_{21}^{(2)}\delta^{(3)}_1+a_{22}^{(2)}\delta^{(3)}_2+a_{23}^{(2)}\delta^{(3)}_3\\a_{31}^{(2)}\delta^{(3)}_1+a_{32}^{(2)}\delta^{(3)}_2+a_{33}^{(2)}\delta^{(3)}_3\end{bmatrix}\end{align}\]

Then \(\large\color{orange}{\delta^{(3)}}=\color{gray}{(y-\hat y)}\cdot \color{purple}{g'(z^{(3)})}\)

Therefore, the matrix multiplication also takes care of the \(\sum\) part in front of the first derivative formula.

Notice that the more general efinition of \(\delta\) outside the last layer is:

\(\bbox[10px,border:2px solid purple]{\large \delta^{(l)}=\left(\Theta^{(l)}\right)^\top\delta^{(l+1)}\,g'(z^{(l)})}\) with \(g'(z^{(l)}) = a^{(l)}\cdot (1 - a^{(l)})\)

As for…

\(\large \color{red}{\large \frac{\partial J}{\partial \Theta^{(1)}}}=\color{orange}{\delta^{(3)}}\frac{\partial z^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial \Theta^{(1)}}=\color{orange}{\delta^{(3)}} \left(\Theta^{(2)}\right)^\top \frac{\partial a^{(2)}}{\partial \Theta^{(1)}}\)

since, \(\large\frac{\partial z^{(3)}}{\partial a^{(2)}}=\Theta^{(2)}\).

Next,

\(\large \color{red}{\large \frac{\partial J}{\partial \Theta^{(1)}}}=\color{orange}{\delta^{(3)}} \left(\Theta^{(2)}\right)^\top \frac{\partial a^{(2)}}{\partial \Theta^{(1)}}=\color{orange}{\delta^{(3)}} \left(\Theta^{(2)}\right)^\top \frac{\partial a^{(2)}}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial \Theta^{(1)}}=\color{orange}{\delta^{(3)}} \left(\Theta^{(2)}\right)^\top g'(z^{(2)})\frac{\partial z^{(2)}}{\partial \Theta^{(1)}}\)

Finally,

\(\large\frac{\partial z^{(2)}}{\partial \Theta^{(1)}}=X\). Therefore,

\(\large \color{red}{\large \frac{\partial J}{\partial \Theta^{(1)}}}=X^\top \color{orange}{\delta^{(3)}} \left(\Theta^{(2)}\right)^\top \color{purple}{g'(z^{(2)})}=X^\top \color{brown}{\delta^{(2)}}.\)

End of example with regression after this youtube video.

Multiclass logistic regression NN after this and this youtube videos:

The cost function in logistic regression is:

For the case of a multi-class logistic regression NN the cost function is:

\[\large J(\Theta)=-\frac{1}{m}\left[\displaystyle\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} \log\left(h_\Theta(x^{(i)})\right)_k + \left(1-y_k^{(i)}\right)\log\left(1-\left(h_\Theta\left(x^{(i)}\right)\right)_k\right) \right]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}\left(\Theta_{ji}^l\right)^2\]

What follows is explained in this video.

\(K\) is the number of classes or labels (output nodes).

\(L\) is the total number of layers in the network.

\(s_l\) is the number of units (without bias unit) in layer \(l\).

\(m\) is the number of examples or rows of the input matrix.

\(\lambda\) is the regularization constant.

The number of columns in our current theta matrix \(i\) (nothing to do with the training example index notation) is equal to the number of nodes in our current layer (including the bias unit): \(i=1\) to \(s_{l}\). The number of rows, \(j\), in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit): \(s_{l+1}\).

In backpropagation we try to minimize

\[\frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}}.\]

\(\delta_j^{(l)}\) is the “error”of node \(j\) in layer \(l\).

For the last layer:

\(\large \delta^{(L)}= a^{(L)}-y\) with \(L\) being the total number of layers, and \(a^{(L)}\) is the vector of outputs of the activation units in the last layer. So it’s simply the difference between the activation (or result of \(h(z)\)) and the actual value of the training example (\(y\)).

The other layers are calculated as:

\(\large \delta^{(l)}=\left(\Theta^{(l)}\right)^\top\delta^{(l+1)}\,g'(z^{(l)})\) with \(g'(z^{(l)}) = a^{(l)}\cdot (1 - a^{(l)})\).

stopping at \(\delta^{(2)}\).

We can compute our partial derivative terms by multiplying our activation values and our error values for each training example \(t\):

\[\large \dfrac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = \frac{1}{m}\sum_{t=1}^m a_j^{(t)(l)} {\delta}_i^{(t)(l+1)}\]

Backpropagation Algorithm:

Given training set \(\{ (x^{(1)}, y^{(1)}), \cdots (x^{(m)}, y^{(m)}) \}\)
Set \(\Delta_{i,j}^{(l)}:=0\) for all \((l, i, j)\).

For training example \(i=1\) to \(m\):

1. Set $a^{(1)}:=x^{(t)}$

2. Perform forward propagation to compute $a^{(l)}$ for $l=2,3,\cdots, L$.

3. Using $y^{(t)}$ compute $\delta^{(L)}=a^{(L)}-y^{(t)}$.

4. Compute $\delta^{(L-1)}, \delta^{(L-2)},\cdots,\delta^{(2)}$ using $\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)}) \cdot a^{(l)}\cdot (1 - a^{(l)})$.

5. $\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}$ or vectorized: $\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$.

\(D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right)\) if \(j \neq 0\).
\(D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}\) if \(j=0\).

The capital-delta matrix is used as an “accumulator” to add up our values as we go along and eventually compute our partial derivative. It can be proven that:

\(\large D_{i,j}^{(l)} = \dfrac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}}.\)

In Hinton’s Coursera:

From the Coursera Wiki:

Explanation of Derivatives Used in Backpropagation

We know that for a logistic regression classifier (which is what all of the output neurons in a neural network are), we use the cost function, \[\large J(\theta) = -y\,\log(h_{\theta}(x)) - (1-y)\,\log(1-h_{\theta}(x))\] and apply this over the \(K\) output neurons (labels), and for all \(m\) examples.
The equation to compute the partial derivatives of the theta terms in the output neurons: \[\large\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}}\]
And the equation to compute partial derivatives of the theta terms in the [last] hidden layer neurons (layer L-1): \[\large \frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}} \frac{\partial z^{(L-1)}}{\partial \theta^{(L-2)}}\]
Clearly they share some pieces in common, so a delta term \(\delta^{(L)}\) can be used for the common pieces between the output layer and the hidden layer immediately before it (with the possibility that there could be many hidden layers if we wanted): \[\large \delta^{(L)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}}\]
And we can go ahead and use another delta term \(\delta^{(L-1)}\) for the pieces that would be shared by the final hidden layer and a hidden layer before that, if we had one. Regardless, this delta term will still serve to make the math and implementation more concise. \[\large\delta^{(L-1)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}\]
\[\large\delta^{(L-1)} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}\]
With these delta terms, our equations become: \[\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}}\]
\[\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \delta^{(L-1)} \frac{\partial z^{(L-1)}}{\partial \theta^{(L-2)}}\]

Now, time to evaluate these derivatives:

Let’s start with the output layer:

\[\large \frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}}\] using \[\large \delta^{(L)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}}.\]

We need to evaluate both partial derivatives.

Given \(\large J(\theta) = -y\log(a^{(L)}) - (1-y)\log(1-a^{(L)})\), where \(a^{(L)} = h_{\theta}(x))\), the partial derivative is: \[\large\frac{\partial J(\theta)}{\partial a^{(L)}} = \frac{y-1}{a^{(L)}-1} - \frac{y}{a^{(L)}}\]
And given \(a = g(z)\), where \(g = \frac{1}{1+e^{-z}}\), the partial derivative is:

\[\large \frac{\partial a^{(L)}}{\partial z^{(L)}} = a^{(L)}(1-a^{(L)})\]

So, let’s substitute these in for \(\delta^{(L)}\):

\[\large\delta^{(L)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}}\]
\[\large\delta^{(L)} = (\frac{y-1}{a^{(L)}-1} - \frac{y}{a^{(L)}}) (a^{(L)}(1-a^{(L)}))\]
\[\large\delta^{(L)} =a^{(L)} - y\]

So, for a 3-layer network \(L=3\), \(\delta^{(3)} =a^{(3)} - y\)
Note that this is the correct equation, as given in our notes.
Now, given \(z = \theta \times \text{input}\), and in layer \(L\) the input is \(a^{(L-1)}\), the partial derivative is: \[\large \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}} = a^{(L-1)}\]
Put it together for the output layer:

\[\large\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}}\]
\[\large\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = (a^{(L)} - y) (a^{(L-1)})\]

Let’s continue on for the hidden layer (let’s assume we only have 1 hidden layer):
\[\large\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \delta^{(L-1)} \frac{\partial z^{(L-1)}}{\partial \theta^{(L-2)}}\]
Let’s figure out \(\delta^{(L-1)}\). Once again, given \(z = \theta \times\text{input}\), the partial derivative is:
\[\large \frac{\partial z^{(L)}}{\partial a^{(L-1)}} = \theta^{(L-1)}\] And: \[\frac{\partial a^{(L-1)}}{\partial z^{(L-1)}} = a^{(L-1)}(1-a^{(L-1)})\]
So, let’s substitute these in for \(\delta^{(L-1)}\):
\[\large\delta^{(L-1)} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}\]
\[\large\delta^{(L-1)} = \delta^{(L)} (\theta^{(L-1)}) (a^{(L-1)}(1-a^{(L-1)}))\]
\[\large\delta^{(L-1)} = \delta^{(L)} \theta^{(L-1)} a^{(L-1)}(1-a^{(L-1)})\]
So, for a 3-layer network,
\[\large\delta^{(2)} = \delta^{(3)} \theta^{(2)} a^{(2)}(1-a^{(2)})\]
Put it together for the [last] hidden layer:
\[\large\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \delta^{(L-1)} \frac{\partial z^{(L-1)}}{\partial \theta^{(L-2)}}\]
\[\large\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \left(\delta^{(L)} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}\right) (a^{(L-2)})\]
\[\large\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \left((a^{(L)} - y) (\theta^{(L-1)})(a^{(L-1)}(1-a^{(L-1)}))\right) (a^{(L-2)})\]

Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.

Gradient Descent

NOTES ON STATISTICS, PROBABILITY and MATHEMATICS

Neural Networks & Gradient Descent:

End of example with regression after this youtube video.

Multiclass logistic regression NN after this and this youtube videos:

Explanation of Derivatives Used in Backpropagation