NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Backpropagation:


Example with hyperbolic tangent:

Another illuminating example can be found here.

The input matrix (\(X\) or \(I\)) is: \(X =I=\begin{bmatrix}.9&.9\\.9&-.9\\-.9&.9\\-.9&-.9\end{bmatrix}\)

But we’ll add a bias column to \(X\):

(X = matrix(c(.9,.9,-.9,-.9,.9,-.9,.9,-.9), nrow = 4))
##      [,1] [,2]
## [1,]  0.9  0.9
## [2,]  0.9 -0.9
## [3,] -0.9  0.9
## [4,] -0.9 -0.9

The target matrix is: \(T=\begin{bmatrix}-.9\\.9\\.9\\-.9\end{bmatrix}.\)

(true = matrix(c(-.9,.9,.9,-.9), nrow= 4))
##      [,1]
## [1,] -0.9
## [2,]  0.9
## [3,]  0.9
## [4,] -0.9

Forward Pass:

  1. Add bias to input matrix:

\[\begin{bmatrix}.9&.9&1\\.9&-.9&1\\-.9&.9&1\\-.9&-.9&1\end{bmatrix}\]

# Introducing a bias term to the input data:
(X_bias = cbind(X, rep(1, nrow(X))))
##      [,1] [,2] [,3]
## [1,]  0.9  0.9    1
## [2,]  0.9 -0.9    1
## [3,] -0.9  0.9    1
## [4,] -0.9 -0.9    1

  1. Dot product with the weight matrix \(W_1 = W_h\) gives the input of the hidden layer (hidden net):

\[\text{hidden}_{\text{net}}= I \, W_1 = \begin{bmatrix}.9&.9&1\\.9&-.9&1\\-.9&.9&1\\-.9&-.9&1\end{bmatrix}\begin{bmatrix}.23&-1.13\\.73&-.48\\.23&-.24\end{bmatrix}=\begin{bmatrix}1.09&-1.7\\-.22&-.83\\.68&.34\\-.63&1.21 \end{bmatrix}\]

# First set of weights (to produce the h = hidden layer):
(W_h= matrix (c(.23,.73,.23,-1.13,-.48,-.24), nrow = 3))
##      [,1]  [,2]
## [1,] 0.23 -1.13
## [2,] 0.73 -0.48
## [3,] 0.23 -0.24
# Getting the hidden layer:
(hidd_net = round(X_bias %*% W_h, 2))
##       [,1]  [,2]
## [1,]  1.09 -1.69
## [2,] -0.22 -0.82
## [3,]  0.68  0.34
## [4,] -0.63  1.21

  1. The activation is \(\tanh\) giving the output of the hidden layer (hidden out):

\[\text{hidden}_{\text{out}}=\tanh\left(\begin{bmatrix}1.09&-1.7\\-.22&-.83\\.68&.34\\-.63&1.21 \end{bmatrix}\right)=\begin{bmatrix} .8&-.93\\-.22&-.68\\.59&.33\\-.56&.84\end{bmatrix}\]

# Activation (tanh) of the hidden layer:
(hidd_out = round(tanh(hidd_net),2))
##       [,1]  [,2]
## [1,]  0.80 -0.93
## [2,] -0.22 -0.68
## [3,]  0.59  0.33
## [4,] -0.56  0.84

  1. The input of the outer layer (outer net) will result of the matrix multiplication by the second set of weights \(W_o= W_{out}=W_2\) after adding a column of ones for the bias:

\[\text{outer}_{\text{net}}=\begin{bmatrix} .8&-.94&1\\-.22&-.68&1\\.59&.33&1\\-.56&.84&1\end{bmatrix}\begin{bmatrix}.22\\-.34\\.54\end{bmatrix}=\begin{bmatrix}1.03\\.72\\.56\\.13\end{bmatrix}\]

# Adding the bias AFTER the activation of the hidden layer:
(hidd_out_bias = cbind(hidd_out, rep(1, ncol(hidd_out))))
##       [,1]  [,2] [,3]
## [1,]  0.80 -0.93    1
## [2,] -0.22 -0.68    1
## [3,]  0.59  0.33    1
## [4,] -0.56  0.84    1
# Second matrix of weights (a given):
(W_out = matrix(c(.22,-.34,.54), nrow =3))
##       [,1]
## [1,]  0.22
## [2,] -0.34
## [3,]  0.54
(outer_net = round(hidd_out_bias %*% W_out,2))
##      [,1]
## [1,] 1.03
## [2,] 0.72
## [3,] 0.56
## [4,] 0.13

  1. The activation will yield the outer out (output of the outer layer):

\[\text{outer}_{\text{out}}=\tanh\left( \begin{bmatrix} 1.03\\.72\\.56\\.13 \end{bmatrix}\right) = \begin{bmatrix} .77\\.62\\.51\\.13 \end{bmatrix}\]

# Activation (tanh) of the outer layer:
(outer_out = round(tanh(outer_net),2))
##      [,1]
## [1,] 0.77
## [2,] 0.62
## [3,] 0.51
## [4,] 0.13

Backpropagation pass to weights to the outer layer \((W_2 = W_o)\):

  1. Computing the error as simply \(E_{outer}=T - Y_{outer}\) the difference between the target and the output:

\[E_o=T-Y_0=\begin{bmatrix}-.9\\.9\\.9\\-.9\end{bmatrix} - \begin{bmatrix}.77\\.62\\.51\\.13\end{bmatrix}=\begin{bmatrix}-1.67\\.28\\.39\\-1.03\end{bmatrix}\]

# Simple calculation of the actual error at the end of forward pass:
(E_o = true - outer_out)
##       [,1]
## [1,] -1.67
## [2,]  0.28
## [3,]  0.39
## [4,] -1.03

  1. We want to update \(W_2\), backpropagating the error \((L)\) through the NN tree. So we need to see how much the loss \((L)\) changes with changes in \(W_2\). This is the \(\Delta_o\) (output delta):

\[\frac{\partial L}{\partial W_2}=\Delta_0=\color{blue}{\large\frac{\partial \text{outer}_{input}}{\partial W_2}}\,\color{red}{\large \frac{\partial L}{\partial \text{outer}_{input}}}\tag 1\]

\[\color{red}{\delta_0}=\color{red}{\large \frac{\partial L}{\partial \text{outer}_{input}}} = \color{orange}{E_0} \circ \color{brown}{D_0}= \color{orange}{\frac{\partial L}{\partial \text{outer}_{output}}}\,\circ\color{brown}{\frac{\partial \text{outer}_{output}}{\partial \text{outer}_{input}}}\]

where \(E_0\) is the error calculated under (1) and \(D_0\) is the derivative of the activation of the \(\tanh\) in the outer layer.

The derivative of the \(\tanh\) is \(D_0 = 1 - Y_0^2\):

\[D_0 = 1 - Y_0^2=\begin{bmatrix}1\\1\\1\\1\end{bmatrix}-\begin{bmatrix}.77\\.62\\.51\\.13\end{bmatrix}^2=\begin{bmatrix}.41\\.62\\.74\\.98 \end{bmatrix}\]

# Calculating the changes in the activation (tanh) results wrt input outer layer:
(D_o = round(c(rep(1, nrow(outer_out))) - outer_out^2, 2))
##      [,1]
## [1,] 0.41
## [2,] 0.62
## [3,] 0.74
## [4,] 0.98

And \(\circ\) stands for the Hadamard product.

\[\color{red}{\delta_0} = E_0\circ D_0 = E_0\circ \left(1 - Y_0^2 \right)= \begin{bmatrix}-1.67\\.28\\.39\\-1.03\end{bmatrix} \circ \begin{bmatrix}.4\\.62\\.74\\.98 \end{bmatrix} = \begin{bmatrix}-.68 \\ .18 \\.29 \\-1.01 \end{bmatrix}\]

# Calculating the change in the error wrt input of the outer layer (small delta):
(delta_o = round(E_o * D_o, 2))
##       [,1]
## [1,] -0.68
## [2,]  0.17
## [3,]  0.29
## [4,] -1.01

Now, to finally calculate (1) or big delta:

\[\begin{align}\Delta_0 = \color{blue}{\large\frac{\partial \text{outer}_{input}}{\partial W_2}}\,\color{red}{\delta_0}&= \text{hidden}_{out}^\top\quad\delta_0\\[2ex] &= \begin{bmatrix} .8 & -.22 & .59 & -.56\\ -.94 & -.68 &.33 & .84\\1&1&1&1\end{bmatrix} \begin{bmatrix} -.67 \\ .18 \\.29 \\-1.01\end{bmatrix}=\begin{bmatrix}.16\\-.24\\-1.23 \end{bmatrix} \end{align}\]

# Calculating loss wrt to W2 (Delta outer layer):
(Delta_o = round(t(hidd_out_bias) %*% delta_o, 2))
##       [,1]
## [1,]  0.16
## [2,] -0.24
## [3,] -1.23
  1. Updating weights:

With a learning rate of \(\eta\), the with:

\[W_2:= W_2 + \eta\Delta_0= \begin{bmatrix}.22\\-.34\\.54 \end{bmatrix}+0.03\,\begin{bmatrix} .22\\-.35\\.5\end{bmatrix}\]

# learning rate:
eta = .03
# update weights:
(W_out_update = round(W_out + eta * Delta_o, 2))
##       [,1]
## [1,]  0.22
## [2,] -0.35
## [3,]  0.50

Backpropagation pass to updates the weights to the hidden layer \((W_1 = W_h)\):

  1. Computing the error:

The error in the output layer can be propagated back to the output of the hidden layer:

\[\begin{align}E_H=E_h=E_{\text{hidden}}&=\frac{\partial L}{\partial \text{hidden}_{output}}\\[2ex] &=\color{orange}{\frac{\partial L}{\partial \text{outer}_{output}}}\,\circ\color{brown}{\frac{\partial \text{outer}_{output}}{\partial \text{outer}_{input}}}\,\frac{\partial\text{outer}_{input}}{\partial \text{hidden}_{output}}\\[2ex] &=\color{red}{\delta_0}\cdot W_2^\top= \color{orange}{E_0}\circ \color{brown}{D_0} \cdot W_2^T\\[2ex] &=\begin{bmatrix} -.67\\.18\\.29\\-1.01\end{bmatrix}\begin{bmatrix}.22&-.34\end{bmatrix} =\color{aqua}{\begin{bmatrix} -.15&.23\\.04&-.06\\.06&-.1\\-.22&.34\end{bmatrix}} \end{align}\]

# It wouldn't be fair to blame the hidden layer for the bias 
# (introduced after activating the layer).
# Notice also that we use the original W2 or W outer weights - not the updated!

(W_out_minus_bias = W_out[-length(W_out)]) 
## [1]  0.22 -0.34
(E_h = round(delta_o %*% t(W_out_minus_bias), 2))
##       [,1]  [,2]
## [1,] -0.15  0.23
## [2,]  0.04 -0.06
## [3,]  0.06 -0.10
## [4,] -0.22  0.34

  1. Calculating delta for the hidden layer (change in error wrt input in the hidden layer):

First we need to calculate the derivative of the activation function:

\[\begin{align} D_H &= \frac{\partial\text{hidden}_{output}}{\partial\text{hidden}_{input}}\\[2ex] &= 1 - Y_H^2\\[2ex] &= \begin{bmatrix}1\\1\\1\\1 \end{bmatrix} - \begin{bmatrix} .8&-.93\\-.22&-.68\\.59&.33\\-.56&.84\end{bmatrix}^2 = \begin{bmatrix}.36&.14\\.95&.54\\.65&.89\\.69&.29 \end{bmatrix} \end{align}\]

(D_h = round(c(rep(1, nrow(hidd_out))) - hidd_out^2, 2))
##      [,1] [,2]
## [1,] 0.36 0.14
## [2,] 0.95 0.54
## [3,] 0.65 0.89
## [4,] 0.69 0.29

… and the change in the error wrt the input of the hidden layer (small \(\delta\)):

\[\begin{align}\color{red}{\delta_H}&=\color{orange}{\frac{\partial L}{\partial \text{hidden}_{output}}}\,\circ\color{brown}{\frac{\partial \text{hidden}_{output}}{\partial\text{hidden}_{input}}}\\[2ex] &=\color{orange}{E_H}\circ \color{brown}{D_H}\\[2ex] &=E_h \circ \left(1 - Y_H^2\right)\\[2ex] &=\color{aqua}{\begin{bmatrix}-.15&.23\\.04&-.06\\.06&-.1\\-.22&.34\end{bmatrix}}\circ\begin{bmatrix}.36&.12\\.95&.54\\.65&.89\\.69&.29\end{bmatrix}=\begin{bmatrix}-.05&.03\\.04&-.03\\.04&-.09\\-.15&.1\end{bmatrix} \end{align}\]

# Calculating small delta of the hidden layer:
(delta_h = round(E_h * D_h, 2))
##       [,1]  [,2]
## [1,] -0.05  0.03
## [2,]  0.04 -0.03
## [3,]  0.04 -0.09
## [4,] -0.15  0.10
  1. Compute the big delta weight matrix (change in error wrt \(W_1\)):

\[\begin{align}\Delta_H&=\color{orange}{\frac{\partial L}{\partial \text{hidden}_{output}}}\,\circ\,\color{brown}{\frac{\partial \text{hidden}_{output}}{\partial\text{hidden}_{input}}}\frac{\partial \text{hidden}_{input}}{W_1}\\[2ex] &=I^\top \cdot \delta_H\\[2ex] &=\begin{bmatrix}.9&.9&-.9&-.9\\.9&-.9&.9&-.9\\1&1&1&1 \end{bmatrix}\cdot\begin{bmatrix}-.05&.03\\.04&-.03\\.04&-.09\\-.15&.1 \end{bmatrix}=\begin{bmatrix}.09&-.01\\.09&-.12\\-.12&.01\end{bmatrix} \end{align}\]

# Big Delta for the W1 or weights in the hidden layer (W Hidden):
(Delta_h = round(t(X_bias) %*% delta_h, 2))
##       [,1]  [,2]
## [1,]  0.09 -0.01
## [2,]  0.09 -0.12
## [3,] -0.12  0.01
  1. Udate \(W_1\) using delta weight change matrix:

\[W_1 := W_1 + \eta \Delta_H\]

\[W_1:= \begin{bmatrix}.23&-1.13\\.73&-.48\\.23&-.24\end{bmatrix}+0.03 \begin{bmatrix}.09&-.02\\.1&-.12\\-.13&.01\end{bmatrix}\]

# Updating the first set of weights:
(W_h_update = round(W_h + eta * Delta_h, 2))
##      [,1]  [,2]
## [1,] 0.23 -1.13
## [2,] 0.73 -0.48
## [3,] 0.23 -0.24


###Example with Logistic activation function:

From this source.

For this tutorial, we’re going to use a neural network with two inputs, \(i_1\) and \(i_2\), two hidden neurons, \(h_1\) and \(h_2\), and two output neurons, \(o_1\) and \(o_2\). Additionally, the hidden and output neurons will include a bias.

Here’s the basic structure:

In order to have some numbers to work with, here are the \(\color{red}{\text{initial weights}}\), the \(\color{orange}{\text{biases}}\), and \(\color{blue}{\text{training inputs/outputs}}\):

The goal of backpropagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to outputs.

For the rest of this tutorial we’re going to work with a single training set: given inputs 0.05 and 0.10, we want the neural network to output 0.01 and 0.99.

Training set: \(\{0.05, 0.10 \}\mapsto \{0.01, 0.99\}\).


The Forward Pass:

To begin, let”s see what the neural network currently predicts given the weights and biases above and inputs of 0.05 and 0.10. To do this we’ll feed those inputs forward though the network.

We figure out the total net input to each hidden layer neuron, squash the total net input using an activation function (here we use the logistic function), then repeat the process with the output layer neurons.

Total net input is also referred to as just net input by some sources.

Here’s how we calculate the total net input for \(h_1\):

\(\text{net}_{h1} = w_1 \times i_1 + w_2 \times i_2 + b_1 \times 1\)

\(\text{net}_{h1} = 0.15 \times 0.05 + 0.2 \times 0.1 + 0.35 \times 1 = 0.3775\)

We then squash it using the logistic function to get the output of \(h_1\):

\(\text{out}_{h1} = \frac{1}{1+e^{-\text{net}_{h1}}} = \frac{1}{1+e^{-0.3775}} = 0.593269992\)

Carrying out the same process for \(h_2\) we get:

\(\text{out}_{h2} = 0.596884378\)


We repeat this process for the output layer neurons, using the output from the hidden layer neurons as inputs.

Here’s the output for \(o_1\):

\(\text{net}_{o1} = w_5 \times \text{out}_{h1} + w_6 \times \text{out}_{h2} + b_2 \times 1\)

\(\text{net}_{o1} = 0.4 \times 0.593269992 + 0.45 \times 0.596884378 + 0.6 \times 1 = 1.105905967\)

\(\text{out}_{o1} = \frac{1}{1+e^{-net_{o1}}} = \frac{1}{1+e^{-1.105905967}} = 0.75136507\)

And carrying out the same process for \(o_2\) we get:

\(\text{out}_{o2} = 0.772928465\)


TOTAL ERROR of the NEURAL NETWORK (output of the outer layer to true values)

We can now calculate the error for each output neuron using the squared error function and sum them to get the total error:

\(E_{\text{total}} = \sum \frac{1}{2}(\text{target - output})^{2}\)

For example, the target output for \(o_1\) is \(0.01\) but the neural network output \(0.75136507\), therefore its error is:

\(E_{o1} = \frac{1}{2}(\text{target}_{o1} - \text{out}_{o1})^{2} = \frac{1}{2}(0.01 - 0.75136507)^{2} = 0.274811083\)

Repeating this process for \(o_2\) (remembering that the target is \(0.99\)) we get:

\(E_{o2} = 0.023560026\)

The total error for the neural network is the sum of these errors:

\(E_{\text{total}} = E_{o1} + E_{o2} = 0.274811083 + 0.023560026 = 0.298371109\)


The Backwards Pass:


Our goal with backpropagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole.

Output Layer


Example: \(\Large w_5\):

Consider \(w_5\). We want to know how much a change in \(w_5\) affects the total error, aka \(\large\frac{\partial E_{\text{total}}}{\partial w_{5}}\) (the partial derivative of \(E_{\text{total}}\) with respect to \(w_{5}\) or the gradient with respect to \(w_{5}\)).

Applying the chain rule we know that:

\(\large\frac{\partial\, E_{\text{total}}}{\partial \,w_{5}} = \frac{\partial\, E_{\text{total}}}{\partial\, \text{out}_{o1}} \frac{\partial\, \text{out}_{o1}}{\partial\, \text{net}_{o1}} \frac{\partial\, \text{net}_{o1}}{\partial\, w_{5}}\)

Visually, here’s what we’re doing:

We need to figure out each piece in this equation.

First, how much does the total error change with respect to the output?


CHANGE IN ERROR WRT OUTPUT OF OUTER LAYER NODE 1:

\(E_{total} = \frac{1}{2}(\text{target}_{o1} - \text{out}_{o1})^{2} + \frac{1}{2}(\text{target}_{o2} - \text{out}_{o2})^{2}\)

\(\frac{\partial E_{\text{total}}}{\partial \text{out}_{o1}} = 2 * \frac{1}{2}(\text{target}_{o1} - \text{out}_{o1})^{2 - 1} * -1 + 0\)

\(\color{red}{\frac{\partial E_{\text{total}}}{\partial \text{out}_{o1}}} = -(\text{target}_{o1} - \text{out}_{o1}) = -(0.01 - 0.75136507) = \color{red}{0.74136507}\)

When we take the partial derivative of the total error with respect to \(\text{out}_{o1}\), the quantity \(\frac{1}{2}(\text{target}_{o2} - \text{out}_{o2})^{2}\) becomes zero because \(\text{out}_{o1}\) does not affect it which means we’re taking the derivative of a constant which is zero.


CHANGE IN OUTPUT OUTER LAYER NODE 1 WRT ITS INPUT (SIGMOID ACTIVATION):

Next, how much does the output of \(o_1\) change with respect to its total net input?

The partial derivative of the logistic function is the output multiplied by \(1\) minus the output:

\(\text{out}_{o1} = \frac{1}{1+e^{-\text{net}_{o1}}}\)

\(\color{blue}{\frac{\partial \text{out}_{o1}}{\partial \text{net}_{o1}}} = \text{out}_{o1}(1 - \text{out}_{o1}) = 0.75136507(1 - 0.75136507) = \color{blue}{0.186815602}\)


CHANGE IN INPUT OF OUTER LAYER NODE 1 WRT \(w_5\) (Logit linear equation):

Finally, how much does the total net input of \(o_1\) change with respect to \(w_5\)?

\(\text{net}_{o1} = w_5 \times \text{out}_{h1} + w_6 \times \text{out}_{h2} + b_2 \times 1\)

\(\frac{\partial \text{net}_{o1}}{\partial w_{5}} = 1 \times \text{out}_{h1} \times w_5^{(1 - 1)} + 0 + 0 = \text{out}_{h1} = 0.593269992\)

Putting it all together:

\(\large\frac{\partial\, E_{\text{total}}}{\partial \,w_{5}} = \frac{\partial\, E_{\text{total}}}{\partial\, \text{out}_{o1}} \frac{\partial\, \text{out}_{o1}}{\partial\, \text{net}_{o1}} \frac{\partial\, \text{net}_{o1}}{\partial\, w_{5}}\)

\(\frac{\partial E_{\text{total}}}{\partial w_{5}} = 0.74136507 * 0.186815602 * 0.593269992 = 0.082167041\)


You’ll often see this calculation combined in the form of the delta rule:

\(\frac{\partial E_{\text{total}}}{\partial w_{5}} = -(\text{target}_{o1} - \text{out}_{o1}) \times \text{out}_{o1}(1 - \text{out}_{o1}) \times \text{out}_{h1}\)

Alternatively, we have \(\frac{\partial E_{\text{total}}}{\partial \text{out}_{o1}}\) and \(\frac{\partial \text{out}_{o1}}{\partial \text{net}_{o1}}\) which can be written as \(\frac{\partial E_{\text{total}}}{\partial \text{net}_{o1}}\), aka \(\delta_{o1}\) (the Greek letter delta) aka the node delta. We can use this to rewrite the calculation above:

\(\delta_{o1} = \frac{\partial E_{\text{total}}}{\partial \text{out}_{o1}} \times \frac{\partial out_{o1}}{\partial \text{net}_{o1}} = \frac{\partial E_{\text{total}}}{\partial \text{net}_{o1}}\)

\(\delta_{o1} = -(\text{target}_{o1} - \text{out}_{o1}) \times \text{out}_{o1}(1 - \text{out}_{o1})\)

Therefore:

\(\frac{\partial E_{\text{total}}}{\partial w_{5}} = \delta_{o1} \text{out}_{h1}\)

Some sources extract the negative sign from so it would be written as:

\(\frac{\partial E_{\text{total}}}{\partial w_{5}} = -\delta_{o1} \text{out}_{h1}\)


To decrease the error, we then subtract this value from the current weight (optionally multiplied by some learning rate, eta, which we’ll set to \(0.5\)):

\(w_5^{+} = w_5 - \eta \times \frac{\partial E_{\text{total}}}{\partial w_{5}} = 0.4 - 0.5 \times 0.082167041 = 0.35891648\)

Some sources use \(\alpha\) (alpha) to represent the learning rate, others use \(\eta\) (eta), and others even use \(\epsilon\) (epsilon).

We can repeat this process to get the new weights \(w_6, w_7\), and \(w_8\):

\(w_6^{+} = 0.408666186\)

\(w_7^{+} = 0.511301270\)

\(w_8^{+} = 0.561370121\)

We perform the actual updates in the neural network after we have the new weights leading into the hidden layer neurons (ie, we use the original weights, not the updated weights, when we continue the backpropagation algorithm below).


Hidden Layer

Next, we’ll continue the backwards pass by calculating new values for \(w_1, w_2, w_3\), and \(w_4\).

Example: Calculating \(\Large w_1\):

Big picture, here’s what we need to figure out:

\(\large \frac{\partial E_{\text{total}}}{\partial w_{1}} = \frac{\partial \,E_{\text{total}}}{\partial \,\text{out}_{h1}} \frac{\partial \,\text{out}_{h1}}{\partial \,\text{net}_{h1}} \frac{\partial \,\text{net}_{h1}}{\partial \,w_{1}}\)

Visually:

We’re going to use a similar process as we did for the output layer, but slightly different to account for the fact that the output of each hidden layer neuron contributes to the output (and therefore error) of multiple output neurons. We know that \(\text{out}_{h1}\) affects both \(\text{out}_{o1}\) and \(\text{out}_{o2}\) therefore the \(\frac{\partial E_{\text{total}}}{\partial \text{out}_{h1}}\) needs to take into consideration its effect on the both output neurons:

\(\frac{\partial E_{\text{total}}}{\partial \text{out}_{h1}} = \frac{\partial E_{o1}}{\partial \text{out}_{h1}} + \frac{\partial E_{o2}}{\partial \text{out}_{h1}}\)


CHANGE IN ERROR OF TARGET NEURON 1 OUTER LAYER WRT OUTPUT OF HIDDEN LAYER NODE 1:

Starting with \(\frac{\partial\, E_{o1}}{\partial \,\text{out}_{h1}}\):

\(\large \frac{\partial\, E_{o1}}{\partial\, \text{out}_{h1}} = \frac{\partial\, E_{o1}}{\partial\, \text{net}_{o1}} \frac{\partial\, \text{net}_{o1}}{\partial\, \text{out}_{h1}}\)

We can calculate \(\frac{\partial \,E_{o1}}{\partial \,\text{net}_{o1}}\) using values we calculated earlier:

\(\frac{\partial\, E_{o1}}{\partial\, \text{net}_{o1}} = \color{red}{\frac{\partial\, E_{o1}}{\partial\, \text{out}_{o1}}} \, \color{blue}{\frac{\partial\, \text{out}_{o1}}{\partial\, \text{net}_{o1}}} = 0.74136507 \times 0.186815602 = 0.138498562\)

And \(\frac{\partial \text{net}_{o1}}{\partial \text{out}_{h1}}\) is equal to \(w_5\):

\(\text{net}_{o1} = w_5 \times \text{out}_{h1} + w_6 \times \text{out}_{h2} + b_2 \times 1\)

\(\frac{\partial \text{net}_{o1}}{\partial \text{out}_{h1}} = w_5 = 0.40\)

Plugging them in:

\(\large \frac{\partial E_{o1}}{\partial \text{out}_{h1}} = \frac{\partial E_{o1}}{\partial \text{net}_{o1}} \frac{\partial \text{net}_{o1}}{\partial \text{out}_{h1}} = 0.138498562 \times 0.40 = 0.055399425\)


CHANGE IN ERROR OF TARGET OUTER LAYER NEURON 2 WRT OUTPUT OF HIDDEN LAYER NODE 1:

Following the same process for \(\frac{\partial E_{o2}}{\partial \text{out}_{h1}}\), we get:

\(\frac{\partial E_{o2}}{\partial \text{out}_{h1}} = -0.019049119\)

Therefore:

\(\frac{\partial E_{\text{total}}}{\partial \text{out}_{h1}} = \frac{\partial E_{o1}}{\partial \text{out}_{h1}} + \frac{\partial E_{o2}}{\partial \text{out}_{h1}} = 0.055399425 + -0.019049119 = 0.036350306\)


Now that we have \(\frac{\partial E_{total}}{\partial \text{out}_{h1}}\), we need to figure out \(\frac{\partial \text{out}_{h1}}{\partial \text{net}_{h1}}\) and then \(\frac{\partial \text{net}_{h1}}{\partial w}\) for each weight:


CHANGE IN OUTPUT NODE 1 HIDDEN LAYER WRT INPUT:

\(\text{out}_{h1} = \frac{1}{1+e^{-\text{net}_{h1}}}\)

\(\frac{\partial\, \text{out}_{h1}}{\partial \,\text{net}_{h1}} = \text{out}_{h1}(1 - \text{out}_{h1}) = 0.59326999(1 - 0.59326999 ) = 0.241300709\)


CHANGE IN INPUT NODE 1 HIDDEN LAYER WRT \(w_1\):

We calculate the partial derivative of the total net input to \(h_1\) with respect to \(w_1\) the same as we did for the output neuron:

\(\text{net}_{h1} = w_1 \times i_1 + w_2 \times i_2 + b_1 \times 1\)

\(\frac{\partial \text{net}_{h1}}{\partial w_1} = i_1 = 0.05\)

Putting it all together:

\(\frac{\partial E_{\text{total}}}{\partial w_{1}} = \frac{\partial E_{\text{total}}}{\partial \text{out}_{h1}} \, \frac{\partial \text{out}_{h1}}{\partial \text{net}_{h1}} \, \frac{\partial \text{net}_{h1}}{\partial w_{1}}\)

\(\frac{\partial E_{\text{total}}}{\partial w_{1}} = 0.036350306 \times 0.241300709 \times 0.05 = 0.000438568\)

You might also see this written as:

\(\frac{\partial E_{\times{total}}}{\partial w_{1}} = (\sum\limits_{o}{\frac{\partial E_{\text{total}}}{\partial \text{out}_{o}} \frac{\partial \text{out}_{o}}{\partial \text{net}_{o}} \frac{\partial \text{net}_{o}}{\partial \text{out}_{h1}}}) * \frac{\partial \text{out}_{h1}}{\partial \text{net}_{h1}} \frac{\partial \text{net}_{h1}}{\partial w_{1}}\)

\(\frac{\partial E_{\text{total}}}{\partial w_{1}} = (\sum\limits_{o}{\delta_{o} w_{ho}}) \text{out}_{h1}(1 - \text{out}_{h1}) * i_{1}\)

\(\frac{\partial E_{\text{total}}}{\partial w_{1}} = \delta_{h1}i_{1}\)

We can now update \(w_1\):

\(w_1^{+} = w_1 - \eta * \frac{\partial E_{\text{total}}}{\partial w_{1}} = 0.15 - 0.5 * 0.000438568 = 0.149780716\)

Repeating this for \(w_2, w_3\), and \(w_4\)

\(w_2^{+} = 0.19956143\)

\(w_3^{+} = 0.24975114\)

\(w_4^{+} = 0.29950229\)

Finally, we’ve updated all of our weights! When we fed forward the \(0.05\) and \(0.1\) inputs originally, the error on the network was \(0.298371109\). After this first round of backpropagation, the total error is now down to \(0.291027924\). It might not seem like much, but after repeating this process \(10,000\) times, for example, the error plummets to \(0.000035085\). At this point, when we feed forward \(0.05\) and \(0.1\), the two outputs neurons generate \(0.015912196\) (vs \(0.01\) target) and \(0.984065734\) (vs \(0.99\) target).


Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.