BACKPROPAGATION:



Example with hyperbolic tangent:

Another illuminating example can be found here.

The input matrix (\(X\) or \(I\)) is: \(X =I=\begin{bmatrix}.9&.9\\.9&-.9\\-.9&.9\\-.9&-.9\end{bmatrix}\)

But we’ll add a bias column to \(X\):

(X = matrix(c(.9,.9,-.9,-.9,.9,-.9,.9,-.9), nrow = 4))
##      [,1] [,2]
## [1,]  0.9  0.9
## [2,]  0.9 -0.9
## [3,] -0.9  0.9
## [4,] -0.9 -0.9

The target matrix is: \(T=\begin{bmatrix}-.9\\.9\\.9\\-.9\end{bmatrix}.\)

(true = matrix(c(-.9,.9,.9,-.9), nrow= 4))
##      [,1]
## [1,] -0.9
## [2,]  0.9
## [3,]  0.9
## [4,] -0.9

Forward Pass:

  1. Add bias to input matrix:

\[\begin{bmatrix}.9&.9&1\\.9&-.9&1\\-.9&.9&1\\-.9&-.9&1\end{bmatrix}\]

# Introducing a bias term to the input data:
(X_bias = cbind(X, rep(1, nrow(X))))
##      [,1] [,2] [,3]
## [1,]  0.9  0.9    1
## [2,]  0.9 -0.9    1
## [3,] -0.9  0.9    1
## [4,] -0.9 -0.9    1

  1. Dot product with the weight matrix \(W_1 = W_h\) gives the input of the hidden layer (hidden net):

\[\text{hidden}_{\text{net}}= I \, W_1 = \begin{bmatrix}.9&.9&1\\.9&-.9&1\\-.9&.9&1\\-.9&-.9&1\end{bmatrix}\begin{bmatrix}.23&-1.13\\.73&-.48\\.23&-.24\end{bmatrix}=\begin{bmatrix}1.09&-1.7\\-.22&-.83\\.68&.34\\-.63&1.21 \end{bmatrix}\]

# First set of weights (to produce the h = hidden layer):
(W_h= matrix (c(.23,.73,.23,-1.13,-.48,-.24), nrow = 3))
##      [,1]  [,2]
## [1,] 0.23 -1.13
## [2,] 0.73 -0.48
## [3,] 0.23 -0.24
# Getting the hidden layer:
(hidd_net = round(X_bias %*% W_h, 2))
##       [,1]  [,2]
## [1,]  1.09 -1.69
## [2,] -0.22 -0.82
## [3,]  0.68  0.34
## [4,] -0.63  1.21

  1. The activation is \(\tanh\) giving the output of the hidden layer (hidden out):

\[\text{hidden}_{\text{out}}=\tanh\left(\begin{bmatrix}1.09&-1.7\\-.22&-.83\\.68&.34\\-.63&1.21 \end{bmatrix}\right)=\begin{bmatrix} .8&-.93\\-.22&-.68\\.59&.33\\-.56&.84\end{bmatrix}\]

# Activation (tanh) of the hidden layer:
(hidd_out = round(tanh(hidd_net),2))
##       [,1]  [,2]
## [1,]  0.80 -0.93
## [2,] -0.22 -0.68
## [3,]  0.59  0.33
## [4,] -0.56  0.84

  1. The input of the outer layer (outer net) will result of the matrix multiplication by the second set of weights \(W_o= W_{out}=W_2\) after adding a column of ones for the bias:

\[\text{outer}_{\text{net}}=\begin{bmatrix} .8&-.94&1\\-.22&-.68&1\\.59&.33&1\\-.56&.84&1\end{bmatrix}\begin{bmatrix}.22\\-.34\\.54\end{bmatrix}=\begin{bmatrix}1.03\\.72\\.56\\.13\end{bmatrix}\]

# Adding the bias AFTER the activation of the hidden layer:
(hidd_out_bias = cbind(hidd_out, rep(1, ncol(hidd_out))))
##       [,1]  [,2] [,3]
## [1,]  0.80 -0.93    1
## [2,] -0.22 -0.68    1
## [3,]  0.59  0.33    1
## [4,] -0.56  0.84    1
# Second matrix of weights (a given):
(W_out = matrix(c(.22,-.34,.54), nrow =3))
##       [,1]
## [1,]  0.22
## [2,] -0.34
## [3,]  0.54
(outer_net = round(hidd_out_bias %*% W_out,2))
##      [,1]
## [1,] 1.03
## [2,] 0.72
## [3,] 0.56
## [4,] 0.13

  1. The activation will yield the outer out (output of the outer layer):

\[\text{outer}_{\text{out}}=\tanh\left( \begin{bmatrix} 1.03\\.72\\.56\\.13 \end{bmatrix}\right) = \begin{bmatrix} .77\\.62\\.51\\.13 \end{bmatrix}\]

# Activation (tanh) of the outer layer:
(outer_out = round(tanh(outer_net),2))
##      [,1]
## [1,] 0.77
## [2,] 0.62
## [3,] 0.51
## [4,] 0.13

Backpropagation pass to weights to the outer layer \((W_2 = W_o)\):

  1. Computing the error as simply \(E_{outer}=T - Y_{outer}\) the difference between the target and the output:

\[E_o=T-Y_0=\begin{bmatrix}-.9\\.9\\.9\\-.9\end{bmatrix} - \begin{bmatrix}.77\\.62\\.51\\.13\end{bmatrix}=\begin{bmatrix}-1.67\\.28\\.39\\-1.03\end{bmatrix}\]

# Simple calculation of the actual error at the end of forward pass:
(E_o = true - outer_out)
##       [,1]
## [1,] -1.67
## [2,]  0.28
## [3,]  0.39
## [4,] -1.03

  1. We want to update \(W_2\), backpropagating the error \((L)\) through the NN tree. So we need to see how much the loss \((L)\) changes with changes in \(W_2\). This is the \(\Delta_o\) (output delta):

\[\frac{\partial L}{\partial W_2}=\Delta_0=\color{blue}{\large\frac{\partial \text{outer}_{input}}{\partial W_2}}\,\color{red}{\large \frac{\partial L}{\partial \text{outer}_{input}}}\tag 1\]

\[\color{red}{\delta_0}=\color{red}{\large \frac{\partial L}{\partial \text{outer}_{input}}} = \color{orange}{E_0} \circ \color{brown}{D_0}= \color{orange}{\frac{\partial L}{\partial \text{outer}_{output}}}\,\circ\color{brown}{\frac{\partial \text{outer}_{output}}{\partial \text{outer}_{input}}}\]

where \(E_0\) is the error calculated under (1) and \(D_0\) is the derivative of the activation of the \(\tanh\) in the outer layer.

The derivative of the \(\tanh\) is \(D_0 = 1 - Y_0^2\):

\[D_0 = 1 - Y_0^2=\begin{bmatrix}1\\1\\1\\1\end{bmatrix}-\begin{bmatrix}.77\\.62\\.51\\.13\end{bmatrix}^2=\begin{bmatrix}.41\\.62\\.74\\.98 \end{bmatrix}\]

# Calculating the changes in the activation (tanh) results wrt input outer layer:
(D_o = round(c(rep(1, nrow(outer_out))) - outer_out^2, 2))
##      [,1]
## [1,] 0.41
## [2,] 0.62
## [3,] 0.74
## [4,] 0.98

And \(\circ\) stands for the Hadamard product.

\[\color{red}{\delta_0} = E_0\circ D_0 = E_0\circ \left(1 - Y_0^2 \right)= \begin{bmatrix}-1.67\\.28\\.39\\-1.03\end{bmatrix} \circ \begin{bmatrix}.4\\.62\\.74\\.98 \end{bmatrix} = \begin{bmatrix}-.68 \\ .18 \\.29 \\-1.01 \end{bmatrix}\]

# Calculating the change in the error wrt input of the outer layer (small delta):
(delta_o = round(E_o * D_o, 2))
##       [,1]
## [1,] -0.68
## [2,]  0.17
## [3,]  0.29
## [4,] -1.01

Now, to finally calculate (1) or big delta:

\[\begin{align}\Delta_0 = \color{blue}{\large\frac{\partial \text{outer}_{input}}{\partial W_2}}\,\color{red}{\delta_0}&= \text{hidden}_{out}^\top\quad\delta_0\\[2ex] &= \begin{bmatrix} .8 & -.22 & .59 & -.56\\ -.94 & -.68 &.33 & .84\\1&1&1&1\end{bmatrix} \begin{bmatrix} -.67 \\ .18 \\.29 \\-1.01\end{bmatrix}=\begin{bmatrix}.16\\-.24\\-1.23 \end{bmatrix} \end{align}\]

# Calculating loss wrt to W2 (Delta outer layer):
(Delta_o = round(t(hidd_out_bias) %*% delta_o, 2))
##       [,1]
## [1,]  0.16
## [2,] -0.24
## [3,] -1.23
  1. Updating weights:

With a learning rate of \(\eta\), the with:

\[W_2:= W_2 + \eta\Delta_0= \begin{bmatrix}.22\\-.34\\.54 \end{bmatrix}+0.03\,\begin{bmatrix} .22\\-.35\\.5\end{bmatrix}\]

# learning rate:
eta = .03
# update weights:
(W_out_update = round(W_out + eta * Delta_o, 2))
##       [,1]
## [1,]  0.22
## [2,] -0.35
## [3,]  0.50

Backpropagation pass to updates the weights to the hidden layer \((W_1 = W_h)\):

  1. Computing the error:

The error in the output layer can be propagated back to the output of the hidden layer:

\[\begin{align}E_H=E_h=E_{\text{hidden}}&=\frac{\partial L}{\partial \text{hidden}_{output}}\\[2ex] &=\color{orange}{\frac{\partial L}{\partial \text{outer}_{output}}}\,\circ\color{brown}{\frac{\partial \text{outer}_{output}}{\partial \text{outer}_{input}}}\,\frac{\partial\text{outer}_{input}}{\partial \text{hidden}_{output}}\\[2ex] &=\color{red}{\delta_0}\cdot W_2^\top= \color{orange}{E_0}\circ \color{brown}{D_0} \cdot W_2^T\\[2ex] &=\begin{bmatrix} -.67\\.18\\.29\\-1.01\end{bmatrix}\begin{bmatrix}.22&-.34\end{bmatrix} =\color{aqua}{\begin{bmatrix} -.15&.23\\.04&-.06\\.06&-.1\\-.22&.34\end{bmatrix}} \end{align}\]

# It wouldn't be fair to blame the hidden layer for the bias 
# (introduced after activating the layer).
# Notice also that we use the original W2 or W outer weights - not the updated!

(W_out_minus_bias = W_out[-length(W_out)]) 
## [1]  0.22 -0.34
(E_h = round(delta_o %*% t(W_out_minus_bias), 2))
##       [,1]  [,2]
## [1,] -0.15  0.23
## [2,]  0.04 -0.06
## [3,]  0.06 -0.10
## [4,] -0.22  0.34

  1. Calculating delta for the hidden layer (change in error wrt input in the hidden layer):

First we need to calculate the derivative of the activation function:

\[\begin{align} D_H &= \frac{\partial\text{hidden}_{output}}{\partial\text{hidden}_{input}}\\[2ex] &= 1 - Y_H^2\\[2ex] &= \begin{bmatrix}1\\1\\1\\1 \end{bmatrix} - \begin{bmatrix} .8&-.93\\-.22&-.68\\.59&.33\\-.56&.84\end{bmatrix}^2 = \begin{bmatrix}.36&.14\\.95&.54\\.65&.89\\.69&.29 \end{bmatrix} \end{align}\]

(D_h = round(c(rep(1, nrow(hidd_out))) - hidd_out^2, 2))
##      [,1] [,2]
## [1,] 0.36 0.14
## [2,] 0.95 0.54
## [3,] 0.65 0.89
## [4,] 0.69 0.29

… and the change in the error wrt the input of the hidden layer (small \(\delta\)):

\[\begin{align}\color{red}{\delta_H}&=\color{orange}{\frac{\partial L}{\partial \text{hidden}_{output}}}\,\circ\color{brown}{\frac{\partial \text{hidden}_{output}}{\partial\text{hidden}_{input}}}\\[2ex] &=\color{orange}{E_H}\circ \color{brown}{D_H}\\[2ex] &=E_h \circ \left(1 - Y_H^2\right)\\[2ex] &=\color{aqua}{\begin{bmatrix}-.15&.23\\.04&-.06\\.06&-.1\\-.22&.34\end{bmatrix}}\circ\begin{bmatrix}.36&.12\\.95&.54\\.65&.89\\.69&.29\end{bmatrix}=\begin{bmatrix}-.05&.03\\.04&-.03\\.04&-.09\\-.15&.1\end{bmatrix} \end{align}\]

# Calculating small delta of the hidden layer:
(delta_h = round(E_h * D_h, 2))
##       [,1]  [,2]
## [1,] -0.05  0.03
## [2,]  0.04 -0.03
## [3,]  0.04 -0.09
## [4,] -0.15  0.10
  1. Compute the big delta weight matrix (change in error wrt \(W_1\)):

\[\begin{align}\Delta_H&=\color{orange}{\frac{\partial L}{\partial \text{hidden}_{output}}}\,\circ\,\color{brown}{\frac{\partial \text{hidden}_{output}}{\partial\text{hidden}_{input}}}\frac{\partial \text{hidden}_{input}}{W_1}\\[2ex] &=I^\top \cdot \delta_H\\[2ex] &=\begin{bmatrix}.9&.9&-.9&-.9\\.9&-.9&.9&-.9\\1&1&1&1 \end{bmatrix}\cdot\begin{bmatrix}-.05&.03\\.04&-.03\\.04&-.09\\-.15&.1 \end{bmatrix}=\begin{bmatrix}.09&-.01\\.09&-.12\\-.12&.01\end{bmatrix} \end{align}\]

# Big Delta for the W1 or weights in the hidden layer (W Hidden):
(Delta_h = round(t(X_bias) %*% delta_h, 2))
##       [,1]  [,2]
## [1,]  0.09 -0.01
## [2,]  0.09 -0.12
## [3,] -0.12  0.01
  1. Udate \(W_1\) using delta weight change matrix:

\[W_1 := W_1 + \eta \Delta_H\]

\[W_1:= \begin{bmatrix}.23&-1.13\\.73&-.48\\.23&-.24\end{bmatrix}+0.03 \begin{bmatrix}.09&-.02\\.1&-.12\\-.13&.01\end{bmatrix}\]

# Updating the first set of weights:
(W_h_update = round(W_h + eta * Delta_h, 2))
##      [,1]  [,2]
## [1,] 0.23 -1.13
## [2,] 0.73 -0.48
## [3,] 0.23 -0.24


Example with Logistic activation function:

From this source.

For this tutorial, we’re going to use a neural network with two inputs, \(i_1\) and \(i_2\), two hidden neurons, \(h_1\) and \(h_2\), and two output neurons, \(o_1\) and \(o_2\). Additionally, the hidden and output neurons will include a bias.

Here’s the basic structure:

In order to have some numbers to work with, here are the \(\color{red}{\text{initial weights}}\), the \(\color{orange}{\text{biases}}\), and \(\color{blue}{\text{training inputs/outputs}}\):

The goal of backpropagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to outputs.

For the rest of this tutorial we’re going to work with a single training set: given inputs 0.05 and 0.10, we want the neural network to output 0.01 and 0.99.

Training set: \(\{0.05, 0.10 \}\mapsto \{0.01, 0.99\}\).


The Forward Pass:

To begin, let“s see what the neural network currently predicts given the weights and biases above and inputs of 0.05 and 0.10. To do this we’ll feed those inputs forward though the network.

We figure out the total net input to each hidden layer neuron, squash the total net input using an activation function (here we use the logistic function), then repeat the process with the output layer neurons.

Total net input is also referred to as just net input by some sources.

Here’s how we calculate the total net input for \(h_1\):