Back Propagation
We introduce back propagation in numpy and pytorch respectively.
If you have some questions or suggestion about BackPropagation with Numpy, contact Jiaxin Zhuang or email(zhuangjx5@mail2.sysu.edu.cn)
1. Simple expressions and interpretation of the gradient
1.1 Simple expressions
Lets start simple so that we can develop the notation and conventions for more complex expressions. Consider a simple multiplication function of two numbers $f(x,y)=xy$. It is a matter of simple calculus to derive the partial derivative for either input:
# set some inputs
x1 = -2; x2 = 5;
# perform the forward pass
f = x1 * x2 # f becomes -10
# perform the backward pass (backpropagation) in reverse order:
# backprop through f = x * y
dfdx1 = x2 # df/dx = y, so gradient on x becomes 5
print("gradient on x is {:2}".format(dfdx1))
dfdx2 = x1 # df/dy = x, so gradient on y becomes -2
print('gradient on y is {:2}'.format(dfdx2))
gradient on x is 5
gradient on y is -2
1.2 interpretation of the gradient
Interpretation:The derivatives indicate the rate of change of a function with respect to that variable surrounding an infinitesimally small region near a particular point: In other words, the derivative on each variable tells you the sensitivity of the whole expression on its value.As mentioned, the gradient $\nabla f$ is the vector of partial derivatives, so we have that $\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]$.
2. Compound expressions with chain rule
2.1 Simple examples for chain rule
Lets now start to consider more complicated expressions that involve multiple composed functions, such as $f(x,y,z) = (x + y) z$.
This expression is still simple enough to differentiate directly, but we’ll take a particular approach to it that will be helpful with understanding the intuition behind backpropagation.
In particular, note that this expression can be broken down into two expressions: $q=x+y$ and $f=qz$. As seen in the previous section,$f$ is just multiplication of $q$ and $z$, so $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$,and $q$ is addition of $x$ and $y$ so $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$.
However, we don’t necessarily care about the gradient on the intermediate value $q$ - the value of $\frac{\partial f}{\partial q}$ is not useful. Instead, we are ultimately interested in the gradient of $f$ with respect to its inputs $x$,$y$,$z$.
The chain rule tells us that the correct way to “chain” these gradient expressions together is through multiplication. For example, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x}$. In practice this is simply a multiplication of the two numbers that hold the two gradients. Lets see this with an example:
# set some inputs
x = -2; y = 5; z = -4
# perform the forward pass
q = 2*x + y # q becomes 1
f = q * z # f becomes -4
print(q, f)
1 -4
# perform the backward pass (backpropagation) in reverse order:
# first backprop through f = q * z = (2*x+y) * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
# now backprop through q = x + y
dfdx = 2.0 * dfdq # dq/dx = 2. And the multiplication here is the chain rule!
dfdy = 1.0 * dfdq # dq/dy = 1
print('df/dx is {:2}'.format(dfdx))
print('df/dy is {:2}'.format(dfdy))
df/dx is -8.0
df/dy is -4.0
2.2 Intuitive understanding of backpropagation
Notice that backpropagation is a beautifully local process. Every gate in a circuit diagram gets some inputs and can right away compute two things:
- its output value and
- the local gradient of its inputs with respect to its output value.
3. Practice: Writing a simple Feedforward Neural Network
3.1 Outline
We would implement a simple feedforward neural network by using numpy. Thus, we need to define network and implement the forward pass as well as the backword propagation.
- Define a simpel feedforward neural netork, with 1 hidden layer. Implement forward and backward
- Load data from local csv file with pandas, which contains some training and testing dots, generated by 3 different gaussian distribution.(different mean and std).
- Define some functions for visualization and training
- Training and predicting every epoch
- plot the distribution of the points’ label and the predictions
# Load necessary module for later
import numpy as np
import pandas as pd
np.random.seed(1024)
3.2 Define a Feedforward Neural Netowk, implement forward and backward
A simple Neural Network with 1 hidden layer.
Networks Structure
Input Weights Output
Hidden Layer [batch_size, 2] x [2,5] -> [batch_size, 5]
activation function(sigmoid) [batch_size, 5] -> [batch_size, 5]
Classification Layer [batch_size, 5] x [5,3] -> [batch_size, 3]
activation function(sigmoid) [batch_size, 3] -> [batch_size, 3]
According to training and testing data. Each points is in two-dimension space, and there is three categories. And predictions would be a one-hot vector, like [0 0 1] , [1 0 0], [0 1 0]
w1_initialization = np.random.randn(2, 5)
w2_initialization = np.random.randn(5, 3)
w2_initialization
array([[-0.06510141, 0.80681666, -0.5778176 ],
[ 0.57306064, -0.33667496, 0.29700734],
[-0.37480416, 0.15510474, 0.70485719],
[ 0.8452178 , -0.65818079, 0.56810558],
[ 0.51538125, -0.61564998, 0.92611427]])
class FeedForward_Neural_Network(object):
def __init__(self, learning_rate):
self.input_channel = 2 # number of input neurons
self.output_channel = 3 # number of output neurons
self.hidden_channel = 5 # number of hidden neurons
self.learning_rate = learning_rate
# weights initialization
# Usually, we use random or uniform initialzation to initialize weight
# For simplicity, here we use same array to initialze
# np.random.randn(self.input_channel, self.hidden_channel)
# (2x5) weight matrix from input to hidden layer
self.weight1 = np.array([[ 2.12444863, 0.25264613, 1.45417876, 0.56923979, 0.45822365],
[-0.80933344, 0.86407349, 0.20170137, -1.87529904, -0.56850693]])
# (5x3) weight matrix from hidden to output layer
# np.random.randn(self.hidden_channel, self.output_channel)
self.weight2 = np.array([ [-0.06510141, 0.80681666, -0.5778176 ],
[ 0.57306064, -0.33667496, 0.29700734],
[-0.37480416, 0.15510474, 0.70485719],
[ 0.8452178 , -0.65818079, 0.56810558],
[ 0.51538125, -0.61564998, 0.92611427]])
def forward(self, X):
"""forward propagation through our network
"""
# dot product of X (input) and first set of 3x2 weights
self.h1 = np.dot(X, self.weight1)
# activation function
self.z1 = self.sigmoid(self.h1)
# dot product of hidden layer (z2) and second set of 3x1 weights
self.h2 = np.dot(self.z1, self.weight2)
# final activation function
o = self.sigmoid(self.h2)
return o
def backward(self, X, y, o):
"""Backward, compute gradient and update parameters
Inputs:
X: data, [batch_size, 2]
y: label, one-hot vector, [batch_size, 3]
o: predictions, [batch_size, 3]
"""
# backward propgate through the network
self.o_error = y - o # error in output
# applying derivative of sigmoid to error delata L
self.o_delta = self.o_error * self.sigmoid_prime(o)
# z1 error: how much our hidden layer weights contributed to output error
self.z1_error = self.o_delta.dot(self.weight2.T)
# applying derivative of sigmoid to z1 error
self.z1_delta = self.z1_error * self.sigmoid_prime(self.z1)
# adjusting first set (input --> hidden) weights
self.weight1 += X.T.dot(self.z1_delta) * self.learning_rate
# adjusting second set (hidden --> output) weights
self.weight2 += self.z1.T.dot(self.o_delta) * self.learning_rate
def sigmoid(self, s):
"""activation function
"""
return 1 / (1 + np.exp(-s))
def sigmoid_prime(self, s):
"""derivative of sigmoid
"""
return s * (1 - s)
3.3 Loading Data From local csv by using Pandas
# Import Module
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import math
train_csv_file = './labels/train.csv'
test_csv_file = './labels/test.csv'
# Load data from csv file, without header
train_frame = pd.read_csv(train_csv_file, encoding='utf-8', header=None)
test_frame = pd.read_csv(test_csv_file, encoding='utf-8', header=None)
# show data in Dataframe format (defined in pandas)
train_frame
0 | 1 | 2 | |
---|---|---|---|
0 | 11.834241 | 11.866105 | 1 |
1 | 8.101150 | 9.324800 | 1 |
2 | 11.184679 | 1.196726 | 2 |
3 | 8.911888 | -0.044024 | 2 |
4 | 9.863982 | 0.151162 | 2 |
5 | 9.427897 | -0.598807 | 2 |
6 | 10.038352 | 2.133938 | 2 |
7 | 11.149009 | -0.726649 | 2 |
8 | 9.041540 | 2.972213 | 2 |
9 | 13.413336 | -3.174030 | 2 |
10 | -0.385824 | 0.388751 | 0 |
11 | -0.192905 | 1.562469 | 0 |
12 | 10.735249 | 7.702754 | 1 |
13 | -3.024363 | 2.518729 | 0 |
14 | 10.694739 | 11.442958 | 1 |
15 | 10.672035 | 0.163851 | 2 |
16 | 9.717515 | -0.673383 | 2 |
17 | 7.757028 | -2.540235 | 2 |
18 | 0.195954 | 0.843201 | 0 |
19 | 10.359054 | 11.489937 | 1 |
20 | 10.245470 | 10.873774 | 1 |
21 | 9.767327 | 9.450749 | 1 |
22 | 12.402497 | 11.861342 | 1 |
23 | 0.980769 | -1.524264 | 0 |
24 | -2.113837 | 2.111235 | 0 |
25 | 0.076416 | 0.650588 | 0 |
26 | 0.670296 | -0.344045 | 0 |
27 | 10.452718 | 9.419734 | 1 |
28 | 10.647860 | 8.271140 | 1 |
29 | -0.095686 | 2.692840 | 0 |
... | ... | ... | ... |
180 | 0.239345 | -2.378022 | 0 |
181 | 1.497582 | -2.700999 | 0 |
182 | -0.471785 | 0.856114 | 0 |
183 | 13.690628 | 11.552953 | 1 |
184 | 10.652533 | 10.357309 | 1 |
185 | 8.714084 | 9.839341 | 1 |
186 | 12.177913 | 10.932641 | 1 |
187 | 10.049335 | 8.478106 | 1 |
188 | 1.370425 | 2.321562 | 0 |
189 | 2.189643 | 0.012325 | 0 |
190 | 7.425213 | 10.904103 | 1 |
191 | 6.836717 | 10.750923 | 1 |
192 | 8.911069 | 11.032682 | 1 |
193 | 8.819191 | 11.310835 | 1 |
194 | -0.807627 | -1.435569 | 0 |
195 | -1.687238 | 1.345539 | 0 |
196 | 9.856732 | 10.116610 | 1 |
197 | 9.648434 | 8.059552 | 1 |
198 | -0.223917 | 1.003647 | 0 |
199 | 10.004307 | 8.482203 | 1 |
200 | 12.090931 | 9.942670 | 1 |
201 | 10.983798 | 10.193395 | 1 |
202 | 0.109491 | -1.238625 | 0 |
203 | -1.068244 | -0.996179 | 0 |
204 | 0.341772 | -0.582299 | 0 |
205 | -1.344687 | -0.894215 | 0 |
206 | -0.711753 | -2.676756 | 0 |
207 | -0.625906 | -2.659784 | 0 |
208 | 9.685143 | 10.292463 | 1 |
209 | 9.921518 | 12.654102 | 1 |
210 rows × 3 columns
# obtain data from specific columns
# obtain data from first and second columns and convert into narray
train_data = train_frame.iloc[:,0:2].values
# obtain labels from third columns and convert into narray
train_labels = train_frame.iloc[:,2].values
# obtain data from first and second columns and convert into narray
test_data = test_frame.iloc[:,0:2].values
# obtain labels from third columns and convert into narray
test_labels = test_frame.iloc[:,2].values
# train & test data shape
print(train_data.shape)
print(test_data.shape)
# train & test labels shape
print(train_labels.shape)
print(test_labels.shape)
(210, 2)
(90, 2)
(210,)
(90,)
3.4 Define some function for visualization and training
def plot(data, labels, caption):
"""plot the data distribution, !!YOU CAN READ THIS LATER, if you are interested
"""
colors = cm.rainbow(np.linspace(0, 1, len(set(labels))))
for i in set(labels):
xs = []
ys = []
for index, label in enumerate(labels):
if label == i:
xs.append(data[index][0])
ys.append(data[index][1])
plt.scatter(xs, ys, colors[int(i)])
plt.title(caption)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
plot(train_data, train_labels, 'train_dataset')
plot(test_data, test_labels, 'test_dataset')
def int2onehot(label):
"""conver labels into one-hot vector, !!YOU CAN READ THIS LATER, if you are interested
Args:
label: [batch_size]
Returns:
onehot: [batch_size, categories]
"""
dims = len(set(label))
imgs_size = len(label)
onehot = np.zeros((imgs_size, dims))
onehot[np.arange(imgs_size), label] = 1
return onehot
# convert labels into one hot vector
train_labels_onehot = int2onehot(train_labels)
test_labels_onehot = int2onehot(test_labels)
print(train_labels_onehot.shape)
print(train_labels_onehot.shape)
(210, 3)
(210, 3)
def get_accuracy(predictions, labels):
"""Compute accuracy, !!YOU CAN READ THIS LATER, if you are interested
Inputs:
predictions:[batch_size, categories] one-hot vector
labels: [batch_size, categories]
"""
predictions = np.argmax(predictions, axis=1)
labels = np.argmax(labels, axis=1)
all_imgs = len(labels)
predict_true = np.sum(predictions == labels)
return predict_true/all_imgs
# Please read this function carefully, related to implementation of GD, SGD, and mini-batch
def generate_batch(train_data, train_labels, batch_size):
"""Generate batch
when batch_size=len(train_data), it's GD
when batch_size=1, it's SGD
when batch_size>1 & batch_size<len(train_data), it's mini-batch, usually, batch_size=2,4,8,16...
"""
iterations = math.ceil(len(train_data)/batch_size)
for i in range(iterations):
index_from = i*batch_size
index_end = (i+1)*batch_size
yield (train_data[index_from:index_end], train_labels[index_from:index_end])
def show_curve(ys, title):
"""plot curve for Loss and Accuacy, !!YOU CAN READ THIS LATER, if you are interested
Args:
ys: loss or acc list
title: Loss or Accuracy
"""
x = np.array(range(len(ys)))
y = np.array(ys)
plt.plot(x, y, c='b')
plt.axis()
plt.title('{} Curve:'.format(title))
plt.xlabel('Epoch')
plt.ylabel('{} Value'.format(title))
plt.show()
3.5 Training model and make predictions
learning_rate = 0.1
epochs = 400 # training epoch
batch_size = len(train_data) # GD
# batch_size = 1 # SGD
# batch_size = 8 # mini-batch
model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model
losses = []
accuracies = []
for i in range(epochs):
loss = 0
for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
predictions = model.forward(xs) # forward phase
loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
model.backward(xs, ys, predictions) # backward phase
losses.append(loss)
# train dataset acc computation
predictions = model.forward(train_data)
# compute acc on train dataset
accuracy = get_accuracy(predictions, train_labels_onehot)
accuracies.append(accuracy)
if i % 50 == 0:
print('Epoch: {}, has {} iterations'.format(i, index+1))
print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))
Epoch: 0, has 1 iterations
Loss: 0.4185, Accuracy: 0.3381
Epoch: 50, has 1 iterations
Loss: 0.0309, Accuracy: 0.9571
Epoch: 100, has 1 iterations
Loss: 0.0334, Accuracy: 0.9714
Epoch: 150, has 1 iterations
Loss: 0.0233, Accuracy: 1.0000
Epoch: 200, has 1 iterations
Loss: 0.0044, Accuracy: 1.0000
Epoch: 250, has 1 iterations
Loss: 0.0955, Accuracy: 0.8286
Epoch: 300, has 1 iterations
Loss: 0.0322, Accuracy: 0.9667
Epoch: 350, has 1 iterations
Loss: 0.0151, Accuracy: 0.9476
Test Accuracy: 0.9111
3.6 Show results
# Draw losses curve using losses
show_curve(losses, 'Loss')
# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')
4. Problems
4.1 Problem 1
Describe the training procedure, based on codes above.
The procedure above uses a feedforward neural network to complete a process.
4.1.1 Data Process
Python uses Pandas
to load csv files to convert the raw data into dataframe format.
4.1.1.1 Files
label.csv
includes the total observations of the whole dataset, which then split into two parts,train.csv
andtest.csv
.train.csv
is the training set of this neural network. This file includes 210 observations which are 2-D, and each observation in it owns a label. Labels range from 0 to 2.test.csv
is the test set of this neural network. This file includes 90 observations which are 2-D, and each observation in it owns a label. Labels range from 0 to 2.
4.1.1.2 Data and Lable
After the I/O, we further split the raw data into two classes, i.e. data and labels. From the plots above, we can clearly see that the whole data are shown in a 2-D figure, and there emerges 3 clusters according to labels.
In this process, we only use one iteration. In other words, the batch is all of the test as a whole. And we have 400 epochs, meaning the train data is used 400 times for training. We check the loss and accuracy of our network every 50 ephoch. Now we has 0.1 $\alpha$, learning rate.
4.1.1.3 One-Hot
There is a trick of one-hot. One-hot coding treats each state bit as a feature.
- Advantages: Firstly, it solves the problem that the classifier can not handle discrete data well, and secondly, to some extent, it also plays the role of expanding features (the number of features of the above samples is expanded from 3 to 9).
- Disadvantage: There are some shortcomings in the representation of text features, which are very prominent. Firstly, it is a bag of words model, which does not consider the order between words (the order information of words in text is also very important); secondly, it assumes that words and words are independent (in most cases, words and words interact with each other); lastly, the features it obtains are discrete and sparse.
4.1.2 Neural Network
Our neural network requires 2-D input data, 5 neurals in the hidden layer classify the data into 3 classes. Notice that although we usually use random matrix to initialize the weight matrices. For simplicity, here we use two pre-set matrices.
4.1.2.1 Forward
The neural network predict labels by forward procedure. By multiplying the origin input and nn.weight1
, we activate the result to get the intermediate product.By multiplying the origin intermediate product and nn.weight2
, we activate the result and get the final prediction.
4.1.2.2 Backward
Notice that the training set has important value - labels. We calculate mean-square errors and use grediant descent method to minimize them. To decrease the grediant, we need to find the direction which can decrease the MSE in the faster manner. Finally, weights are renewen in each step.
4.1.3 Test
To verify the correctness of our method, we plot the curve of 400 epochs. We can see that the accuracy are imcreasing and the loss is decreasing, generally. However, strangely, we can see that the loss curve has a huge improvement in epoch 150, 300 and 350. Why?
After further exploration, we finally get the problem that the learing step is too long, our method cannot find the global optimization. That’s a really an interesting thing! We are able to get the optimized answers below in problem2.
4.2 Problem 2
Set learning rate = 0.01 to train the model and show two curve below.
learning_rate = 0.01
epochs = 400 # training epoch
batch_size = len(train_data) # GD
# batch_size = 1 # SGD
# batch_size = 8 # mini-batch
model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model
losses = []
accuracies = []
for i in range(epochs):
loss = 0
for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
predictions = model.forward(xs) # forward phase
loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
model.backward(xs, ys, predictions) # backward phase
losses.append(loss)
# train dataset acc computation
predictions = model.forward(train_data)
# compute acc on train dataset
accuracy = get_accuracy(predictions, train_labels_onehot)
accuracies.append(accuracy)
if i % 50 == 0:
# print('Epoch: {}, has {} iterations'.format(i, index+1))
# print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
pass
test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))
# Draw losses curve using losses
show_curve(losses, 'Loss')
# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')
Test Accuracy: 1.0000
See? Loss decreases positive correlate with epochs and the accuracy is increasing to one!
4.3 Problem 3
Use SGD and mini-batch to train model and show four curve below.
4.3.1 SGD
learning_rate = 0.1
epochs = 400 # training epoch
# batch_size = len(train_data) # GD
batch_size = 1 # SGD
# batch_size = 8 # mini-batch
model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model
losses = []
accuracies = []
for i in range(epochs):
loss = 0
for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
predictions = model.forward(xs) # forward phase
loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
model.backward(xs, ys, predictions) # backward phase
losses.append(loss)
# train dataset acc computation
predictions = model.forward(train_data)
# compute acc on train dataset
accuracy = get_accuracy(predictions, train_labels_onehot)
accuracies.append(accuracy)
if i % 50 == 0:
# print('Epoch: {}, has {} iterations'.format(i, index+1))
# print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
pass
test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))
# Draw losses curve using losses
show_curve(losses, 'Loss')
# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')
Test Accuracy: 1.0000
4.3.2 Mini-Batch
learning_rate = 0.1
epochs = 400 # training epoch
# batch_size = len(train_data) # GD
# batch_size = 1 # SGD
batch_size = 8 # mini-batch
model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model
losses = []
accuracies = []
for i in range(epochs):
loss = 0
for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
predictions = model.forward(xs) # forward phase
loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
model.backward(xs, ys, predictions) # backward phase
losses.append(loss)
# train dataset acc computation
predictions = model.forward(train_data)
# compute acc on train dataset
accuracy = get_accuracy(predictions, train_labels_onehot)
accuracies.append(accuracy)
if i % 50 == 0:
# print('Epoch: {}, has {} iterations'.format(i, index+1))
# print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
pass
test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))
# Draw losses curve using losses
show_curve(losses, 'Loss')
# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')
Test Accuracy: 1.0000
From this perspective, we can see that a remedy to the defficiency of step length is to increase iterations. We are able to copensate this problem by adopting Mini-batch and SGD.