Back Propagation

We introduce back propagation in numpy and pytorch respectively.

If you have some questions or suggestion about BackPropagation with Numpy, contact Jiaxin Zhuang or email(zhuangjx5@mail2.sysu.edu.cn)

1. Simple expressions and interpretation of the gradient

1.1 Simple expressions

Lets start simple so that we can develop the notation and conventions for more complex expressions. Consider a simple multiplication function of two numbers $f(x,y)=xy$. It is a matter of simple calculus to derive the partial derivative for either input:

$f(x,y) = x y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = y \hspace{0.5in} \frac{\partial f}{\partial y} = x$

# set some inputs
x1 = -2; x2 = 5;

# perform the forward pass
f = x1 * x2 # f becomes -10

# perform the backward pass (backpropagation) in reverse order:
# backprop through f = x * y
dfdx1 = x2 # df/dx = y, so gradient on x becomes 5
print("gradient on x is {:2}".format(dfdx1))
dfdx2 = x1 # df/dy = x, so gradient on y becomes -2
print('gradient on y is {:2}'.format(dfdx2))

gradient on x is  5
gradient on y is -2

1.2 interpretation of the gradient

Interpretation:The derivatives indicate the rate of change of a function with respect to that variable surrounding an infinitesimally small region near a particular point: $\frac{df(x)}{dx} = \lim_{h\ \to 0} \frac{f(x + h) - f(x)}{h}$ In other words, the derivative on each variable tells you the sensitivity of the whole expression on its value.As mentioned, the gradient $\nabla f$ is the vector of partial derivatives, so we have that $\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]$.

2. Compound expressions with chain rule

2.1 Simple examples for chain rule

Lets now start to consider more complicated expressions that involve multiple composed functions, such as $f(x,y,z) = (x + y) z$.

This expression is still simple enough to differentiate directly, but we’ll take a particular approach to it that will be helpful with understanding the intuition behind backpropagation.

In particular, note that this expression can be broken down into two expressions: $q=x+y$ and $f=qz$. As seen in the previous section,$f$ is just multiplication of $q$ and $z$, so $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$,and $q$ is addition of $x$ and $y$ so $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$.

However, we don’t necessarily care about the gradient on the intermediate value $q$ - the value of $\frac{\partial f}{\partial q}$ is not useful. Instead, we are ultimately interested in the gradient of $f$ with respect to its inputs $x$,$y$,$z$.

The chain rule tells us that the correct way to “chain” these gradient expressions together is through multiplication. For example, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x}$. In practice this is simply a multiplication of the two numbers that hold the two gradients. Lets see this with an example:

# set some inputs
x = -2; y = 5; z = -4  

# perform the forward pass
q = 2*x + y # q becomes 1
f = q * z # f becomes -4
print(q, f)

1 -4

# perform the backward pass (backpropagation) in reverse order:
# first backprop through f = q * z = (2*x+y) * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
# now backprop through q = x + y
dfdx = 2.0 * dfdq # dq/dx = 2. And the multiplication here is the chain rule!
dfdy = 1.0 * dfdq # dq/dy = 1
print('df/dx is {:2}'.format(dfdx))
print('df/dy is {:2}'.format(dfdy))

df/dx is -8.0
df/dy is -4.0

2.2 Intuitive understanding of backpropagation

Notice that backpropagation is a beautifully local process. Every gate in a circuit diagram gets some inputs and can right away compute two things:

its output value and
the local gradient of its inputs with respect to its output value.

3. Practice: Writing a simple Feedforward Neural Network

3.1 Outline

We would implement a simple feedforward neural network by using numpy. Thus, we need to define network and implement the forward pass as well as the backword propagation.

Define a simpel feedforward neural netork, with 1 hidden layer. Implement forward and backward
Load data from local csv file with pandas, which contains some training and testing dots, generated by 3 different gaussian distribution.(different mean and std).
Define some functions for visualization and training
Training and predicting every epoch
plot the distribution of the points’ label and the predictions

# Load necessary module for later
import numpy as np
import pandas as pd
np.random.seed(1024)

3.2 Define a Feedforward Neural Netowk, implement forward and backward

A simple Neural Network with 1 hidden layer.

                                   Networks Structure
                         
                                    Input        Weights            Output
Hidden Layer                     [batch_size, 2] x [2,5]   ->   [batch_size, 5]
activation function(sigmoid)     [batch_size, 5]           ->   [batch_size, 5]
Classification Layer             [batch_size, 5] x [5,3]   ->   [batch_size, 3]
activation function(sigmoid)     [batch_size, 3]           ->   [batch_size, 3]

According to training and testing data. Each points is in two-dimension space, and there is three categories. And predictions would be a one-hot vector, like [0 0 1] , [1 0 0], [0 1 0]

w1_initialization = np.random.randn(2, 5)  
w2_initialization = np.random.randn(5, 3)  

w2_initialization

array([[-0.06510141,  0.80681666, -0.5778176 ],
       [ 0.57306064, -0.33667496,  0.29700734],
       [-0.37480416,  0.15510474,  0.70485719],
       [ 0.8452178 , -0.65818079,  0.56810558],
       [ 0.51538125, -0.61564998,  0.92611427]])

class FeedForward_Neural_Network(object):
    def __init__(self, learning_rate):
        self.input_channel = 2  #  number of input neurons
        self.output_channel = 3 #  number of output neurons
        self.hidden_channel = 5 # number of hidden neurons
        self.learning_rate = learning_rate
        
        # weights initialization
        # Usually, we use random or uniform initialzation to initialize weight
        # For simplicity, here we use same array to initialze 
#         np.random.randn(self.input_channel, self.hidden_channel) 
        # (2x5) weight matrix from input to hidden layer
        self.weight1 = np.array([[ 2.12444863,  0.25264613,  1.45417876,  0.56923979,  0.45822365],
                                 [-0.80933344,  0.86407349,  0.20170137, -1.87529904, -0.56850693]])
         
        # (5x3) weight matrix from hidden to output layer
#         np.random.randn(self.hidden_channel, self.output_channel)  
        self.weight2 = np.array([ [-0.06510141,  0.80681666, -0.5778176 ],
                               [ 0.57306064, -0.33667496,  0.29700734],
                               [-0.37480416,  0.15510474,  0.70485719],
                               [ 0.8452178 , -0.65818079,  0.56810558],
                               [ 0.51538125, -0.61564998,  0.92611427]])

    
    def forward(self, X):
        """forward propagation through our network
        """
        # dot product of X (input) and first set of 3x2 weights
        self.h1 = np.dot(X, self.weight1)  
        # activation function
        self.z1 = self.sigmoid(self.h1)  
        # dot product of hidden layer (z2) and second set of 3x1 weights
        self.h2 = np.dot(self.z1, self.weight2) 
        # final activation function
        o = self.sigmoid(self.h2)
        return o
    
    def backward(self, X, y, o):
        """Backward, compute gradient and update parameters
        Inputs:
            X: data, [batch_size, 2]
            y: label, one-hot vector, [batch_size, 3]
            o: predictions, [batch_size, 3]
        """
        # backward propgate through the network
        self.o_error = y - o  # error in output
         # applying derivative of sigmoid to error  delata L
        self.o_delta = self.o_error * self.sigmoid_prime(o) 

        # z1 error: how much our hidden layer weights contributed to output error
        self.z1_error = self.o_delta.dot(self.weight2.T)  
        # applying derivative of sigmoid to z1 error
        self.z1_delta = self.z1_error * self.sigmoid_prime(self.z1)  

        # adjusting first set (input --> hidden) weights
        self.weight1 += X.T.dot(self.z1_delta) * self.learning_rate  
        # adjusting second set (hidden --> output) weights
        self.weight2 += self.z1.T.dot(self.o_delta) * self.learning_rate 
        
    def sigmoid(self, s):
        """activation function
        """
        return 1 / (1 + np.exp(-s))

    def sigmoid_prime(self, s):
        """derivative of sigmoid
        """
        return s * (1 - s)

3.3 Loading Data From local csv by using Pandas

# Import Module
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import math

train_csv_file = './labels/train.csv'
test_csv_file = './labels/test.csv'
# Load data from csv file, without header
train_frame = pd.read_csv(train_csv_file, encoding='utf-8', header=None)
test_frame = pd.read_csv(test_csv_file, encoding='utf-8', header=None)

# show data in Dataframe format (defined in pandas)
train_frame

	0	1	2
0	11.834241	11.866105	1
1	8.101150	9.324800	1
2	11.184679	1.196726	2
3	8.911888	-0.044024	2
4	9.863982	0.151162	2
5	9.427897	-0.598807	2
6	10.038352	2.133938	2
7	11.149009	-0.726649	2
8	9.041540	2.972213	2
9	13.413336	-3.174030	2
10	-0.385824	0.388751	0
11	-0.192905	1.562469	0
12	10.735249	7.702754	1
13	-3.024363	2.518729	0
14	10.694739	11.442958	1
15	10.672035	0.163851	2
16	9.717515	-0.673383	2
17	7.757028	-2.540235	2
18	0.195954	0.843201	0
19	10.359054	11.489937	1
20	10.245470	10.873774	1
21	9.767327	9.450749	1
22	12.402497	11.861342	1
23	0.980769	-1.524264	0
24	-2.113837	2.111235	0
25	0.076416	0.650588	0
26	0.670296	-0.344045	0
27	10.452718	9.419734	1
28	10.647860	8.271140	1
29	-0.095686	2.692840	0
...	...	...	...
180	0.239345	-2.378022	0
181	1.497582	-2.700999	0
182	-0.471785	0.856114	0
183	13.690628	11.552953	1
184	10.652533	10.357309	1
185	8.714084	9.839341	1
186	12.177913	10.932641	1
187	10.049335	8.478106	1
188	1.370425	2.321562	0
189	2.189643	0.012325	0
190	7.425213	10.904103	1
191	6.836717	10.750923	1
192	8.911069	11.032682	1
193	8.819191	11.310835	1
194	-0.807627	-1.435569	0
195	-1.687238	1.345539	0
196	9.856732	10.116610	1
197	9.648434	8.059552	1
198	-0.223917	1.003647	0
199	10.004307	8.482203	1
200	12.090931	9.942670	1
201	10.983798	10.193395	1
202	0.109491	-1.238625	0
203	-1.068244	-0.996179	0
204	0.341772	-0.582299	0
205	-1.344687	-0.894215	0
206	-0.711753	-2.676756	0
207	-0.625906	-2.659784	0
208	9.685143	10.292463	1
209	9.921518	12.654102	1

210 rows × 3 columns

# obtain data from specific columns

# obtain data from first and second columns and convert into narray
train_data = train_frame.iloc[:,0:2].values 
# obtain labels from third columns and convert into narray
train_labels = train_frame.iloc[:,2].values 
# obtain data from first and second columns and convert into narray
test_data = test_frame.iloc[:,0:2].values
# obtain labels from third columns and convert into narray
test_labels = test_frame.iloc[:,2].values

# train & test data shape
print(train_data.shape)
print(test_data.shape)
# train & test labels shape
print(train_labels.shape)
print(test_labels.shape)

(210, 2)
(90, 2)
(210,)
(90,)

3.4 Define some function for visualization and training

def plot(data, labels, caption):
    """plot the data distribution, !!YOU CAN READ THIS LATER, if you are interested
    """
    colors = cm.rainbow(np.linspace(0, 1, len(set(labels))))
    for i in set(labels):
        xs = []
        ys = []
        for index, label in enumerate(labels):
            if label == i:
                xs.append(data[index][0])
                ys.append(data[index][1])
        plt.scatter(xs, ys, colors[int(i)])    
    plt.title(caption)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.show()

plot(train_data, train_labels, 'train_dataset')

plot(test_data, test_labels, 'test_dataset')

def int2onehot(label):
    """conver labels into one-hot vector, !!YOU CAN READ THIS LATER, if you are interested
    Args:
        label: [batch_size]
    Returns:
        onehot: [batch_size, categories]
    """
    dims = len(set(label))
    imgs_size = len(label)
    onehot = np.zeros((imgs_size, dims))
    onehot[np.arange(imgs_size), label] = 1
    return onehot

# convert labels into one hot vector
train_labels_onehot = int2onehot(train_labels)
test_labels_onehot = int2onehot(test_labels)
print(train_labels_onehot.shape)
print(train_labels_onehot.shape)

(210, 3)
(210, 3)

def get_accuracy(predictions, labels):
    """Compute accuracy, !!YOU CAN READ THIS LATER, if you are interested
    Inputs: 
        predictions:[batch_size, categories] one-hot vector
        labels: [batch_size, categories]
    """
    predictions = np.argmax(predictions, axis=1)
    labels = np.argmax(labels, axis=1)
    all_imgs = len(labels)
    predict_true = np.sum(predictions == labels)
    return predict_true/all_imgs

# Please read this function carefully, related to implementation of GD, SGD, and mini-batch
def generate_batch(train_data, train_labels, batch_size):
    """Generate batch
    when batch_size=len(train_data), it's GD
    when batch_size=1, it's SGD
    when batch_size>1 & batch_size<len(train_data), it's mini-batch, usually, batch_size=2,4,8,16...
    """
    iterations = math.ceil(len(train_data)/batch_size)
    for i in range(iterations):
        index_from = i*batch_size
        index_end = (i+1)*batch_size
        yield (train_data[index_from:index_end], train_labels[index_from:index_end])

def show_curve(ys, title):
    """plot curve for Loss and Accuacy, !!YOU CAN READ THIS LATER, if you are interested
    Args:
        ys: loss or acc list
        title: Loss or Accuracy
    """
    x = np.array(range(len(ys)))
    y = np.array(ys)
    plt.plot(x, y, c='b')
    plt.axis()
    plt.title('{} Curve:'.format(title))
    plt.xlabel('Epoch')
    plt.ylabel('{} Value'.format(title))
    plt.show()

3.5 Training model and make predictions

learning_rate = 0.1

epochs = 400 # training epoch

batch_size = len(train_data) # GD
# batch_size = 1               # SGD
# batch_size = 8               # mini-batch

model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model

losses = []
accuracies = []

for i in range(epochs):
    loss = 0
    for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
        predictions = model.forward(xs) # forward phase
        loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
        model.backward(xs, ys, predictions) # backward phase
        
    losses.append(loss)
    
    # train dataset acc computation
    predictions = model.forward(train_data)
    # compute acc on train dataset
    accuracy = get_accuracy(predictions, train_labels_onehot)
    accuracies.append(accuracy)
    
    if i % 50 == 0:
        print('Epoch: {}, has {} iterations'.format(i, index+1))
        print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
        
test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))

Epoch: 0, has 1 iterations
	Loss: 0.4185, 	Accuracy: 0.3381
Epoch: 50, has 1 iterations
	Loss: 0.0309, 	Accuracy: 0.9571
Epoch: 100, has 1 iterations
	Loss: 0.0334, 	Accuracy: 0.9714
Epoch: 150, has 1 iterations
	Loss: 0.0233, 	Accuracy: 1.0000
Epoch: 200, has 1 iterations
	Loss: 0.0044, 	Accuracy: 1.0000
Epoch: 250, has 1 iterations
	Loss: 0.0955, 	Accuracy: 0.8286
Epoch: 300, has 1 iterations
	Loss: 0.0322, 	Accuracy: 0.9667
Epoch: 350, has 1 iterations
	Loss: 0.0151, 	Accuracy: 0.9476
Test Accuracy: 0.9111

3.6 Show results

# Draw losses curve using losses 
show_curve(losses, 'Loss')

# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')

4. Problems

4.1 Problem 1

Describe the training procedure, based on codes above.

The procedure above uses a feedforward neural network to complete a process.

4.1.1 Data Process

Python uses Pandas to load csv files to convert the raw data into dataframe format.

4.1.1.1 Files

label.csv includes the total observations of the whole dataset, which then split into two parts, train.csv and test.csv.
train.csv is the training set of this neural network. This file includes 210 observations which are 2-D, and each observation in it owns a label. Labels range from 0 to 2.
test.csv is the test set of this neural network. This file includes 90 observations which are 2-D, and each observation in it owns a label. Labels range from 0 to 2.

4.1.1.2 Data and Lable

After the I/O, we further split the raw data into two classes, i.e. data and labels. From the plots above, we can clearly see that the whole data are shown in a 2-D figure, and there emerges 3 clusters according to labels.

In this process, we only use one iteration. In other words, the batch is all of the test as a whole. And we have 400 epochs, meaning the train data is used 400 times for training. We check the loss and accuracy of our network every 50 ephoch. Now we has 0.1 $\alpha$, learning rate.

4.1.1.3 One-Hot

There is a trick of one-hot. One-hot coding treats each state bit as a feature.

Advantages: Firstly, it solves the problem that the classifier can not handle discrete data well, and secondly, to some extent, it also plays the role of expanding features (the number of features of the above samples is expanded from 3 to 9).
Disadvantage: There are some shortcomings in the representation of text features, which are very prominent. Firstly, it is a bag of words model, which does not consider the order between words (the order information of words in text is also very important); secondly, it assumes that words and words are independent (in most cases, words and words interact with each other); lastly, the features it obtains are discrete and sparse.

4.1.2 Neural Network

Our neural network requires 2-D input data, 5 neurals in the hidden layer classify the data into 3 classes. Notice that although we usually use random matrix to initialize the weight matrices. For simplicity, here we use two pre-set matrices.

4.1.2.1 Forward

The neural network predict labels by forward procedure. By multiplying the origin input and nn.weight1, we activate the result to get the intermediate product.By multiplying the origin intermediate product and nn.weight2, we activate the result and get the final prediction.

4.1.2.2 Backward

Notice that the training set has important value - labels. We calculate mean-square errors and use grediant descent method to minimize them. To decrease the grediant, we need to find the direction which can decrease the MSE in the faster manner. Finally, weights are renewen in each step.

4.1.3 Test

To verify the correctness of our method, we plot the curve of 400 epochs. We can see that the accuracy are imcreasing and the loss is decreasing, generally. However, strangely, we can see that the loss curve has a huge improvement in epoch 150, 300 and 350. Why?

After further exploration, we finally get the problem that the learing step is too long, our method cannot find the global optimization. That’s a really an interesting thing! We are able to get the optimized answers below in problem2.

4.2 Problem 2

Set learning rate = 0.01 to train the model and show two curve below.

learning_rate = 0.01
epochs = 400 # training epoch
batch_size = len(train_data) # GD
# batch_size = 1               # SGD
# batch_size = 8               # mini-batch

model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model

losses = []
accuracies = []

for i in range(epochs):
    loss = 0
    for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
        predictions = model.forward(xs) # forward phase
        loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
        model.backward(xs, ys, predictions) # backward phase
        
    losses.append(loss)
    
    # train dataset acc computation
    predictions = model.forward(train_data)
    # compute acc on train dataset
    accuracy = get_accuracy(predictions, train_labels_onehot)
    accuracies.append(accuracy)
    
    if i % 50 == 0:
#         print('Epoch: {}, has {} iterations'.format(i, index+1))
#         print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
        pass
        
test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))

# Draw losses curve using losses 
show_curve(losses, 'Loss')

# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')

Test Accuracy: 1.0000

See? Loss decreases positive correlate with epochs and the accuracy is increasing to one!

4.3 Problem 3

Use SGD and mini-batch to train model and show four curve below.

4.3.1 SGD

learning_rate = 0.1
epochs = 400 # training epoch
# batch_size = len(train_data) # GD
batch_size = 1               # SGD
# batch_size = 8               # mini-batch

model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model

losses = []
accuracies = []

for i in range(epochs):
    loss = 0
    for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
        predictions = model.forward(xs) # forward phase
        loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
        model.backward(xs, ys, predictions) # backward phase
        
    losses.append(loss)
    
    # train dataset acc computation
    predictions = model.forward(train_data)
    # compute acc on train dataset
    accuracy = get_accuracy(predictions, train_labels_onehot)
    accuracies.append(accuracy)
    
    if i % 50 == 0:
#         print('Epoch: {}, has {} iterations'.format(i, index+1))
#         print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
        pass
        
test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))

# Draw losses curve using losses 
show_curve(losses, 'Loss')

# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')

Test Accuracy: 1.0000

4.3.2 Mini-Batch

learning_rate = 0.1
epochs = 400 # training epoch
# batch_size = len(train_data) # GD
# batch_size = 1               # SGD
batch_size = 8               # mini-batch

model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model

losses = []
accuracies = []

for i in range(epochs):
    loss = 0
    for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
        predictions = model.forward(xs) # forward phase
        loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
        model.backward(xs, ys, predictions) # backward phase
        
    losses.append(loss)
    
    # train dataset acc computation
    predictions = model.forward(train_data)
    # compute acc on train dataset
    accuracy = get_accuracy(predictions, train_labels_onehot)
    accuracies.append(accuracy)
    
    if i % 50 == 0:
#         print('Epoch: {}, has {} iterations'.format(i, index+1))
#         print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
        pass
        
test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))

# Draw losses curve using losses 
show_curve(losses, 'Loss')

# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')

Test Accuracy: 1.0000

From this perspective, we can see that a remedy to the defficiency of step length is to increase iterations. We are able to copensate this problem by adopting Mini-batch and SGD.

前馈神经网络的 Numpy 实现

深度学习笔记（三）