# 前馈神经网络的 Numpy 实现

## 深度学习笔记（三）

Posted by Nino Lau on March 8, 2019

# Back Propagation

We introduce back propagation in numpy and pytorch respectively.

If you have some questions or suggestion about BackPropagation with Numpy, contact Jiaxin Zhuang or email(zhuangjx5@mail2.sysu.edu.cn)

## 1. Simple expressions and interpretation of the gradient

### 1.1 Simple expressions

Lets start simple so that we can develop the notation and conventions for more complex expressions. Consider a simple multiplication function of two numbers $f(x,y)=xy$. It is a matter of simple calculus to derive the partial derivative for either input:

# set some inputs
x1 = -2; x2 = 5;

# perform the forward pass
f = x1 * x2 # f becomes -10

# perform the backward pass (backpropagation) in reverse order:
# backprop through f = x * y
dfdx1 = x2 # df/dx = y, so gradient on x becomes 5
dfdx2 = x1 # df/dy = x, so gradient on y becomes -2

gradient on x is  5


### 1.2 interpretation of the gradient

Interpretation:The derivatives indicate the rate of change of a function with respect to that variable surrounding an infinitesimally small region near a particular point: $\frac{df(x)}{dx} = \lim_{h\ \to 0} \frac{f(x + h) - f(x)}{h}$ In other words, the derivative on each variable tells you the sensitivity of the whole expression on its value.As mentioned, the gradient $\nabla f$ is the vector of partial derivatives, so we have that $\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]$.

## 2. Compound expressions with chain rule

### 2.1 Simple examples for chain rule

Lets now start to consider more complicated expressions that involve multiple composed functions, such as $f(x,y,z) = (x + y) z$.

This expression is still simple enough to differentiate directly, but we’ll take a particular approach to it that will be helpful with understanding the intuition behind backpropagation.

In particular, note that this expression can be broken down into two expressions: $q=x+y$ and $f=qz$. As seen in the previous section,$f$ is just multiplication of $q$ and $z$, so $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$,and $q$ is addition of $x$ and $y$ so $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$.

However, we don’t necessarily care about the gradient on the intermediate value $q$ - the value of $\frac{\partial f}{\partial q}$ is not useful. Instead, we are ultimately interested in the gradient of $f$ with respect to its inputs $x$,$y$,$z$.

The chain rule tells us that the correct way to “chain” these gradient expressions together is through multiplication. For example, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x}$. In practice this is simply a multiplication of the two numbers that hold the two gradients. Lets see this with an example:

# set some inputs
x = -2; y = 5; z = -4

# perform the forward pass
q = 2*x + y # q becomes 1
f = q * z # f becomes -4
print(q, f)

1 -4

# perform the backward pass (backpropagation) in reverse order:
# first backprop through f = q * z = (2*x+y) * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
# now backprop through q = x + y
dfdx = 2.0 * dfdq # dq/dx = 2. And the multiplication here is the chain rule!
dfdy = 1.0 * dfdq # dq/dy = 1
print('df/dx is {:2}'.format(dfdx))
print('df/dy is {:2}'.format(dfdy))

df/dx is -8.0
df/dy is -4.0


### 2.2 Intuitive understanding of backpropagation

Notice that backpropagation is a beautifully local process. Every gate in a circuit diagram gets some inputs and can right away compute two things:

1. its output value and
2. the local gradient of its inputs with respect to its output value.

## 3. Practice: Writing a simple Feedforward Neural Network

### 3.1 Outline

We would implement a simple feedforward neural network by using numpy. Thus, we need to define network and implement the forward pass as well as the backword propagation.

1. Define a simpel feedforward neural netork, with 1 hidden layer. Implement forward and backward
2. Load data from local csv file with pandas, which contains some training and testing dots, generated by 3 different gaussian distribution.(different mean and std).
3. Define some functions for visualization and training
4. Training and predicting every epoch
5. plot the distribution of the points’ label and the predictions
# Load necessary module for later
import numpy as np
import pandas as pd
np.random.seed(1024)


### 3.2 Define a Feedforward Neural Netowk, implement forward and backward

A simple Neural Network with 1 hidden layer.

                                   Networks Structure

Input        Weights            Output
Hidden Layer                     [batch_size, 2] x [2,5]   ->   [batch_size, 5]
activation function(sigmoid)     [batch_size, 5]           ->   [batch_size, 5]
Classification Layer             [batch_size, 5] x [5,3]   ->   [batch_size, 3]
activation function(sigmoid)     [batch_size, 3]           ->   [batch_size, 3]


According to training and testing data. Each points is in two-dimension space, and there is three categories. And predictions would be a one-hot vector, like [0 0 1] , [1 0 0], [0 1 0]

w1_initialization = np.random.randn(2, 5)
w2_initialization = np.random.randn(5, 3)

w2_initialization

array([[-0.06510141,  0.80681666, -0.5778176 ],
[ 0.57306064, -0.33667496,  0.29700734],
[-0.37480416,  0.15510474,  0.70485719],
[ 0.8452178 , -0.65818079,  0.56810558],
[ 0.51538125, -0.61564998,  0.92611427]])

class FeedForward_Neural_Network(object):
def __init__(self, learning_rate):
self.input_channel = 2  #  number of input neurons
self.output_channel = 3 #  number of output neurons
self.hidden_channel = 5 # number of hidden neurons
self.learning_rate = learning_rate

# weights initialization
# Usually, we use random or uniform initialzation to initialize weight
# For simplicity, here we use same array to initialze
#         np.random.randn(self.input_channel, self.hidden_channel)
# (2x5) weight matrix from input to hidden layer
self.weight1 = np.array([[ 2.12444863,  0.25264613,  1.45417876,  0.56923979,  0.45822365],
[-0.80933344,  0.86407349,  0.20170137, -1.87529904, -0.56850693]])

# (5x3) weight matrix from hidden to output layer
#         np.random.randn(self.hidden_channel, self.output_channel)
self.weight2 = np.array([ [-0.06510141,  0.80681666, -0.5778176 ],
[ 0.57306064, -0.33667496,  0.29700734],
[-0.37480416,  0.15510474,  0.70485719],
[ 0.8452178 , -0.65818079,  0.56810558],
[ 0.51538125, -0.61564998,  0.92611427]])

def forward(self, X):
"""forward propagation through our network
"""
# dot product of X (input) and first set of 3x2 weights
self.h1 = np.dot(X, self.weight1)
# activation function
self.z1 = self.sigmoid(self.h1)
# dot product of hidden layer (z2) and second set of 3x1 weights
self.h2 = np.dot(self.z1, self.weight2)
# final activation function
o = self.sigmoid(self.h2)
return o

def backward(self, X, y, o):
"""Backward, compute gradient and update parameters
Inputs:
X: data, [batch_size, 2]
y: label, one-hot vector, [batch_size, 3]
o: predictions, [batch_size, 3]
"""
# backward propgate through the network
self.o_error = y - o  # error in output
# applying derivative of sigmoid to error  delata L
self.o_delta = self.o_error * self.sigmoid_prime(o)

# z1 error: how much our hidden layer weights contributed to output error
self.z1_error = self.o_delta.dot(self.weight2.T)
# applying derivative of sigmoid to z1 error
self.z1_delta = self.z1_error * self.sigmoid_prime(self.z1)

# adjusting first set (input --> hidden) weights
self.weight1 += X.T.dot(self.z1_delta) * self.learning_rate
# adjusting second set (hidden --> output) weights
self.weight2 += self.z1.T.dot(self.o_delta) * self.learning_rate

def sigmoid(self, s):
"""activation function
"""
return 1 / (1 + np.exp(-s))

def sigmoid_prime(self, s):
"""derivative of sigmoid
"""
return s * (1 - s)


# Import Module
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import math

train_csv_file = './labels/train.csv'
test_csv_file = './labels/test.csv'

# show data in Dataframe format (defined in pandas)
train_frame

0 1 2
0 11.834241 11.866105 1
1 8.101150 9.324800 1
2 11.184679 1.196726 2
3 8.911888 -0.044024 2
4 9.863982 0.151162 2
5 9.427897 -0.598807 2
6 10.038352 2.133938 2
7 11.149009 -0.726649 2
8 9.041540 2.972213 2
9 13.413336 -3.174030 2
10 -0.385824 0.388751 0
11 -0.192905 1.562469 0
12 10.735249 7.702754 1
13 -3.024363 2.518729 0
14 10.694739 11.442958 1
15 10.672035 0.163851 2
16 9.717515 -0.673383 2
17 7.757028 -2.540235 2
18 0.195954 0.843201 0
19 10.359054 11.489937 1
20 10.245470 10.873774 1
21 9.767327 9.450749 1
22 12.402497 11.861342 1
23 0.980769 -1.524264 0
24 -2.113837 2.111235 0
25 0.076416 0.650588 0
26 0.670296 -0.344045 0
27 10.452718 9.419734 1
28 10.647860 8.271140 1
29 -0.095686 2.692840 0
... ... ... ...
180 0.239345 -2.378022 0
181 1.497582 -2.700999 0
182 -0.471785 0.856114 0
183 13.690628 11.552953 1
184 10.652533 10.357309 1
185 8.714084 9.839341 1
186 12.177913 10.932641 1
187 10.049335 8.478106 1
188 1.370425 2.321562 0
189 2.189643 0.012325 0
190 7.425213 10.904103 1
191 6.836717 10.750923 1
192 8.911069 11.032682 1
193 8.819191 11.310835 1
194 -0.807627 -1.435569 0
195 -1.687238 1.345539 0
196 9.856732 10.116610 1
197 9.648434 8.059552 1
198 -0.223917 1.003647 0
199 10.004307 8.482203 1
200 12.090931 9.942670 1
201 10.983798 10.193395 1
202 0.109491 -1.238625 0
203 -1.068244 -0.996179 0
204 0.341772 -0.582299 0
205 -1.344687 -0.894215 0
206 -0.711753 -2.676756 0
207 -0.625906 -2.659784 0
208 9.685143 10.292463 1
209 9.921518 12.654102 1

210 rows × 3 columns

# obtain data from specific columns

# obtain data from first and second columns and convert into narray
train_data = train_frame.iloc[:,0:2].values
# obtain labels from third columns and convert into narray
train_labels = train_frame.iloc[:,2].values
# obtain data from first and second columns and convert into narray
test_data = test_frame.iloc[:,0:2].values
# obtain labels from third columns and convert into narray
test_labels = test_frame.iloc[:,2].values

# train & test data shape
print(train_data.shape)
print(test_data.shape)
# train & test labels shape
print(train_labels.shape)
print(test_labels.shape)

(210, 2)
(90, 2)
(210,)
(90,)


### 3.4 Define some function for visualization and training

def plot(data, labels, caption):
"""plot the data distribution, !!YOU CAN READ THIS LATER, if you are interested
"""
colors = cm.rainbow(np.linspace(0, 1, len(set(labels))))
for i in set(labels):
xs = []
ys = []
for index, label in enumerate(labels):
if label == i:
xs.append(data[index][0])
ys.append(data[index][1])
plt.scatter(xs, ys, colors[int(i)])
plt.title(caption)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

plot(train_data, train_labels, 'train_dataset')


plot(test_data, test_labels, 'test_dataset')


def int2onehot(label):
"""conver labels into one-hot vector, !!YOU CAN READ THIS LATER, if you are interested
Args:
label: [batch_size]
Returns:
onehot: [batch_size, categories]
"""
dims = len(set(label))
imgs_size = len(label)
onehot = np.zeros((imgs_size, dims))
onehot[np.arange(imgs_size), label] = 1
return onehot

# convert labels into one hot vector
train_labels_onehot = int2onehot(train_labels)
test_labels_onehot = int2onehot(test_labels)
print(train_labels_onehot.shape)
print(train_labels_onehot.shape)

(210, 3)
(210, 3)

def get_accuracy(predictions, labels):
"""Compute accuracy, !!YOU CAN READ THIS LATER, if you are interested
Inputs:
predictions:[batch_size, categories] one-hot vector
labels: [batch_size, categories]
"""
predictions = np.argmax(predictions, axis=1)
labels = np.argmax(labels, axis=1)
all_imgs = len(labels)
predict_true = np.sum(predictions == labels)
return predict_true/all_imgs

# Please read this function carefully, related to implementation of GD, SGD, and mini-batch
def generate_batch(train_data, train_labels, batch_size):
"""Generate batch
when batch_size=len(train_data), it's GD
when batch_size=1, it's SGD
when batch_size>1 & batch_size<len(train_data), it's mini-batch, usually, batch_size=2,4,8,16...
"""
iterations = math.ceil(len(train_data)/batch_size)
for i in range(iterations):
index_from = i*batch_size
index_end = (i+1)*batch_size
yield (train_data[index_from:index_end], train_labels[index_from:index_end])

def show_curve(ys, title):
"""plot curve for Loss and Accuacy, !!YOU CAN READ THIS LATER, if you are interested
Args:
ys: loss or acc list
title: Loss or Accuracy
"""
x = np.array(range(len(ys)))
y = np.array(ys)
plt.plot(x, y, c='b')
plt.axis()
plt.title('{} Curve:'.format(title))
plt.xlabel('Epoch')
plt.ylabel('{} Value'.format(title))
plt.show()


### 3.5 Training model and make predictions

learning_rate = 0.1

epochs = 400 # training epoch

batch_size = len(train_data) # GD
# batch_size = 1               # SGD
# batch_size = 8               # mini-batch

model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model

losses = []
accuracies = []

for i in range(epochs):
loss = 0
for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
predictions = model.forward(xs) # forward phase
loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
model.backward(xs, ys, predictions) # backward phase

losses.append(loss)

# train dataset acc computation
predictions = model.forward(train_data)
# compute acc on train dataset
accuracy = get_accuracy(predictions, train_labels_onehot)
accuracies.append(accuracy)

if i % 50 == 0:
print('Epoch: {}, has {} iterations'.format(i, index+1))
print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))

test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))

Epoch: 0, has 1 iterations
Loss: 0.4185, 	Accuracy: 0.3381
Epoch: 50, has 1 iterations
Loss: 0.0309, 	Accuracy: 0.9571
Epoch: 100, has 1 iterations
Loss: 0.0334, 	Accuracy: 0.9714
Epoch: 150, has 1 iterations
Loss: 0.0233, 	Accuracy: 1.0000
Epoch: 200, has 1 iterations
Loss: 0.0044, 	Accuracy: 1.0000
Epoch: 250, has 1 iterations
Loss: 0.0955, 	Accuracy: 0.8286
Epoch: 300, has 1 iterations
Loss: 0.0322, 	Accuracy: 0.9667
Epoch: 350, has 1 iterations
Loss: 0.0151, 	Accuracy: 0.9476
Test Accuracy: 0.9111


### 3.6 Show results

# Draw losses curve using losses
show_curve(losses, 'Loss')


# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')


## 4. Problems

### 4.1 Problem 1

Describe the training procedure, based on codes above.

The procedure above uses a feedforward neural network to complete a process.

#### 4.1.1 Data Process

Python uses Pandas to load csv files to convert the raw data into dataframe format.

##### 4.1.1.1 Files
• label.csv includes the total observations of the whole dataset, which then split into two parts, train.csv and test.csv.
• train.csv is the training set of this neural network. This file includes 210 observations which are 2-D, and each observation in it owns a label. Labels range from 0 to 2.
• test.csv is the test set of this neural network. This file includes 90 observations which are 2-D, and each observation in it owns a label. Labels range from 0 to 2.
##### 4.1.1.2 Data and Lable

After the I/O, we further split the raw data into two classes, i.e. data and labels. From the plots above, we can clearly see that the whole data are shown in a 2-D figure, and there emerges 3 clusters according to labels.

In this process, we only use one iteration. In other words, the batch is all of the test as a whole. And we have 400 epochs, meaning the train data is used 400 times for training. We check the loss and accuracy of our network every 50 ephoch. Now we has 0.1 $\alpha$, learning rate.

##### 4.1.1.3 One-Hot

There is a trick of one-hot. One-hot coding treats each state bit as a feature.

• Advantages: Firstly, it solves the problem that the classifier can not handle discrete data well, and secondly, to some extent, it also plays the role of expanding features (the number of features of the above samples is expanded from 3 to 9).
• Disadvantage: There are some shortcomings in the representation of text features, which are very prominent. Firstly, it is a bag of words model, which does not consider the order between words (the order information of words in text is also very important); secondly, it assumes that words and words are independent (in most cases, words and words interact with each other); lastly, the features it obtains are discrete and sparse.

#### 4.1.2 Neural Network

Our neural network requires 2-D input data, 5 neurals in the hidden layer classify the data into 3 classes. Notice that although we usually use random matrix to initialize the weight matrices. For simplicity, here we use two pre-set matrices.

##### 4.1.2.1 Forward

The neural network predict labels by forward procedure. By multiplying the origin input and nn.weight1, we activate the result to get the intermediate product.By multiplying the origin intermediate product and nn.weight2, we activate the result and get the final prediction.

##### 4.1.2.2 Backward

Notice that the training set has important value - labels. We calculate mean-square errors and use grediant descent method to minimize them. To decrease the grediant, we need to find the direction which can decrease the MSE in the faster manner. Finally, weights are renewen in each step.

#### 4.1.3 Test

To verify the correctness of our method, we plot the curve of 400 epochs. We can see that the accuracy are imcreasing and the loss is decreasing, generally. However, strangely, we can see that the loss curve has a huge improvement in epoch 150, 300 and 350. Why?

After further exploration, we finally get the problem that the learing step is too long, our method cannot find the global optimization. That’s a really an interesting thing! We are able to get the optimized answers below in problem2.

### 4.2 Problem 2

Set learning rate = 0.01 to train the model and show two curve below.

learning_rate = 0.01
epochs = 400 # training epoch
batch_size = len(train_data) # GD
# batch_size = 1               # SGD
# batch_size = 8               # mini-batch

model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model

losses = []
accuracies = []

for i in range(epochs):
loss = 0
for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
predictions = model.forward(xs) # forward phase
loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
model.backward(xs, ys, predictions) # backward phase

losses.append(loss)

# train dataset acc computation
predictions = model.forward(train_data)
# compute acc on train dataset
accuracy = get_accuracy(predictions, train_labels_onehot)
accuracies.append(accuracy)

if i % 50 == 0:
#         print('Epoch: {}, has {} iterations'.format(i, index+1))
#         print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
pass

test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))

# Draw losses curve using losses
show_curve(losses, 'Loss')

# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')

Test Accuracy: 1.0000


See? Loss decreases positive correlate with epochs and the accuracy is increasing to one!

### 4.3 Problem 3

Use SGD and mini-batch to train model and show four curve below.

#### 4.3.1 SGD

learning_rate = 0.1
epochs = 400 # training epoch
# batch_size = len(train_data) # GD
batch_size = 1               # SGD
# batch_size = 8               # mini-batch

model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model

losses = []
accuracies = []

for i in range(epochs):
loss = 0
for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
predictions = model.forward(xs) # forward phase
loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
model.backward(xs, ys, predictions) # backward phase

losses.append(loss)

# train dataset acc computation
predictions = model.forward(train_data)
# compute acc on train dataset
accuracy = get_accuracy(predictions, train_labels_onehot)
accuracies.append(accuracy)

if i % 50 == 0:
#         print('Epoch: {}, has {} iterations'.format(i, index+1))
#         print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
pass

test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))

# Draw losses curve using losses
show_curve(losses, 'Loss')

# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')

Test Accuracy: 1.0000


#### 4.3.2 Mini-Batch

learning_rate = 0.1
epochs = 400 # training epoch
# batch_size = len(train_data) # GD
# batch_size = 1               # SGD
batch_size = 8               # mini-batch

model = FeedForward_Neural_Network(learning_rate) # declare a simple feedforward neural model

losses = []
accuracies = []

for i in range(epochs):
loss = 0
for index, (xs, ys) in enumerate(generate_batch(train_data, train_labels_onehot, batch_size)):
predictions = model.forward(xs) # forward phase
loss += 1/2 * np.mean(np.sum(np.square(ys-predictions), axis=1)) # Mean square error
model.backward(xs, ys, predictions) # backward phase

losses.append(loss)

# train dataset acc computation
predictions = model.forward(train_data)
# compute acc on train dataset
accuracy = get_accuracy(predictions, train_labels_onehot)
accuracies.append(accuracy)

if i % 50 == 0:
#         print('Epoch: {}, has {} iterations'.format(i, index+1))
#         print('\tLoss: {:.4f}, \tAccuracy: {:.4f}'.format(loss, accuracy))
pass

test_predictions = model.forward(test_data)
# compute acc on test dataset
test_accuracy = get_accuracy(test_predictions, test_labels_onehot)
print('Test Accuracy: {:.4f}'.format(test_accuracy))

# Draw losses curve using losses
show_curve(losses, 'Loss')

# Draw Accuracy curve using accuracies
show_curve(accuracies, 'Accuracy')

Test Accuracy: 1.0000


From this perspective, we can see that a remedy to the defficiency of step length is to increase iterations. We are able to copensate this problem by adopting Mini-batch and SGD.