RNN-1

深度学习笔记(十二)

Posted by Nino Lau on May 11, 2019

0. Introduction

本次实验设计主要修改自Pytorch官方提供的教程,在官方的教程上进行了整合。如对本次课件的内容有任何疑惑的同学可以直接微信我或者邮件到cuizhiying.csu@gmail.com
原作者: `Sean Robertson <https://github.com/spro/practical-

0.1 Experimental content and requirements

本次实验内容主要分为词语分类和词语生成两大部分,具体要求如下:

  1. 体验语义分割网络的运行过程,和基本的代码结构,结合理论课的内容,加深对RNN的思考和理解
  2. 独立完成实验指导书中提出的问题(简要回答)
  3. 按照实验指导书的引导,填充缺失部分的代码,让程序顺利地运转起来
  4. 坚持独立完成,禁止抄袭
  5. 实验结束后,将整个文件夹下载下来(注意保留程序运行结果),打包上传到超算课堂网站中(统一使用zip格式压缩)。

These are all good additions to the understanding of RNN.:

0.3 Setup CUDA

我们在没有CUDA的情况下跑整个实验,但是有CUDA的话自然会更好,首先,检查和配置CUDA环境。

import torch
is_cuda = False
print(is_cuda)
False

选着一张空闲的卡

id = 0
torch.cuda.set_device(id)
print( torch.cuda.current_device() )
0

1. Classifying Names with a Character-Level RNN


We will be building and training a basic character-level RNN to classify words. A character-level RNN reads words as a series of characters - outputting a prediction and “hidden state” at each step, feeding its previous hidden state into each next step. We take the final prediction to be the output, i.e. which class the word belongs to.

Specifically, we’ll train on a few thousand surnames from 18 languages of origin, and predict which language a name is from based on the spelling:

1.1 Preparing the Data

Download the data from here and extract it to the current directory.

Included in the data/names directory are 18 text files named as “[Language].txt”. Each file contains a bunch of names, one name per line, mostly romanized (but we still need to convert from Unicode to ASCII).

We’ll end up with a dictionary of lists of names per language, {language: [names ...]}. The generic variables “category” and “line” (for language and name in our case) are used for later extensibility.

from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os

def findFiles(path): return glob.glob(path)

print(findFiles('data/names/*.txt'))

import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))

# Build the category_lines dictionary, a list of names per language
all_categories = []
category_lines = {}

# Split it into training set and validation set
training_lines = {}
validation_lines = {}

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines
    
    num_of_training_set = int(len(lines)*0.8)
    training_lines[category]   = lines[:num_of_training_set]
    validation_lines[category] = lines[num_of_training_set:] 

n_categories = len(all_categories)
print(n_categories)
['data/names/Polish.txt', 'data/names/Italian.txt', 'data/names/Vietnamese.txt', 'data/names/Czech.txt', 'data/names/German.txt', 'data/names/Spanish.txt', 'data/names/Japanese.txt', 'data/names/Russian.txt', 'data/names/Chinese.txt', 'data/names/Korean.txt', 'data/names/Irish.txt', 'data/names/French.txt', 'data/names/Dutch.txt', 'data/names/Greek.txt', 'data/names/Arabic.txt', 'data/names/English.txt', 'data/names/Scottish.txt', 'data/names/Portuguese.txt']
Slusarski
18

Now we have category_lines, a dictionary mapping each category (language) to a list of lines (names). We also kept track of all_categories (just a list of languages) and n_categories for later reference.

print(category_lines['Italian'][:5])
['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']

1.2 Turning Names into Tensors

Now that we have all the names organized, we need to turn them into Tensors to make any use of them.

To represent a single letter, we use a “one-hot vector” of size <1 x n_letters>. A one-hot vector is filled with 0s except for a 1 at index of the current letter, e.g. "b" = <0 1 0 0 0 ...>.

To make a word we join a bunch of those into a 2D matrix <line_length x 1 x n_letters>.

That extra 1 dimension is because PyTorch assumes everything is in batches - we’re just using a batch size of 1 here.

import torch

# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    return all_letters.find(letter)

# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    tensor[0][letterToIndex(letter)] = 1
    if is_cuda:
        tensor = tensor.cuda()
    return tensor

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    if is_cuda:
        tensor = tensor.cuda()
    return tensor

# Tensor here, someone else may call it vector. 
print(letterToTensor('J'))

print(lineToTensor('Jones').size())
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])
torch.Size([5, 1, 57])

1.3 Creating the Network

Before autograd, creating a recurrent neural network in Torch involved cloning the parameters of a layer over several timesteps. The layers held hidden state and gradients which are now entirely handled by the graph itself. This means you can implement a RNN in a very “pure” way, as regular feed-forward layers.

Inside the forward function, we need to Loop

import torch.nn as nn

class BaseRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(BaseRNN, self).__init__()

        self.hidden_size = hidden_size

        # input to hidden
        self.i2h = nn.Linear(input_size,  hidden_size)
        # hidden to hidden
        self.h2h = nn.Linear(hidden_size, hidden_size)
        self.activation = nn.ReLU()
        # hidden to output
        self.h2o = nn.Linear(hidden_size, output_size)
        
    def step(self, letter, hidden):
        i2h = self.i2h(letter)
        h2h = self.h2h(hidden)
        hidden = self.activation( h2h+i2h )
        
        output = self.h2o(hidden)
        return output, hidden
    
    def forward(self, word):
        hidden = self.initHidden()
        for i in range(word.size()[0]):
            # Only the last output will be used to predict
            output, hidden = self.step(word[i], hidden)
        return output
    
    def initHidden(self, is_cuda=True):
        if is_cuda:
            return torch.zeros(1, self.hidden_size).cuda()
        else:
            return torch.zeros(1, self.hidden_size)

n_hidden = 128
rnn = BaseRNN(n_letters, n_hidden, n_categories)
if is_cuda:
    rnn = rnn.cuda()
class DeeperRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(DeeperRNN, self).__init__()
        self.hidden1_size = hidden_size
        self.hidden2_size = hidden_size
        self.layer1 = BaseRNN(input_size, hidden_size, hidden_size)
        self.layer2 = BaseRNN(hidden_size, hidden_size, output_size)
        
    def step(self, letter, hidden1, hidden2):
        output1, hidden1 = self.layer1.step(letter, hidden1)
        output2, hidden2 = self.layer2.step(output1, hidden2)
        
        return output2, hidden1, hidden2
    
    def forward(self, word):
        hidden1, hidden2 = self.initHidden(False)
        for i in range(word.size()[0]):
            # Only the last output will be used to predict
            output, hidden1, hidden2 = self.step(word[i], hidden1, hidden2)
        return output

        
    def initHidden(self, is_cuda=True):
        if is_cuda:
            return torch.zeros(1, self.hidden1_size).cuda(), torch.zeros(1, self.hidden2_size).cuda()
        else:
            return torch.zeros(1, self.hidden1_size), torch.zeros(1, self.hidden2_size)
n_hidden = 128
rnn = DeeperRNN(n_letters, n_hidden, n_categories)
rnn = rnn.cuda() if is_cuda else rnn

To run a step of this network we need to pass an input (in our case, the Tensor for the current letter) and a previous hidden state (which we initialize as zeros at first). We’ll get back the output (probability of each language) and a next hidden state (which we keep for the next step).

input = letterToTensor('A')
hidden =torch.zeros(1, n_hidden)
hidden2 =torch.zeros(1, n_hidden)
hidden = hidden.cuda() if is_cuda else hidden
hidden2 = hidden2.cuda() if is_cuda else hidden2

output, next_hidden, next_hidden2 = rnn.step(input, hidden,hidden2)
print(output.shape)
print(next_hidden.shape)
torch.Size([1, 18])
torch.Size([1, 128])

For the sake of efficiency we don’t want to be creating a new Tensor for every step, so we will use lineToTensor instead of letterToTensor and use slices. This could be further optimized by pre-computing batches of Tensors.

Each loop in side forward function will:

  • Create input and target tensors
  • Create a zeroed initial hidden state
  • Read each letter in and

    • Keep hidden state for next letter
  • Compare final output to target
  • Back-propagate
  • Return the output and loss
input = lineToTensor('Albert')
print(n_hidden)

output = rnn(input)
print(output)
print(output.shape)
128
tensor([[ 0.0510,  0.0372, -0.0115,  0.0182, -0.0876,  0.1243, -0.0355, -0.0904,
          0.0170,  0.0767, -0.0053,  0.1697, -0.0931,  0.0245, -0.0771,  0.0039,
         -0.0405, -0.0546]], grad_fn=<AddmmBackward>)
torch.Size([1, 18])

As you can see the output is a <1 x n_categories> Tensor, where every item is the likelihood of that category (higher is more likely).

1.4 Training

1.4.1 Preparing for Training

Before going into training we should make a few helper functions. The first is to interpret the output of the network, which we know to be a likelihood of each category. We can use Tensor.topk to get the index of the greatest value:

def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return all_categories[category_i], category_i

print(categoryFromOutput(output))
('French', 11)

We will also want a quick way to get a training example (a name and its language):

import random

def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

def randomTrainingExample():
    category = randomChoice(all_categories)
    # attention: split training set 
    line = randomChoice(training_lines[category])
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    category_tensor = category_tensor.cuda() if is_cuda else category_tensor
    line_tensor = lineToTensor(line)
    return category, line, category_tensor, line_tensor

def randomValidationExample():
    category = randomChoice(all_categories)
    # attention: split validation set
    line = randomChoice(validation_lines[category])
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    category_tensor = category_tensor.cuda() if is_cuda else category_tensor
    line_tensor = lineToTensor(line)
    return category, line, category_tensor, line_tensor

for i in range(10):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category =', category, '/ line =', line)
category = Vietnamese / line = Dao
category = Italian / line = Pesaro
category = English / line = Greenleaf
category = Italian / line = Nieri
category = Irish / line = Bradach
category = Russian / line = Esipovich
category = Dutch / line = Romeijnders
category = Vietnamese / line = Chau
category = Portuguese / line = Castro
category = Greek / line = Kefalas

1.4.2 Training the Network

Now all it takes to train this network is show it a bunch of examples, have it make guesses, and tell it if it’s wrong.

criterion = nn.CrossEntropyLoss()

learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn

def train(category_tensor, line_tensor):
    output = rnn(line_tensor)
    rnn.zero_grad()
    loss = criterion(output, category_tensor)
    loss.backward()

    # Add parameters' gradients to their values, multiplied by learning rate
    for p in rnn.parameters():
        if hasattr(p.grad, "data"):
            p.data.add_(-learning_rate, p.grad.data)

    return output, loss.item()

Now we just have to run that with a bunch of examples. Since the train function returns both the output and loss we can print its guesses and also keep track of loss for plotting. Since there are 1000s of examples we print only every print_every examples, and take an average of the loss.

import time
import math

n_iters = 100000
print_every = 5000
plot_every = 1000

# Keep track of losses for plotting
current_loss = 0
all_losses = []

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

for iter in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss

    # Print iter number, loss, name and guess
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # Add current loss avg to list of losses
    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0
5000 5% (0m 29s) 2.8270 Kakutama / Italian ✗ (Japanese)
10000 10% (1m 1s) 2.2744 Delgado / Japanese ✗ (Portuguese)
15000 15% (1m 29s) 3.3385 Kaneko / Portuguese ✗ (Japanese)
20000 20% (1m 57s) 4.2050 Seto / Spanish ✗ (Chinese)
25000 25% (2m 25s) 0.9056 Hui / Chinese ✓
30000 30% (2m 54s) 2.2105 Kool / German ✗ (Dutch)
35000 35% (3m 23s) 2.6755 Lupo / Portuguese ✗ (Italian)
40000 40% (3m 52s) 0.0967 Alvarez / Spanish ✓
45000 45% (4m 20s) 0.1633 Hanraets / Dutch ✓
50000 50% (4m 47s) 0.6894 Dickson / Scottish ✓
55000 55% (5m 15s) 1.4523 Delgado / Spanish ✗ (Portuguese)
60000 60% (5m 42s) 0.5448 Pei / Chinese ✓
65000 65% (6m 10s) 3.9038 Dannenberg / German ✗ (Russian)
70000 70% (6m 41s) 1.1801 Kelly / Scottish ✗ (Irish)
75000 75% (7m 11s) 0.6070 Gajos / Polish ✓
80000 80% (7m 38s) 0.1557 Beloshitsky / Russian ✓
85000 85% (8m 8s) 3.6895 Agilera / Spanish ✗ (Russian)
90000 90% (8m 36s) 0.0096 Shammas / Arabic ✓
95000 95% (9m 2s) 0.0291 Magalhaes / Portuguese ✓
100000 100% (9m 31s) 0.0076 Haritopoulos / Greek ✓

1.4.3 Plotting the Results


Plotting the historical loss from all_losses shows the network learning:

import matplotlib.pyplot as plt
plt.plot(all_losses)
plt.show()

1.5 Evaluating the Results

To see how well the network performs on different categories, we will create a confusion matrix, indicating for every actual language (rows) which language the network guesses (columns). To calculate the confusion matrix a bunch of samples are run through the network with evaluate(), which is the same as train() minus the backprop.

# Keep track of correct guesses in a confusion matrix
confusion_training   = torch.zeros(n_categories, n_categories)
confusion_validation = torch.zeros(n_categories, n_categories)
n_confusion = 5000

# Just return an output given a line
def evaluate(line_tensor):
    rnn.eval()
    output = rnn(line_tensor)
    return output

# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output = evaluate(line_tensor)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories.index(category)
    confusion_training[category_i][guess_i] += 1

    
# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = randomValidationExample()
    output = evaluate(line_tensor)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories.index(category)
    confusion_validation[category_i][guess_i] += 1
    
    
# catcul acc
right_train = 0
right_valid = 0
for i in range(n_categories):
    right_train += confusion_training[i][i]
    right_valid += confusion_validation[i][i]
acc_train = right_train / n_confusion
acc_valid = right_valid / n_confusion

# Normalize by dividing every row by its sum and 
for i in range(n_categories):
    confusion_training[i] = confusion_training[i] / confusion_training[i].sum()
    confusion_validation[i] = confusion_validation[i] / confusion_validation[i].sum()


# Set up plot
fig = plt.figure()
ax1 = fig.add_subplot(121)
cax1 = ax1.matshow(confusion_training.numpy())

ax2 = fig.add_subplot(122)
cax2 = ax2.matshow(confusion_validation.numpy())


# Set up axes
ax1.set_xticklabels([''] + all_categories, rotation=90)
ax1.set_yticklabels([''] + all_categories)
ax2.set_xticklabels([''] + all_categories, rotation=90)

# sphinx_gallery_thumbnail_number = 2
plt.show()

print("Traing set Acc is", acc_train.item())
print("validation set Acc is", acc_valid.item())

Traing set Acc is 0.732200026512146
validation set Acc is 0.44699999690055847

You can pick out bright spots off the main axis that show which languages it guesses incorrectly, e.g. Chinese for Korean, and Spanish for Italian. It seems to do very well with Greek, and very poorly with English (perhaps because of overlap with other languages).

1.6 Running on User Input

def predict(input_line, n_predictions=3):
    print('\n> %s' % input_line)
    with torch.no_grad():
        output = evaluate(lineToTensor(input_line))
        output = torch.nn.functional.softmax(output, dim=1)

        # Get top N categories
        topv, topi = output.topk(n_predictions, 1, True)
        predictions = []

        for i in range(n_predictions):
            value = topv[0][i].item()
            category_index = topi[0][i].item()
            print('Probability (%.2f) %s' % (value, all_categories[category_index]))
            predictions.append([value, all_categories[category_index]])

predict('Dovesky')
predict('Jackson')
predict('Satoshi')
predict("Cui")
predict("Zhuang")
predict("Xue")
predict("Wang")
predict("Liu")
predict("Shuo")
predict("Nino")
predict("Lau")
> Dovesky
Probability (0.35) Czech
Probability (0.29) English
Probability (0.14) Greek

> Jackson
Probability (0.99) Scottish
Probability (0.01) English
Probability (0.00) French

> Satoshi
Probability (0.63) Italian
Probability (0.30) Japanese
Probability (0.06) Arabic

> Cui
Probability (0.83) Vietnamese
Probability (0.14) Chinese
Probability (0.01) Japanese

> Zhuang
Probability (0.80) Vietnamese
Probability (0.19) Chinese
Probability (0.00) Korean

> Xue
Probability (0.93) Chinese
Probability (0.02) German
Probability (0.01) Korean

> Wang
Probability (0.87) Chinese
Probability (0.05) Korean
Probability (0.03) Vietnamese

> Liu
Probability (0.44) Chinese
Probability (0.36) Vietnamese
Probability (0.18) Polish

> Shuo
Probability (0.97) Chinese
Probability (0.01) Japanese
Probability (0.01) Vietnamese

> Nino
Probability (0.52) Spanish
Probability (0.35) Italian
Probability (0.07) Portuguese

> Lau
Probability (0.98) Vietnamese
Probability (0.02) Chinese
Probability (0.00) German

1.7 Exercises

==============================================================================================================

1、在第1.4.3节中编写代码,打印出训练过程当中,loss值的变化,(loss已经在训练过程中记录,找出并画成曲线即可,你也可以选择自己修改训练过程中的代码,自己实现重新画成更加漂亮的曲线。)

2、在第1.6节,将自己的姓名的每一个字的拼音输入进去,进行预测,看模型预测是否正确(保留上面的预测结果)。如名字为张三,则输入predict('Zhang')predict('San').

3、编辑该cell,在下面一行回答问题。当前的BaseRNN中使用的激活函数修改成relu(在第1.3小节中),然后重新训练模型,看性能是否有变化。

当前的激活函数是( Tanh ),训练集准确率为:( 0.732200026512146 )验证集的准确率为:( 0.35659543275654894 )
激活函数为Relu时的训练准确率为:( 0.7541999887751331 )验证集的准确率为:( 0.41819998666409793 )

4、编辑该cell,在下面一行记录当前的模型的准确率,然后,回到1.3中,编辑DeeperRNN类,使其比BaseRNN类更深一层,并且重新训练这个更深的网络,看新能是否有提高(记得去掉DeeperRNN那一个cell中的最后两行的注释)。对于RNN的隐藏层的增加可阅读下一个cell的补充理解

BaseRNN的训练集准确率为:( 0.732200026512146 )验证集的准确率为:( 0.44699999690055847 )
DeeperRNN的训练准确率为:( 0.823405736449854 )验证集的准确率为:( 0.4836000054871004 )

补充材料:

RNNs are neural networks and everything works monotonically better (if done right) if you put on your deep learning hat and start stacking models up like pancakes. For instance, we can form a 2-layer recurrent network as follows:

y1 = rnn1.step(x)
y = rnn2.step(y1)

In other words we have two separate RNNs: One RNN is receiving the input vectors and the second RNN is receiving the output of the first RNN as its input. Except neither of these RNNs know or care - it’s all just vectors coming in and going out, and some gradients flowing through each module during backpropagation.

原文阅读点击这里

多层hidden网络结构图如下图所示,其中虚线输出的output表示改output值被丢弃,没有进一步使用:

2. Generating Names with a Character-Level RNN


We are still hand-crafting a small RNN with a few linear layers. The big difference is instead of predicting a category after reading in all the letters of a name, we input a category and output one letter at a time. Recurrently predicting characters to form language (this could also be done with words or other higher order constructs) is often referred to as a “language model”.

2.1 Creating the Network

This network extends the last tutorial’s RNN with an extra argument for the category tensor, which is added with the others. The category tensor is a one-hot vector just like the letter input.

We will interpret the output as the probability of the next letter. When sampling, the most likely output letter is used as the next input letter.

I added a second linear layer h2o (after combining hidden and output) to give it more muscle to work with. There’s also a dropout layer, which randomly zeros parts of its input with a given probability (here 0.2) and is usually used to fuzz inputs to prevent overfitting. Here we’re using it towards the end of the network to purposely add some chaos and increase sampling variety.

import torch
import torch.nn as nn

class GenerateRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(GenerateRNN, self).__init__()
        self.hidden_size = hidden_size
        self.c2h = nn.Linear(n_categories, hidden_size)
        self.i2h = nn.Linear(input_size,   hidden_size)
        self.h2h = nn.Linear(hidden_size,  hidden_size)
        self.activation = nn.Tanh()
        self.h2o = nn.Linear(hidden_size,  output_size)
        self.dropout = nn.Dropout(0.2)

        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, category, input, hidden):
        c2h = self.c2h(category)
        i2h = self.i2h(input)
        h2h = self.h2h(hidden)
        
        hidden = self.activation( c2h+i2h+h2h )
        

        dropout = self.dropout(self.h2o(hidden))
        output = self.softmax(dropout)
        return output, hidden

    def initHidden(self, is_cuda=True):
        if is_cuda:
            return torch.zeros(1, self.hidden_size).cuda()
        else:
            return torch.zeros(1, self.hidden_size)

2.2 Training

2.2.1 Preparing for Training

import random

# Random item from a list
def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

# Get a random category and random line from that category
def randomTrainingPair():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    return category, line

For each timestep (that is, for each letter in a training word) the inputs of the network will be (category, current letter, hidden state) and the outputs will be (next letter, next hidden state). So for each training set, we’ll need the category, a set of input letters, and a set of output/target letters.

Since we are predicting the next letter from the current letter for each timestep, the letter pairs are groups of consecutive letters from the line - e.g. for "ABCD<EOS>" we would create (“A”, “B”), (“B”, “C”), (“C”, “D”), (“D”, “EOS”).

figure

The category tensor is a one-hot tensor of size <1 x n_categories>. When training we feed it to the network at every timestep - this is a design choice, it could have been included as part of initial hidden state or some other strategy.

Or you could look at here for another description. I think this image could tell you everything.

# One-hot vector for category
def categoryTensor(category):
    li = all_categories.index(category)
    tensor = torch.zeros(1, n_categories)
    tensor[0][li] = 1
    if is_cuda:
        tensor = tensor.cuda()
    return tensor

# One-hot matrix of first to last letters (not including EOS) for input
def inputTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li in range(len(line)):
        letter = line[li]
        tensor[li][0][all_letters.find(letter)] = 1
    if is_cuda:
        tensor = tensor.cuda()
    return tensor

# LongTensor of second letter to end (EOS) for target
def targetTensor(line):
    letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]
    letter_indexes.append(n_letters - 1) # EOS
    tensor = torch.LongTensor(letter_indexes)
    if is_cuda:
        tensor = tensor.cuda()
    return tensor

For convenience during training we’ll make a randomTrainingExample function that fetches a random (category, line) pair and turns them into the required (category, input, target) tensors.

# Make category, input, and target tensors from a random category, line pair
def randomTrainingExample():
    category, line = randomTrainingPair()
    category_tensor = categoryTensor(category)
    input_line_tensor = inputTensor(line)
    target_line_tensor = targetTensor(line)
    return category_tensor, input_line_tensor, target_line_tensor

2.2.2 Training the Network

In contrast to classification, where only the last output is used, we are making a prediction at every step, so we are calculating loss at every step.

The magic of autograd allows you to simply sum these losses at each step and call backward at the end.

For the loss function nn.NLLLoss is appropriate, since the last layer of the RNN is nn.LogSoftmax.

For the different between nn.NLLLoss and the nn.LogSoftmax, we could Look at the source code, the most import line is quote as follow:

 def cross_entropy(input, target):
    return nll_loss(log_softmax(input, 1))

In a word, Cross entropy combines log_softmax and nll_loss in a single function.

criterion = nn.NLLLoss()

learning_rate = 0.0005

def train(category_tensor, input_line_tensor, target_line_tensor):
    target_line_tensor.unsqueeze_(-1)
    hidden = rnn.initHidden()

    rnn.zero_grad()

    loss = 0

    # Take care of the loss function, 
    # it could be visualized as the following figure 
    for i in range(input_line_tensor.size(0)):
        output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
        l = criterion(output, target_line_tensor[i])
        loss += l

    loss.backward()

    for p in rnn.parameters():
        if hasattr(p.grad, "data"):
            p.data.add_(-learning_rate, p.grad.data)

    return output, loss.item() / input_line_tensor.size(0)

The loss function could be shown as the following picture. The network forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

To keep track of how long training takes I am adding a timeSince(timestamp) function which returns a human readable string:

import time
import math

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

Training is business as usual - call train a bunch of times and wait a few minutes, printing the current time and loss every print_every examples, and keeping store of an average loss per plot_every examples in all_losses for plotting later.

is_cuda = True
rnn = GenerateRNN(n_letters, 128, n_letters)
rnn = rnn.cuda() if is_cuda else rnn

n_iters = 100000
print_every = 5000
plot_every = 500
all_losses = []
total_loss = 0 # Reset every plot_every iters

start = time.time()

for iter in range(1, n_iters + 1):
    output, loss = train(*randomTrainingExample())
    total_loss += loss

    if iter % print_every == 0:
        print('%s (%d %d%%) %.4f' % (timeSince(start), iter, iter / n_iters * 100, total_loss/iter))

    if iter % plot_every == 0:
        all_losses.append(total_loss / iter)
0m 44s (5000 5%) 3.1420
1m 29s (10000 10%) 3.0273
2m 13s (15000 15%) 2.9634
2m 57s (20000 20%) 2.9156
3m 41s (25000 25%) 2.8781
4m 26s (30000 30%) 2.8456
5m 10s (35000 35%) 2.8196
5m 54s (40000 40%) 2.7973
6m 38s (45000 45%) 2.7777
7m 22s (50000 50%) 2.7586
8m 6s (55000 55%) 2.7420
8m 49s (60000 60%) 2.7266
9m 33s (65000 65%) 2.7130
10m 18s (70000 70%) 2.6991
11m 3s (75000 75%) 2.6872
11m 49s (80000 80%) 2.6760
12m 34s (85000 85%) 2.6660
13m 19s (90000 90%) 2.6557
14m 5s (95000 95%) 2.6466
14m 50s (100000 100%) 2.6371

2.2.3 Plotting the Losses

Plotting the historical loss from all_losses shows the network learning:

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)
[<matplotlib.lines.Line2D at 0x7f34b1b3f9e8>]

2.3 Sampling the Network

To sample we give the network a letter and ask what the next one is, feed that in as the next letter, and repeat until the EOS token.

  • Create tensors for input category, starting letter, and empty hidden state
  • Create a string output_name with the starting letter
  • Up to a maximum output length,

    • Feed the current letter to the network
    • Get the next letter from highest output, and next hidden state
    • If the letter is EOS, stop here
    • If a regular letter, add to output_name and continue
  • Return the final name

Note: Rather than having to give it a starting letter, another strategy would have been to include a “start of string” token in training and have the network choose its own starting letter.

max_length = 20

# Sample from a category and starting letter
def sample(category, start_letter='A'):
    with torch.no_grad():  
        # no need to track history in sampling
        category_tensor = categoryTensor(category)
        input = inputTensor(start_letter)
        hidden = rnn.initHidden()

        output_name = start_letter

        for i in range(max_length):
            output, hidden = rnn(category_tensor, input[0], hidden)
            topv, topi = output.topk(1)
            topi = topi[0][0]
            if topi == n_letters - 1:
                break
            else:
                letter = all_letters[topi]
                output_name += letter
            input = inputTensor(letter)

        return output_name

# Get multiple samples from one category and multiple starting letters
def samples(category, start_letters='ABC'):
    for start_letter in start_letters:
        print(sample(category, start_letter))
    print("\n")


samples('Russian', 'CCZZYY')

samples('German',  'CCZZYY')

samples('Spanish', 'CCZZYY')

samples('Chinese', 'CCZZYY')

samples('Chinese', 'LS')

samples('Chinese', 'LS')
Chanhin
Chanhov
Zhinhovov
Zantov
Yakinov
Yoverin


Cherrer Chaner Zerter Zentener Yerten Yenter


Coraner Calla Zoner Zoner Yanera Yaner


Cau Cha Zhang Zung Yan Yan


Lia Sin


Lan Sin

2.4 Exercises

=======================================================================================================

1、反复输入自己名字的首字母,观察网络的生成的名字是否一样(保留以你自己的名字的首字母为程序输入的结果)
No.

2、请回答:为什么这个模型这个模型训练好了之后,输入同样的参数会产生不一样的结果?并在上面的程序中验证你自己的猜测 Because in this RNN model, the hidden state of the model is changed every time it is entered and run.

3、这是一个生成模型,该模型只输入了一个字母,就可以预测一个单词,请问改模型在开始预测什么时候终止预测新的字符,即这个生成模型是如何确定生成的单词的长度的?

Predict when it is digital position is n_letters - 1, the symbol ‘ were predicted to terminate the new characters.

4、根据下一个cell的提示,以impantance sampling的方式,实现程序在同样输入的情况下,可能生成不一样的名字。

As follows.

Importance sampling指的是,在生成单词每一个字母的时候,一般的生成模型都有是选择输出结果中概率值最大的字母作为预测,而Importance sampling的方式则是按照各个字母的输出概率来选择输出哪个模型。举个栗子,当程序输入一个字母C的时候,程序预测下一个字母为[a, o, e]的概率分别为[0.7, 0.2, 0.1], 普通的做法是直接选择概率值最大的a作为预测结果,而Importance Sampling的方式则是以70%的概率选择a,以20%的概率选择o,以10%的该路选择e,最终选到哪个有一定的随机性。 请根据你的理解,补充完成下面的Importance sampling的预测。

Tips: 可使用numpy.random.choice函数,文档请点击这里

Tips: 这部分的代码与上面的代码的不同之处仅仅在于如何选取预测值,所以,你只需要在下面指定的区域编辑代码即可。

Tips: 直接使用模型输出的概率值会比较麻烦,因为模型的概率输出值会很小,因为这里有50多种字符,所以,一个更加明智的做法是只在概率值最高的前3个或者5个中选择,记得要将概率使用softmax等函数将其转换到和为1.

Tips: 更加简单的做法是,仅仅使用模型输出的概率值进行排序,然后使用指定的概率值进行选择,如,指定选择概率最大的那个字母的概率值为0.5,第二大的为0.3,第三大的为0.2,然后可以直接输入[0.5, 0.3, 0.2].

import numpy as np
max_length = 20
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

def sample(category, start_letter='A'):
    with torch.no_grad():  
        # no need to track history in sampling
        category_tensor = categoryTensor(category)
        input = inputTensor(start_letter)
        output_name = start_letter
        hidden = rnn.initHidden()
        for i in range(max_length):
            output, hidden = rnn(category_tensor, input[0], hidden)
            prob_list = output.tolist()[0]
            topv, topi = output.topk(5)
            prob_list = softmax([prob_list[i] for i in topi[0]])
            index = np.random.choice([i for i in range(len(prob_list))], p=prob_list)
            topi = topi[0][index]
            if topi == n_letters - 1:
                break
            else:
                letter = all_letters[topi]
                output_name += letter
            input = inputTensor(letter)

        return output_name

# Get multiple samples from one category and multiple starting letters
def samples(category, start_letters='ABC'):
    for start_letter in start_letters:
        print(sample(category, start_letter))
    print("\n")

rnn.eval()

samples('Russian', 'CYY')

samples('German', 'CCY')

samples('Spanish', 'CCZZYY')

samples('Chinese', 'CCCZZZYYY')

samples('Chinese', 'LS')

samples('Chinese', 'LS')
Chovin
Yinkiven
Yantev


Counerre Chenentrri Yuner


Cheilleni Cherer Zollus Zouna Yanterara Yenseraz


Chen Chening Cho Zin Zia Ziang Yaing You Ying


Lenin Sha


Ling Shan