预备知识:

会使用pytorch搭建简单的cnn
熟悉神经网络的训练过程与优化方法
结合理论课的内容，了解目标检测的几种经典算法(如Faster RCNN/YOLO/SSD)的内容和原理

声明：

本次实验课的代码来源于github上的一个开源项目，链接为:https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#training
在该项目的基础上，为了便于同学们更好地去理解代码，我们在此基础上对代码做了略微的修改
由于目标检测任务整个代码逻辑比较复杂，需要理解的细节非常多，因此在本次实验课内容设计过程中我们有幸邀请到了李伟鹏同学，他全程参与了课件的制作过程。

网络结构

SSD采用VGG16作为基础模型，然后在VGG16的基础上新增了卷积层来获得更多的特征图以用于检测。SSD的网络结构如图所示。采用VGG16做基础模型，分别将VGG16的全连接层fc6和fc7转换成 $3\times3$ 卷积层 conv6和 $1\times1$ 卷积层conv7，同时将池化层pool5由原来的stride=2的 $2\times 2$ 变成stride=1的 $3\times 3$ （猜想是不想reduce特征图大小），为了配合这种变化，采用了一种Atrous Algorithm，其实就是conv6采用扩展卷积或带孔卷积（Dilation Conv），然后移除dropout层和fc8层，并新增一系列卷积层，在检测数据集上做finetuing。
其中VGG16中的Conv4_3层将作为用于检测的第一个特征图。conv4_3层特征图大小是 $38\times38$ ，但是该层比较靠前，其norm较大，所以在其后面增加了一个L2 Normalization层.

先验框

SSD借鉴了Faster R-CNN中anchor的理念，每个单元设置尺度或者长宽比不同的先验框，预测的边界框（bounding boxes）是以这些先验框为基准的，在一定程度上减少训练难度。一般情况下，每个单元会设置多个先验框，其尺度和长宽比存在差异，如图所示，可以看到每个单元使用了4个不同的先验框，图片中猫和狗分别采用最适合它们形状的先验框来进行训练。

Dataset

目标检测任务的数据集的构成形式与之前学习的分类任务有很大的区别，传统的分类问题的的dataset里面大致包含
[image,label],
由于目标检测既要做检测框的回归任务又要做检测框内物体的分割任务，因此数据集的构成形式大致如下
[{‘boxes’:[[ground_truth坐标1],[ground_truth坐标2]，…]},{‘labels’:[ground_truth标签1，ground_truth标签2，…]}]
由于数据集并没有一个规整的格式，处理此类问题我们通常考虑使用Json文件来做存储

首先要将数据集提供的txt文件转换成.json文件，方便后面的重写的dataset函数load数据 create-data_lists()主要的功能就是将图片和它的ground_truth_box以及box对应的标签连接起来存到json文件中。
__注意：__此函数必须运行一次。

from utils import *
create_data_lists(voc07_path='./data1/VOC2007',output_folder='./json1/')

There are 200 training images containing a total of 600 objects. Files have been saved to /home/jovyan/Week8/json1.

There are 200 validation images containing a total of 600 objects. Files have been saved to /home/jovyan/Week8/json1.

import torch
from torch.utils.data import Dataset
import json
import os
from PIL import Image
from utils import transform


class PascalVOCDataset(Dataset):
    """
    A PyTorch Dataset class to be used in a PyTorch DataLoader to create batches.
    """

    def __init__(self, data_folder, split, keep_difficult=False):
        """
        :param data_folder: folder where data files are stored
        :param split: split, one of 'TRAIN' or 'TEST'
        :param keep_difficult: keep or discard objects that are considered difficult to detect?
        """
        self.split = split.upper()

        assert self.split in {'TRAIN', 'TEST'}

        self.data_folder = data_folder
        self.keep_difficult = keep_difficult

        # Read data files
        with open(os.path.join(data_folder, self.split + '_images.json'), 'r') as j:
            self.images = json.load(j)
        with open(os.path.join(data_folder, self.split + '_objects.json'), 'r') as j:
            self.objects = json.load(j)

        assert len(self.images) == len(self.objects)

    def __getitem__(self, i):
        # Read image
        image = Image.open(self.images[i], mode='r')
        image = image.convert('RGB')

        # Read objects in this image (bounding boxes, labels, difficulties)
        objects = self.objects[i]
        boxes = torch.FloatTensor(objects['boxes'])  # (n_objects, 4)
        labels = torch.LongTensor(objects['labels'])  # (n_objects)
        difficulties = torch.ByteTensor(objects['difficulties'])  # (n_objects)

        # Discard difficult objects, if desired
        if not self.keep_difficult:
            boxes = boxes[1 - difficulties]
            labels = labels[1 - difficulties]
            difficulties = difficulties[1 - difficulties]

        # Apply transformations
        image, boxes, labels, difficulties = transform(image, boxes, labels, difficulties, split=self.split)

        return image, boxes, labels, difficulties

    def __len__(self):
        return len(self.images)

    def collate_fn(self, batch):
        """
        Since each image may have a different number of objects, we need a collate function (to be passed to the DataLoader).

        This describes how to combine these tensors of different sizes. We use lists.

        Note: this need not be defined in this Class, can be standalone.

        :param batch: an iterable of N sets from __getitem__()
        :return: a tensor of images, lists of varying-size tensors of bounding boxes, labels, and difficulties
        """

        images = list()
        boxes = list()
        labels = list()
        difficulties = list()

        for b in batch:
            images.append(b[0])
            boxes.append(b[1])
            labels.append(b[2])
            difficulties.append(b[3])

        images = torch.stack(images, dim=0)

        return images, boxes, labels, difficulties  # tensor (N, 3, 300, 300), 3 lists of N tensors each

重写完dataset函数之后，让我们看看目标检测任务的训练数据具体是以何种形式存储的

data_folder = './json1/'
keep_difficult = True
batch_size = 1
workers = 1


train_dataset = PascalVOCDataset(data_folder,
                                     split='train',
                                     keep_difficult=keep_difficult)
val_dataset = PascalVOCDataset(data_folder,
                                   split='test',
                                   keep_difficult=keep_difficult)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True,
                                               collate_fn=train_dataset.collate_fn, num_workers=workers,
                                               pin_memory=True)  
                                                # note that we're passing the collate function here
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=True,
                                             collate_fn=val_dataset.collate_fn, num_workers=workers,
                                             pin_memory=True)
    
# for data in train_loader:
#     images, boxes, labels, difficulties = data
#     print('images---->', images)
#     print('boxes---->', boxes)
#     print('labels---->',labels)
#     print('difficulties---->',difficulties)

Loss

ssd的loss分为两部分,置信度误差confidence loss和位置location loss. 其中,confidence loss是对bbox的分类误差,使用cross entropy loss;而location是bbox的位置与ground truth的回归误差,使用smooth l1 loss.

对于location loss, 公式如下图, 其中$g_j^{cx}$, $g_j^{cx}$, $g_j^{w}$, $g_j^{h}$ 是第j个groud truth bbox的4个位置值(中心点x,y坐标以及bbox的宽,高). $d_i^{cx}$, $d_i^{cx}$, $d_i^{w}$, $d_i^{h}$ 则是第i个先验框(prior)的4个位置值(中心点x,y坐标以及bbox的宽,高). 而 $\hat{g}_j^{cx}$, $\hat{g}_j^{cx}$, $\hat{g}_j^{w}$, $\hat{g}_j^{h}$ 是由ground truth bbox j和先验框(prior) i 算出的transform(或者叫offset)值.

$\hat{g}_{j}^{c x}=\left(g_{j}^{c x}-d_{i}^{c x}\right) / d_{i}^{w}, \hat{g}_{j}^{c y}=\left(g_{j}^{c y}-d_{i}^{c y}\right) / d_{i}^{h}$ $\hat{g}_{j}^{w}=\log \left(\frac{g_{j}^{w}}{d_{i}^{w}}\right), \hat{g}_{j}^{h}=\log \left(\frac{g_{j}^{h}}{d_{i}^{h}}\right)$

我们的目的是使得我们的CNN网络学习到这些transform(或者叫offset)值(即让输出的loc值逼近它们), 而当模型训练好后,进行目标检测时,我们只要将CNN输出的loc值与先验框(prior)的位置值做一个decode即可.在decode时, 公式如下,其中对于第i个prior,$d_i^{cx}$, $d_i^{cx}$, $d_i^{w}$, $d_i^{h}$是prior的位置值,$l_{i}^{cx}$, $l_{i}^{cy}$, $l_{i}^{w}$, $l_{i}^{h}$是我们模型输出的transform/offset值, $b_{i}^{cx}$, $b_{i}^{cy}$, $b_{i}^{w}$, $b_{i}^{h}$是我们检测到的物体对应图片的位置值.

$b_{i}^{w}=d_{i}^{w}\exp{(l_{i}^{w})}, b_{i}^{h}=d_{i}^{h}\exp{(l_{i}^{h})}$ $b_{i}^{cx}=d_{i}^{w}l^{cx} + d_{i}^{cx}, b_{i}^{cy}= d_{i}^{h}l^{cy} + d_{i}^{cy}$

location loss的公式如下,其中,$l^{m},m\in{cx, cy, w, h}$表示CNN对于每个先验框输出的loc值, $\hat{g}^{m}$表示由ground truth box j与先验框i算出的transform值. $x_{ij}^k \in {0,1}$是一个指示参数, $x_{ij}^k=1$时表示先验框i与ground truth box j匹配,且ground truth box j的类别为k. 这里使用smooth l1 loss来是模型学习到的loc值逼近由先验框与ground truth box得到的transform值.其中,Pos表示非背景的先验框的集合(计算每个prior与每个ground truth box的IOU,最大的IOU小于某个阈值的prior可以视为Negative(背景), 反之视为Positive(非背景)).

$L_{l o c}(x, l, g)=\sum_{i \in P o s}^{N} \quad \sum_{m \in\{c x, c y, w, h\}} x_{i j}^{k} \operatorname{smooth}_{\mathrm{Ll}}\left(l_{i}^{m}-\hat{g}_{j}^{m}\right)$

对于confidence loss, 如下图, $x_{ij}^p \in {0,1}$是一个指示参数, $x_{ij}^p=1$时表示先验框i与ground truth box j,且ground truth box j的类别为p(即label).这里直接使用cross entropy loss来计算它们的置信度误差. $\vec{c_i}$表示对于先验框i模型输出的(经过softmax)在每个类上的置信度输出.其中,Pos表示非背景的先验框的集合,而Neg表示为背景的先验框的集合. \begin{equation} L_{conf} = \sum_{i \in Pos}x_{ij}^pCrossEntropy(\vec{c_i}, p) + \sum_{i \in Neg}CrossEntropy(\vec{c_i}, 0) \end{equation}

在一般情况下,由于在目标检测中,背景的先验框的数量会远大于有object的先验框的数量,为了解决这个问题,在SSD的代码中使用了hard negative mining.即只选择negative(视为背景的prior)中选择loss值较大的项.

import torch.nn as nn
class MultiBoxLoss(nn.Module):
    """
    The MultiBox loss, a loss function for object detection.

    This is a combination of:
    (1) a localization loss for the predicted locations of the boxes, and
    (2) a confidence loss for the predicted class scores.
    """

    def __init__(self, priors_cxcy, threshold=0.5, neg_pos_ratio=3, alpha=1.):
        super(MultiBoxLoss, self).__init__()
        self.priors_cxcy = priors_cxcy
        self.priors_xy = cxcy_to_xy(priors_cxcy)
        self.threshold = threshold
        self.neg_pos_ratio = neg_pos_ratio
        self.alpha = alpha

        self.smooth_l1 = nn.SmoothL1Loss()
        self.cross_entropy = nn.CrossEntropyLoss(reduce=False)

    def forward(self, predicted_locs, predicted_scores, boxes, labels):
        """
        Forward propagation.

        :param predicted_locs: predicted locations/boxes w.r.t the 8732 prior boxes, a tensor of dimensions (N, 8732, 4)
        :param predicted_scores: class scores for each of the encoded locations/boxes, a tensor of dimensions (N, 8732, n_classes)
        :param boxes: true  object bounding boxes in boundary coordinates, a list of N tensors
        :param labels: true object labels, a list of N tensors
        :return: multibox loss, a scalar
        """
        batch_size = predicted_locs.size(0)
        n_priors = self.priors_cxcy.size(0)
        n_classes = predicted_scores.size(2)

        assert n_priors == predicted_locs.size(1) == predicted_scores.size(1)

        true_locs = torch.zeros((batch_size, n_priors, 4), dtype=torch.float).to(device)  # (N, 8732, 4)
        true_classes = torch.zeros((batch_size, n_priors), dtype=torch.long).to(device)  # (N, 8732)

        # For each image
        for i in range(batch_size):
            n_objects = boxes[i].size(0)

            overlap = find_jaccard_overlap(boxes[i],
                                           self.priors_xy)  # (n_objects, 8732)

            # For each prior, find the object that has the maximum overlap
            overlap_for_each_prior, object_for_each_prior = overlap.max(dim=0)  # (8732)

            # We don't want a situation where an object is not represented in our positive (non-background) priors -
            # 1. An object might not be the best object for all priors, and is therefore not in object_for_each_prior.
            # 2. All priors with the object may be assigned as background based on the threshold (0.5).

            # To remedy this -
            # First, find the prior that has the maximum overlap for each object.
            _, prior_for_each_object = overlap.max(dim=1)  # (N_o)

            # Then, assign each object to the corresponding maximum-overlap-prior. (This fixes 1.)
            object_for_each_prior[prior_for_each_object] = torch.LongTensor(range(n_objects)).to(device)

            # To ensure these priors qualify, artificially give them an overlap of greater than 0.5. (This fixes 2.)
            overlap_for_each_prior[prior_for_each_object] = 1.

            # Labels for each prior
            label_for_each_prior = labels[i][object_for_each_prior]  # (8732)
            # Set priors whose overlaps with objects are less than the threshold to be background (no object)
            label_for_each_prior[overlap_for_each_prior < self.threshold] = 0  # (8732)

            # Store
            true_classes[i] = label_for_each_prior

            # Encode center-size object coordinates into the form we regressed predicted boxes to
            true_locs[i] = cxcy_to_gcxgcy(xy_to_cxcy(boxes[i][object_for_each_prior]), self.priors_cxcy)  # (8732, 4)

        # Identify priors that are positive (object/non-background)
        positive_priors = true_classes != 0  # (N, 8732)

        # LOCALIZATION LOSS

        # Localization loss is computed only over positive (non-background) priors
        loc_loss = self.smooth_l1(predicted_locs[positive_priors], true_locs[positive_priors])  # (), scalar

        # Note: indexing with a torch.uint8 (byte) tensor flattens the tensor when indexing is across multiple dimensions (N & 8732)
        # So, if predicted_locs has the shape (N, 8732, 4), predicted_locs[positive_priors] will have (total positives, 4)

        # CONFIDENCE LOSS

        # Confidence loss is computed over positive priors and the most difficult (hardest) negative priors in each image
        # That is, FOR EACH IMAGE,
        # we will take the hardest (neg_pos_ratio * n_positives) negative priors, i.e where there is maximum loss
        # This is called Hard Negative Mining - it concentrates on hardest negatives in each image, and also minimizes pos/neg imbalance

        # Number of positive and hard-negative priors per image
        n_positives = positive_priors.sum(dim=1)  # (N)
        n_hard_negatives = self.neg_pos_ratio * n_positives  # (N)

        # First, find the loss for all priors
        conf_loss_all = self.cross_entropy(predicted_scores.view(-1, n_classes), true_classes.view(-1))  # (N * 8732)
        conf_loss_all = conf_loss_all.view(batch_size, n_priors)  # (N, 8732)

        # We already know which priors are positive
        conf_loss_pos = conf_loss_all[positive_priors]  # (sum(n_positives))

        # Next, find which priors are hard-negative
        # To do this, sort ONLY negative priors in each image in order of decreasing loss and take top n_hard_negatives
        conf_loss_neg = conf_loss_all.clone()  # (N, 8732)
        conf_loss_neg[positive_priors] = 0.  # (N, 8732), positive priors are ignored (never in top n_hard_negatives)
        conf_loss_neg, _ = conf_loss_neg.sort(dim=1, descending=True)  # (N, 8732), sorted by decreasing hardness
        hardness_ranks = torch.LongTensor(range(n_priors)).unsqueeze(0).expand_as(conf_loss_neg).to(device)  # (N, 8732)
        hard_negatives = hardness_ranks < n_hard_negatives.unsqueeze(1)  # (N, 8732)
        conf_loss_hard_neg = conf_loss_neg[hard_negatives]  # (sum(n_hard_negatives))

        # As in the paper, averaged over positive priors only, although computed over both positive and hard-negative priors
        conf_loss = (conf_loss_hard_neg.sum() + conf_loss_pos.sum()) / n_positives.sum().float()  # (), scalar

        
        return conf_loss + self.alpha * loc_loss

模型训练

目标检测模型的训练过程和分类模型的主要区别体现在loss函数输入的区别，一般的分类模型的loss函数输入的是（预测结果，标签），而ssd算法的loss函数输入的是（预测框的数值，预测分类的分数，ground_truth框，分类标签）
注意:以下代码是简化版的train()函数，省略了对其他数据的一些统计操作，主要是为了让大家理解对比ssd和分类网络训练过程中的异同点。

请勿运行train_model()函数!!!

def train_model(train_loader, model, criterion, optimizer, epoch):
    """
    One epoch's training.

    :param train_loader: DataLoader for training data
    :param model: model
    :param criterion: MultiBox loss
    :param optimizer: optimizer
    :param epoch: epoch number
    """
    model.train()  # training mode enables dropout

    # Batches
    for i, (images, boxes, labels, _) in enumerate(train_loader):

        # Move to default device
        images = images.to(device)  # (batch_size (N), 3, 300, 300)
        boxes = [b.to(device) for b in boxes]
        labels = [l.to(device) for l in labels]

        # Forward prop.
        predicted_locs, predicted_scores = model(images)  # (N, 8732, 4), (N, 8732, n_classes)

        # Loss
        loss = criterion(predicted_locs, predicted_scores, boxes, labels)  # scalar

        # Backward prop.
        optimizer.zero_grad()
        loss.backward()

        # Update model
        optimizer.step()

        # Print status
        if i % print_freq == 0:
            print('Loss {loss.val:.4f} ({loss.avg:.4f})\t'.format( loss=losses))
    # free some memory since their histories may be stored
    del predicted_locs, predicted_scores, images, boxes, labels  

作业：

请补充完整训练过程中缺少的代码(回忆第三周训练一个简单的分类网络的步骤)
补充：
loss函数缺少的参数(阅读上面loss函数的代码，理解需要计算loss需要参数)
反向传播
更新模型的参数

模型的参数初始化

import time
import torch.backends.cudnn as cudnn
import torch.optim
import torch.utils.data
from model import SSD300, MultiBoxLoss
from datasets import PascalVOCDataset
from utils import *
from utils1 import *

data_folder = './json1'  # folder with data files
keep_difficult = True  # use objects considered difficult to detect?

# Model parameters
# Not too many here since the SSD300 has a very specific structure
n_classes = len(label_map)  # number of different types of objects
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Learning parameters
checkpoint = "checkpoint_ssd300.pth.tar"  # path to model checkpoint, None if none
batch_size = 1  # batch size
start_epoch = 0  # start at this epoch
epochs = 5  # number of epochs to run without early-stopping
epochs_since_improvement = 0  # number of epochs since there was an improvement in the validation metric
best_loss = 100.  # assume a high loss at first
workers = 1  # number of workers for loading data in the DataLoader
print_freq = 20  # print training or validation status every __ batches
lr = 1e-3  # learning rate
momentum = 0.9  # momentum
weight_decay = 5e-4  # weight decay
grad_clip = None  # clip if gradients are exploding, which may happen at larger batch sizes (sometimes at 32) - you will recognize it by a sorting error in the MuliBox loss calculation

cudnn.benchmark = True

模型的训练以及评估

此部分是整个项目的主体结构

def main():
    """
    Training and validation.
    """
    global epochs_since_improvement, start_epoch, label_map, best_loss, epoch, checkpoint
    
    optimizer, model = init_optimizer_and_model()
    
    # Move to default device
    model = model.to(device)
    criterion = MultiBoxLoss(priors_cxcy=model.priors_cxcy).to(device)
    
    # Epochs
    for epoch in range(start_epoch, epochs):
        # Paper describes decaying the learning rate at the 80000th, 100000th, 120000th 'iteration', i.e. model update or batch
        # The paper uses a batch size of 32, which means there were about 517 iterations in an epoch
        # Therefore, to find the epochs to decay at, you could do,
        # if epoch in {80000 // 517, 100000 // 517, 120000 // 517}:
        #     adjust_learning_rate(optimizer, 0.1)

        # In practice, I just decayed the learning rate when loss stopped improving for long periods,
        # and I would resume from the last best checkpoint with the new learning rate,
        # since there's no point in resuming at the most recent and significantly worse checkpoint.
        # So, when you're ready to decay the learning rate, just set checkpoint = 'BEST_checkpoint_ssd300.pth.tar' above
        # and have adjust_learning_rate(optimizer, 0.1) BEFORE this 'for' loop

        # One epoch's training
        train(train_loader=train_loader,
              model=model,
              criterion=criterion,
              optimizer=optimizer,
              epoch=epoch)

        # One epoch's validation
        val_loss = validate(val_loader=val_loader,
                            model=model,
                            criterion=criterion)

        # Did validation loss improve?
        is_best = val_loss < best_loss
        best_loss = min(val_loss, best_loss)

        if not is_best:
            epochs_since_improvement += 1
            print("\nEpochs since last improvement: %d\n" % (epochs_since_improvement,))

        else:
            epochs_since_improvement = 0

        # Save checkpoint
        save_checkpoint(epoch, epochs_since_improvement, model, optimizer, val_loss, best_loss, is_best)
        
if __name__ == '__main__':
    main()

Loaded base model.

/opt/conda/lib/python3.6/site-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))

Epoch: [0][0/200]	Batch Time 1.336 (1.336)	Data Time 0.145 (0.145)	Loss 21.1961 (21.1961)	
[0/200]	Batch Time 0.127 (0.127)	Loss 4.8454 (4.8454)	

 * LOSS - 13.931

Epoch: [1][0/200]	Batch Time 0.188 (0.188)	Data Time 0.137 (0.137)	Loss 7.1951 (7.1951)	
[0/200]	Batch Time 0.127 (0.127)	Loss 53.9689 (53.9689)	

 * LOSS - 15.314

Epochs since last improvement: 1
Epoch: [2][0/200] Batch Time 0.201 (0.201) Data Time 0.153 (0.153) Loss 58.9795 (58.9795) [0/200] Batch Time 0.134 (0.134) Loss 4.1666 (4.1666)

 * LOSS - 14.845

Epochs since last improvement: 2
Epoch: [3][0/200] Batch Time 0.185 (0.185) Data Time 0.139 (0.139) Loss 3.7385 (3.7385) [0/200] Batch Time 0.123 (0.123) Loss 41.0693 (41.0693)

 * LOSS - 19.491

Epochs since last improvement: 3
Epoch: [4][0/200] Batch Time 0.192 (0.192) Data Time 0.148 (0.148) Loss 4.8312 (4.8312) [0/200] Batch Time 0.123 (0.123) Loss 4.1246 (4.1246)

 * LOSS - 12.118

目标检测

将训练好的模型用以检测图片中的物体并分类，用bounding_box显示出修改img_path变量，改变要检测的图片
测试集可用的图片从./json1/TEST_images.json

from detect import *
from PIL import Image
from torchvision import transforms
from matplotlib import pyplot as plt
%matplotlib inline

if __name__ == '__main__':
    img_path = './data1/VOC2007/JPEGImages/000220.jpg'
    original_image = Image.open(img_path, mode='r')
    original_image = original_image.convert('RGB')
    img = detect(original_image, min_score=0.2, max_overlap=0.1, top_k=200)
    plt.imshow(img)
    plt.show()

作业：

查看源码中各个参数的具体含义之后，尝试修改 min_score , max_overlap, top_k 三个参数值,分析改动三个参数之后检测结果的变化。

Each frame is a cluster. Each cluster is trained with a threshold min_score. If this threshold is too large, each pixel is classified as a unique cluster, resulting in no frames. If this threshold is too small, each pixel is over-classified, resulting in many frames.
Two frames may have overlap parts, which is determined by max_overlap. Note that max_overlap is useful only if you have larger than 2 frames. If max_overlap is too large, overlap in frames is relatively lenient - two frames may be even same if you set max_overlap = 1.. When max_overlap is small, frames are more independent.
The top_k defines the number of frames. If you only have one cluster, a huge top_k is of no use. But smaller top_k can compensate the inaccuracy of classification by min_score.

答：

__min_score__是一个识别框被认为是一个类，然后显示出来的最小阈值，如果调小的话，会出现很多错误的识别框，但是如果调高的话，则连正确的识别框都不会出现。
__max_overlap__代表着两个识别框之间可以有的最大重叠，如果调小的话，会发现图中有两个识别框，但其实其中它们表达的意思重叠了。
__top_k__代表着显示识别框的数量，由于图中原本只有一个识别框，只要这个值不调为0，则不会有任何影响。

模型评估

计算模型分类的准确率，由于我们的数据集做了删减，只使用了VOC中的两类，因此只有两类会有准确率，其他类准确率为0

from eval import *
if __name__ == '__main__':
    evaluate(test_loader, model)

{'aeroplane': 0.0,
 'bicycle': 0.0,
 'bird': 0.0,
 'boat': 0.0,
 'bottle': 0.0,
 'bus': 0.0,
 'car': 0.162913516163826,
 'cat': 0.0,
 'chair': 0.0,
 'cow': 0.0,
 'diningtable': 0.0,
 'dog': 0.0,
 'horse': 0.0,
 'motorbike': 0.0,
 'person': 0.0,
 'pottedplant': 0.0,
 'sheep': 0.0,
 'sofa': 0.0,
 'train': 0.0,
 'tvmonitor': 0.0}

Mean Average Precision (mAP): 0.008

关于代码部分的补充

以上代码是从源码中提取出来，并且做了一些必要的修改之后的内容，主要是为了能够将目标检测任务的训练，评估，检测的过程以较为清晰的逻辑结构展示给大家。如果同学们已经基本掌握了以上内容，以下则是源码以.py文件的正确运行方式。
__注意:__请勿在实验课上运行以下代码，因为源码内的参数是训练这个模型的默认参数，使用这组参数可以训练出一较为理想的检测模型，但是会占用大量GPU资源，如果同学们有自己的gpu，可以课后在自己的设备上运行，课上请勿使用该代码浪费GPU资源，谢谢配合。

import train

# train model
# Setting the parameters you want in train.py file
train.main()

import detect
# detect
# Setting the parameters you want in detect.py file
detect.main()

Loaded checkpoint from epoch 11. Best loss so far is 5.796.

<Figure size 640x480 with 1 Axes>

import eval

# evaluate the model
eval.main()

课后阅读部分

此部分是提供想要进一步了解ssd算法，有兴趣做目标检测任务的同学一些在代码方面更详细的解释，由于目标检测任务过程中会使用到非常多的数据处理，统计的工具函数，而此部分内容又基本都放在了utils.py文件中，因此我们将该文件做了一个大致的介绍，并且挑选出比较重要的部分为大家详细解释。

utils.py文件解读

包括的函数有parse_annotation(), create_data_lists(), decimate(), calculate_mAP(), xy_to_cxcy(), cxcy_to_xy(), cxcy_to_gcxgcy(), gcxgcy_to_cxcy(), find_intersection(), find_jaccard_overlap(), expand(), random_crop(), flip(), resize(), photometric_distort(), transform(), adjust_learning_rate(), accuracy(), save_checkpoint(), clip_gradient()等。
其中有些函数这里就不详细讲了希望大家有兴趣的可以课下仔细阅读下，比如用于图像数据增强的函数expand(), random_crop(), flip(), resize(), photometric_distort(), transform()，这些函数不仅可以用于目标检测还可以用在分类等其他领域。
adjust_learning_rate(), accuracy(), save_checkpoint(), clip_gradient()等函数则是用于常规深度学习工具函数，这里也不再详细介绍。
parse_annotation()函数主要是辅助create_data_lists()这个函数完成VOC2007数据集XML文件解析的，而create_data_lists()函数则是解析原始VOC2007数据集生成对应实际训练中载入的Json文件即TRAIN_images.json, TRAIN_objects.json, TEST_images.json, TEST_objects.json, lable_map.json。
decimate()函数主要是在进行全连接层转化为卷积的时候进行间隔抽样，以达成空洞卷积的目的。
calculate_mAP()函数是计算mAP即Mean Average Precision，这一指标是近年来用来衡量目标检测算法性能的重要指标，它的核心原理如下：
- 将所有的detection_box按detection_score进行排序
- 计算每个detection_box与所有groundtruth_box的IOU
- 取IOU最大(max_IOU)的groundtruth_box作为这个detection_box的预测结果是否正确的判断依据，然后根据max_IOU的结果判断预测结果是TP还是FP进而画出PR曲线，最后再在每一类上做平均得到mAP。
- 一个不错的深入了解链接.mAP详解
find_intersection()以及find_jaccard_overlap()函数即为计算IoU的，这个在之前分割实验课上有介绍过。其实现如下：

def find_intersection(set_1, set_2):
    """
    Find the intersection of every box combination between two sets of boxes that are in boundary coordinates.

    :param set_1: set 1, a tensor of dimensions (n1, 4)
    :param set_2: set 2, a tensor of dimensions (n2, 4)
    :return: intersection of each of the boxes in set 1 with respect to each of the boxes in set 2, a tensor of dimensions (n1, n2)
    """

    # PyTorch auto-broadcasts singleton dimensions
    lower_bounds = torch.max(set_1[:, :2].unsqueeze(1), set_2[:, :2].unsqueeze(0))  # (n1, n2, 2)
    upper_bounds = torch.min(set_1[:, 2:].unsqueeze(1), set_2[:, 2:].unsqueeze(0))  # (n1, n2, 2)
    intersection_dims = torch.clamp(upper_bounds - lower_bounds, min=0)  # (n1, n2, 2)
    return intersection_dims[:, :, 0] * intersection_dims[:, :, 1]  # (n1, n2)

计算完相交的部分后，计算IoU便比较简单，只需要用相交部分除以相并的部分

def find_jaccard_overlap(set_1, set_2):
    """
    Find the Jaccard Overlap (IoU) of every box combination between two sets of boxes that are in boundary coordinates.

    :param set_1: set 1, a tensor of dimensions (n1, 4)
    :param set_2: set 2, a tensor of dimensions (n2, 4)
    :return: Jaccard Overlap of each of the boxes in set 1 with respect to each of the boxes in set 2, a tensor of dimensions (n1, n2)
    """

    # Find intersections
    intersection = find_intersection(set_1, set_2)  # (n1, n2)

    # Find areas of each box in both sets
    areas_set_1 = (set_1[:, 2] - set_1[:, 0]) * (set_1[:, 3] - set_1[:, 1])  # (n1)
    areas_set_2 = (set_2[:, 2] - set_2[:, 0]) * (set_2[:, 3] - set_2[:, 1])  # (n2)

    # Find the union
    # PyTorch auto-broadcasts singleton dimensions
    union = areas_set_1.unsqueeze(1) + areas_set_2.unsqueeze(0) - intersection  # (n1, n2)

    return intersection / union  # (n1, n2)

非极大值抑制（Non-Maximum Suppression，NMS）

NMS是目标检测的重要算法，它的作用是用来去掉模型预测后的多余框。如下图所示:

NMS算法处理前
NMS算法处理后
算法流程
- 设定一个阈值IOU假设为0.5，选取每一类box中scores最大的那一个，记为box_best，并保留它
- 计算box_best与其余的box的IOU，如果其IOU>0.5了，那么就舍弃这个box（由于可能这两个box表示同一目标，所以保留分数高的哪一个）
- 从最后剩余的boxes中，再找出最大scores的哪一个，如此循环往复
一个简单的例子
- 比如现在滑动窗口有：A、B、C、D、E、F、G、H、I、J个，假设A是得分最高的，IOU＞0.7淘汰。第一轮：与A计算IOU，BEG＞0.7，剔除，剩余CDFHIJ 第二轮：假设CDFHIJ中F得分最高，与F计算IOU，DHI＞0.7，剔除，剩余CJ 第三轮：假设CJ中C得分最高，J与C计算IOU，若结果＞0.7，则AFC就是选择出来的窗口。
SSD中非极大值抑制的实现

def NMS(n_classes, predicted_scores, min_score, decoded_locs, max_overlap, image_boxes,  
        image_labels, image_scores):
    for c in range(1, n_classes):
        # Keep only predicted boxes and scores where scores for this class are above the minimum score
        class_scores = predicted_scores[i][:, c]  # (8732)
        score_above_min_score = class_scores > min_score  # torch.uint8 (byte) tensor, for indexing
        n_above_min_score = score_above_min_score.sum().item()
        if n_above_min_score == 0:
            continue
        class_scores = class_scores[score_above_min_score]  # (n_qualified), n_min_score <= 8732
        class_decoded_locs = decoded_locs[score_above_min_score]  # (n_qualified, 4)

        # Sort predicted boxes and scores by scores
        class_scores, sort_ind = class_scores.sort(dim=0, descending=True)  # (n_qualified), (n_min_score)
        class_decoded_locs = class_decoded_locs[sort_ind]  # (n_min_score, 4)

        # Find the overlap between predicted boxes
        overlap = find_jaccard_overlap(class_decoded_locs, class_decoded_locs)  # (n_qualified, n_min_score)

        # Non-Maximum Suppression (NMS)

        # A torch.uint8 (byte) tensor to keep track of which predicted boxes to suppress
        # 1 implies suppress, 0 implies don't suppress
        suppress = torch.zeros((n_above_min_score), dtype=torch.uint8).to(device)  # (n_qualified)

        # Consider each box in order of decreasing scores
        for box in range(class_decoded_locs.size(0)):
            # If this box is already marked for suppression
            if suppress[box] == 1:
                continue

            # Suppress boxes whose overlaps (with this box) are greater than maximum overlap
            # Find such boxes and update suppress indices
            suppress = torch.max(suppress, overlap[box] > max_overlap)
            # The max operation retains previously suppressed boxes, like an 'OR' operation

            # Don't suppress this box, even though it has an overlap of 1 with itself
            suppress[box] = 0

        # Store only unsuppressed boxes for this class
        image_boxes.append(class_decoded_locs[1 - suppress])
        image_labels.append(torch.LongTensor((1 - suppress).sum().item() * [c]).to(device))
        image_scores.append(class_scores[1 - suppress])

课后作业（不要求提交，有兴趣的同学可以去尝试一下）

实现并分析NMS算法使用VOC2007或者VOC2012数据集中的任意一张图像，自己模拟目标检测中NMS函数的输入，即输入自己人工设置一些detection box（数量大于5，类别数大于等于2）的位置和大小同时给定任意的score。实现NMS算法的函数，并可视化NMS算法处理效果如上面的示例图。

目标检测

深度学习笔记（九）

预备知识:

声明：

网络结构

先验框

Dataset

Loss

模型训练

作业：

模型的参数初始化

模型的训练以及评估

目标检测

作业：

答：

模型评估

关于代码部分的补充

课后阅读部分

utils.py文件解读

非极大值抑制（Non-Maximum Suppression，NMS）

课后作业（不要求提交，有兴趣的同学可以去尝试一下）

CATALOG

FEATURED TAGS

FRIENDS