Reread target detection -- ssd

[target detection – R-CNN, Fast R-CNN, Fast R-CNN] https://www.cnblogs.com/yanghailin/p/14767995.html
[target detection SSD] https://www.cnblogs.com/yanghailin/p/14769384.html
[detailed explanation of anchor generation of ssd] https://www.cnblogs.com/yanghailin/p/14868575.html
[detailed explanation of ssd network] https://www.cnblogs.com/yanghailin/p/14871296.html
[detailed explanation of ssd loss] https://www.cnblogs.com/yanghailin/p/14882807.html
In fact, I read the ssd source code a while ago and wrote the above blog. Then I don't remember it at all after a while... Then I read it again recently, but this time I read the source code much faster,
In particular, there is a further understanding of some previously ambiguous places.

ssd network part

I wrote it again...

The above diagram I wrote according to the code is consistent with the diagram in the paper:

Last LOC above me [BN, 8732,4]
conf [bn,8732,num]
Is the output of the network.

ssd loss part – ssd codec

As for codec, I was confused when I first read it. That's what you say. The formula in the ssd paper is as follows:

Then I saw a detailed introduction in fast RCNN.
https://blog.csdn.net/weixin_42782150/article/details/110522380

In order to make the predicted frame close to gt, you need to add an offset to the anchor frame center point x and y, and multiply the width and height by a coefficient, which is not 0! Close to gt, so this mathematical formula came into being.

In the fast RCNN paper:

You can see that there are two sets of t and t *
I didn't understand it at first, but then I realized it in the wilderness!
The following T formula is the offset between gt and anchor. E-learning is such an offset t! Look at the loss formula above, which is t and t. The purpose is to bring the network closer to t *.
t is after coding, and this coding relationship is shown in the above formula.

About how to calculate intersection and union ratio, I don't see the operation of mapping to the original graph.

When calculating intersection ratio, I don't see a for loop.

def point_form(boxes):
    """ Convert prior_boxes to (xmin, ymin, xmax, ymax)
    representation for comparison to point form ground truth data.
    Args:
        boxes: (tensor) center-size default boxes from priorbox layers.
    Return:
        boxes: (tensor) Converted xmin, ymin, xmax, ymax form of boxes.
    """
    return torch.cat((boxes[:, :2] - boxes[:, 2:]/2,     # xmin, ymin
                     boxes[:, :2] + boxes[:, 2:]/2), 1)  # xmax, ymax
def intersect(box_a, box_b):
    """ We resize both tensors to [A,B,2] without new malloc:
    [A,2] -> [A,1,2] -> [A,B,2]
    [B,2] -> [1,B,2] -> [A,B,2]
    Then we compute the area of intersect between box_a and box_b.
    Args:
      box_a: (tensor) bounding boxes, Shape: [A,4].
      box_b: (tensor) bounding boxes, Shape: [B,4].
    Return:
      (tensor) intersection area, Shape: [A,B].
    """
    A = box_a.size(0)
    B = box_b.size(0)

    # n1 = box_a[:, 2:]
    # n1_1 = box_a[:, 2:].unsqueeze(1)
    # n1_2 = box_a[:, 2:].unsqueeze(1).expand(A, B, 2)
    #
    # n2 = box_b[:, 2:]
    # n2_1 = box_b[:, 2:].unsqueeze(0)
    # n2_2 = box_b[:, 2:].unsqueeze(0).expand(A, B, 2)
    #
    # n3 = torch.min(n1_2, n2_2)

    max_xy = torch.min(box_a[:, 2:].unsqueeze(1).expand(A, B, 2),
                       box_b[:, 2:].unsqueeze(0).expand(A, B, 2))
    min_xy = torch.max(box_a[:, :2].unsqueeze(1).expand(A, B, 2),
                       box_b[:, :2].unsqueeze(0).expand(A, B, 2))

    # sub_ = max_xy - min_xy

    inter = torch.clamp((max_xy - min_xy), min=0)
    return inter[:, :, 0] * inter[:, :, 1]


def jaccard(box_a, box_b):
    """Compute the jaccard overlap of two sets of boxes.  The jaccard overlap
    is simply the intersection over union of two boxes.  Here we operate on
    ground truth boxes and default boxes.
    E.g.:
        A ∩ B / A ∪ B = A ∩ B / (area(A) + area(B) - A ∩ B)
    Args:
        box_a: (tensor) Ground truth bounding boxes, Shape: [num_objects,4]
        box_b: (tensor) Prior boxes from priorbox layers, Shape: [num_priors,4]
    Return:
        jaccard overlap: (tensor) Shape: [box_a.size(0), box_b.size(0)]
    """
    inter = intersect(box_a, box_b)
    area_a = ((box_a[:, 2]-box_a[:, 0]) *
              (box_a[:, 3]-box_a[:, 1])).unsqueeze(1).expand_as(inter)  # [A,B]
    area_b = ((box_b[:, 2]-box_b[:, 0]) *
              (box_b[:, 3]-box_b[:, 1])).unsqueeze(0).expand_as(inter)  # [A,B]
    union = area_a + a
    rea_b - inter
    return inter / union  # [A,B]

overlaps = jaccard(
        truths,
        point_form(priors)
    )

After looking at the code, all operations are relative values, values between 0-1. This value multiplied by the size of the original image can be mapped to the original image, so the code is a relative value, which is consistent with the operation on the original image.

A priori box matching rule

When calculating the intersection and comparison between gt and anchor. There are very few gt in a graph. There are 1,2 targets and 8732 anchor.
The strategy in the code is to ensure that a gt matches the anchor with the largest intersection and union ratio, and then each anchor matches a gt with the largest intersection and union ratio!

Look at this figure, first find the one with the largest gt and anchor horizontally. This is the highest priority, and there is logic to ensure that this is always satisfied.

 best_truth_overlap.index_fill_(0, best_prior_idx, 2)  # ensure best prior
    # TODO refactor: index  best_prior_idx with long tensor
    # ensure every gt matches with its prior of max overlap
    for j in range(best_prior_idx.size(0)):
        best_truth_idx[best_prior_idx[j]] = j

Find the maximum longitudinally, that is, the gt corresponding to the maximum intersection union ratio of each anchor. Then 8732 anchors are encoded with matching gt.
The set background class whose intersection and union ratio of anchor and gt is less than a certain threshold.
My blog has analyzed:
https://www.cnblogs.com/yanghailin/p/14882807.html
This knowledge has an explanation about a priori frame matching: https://zhuanlan.zhihu.com/p/33544892

Difficult case excavation:

It's hard. I still don't understand one paragraph. But I feel the design is very clever! sort twice.

 #loss_c[26196,1]    #https://zhuanlan.zhihu.com/p/153535799
loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))

Here I release the code with detailed comments

# -*- coding: utf-8 -*-
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from data import coco as cfg
from ..box_utils import match, log_sum_exp

# criterion = MultiBoxLoss(cfg['num_classes'], 0.5, True, 0, True, 3, 0.5,
#                          False, args.cuda)

class MultiBoxLoss(nn.Module):
    """SSD Weighted Loss Function
    Compute Targets:
        1) Produce Confidence Target Indices by matching  ground truth boxes
           with (default) 'priorboxes' that have jaccard index > threshold parameter
           (default threshold: 0.5).
        2) Produce localization target by 'encoding' variance into offsets of ground
           truth boxes and their matched  'priorboxes'.
        3) Hard negative mining to filter the excessive number of negative examples
           that comes with using a large number of default bounding boxes.
           (default negative:positive ratio 3:1)
    Objective Loss:
        L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        Where, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Loss
        weighted by α which is set to 1 by cross val.
        Args:
            c: class confidences,
            l: predicted boxes,
            g: ground truth boxes
            N: number of matched default boxes
        See: https://arxiv.org/pdf/1512.02325.pdf for more details.
    """

    def __init__(self, num_classes, overlap_thresh, prior_for_matching,
                 bkg_label, neg_mining, neg_pos, neg_overlap, encode_target,
                 use_gpu=True):
        super(MultiBoxLoss, self).__init__()
        self.use_gpu = use_gpu
        self.num_classes = num_classes
        self.threshold = overlap_thresh
        self.background_label = bkg_label
        self.encode_target = encode_target
        self.use_prior_for_matching = prior_for_matching
        self.do_neg_mining = neg_mining
        self.negpos_ratio = neg_pos
        self.neg_overlap = neg_overlap
        self.variance = cfg['variance']

    def forward(self, predictions, targets):
        """Multibox Loss
        Args:
            predictions (tuple): A tuple containing loc preds, conf preds,
            and prior boxes from SSD net.
                conf shape: torch.size(batch_size,num_priors,num_classes)
                loc shape: torch.size(batch_size,num_priors,4)
                priors shape: torch.size(num_priors,4)

            targets (tensor): Ground truth boxes and labels for a batch,
                shape: [batch_size,num_objs,5] (last idx is the label).

                loc_data [3,8732,4]
                conf_data [3,8732,21]
                priors [8732,4]
        """
        loc_data, conf_data, priors = predictions
        num = loc_data.size(0) # num is batchsize
        priors = priors[:loc_data.size(1), :] #priors [8732,4]
        num_priors = (priors.size(0)) #num_priors=8732
        num_classes = self.num_classes #21

        # match priors (default boxes) and ground truth boxes
        loc_t = torch.Tensor(num, num_priors, 4) #[3,8732,4]
        conf_t = torch.LongTensor(num, num_priors) #[3,8732]
        for idx in range(num):
            truths = targets[idx][:, :-1].data
            labels = targets[idx][:, -1].data
            defaults = priors.data
            match(self.threshold, truths, defaults, self.variance, labels,
                  loc_t, conf_t, idx)
        if self.use_gpu:
            loc_t = loc_t.cuda()
            conf_t = conf_t.cuda()

        #Suffix_ The variable of t represents truth
        #The LOC obtained from match here_ t[3,8732,4]， conf_t[3,8732]
        #loc_ T [38732,4] is the value after encoding each anchor and gt
        #conf_t[3, 8732] is the value between label, 0-21, which is 0 in most places. There is a value only where the intersection ratio of anchor and gt is greater than the threshold
        # wrap targets
        loc_t = Variable(loc_t, requires_grad=False) #[3,8732,4]
        conf_t = Variable(conf_t, requires_grad=False) #[3,8732]

        # POS [38732] false, true is important! Select the positive sample position. The positive sample position is true and the rest are false
        # num_ POS shape [3,1] | [23], [4], [9] number of positive samples per graph
        pos = conf_t > 0 #pos [3,8732]  False,True
        num_pos = pos.sum(dim=1, keepdim=True) #num_pos shape[3,1]   | [23],[4],[9]



        # Localization Loss (Smooth L1)
        #pos [3,8732]
        #pos.dim()=2
        #pos.unsqueeze(pos.dim()) [3,8732,1]
        # loc_data [3,8732,4]
        # pos_ IDX [38732,4] True False 𞓜 is to enlarge the True and False in POS to [38732,4]
        pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)

        #loc_data[3, 8732, 4]   aa[144]
        aa = loc_data[pos_idx]
        #[3,8732,4]   bb[144]
        bb = loc_t[pos_idx]

        #pos_idx [3,8732,4] True  False
        #loc_ Data [38732,4] network prediction output value
        #loc_data[pos_idx] shape[144]  (23+4+9)×4=144
        # loc_t [3,8732,4]
        #loc_t[pos_idx] shape[144]  (23+4+9)×4=144
        loc_p = loc_data[pos_idx].view(-1, 4)
        loc_t = loc_t[pos_idx].view(-1, 4)
        #loss_l tensor(14.0165, grad_fn=<SmoothL1LossBackward>)
        loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False)

        # Compute max conf across batch for hard negative mining
        #conf_ Data [38732,21] network prediction value
        # batch_conf[3*8732,21]  [26196,21]
        # tt_1 = batch_conf.max() ## 35
        # tt_2 = batch_conf.min() ## -13
        #The maximum and minimum values are taken out here. It is convenient and intuitive to feel what the values are. You can see that this value is actually used to predict the label. Without softmax
        batch_conf = conf_data.view(-1, self.num_classes)

        b1 = log_sum_exp(batch_conf) #[26196,1]
        # conf_t[3,8732]
        # conf_t.view(-1, 1) [26196,1]
        b2 = batch_conf.gather(1, conf_t.view(-1, 1)) #[26196,1]


        #https://blog.csdn.net/liyu0611/article/details/100547145
        #https://zhuanlan.zhihu.com/p/35709485
        # https://zhuanlan.zhihu.com/p/153535799
        #loss_c[26196,1]
        loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))

        # Hard Negative Mining
        #loss_c[pos] = 0  # filter out pos boxes for now
        #loss_c = loss_c.view(num, -1)
        #loss_c [3,8732]
        loss_c = loss_c.view(num, -1)
        #pos [3,8732]
        # loss_c [3,8732]
        loss_c[pos] = 0 #Set the loss of the positive sample to 0

        #Twice sort https://blog.csdn.net/laizi_laizi/article/details/103482634
        #loss_idx [3,8732]
        tmp1, loss_idx = loss_c.sort(1, descending=True) ## _, loss_idx = loss_c.sort(1, descending=True)
        #idx_rank [3,8732]
        tmp2, idx_rank = loss_idx.sort(1) ## _, idx_rank = loss_idx.sort(1)

        # pos [3,8732]
        # num_pos shape[3,1]   | [23],[4],[9]
        num_pos = pos.long().sum(1, keepdim=True)

        # aaaaa shape[3,1]   | [23*3],[4*3],[9*3]
        #aaaaa = self.negpos_ratio * num_pos
        # num_neg shape[3,1]   | [23*3],[4*3],[9*3]
        num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)#num_pos shape[3,1]   | [69],[12],[27]

        # num_neg shape[3,1]
        # idx_rank[3,8732]
        #Neg [38732] True, False gives loss_c is sorted from large to small by True and False of corresponding coordinates
        neg = idx_rank < num_neg.expand_as(idx_rank)
        # Confidence Loss Including Positive and Negative Examples
        # pos[3,8732]
        # pos.unsqueeze(2)  [3,8732,1]
        #conf_data[3, 8732, 21]
        #pos_idx [3, 8732, 21]
        pos_idx = pos.unsqueeze(2).expand_as(conf_data)#pos[3,8732]  conf_data[3,8732,21]
        # neg [3,8732]
        # neg_idx [3, 8732, 21]
        neg_idx = neg.unsqueeze(2).expand_as(conf_data)##neg [3,8732]  conf_data[3,8732,21]
       
       
        # tmp_1111 = pos_idx + neg_idx #[3,8732,21]  True  False   ##Other POS_ idx+neg_ The shapes of IDX and IDX are the same [38732,21]. The values are True or False. The addition operation is equivalent to the or operation. As long as there is one True, it is True
        # tmp_2222 = (pos_idx+neg_idx).gt(0) #[3,8732,21]  True  False
        # conf_data [3,8732,21]
        # tmp_333 = conf_data[(pos_idx + neg_idx).gt(0)] #[3696]   
        #conf_p [176,21]          #Other conf_ P [144,21] -- > the 144 is the above two pos_idx and neg_ Sum of True quantities in IDX 69 + 12 + 27 + 23 + 4 + 9 = 144
        conf_p = conf_data[(pos_idx+neg_idx).gt(0)].view(-1, self.num_classes)
        # pos [3,8732]
        # neg [3,8732]
        # conf_t [3,8732]
        # targets_weighted  [176]   ||[144]
        targets_weighted = conf_t[(pos+neg).gt(0)]
        #loss_c tensor(58.0656, grad_fn=<NllLossBackward>)
        loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)

        # Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N

        N = num_pos.data.sum() ##N=36 is num_ Sum of POS [23] + [4] + [9]
        loss_l /= N
        loss_c /= N
        return loss_l, loss_c

There are two things to say:
The first place is

loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))

This actually feels like cross entropy, log_sum_exp is the minus sign after the following formula, batch_conf.gather(1, conf_t.view(-1, 1)) is xj

The formula comes from the following blog.
https://blog.csdn.net/liyu0611/article/details/100547145
Why log_sum_exp, because the 1000th power of e will exceed the numerical expression range. Look specifically https://zhuanlan.zhihu.com/p/153535799
I just don't understand the following:

loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))
loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)

What's the difference? Why did you write that first.

There are two sort s
https://blog.csdn.net/laizi_laizi/article/details/103482634

This picture is very clear. The purpose of twice sort is to easily take out the top number of values and get its position. Very severe!

Keywords: Pytorch Object Detection

Added by oskom on Fri, 17 Sep 2021 16:44:18 +0300

Programming VIP