mmdetection construction of one-stage detection model for disassembly of each module (taking FCOS as an example)

introduction

For the author, the most important thing to learn mmdetection is to learn how to DIY your own model, so it is very important to understand how the classical model is built step by step. This chapter will start from scratch and deconstruct the construction process of the one-stage detection model (taking FCOS as an example) in mmdetection. This paper has detailed explanations and comments in the specific code part of FCOS. Students who have DIY model requirements like the author can read it carefully~

Composition of one-stage detection model

The OneStageDetector makes intensive prediction of target location and category directly on the extracted feature map. The model composition of the one-stage detector is relatively simple, and the key components are backbone, neck and bbox_head:

Generally speaking, the differences of detection models in different stages are mainly concentrated in two parts: bbox_head and loss function, which are also the focus of this chapter.

Construction of FCOS model in mmdet

Method introduction

Thesis address: FCOS: Fully Convolutional One-Stage Object Detection

(the above figure is from the original paper)
FCOS adopts the full convolution network structure, and takes an additional center prediction branch on the classification branch of the detection head to predict the offset degree of the target from the center, which can improve the quality of the detection frame. This is a very interesting work. Students interested in the specific details of the model can read the original text by themselves. This paper mainly discusses the implementation process of FCOS in mmdetection.
Returning to the "three elements" of the first stage detection model structure mentioned above, FCOS has no special requirements on the Backbone. Neck uses the most commonly used feature pyramid (FPN) in detection tasks, and its innovation is mainly reflected in the bbox_head part.

General construction process of mmdet model

stay mmdetection/tools/train.py In, the model is constructed as follows:

from mmdet.models import build_detector
"""As you can see, the construction of the model depends on cfg.model,cfg.train_cfg,cfg.test_cfg Three dictionaries"""
    model = build_detector(
        cfg.model,
        train_cfg=cfg.get('train_cfg'),
        test_cfg=cfg.get('test_cfg'))
    """Model parameter initialization"""
    model.init_weights()

Next we see the FCOS configuration file , the author notes the important parts:

"""model part"""
model = dict(
	"""type Indicates the class of the model(class)"""
    type='FCOS',
    """appoint backbone，backbone It is generally the most flexible part of the model,It can be easily replaced with other networks"""
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=False),
        norm_eval=True,
        style='caffe',
        init_cfg=dict(
            type='Pretrained',
            checkpoint='open-mmlab://detectron/resnet50_caffe')),
    """Use here FPN As neck,And specified FPN Number of input and output channels and whether to use relu Equal parameters"""
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs='on_output',  # use P5
        num_outs=5,
        relu_before_extra_convs=True),
    """It's specified here bbox_head,Can see type by'FCOShead'，yes FCOS Core components of"""
    bbox_head=dict(
        type='FCOSHead',
        num_classes=80,
        in_channels=256,
        stacked_convs=4,
        feat_channels=256,
        strides=[8, 16, 32, 64, 128],
        """loss Attributable to bbox_head part,The loss function for detecting the first three branches is specified here"""
        loss_cls=dict(
            type='FocalLoss',
            use_sigmoid=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_bbox=dict(type='IoULoss', loss_weight=1.0),
        loss_centerness=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)),
        """Configuration part of training and testing"""
        train_cfg=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.4,
            min_pos_iou=0,
            ignore_iof_thr=-1),
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(type='nms', iou_threshold=0.5),
        max_per_img=100))

The basic structure of the configuration file is shown in the figure below. FCOShead is a key part of the model:

Next, see the build function build of the detection model_ detector():

from mmcv.utils import Registry
MODELS = Registry('models', parent=MMCV_MODELS)
DETECTORS = MODELS

def build_detector(cfg, train_cfg=None, test_cfg=None):
    """Build detector."""
    if train_cfg is not None or test_cfg is not None:
        warnings.warn(
            'train_cfg and test_cfg is deprecated, '
            'please specify them in model', UserWarning)
    assert cfg.get('train_cfg') is None or train_cfg is None, \
        'train_cfg specified in both outer field and model field '
    assert cfg.get('test_cfg') is None or test_cfg is None, \
        'test_cfg specified in both outer field and model field '
    return DETECTORS.build(
        cfg, default_args=dict(train_cfg=train_cfg, test_cfg=test_cfg))

DETECTORS is a registry instance of mmcv, DETECTORS Build (cfg) is to instantiate a model according to the model category (keyword 'type') given by cfg. In this example, if the category of the model is type = 'FCOS', this function will build and return an FCOS model.

FCOS

We are mmdetection/mmdet/models/detectors/fcos.py Find the definition of FCOS category:

# Copyright (c) OpenMMLab. All rights reserved.
from ..builder import DETECTORS
from .single_stage import SingleStageDetector

@DETECTORS.register_module()
class FCOS(SingleStageDetector):
    """Implementation of `FCOS <https://arxiv.org/abs/1904.01355>`_"""
    def __init__(self,
                 backbone,
                 neck,
                 bbox_head,
                 train_cfg=None,
                 test_cfg=None,
                 pretrained=None,
                 init_cfg=None):
        super(FCOS, self).__init__(backbone, neck, bbox_head, train_cfg,
                                   test_cfg, pretrained, init_cfg)

Unexpectedly, the definition of FCOS is very simple, that is, it inherits the SingleStageDetector and instantiates it.

SingleStageDetector

Let's take a look at the SingleStageDetector. Here we also make some comments on the key parts:

@DETECTORS.register_module()
class SingleStageDetector(BaseDetector):
    """Base class of one-stage detector
    One stage detector in backbone+neck The output is directly used for dense bounding box prediction
    """
        def __init__(self,
                 backbone,
                 neck=None,
                 bbox_head=None,
                 train_cfg=None,
                 test_cfg=None,
                 pretrained=None,
                 init_cfg=None):
        super(SingleStageDetector, self).__init__(init_cfg)
        if pretrained:
            warnings.warn('DeprecationWarning: pretrained is deprecated, '
                          'please use "init_cfg" instead')
            backbone.pretrained = pretrained
        """establish backbone，build Function function and build_detecor identical"""
        self.backbone = build_backbone(backbone)
        if neck is not None:
        	 """establish neck, Generally used in testing FPN And its various variants"""
            self.neck = build_neck(neck)
        bbox_head.update(train_cfg=train_cfg)
        bbox_head.update(test_cfg=test_cfg)
        """establish bbox_head"""
        self.bbox_head = build_head(bbox_head)
        self.train_cfg = train_cfg
        self.test_cfg = test_cfg
        
    """feature extraction """
    def extract_feat(self, img):
        """use backbone and neck Feature extraction"""
        x = self.backbone(img)
        if self.with_neck:
            x = self.neck(x)
        return x
    
    def forward_dummy(self, img):
        """Forward propagation algorithm  x ---> backbone+neck ---> feat ---> bbox_head ---> outs"""
        
        x = self.extract_feat(img)
        outs = self.bbox_head(x)
        return outs
        
    def forward_train(self,
                      img,
                      img_metas,
                      gt_bboxes,
                      gt_labels,
                      gt_bboxes_ignore=None):
        """
        parameter:
            img (Tensor): Enter a picture in the shape of(N,C,H,W),Generally speaking, it should be a normalized picture.
            img_metas (list[dict]): contain'image_scale','flip','filename','ori_shape'Dictionary list of equivalent meta information
            gt_bboxes (list[Tensor]): Bounding box real dimension, shape(xmin,ymin,xmax,ymax).
            gt_labels (list[Tensor]): Category of bounding box
            gt_bboxes_ignore (None | list[Tensor]): Specifies a bounding box that can be ignored when calculating losses.

        Return value:
            dict[str, Tensor]: Dictionary containing multiple loss functions.
        """
        super(SingleStageDetector, self).forward_train(img, img_metas)
        
        """Extract features and use bbox_head The forward propagation function of the loss is obtained"""
        x = self.extract_feat(img)
        losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes,
                                              gt_labels, gt_bboxes_ignore)
        return losses

The above is the training part of SingleStageDetector, and the following is the test part:

    def simple_test(self, img, img_metas, rescale=False):
        """Simple test functions without test phase data enhancement
        parameter:
            rescale (bool, optional): Whether to shrink the test results to the original size. The default is False.
        Return value:
            list[list[np.ndarray]]: Test results of each category in each picture, the first list Dimensions represent different pictures, the second one list Dimensions represent different categories.
        """
        """Extract features and use bbox_head of simple_test Function to test"""
        feat = self.extract_feat(img)
        results_list = self.bbox_head.simple_test(
            feat, img_metas, rescale=rescale)
        """use bbox2result Function processes the detection result and returns"""
        bbox_results = [
            bbox2result(det_bboxes, det_labels, self.bbox_head.num_classes)
            for det_bboxes, det_labels in results_list
        ]
        return bbox_results
    def aug_test(self, imgs, img_metas, rescale=False):
        """And simple_test The functions are basically the same, except that data enhancement is used in the test
        """
        assert hasattr(self.bbox_head, 'aug_test'), \
            f'{self.bbox_head.__class__.__name__}' \
            ' does not support test-time augmentation'

        feats = self.extract_feats(imgs)
        """use bbox_head of aug_test function"""
        results_list = self.bbox_head.aug_test(
            feats, img_metas, rescale=rescale)
        bbox_results = [
            bbox2result(det_bboxes, det_labels, self.bbox_head.num_classes)
            for det_bboxes, det_labels in results_list
        ]
        return bbox_results

Through the analysis of SingleStageDetector class code, we can know that bbox is used in key steps such as forward propagation and testing_ Head's training and testing functions show that for the one-stage detection model, bbox_head is the most important part.

Backbone and Neck

Here are just some brief introductions to backbone and neck.

backbone:Resnet

Backbone is the backbone network, which is generally composed of a large number of convolution, pooling, normalization and activation layers. It is often used for image feature extraction. In this paper, FCOS uses the most classic resnet50 as the backbone network. In mmdetection, the backbone network, like other components in the detection model, also needs to write and register classes:

"""
codes from mmdet/models/bakcbones/resnet.py
"""
@BACKBONES.register_module()
class ResNet(BaseModule):
    """ResNet backbone."""

The usage of Resnet is very simple. It accepts the image input with shape (N,C,H,W) and returns the output of four stage s. Here is an official example:

from mmdet.models import ResNet
import torch

m = ResNet(18)
m.eval()

input = torch.rand((1,3,32,32))
level_outputs = m.forward(input)
for level_output in level_outputs:
	print(level_output.shape)

The results obtained are:

torch.Size([1, 64, 8, 8])
torch.Size([1, 128, 4, 4])
torch.Size([1, 256, 2, 2])
torch.Size([1, 512, 1, 1])

neck:FPN

Feature pyramid network is the most commonly used neck structure in detection tasks, which is mainly used for multi-scale information fusion:

@NECKS.register_module()
class FPN(BaseModule):

FPN accepts multiple inputs of different scales and returns a tuple containing different scale characteristics of the same number of channels:

from mmdet.models import ResNet
from mmdet.models import FPN
import torch

m = ResNet(18)
m.eval()

input = torch.rand((1,3,32,32))

level_outputs = m.forward(input)

in_channels = [64,128,256,512]

fpn = FPN(in_channels=in_channels,out_channels=256,num_outs=len(level_outputs))
fpn.eval()
outputs = fpn.forward(level_outputs)

for output in outputs:
    print(output.shape)

The results of the above test code are:

torch.Size([1, 256, 8, 8])
torch.Size([1, 256, 4, 4])
torch.Size([1, 256, 2, 2])
torch.Size([1, 256, 1, 1])

It can be seen that all input channels are processed to 256

bbox_head

Finally, it's the highlight of this paper. Next, we will make a detailed analysis of the first stage detection head in mmdetection.
First of all, we need to know that in mmdetection, the FCOS detector follows the following class inheritance relationship:

BaseDenseHead

This class is the base class of dense detection headers. All headers that directly predict on the feature graph (including one-stage detector and RPN) inherit from this class from the root. Let's take a look at the code and the author's comments. See for the source code mmdet/models/dense_heads/base_dense_head.py

"""This category is abstract"""
class BaseDenseHead(BaseModule, metaclass=ABCMeta):
    """Base class for DenseHeads."""

    def __init__(self, init_cfg=None):
        super(BaseDenseHead, self).__init__(init_cfg)
	"""This category contains loss and get_bboxes Two abstract methods, subclasses must be replicated"""
    @abstractmethod
    def loss(self, **kwargs):
        """Calculate the loss of the detection head."""
        pass

    @abstractmethod
    def get_bboxes(self, **kwargs):
        """Convert the batch output of the test head to frame prediction."""
        pass

    def forward_train(self,
                      x,
                      img_metas,
                      gt_bboxes,
                      gt_labels=None,
                      gt_bboxes_ignore=None,
                      proposal_cfg=None,
                      **kwargs):
        """
        Args:
            x (list[Tensor]): Different scale inputs from the feature pyramid.
            img_metas (list[dict]): Meta information of picture.
            gt_bboxes (Tensor): Bounding box true value, shape(N,4)
            gt_labels (Tensor): The real category label of the bounding box in the shape of(N,)
            gt_bboxes_ignore (Tensor): Ground truth bboxes to be
                ignored, shape (num_ignored_gts, 4).
            proposal_cfg (mmcv.Config): Test / postprocessing configuration,
                if None, test_cfg would be used

        Returns:
            tuple:
                losses: (dict[str, Tensor]): Loss function dictionary.
                proposal_list (list[Tensor]): Generated on each picture proposal.
        """
        outs = self(x)
        """If given gt_labels Add the category label to the calculation of the loss function"""
        if gt_labels is None:
            loss_inputs = outs + (gt_bboxes, img_metas)
        else:
            loss_inputs = outs + (gt_bboxes, gt_labels, img_metas)
        """Calculation loss function"""
        losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
        # Returns the loss function dictionary and, if necessary, the generated proposal
        if proposal_cfg is None:
            return losses
        else:
            proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
            return losses, proposal_list
    def simple_test(self, feats, img_metas, rescale=False):
        """Simple test function.
        Returns:
            list[tuple[Tensor, Tensor]]: Each item in the list is a tuple of length 2. First tensor in tuple
            Shape is(N,5), Dimension 5 represents (tl_x, tl_y, br_x, br_y, score).The second tensor is the category prediction result,
            The shape of its tensor is(N,)
        """  
        return self.simple_test_bboxes(feats, img_metas, rescale=rescale)

Basesensehead is an abstract class, and its subclasses must duplicate loss() and get_bboxes() two abstract methods.

AnchorFreeHead

Anchor free method means that the anchor frame of the specified size is not set in advance, but the target position is predicted directly. Let's first look at the initialization function of AnchorFreeHead class:

from abc import abstractmethod

import torch
import torch.nn as nn
from mmcv.cnn import ConvModule
from mmcv.runner import force_fp32

from mmdet.core import multi_apply
from ..builder import HEADS, build_loss
from .base_dense_head import BaseDenseHead
"""There are many parameters of the initialization function. Here, a simple comment is made for each parameter"""
@HEADS.register_module()
class AnchorFreeHead(BaseDenseHead, BBoxTestMixin):
    """Anchor-free head (FCOS, Fovea, RepPoints, etc.).
    """  # noqa: W605

    _version = 1

    def __init__(self,
                 num_classes, # Number of detection categories (excluding background classes)
                 in_channels, # Enter the number of channels of the characteristic graph
                 feat_channels=256, # Number of channels in hidden layer
                 stacked_convs=4, # Number of convolution layers of detection head stack
                 strides=(4, 8, 16, 32, 64), # Down sampling coefficients of different scale features
                 dcn_on_last_conv=False, # Whether to use DCN in the last convolution layer
                 conv_bias='auto', # Whether the convolution layer uses bias, "auto" stands for normal_ CFG decision
                 loss_cls=dict( # Loss function of classification. The default value is FocalLoss
                     type='FocalLoss',
                     use_sigmoid=True,
                     gamma=2.0,
                     alpha=0.25,
                     loss_weight=1.0),
                 loss_bbox=dict(type='IoULoss', loss_weight=1.0), #The loss of border regression is IouLoss by default
                 conv_cfg=None,
                 norm_cfg=None,
                 train_cfg=None,
                 test_cfg=None,
                 init_cfg=dict( # Initialize configuration
                     type='Normal',
                     layer='Conv2d',
                     std=0.01,
                     override=dict(
                         type='Normal',
                         name='conv_cls',
                         std=0.01,
                         bias_prob=0.01))):
        super(AnchorFreeHead, self).__init__(init_cfg)
        self.num_classes = num_classes
        self.cls_out_channels = num_classes
        self.in_channels = in_channels
        self.feat_channels = feat_channels
        self.stacked_convs = stacked_convs
        self.strides = strides
        self.dcn_on_last_conv = dcn_on_last_conv
        assert conv_bias == 'auto' or isinstance(conv_bias, bool)
        self.conv_bias = conv_bias
        """Build loss function"""
        self.loss_cls = build_loss(loss_cls)
        self.loss_bbox = build_loss(loss_bbox)
        
        self.train_cfg = train_cfg
        self.test_cfg = test_cfg
        self.conv_cfg = conv_cfg
        self.norm_cfg = norm_cfg
        self.fp16_enabled = False
		Initialize layers
        self._init_layers()

    def _init_layers(self):
        """Initialize each layer of the detection head."""
        self._init_cls_convs()
        self._init_reg_convs()
        self._init_predictor()

The composition of AnchorFreeHead detection head is shown in the following figure:

Let's take a look at the initialization functions of each layer:

 def _init_cls_convs(self):
        """Initialize classification volume layer."""
        self.cls_convs = nn.ModuleList()
        """The number of layers of the classification part is determined by self.stacked_convs Parameter determination"""
        for i in range(self.stacked_convs):
            chn = self.in_channels if i == 0 else self.feat_channels
            if self.dcn_on_last_conv and i == self.stacked_convs - 1:
                conv_cfg = dict(type='DCNv2')
            else:
                conv_cfg = self.conv_cfg
            """there ConvModule yes mmcv Convolution encapsulated in-normalization-Activate the layer. The activation function is used by default ReLU"""
            self.cls_convs.append(
                ConvModule(
                    chn,
                    self.feat_channels,
                    3,
                    stride=1,
                    padding=1,
                    conv_cfg=conv_cfg,
                    norm_cfg=self.norm_cfg,
                    bias=self.conv_bias))

    def _init_reg_convs(self):
        """Initialization of border regression layer, code and_init_cls_convs similar."""
        self.reg_convs = nn.ModuleList()
        for i in range(self.stacked_convs):
            chn = self.in_channels if i == 0 else self.feat_channels
            if self.dcn_on_last_conv and i == self.stacked_convs - 1:
                conv_cfg = dict(type='DCNv2')
            else:
                conv_cfg = self.conv_cfg
            self.reg_convs.append(
                ConvModule(
                    chn,
                    self.feat_channels,
                    3,
                    stride=1,
                    padding=1,
                    conv_cfg=conv_cfg,
                    norm_cfg=self.norm_cfg,
                    bias=self.conv_bias))

    def _init_predictor(self):
        """Initialize prediction layer."""
        """conv_cls The number of output channels is self.cls_out_channels=num_classes，This is the number of categories"""
        self.conv_cls = nn.Conv2d(
            self.feat_channels, self.cls_out_channels, 3, padding=1)
        """conv_reg The number of output channels is 4, corresponding to the 4 coordinate parameters of the frame"""
        self.conv_reg = nn.Conv2d(self.feat_channels, 4, 3, padding=1)

See the forward propagation function again:

    def forward(self, feats):
        """Forward propagation function.
        Args:
            feats (tuple[Tensor]): The characteristic diagram obtained from the upstream network, tuple Each of tensor Represent different
            Characteristics of scale.
        Returns:
            tuple: It usually includes classification confidence score and bounding box prediction results.
                cls_scores (list[Tensor]): For the classification prediction results of different scale feature maps, the number of feature map channels is num_points(Prediction points)*num_classes(Number of categories)，That is, each category of each point is predicted.
                bbox_preds (list[Tensor]): The frame offset of feature maps with different scales, and the number of channels is num_points*4
        """
      	"""multi_apply It can act on the input parameter function feats On each item in"""
        return multi_apply(self.forward_single, feats)[:2]

    def forward_single(self, x):
        """Forward propagation function for single feature layer processing.

        Args:
            x (Tensor): FPN feature maps of the specified stride.

        Returns:
            tuple: Confidence score for each category, Frame prediction, after cls_convs and reg_convs Characteristics after processing.
        """
        cls_feat = x
        reg_feat = x
		"""The classification confidence is obtained"""
        for cls_layer in self.cls_convs:
            cls_feat = cls_layer(cls_feat)
        cls_score = self.conv_cls(cls_feat)
		"""Get border prediction"""
        for reg_layer in self.reg_convs:
            reg_feat = reg_layer(reg_feat)
        bbox_pred = self.conv_reg(reg_feat)
        return cls_score, bbox_pred, cls_feat, reg_feat

You can see that the forward propagation process of AnchorFreeHead is relatively simple. Let's take a look at the calculation of loss function and the processing of test results:

"""loss It is an abstract class method, which means that the specific model needs to define its own loss function"""
	@abstractmethod
    @force_fp32(apply_to=('cls_scores', 'bbox_preds'))
    def loss(self,
             cls_scores,
             bbox_preds,
             gt_bboxes,
             gt_labels,
             img_metas,
             gt_bboxes_ignore=None):
        """Calculate the loss function of the detection head.

        Args:
            cls_scores (list[Tensor]): Classification and prediction results for different scale feature maps.
            bbox_preds (list[Tensor]): Frame prediction results for different scale feature maps.
            gt_bboxes (list[Tensor]): True value of the bounding box of the target to be detected.
            gt_labels (list[Tensor]): Category true value of each border
            img_metas (list[dict]): Meta information of each image, e.g.,
                image size, scaling factor, etc.
            gt_bboxes_ignore (None | list[Tensor]): specify which bounding
                boxes can be ignored when computing the loss.
        """
        raise NotImplementedError
        
	@abstractmethod
    @force_fp32(apply_to=('cls_scores', 'bbox_preds'))
    def get_bboxes(self,
                   cls_scores,
                   bbox_preds,
                   img_metas,
                   cfg=None,
                   rescale=None):
        """This function is in BaseDenseHead I've seen it in. It's also an abstract method. I won't repeat it.
        """
        raise NotImplementedError
    
 	@abstractmethod
    def get_targets(self, points, gt_bboxes_list, gt_labels_list):
        """Calculate the prediction objectives of classification, regression and other tasks(target),For calculation of loss function.

        Args:
            points (list[Tensor]): The anchor point of each layer of the feature pyramid, and each item in the list is(num_points,2).
            gt_bboxes_list (list[Tensor]): Real border information,
                The shape of each item is (num_gt, 4).
            gt_labels_list (list[Tensor]): True category label for the border,
                The shape of each item is (num_gt,).
        """
        raise NotImplementedError

Because the Anchor box is not used in the anchor freehead, we generally use the anchor point as the anchor point to calculate the target value to be predicted, and then calculate the loss function and get the predicted frame. Anchor points are generated as follows:

    def _get_points_single(self,
                           featmap_size,
                           stride,
                           dtype,
                           device,
                           flatten=False):
        """Obtain the location point of single scale feature map."""
        h, w = featmap_size
        """generate x,y Coordinates of dimensions, and each feature map point is set as a positioning point"""
        x_range = torch.arange(w, device=device).to(dtype)
        y_range = torch.arange(h, device=device).to(dtype)
        """use meshgrid Function to generate 2D coordinates"""
        y, x = torch.meshgrid(y_range, x_range)
        """Expand to one-dimensional vector"""
        if flatten:
            y = y.flatten()
            x = x.flatten()
        return y, x

    def get_points(self, featmap_sizes, dtype, device, flatten=False):
        """Generate anchor points for multi-scale feature map.

        Args:
            featmap_sizes (list[tuple]): Size of multiscale feature map.
            dtype (torch.dtype): data type.
            device (torch.device): Computing hardware(cpu or gpu).

        Returns:
            tuple: Location point of each feature map.
        """
        mlvl_points = []
        for i in range(len(featmap_sizes)):
            mlvl_points.append(
                self._get_points_single(featmap_sizes[i], self.strides[i],
                                        dtype, device, flatten))
        return mlvl_points

Here is an example to briefly introduce torch Usage of meshgrid() function:

import torch
""" meshgrid The function is used to generate multi-dimensional coordinate points
"""
w = 5
h = 5
x_range = torch.arange(w)
y_range = torch.arange(h)

y,x = torch.meshgrid(y_range,x_range)
print(y)

"""
results:
Generated 5 X5 Two dimensional coordinates of, tensor y Representing each point y Coordinate value
tensor([[0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1],
        [2, 2, 2, 2, 2],
        [3, 3, 3, 3, 3],
        [4, 4, 4, 4, 4]])
"""

To summarize several components of AnchorFreeHead:

FCOSHead

The code part of FCOSHead will be explained below, which will involve the details of FCOS model construction. Interested students can take a look at the article first.

initialization

First see the initialization part of FCOSHead:

@HEADS.register_module()
class FCOSHead(AnchorFreeHead):
 def __init__(self,
                 num_classes,# Number of categories (excluding background)
                 in_channels, # Enter the number of channels of the characteristic graph
                 regress_ranges=((-1, 64), (64, 128), (128, 256), (256, 512),
                                 (512, INF)), # Regression range of multi-scale positioning points (in fact, the regression of a certain point is limited to a certain layer)
                 center_sampling=False,
                 center_sample_radius=1.5,
                 norm_on_bbox=False,
                 centerness_on_reg=False,
                 loss_cls=dict( # Classified loss, default to Focal Loss
                     type='FocalLoss',
                     use_sigmoid=True,
                     gamma=2.0,
                     alpha=0.25,
                     loss_weight=1.0),
                 loss_bbox=dict(type='IoULoss', loss_weight=1.0),# Regression loss
                 loss_centerness=dict( # Center forecast loss
                     type='CrossEntropyLoss',
                     use_sigmoid=True,
                     loss_weight=1.0),
                 norm_cfg=dict(type='GN', num_groups=32, requires_grad=True),
                 init_cfg=dict(
                     type='Normal',
                     layer='Conv2d',
                     std=0.01,
                     override=dict(
                         type='Normal',
                         name='conv_cls',
                         std=0.01,
                         bias_prob=0.01)),
                 **kwargs):
        self.regress_ranges = regress_ranges
        self.center_sampling = center_sampling
        self.center_sample_radius = center_sample_radius
        self.norm_on_bbox = norm_on_bbox
        self.centerness_on_reg = centerness_on_reg
        super().__init__(
            num_classes,
            in_channels,
            loss_cls=loss_cls,
            loss_bbox=loss_bbox,
            norm_cfg=norm_cfg,
            init_cfg=init_cfg,
            **kwargs)
        self.loss_centerness = build_loss(loss_centerness) #Build central forecast loss function

    def _init_layers(self):
        """Initialize layers.
        Call the parent class first AnchorFreeHead Initialization function, get self.conv_cls,self.conv_reg, self.cls_convs, 
        self.reg_convs, self.loss_cls, self.loss_bbox Equal network and loss function components
        """
        super()._init_layers() 
        """One layer convolution is used to predict the central probability value"""
        self.conv_centerness = nn.Conv2d(self.feat_channels, 1, 3, padding=1)
        self.scales = nn.ModuleList([Scale(1.0) for _ in self.strides])

Forward propagation

See forward propagation again:

    def forward(self, feats):
        """Forward propagation.
        """
        return multi_apply(self.forward_single, feats, self.scales,
                           self.strides)

    def forward_single(self, x, scale, stride):
        """Forward propagation of single scale feature map.
        """
        """Here, we first use the forward propagation function of the parent class to obtain cls_score, bbox_pred, cls_feat, reg_feat"""
        cls_score, bbox_pred, cls_feat, reg_feat = super().forward_single(x)
        """According to the parameter, specify which characteristic graph is used to predict the central probability value"""
        if self.centerness_on_reg:
            centerness = self.conv_centerness(reg_feat)
        else:
            centerness = self.conv_centerness(cls_feat)
        # scale the bbox_pred of different level
        # float to avoid overflow when enabling FP16
        bbox_pred = scale(bbox_pred).float()
        """because bbox_pred Are theoretically nonnegative, so use ReLU perhaps exp Keep nonnegative"""
        if self.norm_on_bbox:
            bbox_pred = F.relu(bbox_pred)
            if not self.training:
                bbox_pred *= stride
        else:
            bbox_pred = bbox_pred.exp()
        """Return classification score,Frame prediction and center value prediction"""
        return cls_score, bbox_pred, centerness

Generation of prediction target

The loss function measures the difference between the predicted value and the real value:
L = ( p r e d , t a r g e t ) L = (pred,target) L=(pred,target)
For classification tasks, the real value is the category label to which the target belongs. However, for border regression, different models usually return to different targets, and the target is related to the scale, which is also one of the biggest difficulties in detecting model coding.

FCOS directly predicts the positioning points on the feature map. Let's first look at how FCOSHead generates positioning points:

def _get_points_single(self,
                           featmap_size,
                           stride,
                           dtype,
                           device,
                           flatten=False):
        """Call the parent class first AnchorFreeHead Generating basis points for functions"""
        y, x = super()._get_points_single(featmap_size, stride, dtype, device)
        """Stack the coordinates of the two dimensions together, and points The shape of the is(h*w,2)，And use stride The information is converted to a center point and mapped to
        	In the original image.
        """
        points = torch.stack((x.reshape(-1) * stride, y.reshape(-1) * stride),
                             dim=-1) + stride // 2
        return points

The function of this function can be clearly shown in a figure:

Then see_ get_target_single() function:

 def _get_target_single(self, gt_bboxes, gt_labels, points, regress_ranges,
                           num_points_per_lvl):
        """It is a single picture (note that it is a single picture, consider batch_size，(not a single scale feature map).
        Feature maps with different scales have different numbers of feature points, which will naturally generate different numbers of positioning points. Here we use num_points1,num_points2,
        ...,num_points5 To represent(Diminishing). Record and for sum_points=num_points1 + num_points2 +...+num_points5
        Args:
        	gt_bboxes(Tensor): Shape is(num_gts,4)
        	gt_labels(Tensor): Shape is(num_gts,)
        	points(Tensor): Points of all scale feature maps,Shape is(sum_points,2) 
        	regress_ranges(Tensor):The regression range corresponding to the points of all scale characteristic maps has the same shape points
        	num_points_per_level()
        """
        num_points = points.size(0)
        num_gts = gt_labels.size(0)
        """If not gt_bboxes，Return 0 value"""
        if num_gts == 0:
            return gt_labels.new_full((num_points,), self.num_classes), \
                   gt_bboxes.new_zeros((num_points, 4))
		"""calculation gt_bboxes Area of"""
        areas = (gt_bboxes[:, 2] - gt_bboxes[:, 0]) * (
            gt_bboxes[:, 3] - gt_bboxes[:, 1])
        # TODO: figure out why these two are different
        # areas = areas[None].expand(num_points, num_gts)
        """take areas Expand,areas[None]Indicates adding a dimension at the beginning,After this step is completed areas The shape of the is(num_points,num_gts)"""
        areas = areas[None].repeat(num_points, 1)
        """take regress_ranges The final dimension is(num_points,num_gts,2)"""
        regress_ranges = regress_ranges[:, None, :].expand(
            num_points, num_gts, 2)
        """take gt_bboxes Expand, and the final dimension is(num_points,num_gts,4)"""
        gt_bboxes = gt_bboxes[None].expand(num_points, num_gts, 4)
        """Gets the of the generated anchor point x,y Coordinates and expand them into shapes(num_points,num_gts)"""
        xs, ys = points[:, 0], points[:, 1]
        xs = xs[:, None].expand(num_points, num_gts)
        ys = ys[:, None].expand(num_points, num_gts)
		"""Get each build point for each gt_bbox Return target
		   Here we explain why the first two dimensions of each tensor are extended to(num_points,num_gts)，Just to make everyone
		   Generate anchor points with each gt_boox Calculate the regression error, and then allocate the regression task
		   Here is the center point to gt_bbox The difference between up, down, left and right.
		"""
        left = xs - gt_bboxes[..., 0]
        right = gt_bboxes[..., 2] - xs
        top = ys - gt_bboxes[..., 1]
        bottom = gt_bboxes[..., 3] - ys
        """Combine the regression objectives of the four dimensions, pay attention to stack The function increases the dimension, dim=-1 Resulting in the final
        bbox_targets Shape is(num_points,num_gts,4),That is, each point to each gt_bbox Intensive computing
        """
        bbox_targets = torch.stack((left, top, right, bottom), -1)
		"""What will be decided next bbox_targets It can be used as a positive sample for training"""
		"""1.If center_sampling==True，That is, from the center point radius*stride The range of is sampled (i.e. near the center point), and the sample in it is regarded as a positive sample"""
        if self.center_sampling:
            # condition1: inside a `center bbox`
            """The default radius here is 1.5"""
            radius = self.center_sample_radius
            """calculation gt_bbox The center point of the, center_xs,center_ys All shapes are(num_points,num_gts)"""
            center_xs = (gt_bboxes[..., 0] + gt_bboxes[..., 2]) / 2
            center_ys = (gt_bboxes[..., 1] + gt_bboxes[..., 3]) / 2
            center_gts = torch.zeros_like(gt_bboxes)
            stride = center_xs.new_zeros(center_xs.shape)

            """The points on different scale feature maps are projected to the original size"""
            lvl_begin = 0
            for lvl_idx, num_points_lvl in enumerate(num_points_per_lvl):
                lvl_end = lvl_begin + num_points_lvl
                """Replace the original stride Extension 1.5 times"""
                stride[lvl_begin:lvl_end] = self.strides[lvl_idx] * radius
                lvl_begin = lvl_end
			"""According to the center point and stride Determine the selection range of positive samples( xmin,ymin,xmax,ymax)"""
            x_mins = center_xs - stride
            y_mins = center_ys - stride
            x_maxs = center_xs + stride
            y_maxs = center_ys + stride
            """torch.where(condition,a,b),The position where the condition holds is a Value, which is not valid b value.
               The function of the four codes here is to keep the positive sample points satisfying two conditions at the same time:
               1. Within the central sampling range:( xmin,ymin,xmax,ymax)
               2. stay gt_bboxes Within the scope of: gt_bboxes
               So it will be x_mins and gt_bbox[0]Take the maximum value at x_max and gt_bbox[2]Whichever is the minimum
         
            """
            center_gts[..., 0] = torch.where(x_mins > gt_bboxes[..., 0],
                                             x_mins, gt_bboxes[..., 0])
            center_gts[..., 1] = torch.where(y_mins > gt_bboxes[..., 1],
                                             y_mins, gt_bboxes[..., 1])
            center_gts[..., 2] = torch.where(x_maxs > gt_bboxes[..., 2],
                                             gt_bboxes[..., 2], x_maxs)
            center_gts[..., 3] = torch.where(y_maxs > gt_bboxes[..., 3],
                                             gt_bboxes[..., 3], y_maxs)
			"""Finally get center_gts Shape is(num_points,num_gts,4)，Its meaning is to stipulate the right to participate in the return
			   Positive sample range
			"""
            cb_dist_left = xs - center_gts[..., 0]
            cb_dist_right = center_gts[..., 2] - xs
            cb_dist_top = ys - center_gts[..., 1]
            cb_dist_bottom = center_gts[..., 3] - ys
            center_bbox = torch.stack(
                (cb_dist_left, cb_dist_top, cb_dist_right, cb_dist_bottom), -1)
            """Only with center_gts The point in its mask Just right"""
            inside_gt_bbox_mask = center_bbox.min(-1)[0] > 0
        else:
            """If center_sampling=False,Simply put gt_bbox The points in are regarded as positive sample points"""
            inside_gt_bbox_mask = bbox_targets.min(-1)[0] > 0

        """Calculate the maximum to be regressed target，And according to ranges exclude"""
        max_regress_distance = bbox_targets.max(-1)[0]
        inside_regress_range = (
            (max_regress_distance >= regress_ranges[..., 0])
            & (max_regress_distance <= regress_ranges[..., 1]))

        """Will not be gt_bbox Corresponding to the point in area Set to maximum"""
        areas[inside_gt_bbox_mask == 0] = INF
        areas[inside_regress_range == 0] = INF
        """If a point still corresponds to multiple points gt_bbox，Then select gt_bbox The smallest area is used as the regression target"""
        """min_area_inds The shape of the is(num_points,),Where the elements represent gt_bbox Label for,num_pos_points express
        	Number of positive samples.
        """
        min_area, min_area_inds = areas.min(dim=1)
        labels = gt_labels[min_area_inds]
        """Will not correspond to any of the previous gt_bbox The point is set as the background sample point"""
        labels[min_area == INF] = self.num_classes  # set as BG
        """here range(num_points) and min_area_inds Equal length, indicating each point Select at most one gt_bbox"""
        bbox_targets = bbox_targets[range(num_points), min_area_inds]
		"""Final return value:
		   labels(Tensor):shape (num_points,)
		   bbox_targets(Tensor): (num_points,4)
		"""
        return labels, bbox_targets

As you can see_ get_ target_ The single function is used to select qualified positive sample points for a single picture, and calculate the objectives of classification and regression for the calculation of subsequent loss function.
See get again below_ Targets function:

    def get_targets(,self, points, gt_bboxes_list, gt_labels_list):
        """Calculation of classification, regression and center value prediction for multiple images target

        Args:
            points (list[Tensor]): Anchor point of each feature layer, list(Tensor(num_points,2))
            gt_bboxes_list (list[Tensor]): each Tensor The shape of the is (num_gt, 4).
            gt_labels_list (list[Tensor]):each Tensor The shape of the is (num_gt, ).
        Returns:
            tuple:
                concat_lvl_labels (list[Tensor]):. \
                concat_lvl_bbox_targets (list[Tensor]): BBox targets of each \
                    level.
        """
        assert len(points) == len(self.regress_ranges)
        num_levels = len(points)
        """take regress_ranges Extension, and points Align shapes"""
        expanded_regress_ranges = [
            points[i].new_tensor(self.regress_ranges[i])[None].expand_as(
                points[i]) for i in range(num_levels)
        ]
        """Here will be all regress_ranges and points Connect with just_get_targets_single Function correspondence.
        concat_points Shape is(h1*w1+h2*w2+...+h5*w5,4)"""
        concat_regress_ranges = torch.cat(expanded_regress_ranges, dim=0)
        concat_points = torch.cat(points, dim=0)
		"""Count the number of anchor points on each floor"""
        num_points = [center.size(0) for center in points]
		"""Generate for each picture labels(shape (num_all_points,))and bbox_targets(num_all_points,4),
		And form two list"""
        labels_list, bbox_targets_list = multi_apply(
            self._get_target_single,
            gt_bboxes_list,
            gt_labels_list,
            points=concat_points,
            regress_ranges=concat_regress_ranges,
            num_points_per_lvl=num_points)

        """here num_points It is a list in which elements generate positioning points for each layer. The following operation will each picture
        Medium points According to different level Separate, equivalent to torch.split(labels,[num_lvl1,num_lvl2,...,num_lvl5],dim=0)
        """
        labels_list = [labels.split(num_points, 0) for labels in labels_list]
        bbox_targets_list = [
            bbox_targets.split(num_points, 0)
            for bbox_targets in bbox_targets_list
        ]
		
		"""Different pictures are the same level of label and bbox_target Connect"""
        concat_lvl_labels = []
        concat_lvl_bbox_targets = []
        for i in range(num_levels):
        	"""concat_lvl_labels Medium Tensor Shape is(num_points_leveli*num_imgs,)
        	   concat_lvl_bbox_targets in Tensor Shape is(num_points_leveli*num_imgs,4)
        	"""
            concat_lvl_labels.append(
                torch.cat([labels[i] for labels in labels_list]))
            bbox_targets = torch.cat(
                [bbox_targets[i] for bbox_targets in bbox_targets_list])
            if self.norm_on_bbox:
                bbox_targets = bbox_targets / self.strides[i]
            concat_lvl_bbox_targets.append(bbox_targets)
        """Finally, two lists are returned. The first dimension in the list represents different dimensions level Not different pictures"""
        return concat_lvl_labels, concat_lvl_bbox_targets

So far, we have generated targets for classification and regression, but don't forget that FCOS also has a Centerness prediction branch, which is mainly responsible for measuring anchor points and GT_ The offset of the bbox center, which also needs to generate the target:

    def centerness_target(self, pos_bbox_targets):
     	"""
     	Args:
     		pos_bbox_targets(Tensor):Only the center offset of the positive sample is calculated,shape (num_pos_bbox,4)
     	"""
        left_right = pos_bbox_targets[:, [0, 2]]
        top_bottom = pos_bbox_targets[:, [1, 3]]
        if len(left_right) == 0:
            centerness_targets = left_right[..., 0]
        else:
            centerness_targets = (
                left_right.min(dim=-1)[0] / left_right.max(dim=-1)[0]) * (
                    top_bottom.min(dim=-1)[0] / top_bottom.max(dim=-1)[0])
        """The returned shape is(num_pos_bbox,)"""
        return torch.sqrt(centerness_targets)

The centerness metric is calculated as follows: l ∗ = r ∗ , t ∗ = b ∗ l^*=r^*,t^*=b^* When l * = r *, t * = b *, that is, when the point is just in the center, centrness = 1.

Calculation of loss

 def loss(self,
             cls_scores,
             bbox_preds,
             centernesses,
             gt_bboxes,
             gt_labels,
             img_metas,
             gt_bboxes_ignore=None):
        """calculation bbox_head Loss function of.

        Args:
            cls_scores (list[Tensor]): each level Category confidence score,
                every last Tensor The character of is(B,num_points*num_class,H,W).
            bbox_preds (list[Tensor]): every last level Border offset for, Tensor Shape is(B,num_points*4,H,W)
            centernesses (list[Tensor]): every last level of"Centrality"forecast, The shape of each is(B,num_points,H,W).
            gt_bboxes (list[Tensor]): The real target bounding box of each picture, shape (num_gts,4),(xmin,ymin,xmax,ymax)format.
            gt_labels (list[Tensor]): each gt_bbox Category label for, shape (num_gts,)

        Returns:
            dict[str, Tensor]:Returns a dictionary containing each loss function.
        """
        assert len(cls_scores) == len(bbox_preds) == len(centernesses)
        """size()[-2:]  Representation fetch H,W,Here is the size of the feature map at each scale level"""
        featmap_sizes = [featmap.size()[-2:] for featmap in cls_scores]
        """Generate anchor points"""
        all_level_points = self.get_points(featmap_sizes, bbox_preds[0].dtype,
                                           bbox_preds[0].device)
        """The prediction target is generated according to the positioning point and the real label labels and bbox_targets All are lists. The first dimension in the list represents different dimensions
        Characteristic scale.
        """
        labels, bbox_targets = self.get_targets(all_level_points, gt_bboxes,
                                                gt_labels)
		"""Count the number of pictures to be processed at the same time, that is batch_size"""
        num_imgs = cls_scores[0].size(0)
        """
        Here with flatten_cls_score As an example, when calculating the loss function, 4 D of Tensor They are not suitable for calculation, so they need to be expanded first.
        stay cls_scores Each of the cls_score Represents a level in conv_cls The category prediction results obtained,
        shape = (B,num_points(1)*num_classes,H,W). use first permute(0,2,3,1)j Change shape to(B,H,W,num_classes),
        Then continue reshape(-1,self.cls_out_channels),The final shape is(H*W*num_classes,self.cls_out_channels)，that is
        The prediction results of each point are arranged linearly
       	flatten_bbox_preds，flatten_centerness The treatment of is the same.
        """
        flatten_cls_scores = [
            cls_score.permute(0, 2, 3, 1).reshape(-1, self.cls_out_channels)
            for cls_score in cls_scores
        ]
        flatten_bbox_preds = [
            bbox_pred.permute(0, 2, 3, 1).reshape(-1, 4)
            for bbox_pred in bbox_preds
        ]
        flatten_centerness = [
            centerness.permute(0, 2, 3, 1).reshape(-1)
            for centerness in centernesses
        ]
        """After splitting, all labels and target All connected at this time Tensor Equal length"""
        flatten_cls_scores = torch.cat(flatten_cls_scores)
        flatten_bbox_preds = torch.cat(flatten_bbox_preds)
        flatten_centerness = torch.cat(flatten_centerness)
        flatten_labels = torch.cat(labels)
        flatten_bbox_targets = torch.cat(bbox_targets)
       	"""The anchor points are also expanded and expanded to generate the final frame prediction results"""
        flatten_points = torch.cat(
            [points.repeat(num_imgs, 1) for points in all_level_points])

        """Foreground (target) category range: [0, num_classes -1], Background category label: num_classes"""
        bg_class_ind = self.num_classes
        """Tensor.nonzero()Get a non-zero index, here pos_inds Refers to the index value of the correction sample"""
        pos_inds = ((flatten_labels >= 0)
                    & (flatten_labels < bg_class_ind)).nonzero().reshape(-1)
        
        num_pos = torch.tensor(
            len(pos_inds), dtype=torch.float, device=bbox_preds[0].device)
        num_pos = max(reduce_mean(num_pos), 1.0)
        """Calculate the loss function of the classification"""
        loss_cls = self.loss_cls(
            flatten_cls_scores, flatten_labels, avg_factor=num_pos)
		"""For frame regression and centrality prediction, the loss function is calculated only for positive samples"""
        pos_bbox_preds = flatten_bbox_preds[pos_inds]
        pos_centerness = flatten_centerness[pos_inds]
        pos_bbox_targets = flatten_bbox_targets[pos_inds]
        pos_centerness_targets = self.centerness_target(pos_bbox_targets)
        # centerness weighted iou loss
        centerness_denorm = max(
            reduce_mean(pos_centerness_targets.sum().detach()), 1e-6)

        if len(pos_inds) > 0:
            pos_points = flatten_points[pos_inds]
            """because IOU Loss Two borders need to be calculated IOU,So first bbox delta Decode to frame for calculation"""
            pos_decoded_bbox_preds = distance2bbox(pos_points, pos_bbox_preds)
            pos_decoded_target_preds = distance2bbox(pos_points,
                                                     pos_bbox_targets)
            loss_bbox = self.loss_bbox(
                pos_decoded_bbox_preds,
                pos_decoded_target_preds,
                weight=pos_centerness_targets,
                avg_factor=centerness_denorm)
            loss_centerness = self.loss_centerness(
                pos_centerness, pos_centerness_targets, avg_factor=num_pos)
        else:
            loss_bbox = pos_bbox_preds.sum()
            loss_centerness = pos_centerness.sum()
		"""Final return loss function dictionary"""
        return dict(
            loss_cls=loss_cls,
            loss_bbox=loss_bbox,
            loss_centerness=loss_centerness)

Summary

Summarize the training process with a chart:

In addition to the above, FCOSHead also contains_ get_bbox() and other methods are used to generate the final border. There is no key description here. Interested students can see the code by themselves~

summary

This chapter explains in great detail the code construction and operation process of FCOS method, including the overall structure of the model, the generation of prediction target value and the calculation of loss function. Here, I admire the code contributors of mmdetection, who split the model very clearly, and it's not difficult to understand when you look at it carefully. That's all for the study of this chapter. Students with questions can leave a message in the comment area~

Keywords: AI Computer Vision Deep Learning

Added by davidz on Sun, 26 Dec 2021 01:47:10 +0200

Programming VIP