Reprinted from https://blog.csdn.net/dz4543/article/details/90049377
Detailed analysis of YOLOv3 network structure
Based on keras yolov3, understand the principle and code details
Thesis address: https://pjreddie.com/media/files/papers/YOLOv3.pdf
yolov3 official website: https://pjreddie.com/darknet/yolo/
Keras version recommendation: https://github.com/qqwweee/keras-yolo3
And interpretation of keras version: https://danielack.github.io/2018/08/25/yolov3Keras Implementation interpretation/
This article only describes the network structure of YOLO.
1
YOLOv3 itself uses full convolution layer. Even the size modification of the graph or feature graph is realized through convolution layer. Structure chart of YOLO paper:
Another display of YOLO output:
layer filters size input output 0 conv 32 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 32 0.299 BF 1 conv 64 3 x 3 / 2 416 x 416 x 32 -> 208 x 208 x 64 1.595 BF 2 conv 32 1 x 1 / 1 208 x 208 x 64 -> 208 x 208 x 32 0.177 BF 3 conv 64 3 x 3 / 1 208 x 208 x 32 -> 208 x 208 x 64 1.595 BF 4 Shortcut Layer: 1 5 conv 128 3 x 3 / 2 208 x 208 x 64 -> 104 x 104 x 128 1.595 BF 6 conv 64 1 x 1 / 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BF 7 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BF 8 Shortcut Layer: 5 9 conv 64 1 x 1 / 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BF 10 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BF 11 Shortcut Layer: 8 12 conv 256 3 x 3 / 2 104 x 104 x 128 -> 52 x 52 x 256 1.595 BF 13 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF 14 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 15 Shortcut Layer: 12 16 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF 17 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 18 Shortcut Layer: 15 19 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF 20 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 21 Shortcut Layer: 18 22 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF 23 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 24 Shortcut Layer: 21 25 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF 26 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 27 Shortcut Layer: 24 28 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF 29 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 30 Shortcut Layer: 27 31 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF 32 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 33 Shortcut Layer: 30 34 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF 35 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 36 Shortcut Layer: 33 37 conv 512 3 x 3 / 2 52 x 52 x 256 -> 26 x 26 x 512 1.595 BF 38 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF 39 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 40 Shortcut Layer: 37 41 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF 42 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 43 Shortcut Layer: 40 44 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF 45 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 46 Shortcut Layer: 43 47 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF 48 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 49 Shortcut Layer: 46 50 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF 51 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 52 Shortcut Layer: 49 53 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF 54 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 55 Shortcut Layer: 52 56 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF 57 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 58 Shortcut Layer: 55 59 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF 60 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 61 Shortcut Layer: 58 62 conv 1024 3 x 3 / 2 26 x 26 x 512 -> 13 x 13 x1024 1.595 BF 63 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF 64 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF 65 Shortcut Layer: 62 66 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF 67 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF 68 Shortcut Layer: 65 69 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF 70 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF 71 Shortcut Layer: 68 72 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF 73 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF 74 Shortcut Layer: 71 75 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF 76 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF 77 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF 78 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF 79 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF 80 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF 81 conv 18 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 18 0.006 BF 82 yolo 83 route 79 84 conv 256 1 x 1 / 1 13 x 13 x 512 -> 13 x 13 x 256 0.044 BF 85 upsample 2x 13 x 13 x 256 -> 26 x 26 x 256 86 route 85 61 87 conv 256 1 x 1 / 1 26 x 26 x 768 -> 26 x 26 x 256 0.266 BF 88 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 89 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF 90 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 91 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF 92 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF 93 conv 18 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 18 0.012 BF 94 yolo 95 route 91 96 conv 128 1 x 1 / 1 26 x 26 x 256 -> 26 x 26 x 128 0.044 BF 97 upsample 2x 26 x 26 x 128 -> 52 x 52 x 128 98 route 97 36 99 conv 128 1 x 1 / 1 52 x 52 x 384 -> 52 x 52 x 128 0.266 BF 100 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 101 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF 102 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 103 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF 104 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF 105 conv 18 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 18 0.025 BF 106 yolo
In fact, this has told us the output of each layer. Size of characteristic diagram of each floor:
On the basis of the network mentioned above, comments are made in red. Residual uses the residual structure. What is the residual structure? For example, in the first layer residual structure (its output is 208208128), its input is 20820864. After the convolution of 3211 and 6433, the generated characteristic map is superimposed with the input. Its structure is as follows:
The superimposed feature map is input to the next layer as a new input. The main body of YOLO is composed of many residual modules, which reduces the risk of gradient explosion and strengthens the learning ability of the network.
It can be seen that YOLO has three scales of output, which are 52 × 52,26 × 26,13 × 13. Well, they are all odd numbers, so that the grid will have a central position. At the same time, YOLO output is divided into three scales, and there is a connection between each scale. For example, 13 × 13 this scale output is used to detect large targets, corresponding to 26 × 26 medium, 52 × 52 for detecting small targets. I think the last picture is very detailed and understandable.
This detects COCO (80 classes), so its output needs to be constructed as: s × S × three × (5+class_number). Explain why.
YOLO divides the image into S × S grid. When the target center falls in a grid, use this grid to detect it. This is s × The origin of S. The reason why it is 3 is that each grid needs to detect 3 anchor boxes (note that there are 3 scales), so for each scale, its output is s × S × three ×???
For an anchor box, it contains coordinate information (x, y, W, H) and confidence, and there are five information; At the same time, it will also include the information of whether all categories are included, using one hot coding. For example, there are three classes: person, car and dog. If the detection result is human, it is coded as [1,0,0]. It can be seen that all category information will be encoded. If COCO has 80 categories, it is 5 + 80. Therefore, for the output of each dimension, the result is:
S
×
S
×
3
×
(
5
+
80
)
=
S
×
S
×
255
S × S × 3 × ( 5 + 80 ) = S × S × 255
S×S×3×(5+80)=S×S×255
Construct the output like this. The characteristic maps of different scales are superimposed together to increase the output information. You can have a good look at this picture.