TI deep learning (TIDL) -- 3

1.4. Training
As long as layers are supported and parameter constraints are met, existing cafe and TF slim models can be imported. However, these models usually include dense weight matrices. In order to take advantage of some advantages of tidl lib and obtain 3x-4x performance improvement (for convolution layer), it is necessary to use caffe Jacinto Caffe fork to repeat the training process https://github.com/tidsp/caffe-jacinto? The largest contribution of convolution neural network computing load comes from convolution layer (usually in the range of 80-90%), so special attention is paid to optimizing convolution layer processing.
Data set preparation should follow the standard Caffe method, usually creating LMDB files. After the training, it is divided into three steps:
• initial training (usually using L2 regularization) to create a dense model.
This stage is actually a common training program on the desktop. At the end of this phase, it is necessary to verify the accuracy of the model. The weight tensor is dense, so the performance goal may not be achieved, but the following steps can improve the performance. If the accuracy is not enough, it is not recommended to continue with further steps (they will not improve the accuracy - in fact, a small decrease of 1-2% in the accrual rate is expected). Instead, modify the training parameters or enhance the data set, and repeat the training until the accuracy target is reached.
• L1 regularization
This step is necessary (as opposed to L2) at the expense of other weight values and makes the larger part of the weight smaller. The remaining weights will work like the feature extractor (which is needed in the next step).
• thinning ("thinning")
Test the sparsity target at each step (e.g. 70% or 80%) by gradually adjusting the weight threshold (from small to high). This process eliminates smaller weights and leaves only larger contributors. Note that this applies only to convolution layers.
• define acceptable sparsity criteria based on precision degradation
Due to the conversion from FP32 representation to 8-12 bit weight representation (and 8-bit activation), the acceptable accuracy reduction should be in the range of 1-2% (depending on the model). For example, if the classification accuracy of caffe Jacinto desktop model is 70% (using the model after the initial stage), we should not see that the accuracy of thinning and quantization model is less than 68%.
1.4.1. Example of training procedure
• settings for the dataset set of specific smaller objects.
In addition to many publicly available image data sets, it is often necessary to collect new data sets for specific use cases. For example, in an industrial environment, it is usually easier to predict and can usually ensure a controlled environment with good lighting. For pick and place applications, the set of objects that can appear in the camera's field of view is not infinite, but limited to a few or dozens of classes. Data sets can be collected quickly using turntables and photo booths with good lighting.
• collect data sets using AM57xx
The dataset image can be recorded by an external camera device or even by using a camera sub card (AM57xx). The recommended record format is H264, which provides good quality and can be decoded efficiently using the GStreamer pipeline. It can only last 15-20 seconds (rotation cycle of the turntable). For slower fps (10-15 fps), this will provide 200-300 frames. This process can be repeated by changing the distance and elevation (3-4 times), so the total image count of each class can reach 2000-3000 frames. In this way, the collection time of single type data can be limited to 5-10 minutes.
• post treatment
Video clips should be copied to Linux x86 for offline post-processing. The FFMPEG package allows video clips to be easily divided into separate images. Since records are made in a unified context, automatic marking procedures can also be applied. You can use image enhancement scripts to enhance other data sets, easily increasing the number of images by 10-20 times.
Prepare LMDB files for training
Please refer to the scripts available at github.com/tidsp/cafe-jacinto-models/scripts
• training from scratch or transfer learning (fine tuning)
In general, it is best to start training with an initial weight created by a common data set, such as ImageNet. The behavior of the bottom layer is similar to that of the feature extractor, which only needs to fine tune the top layer or a few layers using the data set we just collected (as described in the previous episodes). For Jacinto11, a good starting point is the model created after the "initial" phase. We will need to repeat the initial phase, but now use the new dataset and use the same layer name for these layers. We want to preload the earlier model. You can reduce the base_ lr (in train.prototxt) to adjust the training and increase the lr of the top layer or layers. In this way, the bottom layer will change on the surface, but the top layer will adjust as needed.
1.4.2. Where do the benefits of sparsity come from
• initially, the deep learning network was implemented using a single precision floating point algorithm (FP32). In the past few years, there has been more research on the influence of quantization and arithmetic operations that reduce the accuracy. In many cases, 8 bits or less (as low as 2-4 bits) are considered sufficient for correct operation. This can be explained by a large number of parameters (weights), which contribute to the accuracy of the operation. In the case of DSP and EVE reasoning, the weight (controlled by the parameters in the import tool configuration file) can be quantified with an accuracy of 8-12 bits. The activation layer output (neuron output) is stored in memory with 8-bit accuracy (single byte). Accumulation is done with a precision of 40 bits, but the final output shifts to the right before a single byte is stored in memory. The right shift count is dynamically determined, unique for each layer and once for each frame. For more details, see https://openaccess.thecvf.com/content_cvpr_2017_workshops/w4/papers/Mathew_Sparse_Quantized_Full_CVPR_2017_paper.pdf
• additional optimization (described in the article above) is based on the thinning of convolution layer weights. During training, personal weight is forced to zero. This is achieved in the "L1 regularization" stage (performing fewer and larger weights at the expense of other weights) and the "sparse" stage (small weights are clamped to zero). We can specify the desired training objectives (for example, 70% or 80% of all weights are zero). In the reasoning process, the calculation is reorganized to multiply with a single weight parameter on all input values. If the weight is zero, the multiplication of all input data (for that input channel) is skipped. All calculations are done using blocks preloaded into local L2 memory (using "shadow" EDMA transfer).
1.5. Performance data
1.5.1. Verify the computing performance of the network

•j11, JSeg21, JDetNet, Mobilenet, SqueezeNet: 
Network topology	ROI size	MMAC (million MAC)	Sparsity (%)	EVE using sparse model	EVE using dense model	DSP using sparse model	DSP using dense model	EVE + DSP (optimal model)
MobileNet	224x224	567.70	1.42	•		682.63ms	•		717.11ms	•	
SqueezeNet	227x227	390.8	1.46	•		289.76ms	•		1008.92ms	•	
InceptionNetV1	224x224	1497.37	2.48	•		785.43ms	•		2235.99ms	•	
JacintoNet11	224x224	405.81	73.15	125.9ms	235.70ms	115.91ms	370.64ms	73.55ms
JSegNet21	1024x512	8506.5	76.47	378.18ms	1236.84ms	1101.12ms	3825.95ms	•	
JDetNet	768x320	2191.44	61.84	•		•		•		•		197.55ms

• the sparsity provided in the above table is the average sparsity of all convolutions.
• optimization model - the location of the optimization layer between EVE and DSP (some NN layers run faster on DSP, such as SoftMax; ARP32 in EVE simulates floating-point operations in software, so this may be quite slow).
• the next version will improve performance by using the best layer layout (EVE has a slower SoftMax layer implementation) and will support Jacinto11 (on AM5749) up to 28-30fps, for example.
1.5.2. Accuracy of selected network
For convenience, the following table is copied from https://github.com/tidsp/caffe-jacinto-models file.
• image classification: the top 1 of classification accuracy indicates the highest probability of ground truth ranking. The classification accuracy of the first five bits represents the probability of ground truth in the first five candidates.

Configuration-Dataset Imagenet (1000 classes)	Top-1 accuracy
JacintoNet11 non-sparse	60.9%
JacintoNet11 layerwise threshold sparse (80%)	57.3%
JacintoNet11 channelwise threshold sparse (80%)	59.7%

• image segmentation: the average intersection of the joint is the ratio of true positive to the sum of true positive, false negative and false positive

Configuration-Dataset Cityscapes (5-classes)	Pixel accuracy	Mean IOU
Initial L2 regularized training	96.20%	83.23%
L1 regularized training	96.32%	83.94%
Sparse fine tuned (~80% zero coefficients)	96.11%	82.85%
Sparse (80%), Quantized (8-bit dynamic fixed point)	95.91%	82.15%

• target detection: the verification accuracy can be classification accuracy or average accuracy (mAP). Note the accuracy variation between the initial (dense) and sparse models (the performance improvement can be 2x-4x):

Configuration-Dataset VOC0712	mAP
Initial L2 regularized training	68.66%
L1 regularized fine tuning	68.07%
Sparse fine tuned (~61% zero coefficients)	65.77%

1.6. Troubleshooting
• verify whether the OpenCL stack runs under Linux boot and whether the OpenCL firmware is downloaded to DSP and EVE. Since OpenCL monitor for IPU1 (control EVEs) is newly added, expected trace: enter the following command on the target: cat /sys/kernel/debug/remoteproc/remoteproc0/trace0 expected output, indicating the number of available EVE accelerators (less than AM5729 trace indicates 4 EVEs):

[0][      0.000] 17 Resource entries at 0x3000
[0][      0.000] [t=0x000aa3b3] xdc.runtime.Main: 4 EVEs Available
[0][      0.000] [t=0x000e54bf] xdc.runtime.Main: Creating msg queue...
[0][      0.000] [t=0x000fb885] xdc.runtime.Main: OCL:EVEProxy:MsgQ ready
[0][      0.000] [t=0x0010a1a1] xdc.runtime.Main: Heap for EVE ready
[0][      0.000] [t=0x00116903] xdc.runtime.Main: Booting EVEs...
[0][      0.000] [t=0x00abf9a9] xdc.runtime.Main: Starting BIOS...
[0][      0.000] registering rpmsg-proto:rpmsg-proto service on 61 with HOST
[0][      0.000] [t=0x00b23903] xdc.runtime.Main: Attaching to EVEs...
[0][      0.007] [t=0x00bdf757] xdc.runtime.Main: EVE1 attached
[0][      0.010] [t=0x00c7eff5] xdc.runtime.Main: EVE2 attached
[0][      0.013] [t=0x00d1b41d] xdc.runtime.Main: EVE3 attached
[0][      0.016] [t=0x00db9675] xdc.runtime.Main: EVE4 attached
[0][      0.016] [t=0x00dc967f] xdc.runtime.Main: Opening MsgQ on EVEs...
[0][      1.017] [t=0x013b958a] xdc.runtime.Main: OCL:EVE1:MsgQ opened
[0][      2.019] [t=0x019ae01a] xdc.runtime.Main: OCL:EVE2:MsgQ opened
[0][      3.022] [t=0x01fa62bf] xdc.runtime.Main: OCL:EVE3:MsgQ opened
[0][      4.026] [t=0x025a4a1f] xdc.runtime.Main: OCL:EVE4:MsgQ opened
[0][      4.026] [t=0x025b4143] xdc.runtime.Main: Pre-allocating msgs to EVEs...
[0][      4.027] [t=0x0260edc5] xdc.runtime.Main: Done OpenCL runtime initialization. Waiting for messages...

• verify that CMEM is active and running:

•cat /proc/cmem
•lsmod | grep " cmem "

• the default CMEM size is not enough for devices with more than 2 eves (making each EVE have about 56-64MB of free space).
• validation model preparation procedure
• if importing an external model fails, the import process may not give enough information.
For example, if the format is not recognized, you can see the following report (in this case, try importing the Keras model):

$ ./tidl_model_import.out ./modelInput/tidl_import_mymodel.txt
TF Model File : ./modelInput/mymodel
Num of Layer Detected :   0
Total Giga Macs : 0.0000

Processing config file ./tempDir/qunat_stats_config.txt !
  0, TIDL_DataLayer                ,  0,   0 ,  0 ,  x ,  x ,  x ,  x ,  x ,  x ,  x ,  x ,  0 ,    0 ,    0 ,    0 ,    0 ,    0 ,    0 ,    0 ,    0 ,

Processing Frame Number : 0

End of config list found !

• dataset preparation issues
• good lighting is ideal when preparing a training set.
• enhanced
• equivalence between desktop Caffe execution and target execution
To this end, we can use simulation tools because it is bit accurate EVE or DSP execution. The trace generated by the simulation tool can be intuitively compared with the data block saved after desktop Caffe inference. If the rest is correct, it is worth comparing the intermediate results. Remember that numerical equivalence between Caffe desktop computing (using single precision FP32) and target computing (using 8-bit activation and 8-12 bit weight) is not possible. The static feature map (middle layer) should be quite similar. If there is a significant difference, try changing the number of bits of the weight, or repeat the import process with a more representative image. This problem should be rarely encountered.
• typical runtime errors (when to restart the platform)

... inc/executor.h:199: T* tidl::malloc_ddr(size_t) [with T = char; size_t = unsigned int]: Assertion `val != nullptr' failed.
This means that previous run failed to de-allocate CMEM memory. Reboot is one option, restarting ti-mctd deamon is another option.

Keywords: neural networks Deep Learning caffe

Added by mikeatrpi on Sun, 03 Oct 2021 02:06:04 +0300