[model reasoning] quantization implementation share 3: explain the implementation of ACIQ symmetric quantization algorithm in detail

Welcome to my official account, reply to 001 Google programming specification.

O_o >_< o_O O_o ~_~ o_O

Hello, I'm Jizhi horizon. This paper analyzes the implementation of ACIQ symmetric quantization algorithm, taking Tengine's implementation as an example.

this is the third part of quantitative implementation. There are also one and two in front, which can be consulted by interested students

(1) <[model reasoning] quantization implementation share 1: explain the implementation of min max symmetric quantization algorithm in detail>；

(2)<[model reasoning] quantization realization sharing 2: explain the implementation of KL symmetric quantization algorithm in detail>;

ACIQ is similar to the previous quantization strategy. It intercepts a threshold T and maps [- T, T] to the quantization value domain. The difference is the process of finding T. This paper not only talks about the principle, but also talks about the implementation of quantization strategy in combination with tengine. Let's start.

1. Principle of ACIQ quantization strategy

ACIQ quantization strategy is proposed in the paper post training 4-bit quantization of revolutionary networks for rapid deployment. First, post the following results:

in the figure above, 8-bit weight quantization and 4-bit activation value quantization are adopted uniformly. In terms of quantization efficiency, ACIQ is 4000 times faster than KL quantization process (unbelievable ~). In terms of quantization accuracy, it can be seen that except resnet-101, the network quantization effects of other tests are better than KL quantization, which can be said that there is no loss in efficiency and effect.

at the beginning of the article, the author wrote that he would not like traditional approaches that focus on the quantification at the network level, in this work we propose to minimize the quantification effect at the Tensor level It can be seen that ACIQ is a quantitative strategy starting from Tensor level, and the whole derivation logic is mainly:

(1) first, derive a generic expression for any given distribution for the expected MSE as a function of clipping value；

(2) then use this expression to develop a specifific expression for each distribution；

(3) finally, establish the optimal clipping values by solving the equations for which the derivative with respect to the clipping value are set to zero；

generally, it is necessary to cut when quantifying to deal with the long tail problem of the original data, assuming α Is the truncation value, which can be expressed as:

ACIQ requires a strong a priori assumption: Tensor (feature map) obeys Laplace distribution or Gaussian distribution, and then uses the optimization idea to solve the minimum quantization loss corresponding to the cut-off value of the quantization process. The whole quantization process maps the value obeying the original distribution to the 2^M quantized discrete value domain, and M is the number of quantization bits, which means that the above[- α, α] The value range of is equally divided into 2^M, as shown in the following figure:

assuming that the probability density function of the original distribution is f(x), the truncation value α And quantization function Q(x), L2 Loss before and after quantization can be calculated as follows:

the above formula can be obviously divided into three parts:

(1) [negative infinity- α];

(2) [-α, α];

(3) [ α, Positive infinity];

for Gaussian distribution N(0, σ^ 2) Or Laplace distribution (Laplace(0, b)), which is 0 axisymmetric distribution, (1) and (3) are equivalent, meaning | x | to| α| Mean square error between. Doing[- α, α] After bisection mapping to 2^M, each quantized value will take the values q1, q2, Q3 in the middle of each segment q2 ^ m, item (2) is the cumulative error of intermediate truncation. Now the whole quantization process is transformed into finding a cut-off value that minimizes E[(X - Q(X))^2] α (it's a mathematical problem until the end of deep learning ~ ~), and then make some equivalent transformation ~ transformation ~ of formulas in combination with a priori distribution to obtain the final overall quantitative loss optimization objective function:

mathematically, the minimum value of the objective function = = > find the partial derivative and make it 0.

for Laplace distribution, the expression after partial derivation is:

for Gaussian distribution, the expression after partial derivation is:

finally, whether for Laplace distribution or Gaussian distribution, M is the bit you want to quantify, and like β (Laplace distribution parameter) σ (Gaussian distribution parameters) these are known values, so we can naturally find the cut-off value we want α For symmetric quantization, it is ok to have a truncation value.

2. Implementation of ACIQ quantization strategy

let's look at the implementation of ACIQ in tengine.

main codes for quantitative implementation:

case ALGORITHM_ACIQ:{
    if (quant_tool.scale_file.empty()){
        quant_tool.scale_file = "table_aciq.scale";
        quant_tool.activation_quant_tool();
    }
    save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
    /* Evaluate quantitative losses */
    if (quant_tool.evaluate){
        fprintf(stderr, "[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
        quant_tool.assess_quant_loss(0);
    }
    break;
}

2.1 quantification of activation value

activation value quantization entry:

quant_tool.activation_quant_tool();

first, find the values of min and max. this process is the same logic as the quantization strategy written above. I won't say more. Then go to ACIQ strategy:

for (int i = 0; i < ir_graph->tensor_num; i++){
    struct tensor* t = ir_graph->tensor_list[i];
    if (t->tensor_type == TENSOR_TYPE_VAR || t->tensor_type == TENSOR_TYPE_INPUT){
        float absmax = 0.f;
        float act_scale = 1.f;
        int act_zero_point = 0;
        int emlement_num = t->elem_num;

        absmax = std::max(std::abs(max_activation[i]), std::abs(min_activation[i]));
        float threshold = compute_aciq_gaussian_clip(absmax, emlement_num, 8);
        act_scale = threshold / 127.f;

        /* the scale of softmax is always scale = 1 / 127.f */
        for (int j = 0; j < ir_graph->node_num; j++){
            struct node* noden = ir_graph->node_list[j];
            struct tensor* tensor_tmp = get_ir_graph_tensor(ir_graph, noden->output_tensors[0]);

            if (!(tensor_tmp->tensor_type == TENSOR_TYPE_INPUT || tensor_tmp->tensor_type == TENSOR_TYPE_VAR))
                continue;

            std::string tmp_op_name = get_op_name_from_type(noden->op.type);
            std::string cur_name = t->name;
            std::string tmp_name = tensor_tmp->name;

            if ((cur_name == tmp_name) && tmp_op_name == "Softmax"){
                act_scale = 1 / 127.f;
                break;}
        }
        fprintf(fp_aciq, "%s %f %d\n", ir_graph->tensor_list[i]->name, act_scale, act_zero_point);}
}

the key is this function. In tengine, the default priori obeys Gaussian distribution, int8 quantization:

float threshold = compute_aciq_gaussian_clip(absmax, emlement_num, 8);

let's take a look at its implementation:

static float compute_aciq_gaussian_clip(float absmax, int N, int num_bits)
{
    const float alpha_gaussian[8] = {0, 1.71063519, 2.15159277, 2.55913646, 2.93620062, 3.28691474, 3.6151146, 3.92403714};   // When 8-bit quantization, α= three point nine two four zero three seven one four

    const double gaussian_const = (0.5 * 0.35) * (1 + sqrt(3.14159265358979323846 * log(4))); 

    double std = (absmax * 2 * gaussian_const) / sqrt(2 * log(N));  

    return (float)(alpha_gaussian[num_bits - 1] * std);
}

in this way, the truncation value is obtained, and then the scale can be calculated:

act_scale = threshold / 127.f;

this completes the quantification of the activation value.

2.2 weight & offset quantization

the quantization process of weight & offset is the same as the logic of MIN-MAX and KL quantization introduced earlier, which will not be repeated here.

finally, through practice, it can be found that the quantization process of ACIQ is very fast, 4000 times faster than KL quantization. It is not nonsense. It mainly comes from the a priori Gaussian distribution alpha_gaussian,gaussian_const and std do not need to be searched.