Chinese English translation based on Transformer

Machine translation based on Transformer

Machine translation is the process of using computers to convert one natural language (source language) into another natural language (target language).

This project is the PaddlePaddle implementation of Machine Translation's mainstream model Transformer, including model training, prediction, and use of custom data. Users can build their own translation model based on the published content.

Transformer is a paper Attention Is All You Need A new network structure proposed in to complete sequence to sequence (Seq2Seq) learning tasks such as Machine Translation, which completely uses the Attention mechanism to realize sequence to sequence modeling.

Figure 1: Transformer network structure

Compared with the recurrent neural network (RNN) widely used in the previous Seq2Seq model, using Self Attention to transform input sequence to output sequence mainly has the following advantages:

Low computational complexity
- For the sequence with feature dimension D and length N, the calculation complexity in RNN is O(n * d * d) (n time steps, each time step calculates the matrix vector multiplication of D dimension), and the calculation complexity in self attention is O(n * n * d) (n time steps calculate the vector dot product of D dimension or other correlation functions in pairs). N is usually less than D.
High computational parallelism
- The calculation of the current time step in RNN depends on the calculation result of the previous time step; The calculation of each time step in self attention only depends on the input and does not depend on the output of the previous time step. Each time step can be completely parallel.
Easy to learn long-range dependencies
- The association between two positions with a distance of N in RNN needs n steps to be established; Any two positions in self attention are directly connected; The shorter the path, the easier the signal propagation.
  The sequence modeling module structure based on self attention introduced in Transformer has been widely used in semantic representation models such as Bert and achieved remarkable results.

More detailed interpretation of Transformer

Environment introduction

The PaddlePaddle framework, the AI Studio platform, has installed the latest version of 2.1 by default.
Paddelnlp, depth compatible frame 2.1, is the best practice of propeller frame 2.1 in the NLP field.

!unzip -o transformer_mt.zip

%cd transformer_mt/

[Errno 2] No such file or directory: 'transformer_mt/'
/home/aistudio/transformer_mt

# Installation dependency
!pip install --upgrade paddlenlp -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already up-to-date: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.1.0)
Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied, skipping upgrade: paddlefsl==1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.0.0)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)
Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp) (0.3.3)
Requirement already satisfied, skipping upgrade: numpy>=1.14.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (1.20.3)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)
Requirement already satisfied, skipping upgrade: tqdm~=4.27.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp) (4.27.0)
Requirement already satisfied, skipping upgrade: requests~=2.24.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp) (2.24.0)
Requirement already satisfied, skipping upgrade: pillow==8.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp) (8.2.0)
Requirement already satisfied, skipping upgrade: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.15.0)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (2.8)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (1.25.6)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (2019.9.11)
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: attrdict==2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 1)) (2.0.1)
Requirement already satisfied: PyYAML==5.4.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 2)) (5.4.1)
Requirement already satisfied: subword_nmt==0.3.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (0.3.7)
Requirement already satisfied: jieba==0.42.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 4)) (0.42.1)
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from attrdict==2.0.1->-r requirements.txt (line 1)) (1.15.0)

Pipeline

Figure 2: Pipeline

import os
import time
import yaml
import logging
import argparse
import numpy as np
from pprint import pprint
from attrdict import AttrDict
import jieba

import numpy as np
from functools import partial
import paddle
import paddle.distributed as dist
from paddle.io import DataLoader,BatchSampler
from paddlenlp.data import Vocab, Pad
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import TransformerModel, InferTransformerModel, CrossEntropyCriterion, position_encoding_init
from paddlenlp.utils.log import logger

from utils import post_process_seq

1. Data preprocessing

This tutorial uses CWMT The Chinese and English data in the data set are used as training corpus,
CWMT dataset is 9 million + and of high quality, which is very suitable for training Transformer machine translation.
Jieba+BPE is required for Chinese and BPE is required for English

BPE(Byte Pair Encoding)

BPE benefits:

Compressed thesaurus;
Alleviate the OOV(out of vocabulary) problem to a certain extent

Figure 3: learn BPE

Figure 4: Apply BPE

Figure 5: Jieba+BPE

# Data preprocessing process, including jieba word segmentation, bpe word segmentation and thesaurus.
!bash preprocess.sh

jieba tokenize...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.705 seconds.
Prefix dict has been built successfully.
source learn-bpe and apply-bpe...
no pair has frequency >= 2. Stopping
target learn-bpe and apply-bpe...
no pair has frequency >= 2. Stopping
source get-vocab. if loading pretrained model, use its vocab.
target get-vocab. if loading pretrained model, use its vocab.
Over.

# Download pre training model
!bash get_data_and_model.sh

Over.

2. Construct Dataloader

Create below_ data_ The loader function is used to create DataLoader objects required for training sets and validation sets,
create_ infer_ The loader function is used to create the DataLoader object required by the prediction set,
The DataLoader object is used to generate batch by batch data. The following is a brief description of the paddlenlp built-in function invoked in the function:

paddlenlp.data.Vocab.load_vocabulary: vocab thesaurus class, which collects a series of methods for mapping between text token and ids, and supports the construction of thesaurus from files, dictionaries, json and other methods
paddlenlp.datasets.load_dataset: when creating a dataset from a local file, it is recommended to give a read function according to the format of the local dataset and pass it in to load_ Create a dataset in dataset()
paddlenlp.data.Pad: padding operation
For details, please refer to PaddleNLP documentation

Figure 6: process of constructing Dataloader

Figure 7: Dataloader details

# Customize the method of reading local data
def read(src_path, tgt_path, is_predict=False):
    if is_predict:
        with open(src_path, 'r', encoding='utf8') as src_f:
            for src_line in src_f.readlines():
                src_line = src_line.strip()
                if not src_line:
                    continue
                yield {'src':src_line, 'tgt':''}
    else:
        with open(src_path, 'r', encoding='utf8') as src_f, open(tgt_path, 'r', encoding='utf8') as tgt_f:
            for src_line, tgt_line in zip(src_f.readlines(), tgt_f.readlines()):
                src_line = src_line.strip()
                if not src_line:
                    continue
                tgt_line = tgt_line.strip()
                if not tgt_line:
                    continue
                yield {'src':src_line, 'tgt':tgt_line}
 # Filtering length ≤ min_len or ≥ max_len's data            
def min_max_filer(data, max_len, min_len=0):
    # 1 for special tokens.
    data_min_len = min(len(data[0]), len(data[1])) + 1
    data_max_len = max(len(data[0]), len(data[1])) + 1
    return (data_min_len >= min_len) and (data_max_len <= max_len)

# Create dataloader for training set and verification set
def create_data_loader(args):
    train_dataset = load_dataset(read, src_path=args.training_file.split(',')[0], tgt_path=args.training_file.split(',')[1], lazy=False)
    dev_dataset = load_dataset(read, src_path=args.validation_file.split(',')[0], tgt_path=args.validation_file.split(',')[1], lazy=False)

    src_vocab = Vocab.load_vocabulary(
        args.src_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])
    trg_vocab = Vocab.load_vocabulary(
        args.trg_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])

    padding_vocab = (
        lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
    )
    args.src_vocab_size = padding_vocab(len(src_vocab))
    args.trg_vocab_size = padding_vocab(len(trg_vocab))

    def convert_samples(sample):
        source = sample['src'].split()
        target = sample['tgt'].split()

        source = src_vocab.to_indices(source)
        target = trg_vocab.to_indices(target)

        return source, target

    # Training set dataloader and verification set dataloader
    data_loaders = []
    for i, dataset in enumerate([train_dataset, dev_dataset]):
        dataset = dataset.map(convert_samples, lazy=False).filter(
            partial(min_max_filer, max_len=args.max_length))

        # BatchSampler: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/BatchSampler_cn.html
        batch_sampler = BatchSampler(dataset,batch_size=args.batch_size, shuffle=True,drop_last=False)
        
        # DataLoader: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html
        data_loader = DataLoader(
            dataset=dataset,
            batch_sampler=batch_sampler,
            collate_fn=partial(
                prepare_train_input,
                bos_idx=args.bos_idx,
                eos_idx=args.eos_idx,
                pad_idx=args.bos_idx),
                num_workers=0,
                return_list=True)
        data_loaders.append(data_loader)

    return data_loaders


def prepare_train_input(insts, bos_idx, eos_idx, pad_idx):
    """
    Put all padded data needed by training into a list.
    """
    word_pad = Pad(pad_idx)
    src_word = word_pad([inst[0] + [eos_idx] for inst in insts])
    trg_word = word_pad([[bos_idx] + inst[1] for inst in insts])
    lbl_word = np.expand_dims(
        word_pad([inst[1] + [eos_idx] for inst in insts]), axis=2)

    data_inputs = [src_word, trg_word, lbl_word]

    return data_inputs

# Create the dataloader of the test set. The principle and steps are the same as above (create the dataloader of the training set and the verification set)
def create_infer_loader(args):
    dataset = load_dataset(read, src_path=args.predict_file, tgt_path=None, is_predict=True, lazy=False)

    src_vocab = Vocab.load_vocabulary(
        args.src_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])
    trg_vocab = Vocab.load_vocabulary(
        args.trg_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])

    padding_vocab = (
        lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
    )
    args.src_vocab_size = padding_vocab(len(src_vocab))
    args.trg_vocab_size = padding_vocab(len(trg_vocab))

    def convert_samples(sample):
        source = sample['src'].split()
        target = sample['tgt'].split()

        source = src_vocab.to_indices(source)
        target = trg_vocab.to_indices(target)

        return source, target

    dataset = dataset.map(convert_samples, lazy=False)

    # BatchSampler: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/BatchSampler_cn.html
    batch_sampler = BatchSampler(dataset,batch_size=args.infer_batch_size,drop_last=False)
    
    # DataLoader: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html
    data_loader = DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=partial(
            prepare_infer_input,
            bos_idx=args.bos_idx,
            eos_idx=args.eos_idx,
            pad_idx=args.bos_idx),
            num_workers=0,
            return_list=True)
    return data_loader, trg_vocab.to_tokens

def prepare_infer_input(insts, bos_idx, eos_idx, pad_idx):
    """
    Put all padded data needed by beam search decoder into a list.
    """
    word_pad = Pad(pad_idx)
    src_word = word_pad([inst[0] + [eos_idx] for inst in insts])

    return [src_word, ]

3. Build model

PaddleNLP provides Transformer API for calling:

paddlenlp.transformers.TransformerModel : implementation of Transformer model
paddlenlp.transformers.InferTransformerModel : Transformer models are used to generate
paddlenlp.transformers.CrossEntropyCriterion : calculate cross entropy loss
paddlenlp.transformers.position_encoding_init : initialization of Transformer location code

Figure 8: Model Construction

Figure 9: Example

4. Training model

Run do_train function,
In do_ In the train function, configure the optimizer, loss function, and evaluation index perflexity;

Perplexity, or confusion, is often used to measure the pros and cons of language models, that is, the smoothness of sentences. It is generally used in the fields of machine translation and text generation. The smaller the perflexity, the smoother the sentence, and the better the language model.

Figure 10: Training Model

def do_train(args):
    if args.use_gpu:
        place = "gpu"
    else:
        place = "cpu"
    paddle.set_device(place)
    # Set seed for CE
    random_seed = eval(str(args.random_seed))
    if random_seed is not None:
        paddle.seed(random_seed)

    # Define data loader
    (train_loader), (eval_loader) = create_data_loader(args)

    # Define model
    transformer = TransformerModel(
        src_vocab_size=args.src_vocab_size,
        trg_vocab_size=args.trg_vocab_size,
        max_length=args.max_length + 1,
        num_encoder_layers=args.n_layer,
        num_decoder_layers=args.n_layer,
        n_head=args.n_head,
        d_model=args.d_model,
        d_inner_hid=args.d_inner_hid,
        dropout=args.dropout,
        weight_sharing=args.weight_sharing,
        bos_id=args.bos_idx,
        eos_id=args.eos_idx)

    # Define loss
    criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)

    scheduler = paddle.optimizer.lr.NoamDecay(
        args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)

    # Define optimizer
    optimizer = paddle.optimizer.Adam(
        learning_rate=scheduler,
        beta1=args.beta1,
        beta2=args.beta2,
        epsilon=float(args.eps),
        parameters=transformer.parameters())

    step_idx = 0

    # Train loop
    for pass_id in range(args.epoch):
        batch_id = 0
        for input_data in train_loader:

            (src_word, trg_word, lbl_word) = input_data

            logits = transformer(src_word=src_word, trg_word=trg_word)

            sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
            
            # Calculated gradient
            avg_cost.backward() 
            # Update parameters
            optimizer.step() 
            # Gradient clearing
            optimizer.clear_grad() 

            if (step_idx + 1) % args.print_step == 0 or step_idx == 0:
                total_avg_cost = avg_cost.numpy()
                logger.info(
                    "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
                    " ppl: %f " %
                    (step_idx, pass_id, batch_id, total_avg_cost,
                        np.exp([min(total_avg_cost, 100)])))

            if (step_idx + 1) % args.save_step == 0:
                # Validation
                transformer.eval()
                total_sum_cost = 0
                total_token_num = 0
                with paddle.no_grad():
                    for input_data in eval_loader:
                        (src_word, trg_word, lbl_word) = input_data
                        logits = transformer(
                            src_word=src_word, trg_word=trg_word)
                        sum_cost, avg_cost, token_num = criterion(logits,
                                                                  lbl_word)
                        total_sum_cost += sum_cost.numpy()
                        total_token_num += token_num.numpy()
                        total_avg_cost = total_sum_cost / total_token_num
                    logger.info("validation, step_idx: %d, avg loss: %f, "
                                " ppl: %f" %
                                (step_idx, total_avg_cost,
                                 np.exp([min(total_avg_cost, 100)])))
                transformer.train()

                if args.save_model:
                    model_dir = os.path.join(args.save_model,
                                             "step_" + str(step_idx))
                    if not os.path.exists(model_dir):
                        os.makedirs(model_dir)
                    paddle.save(transformer.state_dict(),
                                os.path.join(model_dir, "transformer.pdparams"))
                    paddle.save(optimizer.state_dict(),
                                os.path.join(model_dir, "transformer.pdopt"))
            batch_id += 1
            step_idx += 1
            scheduler.step()


    if args.save_model:
        model_dir = os.path.join(args.save_model, "step_final")
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)
        paddle.save(transformer.state_dict(),
                    os.path.join(model_dir, "transformer.pdparams"))
        paddle.save(optimizer.state_dict(),
                    os.path.join(model_dir, "transformer.pdopt"))

# Read in parameters
yaml_file = 'transformer.base.yaml'
with open(yaml_file, 'rt') as f:
    args = AttrDict(yaml.safe_load(f))
    pprint(args)

{'batch_size': 50,
 'beam_size': 5,
 'beta1': 0.9,
 'beta2': 0.997,
 'bos_idx': 0,
 'd_inner_hid': 2048,
 'd_model': 512,
 'dropout': 0.1,
 'eos_idx': 1,
 'epoch': 1,
 'eps': '1e-9',
 'infer_batch_size': 50,
 'init_from_params': 'trained_models/CWMT2021_step_345000/',
 'label_smooth_eps': 0.1,
 'learning_rate': 2.0,
 'max_length': 256,
 'max_out_len': 256,
 'n_best': 1,
 'n_head': 8,
 'n_layer': 6,
 'output_file': 'train_dev_test/predict.txt',
 'pad_factor': 8,
 'predict_file': 'train_dev_test/ccmt2019-news.zh2en.source_bpe',
 'print_step': 10,
 'random_seed': 'None',
 'save_model': 'trained_models',
 'save_step': 20,
 'special_token': ['<s>', '<e>', '<unk>'],
 'src_vocab_fpath': 'train_dev_test/vocab.ch.src',
 'src_vocab_size': 10000,
 'training_file': 'train_dev_test/train.ch.bpe,train_dev_test/train.en.bpe',
 'trg_vocab_fpath': 'train_dev_test/vocab.en.tgt',
 'trg_vocab_size': 10000,
 'unk_idx': 2,
 'use_gpu': True,
 'validation_file': 'train_dev_test/dev.ch.bpe,train_dev_test/dev.en.bpe',
 'warmup_steps': 8000,
 'weight_sharing': False}

do_train(args)

[2021-10-20 18:45:23,800] [    INFO] - step_idx: 0, epoch: 0, batch: 0, avg loss: 10.526473,  ppl: 37289.726562 
[2021-10-20 18:45:24,991] [    INFO] - step_idx: 9, epoch: 0, batch: 9, avg loss: 10.517828,  ppl: 36968.742188 
[2021-10-20 18:45:26,296] [    INFO] - step_idx: 19, epoch: 0, batch: 19, avg loss: 10.475711,  ppl: 35444.054688 
[2021-10-20 18:45:26,404] [    INFO] - validation, step_idx: 19, avg loss: 10.480215,  ppl: 35604.062500

5. Forecasting and evaluation

The final training effect of the model can generally be tested through the test set, and the BLEU value is generally calculated in the field of machine translation.

Figure 11: Forecast and evaluation

def do_predict(args):
    if args.use_gpu:
        place = "gpu"
    else:
        place = "cpu"
    paddle.set_device(place)

    # Define data loader
    test_loader, to_tokens = create_infer_loader(args)

    # Define model
    transformer = InferTransformerModel(
        src_vocab_size=args.src_vocab_size,
        trg_vocab_size=args.trg_vocab_size,
        max_length=args.max_length + 1,
        num_encoder_layers=args.n_layer,
        num_decoder_layers=args.n_layer,
        n_head=args.n_head,
        d_model=args.d_model,
        d_inner_hid=args.d_inner_hid,
        dropout=args.dropout,
        weight_sharing=args.weight_sharing,
        bos_id=args.bos_idx,
        eos_id=args.eos_idx,
        beam_size=args.beam_size,
        max_out_len=args.max_out_len)

    # Load the trained model
    assert args.init_from_params, (
        "Please set init_from_params to load the infer model.")

    model_dict = paddle.load(
        os.path.join(args.init_from_params, "transformer.pdparams"))

    # To avoid a longer length than training, reset the size of position
    # encoding to max_length
    model_dict["encoder.pos_encoder.weight"] = position_encoding_init(
        args.max_length + 1, args.d_model)
    model_dict["decoder.pos_encoder.weight"] = position_encoding_init(
        args.max_length + 1, args.d_model)
    transformer.load_dict(model_dict)

    # Set evaluate mode
    transformer.eval()

    f = open(args.output_file, "w")
    with paddle.no_grad():
        for (src_word, ) in test_loader:
            finished_seq = transformer(src_word=src_word)
            finished_seq = finished_seq.numpy().transpose([0, 2, 1])
            for ins in finished_seq:
                for beam_idx, beam in enumerate(ins):
                    if beam_idx >= args.n_best:
                        break
                    id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
                    word_list = to_tokens(id_list)
                    sequence = " ".join(word_list) + "\n"
                    f.write(sequence)
    f.close()

do_predict(args)

Model evaluation

In the prediction result, the output of each row is the translation with the highest score corresponding to the input of the row. For the data using BPE, the predicted translation result will also be the data represented by BPE, which can be correctly evaluated only after being restored to the original data (here refers to the tokenize d data)

# Restore the prediction result in predict.txt to the data after tokenize
! sed -r 's/(@@ )|(@@ ?$)//g' train_dev_test/predict.txt > train_dev_test/predict.tok.txt
# BLEU assessment tool comes from https://github.com/moses-smt/mosesdecoder.git
! tar -zxf mosesdecoder.tar.gz
# Computing multi bleu
! perl mosesdecoder/scripts/generic/multi-bleu.perl train_dev_test/ccmt2019-news.zh2en.ref*.txt < train_dev_test/predict.tok.txt

BLEU = 38.11, 74.5/49.1/32.5/21.7 (BP=0.951, ratio=0.952, hyp_len=22252, ref_len=23371)
It is not advisable to publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.

Keywords: NLP paddlepaddle Transformer

Added by vent on Sun, 05 Dec 2021 02:21:57 +0200

Programming VIP