Using transformers library to fine tune Bert for text classification

Recently tried to use HuggingFace 🤗 The transformers library fine tuned the Bert text classification under pytorch, and found many Chinese blog s, mainly for the processing of data. There is no detailed description, and I don't know how to deal with the format of dataset, so I make a record here.

Dependent package

pytorch
transformers
scikit-learn

Pre training model loading

The HuggingFace loaded by the pre training model is well encapsulated in the transformers library. There is nothing to say:

args.pretrain fill in the model name here or the pre training model prepared by yourself. Various pre training models can be downloaded from https://huggingface.co/models To find and download, you need to include three files (config.json, vocab.txt, pytorch_model.bin). Take Bert as an example to prepare bert-base-chinese Model, args Pretrain is the path where the three files of the model are stored.

Because the downstream task is a text classification task, the model uses transformer Bertforsequenceclassification, you can also select other models as needed.

from from transformers import BertForSequenceClassification, BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained(args.pretrain)
model = BertForSequenceClassification.from_pretrained(args.pretrain, num_labels=2, output_hidden_states=False)

Model tuning

Model fine tuning here, the Trainer module encapsulated by transformers is used, and the meaning of parameters is basically clear at a glance. Here, early stop is set and the best model is loaded according to precision. It is worth noting that multiple checkpoints will be saved when the model is saved, so evaluation_strategy,save_total_limit should be set to avoid exploding the hard disk during saving. It takes almost 1GB to save a checkpoint in Bert

from transformers import Trainer, TrainingArguments, EarlyStoppingCallback
from sklearn.metrics import classification_report, precision_score, \
    recall_score, f1_score, accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

training_args = TrainingArguments(
        output_dir=args.save_path,  # The directory where the result file is stored
        overwrite_output_dir=True,
        num_train_epochs=args.epoch,
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        learning_rate=1e-5,
        eval_steps=500,
        load_best_model_at_end=True,
        metric_for_best_model="precision",  # Finally, the evaluation criteria of the optimal model are loaded, and the model parameter with the highest precision is selected here
        weight_decay=0.01,
        warmup_steps=500,
        evaluation_strategy="steps",  # Here, it is set to evaluate every 100 batch es, or "epoch", that is, every epoch
        logging_strategy="steps",
        save_strategy='steps',
        logging_steps=100,
        save_total_limit=3,
        seed=2021,
        logging_dir=args.logging_dir  # Directory where logs are stored
    )

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_set,
        eval_dataset=valid_set,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # Early stop Callback
    )

Data preprocessing

Finally, let's talk about the data preprocessing stage. Why should the first operation be put at the end? Because readers may also find that transformers 🤗 The encapsulation is so good that the data preprocessing is what needs to be customized in the whole pipeline. In fact, it is also very simple. You need to pass in the train when creating the Trainer_ Dataset and eval_dataset. The type of both datasets is torch utils. data. Dataset, please refer to another article for the dataset processing of PyTorch. So we need to be right here__ getitem__ Method to return a dict containing the elements required for Bert input.

Talk is cheap，this is the code:

from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, file, tokenizer, max_len=512):
        assert os.path.exists(file)
        data = open(file, 'r', encoding='utf-8').read().strip().split('\n')
        texts = [x.split('\t')[0][:max_len-2] for x in data]
        labels = [int(x.split('\t')[1]) for x in data]
        self.encodings = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

Assume that the data format is "text\tlabel", that is, the text and label are separated by tab \ t, and there is one piece of data in each line, as shown in:

My dream is star sea 1

Ambition lies in bed 0

In__ init__ Read the data in the initialization method, and then truncate it to the maximum set length max_len (since tokenizer will automatically add [CLS] and [SEP], it is necessary to do - 2 processing for the maximum length here. After separating the text and label from the string, use tokenizer(texts, padding=True, truncation=True, return_tensors='pt ') to get the input encoding required by Bert (transformers.BatchEncoding object, encoding.data contains three tensor objects, 'input_ids',' token_type_ids' and 'attention_mask', which usually do not need additional processing in the fine-tuning task). After that, the data set initialization process is basically completed by turning the label into a tensor object.

The key is the processing of _getitem method. Here, a dict object needs to be returned, which needs to contain four keys: input _ids, token _tpye _ids, attention _maskand labels (taking Bert as an example, other models may be slightly different. Note that although there is a single piece of data here, the key of the label is called "labels"), and then return the dict object (that is, the item in the code example.

Start training!

The last step is to start training! Just wait for the model to finish training.

trainer.train()

Code example

It hasn't been push ed to GitHub yet. It will be changed later.

Reference

Fine-tuning pretrained NLP models with Huggingface's Trainer

Keywords: AI neural networks Pytorch Deep Learning NLP

Added by Calvin770D on Sat, 11 Dec 2021 10:46:23 +0200

Programming VIP