Recently tried to use HuggingFace 🤗 The transformers library fine tuned the Bert text classification under pytorch, and found many Chinese blog s, mainly for the processing of data. There is no detailed description, and I don't know how to deal with the format of dataset, so I make a record here.
Dependent package
pytorch
transformers
scikit-learn
Pre training model loading
The HuggingFace loaded by the pre training model is well encapsulated in the transformers library. There is nothing to say:
args.pretrain fill in the model name here or the pre training model prepared by yourself. Various pre training models can be downloaded from https://huggingface.co/models To find and download, you need to include three files (config.json, vocab.txt, pytorch_model.bin). Take Bert as an example to prepare bert-base-chinese Model, args Pretrain is the path where the three files of the model are stored.
Because the downstream task is a text classification task, the model uses transformer Bertforsequenceclassification, you can also select other models as needed.
from from transformers import BertForSequenceClassification, BertTokenizerFast tokenizer = BertTokenizerFast.from_pretrained(args.pretrain) model = BertForSequenceClassification.from_pretrained(args.pretrain, num_labels=2, output_hidden_states=False)
Model tuning
Model fine tuning here, the Trainer module encapsulated by transformers is used, and the meaning of parameters is basically clear at a glance. Here, early stop is set and the best model is loaded according to precision. It is worth noting that multiple checkpoints will be saved when the model is saved, so evaluation_strategy,save_total_limit should be set to avoid exploding the hard disk during saving. It takes almost 1GB to save a checkpoint in Bert
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback from sklearn.metrics import classification_report, precision_score, \ recall_score, f1_score, accuracy_score, precision_recall_fscore_support def compute_metrics(pred): labels = pred.label_ids preds = pred.predictions.argmax(-1) precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary') acc = accuracy_score(labels, preds) return { 'accuracy': acc, 'f1': f1, 'precision': precision, 'recall': recall } training_args = TrainingArguments( output_dir=args.save_path, # The directory where the result file is stored overwrite_output_dir=True, num_train_epochs=args.epoch, per_device_train_batch_size=args.batch_size, per_device_eval_batch_size=args.batch_size, learning_rate=1e-5, eval_steps=500, load_best_model_at_end=True, metric_for_best_model="precision", # Finally, the evaluation criteria of the optimal model are loaded, and the model parameter with the highest precision is selected here weight_decay=0.01, warmup_steps=500, evaluation_strategy="steps", # Here, it is set to evaluate every 100 batch es, or "epoch", that is, every epoch logging_strategy="steps", save_strategy='steps', logging_steps=100, save_total_limit=3, seed=2021, logging_dir=args.logging_dir # Directory where logs are stored ) trainer = Trainer( model=model, args=training_args, train_dataset=train_set, eval_dataset=valid_set, tokenizer=tokenizer, compute_metrics=compute_metrics, callbacks=[EarlyStoppingCallback(early_stopping_patience=3)], # Early stop Callback )
Data preprocessing
Finally, let's talk about the data preprocessing stage. Why should the first operation be put at the end? Because readers may also find that transformers 🤗 The encapsulation is so good that the data preprocessing is what needs to be customized in the whole pipeline. In fact, it is also very simple. You need to pass in the train when creating the Trainer_ Dataset and eval_dataset. The type of both datasets is torch utils. data. Dataset, please refer to another article for the dataset processing of PyTorch. So we need to be right here__ getitem__ Method to return a dict containing the elements required for Bert input.
Talk is cheap,this is the code:
from torch.utils.data import Dataset class MyDataset(Dataset): def __init__(self, file, tokenizer, max_len=512): assert os.path.exists(file) data = open(file, 'r', encoding='utf-8').read().strip().split('\n') texts = [x.split('\t')[0][:max_len-2] for x in data] labels = [int(x.split('\t')[1]) for x in data] self.encodings = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') self.labels = torch.tensor(labels) def __getitem__(self, idx): item = {key: val[idx] for key, val in self.encodings.items()} item['labels'] = self.labels[idx] return item def __len__(self): return len(self.labels)
Assume that the data format is "text\tlabel", that is, the text and label are separated by tab \ t, and there is one piece of data in each line, as shown in:
My dream is star sea 1
Ambition lies in bed 0
In__ init__ Read the data in the initialization method, and then truncate it to the maximum set length max_len (since tokenizer will automatically add [CLS] and [SEP], it is necessary to do - 2 processing for the maximum length here. After separating the text and label from the string, use tokenizer(texts, padding=True, truncation=True, return_tensors='pt ') to get the input encoding required by Bert (transformers.BatchEncoding object, encoding.data contains three tensor objects, 'input_ids',' token_type_ids' and 'attention_mask', which usually do not need additional processing in the fine-tuning task). After that, the data set initialization process is basically completed by turning the label into a tensor object.
The key is the processing of _getitem method. Here, a dict object needs to be returned, which needs to contain four keys: input _ids, token _tpye _ids, attention _maskand labels (taking Bert as an example, other models may be slightly different. Note that although there is a single piece of data here, the key of the label is called "labels"), and then return the dict object (that is, the item in the code example.
Start training!
The last step is to start training! Just wait for the model to finish training.
trainer.train()
Code example
It hasn't been push ed to GitHub yet. It will be changed later.
Reference
Fine-tuning pretrained NLP models with Huggingface's Trainer