Automatic Title Generation Based on BERT

🐱 Text title generation based on BERT

A good title is based on the ingenious refinement of the article content, which can quickly arouse the interest of readers.

In order to generate news headlines quickly and accurately, this project uses the classic BERT model to automatically complete the generation of news headlines.

📖 0 project background

The purpose of headline news is to make a short sentence to attract readers to read the news. The news articles that users are interested in can usually contain multiple headlines with different keywords. Quickly generating appropriate news headlines can help us media practitioners quickly obtain a large amount of traffic from hot information. In addition, it can also be used to summarize the key information of civil servants in the examination. Therefore, the process automation has a wide application prospect.

🍌 1 data set

lcsts summary data is compiled by Harbin Institute of technology. The data set is created based on the news summary released by news media on microblog. Each short article is about 100 characters and each summary is about 20 characters. A total of 2108915 data

Due to the large amount of data, this project only uses part of the data for training

🍉 2 interpretation of the paper


BERT (Bidirectional Encoder Representations from Transformers) means Bidirectional Encoder Representations from Transformers. Unlike the recent language representation model, BERT aims to pre train deep two-way representation based on the left and right contexts of all layers. Therefore, the pre trained BERT representation can be fine tuned with only one additional output layer, so as to create the current optimal model for many tasks (such as question answering and language reasoning), without making a lot of modifications to the task specific architecture.

  • Bert proposed a new pre training method, which can effectively improve the performance of the pre training model in downstream tasks.


Language model pre training has been proved to be effective in improving many natural language processing tasks.

There are two existing strategies for applying pre training language representation to downstream tasks: feature-based and fine tuning.

The main limitation of the fine tuning based method is that the standard language model is one-way, which greatly limits the types of architectures that can be used during pre training.

BERT solves the current one-way limitation by proposing a new pre training goal: masked language model (MLM).

The MLM goal allows representation to fuse left and right contexts. In addition to the masking language model, BERT also introduces a "next sentence prediction" task to jointly pre train the representation of text pairs.

The contributions of this paper are as follows:

    1. It proves the importance of two-way pre training to language representation.
    1. It is shown that the amount of pre training representation can eliminate the needs of many heavy engineering task specific architectures.

(BERT is the first representation model based on fine tuning, which achieves the most advanced performance on a large number of sentence level and chunk level tasks)

    1. BERT advanced the highest level of 11 NLP tasks.

Model framework

BERT model architecture is a multi-layer bidirectional Transformer encoder.

In this paper, the size of feedforward / filter is 4H, such as 3072 when H=768 and 4096 when H=1024.

The results are mainly shown in two model sizes:

  • BERTBASE: L=12, H=768, A=12, total parameter = 110M

  • BERTLARGE: L=24, H=1024, A=16, total parameter = 340M

  • Note: the BERT converter uses a bidirectional self attention mechanism

Input characterization

For each token, its representation consists of corresponding token embedding, segment embedding and position embedding. The sequence length supported by position representation is up to 512 word blocks.
(if you use the bert pre training model, you can't change it. If you train yourself, you can define it yourself. It doesn't have to follow 512. The original text here is to balance efficiency and performance)

Among them, position embedding is proposed by Transformer to encode the information of different locations.

There are two implementation methods: splicing and summation. See for details Positive embedding in Transformer

Pre training task

The authors use two new unsupervised prediction tasks to pre train BERT

  • Masking language model

Randomly mask some parts of the input lexical chunks, and then predict only those blocked lexical chunks. Bert calls this process "masking LM" (MLM).

The specific implementation methods are as follows:
The training data generator randomly selects 15% of the lexical chunks. Then complete the following process:

Instead of always replacing the selected word with [MASK], the data generator will do the following:

  1. 80% of training data: replace words with [MASK] chunks, for example, [my dog is furry!] [my dog is [MASK]]
  2. 10% of training data: replace masking words with random words, for example, [my dog is furry!] [my dog is an apple]
  3. 10% of training data: keep the words unchanged, for example, [my dog is furry!] [my dog is furry!] The purpose of this is to bias the representation towards the actually observed words.
  • Next sentence prediction model

In order to train a model for understanding sentence relations, we pre trained a binary next sentence prediction task, which can be easily generated from any monolingual corpus.

The specific implementation method is as follows: select sentences A and B as the pre training samples: 50% of B may be the next sentence of A, and 50% may be random sentences from the corpus

input=[CLS]Men go[MASK]Shop[SEP]He bought a gallon[MASK]milk[SEP]

Label= IsNext

input=[CLS]man[Mask]Go to the store[SEP]penguin[Mask]It's flying##Shaoniao [SEP]

Label= NotNext


BERT uses a more efficient Transformer structure to efficiently obtain the two-way representation information of training data.

The article has done a very detailed ablation experiment. It is suggested that interested students can read the original text in detail.

reference material





🥝 3 model training

# Install required library functions
!pip install sumeval
# Decompress data set
!unzip  -o  data/data127041/ -d /home/aistudio/work/
Archive:  data/data127041/
  inflating: /home/aistudio/work/lcsts_data.json  
import json
f = open('work/lcsts_data.json')
data = json.load(f)
# View data
item = data[520]
print('title:',item['title'],'\n Content:',item['content'])
Title: Haikou citizens can take the bus with their bank card 
Content: Hainan baodaotong Co., Ltd. launched the "ICBC baodaotong joint name", and citizens can take the bus by swiping their bank card. The card not only has the comprehensive financial service function of ICBC debit card, but also has the electronic payment function of baodaotong card. With this joint name card, cardholders can not only take the bus in Haikou City, but also carry out business operations such as swiping card consumption, deposit and withdrawal
from paddlenlp.transformers import BertModel, BertForTokenClassification
import paddle.nn as nn
from paddle.nn import Layer, Linear, Softmax
import paddle

class TitleBertModel(paddle.nn.Layer):
    be based on BERT Generation model of pre training model
    def __init__(self, pretrained_bert_model: str, input_length: int):
        super(TitleBertModel, self).__init__()
        bert_model = BertModel.from_pretrained(pretrained_bert_model)
        self.vocab_size, self.hidden_size = bert_model.embeddings.word_embeddings.parameters()[0].shape
        self.bert_for_class = BertForTokenClassification(bert_model, self.vocab_size)
        # Generate a lower triangular matrix to mask the information behind the sentence
        self.sequence_length = input_length
        # lower_triangle_mask is input_ length * input_ The lower triangular matrix of length (including the main diagonal), which is part of the attention mask (in the forward direction
        # The part with 0 in the processing will be processed into an infinitesimal quantity to ensure that the weight of the masked part is approximately equal to 0 when calculating the attention weight). The reason why it is written in the form of lower triangular matrix, and
        # transformer's multi head attention calculation mechanism is related, and the details can be learned from relevant papers.
        self.lower_triangle_mask = paddle.tril(paddle.tensor.full((input_length, input_length), 1, 'float32'))

    def forward(self, token, token_type, input_mask, input_length=None):
        # Calculate the attention mask
        mask_left = paddle.reshape(input_mask, input_mask.shape + [1])
        mask_right = paddle.reshape(input_mask, [input_mask.shape[0], 1, input_mask.shape[1]])
        # Enter a valid position in the sentence
        mask_left = paddle.cast(mask_left, 'float32')
        mask_right = paddle.cast(mask_right, 'float32')
        attention_mask = paddle.matmul(mask_left, mask_right)
        # Effective position in the calculation of attention mechanism
        if input_length is not None:
            # The reason why we need to calculate again is that when it is used for reasoning and prediction, the possible input length is not the length set during instantiation. The model here assumes the input during training
            # The length is filled to be consistent - this step is not necessary, but it is more convenient to process it into a consistent length (correspondingly, the cost of video memory is increased).
            lower_triangle_mask = paddle.tril(paddle.tensor.full((input_length, input_length), 1, 'float32'))
            lower_triangle_mask = self.lower_triangle_mask

        attention_mask = attention_mask * lower_triangle_mask

        # Invalid position set to minimum
        attention_mask = (1 - paddle.unsqueeze(attention_mask, axis=[1])) * -1e10
        attention_mask = paddle.cast(attention_mask, self.bert_for_class.parameters()[0].dtype)

        output_logits = self.bert_for_class(token, token_type_ids=token_type, attention_mask=attention_mask)
        return output_logits
# Read pre training model
from paddlenlp.transformers import BertTokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
[2022-02-08 15:54:15,141] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/bert-base-chinese/bert-base-chinese-vocab.txt
import paddle
import numpy as np
# Custom data reading class
class TitleGenerateData(
    Construction dataset, inheritance
        data(dict): The title and corresponding text are not encoded
            title(str): title
            content(str): text

        max_len: Maximum length received
    def __init__(self, data, tokenizer,max_len = 128,mode='train'):
        super(TitleGenerateData, self).__init__()
        self.data_ = data
        self.tokenizer = tokenizer
        self.max_len = max_len
        scale = 0.8 # 80% training
        if mode=='train':
   = self.data_[:int(scale*len(self.data_))]
   = self.data_[int(scale*len(self.data_)):]

    def __getitem__(self, idx):
        item =[idx]
        content = item['content']
        title = item['title']

        token_content = self.tokenizer.encode(content)
        token_title = self.tokenizer.encode(title)

        token_c, token_typec_c = token_content['input_ids'], token_content['token_type_ids']
        token_t, token_typec_t = token_title['input_ids'], token_title['token_type_ids']

        if len(token_c) > self.max_len + 1:
            token_c = token_c[:self.max_len] + token_c[-1:]
            token_typec_c = token_typec_c[:self.max_len] + token_typec_c[-1:]

        if len(token_t) > self.max_len + 1:
            token_t = token_t[:self.max_len] + token_t[-1:]
            token_typec_t = token_typec_t[:self.max_len] + token_typec_t[-1:]

        input_token, input_token_type = token_c, token_typec_c
        label_token = np.array((token_t + [0] * self.max_len)[:self.max_len], dtype='int64')

        # Input fill
        input_token = np.array((input_token + [0] * self.max_len)[:self.max_len], dtype='int64')
        input_token_type = np.array((input_token_type + [0] * self.max_len)[:self.max_len], dtype='int64')
        input_pad_mask = (input_token != 0).astype('float32')
        label_pad_mask = (label_token != 0).astype('float32')
        return input_token, input_token_type, input_pad_mask, label_token, label_pad_mask
    def __len__(self):
        return len(
# Define loss function
class Cross_entropy_loss(Layer):
    def forward(self, pred_logits, label, label_pad_mask):
        loss = paddle.nn.functional.cross_entropy(pred_logits, label, ignore_index=0, reduction='none')
        masked_loss = paddle.mean(loss * label_pad_mask, axis=0)
        return paddle.sum(masked_loss)
# Test data reading class
dataset = TitleGenerateData(data,bert_tokenizer,mode='train')
print('=============train dataset=============')

input_token, input_token_type, input_pad_mask, label_token, label_pad_mask = dataset[1]
print(input_token, input_token_type, input_pad_mask, label_token, label_pad_mask)
# View the model structure, define the performance evaluation index and parameters
from paddle.static import InputSpec
from paddlenlp.metrics import Perplexity
from paddle.optimizer import AdamW

net = TitleBertModel('bert-base-chinese', 128)

token_ids = InputSpec((-1, 128), 'int64', 'token')
token_type_ids = InputSpec((-1, 128), 'int64', 'token_type')
input_mask = InputSpec((-1, 128), 'float32', 'input_mask')

label = InputSpec((-1, 128), 'int64', 'label')
label_mask = InputSpec((-1, 128), 'int64', 'label')

inputs = [token_ids, token_type_ids, input_mask]
labels = [label,label_mask]

model = paddle.Model(net, inputs, labels)
model.summary(inputs, [input.dtype for input in inputs])
# Training parameters
epochs = 20
context_length = 128
lr = 1e-3
# model training
from import DataLoader
from tqdm import tqdm
from paddlenlp.metrics import Perplexity

train_dataset = TitleGenerateData(data,bert_tokenizer,mode='train')
dev_dataset = TitleGenerateData(data,bert_tokenizer,mode='dev') 

train_loader =, batch_size=128, shuffle=True)
dev_loader =, batch_size=64, shuffle=True)

model = TitleBertModel('bert-base-chinese', context_length)

# Set optimizer
# Set loss function
loss_fn = Cross_entropy_loss()

perplexity = Perplexity()

for epoch in range(epochs):
    for data in tqdm(train_loader(),desc='epoch:'+str(epoch+1)):

        input_token, input_token_type, input_pad_mask, label_token, label_pad_mask = data[0],data[1],data[2],data[3], data[4]  # data

        predicts = model(input_token, input_token_type, input_pad_mask)    # Prediction results

        # The calculated loss is equivalent to the loss setting in prepare
        loss = loss_fn(predicts, label_token , label_pad_mask)

        predicts = paddle.to_tensor(predicts)
        label =  paddle.to_tensor(label_token)

        # Calculating the confusion degree is equivalent to the setting of metrics in prepare
        correct = perplexity.compute(predicts, label)
        ppl = perplexity.accumulate()
        # Back propagation

        # Update parameters

        # Gradient clearing

    print("epoch: {}, loss is: {}, Perplexity is: {}".format(epoch+1, loss.item(),ppl))

    # Save the model parameters with the file name Unet_model.pdparams, 'work/model.pdparams')

🎖️ 4 model test

import numpy as np

class TitleGen(object):
    Define a class that automatically generates text and generate text as required
    model: Trained prediction model
    tokenizer: Word segmentation coding tool
    max_length: The maximum length of generated text must be less than or equal to model Maximum length allowed
    def __init__(self, model, tokenizer, max_length=128):
        self.model = model
        self.tokenizer = tokenizer
        self.puncs = [',', '. ', '?', ';']
        self.max_length = max_length

    def generate(self, head='', topk=2):
        Generate titles as required
        head (str, list): Text entered
        topk (int): From predicted topk Select results from
        poetry_ids = self.tokenizer.encode(head)['input_ids']

        # Remove the start and end marks
        poetry_ids = poetry_ids[1:-1]
        break_flag = False
        while len(poetry_ids) <= self.max_length:
            next_word = self._gen_next_word(poetry_ids, topk)
            # For some symbols, such as [UNK], [PAD], [CLS], etc., they have no meaning to the poem after they are generated, so they can be skipped directly
            if next_word in self.tokenizer.convert_tokens_to_ids(['[UNK]', '[PAD]', '[CLS]']):

                new_ids = [next_word]
            if next_word == self.tokenizer.convert_tokens_to_ids(['[SEP]'])[0]:
            poetry_ids += new_ids
            if break_flag:
        return ''.join(self.tokenizer.convert_ids_to_tokens(poetry_ids))

    def _gen_next_word(self, known_ids, topk):
        type_token = [0] * len(known_ids)
        mask = [1] * len(known_ids)
        sequence_length = len(known_ids)
        known_ids = paddle.to_tensor([known_ids], dtype='int64')
        type_token = paddle.to_tensor([type_token], dtype='int64')
        mask = paddle.to_tensor([mask], dtype='float32')
        logits =, type_token, mask, sequence_length)
        # The output corresponding to the last word in logits is the probability of the next word
        words_prob = logits[0, -1, :].numpy()
        # In reverse order of probability, select the first topk words
        words_to_be_choosen = words_prob.argsort()[::-1][:topk]
        probs_to_be_choosen = words_prob[words_to_be_choosen]
        # normalization
        probs_to_be_choosen = probs_to_be_choosen / sum(probs_to_be_choosen)
        word_choosen = np.random.choice(words_to_be_choosen, p=probs_to_be_choosen)
        return word_choosen
# Load the trained model
net = TitleBertModel('bert-base-chinese', 128)
model = paddle.Model(net)
title_gen = TitleGen(model, bert_tokenizer)
# Using rouge evaluation index
import json
from sumeval.metrics.rouge import RougeCalculator

text_ = 'At 6 o'clock last night, a helicopter crashed into Dongpu reservoir in Hefei'
ref_content = title_gen.generate(head=text_)


summary_content = 'The helicopter crashed into a reservoir in Hefei, Anhui Province '

rouge = RougeCalculator(lang="zh")

# Output rouge-1, rouge-2, rouge-l indicators
sum_rouge_1 = 0
sum_rouge_2 = 0
sum_rouge_l = 0
for i, (summary, ref) in enumerate(zip(summary_content, ref_content)):
    summary = summary.lower().replace(" ", "")
    rouge_1 = rouge.rouge_n(
    rouge_2 = rouge.rouge_n(
    rouge_l = rouge.rouge_l(

    sum_rouge_1 += rouge_1
    sum_rouge_2 += rouge_2
    sum_rouge_l += rouge_l
    print(i, rouge_1, rouge_2, rouge_l, summary, ref)

print(f"avg rouge-1: {sum_rouge_1/len(summary_content)}\n"
      f"avg rouge-2: {sum_rouge_2/len(summary_content)}\n"
      f"avg rouge-l: {sum_rouge_l/len(summary_content)}")

🥑 5 Summary

Due to the large amount of data, this project only uses a small amount of data for effect demonstration to ensure the successful operation of the project.

The work of the project is mainly reflected in:

  • The project makes a detailed interpretation of the original text of BERT.

  • The project demonstrates and uses the lcsts summary data set

  • rouge evaluation method is used in the project

  • Fix the Bug of the reference item: when the source data and target data use different MASK processing

If you have any questions, please leave a message in the comment area.

Keywords: AI Deep Learning NLP paddlepaddle BERT

Added by sethi_kapil on Sun, 13 Feb 2022 08:37:06 +0200