© Original author | crazy Max
ERNIE code interpretation
Considering ERNIE uses BRET as the basic model, in order to enable NLPer without foundation to understand the code, the author will first briefly interpret the structure of BERT model for you. See [1] for the complete code.
01 structural composition of Bert
The code of BERT is mainly composed of word segmentation module, training data preprocessing, model structure module and so on.
1.1 word segmentation module
Before training, the model needs to segment the input text and convert the segmented sub words into corresponding ID s. This function is mainly realized by BertTokenizer, mainly in
/models/bert/tokenization_bert.py implementation.
BertTokenizer is a word splitter based on BasicTokenizer and WordPieceTokenizer:
BasicTokenizer is responsible for segmenting sentences according to punctuation, spaces, etc., processing whether to unify lowercase, and cleaning up illegal characters.
Wordpiece tokenizer further decomposes words into subword s on the basis of words.
It has the following usage methods:
- from_ Pre trained: initialize a word splitter from the directory containing the thesaurus file (vocab.txt);
- tokenize: decompose the text into a list of sub words;
- convert_tokens_to_ids: convert subwords into subscripts corresponding to subwords;
- convert_ids_to_tokens: convert the corresponding subscript into a subword;
- encode: for a single sentence, decompose words and add special words to form the structure of "[CLS], x, [SEP]" and convert it into a list of subscripts corresponding to the thesaurus;
- decode: convert the output of encode into sentences.
1.2 training data preprocessing
The construction of training data mainly depends on the pre training task. Since the pre training task of BERT includes predicting whether the upper and lower sentences and mask words are continuous sentences, its training data needs to randomly replace the continuous sentences and word segmentation. This part of the function is controlled by run_ pretraining. Functions in PY
create_instances_from_document implementation.
This part first constructs the upper and lower sentences, splices the IDs of special symbols such as [cls] and [sep], constructs a list with a length of 512, and then selects the sub words to be masked according to the specified probability used in the paper. This part is composed of functions
create_masked_lm_predictions implementation.
1.3 model structure
BERT model is mainly composed of BertEmbeddings class and BertEncoder class. The former is responsible for projecting subwords, positions and segment s into vectors, and the latter realizes text coding.
The encoder BertEncoder is composed of 12 layers of the same coding block BertLayer. Each layer is composed of self attention layer BertSelfAttention, feedforward neural network layer BertIntermediate and output layer BertOutput
/models/bert/modeling_bert.py.
The structure and functions of each coding layer are as follows:
- Bert selfattention: responsible for realizing the mutual attention between subwords. Note that the multi head self attention mechanism is realized by changing the dimension to hidden_ The representation vector of size is divided into n dimensions as hidden_size / n vector, then encode the segmented vector respectively, and finally splice the encoded vector;
- BertIntermediate: matrix multiplication and nonlinear change of batch data (three-dimensional tensor);
- BertOutput: realize normalization and residual connection;
Engineering tips: if the model needs to use different coding methods in the process of learning the representation vector, taking the combination graph neural network layer and Transformer coding layer as an example, the author suggests to use the same parameter initialization method as far as possible, and both of them use residual connection, which can avoid the problem of gradient explosion during model training.
In addition, whether the attention weight needs to be changed depends on the number of layers of the graph neural network. Generally speaking, if only two or less graph neural network layers are used, there is no need to change the attention weight.
Specifically, it can be determined by observing whether the size of the representation vector generated by the graph neural network layer is in the same order of magnitude as that generated by the Transformer coding layer. If it is in the same order of magnitude, there is no need to change the attention weight. If there is a gradient explosion, the attention weight can be reduced.
02 from BERT to ERNIE
Since ERNIE is improved on the basis of BERT, the entity sequence corresponding to the text needs to be constructed at the data level, and a new pre training task is added at the pre training level, then the code corresponds to the changes of training data preprocessing and model structure. Therefore, the author will also focus on these two aspects. For the complete code, see [2].
Its code structure mainly includes two modules, training data preprocessing module and model construction module.
2.1 training data preprocessing module
The knowledge injection of ERNIE model depends on finding the entities existing in the text. These entities refer to a meaningful abstract or concrete single noun or noun phrase, which can be called text referential. An entity can have multiple aliases, which means that an entity can correspond to multiple referents in the text.
In order to find the entities in the text corpus, the author uses Wikipedia as ERNIE's training corpus and takes the hyperlinked nouns or phrases in Wikipedia as entities. Using this existing resource can greatly simplify the difficulty of retrieving entities.
2.1.1 training data construction
After using the existing extraction tools to obtain the corpus and entity name files, through
pretrain_data/create_insts.py build training data.
We know that before training, we first need to tokenize the corpus to obtain tokens, and then get the index ID of the sub words according to the dictionary. After receiving the index, the model projects it into a vector. From BERT's code, we can know that BERT first constructs the upper and lower sentences required for next sentences prediction, and randomly selects mask words from them to generate a mask list for self attention stage.
In order to inject the corresponding entity in the statement, ERNIE needs to create and train the entity ID tensor with the same length as the corpus and the corresponding mask list in this process.
The author only labels the entity ID at the position corresponding to the first sub word of the text reference, which means that the model only uses the first sub word vector to predict the entity. This method can directly reuse the code of BERT without building the training data for the entity sequence, which reduces the workload of engineering implementation.
for i, x in enumerate(vec): if x == "#UNK#": vec[i] = -1 elif x[0] == "Q": if x in d: vec[i] = d[x] if i != 0 and vec[i] == vec[i-1]: # Take an entity as an example, q123 q123 - > d [q123] - 1 - 1. Only the ID of the entity is recorded in the first subword, and other location marks are - 1 vec[i] = -1 else: vec[i] = -1
#Function create_instances_from_document // Get the entities and subwords of sentences a and b tokens = [101] + tokens_a + [102] + tokens_b + [102] entity = [-1] + entity_a + [-1] + entity_b + [-1] // Construct the object ds used to build the index for the data, and store the training data such as the corresponding input corpus id list and mask list, entity id list and mask list into the ds. ds.add_item(torch.IntTensor(input_ids+input_mask+segment_ids +masked_lm_labels+entity+entity_mask+[next_sentence_label]))
2.1.2 entity vector loading
Because BERT has a pre trained vector table, the ID value of subwords can use NN The embedding module obtains the projection vector.
Then, the vector of the entity is obtained through the learning of TransE representation, and how should the model obtain its projection vector? The author is at code / iteration Py, which will be called when returning data
torch.utils.data.DataLoader, pass in the function collate, which is responsible for projecting the entity vector_ FN, which enables the model to obtain the representation vector of the entity when loading data.
#Class EpochBatchIterator(object): return CountingIterator(torch.utils.data.DataLoader( self.dataset, # collate_fn is the key to the incoming entity vector collate_fn=self.collate_fn, batch_sampler=batches, ))
#Function collate_fn: def collate_fn(x): x = torch.LongTensor([xx for xx in x]) entity_idx = x[:, 4*args.max_seq_length:5*args.max_seq_length] # embed = torch.nn.Embedding.from_pretrained(embed) # Embedded is loaded with pre trained two-dimensional entity tensors uniq_idx = np.unique(entity_idx.numpy()) ent_candidate = embed(torch.LongTensor(uniq_idx+1))
2.2 model structure module
In terms of the model, the author still uses the 12 layer Transformer coding layer as the model structure. Different from BERT, the Transformer coding layer of BERT is used in the first six layers, but in the 7th layer, the user-defined knowledge fusion layer BertLayerMix sums the aligned entity vector and referential vector for the first time and transmits them to the knowledge coding module and text coding module respectively, In the remaining five layers of user-defined knowledge coding layer BertLayer, the entity sequence and text sequence fused with the two information are encoded by self attention mechanism respectively.
The first five layers of the model are the text encoder referred to in the paper, and the next seven coding layers constitute the knowledge encoder in the paper.
For the Transformer coding layer of BERT, since the first part has been introduced, it will not be repeated. The following is a detailed interpretation of the coding layer defined by the author.
2.2.1 knowledge fusion layer BertLayerMix
Specifically, the knowledge fusion layer BertLayerMix consists of the self attention layer BertAttention_simple, the fusion layer BertIntermediate and the output layer BertOutput.
class BertLayerMix(nn.Module): def __init__(self, config): super(BertLayerMix, self).__init__() self.attention = BertAttention_simple(config) self.intermediate = BertIntermediate(config) self.output = BertOutput(config) # The coding layer only performs self attention operation, matrix multiplication and residual connection for text def forward(self, hidden_states, attention_mask, hidden_states_ent, attention_mask_ent, ent_mask): attention_output = self.attention(hidden_states, attention_mask) attention_output_ent = hidden_states_ent * ent_mask # The intermediate layer is responsible for the summation of entity and text vectors, and changes the summation vector nonlinearly intermediate_output = self.intermediate(attention_output, attention_output_ent) # Then, the output is normalized and residual connected again through the output layer layer_output, layer_output_ent = self.output(intermediate_output, attention_output, attention_output_ent) return layer_output, layer_output_ent
Self attention layer BertAttention_simple is composed of BertSelfAttention and BertSelfOutput. The former is responsible for the self attention operation of the text. The implementation is the same as that of BERT, so the code will not be displayed. The latter is used for matrix change and residual connection of vectors to generate attention_output
class BertAttention_simple(nn.Module): def __init__(self, config): super(BertAttention_simple, self).__init__() self.self = BertSelfAttention(config) self.output = BertSelfOutput(config) def forward(self, input_tensor, attention_mask): self_output = self.self(input_tensor, attention_mask) attention_output = self.output(self_output, input_tensor) return attention_output
class BertSelfOutput(nn.Module): def __init__(self, config): super(BertSelfOutput, self).__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12) self.dropout = nn.Dropout(config.hidden_dropout_prob) def forward(self, hidden_states, input_tensor): hidden_states = self.dense(hidden_states) hidden_states = self.dropout(hidden_states) hidden_states = self.LayerNorm(hidden_states + input_tensor) return hidden_states
The feedforward neural network layer BertIntermediate is responsible for converting the linear changes of the two into the same dimension, summing and making nonlinear changes.
class BertIntermediate(nn.Module): def __init__(self, config): super(BertIntermediate, self).__init__() self.dense = nn.Linear(config.hidden_size, config.intermediate_size) self.dense_ent = nn.Linear(100, config.intermediate_size) self.intermediate_act_fn = ACT2FN[config.hidden_act] \ if isinstance(config.hidden_act, str) else config.hidden_act def forward(self, hidden_states, hidden_states_ent): # Linear change into the same dimension hidden_states_ = self.dense(hidden_states) hidden_states_ent_ = self.dense_ent(hidden_states_ent) # Sum and use intermediate_act_fn make nonlinear changes hidden_states = self.intermediate_act_fn(hidden_states_+hidden_states_ent_) return hidden_states
Finally, BertOutput is used to multiply the text vector and entity vector respectively, connect the fused vector and their residuals, and normalize them.
class BertOutput(nn.Module): def __init__(self, config): super(BertOutput, self).__init__() self.dense = nn.Linear(config.intermediate_size, config.hidden_size) self.dense_ent = nn.Linear(config.intermediate_size, 100) self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12) self.LayerNorm_ent = BertLayerNorm(100, eps=1e-12) self.dropout = nn.Dropout(config.hidden_dropout_prob) def forward(self, hidden_states_, input_tensor, input_tensor_ent): # For text vector matrix multiplication hidden_states = self.dense(hidden_states_) hidden_states = self.dropout(hidden_states) # For text vector residual connection and normalization hidden_states = self.LayerNorm(hidden_states + input_tensor) # Matrix multiplication, residual connection and normalization for entity vectors hidden_states_ent = self.dense_ent(hidden_states_) hidden_states_ent = self.dropout(hidden_states_ent) hidden_states_ent = self.LayerNorm_ent(hidden_states_ent + input_tensor_ent) return hidden_states, hidden_states_ent
2.2.2 knowledge coding layer BertLayer
The coding layer performs self attention coding for the fused entity vector and text vector respectively, so that all entities in the entity sequence can also realize mutual attention.
Thirdly, the entity vector sums up with the text vector at the corresponding position, and transmits the entity information to the text vector, so that the whole text sequence can pay attention to the entity sequence in the next coding layer.
class BertLayer(nn.Module): def __init__(self, config): super(BertLayer, self).__init__() self.attention = BertAttention(config) self.intermediate = BertIntermediate(config) self.output = BertOutput(config) def forward(self, hidden_states, attention_mask, hidden_states_ent, attention_mask_ent, ent_mask): attention_output, attention_output_ent = self.attention(hidden_states, attention_mask, hidden_states_ent, attention_mask_ent) attention_output_ent = attention_output_ent * ent_mask intermediate_output = self.intermediate(attention_output, attention_output_ent) layer_output, layer_output_ent = self.output(intermediate_output, attention_output, attention_output_ent) # layer_output_ent = layer_output_ent * ent_mask return layer_output, layer_output_ent
This coding layer defines the self attention layer, in which the self attention layer for entities uses only four attention heads.
class BertAttention(nn.Module): def __init__(self, config): super(BertAttention, self).__init__() self.self = BertSelfAttention(config) self.output = BertSelfOutput(config) config_ent = copy.deepcopy(config) config_ent.hidden_size = 100 config_ent.num_attention_heads = 4 self.self_ent = BertSelfAttention(config_ent) self.output_ent = BertSelfOutput(config_ent) def forward(self, input_tensor, attention_mask, input_tensor_ent, attention_mask_ent): # BertSelfAttention performs self attention operation on text vector self_output = self.self(input_tensor, attention_mask) self_output_ent = self.self_ent(input_tensor_ent, attention_mask_ent) # BertSelfAttention performs self attention operation on entity vectors attention_output = self.output(self_output, input_tensor) attention_output_ent = self.output_ent(self_output_ent, input_tensor_ent) return attention_output, attention_output_ent
Like the knowledge fusion layer, the output layer uses BERToutput to realize normalization and residual connection.
03 source code reference
[1] https://github.com/google-research/bert
[2] https://github.com/thunlp/ERNIE
Private letter I receive dry goods learning resources such as target detection and R-CNN / application of data analysis / e-commerce data analysis / application of data analysis in the medical field / NLP student project display / introduction and practical application of Chinese NLP / NLP series live courses / NLP cutting-edge model training camp.