1. Project introduction
1.1 what is KBQA?
This project aims to realize a simple question answering system based on knowledge base. Under the existing knowledge map, the system can complete the semantic understanding of simple questions, automatic retrieval of knowledge, and return the answers of questions.
The traditional question answering system based on search engine can only return some time-consuming web pages and present them to users in the form of document collection. Users still need to read and analyze documents to obtain answers. The question answering system based on knowledge map can obtain more accurate answers, that is, find accurate answers in knowledge map and directly return them to users, meet users' accurate information needs and provide personalized knowledge services.
1.2 introduction to the project method
This project mainly divides KBQA into the following four core algorithm modules:
- Topic word recognition of question sentence: identify the topic entity concerned by the questioner from the sentence
- Candidate triplet retrieval: returns triplet knowledge related to the subject entity based on the index
- Classification of candidate triples: classify the retrieved candidate triples into two categories, and filter out a large number of triples that do not meet the question target
- Fine sorting of candidate answers: according to the semantic similarity between the question and the candidate triples, the triples that can be used as the best answer are sorted
1.3 reading instruction
Readers can follow the steps in this Notebook step by step
You can also directly jump to subsection 8 after running the first 3 sections to directly run the complete KBQA prediction process
2. Environment configuration
[important note] the memory cost of this project is about 10G. In order to ensure the correct operation of the program, please use the advanced version or premium version GPU environment of AIstudio.
Take AIstudio advanced edition as an example. The environment is as follows:
- CPU: 2
- RAM: 16g
- GPU: Telsa V100, 16G
- Python version: Python 3.7
- Frame version: PaddlePaddle 2.2. one
In addition, you need to execute the following command to install the gensim library for loading word2vec and python for calculating the editing distance_ Levenshtein Library
! pip install gensim==3.8.1 ! pip install python_Levenshtein==0.12.2
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting gensim==3.8.1 [?25l Downloading https://pypi.tuna.tsinghua.edu.cn/packages/44/93/c6011037f24e3106d13f3be55297bf84ece2bf15b278cc4776339dc52db5/gensim-3.8.1-cp37-cp37m-manylinux1_x86_64.whl (24.2MB) |████████████████████████████████| 24.2MB 4.3MB/s eta 0:00:01 [?25hRequirement already satisfied: numpy>=1.11.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gensim==3.8.1) (1.20.3) Collecting smart-open>=1.8.1 (from gensim==3.8.1) [?25l Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cd/11/05f68ea934c24ade38e95ac30a38407767787c4e3db1776eae4886ad8c95/smart_open-5.2.1-py3-none-any.whl (58kB) |████████████████████████████████| 61kB 1.4MB/s eta 0:00:011 [?25hRequirement already satisfied: scipy>=0.18.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gensim==3.8.1) (1.6.3) Requirement already satisfied: six>=1.5.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gensim==3.8.1) (1.15.0) Installing collected packages: smart-open, gensim Successfully installed gensim-3.8.1 smart-open-5.2.1 Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting python_Levenshtein==0.12.2 [?25l Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2a/dc/97f2b63ef0fa1fd78dcb7195aca577804f6b2b51e712516cc0e902a9a201/python-Levenshtein-0.12.2.tar.gz (50kB) |████████████████████████████████| 51kB 5.8MB/s eta 0:00:011 [?25hRequirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from python_Levenshtein==0.12.2) (56.2.0) Building wheels for collected packages: python-Levenshtein Building wheel for python-Levenshtein (setup.py) ... [?25ldone [?25h Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=171661 sha256=eaed10e8e8ab0610268a8ef9dd73e762c77c734f12c051235404710bdda40b36 Stored in directory: /home/aistudio/.cache/pip/wheels/7b/43/95/25e2d396067496519edc4426d846bf3905f53c24e4e42b0e71 Successfully built python-Levenshtein Installing collected packages: python-Levenshtein Successfully installed python-Levenshtein-0.12.2
3. Data loading
3.1 introduction to knowledge base dataset
The question and answer data set and knowledge base used in this project are from task 7 of nlpcc218 competition: Open Domain Question Answering knowledge-based question (KBQA), which includes the following two files:
- The Chinese encyclopedia knowledge base file contains about 20 million entities and 60 million triples, with a size of 3.37G
- The entity reference mapping table of the knowledge base can correspond the common entity reference words in the real world to the entities in the knowledge base
The above data have been made public and can be downloaded directly from the official website of the event Original dataset , or uploaded from myself in AIStudio data set Get the simply preprocessed version from.
3.2 knowledge base dataset loading
This project directly mounts the preprocessed knowledge base data set, which can be directly read from the file and loaded into memory
from work.TopicWordRecognization.run_ner import predict as ner_predict from work.CandidateTriplesSelection.run_cls import predict as cls_predict from work.CandidateTriplesLookup.knowledge_retrieval import entity_linking, search_triples_by_index from work.AnswerRanking.ranking import span_question, score_similarity from work.config import KGConfig, CLSConfig, NERConfig import jieba import gensim import datetime import json import re from functools import partial import paddle from paddlenlp.datasets import load_dataset from paddlenlp.transformers import ErnieTokenizer, ErnieModel from paddlenlp.data import Stack, Pad, Tuple KGconfig = KGConfig() mention2entity_clean_path = KGconfig.mention2entity_clean_path knowledge_graph_path = KGconfig.knowledge_graph_path print('Loading mention2entity surface', datetime.datetime.now()) with open(mention2entity_clean_path, 'r', encoding='utf-8') as f: mention2entity_dict = json.loads(f.read()) print('Loading knowledge base', datetime.datetime.now()) forward_KG_f = open(knowledge_graph_path, 'rb') print('Knowledge base loaded', datetime.datetime.now())
/opt/conda/envs/python35-paddle120-env/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(*args, **kwds) Loading mention2entity Table 2021-12-22 17:30:39.294507 Loading knowledge base 2021-12-22 17:31:10.639150 Knowledge base loading completed 2021-12-22 17:31:10.641600
# Print partial mention2entity mapping table data for idx, (key, value) in enumerate(list(mention2entity_dict.items())[80:90]): print(key, '-->', value)
try to help the shoots grow by pulling them upward --> ['try to help the shoots grow by pulling them upward(Mitchell Hurwitz directed American films)', 'try to help the shoots grow by pulling them upward(idiom)', 'try to help the shoots grow by pulling them upward'] Anterior branch of thoracic nerve --> ['Anterior branch of thoracic nerve'] anteriorbranchofthoracicnerves --> ['Anterior branch of thoracic nerve'] closed loop --> ['closed loop'] htcmytouch4gslide --> ['HTC myTouch 4G Slide'] Class A tertiary hospital --> ['Class A tertiary hospital'] Britannia(Roman province ) --> ['Britannia(Roman province )'] Britannia --> ['Britannia(Roman province )', 'Britannia(The virtual empire of Lu Lu Xiu, a rebel of Japanese animation)', 'Britannia(English goddess)', 'Britannia(Roman province)'] britannia --> ['Britannia(Roman province )', '<Great Britain', 'Britannia(English goddess)'] white collar --> ['white collar', 'White collar workers', 'white collar(A general term for staff)', 'white collar(Network novel created by Drunken Beauty knee)', 'white collar(1962 Korean films in)']
3.3 construction and loading of knowledge base index table
In the search candidate triples stage of KBQA, it is necessary to retrieve the relevant triples in the knowledge base according to the subject entity of the question. In order to avoid the huge overhead of traversing the whole knowledge base every time we execute triple query, we establish an index table for the knowledge base file, which can greatly reduce the time overhead.
The knowledge base data files hung in this project have been reordered according to the header entity. When entering a query entity, our index table can quickly locate the location of the entity in the knowledge base and return all triples with the entity as the header entity.
In implementation, we open the knowledge base file in the form of byte stream, locate the starting position of each entity in the knowledge base with python's tell() method, and record the total length of all triples with it as the head entity in the knowledge base (counted in bytes).
The constructed index table will be a hash data structure, and we will save it in the specified directory.
def make_KG_index(knowledge_graph_path, forward_index_path): """ read KG File, with the first entity as key Build a one-way index in dictionary format,{mention:{'start_pos':int, 'length':int}, ...} Read with index KG When: with open(knowledge_graph_path, 'rb') as f: f.seek(223) readresult = f.read(448).decode('utf-8') """ def make_index(graph_path, index_path): print('begin to read KG', datetime.datetime.now()) index_dict = dict() with open(graph_path, 'r', encoding='utf-8') as f: previous_entity = '' previous_start = 0 while True: start_pos = f.tell() line = f.readline() if not line: break entity = line.split(' ||| ')[0] if entity != previous_entity and previous_entity: tmp_dict = dict() tmp_dict['start_pos'] = previous_start tmp_dict['length'] = start_pos - previous_start index_dict[previous_entity] = tmp_dict previous_start = start_pos previous_entity = entity print('finish reading KG, begin to write', datetime.datetime.now()) with open(index_path, 'w', encoding='utf-8') as f: f.write(json.dumps(index_dict, ensure_ascii=False)) print('finish writing', datetime.datetime.now()) make_index(knowledge_graph_path, forward_index_path)
In this project, the index table of the knowledge base file has been constructed in advance and mounted in the data set of the project, which can be loaded and used directly.
print('Loading index table', datetime.datetime.now()) forward_index_path = KGconfig.forward_index_path with open(forward_index_path, 'r', encoding='utf-8') as f: forward_index = json.loads(f.read()) print('Index table loaded', datetime.datetime.now()) # Query the index table with the specified entity, return the triples in the knowledge base, and print the first 20 of them entity = 'Yao Ming' read_index, read_size = forward_index[entity]['start_pos'], forward_index[entity]['length'] print(read_index, read_size) forward_KG_f.seek(read_index) readresult = forward_KG_f.read(read_size).decode('utf-8') print(readresult[:20])
Loading index table 2021-12-22 17:31:31.591060 Index table loading completed 2021-12-22 17:31:49.429286 2231973201 2598 Yao Ming ||| alias ||| Yao Ming Yao Ming
3.4 word2vec model loading
In addition to the above knowledge base data set, the existing word2vec word vector needs to be used in the answer sorting module of this project, which can be preloaded before the execution of the main program.
The word2vec model used in this project is derived from the github open source repository word2vec And select the 300 dimensional word vector trained with Baidu Encyclopedia as the training set and Chinese characters + words as the context feature.
The term vector model has been renamed SGNs target. Word character, and mount it in the data/data122049 directory of the project. Use the following code to load it.
from work.config import Word2VecConfig from gensim.models import KeyedVectors def load_word2vec(): word2vec_model_path = Word2VecConfig().model_path # Location of word vector file print('Preloading word2vec Word vector, expected 2 min', datetime.datetime.now()) word2vec_model = KeyedVectors.load_word2vec_format(word2vec_model_path, binary=False, unicode_errors='ignore') print('word2vec Word vector loaded', datetime.datetime.now()) return word2vec_model word2vec_model = load_word2vec()
Preloading word2vec Word vector, expected 2 min 2021-12-22 17:31:53.311305 word2vec Word vector loading completed 2021-12-22 17:33:19.258214
4. Topic word recognition of questions
For a given question sentence, the system needs to determine what is the core question in the question sentence. The subject word of the question is the core question object in the question, which can be used to help us find the corresponding relevant entities in the knowledge map and further obtain the answer. For example, for the question "who is the founder of Microsoft?", The subject word is "Microsoft".
4.1 model structure
The subject word recognition module of the project adopts the entity recognition model based on Baidu pre training model ERNIE. After the question is encoded by ERNIE, the BIO tag is predicted for each token, where B represents the beginning character of the subject word, I represents the middle character of the subject word, and O represents that the token does not belong to any subject word. In the model, each token representing text characters is fully connected in the last hidden layer of ERNIE, and then projected to the three classification output layer.
import paddle from paddle import nn from paddlenlp.transformers import ErniePretrainedModel class ErnieNER(ErniePretrainedModel): def __init__(self, ernie, label_dim, dropout=None): super(ErnieNER, self).__init__() self.label_num = label_dim self.ernie = ernie # allow ernie to be config self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"]) self.classifier = nn.Linear(self.ernie.config['hidden_size'], self.label_num) self.hidden = nn.Linear(self.ernie.config['hidden_size'], self.ernie.config['hidden_size']) def forward(self, words_ids, token_type_ids=None, position_ids=None, attention_mask=None, history_ids=None): sequence_output, pooled_output = self.ernie( words_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask) sequence_output = nn.functional.relu(self.hidden(self.dropout(sequence_output))) sequence_output = self.dropout(sequence_output) logits = self.classifier(sequence_output) return logits
4.2 structural digital features
The training data of the model comes from the subject word NER data labeled by the NLPCC2018KBQA question and answer data set.
For each question, mark the subject word of the question with BIO tag. We construct the data in the form of (token, label) pairs, divide them into training sets and verification sets, and save them in the file. The constructed relevant data can be seen in the work / topicwordrecognition / data directory of the project.
Before training, we need to read the data in text format from the file, then use ERNIE's tokenizer encoder to convert the sentence text into digital features for model input, and splice the special token of the pre training model. At the same time, the tag sequence of the sentence also needs to be converted into the digital features required for model input, and padded at the special token in the text.
def read(data_path): all_sample_words, all_sample_labels = [], [] with open(data_path, 'r', encoding='utf-8') as f: tmp_sample_words, tmp_sample_labels = [], [] for line in f.readlines(): if line == '\n' and tmp_sample_words and tmp_sample_words: all_sample_words.append(tmp_sample_words) all_sample_labels.append(tmp_sample_labels) tmp_sample_words, tmp_sample_labels = [], [] else: word, label = line.strip().split(' ')[0], line.strip().split(' ')[1] tmp_sample_words.append(word) tmp_sample_labels.append(label) for idx in range(len(all_sample_words)): yield {"words": all_sample_words[idx], "labels": all_sample_labels[idx]} def convert_example_to_feature(example, tokenizer, label2id, pad_default_tag=0, max_seq_len=512): features = tokenizer(example["words"], is_split_into_words=True, max_seq_len=max_seq_len) label_ids = [label2id[label] for label in example["labels"][:max_seq_len-2]] label_ids = [label2id[pad_default_tag]] + label_ids + [label2id[pad_default_tag]] assert len(features["input_ids"]) == len(label_ids) return features["input_ids"], features["token_type_ids"], label_ids
4.3 model training
Only the core code of model training is shown here. Run can be run in the work / topicwordrecognition directory_ ner. Py repeats the complete training process.
The trained model ~ / data / data122049 / Ernie has been attached to this project_ ner_ best_ Pdparams can be directly used for the prediction of KBQA pipeline.
def train(): train_ds = load_dataset(read, data_path=train_path, lazy=False) # File - > example dev_ds = load_dataset(read, data_path=dev_path, lazy=False) tokenizer = ErnieTokenizer.from_pretrained(model_name) trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, pad_default_tag="O", max_seq_len=max_seq_len) train_ds = train_ds.map(trans_func, lazy=False) # example->feature dev_ds = dev_ds.map(trans_func, lazy=False) batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), Pad(axis=0, pad_val=tokenizer.pad_token_type_id), Pad(axis=0, pad_val=label2id["O"], dtype='int64'), ): fn(samples) train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=batch_size, shuffle=True) dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=batch_size, shuffle=False) train_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, return_list=True) dev_loader = paddle.io.DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True) ernie = ErnieModel.from_pretrained(model_name) model = ErnieNER(ernie, len(label2id), dropout=0.1) num_training_steps = len(train_loader) * num_epoch lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion) decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] grad_clip = paddle.nn.ClipGradByGlobalNorm(max_grad_norm) optimizer = paddle.optimizer.AdamW(learning_rate=lr_scheduler, parameters=model.parameters(), weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in decay_params, grad_clip=grad_clip) loss_model = paddle.nn.CrossEntropyLoss() ner_metric = SeqEntityScore(id2label) global_step, ner_best_f1 = 0, 0. model.train() for epoch in range(1, num_epoch + 1): for batch_data in train_loader: input_ids, token_type_ids, labels = batch_data logits = model(input_ids, token_type_ids=token_type_ids) loss = loss_model(logits, labels) loss.backward() lr_scheduler.step() optimizer.step() optimizer.clear_grad() if global_step > 0 and global_step % log_step == 0: print( f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.numpy().item():.6f}") if global_step > 0 and global_step % eval_step == 0: ner_results = evaluate(model, dev_loader, ner_metric) ner_result = ner_results["Total"] model.train() ner_f1 = ner_result["F1"] # if ner_f1 > ner_best_f1: # paddle.save(model.state_dict(), f"{save_path}/ernie_ner_best.pdparams") if ner_f1 > ner_best_f1: print(f"\nner best F1 performence has been updated: {ner_best_f1:.5f} --> {ner_f1:.5f}") ner_best_f1 = ner_f1 print( f'\nner evalution result: precision: {ner_result["Precision"]:.5f}, recall: {ner_result["Recall"]:.5f}, F1: {ner_result["F1"]:.5f}, current best {ner_best_f1:.5f}\n') global_step += 1
5. Candidate triples retrieval
5.1 entity chain index based on fuzzy query
After obtaining the subject word of the question, we also need to map the subject word to the relevant entity nodes in the knowledge map, and in most cases, it will correspond to multiple entities, which we call candidate entities. For example, after identifying the subject word "Ma Yun" in the question, you need to link to relevant entities such as "Ma Yun - founder of Alibaba" and "Ma Yun - associate professor of Yunnan University for Nationalities" in the knowledge map.
Although the knowledge base dataset provides us with mention2entity entity mapping table, due to the diversity of natural languages, the subject words in natural questions may not exactly correspond to the words mentioned in the mapping table. In order to improve the accuracy of entity query, the following fuzzy query methods based on rules and editing distance are used on the basis of hard matching.
For a subject word obtained from the entity recognition of the question, first try to use the hard matching method to find out whether there are completely consistent entity references in the mapping table. If no result is retrieved by hard matching, all entity references in the mapping table will be traversed again, and the subject words of questions and entity references in the mapping table will be unified in characters by means of regular expressions. When matching, the retrieval method of hard matching is replaced by calculating the editing distance between entity reference and subject word, and the corresponding calculation results of each entity reference are recorded. The smaller the edit distance value, the closer the entity refers to the subject word of the question. If the entity reference is of combined type (it is detected that it has comma, stop sign, "or" and other possible segmentation characters), it is first divided into multiple entity references according to the separator, and then compared with the subject word of the question one by one. Finally, the calculation results of the editing distance mentioned by all entities are counted, and the entity reference with the minimum editing distance is returned as the query result
import Levenshtein import re import unicodedata def entity_linking(mention2entity_dict, input_mention): """ Enter the of the question NER result input_mention,find mention2entity_dict Some problems with high correlation in mention,Return their entitis Use some rules to fit more mention :param mention2entity_dict: :param input_mention: :return: """ if input_mention == 'NONE': # For those that cannot be found, an empty list of candidate entities is returned, which is consistent below return [] input_mention = input_mention.replace(" ", "") # The mention in mention2entity has been blanked. Here, the NER result should also be blanked relative_entities = mention2entity_dict.get(input_mention, []) # Try checking directly first if not relative_entities: # If you can't find it directly, enter fuzzy query # Save the fuzzy query results. The fuzzy query is bound to traverse the whole knowledge base, match all the ideas that are considered similar, calculate their editing distances, and select the smallest ones after comparing the editing distances fuzzy_query_relative_entities = dict() input_mention = unify_char_format(input_mention) for mention_key in mention2entity_dict.keys(): prim_mention = mention_key _find = False # Handle the data format first mention_key = unify_char_format(mention_key) if len(mention_key) == 0: continue if '\\' == mention_key[-1]: mention_key = mention_key[:-1] + '"' # Combined Menton if ',' in mention_key or ',' in mention_key or '\\\\' in mention_key or ';' in mention_key or ('or' in mention_key and 'or' not in input_mention): mention_splits = re.split(r'[,;,or]|\\\\', mention_key) for _mention in mention_splits: if (len(input_mention) < 6 and Levenshtein.distance(input_mention, _mention) <= 1) \ or (len(input_mention) >= 6 and Levenshtein.distance(input_mention, _mention) <= 4) \ or (len(input_mention) >= 20 and Levenshtein.distance(input_mention, _mention) <= 10): _find = True fuzzy_query_relative_entities[prim_mention] = Levenshtein.distance(input_mention, _mention) # Non combinatorial mention else: if (len(input_mention) < 6 and Levenshtein.distance(input_mention, mention_key) <= 1) \ or (len(input_mention) >= 6 and Levenshtein.distance(input_mention, mention_key) <= 4) \ or (len(input_mention) >= 20 and Levenshtein.distance(input_mention, mention_key) <= 10): _find = True fuzzy_query_relative_entities[prim_mention] = Levenshtein.distance(input_mention, mention_key) if fuzzy_query_relative_entities: # Fuzzy query found results min_key = min(fuzzy_query_relative_entities.keys(), key=fuzzy_query_relative_entities.get) # The minimum editing distance min_similar_score = fuzzy_query_relative_entities[min_key] for prim_mention in fuzzy_query_relative_entities.keys(): if fuzzy_query_relative_entities[prim_mention] == min_similar_score: relative_entities.extend(mention2entity_dict[prim_mention]) # print('find the matching of mention during fuzzy query, and the mention of subject word and mapping table are respectively: ', input_mention, prim_mention) else: # Fuzzy query still can't find the result # print('fuzzy query still can't find the result: ', input_mention) pass if input_mention not in relative_entities: # For some common words, they are no longer in the mention2entity table, but also added relative_entities.append(input_mention) return relative_entities def unify_char_format(string): """ Used to normalize the string before comparing two strings :param string: :return: """ string = unicodedata.normalize('NFKC', string) string = string.replace('[', '[').replace(']', ']') string = string.lower() return string input_mention = 'Stephen Hawking' rela_ents = entity_linking(mention2entity_dict, input_mention) print('Matching to candidate entities in the knowledge base:', rela_ents)
Matching to candidate entities in the knowledge base: ['Stephen·gold', 'Steven·gold', 'Steven·Hawking', 'Stephen Hawking']
5.2 index based candidate triplet retrieval
As described in Section 3.3, using the pre built knowledge base index table, the triples related to the specified entity in the knowledge base can be returned.
def search_triples_by_index(relative_entitis, index, raw_graph_f): """ :param relative_entitis: list :param index: dict :param raw_graph: the file-pointer of the raw graph file, and the content need to be post-process :return: list of all the triples relative to the input_triples entitis Double list """ relative_triples = [] for entity in relative_entitis: index_entity = index.get(entity, None) if index_entity: read_index, read_size = index[entity]['start_pos'], index[entity]['length'] raw_graph_f.seek(read_index) readresult = raw_graph_f.read(read_size).decode('utf-8') for line in readresult.strip().split('\n'): triple = line.strip().split(' ||| ') relative_triples.append(triple) return relative_triples input_mention = 'Stephen Hawking' rela_ents = entity_linking(mention2entity_dict, input_mention) print('Matching to candidate entities in the knowledge base:', rela_ents) rel_triples= search_triples_by_index(rela_ents, forward_index, forward_KG_f) print('Total retrieved{}Bar triplet'.format(len(rel_triples))) print('Print up to 20 triples:') print('\n'.join(map(str, rel_triples[:20])))
Matching to candidate entities in the knowledge base: ['Stephen·gold', 'Steven·gold', 'Steven·Hawking', 'Stephen Hawking'] A total of 68 triples were retrieved Print up to 20 triples: ['Stephen·gold', 'alias', 'Stephen·gold'] ['Stephen·gold', 'Chinese name', 'Steven·gold'] ['Stephen·gold', 'Alias', 'John·Shi huaisen/ Richard.Buckman'] ['Stephen·gold', 'birthplace', 'Maine, USA'] ['Stephen·gold', 'occupation', 'writer'] ['Stephen·gold', 'Major achievements', 'Rich list in literary and Art Circles'] ['Stephen·gold', 'Spouse', 'Tabitha ·gold/ naomi ·Rachel·gold'] ['Stephen·gold', 'Son', 'Joe·Hill·gold/Irving·Philip·gold'] ['Stephen·gold', 'Daughter', 'naomi ·Rachel·gold'] ['Stephen·gold', 'nation', 'American nation'] ['Stephen·gold', 'Foreign name', 'Stephen Edwin King'] ['Stephen·gold', 'Nationality', 'U.S.A'] ['Stephen·gold', 'date of birth', '1947 year'] ['Stephen·gold', 'University one is graduated from', 'Department of English, University of Maine'] ['Stephen·gold', 'Representative works', '<Shawshank Redemption'] ['Stephen·gold', 'Children', 'Joe·Hill·gold/Irving·Philip·gold'] ['Stephen·gold', 'Stephen·Kim. February 2007', 'Stephen·Kim. February 2007'] ['Stephen·gold', 'Pseudonym', 'Richard Bachman John Swithen'] ['Stephen·gold', 'birth', '1947 September 21, 2004 (67 years old) Portland, Maine, USA'] ['Stephen·gold', 'occupation', 'writer']
6. Classification of candidate triples
6.1 model structure
The triple classification module adopts the sentence pair classification model based on Baidu pre training model ERNIE. After pairing the triple and question information, ERNIE coding is used, and the hidden layer output vector of [CLS] characters is projected to the secondary classification output layer after passing through the full connection layer.
from paddle import nn from paddlenlp.transformers import ErniePretrainedModel class ErnieCLS(ErniePretrainedModel): def __init__(self, ernie, label_dim, dropout=None): super(ErnieCLS, self).__init__() self.label_num = label_dim self.ernie = ernie # allow ernie to be config self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"]) self.classifier = nn.Linear(self.ernie.config['hidden_size'], self.label_num) self.hidden = nn.Linear(self.ernie.config['hidden_size'], self.ernie.config['hidden_size']) def forward(self, words_ids, token_type_ids=None, position_ids=None, attention_mask=None, history_ids=None): sequence_output, pooled_output = self.ernie( words_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask) pooled_output = nn.functional.relu(self.hidden(self.dropout(pooled_output))) pooled_output = self.dropout(pooled_output) logits = self.classifier(pooled_output) return logits
6.2 structural digital features
We take the question as sentence A, splice the header entity and relationship name of A candidate triplet as sentence B, and take whether the triplet is the labeled answer of the question as the positive / negative label of the sentence pair (A,B), so as to construct the training data of the candidate triplet classification model.
You can see the constructed relevant data in the work / candidate tripleselection / data directory of the project.
In the digital feature construction stage, ERNIE's tokenizer can be directly used to encode sentence pairs and automatically splice special token s.
def read(data_path): all_sample_text1, all_sample_text2, all_sample_labels = [], [], [] with open(data_path, 'r', encoding='utf-8') as f: for line in f.readlines(): text1, text2, label = line.strip().split('\t') all_sample_text1.append(text1) all_sample_text2.append(text2) all_sample_labels.append(label) for idx in range(len(all_sample_labels)): yield {"text1": all_sample_text1[idx], "text2": all_sample_text2[idx], "label": all_sample_labels[idx]} def convert_example_to_feature(example, tokenizer, label2id, max_seq_len=512): features = tokenizer(example["text1"], example["text2"], max_seq_len=max_seq_len) label_ids = label2id[example["label"]] return features["input_ids"], features["token_type_ids"], label_ids
6.3 model training
Similar to Section 4.3, only the core code of model training is shown here. Run in the work/CanditateTriplesSelection directory_ cls. Py repeats the complete training process.
def train(): train_ds = load_dataset(read, data_path=train_path, lazy=False) # File - > example dev_ds = load_dataset(read, data_path=dev_path, lazy=False) tokenizer = ErnieTokenizer.from_pretrained(model_name) trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=max_seq_len) train_ds = train_ds.map(trans_func, lazy=False) # example->feature dev_ds = dev_ds.map(trans_func, lazy=False) batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), Pad(axis=0, pad_val=tokenizer.pad_token_type_id), Stack(axis=0, dtype='int64'), ): fn(samples) train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=batch_size, shuffle=True) dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=batch_size, shuffle=False) train_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, return_list=True) dev_loader = paddle.io.DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True) ernie = ErnieModel.from_pretrained(model_name) model = ErnieCLS(ernie, len(label2id), dropout=0.1) num_training_steps = len(train_loader) * num_epoch lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion) decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] grad_clip = paddle.nn.ClipGradByGlobalNorm(max_grad_norm) optimizer = paddle.optimizer.AdamW(learning_rate=lr_scheduler, parameters=model.parameters(), weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in decay_params, grad_clip=grad_clip) loss_model = paddle.nn.CrossEntropyLoss() cls_metric = ClassificationScore(id2label) global_step, cls_best_f1 = 0, 0. model.train() for epoch in range(1, num_epoch + 1): for batch_data in train_loader: input_ids, token_type_ids, labels = batch_data logits = model(input_ids, token_type_ids=token_type_ids) loss = loss_model(logits, labels) loss.backward() lr_scheduler.step() optimizer.step() optimizer.clear_grad() if global_step > 0 and global_step % log_step == 0: print( f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.numpy().item():.6f}") if global_step > 0 and global_step % eval_step == 0: cls_results = evaluate(model, dev_loader, cls_metric) cls_result = cls_results["1"] model.train() cls_f1 = cls_result["F1"] if cls_f1 > cls_best_f1: paddle.save(model.state_dict(), f"{save_path}/ernie_cls_best.pdparams") if cls_f1 > cls_best_f1: print(f"\ncls best F1 performence has been updated: {cls_best_f1:.5f} --> {cls_f1:.5f}") cls_best_f1 = cls_f1 print( f'\ncls evalution result: precision: {cls_result["Precision"]:.5f}, recall: {cls_result["Recall"]:.5f}, F1: {cls_result["F1"]:.5f}, current best {cls_best_f1:.5f}\n') global_step += 1
7. Fine sorting of candidate answers
The returned result of BERT classifier will contain one or more triples predicted as positive examples. The purpose of this project is to enable the system to uniquely return a correct answer. In order to improve the accuracy of the only answer returned by the system, multiple answers need to be sorted. The principle of ranking is to measure the similarity between the question attribute words in the question sentence and the candidate triplet relationship names.
7.1 get the attribute information of questions and triples
The question attribute word refers to which attribute feature of the subject word is asked by the question sentence. For example, ask "who is the husband of Princess Xiangcheng?", The subject word in the question is "Xiangcheng Princess", and the question attribute of this subject word is "husband".
The rule-based method can be used to obtain the question attribute words of the question sentence. The specific process is as follows.
- Remove the subject words obtained in the process of entity recognition. For example, for the question "who is the husband of Princess Xiangcheng?", Then remove the theme word "Xiangcheng Princess".
- Remove the stop words, interrogative auxiliary words and punctuation marks in the question sentence. The interrogative auxiliary words include "which", "how much", "how", etc. Some commonly used opening words of questions should also be removed, such as "I want to know", "excuse me", "I'm curious" and so on.
The relation name of triples is also called the attribute name of triples, which is used to represent the association relationship between two entities in the knowledge base. For example, in the triple "Ogawa Yuanhu - Perimeter - 67.4km", the "perimeter" is its relationship name.
def span_question(question, ner_result): """ It is used in the answer ranking stage to delete the information irrelevant to the answer ranking, such as subject words, question words, etc """ question = question.replace(ner_result, '').replace('<', '').replace('>', '') for delete_word in ['I want to know','I'd like to ask','Excuse me','Excuse me?','You know?','Who knows','know','Who knows','I'm curious','You ask for me','Has anyone seen it','Is there anyone' 'Yes?','this','How many','What are there','Which?','which one?','How many?','How many?','who','By whom','also' ,'Do you','ah','ah','bar','means','of','yes','And','Yes','?','?','what']: question = question.replace(delete_word, '') return question span_res = span_question('Who is the husband of Princess Xiangcheng?', 'Xiangcheng Princess') print(span_res)
husband
7.2 calculate the similarity between question attribute and triple relation name
After taking the question attribute word and the triplet relationship name, the similarity between them is calculated. The specific algorithm is to calculate the Jaccard similarity and word2vec similarity respectively, and add them as the overall similarity score.
def score_similarity(word2vec_model, string1, string2): """ Compare the similarity of two strings, from character coverage w2v The comprehensive score of similarity is used for the comparison of question sentences and triplet relationship names when sorting answers :return: Similarity score """ return char_overlap(string1, string2) + word2vec_sim(word2vec_model, string1, string2) def char_overlap(string1, string2): char_intersection = set(string1) & set(string2) char_union = set(string1) | set(string2) return len(char_intersection) / len(char_union) def word2vec_sim(word2vec_model, string1, string2): # Reading n_similarity's source code is to take the average of two groups of word vectors, normalize L2, and then calculate the inner product words1 = jieba.cut(string1) words2 = jieba.cut(string2) de_seg1 = [] de_seg2 = [] for seg in words1: if seg not in word2vec_model.vocab: _ws = [_w for _w in seg if _w in word2vec_model.vocab] de_seg1.extend(_ws) else: de_seg1.append(seg) for seg in words2: if seg not in word2vec_model.vocab: _ws = [_w for _w in seg if _w in word2vec_model.vocab] de_seg1.extend(_ws) else: de_seg2.append(seg) if de_seg1 and de_seg2: score = word2vec_model.n_similarity(de_seg1, de_seg2) else: score = 0 return score
8. Complete KBQA process
The complete KBQA process is implemented as follows
- For the question entered by the console, the question object of the question is predicted through the subject word recognition model. If the model does not predict the results, try to use rules to extract subject words. Print the subject words on the console;
- After the subject words are obtained, a group of candidate entities are obtained through the entity chain reference module and printed on the console;
- According to the pre established index, all triples with candidate entities as the head entities are retrieved in the knowledge base, that is, candidate triples. Since there may be many candidate triples, only the first 20 candidate triples are printed on the console;
- The triple rough classification model is used to make binary prediction for all the above candidate groups, and only the triples with positive prediction are retained
- For the reserved triples, the answer sorting module is used to compare the similarity between the relationship and the question, score and sort each triplet
- Take the triplet with the highest score in the previous step as the best triplet, and the tail entity is returned as the best answer
Note:
-
After ensuring that the relevant libraries in the first three sections of the NoteBook have been loaded, the following process code can be run directly. You can enter a simple question in the input box, and then observe the prediction results of the KBQA system for the answers and the intermediate results of each step.
-
Since all the fine tuned models of the KBQA system have been mounted in the data/data122049 directory, they will be directly loaded and predicted in the following pipeline code. Therefore, the code in Section 4 and after in this NoteBook does not need to be executed.
def pipeline_predict(question): ner_results = ner_predict(NERConfig().best_model_path, question) ner_results = set([_result.replace("<", "").replace(">", "") for _result in ner_results]) # ner_results is a set, which may have 0, 1 or more elements. If there are 0 elements, try the following rules to see if the entity can be extracted if not ner_results: if '<' in question and '>' in question: ner_results = re.search(r'(.*)of.*yes.*', question).group(1) elif re.search(r'', question): ner_results = re.search(r'(.*)of.*yes.*', question).group(1) else: print('No subject words extracted!') return() print('■Identified subject words:', ner_results, datetime.datetime.now()) candidate_entities = [] for mention in ner_results: candidate_entities.extend(entity_linking(mention2entity_dict, mention)) print('■Candidate entities found:', candidate_entities, datetime.datetime.now()) forward_candidate_triples = search_triples_by_index(candidate_entities, forward_index, forward_KG_f) candidate_triples = forward_candidate_triples candidate_triples = list(filter(lambda x: len(x) == 3, candidate_triples)) candidate_triples_num = len(candidate_triples) print('■Candidate triples in total{}strip'.format(candidate_triples_num), datetime.datetime.now()) show_num = 20 if candidate_triples_num > 20 else candidate_triples_num print('■Before display{}Candidate triples:{}'.format(show_num, candidate_triples[:show_num])) candidate_triples_labels = cls_predict(CLSConfig().best_model_path, [question]*len(candidate_triples), [triple[0]+triple[1] for triple in candidate_triples]) predict_triples = [candidate_triples[i] for i in range(len(candidate_triples)) if candidate_triples_labels[i] == '1'] print('■The following triples are reserved for the rough classification results of triples:', predict_triples) predict_answers = [_triple[2] for _triple in predict_triples] if len(predict_answers) == 0: print('■No relevant knowledge was retrieved from the knowledge base. Please try another question......') return() elif len(set(predict_answers)) == 1: # There is only one predicted answer, although there may be more than one triplet providing the answer print('■The predicted answer is unique and output directly......') best_triple = predict_triples[0] best_answer = predict_answers[0] print('■Best answer:', best_answer) else: # Multiple answers are predicted and need to be sorted print('■Multiple answers detected, sorting answers......') max_ner = '' # Split the question with the longest of all ner results for _ner in ner_results: if len(_ner) > len(max_ner): max_ner = _ner fine_question = span_question(question, max_ner) rel_scores = [score_similarity(word2vec_model, _triple[1].replace(' ', ''), fine_question) for _triple in predict_triples] triples_with_score = list(zip(map(tuple, predict_triples), rel_scores)) triples_with_score.sort(key=lambda x: x[1], reverse=True) print('■Triple sorting result:\n{}'.format("\n".join([str(pair[0]) + '-->' + str(pair[1]) for pair in triples_with_score]))) best_answer = triples_with_score[0][0][-1] print('■Best answer:', best_answer) input_question = input('■Please enter a question:') # input_question = 'who is the president of Harbin Institute of technology?' # input_question = 'which nationality is Hugo from?' # input_question = 'which work did Bai Xiaosheng come from?' print('■Question entered:', input_question) pipeline_predict(input_question)
■Please enter a question:■The question has been entered: which country is Hugo from? [2021-12-22 17:33:35,001] [ INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-1.0 [2021-12-22 17:33:35,003] [ INFO] - Downloading ernie_v1_chn_base.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams 100%|██████████| 392507/392507 [00:09<00:00, 41759.84it/s] W1222 17:33:44.524525 101 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1 W1222 17:33:44.528751 101 device_context.cc:465] device: 0, cuDNN Version: 7.6. [2021-12-22 17:33:47,166] [ INFO] - Weights from pretrained model not used in ErnieModel: ['cls.predictions.layer_norm.weight', 'cls.predictions.decoder_bias', 'cls.predictions.transform.bias', 'cls.predictions.transform.weight', 'cls.predictions.layer_norm.bias'] [2021-12-22 17:33:47,783] [ INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt and saved to /home/aistudio/.paddlenlp/models/ernie-1.0 [2021-12-22 17:33:47,786] [ INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt 100%|██████████| 90/90 [00:00<00:00, 3720.30it/s] ■Identified subject words: {'Hugo'} 2021-12-22 17:33:47.919246 ■Candidate entities found: ['Hugo's Secret', 'Hugo(2011 Oscar winning film)', 'Hugo(2011 Martin·Scorsese directed American films)', 'Hugo(2011 year Martin Scorsese Director film)', 'Victor(French writer)', 'Hugo(Fighting game "street overlord" characters)', 'Hugo'] 2021-12-22 17:33:47.921878 ■109 candidate triples in total 2021-12-22 17:33:47.925449 ■Show the first 20 candidate triples:[['Hugo's Secret', 'alias', 'Hugo's Secret'], ['Hugo's Secret', 'Chinese name', 'Hugo'], ['Hugo's Secret', 'Foreign language name', 'Hugo'], ['Hugo's Secret', 'Other translated names', 'Hugo's Paris fantasy adventure'], ['Hugo's Secret', 'Production company', 'Paramount Pictures '], ['Hugo's Secret', 'director', 'Martin·Scorsese'], ['Hugo's Secret', 'Production cost', '1.7 USD100mn'], ['Hugo's Secret', 'Shooting date', '2010 year'], ['Hugo's Secret', 'Film length', '126 minute'], ['Hugo's Secret', 'classification', 'USA: PG'], ['Hugo's Secret', 'color', 'colour'], ['Hugo's Secret', 'to star', 'Asha·Butterfield, colo·Moritz'], ['Hugo's Secret', 'type', 'Plot, science fiction, biography'], ['Hugo's Secret', 'Production time', '2011 November 23'], ['Hugo's Secret', 'Production area', 'U.S.A'], ['Hugo's Secret', 'Screenwriter', 'John·Logan'], ['Hugo's Secret', 'Shooting location', 'U.S.A'], ['Hugo's Secret', 'IMDB score', '7.6'], ['Hugo's Secret', 'Release time', '2012 May 31, 2014 (China)'], ['Hugo's Secret', 'Dialogue language', 'English']] [2021-12-22 17:33:48,496] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/ernie_v1_chn_base.pdparams [2021-12-22 17:33:49,823] [ INFO] - Weights from pretrained model not used in ErnieModel: ['cls.predictions.layer_norm.weight', 'cls.predictions.decoder_bias', 'cls.predictions.transform.bias', 'cls.predictions.transform.weight', 'cls.predictions.layer_norm.bias'] [2021-12-22 17:33:50,497] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/vocab.txt ■The following triples are reserved for the rough classification results of triples: [['Hugo's Secret', 'Title Translation', 'Hugo'], ['Hugo(2011 Oscar winning film)', 'Production area', 'U.S.A'], ['Hugo(2011 Martin·Scorsese directed American films)', 'Production area', 'U.S.A'], ['Hugo(2011 year Martin Scorsese Director film)', 'Production area', 'U.S.A'], ['Victor(French writer)', 'nationality', 'France']] ■Multiple answers detected, sorting answers...... Building prefix dict from the default dictionary ... Dumping model to file cache /tmp/jieba.cache Loading model cost 0.828 seconds. Prefix dict has been built successfully. ■Triple sorting result: ('Victor(French writer)', 'nationality', 'France')-->0.49243852496147156 ('Hugo's Secret', 'Title Translation', 'Hugo')-->0.39723604917526245 ('Hugo(2011 Oscar winning film)', 'Production area', 'U.S.A')-->0.22878356277942657 ('Hugo(2011 Martin·Scorsese directed American films)', 'Production area', 'U.S.A')-->0.22878356277942657 ('Hugo(2011 year Martin Scorsese Director film)', 'Production area', 'U.S.A')-->0.22878356277942657 ■Best answer: France
9. Evaluation results
Before migrating to AIsudio, this project is implemented using the pytorch framework. The models of subject word recognition module and triple classification module use the pytorch Chinese pre training model of SpanBERT and Bert base respectively.
After some data analysis work, it is found that there are natural differences between the answer entities in the original question and answer data set and those in the knowledge base, including English case, decimal precision, time and date format, missing sign and omission and so on. Obviously, because of the differences in these formats, it is considered that an entity is not the answer of a question sentence, which will bring errors to the system. Therefore, the author has corrected this kind of problem, obtained the revised question and answer data set, and trained and tested it in the same way.
On the NLPCC2018 test set, the answer return accuracy is taken as the evaluation index, and the test results are as follows:
Original question and answer data set | Revised question and answer data set | |
---|---|---|
This KBQA system | 78.03% | 90.32% |
Please click here View the basic usage of this environment
Please click here for more detailed instructions.