On knowledge base question answering system (KBQA)

1. Project introduction

1.1 what is KBQA?

This project aims to realize a simple question answering system based on knowledge base. Under the existing knowledge map, the system can complete the semantic understanding of simple questions, automatic retrieval of knowledge, and return the answers of questions.

The traditional question answering system based on search engine can only return some time-consuming web pages and present them to users in the form of document collection. Users still need to read and analyze documents to obtain answers. The question answering system based on knowledge map can obtain more accurate answers, that is, find accurate answers in knowledge map and directly return them to users, meet users' accurate information needs and provide personalized knowledge services.

1.2 introduction to the project method

This project mainly divides KBQA into the following four core algorithm modules:

Topic word recognition of question sentence: identify the topic entity concerned by the questioner from the sentence
Candidate triplet retrieval: returns triplet knowledge related to the subject entity based on the index
Classification of candidate triples: classify the retrieved candidate triples into two categories, and filter out a large number of triples that do not meet the question target
Fine sorting of candidate answers: according to the semantic similarity between the question and the candidate triples, the triples that can be used as the best answer are sorted

1.3 reading instruction

Readers can follow the steps in this Notebook step by step

You can also directly jump to subsection 8 after running the first 3 sections to directly run the complete KBQA prediction process

2. Environment configuration

[important note] the memory cost of this project is about 10G. In order to ensure the correct operation of the program, please use the advanced version or premium version GPU environment of AIstudio.

Take AIstudio advanced edition as an example. The environment is as follows:

CPU: 2
RAM: 16g
GPU: Telsa V100, 16G
Python version: Python 3.7
Frame version: PaddlePaddle 2.2. one

In addition, you need to execute the following command to install the gensim library for loading word2vec and python for calculating the editing distance_ Levenshtein Library

! pip install gensim==3.8.1
! pip install python_Levenshtein==0.12.2

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting gensim==3.8.1
[?25l  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/44/93/c6011037f24e3106d13f3be55297bf84ece2bf15b278cc4776339dc52db5/gensim-3.8.1-cp37-cp37m-manylinux1_x86_64.whl (24.2MB)
     |████████████████████████████████| 24.2MB 4.3MB/s eta 0:00:01
[?25hRequirement already satisfied: numpy>=1.11.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gensim==3.8.1) (1.20.3)
Collecting smart-open>=1.8.1 (from gensim==3.8.1)
[?25l  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cd/11/05f68ea934c24ade38e95ac30a38407767787c4e3db1776eae4886ad8c95/smart_open-5.2.1-py3-none-any.whl (58kB)
     |████████████████████████████████| 61kB 1.4MB/s eta 0:00:011
[?25hRequirement already satisfied: scipy>=0.18.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gensim==3.8.1) (1.6.3)
Requirement already satisfied: six>=1.5.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gensim==3.8.1) (1.15.0)
Installing collected packages: smart-open, gensim
Successfully installed gensim-3.8.1 smart-open-5.2.1
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting python_Levenshtein==0.12.2
[?25l  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2a/dc/97f2b63ef0fa1fd78dcb7195aca577804f6b2b51e712516cc0e902a9a201/python-Levenshtein-0.12.2.tar.gz (50kB)
     |████████████████████████████████| 51kB 5.8MB/s eta 0:00:011
[?25hRequirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from python_Levenshtein==0.12.2) (56.2.0)
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25ldone
[?25h  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=171661 sha256=eaed10e8e8ab0610268a8ef9dd73e762c77c734f12c051235404710bdda40b36
  Stored in directory: /home/aistudio/.cache/pip/wheels/7b/43/95/25e2d396067496519edc4426d846bf3905f53c24e4e42b0e71
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.2

3. Data loading

3.1 introduction to knowledge base dataset

The question and answer data set and knowledge base used in this project are from task 7 of nlpcc218 competition: Open Domain Question Answering knowledge-based question (KBQA), which includes the following two files:

The Chinese encyclopedia knowledge base file contains about 20 million entities and 60 million triples, with a size of 3.37G
The entity reference mapping table of the knowledge base can correspond the common entity reference words in the real world to the entities in the knowledge base

The above data have been made public and can be downloaded directly from the official website of the event Original dataset , or uploaded from myself in AIStudio data set Get the simply preprocessed version from.

3.2 knowledge base dataset loading

This project directly mounts the preprocessed knowledge base data set, which can be directly read from the file and loaded into memory

from work.TopicWordRecognization.run_ner import predict as ner_predict
from work.CandidateTriplesSelection.run_cls import predict as cls_predict
from work.CandidateTriplesLookup.knowledge_retrieval import entity_linking, search_triples_by_index
from work.AnswerRanking.ranking import span_question, score_similarity
from work.config import KGConfig, CLSConfig, NERConfig
import jieba
import gensim
import datetime
import json
import re
from functools import partial
import paddle
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import ErnieTokenizer, ErnieModel
from paddlenlp.data import Stack, Pad, Tuple

KGconfig = KGConfig()
mention2entity_clean_path = KGconfig.mention2entity_clean_path
knowledge_graph_path = KGconfig.knowledge_graph_path

print('Loading mention2entity surface', datetime.datetime.now())
with open(mention2entity_clean_path, 'r', encoding='utf-8') as f:
    mention2entity_dict = json.loads(f.read())

print('Loading knowledge base', datetime.datetime.now())
forward_KG_f = open(knowledge_graph_path, 'rb')
print('Knowledge base loaded', datetime.datetime.now())

/opt/conda/envs/python35-paddle120-env/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)


Loading mention2entity Table 2021-12-22 17:30:39.294507
 Loading knowledge base 2021-12-22 17:31:10.639150
 Knowledge base loading completed 2021-12-22 17:31:10.641600

# Print partial mention2entity mapping table data
for idx, (key, value) in enumerate(list(mention2entity_dict.items())[80:90]):
    print(key, '-->', value)

try to help the shoots grow by pulling them upward --> ['try to help the shoots grow by pulling them upward(Mitchell Hurwitz directed American films)', 'try to help the shoots grow by pulling them upward(idiom)', 'try to help the shoots grow by pulling them upward']
Anterior branch of thoracic nerve --> ['Anterior branch of thoracic nerve']
anteriorbranchofthoracicnerves --> ['Anterior branch of thoracic nerve']
closed loop --> ['closed loop']
htcmytouch4gslide --> ['HTC myTouch 4G Slide']
Class A tertiary hospital --> ['Class A tertiary hospital']
Britannia(Roman province ) --> ['Britannia(Roman province )']
Britannia --> ['Britannia(Roman province )', 'Britannia(The virtual empire of Lu Lu Xiu, a rebel of Japanese animation)', 'Britannia(English goddess)', 'Britannia(Roman province)']
britannia --> ['Britannia(Roman province )', '<Great Britain', 'Britannia(English goddess)']
white collar --> ['white collar', 'White collar workers', 'white collar(A general term for staff)', 'white collar(Network novel created by Drunken Beauty knee)', 'white collar(1962 Korean films in)']

3.3 construction and loading of knowledge base index table

In the search candidate triples stage of KBQA, it is necessary to retrieve the relevant triples in the knowledge base according to the subject entity of the question. In order to avoid the huge overhead of traversing the whole knowledge base every time we execute triple query, we establish an index table for the knowledge base file, which can greatly reduce the time overhead.

The knowledge base data files hung in this project have been reordered according to the header entity. When entering a query entity, our index table can quickly locate the location of the entity in the knowledge base and return all triples with the entity as the header entity.

In implementation, we open the knowledge base file in the form of byte stream, locate the starting position of each entity in the knowledge base with python's tell() method, and record the total length of all triples with it as the head entity in the knowledge base (counted in bytes).

The constructed index table will be a hash data structure, and we will save it in the specified directory.

def make_KG_index(knowledge_graph_path, forward_index_path):
    """
    read KG File, with the first entity as key Build a one-way index in dictionary format,{mention:{'start_pos':int, 'length':int}, ...}
    Read with index KG When:
    with open(knowledge_graph_path, 'rb') as f:
        f.seek(223)
        readresult = f.read(448).decode('utf-8')
    """
    def make_index(graph_path, index_path):
        print('begin to read KG', datetime.datetime.now())
        index_dict = dict()
        with open(graph_path, 'r', encoding='utf-8') as f:
            previous_entity = ''
            previous_start = 0
            while True:
                start_pos = f.tell()
                line = f.readline()
                if not line:
                    break
                entity = line.split(' ||| ')[0]
                if entity != previous_entity and previous_entity:
                    tmp_dict = dict()
                    tmp_dict['start_pos'] = previous_start
                    tmp_dict['length'] = start_pos - previous_start
                    index_dict[previous_entity] = tmp_dict
                    previous_start = start_pos
                previous_entity = entity
        print('finish reading KG, begin to write', datetime.datetime.now())
        with open(index_path, 'w', encoding='utf-8') as f:
            f.write(json.dumps(index_dict, ensure_ascii=False))
        print('finish writing', datetime.datetime.now())
    make_index(knowledge_graph_path, forward_index_path)

In this project, the index table of the knowledge base file has been constructed in advance and mounted in the data set of the project, which can be loaded and used directly.

print('Loading index table', datetime.datetime.now())
forward_index_path = KGconfig.forward_index_path
with open(forward_index_path, 'r', encoding='utf-8') as f:
    forward_index = json.loads(f.read())
print('Index table loaded', datetime.datetime.now())

# Query the index table with the specified entity, return the triples in the knowledge base, and print the first 20 of them
entity = 'Yao Ming'
read_index, read_size = forward_index[entity]['start_pos'], forward_index[entity]['length']
print(read_index, read_size)
forward_KG_f.seek(read_index)
readresult = forward_KG_f.read(read_size).decode('utf-8')
print(readresult[:20])

Loading index table 2021-12-22 17:31:31.591060
 Index table loading completed 2021-12-22 17:31:49.429286
2231973201 2598
 Yao Ming ||| alias ||| Yao Ming
 Yao Ming

3.4 word2vec model loading

In addition to the above knowledge base data set, the existing word2vec word vector needs to be used in the answer sorting module of this project, which can be preloaded before the execution of the main program.
The word2vec model used in this project is derived from the github open source repository word2vec And select the 300 dimensional word vector trained with Baidu Encyclopedia as the training set and Chinese characters + words as the context feature.

The term vector model has been renamed SGNs target. Word character, and mount it in the data/data122049 directory of the project. Use the following code to load it.

from work.config import Word2VecConfig
from gensim.models import KeyedVectors

def load_word2vec():
    word2vec_model_path = Word2VecConfig().model_path  # Location of word vector file

    print('Preloading word2vec Word vector, expected 2 min', datetime.datetime.now())
    word2vec_model = KeyedVectors.load_word2vec_format(word2vec_model_path, binary=False, unicode_errors='ignore')
    print('word2vec Word vector loaded', datetime.datetime.now())
    return word2vec_model
word2vec_model = load_word2vec()

Preloading word2vec Word vector, expected 2 min 2021-12-22 17:31:53.311305
word2vec Word vector loading completed 2021-12-22 17:33:19.258214

4. Topic word recognition of questions

For a given question sentence, the system needs to determine what is the core question in the question sentence. The subject word of the question is the core question object in the question, which can be used to help us find the corresponding relevant entities in the knowledge map and further obtain the answer. For example, for the question "who is the founder of Microsoft?", The subject word is "Microsoft".

4.1 model structure

The subject word recognition module of the project adopts the entity recognition model based on Baidu pre training model ERNIE. After the question is encoded by ERNIE, the BIO tag is predicted for each token, where B represents the beginning character of the subject word, I represents the middle character of the subject word, and O represents that the token does not belong to any subject word. In the model, each token representing text characters is fully connected in the last hidden layer of ERNIE, and then projected to the three classification output layer.

import paddle
from paddle import nn
from paddlenlp.transformers import ErniePretrainedModel


class ErnieNER(ErniePretrainedModel):
    def __init__(self, ernie, label_dim, dropout=None):
        super(ErnieNER, self).__init__()
        self.label_num = label_dim

        self.ernie = ernie  # allow ernie to be config
        self.dropout = nn.Dropout(dropout if dropout is not None else
                                  self.ernie.config["hidden_dropout_prob"])
        self.classifier = nn.Linear(self.ernie.config['hidden_size'], self.label_num)
        self.hidden = nn.Linear(self.ernie.config['hidden_size'], self.ernie.config['hidden_size'])

    def forward(self,
                words_ids,
                token_type_ids=None,
                position_ids=None,
                attention_mask=None,
                history_ids=None):
        sequence_output, pooled_output = self.ernie(
            words_ids,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            attention_mask=attention_mask)

        sequence_output = nn.functional.relu(self.hidden(self.dropout(sequence_output)))

        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)

        return logits

4.2 structural digital features

The training data of the model comes from the subject word NER data labeled by the NLPCC2018KBQA question and answer data set.

For each question, mark the subject word of the question with BIO tag. We construct the data in the form of (token, label) pairs, divide them into training sets and verification sets, and save them in the file. The constructed relevant data can be seen in the work / topicwordrecognition / data directory of the project.

Before training, we need to read the data in text format from the file, then use ERNIE's tokenizer encoder to convert the sentence text into digital features for model input, and splice the special token of the pre training model. At the same time, the tag sequence of the sentence also needs to be converted into the digital features required for model input, and padded at the special token in the text.

def read(data_path):
    all_sample_words, all_sample_labels = [], []
    with open(data_path, 'r', encoding='utf-8') as f:
        tmp_sample_words, tmp_sample_labels = [], []
        for line in f.readlines():
            if line == '\n' and tmp_sample_words and tmp_sample_words:
                all_sample_words.append(tmp_sample_words)
                all_sample_labels.append(tmp_sample_labels)
                tmp_sample_words, tmp_sample_labels = [], []
            else:
                word, label = line.strip().split(' ')[0], line.strip().split(' ')[1]
                tmp_sample_words.append(word)
                tmp_sample_labels.append(label)
    for idx in range(len(all_sample_words)):
        yield {"words": all_sample_words[idx], "labels": all_sample_labels[idx]}


def convert_example_to_feature(example, tokenizer, label2id, pad_default_tag=0, max_seq_len=512):
    features = tokenizer(example["words"], is_split_into_words=True, max_seq_len=max_seq_len)
    label_ids = [label2id[label] for label in example["labels"][:max_seq_len-2]]
    label_ids = [label2id[pad_default_tag]] + label_ids + [label2id[pad_default_tag]]
    assert len(features["input_ids"]) == len(label_ids)
    return features["input_ids"], features["token_type_ids"], label_ids

4.3 model training

Only the core code of model training is shown here. Run can be run in the work / topicwordrecognition directory_ ner. Py repeats the complete training process.

The trained model ~ / data / data122049 / Ernie has been attached to this project_ ner_ best_ Pdparams can be directly used for the prediction of KBQA pipeline.

def train():
    train_ds = load_dataset(read, data_path=train_path, lazy=False)  # File - > example
    dev_ds = load_dataset(read, data_path=dev_path, lazy=False)

    tokenizer = ErnieTokenizer.from_pretrained(model_name)
    trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id,
                         pad_default_tag="O", max_seq_len=max_seq_len)

    train_ds = train_ds.map(trans_func, lazy=False)  # example->feature
    dev_ds = dev_ds.map(trans_func, lazy=False)

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
        Pad(axis=0, pad_val=label2id["O"], dtype='int64'),
    ): fn(samples)

    train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=batch_size, shuffle=True)
    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=batch_size, shuffle=False)
    train_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn,
                                        return_list=True)
    dev_loader = paddle.io.DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn,
                                      return_list=True)

    ernie = ErnieModel.from_pretrained(model_name)
    model = ErnieNER(ernie, len(label2id), dropout=0.1)

    num_training_steps = len(train_loader) * num_epoch
    lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
    grad_clip = paddle.nn.ClipGradByGlobalNorm(max_grad_norm)
    optimizer = paddle.optimizer.AdamW(learning_rate=lr_scheduler, parameters=model.parameters(),
                                       weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in decay_params,
                                       grad_clip=grad_clip)

    loss_model = paddle.nn.CrossEntropyLoss()
    ner_metric = SeqEntityScore(id2label)

    global_step, ner_best_f1 = 0, 0.
    model.train()
    for epoch in range(1, num_epoch + 1):
        for batch_data in train_loader:
            input_ids, token_type_ids, labels = batch_data
            logits = model(input_ids, token_type_ids=token_type_ids)

            loss = loss_model(logits, labels)

            loss.backward()
            lr_scheduler.step()
            optimizer.step()
            optimizer.clear_grad()

            if global_step > 0 and global_step % log_step == 0:
                print(
                    f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.numpy().item():.6f}")
            if global_step > 0 and global_step % eval_step == 0:

                ner_results = evaluate(model, dev_loader, ner_metric)
                ner_result = ner_results["Total"]
                model.train()
                ner_f1 = ner_result["F1"]
                # if ner_f1 > ner_best_f1:
                #     paddle.save(model.state_dict(), f"{save_path}/ernie_ner_best.pdparams")
                if ner_f1 > ner_best_f1:
                    print(f"\nner best F1 performence has been updated: {ner_best_f1:.5f} --> {ner_f1:.5f}")
                    ner_best_f1 = ner_f1
                print(
                    f'\nner evalution result: precision: {ner_result["Precision"]:.5f}, recall: {ner_result["Recall"]:.5f},  F1: {ner_result["F1"]:.5f}, current best {ner_best_f1:.5f}\n')

            global_step += 1

5. Candidate triples retrieval

5.1 entity chain index based on fuzzy query

After obtaining the subject word of the question, we also need to map the subject word to the relevant entity nodes in the knowledge map, and in most cases, it will correspond to multiple entities, which we call candidate entities. For example, after identifying the subject word "Ma Yun" in the question, you need to link to relevant entities such as "Ma Yun - founder of Alibaba" and "Ma Yun - associate professor of Yunnan University for Nationalities" in the knowledge map.

Although the knowledge base dataset provides us with mention2entity entity mapping table, due to the diversity of natural languages, the subject words in natural questions may not exactly correspond to the words mentioned in the mapping table. In order to improve the accuracy of entity query, the following fuzzy query methods based on rules and editing distance are used on the basis of hard matching.

For a subject word obtained from the entity recognition of the question, first try to use the hard matching method to find out whether there are completely consistent entity references in the mapping table. If no result is retrieved by hard matching, all entity references in the mapping table will be traversed again, and the subject words of questions and entity references in the mapping table will be unified in characters by means of regular expressions. When matching, the retrieval method of hard matching is replaced by calculating the editing distance between entity reference and subject word, and the corresponding calculation results of each entity reference are recorded. The smaller the edit distance value, the closer the entity refers to the subject word of the question. If the entity reference is of combined type (it is detected that it has comma, stop sign, "or" and other possible segmentation characters), it is first divided into multiple entity references according to the separator, and then compared with the subject word of the question one by one. Finally, the calculation results of the editing distance mentioned by all entities are counted, and the entity reference with the minimum editing distance is returned as the query result

import Levenshtein
import re
import unicodedata


def entity_linking(mention2entity_dict, input_mention):
    """
    Enter the of the question NER result input_mention，find mention2entity_dict Some problems with high correlation in mention，Return their entitis
    Use some rules to fit more mention
    :param mention2entity_dict:
    :param input_mention:
    :return:
    """
    if input_mention == 'NONE':         # For those that cannot be found, an empty list of candidate entities is returned, which is consistent below
        return []

    input_mention = input_mention.replace(" ", "")      # The mention in mention2entity has been blanked. Here, the NER result should also be blanked
    relative_entities = mention2entity_dict.get(input_mention, [])    # Try checking directly first
    if not relative_entities:                                   # If you can't find it directly, enter fuzzy query
        # Save the fuzzy query results. The fuzzy query is bound to traverse the whole knowledge base, match all the ideas that are considered similar, calculate their editing distances, and select the smallest ones after comparing the editing distances
        fuzzy_query_relative_entities = dict()
        input_mention = unify_char_format(input_mention)
        for mention_key in mention2entity_dict.keys():
            prim_mention = mention_key
            _find = False

            # Handle the data format first
            mention_key = unify_char_format(mention_key)

            if len(mention_key) == 0:
                continue

            if '\\' == mention_key[-1]: 
                    mention_key = mention_key[:-1] + '"'

            # Combined Menton
            if ',' in mention_key or ',' in mention_key or '\\\\' in mention_key or ';' in mention_key or ('or' in mention_key and 'or' not in input_mention):
                mention_splits = re.split(r'[,;,or]|\\\\', mention_key)
                for _mention in mention_splits:
                    if (len(input_mention) < 6 and Levenshtein.distance(input_mention, _mention) <= 1) \
                            or (len(input_mention) >= 6 and Levenshtein.distance(input_mention, _mention) <= 4) \
                            or (len(input_mention) >= 20 and Levenshtein.distance(input_mention, _mention) <= 10):
                        _find = True
                        fuzzy_query_relative_entities[prim_mention] = Levenshtein.distance(input_mention, _mention)
            # Non combinatorial mention
            else:
                if (len(input_mention) < 6 and Levenshtein.distance(input_mention, mention_key) <= 1) \
                            or (len(input_mention) >= 6 and Levenshtein.distance(input_mention, mention_key) <= 4) \
                            or (len(input_mention) >= 20 and Levenshtein.distance(input_mention, mention_key) <= 10):
                    _find = True
                    fuzzy_query_relative_entities[prim_mention] = Levenshtein.distance(input_mention, mention_key)

        if fuzzy_query_relative_entities:               # Fuzzy query found results
            min_key = min(fuzzy_query_relative_entities.keys(), key=fuzzy_query_relative_entities.get)         # The minimum editing distance
            min_similar_score = fuzzy_query_relative_entities[min_key]
            for prim_mention in fuzzy_query_relative_entities.keys():
                if fuzzy_query_relative_entities[prim_mention] == min_similar_score:
                    relative_entities.extend(mention2entity_dict[prim_mention])
                    # print('find the matching of mention during fuzzy query, and the mention of subject word and mapping table are respectively: ', input_mention, prim_mention)
        else:                                           # Fuzzy query still can't find the result
            # print('fuzzy query still can't find the result: ', input_mention)
            pass
    if input_mention not in relative_entities:          # For some common words, they are no longer in the mention2entity table, but also added
        relative_entities.append(input_mention)
    return relative_entities

def unify_char_format(string):
    """
    Used to normalize the string before comparing two strings
    :param string:
    :return:
    """
    string = unicodedata.normalize('NFKC', string)             
    string = string.replace('[', '[').replace(']', ']')      
    string = string.lower()                               
    return string

input_mention = 'Stephen Hawking'
rela_ents = entity_linking(mention2entity_dict, input_mention)
print('Matching to candidate entities in the knowledge base:', rela_ents)

Matching to candidate entities in the knowledge base: ['Stephen·gold', 'Steven·gold', 'Steven·Hawking', 'Stephen Hawking']

5.2 index based candidate triplet retrieval

As described in Section 3.3, using the pre built knowledge base index table, the triples related to the specified entity in the knowledge base can be returned.

def search_triples_by_index(relative_entitis, index, raw_graph_f):
    """
    :param relative_entitis: list
    :param index: dict
    :param raw_graph: the file-pointer of the raw graph file, and the content need to be post-process
    :return: list of all the triples relative to the input_triples entitis  Double list
    """
    relative_triples = []
    for entity in relative_entitis:
        index_entity = index.get(entity, None)
        if index_entity:
            read_index, read_size = index[entity]['start_pos'], index[entity]['length']
            raw_graph_f.seek(read_index)
            readresult = raw_graph_f.read(read_size).decode('utf-8')
            for line in readresult.strip().split('\n'):
                triple = line.strip().split(' ||| ')
                relative_triples.append(triple)
    return relative_triples

input_mention = 'Stephen Hawking'
rela_ents = entity_linking(mention2entity_dict, input_mention)
print('Matching to candidate entities in the knowledge base:', rela_ents)
rel_triples= search_triples_by_index(rela_ents, forward_index, forward_KG_f)
print('Total retrieved{}Bar triplet'.format(len(rel_triples)))
print('Print up to 20 triples:')
print('\n'.join(map(str, rel_triples[:20])))

Matching to candidate entities in the knowledge base: ['Stephen·gold', 'Steven·gold', 'Steven·Hawking', 'Stephen Hawking']
A total of 68 triples were retrieved
 Print up to 20 triples:
['Stephen·gold', 'alias', 'Stephen·gold']
['Stephen·gold', 'Chinese name', 'Steven·gold']
['Stephen·gold', 'Alias', 'John·Shi huaisen/ Richard.Buckman']
['Stephen·gold', 'birthplace', 'Maine, USA']
['Stephen·gold', 'occupation', 'writer']
['Stephen·gold', 'Major achievements', 'Rich list in literary and Art Circles']
['Stephen·gold', 'Spouse', 'Tabitha ·gold/ naomi ·Rachel·gold']
['Stephen·gold', 'Son', 'Joe·Hill·gold/Irving·Philip·gold']
['Stephen·gold', 'Daughter', 'naomi ·Rachel·gold']
['Stephen·gold', 'nation', 'American nation']
['Stephen·gold', 'Foreign name', 'Stephen Edwin King']
['Stephen·gold', 'Nationality', 'U.S.A']
['Stephen·gold', 'date of birth', '1947 year']
['Stephen·gold', 'University one is graduated from', 'Department of English, University of Maine']
['Stephen·gold', 'Representative works', '<Shawshank Redemption']
['Stephen·gold', 'Children', 'Joe·Hill·gold/Irving·Philip·gold']
['Stephen·gold', 'Stephen·Kim. February 2007', 'Stephen·Kim. February 2007']
['Stephen·gold', 'Pseudonym', 'Richard Bachman John Swithen']
['Stephen·gold', 'birth', '1947 September 21, 2004 (67 years old) Portland, Maine, USA']
['Stephen·gold', 'occupation', 'writer']

6. Classification of candidate triples

6.1 model structure

The triple classification module adopts the sentence pair classification model based on Baidu pre training model ERNIE. After pairing the triple and question information, ERNIE coding is used, and the hidden layer output vector of [CLS] characters is projected to the secondary classification output layer after passing through the full connection layer.

from paddle import nn
from paddlenlp.transformers import ErniePretrainedModel


class ErnieCLS(ErniePretrainedModel):
    def __init__(self, ernie, label_dim, dropout=None):
        super(ErnieCLS, self).__init__()
        self.label_num = label_dim

        self.ernie = ernie  # allow ernie to be config
        self.dropout = nn.Dropout(dropout if dropout is not None else
                                  self.ernie.config["hidden_dropout_prob"])
        self.classifier = nn.Linear(self.ernie.config['hidden_size'], self.label_num)
        self.hidden = nn.Linear(self.ernie.config['hidden_size'], self.ernie.config['hidden_size'])

    def forward(self,
                words_ids,
                token_type_ids=None,
                position_ids=None,
                attention_mask=None,
                history_ids=None):
        sequence_output, pooled_output = self.ernie(
            words_ids,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            attention_mask=attention_mask)

        pooled_output = nn.functional.relu(self.hidden(self.dropout(pooled_output)))

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        return logits

6.2 structural digital features

We take the question as sentence A, splice the header entity and relationship name of A candidate triplet as sentence B, and take whether the triplet is the labeled answer of the question as the positive / negative label of the sentence pair (A,B), so as to construct the training data of the candidate triplet classification model.
You can see the constructed relevant data in the work / candidate tripleselection / data directory of the project.

In the digital feature construction stage, ERNIE's tokenizer can be directly used to encode sentence pairs and automatically splice special token s.

def read(data_path):
    all_sample_text1, all_sample_text2, all_sample_labels = [], [], []
    with open(data_path, 'r', encoding='utf-8') as f:
        for line in f.readlines():
            text1, text2, label = line.strip().split('\t')
            all_sample_text1.append(text1)
            all_sample_text2.append(text2)
            all_sample_labels.append(label)
    for idx in range(len(all_sample_labels)):
        yield {"text1": all_sample_text1[idx], "text2": all_sample_text2[idx], "label": all_sample_labels[idx]}


def convert_example_to_feature(example, tokenizer, label2id, max_seq_len=512):
    features = tokenizer(example["text1"], example["text2"], max_seq_len=max_seq_len)
    label_ids = label2id[example["label"]]
    return features["input_ids"], features["token_type_ids"], label_ids

6.3 model training

Similar to Section 4.3, only the core code of model training is shown here. Run in the work/CanditateTriplesSelection directory_ cls. Py repeats the complete training process.

def train():
    train_ds = load_dataset(read, data_path=train_path, lazy=False)  # File - > example
    dev_ds = load_dataset(read, data_path=dev_path, lazy=False)

    tokenizer = ErnieTokenizer.from_pretrained(model_name)
    trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=max_seq_len)

    train_ds = train_ds.map(trans_func, lazy=False)  # example->feature
    dev_ds = dev_ds.map(trans_func, lazy=False)

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
        Stack(axis=0, dtype='int64'),
    ): fn(samples)

    train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=batch_size, shuffle=True)
    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=batch_size, shuffle=False)
    train_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn,
                                        return_list=True)
    dev_loader = paddle.io.DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn,
                                      return_list=True)

    ernie = ErnieModel.from_pretrained(model_name)
    model = ErnieCLS(ernie, len(label2id), dropout=0.1)

    num_training_steps = len(train_loader) * num_epoch
    lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
    grad_clip = paddle.nn.ClipGradByGlobalNorm(max_grad_norm)
    optimizer = paddle.optimizer.AdamW(learning_rate=lr_scheduler, parameters=model.parameters(),
                                       weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in decay_params,
                                       grad_clip=grad_clip)

    loss_model = paddle.nn.CrossEntropyLoss()
    cls_metric = ClassificationScore(id2label)

    global_step, cls_best_f1 = 0, 0.
    model.train()
    for epoch in range(1, num_epoch + 1):
        for batch_data in train_loader:
            input_ids, token_type_ids, labels = batch_data
            logits = model(input_ids, token_type_ids=token_type_ids)

            loss = loss_model(logits, labels)

            loss.backward()
            lr_scheduler.step()
            optimizer.step()
            optimizer.clear_grad()

            if global_step > 0 and global_step % log_step == 0:
                print(
                    f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.numpy().item():.6f}")
            if global_step > 0 and global_step % eval_step == 0:

                cls_results = evaluate(model, dev_loader, cls_metric)
                cls_result = cls_results["1"]
                model.train()
                cls_f1 = cls_result["F1"]
                if cls_f1 > cls_best_f1:
                    paddle.save(model.state_dict(), f"{save_path}/ernie_cls_best.pdparams")
                if cls_f1 > cls_best_f1:
                    print(f"\ncls best F1 performence has been updated: {cls_best_f1:.5f} --> {cls_f1:.5f}")
                    cls_best_f1 = cls_f1
                print(
                    f'\ncls evalution result: precision: {cls_result["Precision"]:.5f}, recall: {cls_result["Recall"]:.5f},  F1: {cls_result["F1"]:.5f}, current best {cls_best_f1:.5f}\n')

            global_step += 1

7. Fine sorting of candidate answers

The returned result of BERT classifier will contain one or more triples predicted as positive examples. The purpose of this project is to enable the system to uniquely return a correct answer. In order to improve the accuracy of the only answer returned by the system, multiple answers need to be sorted. The principle of ranking is to measure the similarity between the question attribute words in the question sentence and the candidate triplet relationship names.

7.1 get the attribute information of questions and triples

The question attribute word refers to which attribute feature of the subject word is asked by the question sentence. For example, ask "who is the husband of Princess Xiangcheng?", The subject word in the question is "Xiangcheng Princess", and the question attribute of this subject word is "husband".

The rule-based method can be used to obtain the question attribute words of the question sentence. The specific process is as follows.

Remove the subject words obtained in the process of entity recognition. For example, for the question "who is the husband of Princess Xiangcheng?", Then remove the theme word "Xiangcheng Princess".
Remove the stop words, interrogative auxiliary words and punctuation marks in the question sentence. The interrogative auxiliary words include "which", "how much", "how", etc. Some commonly used opening words of questions should also be removed, such as "I want to know", "excuse me", "I'm curious" and so on.

The relation name of triples is also called the attribute name of triples, which is used to represent the association relationship between two entities in the knowledge base. For example, in the triple "Ogawa Yuanhu - Perimeter - 67.4km", the "perimeter" is its relationship name.

def span_question(question, ner_result):
    """
    It is used in the answer ranking stage to delete the information irrelevant to the answer ranking, such as subject words, question words, etc
    """
    question = question.replace(ner_result, '').replace('<', '').replace('>', '')
    for delete_word in ['I want to know','I'd like to ask','Excuse me','Excuse me?','You know?','Who knows','know','Who knows','I'm curious','You ask for me','Has anyone seen it','Is there anyone'
                        'Yes?','this','How many','What are there','Which?','which one?','How many?','How many?','who','By whom','also'
                        ,'Do you','ah','ah','bar','means','of','yes','And','Yes','？','?','what']:
        question = question.replace(delete_word, '')

    return question

span_res = span_question('Who is the husband of Princess Xiangcheng?', 'Xiangcheng Princess')
print(span_res)

husband

7.2 calculate the similarity between question attribute and triple relation name

After taking the question attribute word and the triplet relationship name, the similarity between them is calculated. The specific algorithm is to calculate the Jaccard similarity and word2vec similarity respectively, and add them as the overall similarity score.

def score_similarity(word2vec_model, string1, string2):
    """
    Compare the similarity of two strings, from character coverage w2v The comprehensive score of similarity is used for the comparison of question sentences and triplet relationship names when sorting answers
    :return: Similarity score
    """
    return char_overlap(string1, string2) + word2vec_sim(word2vec_model, string1, string2)


def char_overlap(string1, string2):
    char_intersection = set(string1) & set(string2)
    char_union = set(string1) | set(string2)
    return len(char_intersection) / len(char_union)


def word2vec_sim(word2vec_model, string1, string2):
    # Reading n_similarity's source code is to take the average of two groups of word vectors, normalize L2, and then calculate the inner product
    words1 = jieba.cut(string1)
    words2 = jieba.cut(string2)

    de_seg1 = []
    de_seg2 = []
    for seg in words1:
        if seg not in word2vec_model.vocab:
            _ws = [_w for _w in seg if _w in word2vec_model.vocab]
            de_seg1.extend(_ws)
        else:
            de_seg1.append(seg)
    for seg in words2:
        if seg not in word2vec_model.vocab:
            _ws = [_w for _w in seg if _w in word2vec_model.vocab]
            de_seg1.extend(_ws)
        else:
            de_seg2.append(seg)
    if de_seg1 and de_seg2:
        score = word2vec_model.n_similarity(de_seg1, de_seg2)
    else:
        score = 0
    return score

8. Complete KBQA process

The complete KBQA process is implemented as follows

For the question entered by the console, the question object of the question is predicted through the subject word recognition model. If the model does not predict the results, try to use rules to extract subject words. Print the subject words on the console;
After the subject words are obtained, a group of candidate entities are obtained through the entity chain reference module and printed on the console;
According to the pre established index, all triples with candidate entities as the head entities are retrieved in the knowledge base, that is, candidate triples. Since there may be many candidate triples, only the first 20 candidate triples are printed on the console;
The triple rough classification model is used to make binary prediction for all the above candidate groups, and only the triples with positive prediction are retained
For the reserved triples, the answer sorting module is used to compare the similarity between the relationship and the question, score and sort each triplet
Take the triplet with the highest score in the previous step as the best triplet, and the tail entity is returned as the best answer

Note:

After ensuring that the relevant libraries in the first three sections of the NoteBook have been loaded, the following process code can be run directly. You can enter a simple question in the input box, and then observe the prediction results of the KBQA system for the answers and the intermediate results of each step.
Since all the fine tuned models of the KBQA system have been mounted in the data/data122049 directory, they will be directly loaded and predicted in the following pipeline code. Therefore, the code in Section 4 and after in this NoteBook does not need to be executed.

def pipeline_predict(question):
    ner_results = ner_predict(NERConfig().best_model_path, question)
    ner_results = set([_result.replace("<", "").replace(">", "") for _result in ner_results])
    # ner_results is a set, which may have 0, 1 or more elements. If there are 0 elements, try the following rules to see if the entity can be extracted
    if not ner_results:
        if '<' in question and '>' in question:
            ner_results = re.search(r'(.*)of.*yes.*', question).group(1)
        elif re.search(r'', question):  
            ner_results = re.search(r'(.*)of.*yes.*', question).group(1)
        else:
            print('No subject words extracted!')
            return()

    print('■Identified subject words:', ner_results, datetime.datetime.now())

    candidate_entities = []
    for mention in ner_results:
        candidate_entities.extend(entity_linking(mention2entity_dict, mention))
    print('■Candidate entities found:', candidate_entities, datetime.datetime.now())

    forward_candidate_triples = search_triples_by_index(candidate_entities, forward_index, forward_KG_f)
    candidate_triples = forward_candidate_triples
    candidate_triples = list(filter(lambda x: len(x) == 3, candidate_triples))
    candidate_triples_num = len(candidate_triples)
    print('■Candidate triples in total{}strip'.format(candidate_triples_num), datetime.datetime.now())
    show_num = 20 if candidate_triples_num > 20 else candidate_triples_num
    print('■Before display{}Candidate triples:{}'.format(show_num, candidate_triples[:show_num]))

    candidate_triples_labels = cls_predict(CLSConfig().best_model_path, [question]*len(candidate_triples), [triple[0]+triple[1] for triple in candidate_triples])
    predict_triples = [candidate_triples[i] for i in range(len(candidate_triples)) if candidate_triples_labels[i] == '1']
    print('■The following triples are reserved for the rough classification results of triples:', predict_triples)

    predict_answers = [_triple[2] for _triple in predict_triples]
    if len(predict_answers) == 0:
        print('■No relevant knowledge was retrieved from the knowledge base. Please try another question......')
        return()
    elif len(set(predict_answers)) == 1:  # There is only one predicted answer, although there may be more than one triplet providing the answer
        print('■The predicted answer is unique and output directly......')
        best_triple = predict_triples[0]
        best_answer = predict_answers[0]
        print('■Best answer:', best_answer)
    else:  # Multiple answers are predicted and need to be sorted
        print('■Multiple answers detected, sorting answers......')
        max_ner = ''  # Split the question with the longest of all ner results
        for _ner in ner_results:
            if len(_ner) > len(max_ner):
                max_ner = _ner
        fine_question = span_question(question, max_ner)
        rel_scores = [score_similarity(word2vec_model, _triple[1].replace(' ', ''), fine_question) for _triple in
                      predict_triples]
        triples_with_score = list(zip(map(tuple, predict_triples), rel_scores))
        triples_with_score.sort(key=lambda x: x[1], reverse=True)
        print('■Triple sorting result:\n{}'.format("\n".join([str(pair[0]) + '-->' + str(pair[1]) for pair in triples_with_score])))
        best_answer = triples_with_score[0][0][-1]
        print('■Best answer:', best_answer)

input_question = input('■Please enter a question:')
# input_question = 'who is the president of Harbin Institute of technology?'
# input_question = 'which nationality is Hugo from?'
# input_question = 'which work did Bai Xiaosheng come from?'
print('■Question entered:', input_question)

pipeline_predict(input_question)

■Please enter a question:■The question has been entered: which country is Hugo from?


[2021-12-22 17:33:35,001] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-1.0
[2021-12-22 17:33:35,003] [    INFO] - Downloading ernie_v1_chn_base.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams
100%|██████████| 392507/392507 [00:09<00:00, 41759.84it/s]
W1222 17:33:44.524525   101 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W1222 17:33:44.528751   101 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2021-12-22 17:33:47,166] [    INFO] - Weights from pretrained model not used in ErnieModel: ['cls.predictions.layer_norm.weight', 'cls.predictions.decoder_bias', 'cls.predictions.transform.bias', 'cls.predictions.transform.weight', 'cls.predictions.layer_norm.bias']
[2021-12-22 17:33:47,783] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt and saved to /home/aistudio/.paddlenlp/models/ernie-1.0
[2021-12-22 17:33:47,786] [    INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt
100%|██████████| 90/90 [00:00<00:00, 3720.30it/s]


■Identified subject words: {'Hugo'} 2021-12-22 17:33:47.919246
■Candidate entities found: ['Hugo's Secret', 'Hugo(2011 Oscar winning film)', 'Hugo(2011 Martin·Scorsese directed American films)', 'Hugo(2011 year Martin Scorsese Director film)', 'Victor(French writer)', 'Hugo(Fighting game "street overlord" characters)', 'Hugo'] 2021-12-22 17:33:47.921878
■109 candidate triples in total 2021-12-22 17:33:47.925449
■Show the first 20 candidate triples:[['Hugo's Secret', 'alias', 'Hugo's Secret'], ['Hugo's Secret', 'Chinese name', 'Hugo'], ['Hugo's Secret', 'Foreign language name', 'Hugo'], ['Hugo's Secret', 'Other translated names', 'Hugo's Paris fantasy adventure'], ['Hugo's Secret', 'Production company', 'Paramount Pictures '], ['Hugo's Secret', 'director', 'Martin·Scorsese'], ['Hugo's Secret', 'Production cost', '1.7 USD100mn'], ['Hugo's Secret', 'Shooting date', '2010 year'], ['Hugo's Secret', 'Film length', '126 minute'], ['Hugo's Secret', 'classification', 'USA: PG'], ['Hugo's Secret', 'color', 'colour'], ['Hugo's Secret', 'to star', 'Asha·Butterfield, colo·Moritz'], ['Hugo's Secret', 'type', 'Plot, science fiction, biography'], ['Hugo's Secret', 'Production time', '2011 November 23'], ['Hugo's Secret', 'Production area', 'U.S.A'], ['Hugo's Secret', 'Screenwriter', 'John·Logan'], ['Hugo's Secret', 'Shooting location', 'U.S.A'], ['Hugo's Secret', 'IMDB score', '7.6'], ['Hugo's Secret', 'Release time', '2012 May 31, 2014 (China)'], ['Hugo's Secret', 'Dialogue language', 'English']]


[2021-12-22 17:33:48,496] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/ernie_v1_chn_base.pdparams
[2021-12-22 17:33:49,823] [    INFO] - Weights from pretrained model not used in ErnieModel: ['cls.predictions.layer_norm.weight', 'cls.predictions.decoder_bias', 'cls.predictions.transform.bias', 'cls.predictions.transform.weight', 'cls.predictions.layer_norm.bias']
[2021-12-22 17:33:50,497] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/vocab.txt


■The following triples are reserved for the rough classification results of triples: [['Hugo's Secret', 'Title Translation', 'Hugo'], ['Hugo(2011 Oscar winning film)', 'Production area', 'U.S.A'], ['Hugo(2011 Martin·Scorsese directed American films)', 'Production area', 'U.S.A'], ['Hugo(2011 year Martin Scorsese Director film)', 'Production area', 'U.S.A'], ['Victor(French writer)', 'nationality', 'France']]
■Multiple answers detected, sorting answers......


Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.828 seconds.
Prefix dict has been built successfully.


■Triple sorting result:
('Victor(French writer)', 'nationality', 'France')-->0.49243852496147156
('Hugo's Secret', 'Title Translation', 'Hugo')-->0.39723604917526245
('Hugo(2011 Oscar winning film)', 'Production area', 'U.S.A')-->0.22878356277942657
('Hugo(2011 Martin·Scorsese directed American films)', 'Production area', 'U.S.A')-->0.22878356277942657
('Hugo(2011 year Martin Scorsese Director film)', 'Production area', 'U.S.A')-->0.22878356277942657
■Best answer: France

9. Evaluation results

Before migrating to AIsudio, this project is implemented using the pytorch framework. The models of subject word recognition module and triple classification module use the pytorch Chinese pre training model of SpanBERT and Bert base respectively.

After some data analysis work, it is found that there are natural differences between the answer entities in the original question and answer data set and those in the knowledge base, including English case, decimal precision, time and date format, missing sign and omission and so on. Obviously, because of the differences in these formats, it is considered that an entity is not the answer of a question sentence, which will bring errors to the system. Therefore, the author has corrected this kind of problem, obtained the revised question and answer data set, and trained and tested it in the same way.

On the NLPCC2018 test set, the answer return accuracy is taken as the evaluation index, and the test results are as follows:

	Original question and answer data set	Revised question and answer data set
This KBQA system	78.03%	90.32%

Please click here View the basic usage of this environment

Please click here for more detailed instructions.

Keywords: AI paddlepaddle

Added by firmolari on Mon, 03 Jan 2022 04:07:17 +0200

Programming VIP