5-minute NLP: summary of three pre training libraries for rapid realization of NER

In the NLP task of automatic text understanding, named entity recognition (NER) is the primary task. The function of NER model is to identify named entities in text corpus, such as person name, organization, location, language and so on.

NER model can be used to understand the meaning of a text sentence / phrase. It can recognize the words that may represent who, what and who in the text, as well as other main entities referred to by the text data.

In this article, three techniques for performing NER on text data will be introduced. These techniques will involve pre trained and custom trained named entity recognition models.

  • NLTK based pre training NER
  • Sparcy based pre training NER
  • Custom NER based on BERT

NLTK based pre training NER model:

The NLTK package provides an implementation of a pre trained ner model, which can implement the NER function with a few lines of Python code. The NLTK package provides a parameter option: either identify all named entities or identify named entities as their respective types, such as people, places, locations, etc.

If binary=True, the model will only assign a value when the word is named entity (NE) or unnamed entity (NE). Otherwise, for binary=False, all words will be assigned a label.

entities = []
tags = []

sentence = nltk.sent_tokenize(text)
for sent in sentence:
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary=False):
        if hasattr(chunk,'label'):
            entities.append(' '.join(c[0] for c in chunk))
            tags.append(chunk.label())
            
entities_tags = list(set(zip(entities,tags)))

entities_df = pd.DataFrame(entities_tags)
entities_df.columns = ["Entities","Tags"]

Enter sample text:

The results are as follows:

Sparcy based pre training NER

Spacy package provides pre trained in-depth learning NER model and NER tasks with text data. Spacy provides three trained NER models: en_core_web_sm,en_core_web_md,en_core_web_lg.

The NER model can use Python - M spacedownload en_ core_ web_ SM download and use Spacey Load ("en_core_web_sm").

!python -m spacy download en_core_web_sm
import spacy 
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

entities, labels, position_start, position_end = [], [], [], []

for ent in doc.ents:
    entities.append(ent)
    labels.append(ent.label_)
    position_start.append(ent.start_char)
    position_end.append(ent.end_char)
    
df = pd.DataFrame({'Entities':entities,'Labels':labels,'Position_Start':position_start, 'Position_End':position_end})

Or the above text, the results are as follows:

NER based on BERT

The first two implementations of the NER model using NLTK and spacy are pre trained, and these packages provide API s to execute ner using Python functions.

For some custom fields, the pre training model may not perform well or may not be assigned relevant labels. At this time, transformer can be used to train the custom NER model based on BERT.

# Import necessary packages
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from simpletransformers.ner import NERModel, NERArgs

# Read sample training NER data
data = pd.read_csv("sample_ner_dataset.csv", encoding="latin1")
data = data.fillna(method ="ffill")

# Label Encode
data["Sentence #"] = LabelEncoder().fit_transform(data["Sentence #"] )
data.rename(columns={"Sentence #":"sentence_id","Word":"words","Tag":"labels"}, inplace =True)
data["labels"] = data["labels"].str.upper()

# Train test split
X = data[["sentence_id","words"]]
Y = data["labels"]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size =0.2)

# Building up train data and test data
train_data = pd.DataFrame({"sentence_id":x_train["sentence_id"],"words":x_train["words"],"labels":y_train})
test_data = pd.DataFrame({"sentence_id":x_test["sentence_id"],"words":x_test["words"],"labels":y_test})

# Initializing NER model configurations
label = data["labels"].unique().tolist()
args = NERArgs()
args.num_train_epochs = 1
args.learning_rate = 1e-4
args.overwrite_output_dir =True
args.train_batch_size = 32
args.eval_batch_size = 32

# Train BERT based NER model
model = NERModel('bert', 'bert-base-cased', labels=label, args=args)
model.train_model(train_data, eval_data=test_data, acc=accuracy_score)

# Evaluate the performance of NER model
result, model_outputs, preds_list = model.eval_model(test_data)

# Perform NER for inference text
inference_text = "What is the new name of Bangalore"
prediction, model_output = model.predict([inference_text])

The results are as follows:

summary

The performance of sparcy based pre trained NER model seems to be the best, in which the predicted labels are very close to the actual understanding of human beings. The Spacy NER model can be implemented in just a few lines of code and is easy to use.

The custom training ner model based on BERT provides similar performance. The NER model of customized training is also applicable to tasks in specific fields.

There are various other implementations of NER model, which are not discussed in this paper, such as the pre trained ner model of Stanford NLP. Those who are interested can have a look.

https://www.overfit.cn/post/b7a368f1282149338a1afc20a5a6afcc

Keywords: AI neural networks Deep Learning NLP

Added by Trafalger on Mon, 21 Feb 2022 03:23:06 +0200