Master Named Entity Recognition Techniques

Sai Kamal

Software Engineer

Tags:

Named Entity Recognition

3 min read

The goal of Named Entity Recognition is to locate and classify named entities in a sequence. The named entities are pre-defined categories chosen according to the use case such as names of people, organizations, places, codes, time notations, monetary values, etc. Essentially, NER aims to assign a class to each token (usually a single word) in a sequence. Because of this, NER is also referred to as token classification.

Model developing Process

For this we are using pre-trained models from simple transformers which build over Hugging face Library.

Implemetation Process

The process of performing Named Entity Recognition in Simple Transformers does not deviate from the standard pattern.

Initialize a NERModel
Train the model with train_model()
Evaluate the model with eval_model()
Make predictions on (unlabelled) data with predict()

Supported Model Types using simple transformers

Copy Code

            
    1 .  ALBERT	       = albert
    2 .  BERT          = bert
    3 .  BERTweet      = bertweet
    4 .  BigBird       = bigbird
    5 .  CamemBERT     = camembert
    6 .  DeBERTa       = deberta
    7 .  DeBERTa       = deberta
    8 .  DeBERTaV2     = deberta-v2
    9 .  DistilBERT    = distilbert
    10 . ELECTRA       = electra
    11 . HerBERT       = herbert
    12 . LayoutLM      = layoutlm
    13 . Longformer    = longformer
    14 . MobileBERT    = mobilebert
    15 . MPNet         = mpnet
    16 . RoBERTa       = roberta
    17 . SqueezeBert   = squeezebert
    18 . XLM           = xlm
    19 . XLM-RoBERTa   = xlmroberta
    20 . XLNet         = xlnet

The above models completely uses concept encoders and decoders

Data Formats

A DataFrame containing the 3 columns sentence_id, words, labels. Each value in words will have a corresponding labels value. The sentence_id determines which words belong to a given sentence. I.e. the words from the same sequence should be assigned the same unique sentence_id.

Named entity recognition depends on the lables . Model can develop in different lable format here we are using an couple of lablels ['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim', 'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve', 'I-eve', 'I-nat']

Explanation of the labels

O = Outside of a named entity
B-MIS = Beginning of a miscellaneous entity right after another miscellaneous entity
I-MIS = Miscellaneous entity
B-PER = Beginning of a person's name right after another person's name
I-PER = Person's name
B-ORG = Beginning of an organisation right after another organisation
I-ORG = Organisation
B-LOC = Beginning of a location right after another location
I-LOC = Location

Model used for Named entity recognition

Bert-Based-cased

Arguments.num_train_epochs = 3
Arguments.train_batch_size = 32
Arguments.eval_batch_size = 32
Arguments.learning_rate = 4e-5
Arguments.max_seq_length = 128
Arguments.adam_epsilon = 1e-8
Arguments.do_lower_case = True
Arguments.n_gpu = 1
Arguments.overwrite_output_dir = True

Follow technical report.docx file for complete explanation about models used and for dataset.

Back To Blogs

Sai Kamal

Software Engineer

Software Engineer with a specialization in Python, Statistics, machine learning and deep learning. He has also worked extensively with Natural Language Processing and Computer Vision. Sai has also worked on Flask micro-services, which are used to run large-scale web applications.

Python Statistics Machine learning Deep learning Natural language processing Computer vision Flask