Create new features for NLP with Named Entity Recognition

Yannawut Kimnaruk
4 min readJul 4, 2022

What is the NLP?

The objective of NLP (Natural Language Processing) is to make computers understand the text and spoken words in much the same way human beings can.

NLP applies statistical, machine learning, and deep learning models to large text data to comprehend the speaker’s or writer’s intent and sentiment.

Usually, the input data to the NLP model is about the words in texts. Anyway, we can add more features to make the model more accurate.

What is Named Entity Recognition (NER)?

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
(from https://en.wikipedia.org/wiki/Named-entity_recognition)

We can use NEM to create new columns for the NLP model and sometimes visualization of NEM alone can even classify text type.

Get to know spaCy

spaCy is a free open-source Python library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors, and more.

In this article, I will show you step by step how to perform NEM with the spaCy library.

You can see all supported languages and the model for that language Here.

Step summary

  1. Install and import libraries
  2. Load an NER model
  3. Create a tag list column
  4. Create new features

1. Install and import libraries

  • Open Anaconda prompt

Search for Anaconda Prompt and click to open.

You will see a black screen pop-up.

  • Install the spaCy library by typing the following command in the Anaconda prompt
pip install spacy
  • download the pre-trained NER model
python -m spacy download en_core_web_sm
  • Import required libraries in Python
import spacy
import pandas as pd
  • Load a dataset

You can download the news headlines dataset as an example.

df =pd.read_csv("abcnews-date-text.csv")
df = df[:10] # Use only 10 samples to reduce computation time

2. Load an NER model

  • Use spacy.load to load the pre-trained NER model. (Make sure that you already download this model in step 1)
ner = spacy.load("en_core_web_sm")

This is how to implement the loaded model.

  • Fit the model with a text
doc = ner(df['headline_text'][9])

doc is a spacy token. You can call its attributes.

  • Display

Loop over words in doc.

.text will return word

.pos_ will return the part of speech of the word

.ent_type_ will return the named-entity of the word

From the example “australia is locked into war timetable opp”, you will see the model can detect australia as GPE (Geopolitical Entity).

print("Text is: "+doc.text+"\n")
for token in doc:
print(token.text+"\t"+token.pos_+"\t"+token.ent_type_)

spaCy has a useful display tool, displacy, that can beautifully visualize the named entity.

spacy.displacy.render(doc, style="ent")

The list of all NER can be gotten by .pipe_labels and the explanation of each NER is known by .explain.

ner_list = ner.pipe_labels['ner']print("Number of NER: "+str(len(ner_list)))
for i in range(len(ner_list)):
print(ner_list[i]+" : "+spacy.explain(ner_list[i]))

There are 18 NERs as shown above.

3. Create a tag list column

First, create a function add_tag_column that when entered text will return a dictionary of the number of tag occurance.

def add_tag_column(text):
doc = ner(text)
tag_dict = dict.fromkeys(ner_list,0) # empty column
for token in doc:
if token.ent_type_ != "":
tag_dict[token.ent_type_] +=1
return tag_dict

Apply the add_tag_column function to all rows in the dataframe. The tag list result will be temporarily stored in the tag column.

df["tag"] = df["headline_text"].apply(add_tag_column)

4. Create new features

Create new columns, tag_NER, which are the number of each NER occurrence in the text.

for tag in ner_list:
df["tag_"+tag] = df['tag'].apply(lambda count: count[tag])

You can use these columns to train an NLP model.

If you find this article helpful, follow me for more articles about data science.

--

--