Videos
As per spacy documentation for Name Entity Recognition here is the way to extract name entity
import spacy
nlp = spacy.load('en') # install 'en' model (python3 -m spacy download en)
doc = nlp("Alphabet is a new startup in China")
print('Name Entity: {0}'.format(doc.ents))
Result
Name Entity: (China,)
To make "Alphabet" a 'Noun' append it with "The".
doc = nlp("The Alphabet is a new startup in China")
print('Name Entity: {0}'.format(doc.ents))
Name Entity: (Alphabet, China)
In Spacy version 3 the Transformers from Hugging Face are fine-tuned to the operations that Spacy provided in previous versions, but with better results.
Transformers are currently (2020) the state-of-art in Natural Language Processing, i.e generally we had (one-hot-encode -> word2vec -> glove | fast text) then (recurrent neural network, recursive neural network, gated recurrent unit, long short-term memory, bi-directional long short-term memory, etc) and now Transformers + Attention (BERT, RoBERTa, XLNet, XLM, CTRL, AlBERT, T5, Bart, GPT, GPT-2, GPT-3) - This is just to give context for 'why' you should consider Transformers, I know that there are lots of stuff that I didn't mention like Fuzz, Knowledge Graph and so on
Install the dependencies:
sudo apt install libncurses5
pip install spacy-transformers --pre -f https://download.pytorch.org/whl/torch_stable.html
pip install spacy-nightly # I'm using 3.0.0rc2
Download a model:
python -m spacy download en_core_web_trf # English Transformer pipeline, Roberta base
Here's a list of available models.
And then use it as you would normally do:
import spacy
text = 'Type something here which can be related to something, e.g Stack Over Flow organization'
nlp = spacy.load('en_core_web_trf')
document = nlp(text)
print(document.ents)
References:
Learn about Transformers and Attention.
Read a summary about the different Trasnformers architectures.
Learn about the Transformers fine-tune done by Spacy.
The issue with models accuracy
The problem with all models is that they don't have 100% accuracy and even using a bigger model doesn't help to recognize dates. Here are the accuracy values (F-score, precision, recall) for NER models--they are all around 86%.
document_string = """
Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST
The patient was referred by Dr. Jacob Austin.
Electronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST
Electronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST
The patient was referred by
Dr. Jayden Green Olivia.
"""
With small model two date items are labelled as 'PERSON':
import spacy
nlp = spacy.load('en')
sents = nlp(document_string)
[ee for ee in sents.ents if ee.label_ == 'PERSON']
# Out:
# [Wes Scott,
# Jun 26,
# Jacob Austin,
# Robert Clowson,
# John Douglas,
# Jun 16 2017,
# Jayden Green Olivia]
With a larger model en_core_web_md the results are even worse in terms of precision, as there are three misclassified entities.
nlp = spacy.load('en_core_web_md')
sents = nlp(document_string)
# Out:
#[Wes Scott,
# Jun 26,
# Jacob Austin,
# Robert Clowson,
# Janury,
# John Douglas,
# Jun 16 2017,
# Jayden Green Olivia]
I also tried other models (xx_ent_wiki_sm, en_core_web_md) and they don't bring any improvement as well.
What about using rules to improve accuracy?
In the small example not only the document seems to have a clear structure, but the misclassified entities are all dates. So why not combine the initial model with a rule-based component?
The good news is that in Spacy:
it's possible can combine statistical and rule-based components in a variety of ways. Rule-based components can be used to improve the accuracy of statistical models
(from https://spacy.io/usage/rule-based-matching#models-rules)
So, by following the example and using the dateparser library (a parser for human readable dates) I've put together a rule-based component that works very well on this example:
from spacy.tokens import Span
import dateparser
def expand_person_entities(doc):
new_ents = []
for ent in doc.ents:
# Only check for title if it's a person and not the first token
if ent.label_ == "PERSON":
if ent.start != 0:
# if person preceded by title, include title in entity
prev_token = doc[ent.start - 1]
if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
new_ents.append(new_ent)
else:
# if entity can be parsed as a date, it's not a person
if dateparser.parse(ent.text) is None:
new_ents.append(ent)
else:
new_ents.append(ent)
doc.ents = new_ents
return doc
# Add the component after the named entity recognizer
# nlp.remove_pipe('expand_person_entities')
nlp.add_pipe(expand_person_entities, after='ner')
doc = nlp(document_string)
[(ent.text, ent.label_) for ent in doc.ents if ent.label_=='PERSON']
# Out:
# [(‘Wes Scott', 'PERSON'),
# ('Dr. Jacob Austin', 'PERSON'),
# ('Robert Clowson', 'PERSON'),
# ('Dr. John Douglas', 'PERSON'),
# ('Dr. Jayden Green Olivia', 'PERSON')]
Try this:
import spacy
en = spacy.load('en')
sents = en(open('input.txt').read())
people = [ee for ee in sents.ents if ee.label_ == 'PERSON']