python - Return all possible entity types from spaCy model? - Stack Overflow
Entity Recognition from Search Queries
Advanced entity extraction (NER) with GPT-NeoX 20B without annotation, and a comparison with spaCy
python - Removing named entities from a document using spacy - Stack Overflow
The statistical pipeline components like ner provide their labels under .labels:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.get_pipe("ner").labels
This might not be the most general answer, but for en_core_web_sm this returns the named entity types.
model = spacy.load("en_core_web_sm")
list(model.__dict__['_meta']['accuracy']['ents_per_type'].keys())
['ORG', 'CARDINAL', 'DATE', 'GPE', 'PERSON', 'MONEY', 'PRODUCT', 'TIME', 'PERCENT', 'WORK_OF_ART', 'QUANTITY', 'NORP', 'LOC', 'EVENT', 'ORDINAL', 'FAC', 'LAW', 'LANGUAGE']
Hello fellow data scientists,
Many NLP practitioners don't know (yet!) that data annotation is not needed anymore in an entity extraction project.
So I made a video where I'm comparing spaCy and GPT-NeoX 20B for NER, and I show how GPT models can efficiently extract new entities without any training!
https://www.youtube.com/watch?v=E-qZDwXpeY0
You will also want to read this TDS article that shows in details how to leverage few-shot learning for entity extraction: https://towardsdatascience.com/advanced-ner-with-gpt-3-and-gpt-j-ce43dc6cdb9c#4010-fa6647c13fbe-reply
When I see how much time is spent on data annotation and model training in so many NER projects, I really think that these large generative language models (GPT, OPT, Bloom, etc.) are the future.
What do you think?
Julien
This will not handle entities covering multiple tokens.
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
Output
'New York is in'
Here USA is correctly removed but couldn't eliminate New York
Solution
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))
Output
'is in'
This will get you the result you're asking for. Reviewing the Named Entity Recognition should help you going forward.
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
Output:
This is a text document that speaks about entities like and