🌐
Medium
medium.com › @sanskrutikhedkar09 › mastering-information-extraction-from-unstructured-text-a-deep-dive-into-named-entity-recognition-4aa2f664a453
Mastering Information Extraction from Unstructured Text: A Deep Dive into Named Entity Recognition with spaCy | by Sanskrutikhedkar | Medium
October 27, 2023 - SpaCy, our trusty companion in this adventure, offers a plethora of pre-trained models tailored for diverse language processing tasks. Among them are the agile ‘en_core_web_sm,’ the balanced ‘en_core_web_md,’ and the robust ‘en_core_web_lg,’ each catering to specific needs and preferences. These models, with their components like tokenization rules, Part-of-Speech Tagging, Named Entity Recognition, and more, form the bedrock of our exploration into unstructured data.
🌐
spaCy
spacy.io › usage › linguistic-features
Linguistic Features · spaCy Usage Documentation
entities labeled as MONEY, and then uses the dependency parse to find the noun phrase they are referring to – for example "Net income" → "$9.4 million". ... For more examples of how to write rule-based information extraction logic that takes advantage of the model’s predictions produced by the different components, see the usage guide on combining models and rules.
🌐
spaCy
spacy.io › universe › project › video-spacys-ner-model-alt
Named Entity Recognition (NER) using spaCy · spaCy Universe
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.
🌐
Kaggle
kaggle.com › code › curiousprogrammer › entity-extraction-and-classification-using-spacy
Entity Extraction and Classification using SpaCy
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
Quanteda
spacyr.quanteda.io › reference › spacy_extract_entity.html
Extract named entities from texts using spaCy — spacy_extract_entity • spacyr
This function extracts named entities from texts, based on the entity tag ent attributes of documents objects parsed by spaCy (see https://spacy.io/usage/linguistic-features#section-named-entities).
🌐
spaCy
spacy.io › api › entityrecognizer
EntityRecognizer · spaCy API Documentation
A transition-based named entity recognition component. The entity recognizer identifies non-overlapping labelled spans of tokens.
🌐
GeeksforGeeks
geeksforgeeks.org › python › python-named-entity-recognition-ner-using-spacy
Python | Named Entity Recognition (NER) using spaCy - GeeksforGeeks
July 12, 2025 - By tagging these entities, we can transform raw text into structured data that can be analyzed, indexed or used in applications. ... Optimized performance: spaCy is built for high-speed text processing making it ideal for large-scale NLP tasks.
🌐
spaCy
spacy.io › usage › spacy-101
spaCy 101: Everything you need to know · spaCy Usage Documentation
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction.
🌐
Sematext
sematext.com › home › blog › entity extraction with spacy
Entity Extraction with spaCy
Yoast SEO for WordPress
Yoast SEO is the most complete WordPress SEO plugin. It handles the technical optimization of your site & assists with optimizing your content.
Price   $69.00
Find elsewhere
🌐
CRAN
cran.r-project.org › web › packages › spacyr › vignettes › using_spacyr.html
A Guide to Using spacyr
If a user’s only goal is entity or noun phrase extraction, then two functions make this easy without first parsing the entire text: spacy_extract_entity(txt) ## doc_id text ent_type start_id length ## 1 d2 Smith PERSON 2 1 ## 2 d2 two years DATE 4 2 ## 3 d2 North Carolina GPE 7 2 spacy_extract_nounphrases(txt) ## doc_id text root_text start_id root_id length ## 1 d1 fast natural language processing processing 5 8 4 ## 2 d2 Mr.
🌐
Textanalysisonline
textanalysisonline.com › spacy-named-entity-recognition-ner
spaCy Named Entity Recognizer (NER) - API & Demo | Text Analysis Online | TextAnalysis
Getting started with spaCy · Word Tokenize · Word Lemmatize · Pos Tagging · Sentence Segmentation · Noun Chunks Extraction · Named Entity Recognition · LanguageDetector · Language Detection Introduction · LangId Language Detection · Custom · Custom Service ·
🌐
spaCy
spacy.io
spaCy · Industrial-strength Natural Language Processing in Python
spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. If your application needs to process entire web dumps, spaCy is the library you want to be using.
🌐
Stack Overflow
stackoverflow.com › questions › 60621365 › spacy-extract-named-entity-relations-from-trained-model
python - Spacy Extract named entity relations from trained model - Stack Overflow
My goal is to extract the number of cases of a given disease/virus from a news article, and then later also the number of deaths. I now use this newly created model trying to find the dependencies between CASES and CARDINAL: ... import plac import spacy TEXTS = [ "Net income was $9.4 million compared to the prior year of $2.7 million.
Top answer
1 of 3
9

The issue with models accuracy

The problem with all models is that they don't have 100% accuracy and even using a bigger model doesn't help to recognize dates. Here are the accuracy values (F-score, precision, recall) for NER models--they are all around 86%.

document_string = """ 
Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST 
 The patient was referred by Dr. Jacob Austin.   
Electronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST 
Electronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST 
The patient was referred by 
Dr. Jayden Green Olivia.   
"""  

With small model two date items are labelled as 'PERSON':

import spacy                                                                                                                            

nlp = spacy.load('en')                                                                                                                  
sents = nlp(document_string) 
 [ee for ee in sents.ents if ee.label_ == 'PERSON']                                                                                      
# Out:
# [Wes Scott,
#  Jun 26,
#  Jacob Austin,
#  Robert Clowson,
#  John Douglas,
#  Jun 16 2017,
#  Jayden Green Olivia]

With a larger model en_core_web_md the results are even worse in terms of precision, as there are three misclassified entities.

nlp = spacy.load('en_core_web_md')                                                                                                                  
sents = nlp(document_string) 
# Out:
#[Wes Scott,
# Jun 26,
# Jacob Austin,
# Robert Clowson,
# Janury,
# John Douglas,
# Jun 16 2017,
# Jayden Green Olivia]

I also tried other models (xx_ent_wiki_sm, en_core_web_md) and they don't bring any improvement as well.

What about using rules to improve accuracy?

In the small example not only the document seems to have a clear structure, but the misclassified entities are all dates. So why not combine the initial model with a rule-based component?

The good news is that in Spacy:

it's possible can combine statistical and rule-based components in a variety of ways. Rule-based components can be used to improve the accuracy of statistical models

(from https://spacy.io/usage/rule-based-matching#models-rules)

So, by following the example and using the dateparser library (a parser for human readable dates) I've put together a rule-based component that works very well on this example:

from spacy.tokens import Span
import dateparser

def expand_person_entities(doc):
    new_ents = []
    for ent in doc.ents:
        # Only check for title if it's a person and not the first token
        if ent.label_ == "PERSON":
            if ent.start != 0:
                # if person preceded by title, include title in entity
                prev_token = doc[ent.start - 1]
                if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
                    new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
                    new_ents.append(new_ent)
                else:
                    # if entity can be parsed as a date, it's not a person
                    if dateparser.parse(ent.text) is None:
                        new_ents.append(ent) 
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

# Add the component after the named entity recognizer
# nlp.remove_pipe('expand_person_entities')
nlp.add_pipe(expand_person_entities, after='ner')

doc = nlp(document_string)
[(ent.text, ent.label_) for ent in doc.ents if ent.label_=='PERSON']
# Out:
# [(‘Wes Scott', 'PERSON'),
#  ('Dr. Jacob Austin', 'PERSON'),
#  ('Robert Clowson', 'PERSON'),
#  ('Dr. John Douglas', 'PERSON'),
#  ('Dr. Jayden Green Olivia', 'PERSON')]
2 of 3
1

Try this:

import spacy
en = spacy.load('en')

sents = en(open('input.txt').read())
people = [ee for ee in sents.ents if ee.label_ == 'PERSON']
🌐
Medium
manivannan-ai.medium.com › spacy-named-entity-recognizer-4a1eeee1d749
spaCy Named Entity Recognizer. How to extract the entity from text… | by Manivannan Murugavel | Medium
March 29, 2019 - The library is published under the MIT license and currently offers statistical neural network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language NER, as well as tokenization for various other languages. spaCy v2.0 features new neural models for tagging, parsing and entity recognition.