spacy entity extraction nlp

Extracting and Identifying locations with NLP + Spacy

stackoverflow.com › questions › 77951208 › extracting-and-identifying-locations-with-nlp-spacy

in other words, the value [of spaCy's EntityLinker] is from disambiguating when you have multiple exact-string matches and not from disambiguating from multiple very-fuzzy matches

That's right, and while I was reading about your use-case I came to the same idea that perhaps the EL, as implemented in spaCy's core, is not exactly what you need. It sounds to me like you'll want to leverage more fuzzy-based matching while at the same time maximising the probability of certain terms occurring together, and exploiting the relationship between the different parts of your location, e.g. "state should be in country". These are constraints that you could enforce for this specific use-case, and perhaps then you'd really need to run some kind of multi-constraint optimization framework to find the most optimal and coherent interpretation for each specific location occurrence.

then maybe it makes sense to put all my effort into the gazetteer and skip the NER?

The way you've described the NER step, I do think that you've overcomplicated it. Like you say, there is minimal context like "Location: ...". But for an NER system, especially spaCy's transition-based model, there's not that whole lot of a difference between

Location: Mission, San Francisco

and

Location: San Francisco, CA

Whether a specific part is a street, a state or a country, feels more like a dictionary-based lookup to me, rather than a pure NER challenge. I think I would personally advice to just tag Mission, San Francisco and CA as LOC, then deal with the combination of various LOC entities in post-processing.

Answer from V12 on Stack Overflow

Sematext

sematext.com › home › blog › entity extraction with spacy

Entity Extraction with spaCy

Yoast SEO for WordPress

Yoast SEO is the most complete WordPress SEO plugin. It handles the technical optimization of your site & assists with optimizing your content.

Price US$69.00

spaCy

spacy.io › api › entityrecognizer

EntityRecognizer · spaCy API Documentation

The transition-based algorithm used encodes certain assumptions that are effective for “traditional” named entity recognition tasks, but may not be a good fit for every span identification problem. Specifically, the loss function optimizes for whole entity accuracy, so if your inter-annotator agreement on boundary tokens is low, the component will likely perform poorly on your problem.

Discussions

Extracting and Identifying locations with NLP + Spacy - Stack Overflow

3 spaCy nlp - positions of entities in string, extracting nearby words More on stackoverflow.com

stackoverflow.com

Advanced entity extraction (NER) with GPT-NeoX 20B without annotation, and a comparison with spaCy

Ah. Classic conundrum. Good results, but not really easy to use in production! More on reddit.com

r/LanguageTechnology

19

17

March 3, 2022

[D] Entity Extraction with LLMs

Giving it an out, so that it doesn't have to come up with an answer is important. Otherwise they kinda treat everything like multiple choice even when not. You can ask for confidence (if you train or give examples) interval. Ask for 5 labels that fit on first pass, then have it narrow it down on the second pass. Ask it why that labels fits. Double check with code. I think entity extraction is a subset of unstructured data extraction, I find using multiple models that are fed the same prompt and achieve the same results on the same data is more stable than just using a single model that happens to work. More on reddit.com

r/MachineLearning

23

36

July 5, 2024

[D] Named Entity Recognition (NER) Libraries

If spaCy’s NER isn’t picking up what you need, you’ll probably need to look into creating your own annotations and fine tuning a model or training a custom model. It isn’t too hard using BIO/BILOU tags. Things like “raw materials” and particularly niche models and brands are unlikely to be picked up by off the shelf solutions. More on reddit.com

r/MachineLearning

10

11

January 2, 2023

Videos

05:01

YouTube

Best way to do Named Entity Recognition in 2024 with GliNER and ...

March 19, 2024

56:26

YouTube

How to Extract Information from Text with SpaCy - YouTube

May 12, 2023

spacy.io

Named Entity Recognition (NER) using spaCy · spaCy Universe

02:54

YouTube

How to Extract NER (Named Entity Recognition) Using Spacy - YouTube

medium.com › @sanskrutikhedkar09 › mastering-information-extraction-from-unstructured-text-a-deep-dive-into-named-entity-recognition-4aa2f664a453

Mastering Information Extraction from Unstructured Text: A Deep Dive into Named Entity Recognition with spaCy | by Sanskrutikhedkar | Medium

October 27, 2023 - Named Entity Recognition (NER): SpaCy can identify named entities in text, such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

GeeksforGeeks

geeksforgeeks.org › python › python-named-entity-recognition-ner-using-spacy

Python | Named Entity Recognition (NER) using spaCy - GeeksforGeeks

July 12, 2025 - (NLP) to identify and classify important information within unstructured text. These "named entities" include proper nouns like people, organizations, locations and other meaningful categories such as dates, monetary values and products.

spaCy

spacy.io › usage › spacy-101

spaCy 101: Everything you need to know · spaCy Usage Documentation

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline.

Analytics Vidhya

analyticsvidhya.com › home › named entity recognition (ner) in python with spacy

Named Entity Recognition (NER) in Python with Spacy

May 1, 2025 - It automatically identifies and categorizes named entities (e.g., persons, organizations, locations, dates) in text data. spaCy NER is valuable for information extraction, entity recognition in documents, and improving the understanding of text ...

Stack Overflow

stackoverflow.com › questions › 77951208 › extracting-and-identifying-locations-with-nlp-spacy

Extracting and Identifying locations with NLP + Spacy - Stack Overflow

Top answer

1 of 2

1

in other words, the value [of spaCy's EntityLinker] is from disambiguating when you have multiple exact-string matches and not from disambiguating from multiple very-fuzzy matches

That's right, and while I was reading about your use-case I came to the same idea that perhaps the EL, as implemented in spaCy's core, is not exactly what you need. It sounds to me like you'll want to leverage more fuzzy-based matching while at the same time maximising the probability of certain terms occurring together, and exploiting the relationship between the different parts of your location, e.g. "state should be in country". These are constraints that you could enforce for this specific use-case, and perhaps then you'd really need to run some kind of multi-constraint optimization framework to find the most optimal and coherent interpretation for each specific location occurrence.

then maybe it makes sense to put all my effort into the gazetteer and skip the NER?

The way you've described the NER step, I do think that you've overcomplicated it. Like you say, there is minimal context like "Location: ...". But for an NER system, especially spaCy's transition-based model, there's not that whole lot of a difference between

Location: Mission, San Francisco

and

Location: San Francisco, CA

Whether a specific part is a street, a state or a country, feels more like a dictionary-based lookup to me, rather than a pure NER challenge. I think I would personally advice to just tag Mission, San Francisco and CA as LOC, then deal with the combination of various LOC entities in post-processing.

2 of 2

0

Thanks to Vimal for the thoughts. As I suspected, and Vishal confirmed, I needed to extract the string first, and then process it with a separate, non-NLP algorithm.

I ended up solving this in two ways, and wanted to document my findings.

With some testing I found that an LLM (GPT) was actually pretty effective at determining the "administrative hierarchy" given the extracted string. This prompt, for example:

Evaluate the following place identifier and determine the most likely place. List the administrative entities from the lowest-level to the highest-level. Explain your reasoning. """'Castro, San Francisco, U.S."""

returns this (plus some additional explanation):


The place identifier "Castro, San Francisco, U.S." likely refers to a specific location within the city of San Francisco in the United States.

I can't find the final version of my prompt at the moment, but I was able to tweak it to get it to provide JSON with the administrative entities in order (national, first level, second level, etc), plus I asked for any "geographical feature" as a catch-all. (In some cases, I found that my extracted term was something like "Bay Area, United States", and GPT was able to sort that out with the right prompting.) There was a little bit of hallucination in my testing, which worried me.

With all that being said, I ended up on a much lower-tech approach, along the lines of my original gazetteer idea. My original plan was to use the gazetteer idea and then sort out the remaining strings with GPT. I ended up matching like 98% using this approach, and the unmatched strings were pretty objectively wrong and not worth sending to GPT. (Like a city with an incorrect country.)

To create the gazetteer:

This approach used a dictionary which I compiled from the wikidata API search API. I did a first-pass and sent all the strings through Wikidata search to get the top 10 matching entities for each search string. E.g.,

https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=castro%20san%20francisco%20california&srwhat=text&srlimit=10&srprop=titlesnippet|categorysnippet&srsort=incoming_links_desc

I did some pre-filtering to exclude any entity that wasn't an instance of or subclass of geographical features or administrative regions (using lists that I manually searched for and downloaded manually).

Then I ran all those entities through a method that stored the data (including name, aliases, lat/lng, etc) and walked up through the "administrative entity" and "country" paths to collect the family tree.

To search for place names:

Compiled a dictionary where the keys were entity names and aliases.
I took the list of places (castro, san francisco, california) and started with the lowest-level string (castro) and did a fuzzy search against the dictionary keys to look for a match. Anything that matched over, e.g., 85% was chosen as a candidate.
Then I created a list of all the candidate's parent (+grandparent/etc) names and aliases, and I looped through the next place names to try to match every place against a name in the parents list, and got those scores.
Added the scores up. Some other operations to divide by the number of places that I was looking at, bias toward places with fewer parents (so Castro Valley, California would score higher than Castro, San Francisco, California), etc.

All in all, this was surprisingly effective.

Find elsewhere

Google Bing Mojeek

spaCy

spacy.io › usage › linguistic-features

Linguistic Features · spaCy Usage Documentation

entities labeled as MONEY, and then uses the dependency parse to find the noun phrase they are referring to – for example "Net income" → "$9.4 million". ... For more examples of how to write rule-based information extraction logic that takes advantage of the model’s predictions produced by the different components, see the usage guide on combining models and rules.

Kaggle

kaggle.com › code › curiousprogrammer › entity-extraction-and-classification-using-spacy

Entity Extraction and Classification using SpaCy

Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds

spaCy

spacy.io

spaCy · Industrial-strength Natural Language Processing in Python

Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more · Easily extensible with custom components and attributes · Support for custom models in PyTorch, TensorFlow and other frameworks ... The spacy-llm package integrates Large Language Models (LLMs) into spaCy, featuring a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks, no training data required.

Towards Data Science

towardsdatascience.com › home › latest › quick guide to entity recognition and geocoding with r

Named Entity Recognition with NLTK and SpaCy

March 5, 2025 - Combined with the NLP power of the spacy python package, R can be used to locate geographical entities within a text and geocode those results. This is a helpful tool in digital humanities research, as well as HGIS.

spaCy

spacy.io › usage › examples

Projects · spaCy Usage Documentation

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.

Towards Data Science

towardsdatascience.com › home

Custom Named Entity Recognition Using spaCy

June 18, 2025 - Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Medium

medium.com › python-in-plain-english › a-basic-named-entity-recognition-ner-with-spacy-in-10-lines-of-code-in-python-c53316ce9c4c

A basic Named entity recognition (NER) with SpaCy in 10 lines of ...

May 11, 2020 - I have another article discussing more interesting rule-based matching with EntityRuler A Closer Look at EntityRuler in SpaCy Rule-based Matching · Let’s start with the simpler one: Phrase Matcher. nlp = spacy.load('en_core_web_sm') phraseMatcher = PhraseMatcher(nlp.vocab, attr='LOWER') terms = ["cloud computing", "it", "information"] patterns = [nlp.make_doc(text) for text in terms] phraseMatcher.add("Match_By_Phrase", None, *patterns) doc = nlp(content) matches = phraseMatcher(doc) for match_id, start, end in matches: span = doc[start:end] print(span.text)

Medium

medium.com › @manivannan_data › spacy-named-entity-recognizer-4a1eeee1d749

spaCy Named Entity Recognizer. How to extract the entity from text…

March 29, 2019 - The spaCy pretrained model has list of entity classes. I mentioned the classes and its descriptions below. ... $ python >>> import spacy >>> nlp = spacy.load("en") >>> text = "But Google is starting from behind.

spaCy

spacy.io › universe › project › video-spacys-ner-model

spaCy's NER model · spaCy Universe

Incremental parsing with bloom embeddings and residual CNNs

Towards Data Science

towardsdatascience.com › home › latest › named entity recognition with spacy and the mighty roberta

Named Entity Recognition with Spacy and the Mighty roBERTa | Towards Data Science

March 5, 2025 - Now that we have installed all the libraries and defined our named entity extraction function, we can proceed with the analysis. An opensource library for industrial-strength NLP, and has been trained on the OntoNotes 5.0 corpus. ... Traditional spaCy successfully identified CNN as an Organisation (ORG), Amy Schneider as a PERSON, Oakland, and California as Geo-Political Entity (GEP), etc.

Towards Data Science

towardsdatascience.com › home › latest › custom named entity recognition with bert

Named Entity Recognition NER using spaCy | NLP | Part 4

March 5, 2025 - is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary ...

Robertorocha

robertorocha.info › how-to-extract-entities-from-raw-text-with-spacy-3-approaches-using-canadian-data

How to extract entities from raw text with Spacy: 3 approaches using Canadian data – Roberto Rocha

TL;DR: Use the en_core_web_trf transformer model with Spacy to get much more accurate named entity recognition with multilingual text.

Medium

medium.com › @manivannan_data › how-to-train-ner-with-custom-training-data-using-spacy-188e0e508c6

How to Train NER with Custom training data using spaCy. | by ...

May 2, 2019 - >>> import spacy >>> nlp = spacy.load('model name')