in other words, the value [of spaCy's EntityLinker] is from disambiguating when you have multiple exact-string matches and not from disambiguating from multiple very-fuzzy matches

That's right, and while I was reading about your use-case I came to the same idea that perhaps the EL, as implemented in spaCy's core, is not exactly what you need. It sounds to me like you'll want to leverage more fuzzy-based matching while at the same time maximising the probability of certain terms occurring together, and exploiting the relationship between the different parts of your location, e.g. "state should be in country". These are constraints that you could enforce for this specific use-case, and perhaps then you'd really need to run some kind of multi-constraint optimization framework to find the most optimal and coherent interpretation for each specific location occurrence.

then maybe it makes sense to put all my effort into the gazetteer and skip the NER?

The way you've described the NER step, I do think that you've overcomplicated it. Like you say, there is minimal context like "Location: ...". But for an NER system, especially spaCy's transition-based model, there's not that whole lot of a difference between

Location: Mission, San Francisco

and

Location: San Francisco, CA

Whether a specific part is a street, a state or a country, feels more like a dictionary-based lookup to me, rather than a pure NER challenge. I think I would personally advice to just tag Mission, San Francisco and CA as LOC, then deal with the combination of various LOC entities in post-processing.

Answer from V12 on Stack Overflow
🌐
Sematext
sematext.com › home › blog › entity extraction with spacy
Entity Extraction with spaCy
Yoast SEO for WordPress
Yoast SEO is the most complete WordPress SEO plugin. It handles the technical optimization of your site & assists with optimizing your content.
Price   US$69.00
🌐
spaCy
spacy.io › api › entityrecognizer
EntityRecognizer · spaCy API Documentation
The transition-based algorithm used encodes certain assumptions that are effective for “traditional” named entity recognition tasks, but may not be a good fit for every span identification problem. Specifically, the loss function optimizes for whole entity accuracy, so if your inter-annotator agreement on boundary tokens is low, the component will likely perform poorly on your problem.
Discussions

Extracting and Identifying locations with NLP + Spacy - Stack Overflow
3 spaCy nlp - positions of entities in string, extracting nearby words More on stackoverflow.com
🌐 stackoverflow.com
Advanced entity extraction (NER) with GPT-NeoX 20B without annotation, and a comparison with spaCy
Ah. Classic conundrum. Good results, but not really easy to use in production! More on reddit.com
🌐 r/LanguageTechnology
19
17
March 3, 2022
[D] Entity Extraction with LLMs
Giving it an out, so that it doesn't have to come up with an answer is important. Otherwise they kinda treat everything like multiple choice even when not. You can ask for confidence (if you train or give examples) interval. Ask for 5 labels that fit on first pass, then have it narrow it down on the second pass. Ask it why that labels fits. Double check with code. I think entity extraction is a subset of unstructured data extraction, I find using multiple models that are fed the same prompt and achieve the same results on the same data is more stable than just using a single model that happens to work. More on reddit.com
🌐 r/MachineLearning
23
36
July 5, 2024
[D] Named Entity Recognition (NER) Libraries
If spaCy’s NER isn’t picking up what you need, you’ll probably need to look into creating your own annotations and fine tuning a model or training a custom model. It isn’t too hard using BIO/BILOU tags. Things like “raw materials” and particularly niche models and brands are unlikely to be picked up by off the shelf solutions. More on reddit.com
🌐 r/MachineLearning
10
11
January 2, 2023
🌐
Medium
medium.com › @sanskrutikhedkar09 › mastering-information-extraction-from-unstructured-text-a-deep-dive-into-named-entity-recognition-4aa2f664a453
Mastering Information Extraction from Unstructured Text: A Deep Dive into Named Entity Recognition with spaCy | by Sanskrutikhedkar | Medium
October 27, 2023 - Named Entity Recognition (NER): SpaCy can identify named entities in text, such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
🌐
GeeksforGeeks
geeksforgeeks.org › python › python-named-entity-recognition-ner-using-spacy
Python | Named Entity Recognition (NER) using spaCy - GeeksforGeeks
July 12, 2025 - (NLP) to identify and classify important information within unstructured text. These "named entities" include proper nouns like people, organizations, locations and other meaningful categories such as dates, monetary values and products.
🌐
spaCy
spacy.io › usage › spacy-101
spaCy 101: Everything you need to know · spaCy Usage Documentation
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline.
🌐
Analytics Vidhya
analyticsvidhya.com › home › named entity recognition (ner) in python with spacy
Named Entity Recognition (NER) in Python with Spacy
May 1, 2025 - It automatically identifies and categorizes named entities (e.g., persons, organizations, locations, dates) in text data. spaCy NER is valuable for information extraction, entity recognition in documents, and improving the understanding of text ...
Top answer
1 of 2
1

in other words, the value [of spaCy's EntityLinker] is from disambiguating when you have multiple exact-string matches and not from disambiguating from multiple very-fuzzy matches

That's right, and while I was reading about your use-case I came to the same idea that perhaps the EL, as implemented in spaCy's core, is not exactly what you need. It sounds to me like you'll want to leverage more fuzzy-based matching while at the same time maximising the probability of certain terms occurring together, and exploiting the relationship between the different parts of your location, e.g. "state should be in country". These are constraints that you could enforce for this specific use-case, and perhaps then you'd really need to run some kind of multi-constraint optimization framework to find the most optimal and coherent interpretation for each specific location occurrence.

then maybe it makes sense to put all my effort into the gazetteer and skip the NER?

The way you've described the NER step, I do think that you've overcomplicated it. Like you say, there is minimal context like "Location: ...". But for an NER system, especially spaCy's transition-based model, there's not that whole lot of a difference between

Location: Mission, San Francisco

and

Location: San Francisco, CA

Whether a specific part is a street, a state or a country, feels more like a dictionary-based lookup to me, rather than a pure NER challenge. I think I would personally advice to just tag Mission, San Francisco and CA as LOC, then deal with the combination of various LOC entities in post-processing.

2 of 2
0

Thanks to Vimal for the thoughts. As I suspected, and Vishal confirmed, I needed to extract the string first, and then process it with a separate, non-NLP algorithm.

I ended up solving this in two ways, and wanted to document my findings.

  1. With some testing I found that an LLM (GPT) was actually pretty effective at determining the "administrative hierarchy" given the extracted string. This prompt, for example:
Evaluate the following place identifier and determine the most likely place. List the administrative entities from the lowest-level to the highest-level. Explain your reasoning. """'Castro, San Francisco, U.S.""" 

returns this (plus some additional explanation):


The place identifier "Castro, San Francisco, U.S." likely refers to a specific location within the city of San Francisco in the United States. 

I can't find the final version of my prompt at the moment, but I was able to tweak it to get it to provide JSON with the administrative entities in order (national, first level, second level, etc), plus I asked for any "geographical feature" as a catch-all. (In some cases, I found that my extracted term was something like "Bay Area, United States", and GPT was able to sort that out with the right prompting.) There was a little bit of hallucination in my testing, which worried me.

  1. With all that being said, I ended up on a much lower-tech approach, along the lines of my original gazetteer idea. My original plan was to use the gazetteer idea and then sort out the remaining strings with GPT. I ended up matching like 98% using this approach, and the unmatched strings were pretty objectively wrong and not worth sending to GPT. (Like a city with an incorrect country.)

To create the gazetteer:

This approach used a dictionary which I compiled from the wikidata API search API. I did a first-pass and sent all the strings through Wikidata search to get the top 10 matching entities for each search string. E.g.,

https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=castro%20san%20francisco%20california&srwhat=text&srlimit=10&srprop=titlesnippet|categorysnippet&srsort=incoming_links_desc

I did some pre-filtering to exclude any entity that wasn't an instance of or subclass of geographical features or administrative regions (using lists that I manually searched for and downloaded manually).

Then I ran all those entities through a method that stored the data (including name, aliases, lat/lng, etc) and walked up through the "administrative entity" and "country" paths to collect the family tree.

To search for place names:

  1. Compiled a dictionary where the keys were entity names and aliases.
  2. I took the list of places (castro, san francisco, california) and started with the lowest-level string (castro) and did a fuzzy search against the dictionary keys to look for a match. Anything that matched over, e.g., 85% was chosen as a candidate.
  3. Then I created a list of all the candidate's parent (+grandparent/etc) names and aliases, and I looped through the next place names to try to match every place against a name in the parents list, and got those scores.
  4. Added the scores up. Some other operations to divide by the number of places that I was looking at, bias toward places with fewer parents (so Castro Valley, California would score higher than Castro, San Francisco, California), etc.

All in all, this was surprisingly effective.

Find elsewhere
🌐
spaCy
spacy.io › usage › linguistic-features
Linguistic Features · spaCy Usage Documentation
entities labeled as MONEY, and then uses the dependency parse to find the noun phrase they are referring to – for example "Net income" → "$9.4 million". ... For more examples of how to write rule-based information extraction logic that takes advantage of the model’s predictions produced by the different components, see the usage guide on combining models and rules.
🌐
Kaggle
kaggle.com › code › curiousprogrammer › entity-extraction-and-classification-using-spacy
Entity Extraction and Classification using SpaCy
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
spaCy
spacy.io
spaCy · Industrial-strength Natural Language Processing in Python
Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more · Easily extensible with custom components and attributes · Support for custom models in PyTorch, TensorFlow and other frameworks ... The spacy-llm package integrates Large Language Models (LLMs) into spaCy, featuring a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks, no training data required.
🌐
Towards Data Science
towardsdatascience.com › home › latest › quick guide to entity recognition and geocoding with r
Named Entity Recognition with NLTK and SpaCy
March 5, 2025 - Combined with the NLP power of the spacy python package, R can be used to locate geographical entities within a text and geocode those results. This is a helpful tool in digital humanities research, as well as HGIS.
🌐
spaCy
spacy.io › usage › examples
Projects · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.
🌐
Towards Data Science
towardsdatascience.com › home
Custom Named Entity Recognition Using spaCy
June 18, 2025 - Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.
🌐
Medium
medium.com › python-in-plain-english › a-basic-named-entity-recognition-ner-with-spacy-in-10-lines-of-code-in-python-c53316ce9c4c
A basic Named entity recognition (NER) with SpaCy in 10 lines of ...
May 11, 2020 - I have another article discussing more interesting rule-based matching with EntityRuler A Closer Look at EntityRuler in SpaCy Rule-based Matching · Let’s start with the simpler one: Phrase Matcher. nlp = spacy.load('en_core_web_sm') phraseMatcher = PhraseMatcher(nlp.vocab, attr='LOWER') terms = ["cloud computing", "it", "information"] patterns = [nlp.make_doc(text) for text in terms] phraseMatcher.add("Match_By_Phrase", None, *patterns) doc = nlp(content) matches = phraseMatcher(doc) for match_id, start, end in matches: span = doc[start:end] print(span.text)
🌐
Medium
medium.com › @manivannan_data › spacy-named-entity-recognizer-4a1eeee1d749
spaCy Named Entity Recognizer. How to extract the entity from text…
March 29, 2019 - The spaCy pretrained model has list of entity classes. I mentioned the classes and its descriptions below. ... $ python >>> import spacy >>> nlp = spacy.load("en") >>> text = "But Google is starting from behind.
🌐
Towards Data Science
towardsdatascience.com › home › latest › named entity recognition with spacy and the mighty roberta
Named Entity Recognition with Spacy and the Mighty roBERTa | Towards Data Science
March 5, 2025 - Now that we have installed all the libraries and defined our named entity extraction function, we can proceed with the analysis. An opensource library for industrial-strength NLP, and has been trained on the OntoNotes 5.0 corpus. ... Traditional spaCy successfully identified CNN as an Organisation (ORG), Amy Schneider as a PERSON, Oakland, and California as Geo-Political Entity (GEP), etc.
🌐
Towards Data Science
towardsdatascience.com › home › latest › custom named entity recognition with bert
Named Entity Recognition NER using spaCy | NLP | Part 4
March 5, 2025 - is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary ...