in other words, the value [of spaCy's EntityLinker] is from disambiguating when you have multiple exact-string matches and not from disambiguating from multiple very-fuzzy matches
That's right, and while I was reading about your use-case I came to the same idea that perhaps the EL, as implemented in spaCy's core, is not exactly what you need. It sounds to me like you'll want to leverage more fuzzy-based matching while at the same time maximising the probability of certain terms occurring together, and exploiting the relationship between the different parts of your location, e.g. "state should be in country". These are constraints that you could enforce for this specific use-case, and perhaps then you'd really need to run some kind of multi-constraint optimization framework to find the most optimal and coherent interpretation for each specific location occurrence.
then maybe it makes sense to put all my effort into the gazetteer and skip the NER?
The way you've described the NER step, I do think that you've overcomplicated it. Like you say, there is minimal context like "Location: ...". But for an NER system, especially spaCy's transition-based model, there's not that whole lot of a difference between
Location: Mission, San Francisco
and
Location: San Francisco, CA
Whether a specific part is a street, a state or a country, feels more like a dictionary-based lookup to me, rather than a pure NER challenge. I think I would personally advice to just tag Mission, San Francisco and CA as LOC, then deal with the combination of various LOC entities in post-processing.
Answer from V12 on Stack OverflowExtracting and Identifying locations with NLP + Spacy - Stack Overflow
Advanced entity extraction (NER) with GPT-NeoX 20B without annotation, and a comparison with spaCy
[D] Entity Extraction with LLMs
[D] Named Entity Recognition (NER) Libraries
Videos
in other words, the value [of spaCy's EntityLinker] is from disambiguating when you have multiple exact-string matches and not from disambiguating from multiple very-fuzzy matches
That's right, and while I was reading about your use-case I came to the same idea that perhaps the EL, as implemented in spaCy's core, is not exactly what you need. It sounds to me like you'll want to leverage more fuzzy-based matching while at the same time maximising the probability of certain terms occurring together, and exploiting the relationship between the different parts of your location, e.g. "state should be in country". These are constraints that you could enforce for this specific use-case, and perhaps then you'd really need to run some kind of multi-constraint optimization framework to find the most optimal and coherent interpretation for each specific location occurrence.
then maybe it makes sense to put all my effort into the gazetteer and skip the NER?
The way you've described the NER step, I do think that you've overcomplicated it. Like you say, there is minimal context like "Location: ...". But for an NER system, especially spaCy's transition-based model, there's not that whole lot of a difference between
Location: Mission, San Francisco
and
Location: San Francisco, CA
Whether a specific part is a street, a state or a country, feels more like a dictionary-based lookup to me, rather than a pure NER challenge. I think I would personally advice to just tag Mission, San Francisco and CA as LOC, then deal with the combination of various LOC entities in post-processing.
Thanks to Vimal for the thoughts. As I suspected, and Vishal confirmed, I needed to extract the string first, and then process it with a separate, non-NLP algorithm.
I ended up solving this in two ways, and wanted to document my findings.
- With some testing I found that an LLM (GPT) was actually pretty effective at determining the "administrative hierarchy" given the extracted string. This prompt, for example:
Evaluate the following place identifier and determine the most likely place. List the administrative entities from the lowest-level to the highest-level. Explain your reasoning. """'Castro, San Francisco, U.S."""
returns this (plus some additional explanation):
The place identifier "Castro, San Francisco, U.S." likely refers to a specific location within the city of San Francisco in the United States.
I can't find the final version of my prompt at the moment, but I was able to tweak it to get it to provide JSON with the administrative entities in order (national, first level, second level, etc), plus I asked for any "geographical feature" as a catch-all. (In some cases, I found that my extracted term was something like "Bay Area, United States", and GPT was able to sort that out with the right prompting.) There was a little bit of hallucination in my testing, which worried me.
- With all that being said, I ended up on a much lower-tech approach, along the lines of my original gazetteer idea. My original plan was to use the gazetteer idea and then sort out the remaining strings with GPT. I ended up matching like 98% using this approach, and the unmatched strings were pretty objectively wrong and not worth sending to GPT. (Like a city with an incorrect country.)
To create the gazetteer:
This approach used a dictionary which I compiled from the wikidata API search API. I did a first-pass and sent all the strings through Wikidata search to get the top 10 matching entities for each search string. E.g.,
https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=castro%20san%20francisco%20california&srwhat=text&srlimit=10&srprop=titlesnippet|categorysnippet&srsort=incoming_links_desc
I did some pre-filtering to exclude any entity that wasn't an instance of or subclass of geographical features or administrative regions (using lists that I manually searched for and downloaded manually).
Then I ran all those entities through a method that stored the data (including name, aliases, lat/lng, etc) and walked up through the "administrative entity" and "country" paths to collect the family tree.
To search for place names:
- Compiled a dictionary where the keys were entity names and aliases.
- I took the list of places
(castro, san francisco, california)and started with the lowest-level string (castro) and did a fuzzy search against the dictionary keys to look for a match. Anything that matched over, e.g., 85% was chosen as a candidate. - Then I created a list of all the candidate's parent (+grandparent/etc) names and aliases, and I looped through the next place names to try to match every place against a name in the parents list, and got those scores.
- Added the scores up. Some other operations to divide by the number of places that I was looking at, bias toward places with fewer parents (so Castro Valley, California would score higher than Castro, San Francisco, California), etc.
All in all, this was surprisingly effective.