Hey atalnarayan,

In general the approach you are taking seems to be on the right track, but your question is a bit general for a discussion here. Let me point you to some relevant material:

  1. Finding video game titles with sense2vec: https://www.youtube.com/watch?v=EoYHbUHr0fM
  2. Detailed example about using the entity ruler to find museum names: https://www.youtube.com/watch?v=Ds18bQAzygo.
  3. Rather than the EntityRuler we recommend using the SpanRuler in the future: https://spacy.io/api/spanruler
  4. Using dependency tree for extracting information: https://www.youtube.com/watch?v=BoyLPiXXEYA&t=429s.
  5. For more in-depth information about entity extraction I recommend this Chapter: https://web.stanford.edu/~jurafsky/slp3/8.pdf
  6. For practical examples for machine learning based named entity recognition with spacy you can checkout the relevant projects here: https://github.com/explosion/projects.
🌐
GitHub
github.com › egerber › spaCy-entity-linker
GitHub - egerber/spaCy-entity-linker: spaCy module for linking text to Wikidata items
Spacy Entity Linker is a pipeline for spaCy that performs Linked Entity Extraction with Wikidata on a given Document. The Entity Linking System operates by matching potential candidates from each sentence (subject, object, prepositional phrase, ...
Starred by 241 users
Forked by 34 users
Languages   Python 99.4% | Shell 0.6%
🌐
GitHub
github.com › cloudera › CML_AMP_SpaCy_Entity_Extraction
GitHub - cloudera/CML_AMP_SpaCy_Entity_Extraction: A Jupyter notebook demonstrating entity extraction on headlines with SpaCy.
SpaCy wraps industrial-strength natural language processing capabilites into a Python library with an elegant and powerful API. The notebook in this repo demonstrates its use for Named Entity Recognition (NER) on a real world news dataset.
Starred by 4 users
Forked by 5 users
Languages   Jupyter Notebook
🌐
GitHub
github.com › jenojp › extractacy
GitHub - jenojp/extractacy: Spacy pipeline object for extracting values that correspond to a named entity (e.g., birth dates, account numbers, laboratory results)
Spacy pipeline object for extracting values that correspond to a named entity (e.g., birth dates, account numbers, laboratory results) - jenojp/extractacy
Starred by 54 users
Forked by 9 users
Languages   Python
🌐
GitHub
github.com › niraj1234567890 › entity_extraction_spaCy
GitHub - niraj1234567890/entity_extraction_spaCy: Entity_Extraction_using_Spacy
Entity_Extraction_using_Spacy. Contribute to niraj1234567890/entity_extraction_spaCy development by creating an account on GitHub.
Author   niraj1234567890
🌐
GitHub
github.com › osamadev › Named-Entity-Recognition-Using-Spacy
GitHub - osamadev/Named-Entity-Recognition-Using-Spacy: Named Entity Recognition Using Spacy
Named Entity Recognition Using Spacy. Contribute to osamadev/Named-Entity-Recognition-Using-Spacy development by creating an account on GitHub.
Author   osamadev
🌐
GitHub
github.com › ByUnal › Custom-Entity-Extraction-w-SpaCy
GitHub - ByUnal/Custom-Entity-Extraction-w-SpaCy: In this repo, SpaCy is used for entity extraction and categorization. We are customizing spacy to extract entities from the data. At the end, entities are categorized and similarity scores are calculated.
In this repo, SpaCy is used for entity extraction and categorization. We are customizing spacy to extract entities from the data. At the end, entities are categorized and similarity scores are calculated.
Author   ByUnal
Find elsewhere
🌐
GitHub
github.com › akash-kaul › Using-scispaCy-for-Named-Entity-Recognition
GitHub - akash-kaul/Using-scispaCy-for-Named-Entity-Recognition: A beginner's guide to using Named-Entity Recognition for data extraction from biomedical literature
A beginner's guide to using Named-Entity Recognition for data extraction from biomedical literature - akash-kaul/Using-scispaCy-for-Named-Entity-Recognition
Starred by 22 users
Forked by 13 users
Languages   Jupyter Notebook
🌐
GitHub
github.com › sulaihasubi › Named-Entity-Recognition-spaCy
GitHub - sulaihasubi/Named-Entity-Recognition-spaCy: 📖 This will be a complete end-to-end demonstration of the entire process, including both labeling and model training by @sulaihasubi
For this we use displacy which will display the entities in the text. from spacy import displacy example = "service postings marathon petroleum co said it reduced the contract price it will pay for all grades of service oil one dlr a barrel effective today the decrease brings marathon s posted price for both west texas intermediate and west texas sour to dlrs a bbl the south louisiana sweet grade of service was reduced to dlrs a bbl the company last changed its service postings on jan reuter" doc = nlp(example) displacy.render(doc, style='ent')
Author   sulaihasubi
🌐
spaCy
spacy.io › usage › linguistic-features
Linguistic Features · spaCy Usage Documentation
The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.label and ent.label_.
🌐
GitHub
github.com › mpuig › spacy-lookup
GitHub - mpuig/spacy-lookup: Named Entity Recognition based on dictionaries
Named Entities are matched using the python module flashtext, and looks up in the data provided by different dictionaries. ... First, you need to download a language model. ... Import the component and initialise it with the shared nlp object ...
Starred by 242 users
Forked by 38 users
Languages   Python
🌐
GitHub
github.com › topics › entity-extraction
entity-extraction · GitHub Topics · GitHub
awesome entity-resolution ... knowledge-graphs llm mmkg domain-specific-knowledge ... Web-based Named Entity Recognition (NER) app using Flask and spaCy, featuring multilingual support, entity filtering, an API endpoint, ...
🌐
GitHub
github.com › explosion › spaCy › discussions › 11128
Using entity relation extraction to establish an entity hierarchy · explosion/spaCy · Discussion #11128
Hello, I’m looking to extract entity relations in spacy. For my use-case, I want to label two types of relations from text involving chess matches. I wish to relate a PERSON entity to a CHESS_PIECE...
Author   explosion
🌐
GitHub
github.com › RaThorat › entity-extraction-01
GitHub - RaThorat/entity-extraction-01: Entity Extraction from PDF files using spacy NLP model
Entity Extraction from PDF files using spacy NLP model - RaThorat/entity-extraction-01
Author   RaThorat
Top answer
1 of 1
1

Both of your questions can be answered in a similar way. Both the named entity recognition and part-of-speech tagging pipelines use machine learning models. Such models will make mistakes and these mistakes are hard for us to correct in general, because models are not a deterministic set of rules. The accuracy of a model depends on several factors, including:

  • The size of the training data;
  • the quality of the training data;
  • the size of the model.

Taking your named entity recognition example, the en_core_web_sm model is (as the name suggests) a small model. It uses a relatively small convolutional network, but also does not use static embeddings that are pretrained on a large corpus. Since the model is relatively limited, the model may have picked up patterns like: words that are capitalized are typically names, except if they occur at the beginning of the sentence (since all sentence-initial words are capitalized). This may be the reason that it fails to annotate your example correctly. However, if you use the en_core_web_lg model instead, you will see that that model will return the correct annotation:

('Dumbledore', 'PERSON')

You'd have to dive deeper to understand why it works in this case. But en_core_web_lg is a larger model that uses pretrained word embeddings. So, it may e.g. be the case that Dumbledore occurs in the set of word embeddings and the vector is similar to other names, allowing the model to extrapolate that since the vector of _ Dumbledore_ is similar to that of names it has seen in the training data that Dumbledore must also be a name.

Similar reasoning applies to your second question, the models are a trade-off between size, speed and accuracy. Also in this case en_core_web_lg does predicts 'VERB' consistently as the tag for finished.

So what does this mean in practice? First, models make mistakes. Second, if the error rate is not acceptable, you may want to look at larger models (such as md/lg/trf); or if you are working in a very specific domain, annotating more training data. Finally, do not underestimate the power of a set of rules. If you are working in a particular domain, say processing Harry Potter novels, you could get a lot of milage out of making a small set of rules to recognize names since it is a finite set (using e.g. the attribute ruler).

🌐
GitHub
github.com › chawla201 › Custom-Named-Entity-Recognition
GitHub - chawla201/Custom-Named-Entity-Recognition: NLP | NER | SpaCy
Lists of company names and addresses are stored in a dictionary format and are searched through if the NER model fails to identify the entity. Evaluation metric used to measure the model performance is F1 score.
Starred by 27 users
Forked by 10 users
Languages   Jupyter Notebook 94.3% | Python 5.7%
🌐
GitHub
github.com › explosion › spaCy › issues › 3303
Information Extraction (Knowledge Triples) · Issue #3303 · explosion/spaCy
September 14, 2018 - For each entity, extract all the possible knowledge triples.
Published   Feb 20, 2019
🌐
GitHub
github.com › explosion › spacy-llm
GitHub - explosion/spacy-llm: 🦙 Integrating LLMs into structured NLP pipelines
With only a few (and sometimes no) examples, an LLM can be prompted to perform custom NLP tasks such as text categorization, named entity recognition, coreference resolution, information extraction and more. spaCy is a well-established library for building systems that need to work with language ...
Starred by 1.4K users
Forked by 106 users
Languages   Python 96.7% | Jinja 3.3%
🌐
GitHub
github.com › DataTurks-Engg › Entity-Recognition-In-Resumes-SpaCy
GitHub - DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy: Automatic Summarization of Resumes with NER -> Evaluate resumes at a glance through Named Entity Recognition
The above dataset consisting of 220 annotated resumes can be found [here](https://dataturks.com/projects/abhishek.narayanan/Entity Recognition in Resumes). We train the model with 200 resume data and test it on 20 resume data. We use python’s spaCy module for training the NER model.
Starred by 448 users
Forked by 216 users
Languages   Python