Both of your questions can be answered in a similar way. Both the named entity recognition and part-of-speech tagging pipelines use machine learning models. Such models will make mistakes and these mistakes are hard for us to correct in general, because models are not a deterministic set of rules. The accuracy of a model depends on several factors, including:

  • The size of the training data;
  • the quality of the training data;
  • the size of the model.

Taking your named entity recognition example, the en_core_web_sm model is (as the name suggests) a small model. It uses a relatively small convolutional network, but also does not use static embeddings that are pretrained on a large corpus. Since the model is relatively limited, the model may have picked up patterns like: words that are capitalized are typically names, except if they occur at the beginning of the sentence (since all sentence-initial words are capitalized). This may be the reason that it fails to annotate your example correctly. However, if you use the en_core_web_lg model instead, you will see that that model will return the correct annotation:

('Dumbledore', 'PERSON')

You'd have to dive deeper to understand why it works in this case. But en_core_web_lg is a larger model that uses pretrained word embeddings. So, it may e.g. be the case that Dumbledore occurs in the set of word embeddings and the vector is similar to other names, allowing the model to extrapolate that since the vector of _ Dumbledore_ is similar to that of names it has seen in the training data that Dumbledore must also be a name.

Similar reasoning applies to your second question, the models are a trade-off between size, speed and accuracy. Also in this case en_core_web_lg does predicts 'VERB' consistently as the tag for finished.

So what does this mean in practice? First, models make mistakes. Second, if the error rate is not acceptable, you may want to look at larger models (such as md/lg/trf); or if you are working in a very specific domain, annotating more training data. Finally, do not underestimate the power of a set of rules. If you are working in a particular domain, say processing Harry Potter novels, you could get a lot of milage out of making a small set of rules to recognize names since it is a finite set (using e.g. the attribute ruler).

🌐
GitHub
github.com › kriesbeck › spacy-ner
GitHub - kriesbeck/spacy-ner: Pretrained and custom named entity recognition in spaCy
https://www.youtube.com/watch?v=sqDHBH9IjRU https://spacy.io/api/entityruler#add_patterns https://spacy.io/api/annotation#named-entities https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting https://spacy.io/usage/training https://github.com/explosion/spaCy/blob/master/examples/training/train_ner.py https://spacy.io/usage/training#tips-batch-size https://aihub.cloud.google.com/p/products/2290fc65-0041-4c87-a898-0289f59aa8ba
Starred by 8 users
Forked by 6 users
Languages   Jupyter Notebook
🌐
GitHub
github.com › osamadev › Named-Entity-Recognition-Using-Spacy › blob › master › NER_Spacy.ipynb
Named-Entity-Recognition-Using-Spacy/NER_Spacy.ipynb at master · osamadev/Named-Entity-Recognition-Using-Spacy
Named Entity Recognition Using Spacy. Contribute to osamadev/Named-Entity-Recognition-Using-Spacy development by creating an account on GitHub.
Author   osamadev
🌐
GitHub
github.com › amrrs › custom-ner-with-spacy
GitHub - amrrs/custom-ner-with-spacy: Custom Named Entity Recognition annotated using NER Annotated by tecoholic and Spacy for training the model
Custom Named Entity Recognition annotated using NER Annotated by tecoholic and Spacy for training the model - amrrs/custom-ner-with-spacy
Starred by 16 users
Forked by 20 users
Languages   Jupyter Notebook
🌐
GitHub
github.com › osamadev › Named-Entity-Recognition-Using-Spacy
GitHub - osamadev/Named-Entity-Recognition-Using-Spacy: Named Entity Recognition Using Spacy
Named Entity Recognition Using Spacy. Contribute to osamadev/Named-Entity-Recognition-Using-Spacy development by creating an account on GitHub.
Author   osamadev
🌐
GitHub
github.com › mpuig › spacy-lookup
GitHub - mpuig/spacy-lookup: Named Entity Recognition based on dictionaries
Named Entities are matched using the python module flashtext, and looks up in the data provided by different dictionaries. ... First, you need to download a language model. ... Import the component and initialise it with the shared nlp object ...
Starred by 242 users
Forked by 38 users
Languages   Python
Top answer
1 of 1
1

Both of your questions can be answered in a similar way. Both the named entity recognition and part-of-speech tagging pipelines use machine learning models. Such models will make mistakes and these mistakes are hard for us to correct in general, because models are not a deterministic set of rules. The accuracy of a model depends on several factors, including:

  • The size of the training data;
  • the quality of the training data;
  • the size of the model.

Taking your named entity recognition example, the en_core_web_sm model is (as the name suggests) a small model. It uses a relatively small convolutional network, but also does not use static embeddings that are pretrained on a large corpus. Since the model is relatively limited, the model may have picked up patterns like: words that are capitalized are typically names, except if they occur at the beginning of the sentence (since all sentence-initial words are capitalized). This may be the reason that it fails to annotate your example correctly. However, if you use the en_core_web_lg model instead, you will see that that model will return the correct annotation:

('Dumbledore', 'PERSON')

You'd have to dive deeper to understand why it works in this case. But en_core_web_lg is a larger model that uses pretrained word embeddings. So, it may e.g. be the case that Dumbledore occurs in the set of word embeddings and the vector is similar to other names, allowing the model to extrapolate that since the vector of _ Dumbledore_ is similar to that of names it has seen in the training data that Dumbledore must also be a name.

Similar reasoning applies to your second question, the models are a trade-off between size, speed and accuracy. Also in this case en_core_web_lg does predicts 'VERB' consistently as the tag for finished.

So what does this mean in practice? First, models make mistakes. Second, if the error rate is not acceptable, you may want to look at larger models (such as md/lg/trf); or if you are working in a very specific domain, annotating more training data. Finally, do not underestimate the power of a set of rules. If you are working in a particular domain, say processing Harry Potter novels, you could get a lot of milage out of making a small set of rules to recognize names since it is a finite set (using e.g. the attribute ruler).

🌐
GitHub
github.com › topics › spacy-ner
spacy-ner · GitHub Topics · GitHub
python training machine-learning spacy spacy-models spacy-pipeline spacy-ner ... Name Entity Recognition Tool for Hindi Language.
Find elsewhere
🌐
GitHub
github.com › rsreetech › CustomNERwithspaCy
GitHub - rsreetech/CustomNERwithspaCy
Let us look at how we can create a custom Named Entity Recognition model with spaCy.
Starred by 21 users
Forked by 8 users
Languages   Jupyter Notebook
🌐
GitHub
github.com › akash-kaul › Using-scispaCy-for-Named-Entity-Recognition
GitHub - akash-kaul/Using-scispaCy-for-Named-Entity-Recognition: A beginner's guide to using Named-Entity Recognition for data extraction from biomedical literature
scispaCy is a full, open-source spaCy pipeline for Python designed for analyzing biomedical and scientific text. It is a very powerful tool, especially for named entity recognition (NER), but it can be somewhat confusing to understand.
Starred by 22 users
Forked by 13 users
Languages   Jupyter Notebook
🌐
GitHub
github.com › chawla201 › Custom-Named-Entity-Recognition
GitHub - chawla201/Custom-Named-Entity-Recognition: NLP | NER | SpaCy
NLP | NER | SpaCy. Contribute to chawla201/Custom-Named-Entity-Recognition development by creating an account on GitHub.
Starred by 27 users
Forked by 10 users
Languages   Jupyter Notebook 94.3% | Python 5.7%
🌐
GitHub
github.com › topics › spacy-nlp-ner
spacy-nlp-ner · GitHub Topics · GitHub
Named Entity Recognition for HealthCare Data using Custom CRF model and predict disease pf patients based on complaints · nlp crf python3 spacy named-entity-recognition nlp-machine-learning spacy-nlp-ner spacy-ner spacy-transformers
🌐
GitHub
github.com › opokualbert › Named_Entity_Recognition_With_Spacy
GitHub - opokualbert/Named_Entity_Recognition_With_Spacy: Named Entity Recognition With Spacy Package
Named Entity Recognition With Spacy Package. Contribute to opokualbert/Named_Entity_Recognition_With_Spacy development by creating an account on GitHub.
Forked by 3 users
Languages   Jupyter Notebook
🌐
GitHub
github.com › Disciplined-22 › Named-Entity-Recognition-with-SpaCy
GitHub - Disciplined-22/Named-Entity-Recognition-with-SpaCy
This repository demonstrates how to use SpaCy, a popular natural language processing library, for Named Entity Recognition (NER).
Author   Disciplined-22
🌐
GitHub
github.com › fastforwardlabs › analyzing_headlines_with_spacy
GitHub - fastforwardlabs/analyzing_headlines_with_spacy: Named Entity Recognition on Reuters news headlines with spaCy
SpaCy wraps industrial-strength natural language processing capabilites into a Python library with an elegant and powerful API. The notebook in this repo demonstrates its use for Named Entity Recognition (NER) on a real world news dataset.
Author   fastforwardlabs
🌐
GitHub
github.com › explosion › spaCy
GitHub - explosion/spaCy: 💫 Industrial-strength Natural Language Processing (NLP) in Python
spaCy comes with pretrained pipelines and currently supports tokenization and training for 70+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management.
Starred by 32.9K users
Forked by 4.6K users
Languages   Python 54.1% | MDX 31.2% | Cython 10.5% | JavaScript 2.6% | Sass 0.8% | TypeScript 0.4%
🌐
GitHub
github.com › jstanai › Named-Entity-Recognition-with-spaCy
GitHub - jstanai/Named-Entity-Recognition-with-spaCy: spaCy Implementation of Multi-Lingual Named Entity Recognition
This is an implementation of multi-lingual Named Entity Recognition (NER) using spaCy (https://spacy.io/). This codebase loads German and English models to provide NER predictions on various user input. It will run with Google Colab, and MLflow integration is currently in progress.
Author   jstanai
🌐
GitHub
github.com › Srimathij › NER-NAMED-ENTITY-RECOGNITION-USING-spaCy-
GitHub - Srimathij/NER-NAMED-ENTITY-RECOGNITION-USING-spaCy-: Named Entity Recognition using spaCy-SpaCy is an open-source software library for advanced natural language processing(NLP), written in the programming languages Python and Cython.
Named Entity Recognition using spaCy-SpaCy is an open-source software library for advanced natural language processing(NLP), written in the programming languages Python and Cython. - Srimathij/NER-...
Author   Srimathij