Both of your questions can be answered in a similar way. Both the named entity recognition and part-of-speech tagging pipelines use machine learning models. Such models will make mistakes and these mistakes are hard for us to correct in general, because models are not a deterministic set of rules. The accuracy of a model depends on several factors, including:
- The size of the training data;
- the quality of the training data;
- the size of the model.
Taking your named entity recognition example, the en_core_web_sm model is (as the name suggests) a small model. It uses a relatively small convolutional network, but also does not use static embeddings that are pretrained on a large corpus. Since the model is relatively limited, the model may have picked up patterns like: words that are capitalized are typically names, except if they occur at the beginning of the sentence (since all sentence-initial words are capitalized). This may be the reason that it fails to annotate your example correctly. However, if you use the en_core_web_lg model instead, you will see that that model will return the correct annotation:
('Dumbledore', 'PERSON')
You'd have to dive deeper to understand why it works in this case. But en_core_web_lg is a larger model that uses pretrained word embeddings. So, it may e.g. be the case that Dumbledore occurs in the set of word embeddings and the vector is similar to other names, allowing the model to extrapolate that since the vector of _ Dumbledore_ is similar to that of names it has seen in the training data that Dumbledore must also be a name.
Similar reasoning applies to your second question, the models are a trade-off between size, speed and accuracy. Also in this case en_core_web_lg does predicts 'VERB' consistently as the tag for finished.
So what does this mean in practice? First, models make mistakes. Second, if the error rate is not acceptable, you may want to look at larger models (such as md/lg/trf); or if you are working in a very specific domain, annotating more training data. Finally, do not underestimate the power of a set of rules. If you are working in a particular domain, say processing Harry Potter novels, you could get a lot of milage out of making a small set of rules to recognize names since it is a finite set (using e.g. the attribute ruler).
Videos
As per spacy documentation for Name Entity Recognition here is the way to extract name entity
import spacy
nlp = spacy.load('en') # install 'en' model (python3 -m spacy download en)
doc = nlp("Alphabet is a new startup in China")
print('Name Entity: {0}'.format(doc.ents))
Result
Name Entity: (China,)
To make "Alphabet" a 'Noun' append it with "The".
doc = nlp("The Alphabet is a new startup in China")
print('Name Entity: {0}'.format(doc.ents))
Name Entity: (Alphabet, China)
In Spacy version 3 the Transformers from Hugging Face are fine-tuned to the operations that Spacy provided in previous versions, but with better results.
Transformers are currently (2020) the state-of-art in Natural Language Processing, i.e generally we had (one-hot-encode -> word2vec -> glove | fast text) then (recurrent neural network, recursive neural network, gated recurrent unit, long short-term memory, bi-directional long short-term memory, etc) and now Transformers + Attention (BERT, RoBERTa, XLNet, XLM, CTRL, AlBERT, T5, Bart, GPT, GPT-2, GPT-3) - This is just to give context for 'why' you should consider Transformers, I know that there are lots of stuff that I didn't mention like Fuzz, Knowledge Graph and so on
Install the dependencies:
sudo apt install libncurses5
pip install spacy-transformers --pre -f https://download.pytorch.org/whl/torch_stable.html
pip install spacy-nightly # I'm using 3.0.0rc2
Download a model:
python -m spacy download en_core_web_trf # English Transformer pipeline, Roberta base
Here's a list of available models.
And then use it as you would normally do:
import spacy
text = 'Type something here which can be related to something, e.g Stack Over Flow organization'
nlp = spacy.load('en_core_web_trf')
document = nlp(text)
print(document.ents)
References:
Learn about Transformers and Attention.
Read a summary about the different Trasnformers architectures.
Learn about the Transformers fine-tune done by Spacy.