The pipeline object can do that for you when you set the parameter:
- transformers < 4.7.0: grouped_entities to
True. - transformers >= 4.7.0: aggregation_strategy to
simple
from transformers import pipeline
#transformers < 4.7.0
#ner = pipeline("ner", grouped_entities=True)
ner = pipeline("ner", aggregation_strategy='simple')
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
output = ner(sequence)
print(output)
Output:
[{'entity_group': 'I-ORG', 'score': 0.9970663785934448, 'word': 'Hugging Face Inc'}
, {'entity_group': 'I-LOC', 'score': 0.9993778467178345, 'word': 'New York City'}
, {'entity_group': 'I-LOC', 'score': 0.9571147759755453, 'word': 'DUMBO'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}]
Answer from cronoik on Stack OverflowVideos
The pipeline object can do that for you when you set the parameter:
- transformers < 4.7.0: grouped_entities to
True. - transformers >= 4.7.0: aggregation_strategy to
simple
from transformers import pipeline
#transformers < 4.7.0
#ner = pipeline("ner", grouped_entities=True)
ner = pipeline("ner", aggregation_strategy='simple')
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
output = ner(sequence)
print(output)
Output:
[{'entity_group': 'I-ORG', 'score': 0.9970663785934448, 'word': 'Hugging Face Inc'}
, {'entity_group': 'I-LOC', 'score': 0.9993778467178345, 'word': 'New York City'}
, {'entity_group': 'I-LOC', 'score': 0.9571147759755453, 'word': 'DUMBO'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}]
Quick update: grouped_entities has been deprecated.
UserWarning:
grouped_entitiesis deprecated and will be removed in version v5.0.0, defaulted toaggregation_strategy="AggregationStrategy.SIMPLE"instead.
f'grouped_entitiesis deprecated and will be removed in version v5.0.0, defaulted toaggregation_strategy="{aggregation_strategy}"instead.'
you will have to change your code to:
ner = pipeline("ner", aggregation_stategy="simple")
I'm trying to fine-tune BERT to do named-entity recognition (i.e. token classification with some extra steps). Most of my documents are longer than BERT's 512-token max length, so I can't evaluate the whole doc in one go.
In theory, I think what I want to do is have a sliding window that averages the logits for the overlapping sections. I am not sure how to accomplish this using TokenClassificationPipeline (source), which seems to automatically truncate the input text to the model's max length.
Anyone know an easy way to accomplish this? Or should I make a feature request to HuggingFace? 3rd option?