named entity recognition dataset

Dataset for Named Entity Recognition on Informal Text

datascience.stackexchange.com › questions › 641 › dataset-for-named-entity-recognition-on-informal-text

As I understand it, these are the properties that you're seeking in a sample dataset:

Text data
It should be informal, i.e. have typos, slang, and basically something not professionally edited
Something other than Twitter (I don't blame you, Twitter is a useful yet way overused example datasource in text mining)

Here are some recommendations:

Emails from the SpamAssassin corpus -- note that both "ham" (non-spam) and spam datasets are available
microblogPCU data set from UCI, which is data scraped from the microblogs of Sina Weibo users -- note, the raw text data is a mix of Chinese and English (you could perform machine translation of the Chinese, filter to only English, or use it as-is)
Amazon Commerce reviews dataset from UCI
Within the bag-o-words dataset, try using the Enron emails
The Twenty Newsgroups dataset
This nice collection of SMS spam
You can always scrape (extract) your own text data from the Internet; I'm not sure which language or statistical package you're using, but XPath-based packages are available in R (rvest, scrapeR, etc) and Python to accomplish this

Answer from Hack-R on Stack Exchange

Kaggle

kaggle.com › datasets › naseralqaydeh › named-entity-recognition-ner-corpus

Named Entity Recognition (NER) Corpus

Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds

GitHub

github.com › juand-r › entity-recognition-datasets

GitHub - juand-r/entity-recognition-datasets: A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types. - juand-r/entity-recognition-datasets

Starred by 1.6K users

Forked by 248 users

Languages Python 99.2% | Shell 0.8%

Videos

15:06

YouTube

KazNERD: Kazakh Named Entity Recognition Dataset - YouTube

How to Use spaCy to Create an NER training set (Named Entity ...

December 2, 2020

01:14:03

YouTube

Kaggle Live-Coding: Named Entity Recognition | Kaggle - YouTube

July 28, 2018

12.5K

youtube.com

How to do efficient Named Entity Recognition (NER) based ...

15:08

YouTube

Named Entity Recognition Using BERT Transformers-@shahzaib_hamid ...

February 22, 2023

21:29

YouTube

Learn How to Build a Custom Named Entity Recognition (NER) model ...

metatext.io › datasets-list › ner-task

+86 Ner Datasets - NLP Database

Dataset contains a total of 13.6M articles across several languages: English, Spanish, Italian, German, French and Arabic. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of entity.

Kaggle

kaggle.com › datasets › debasisdotcom › name-entity-recognition-ner-dataset

Name Entity Recognition (NER) Dataset

Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds

GitHub

github.com › davidsbatista › NER-datasets

GitHub - davidsbatista/NER-datasets: Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English)

Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English) - davidsbatista/NER-datasets

Starred by 348 users

Forked by 82 users

Languages Python

arXiv

arxiv.org › abs › 2310.14282

[2310.14282] NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

October 22, 2023 - View a PDF of the paper titled NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval, by Uri Katz and 3 other authors View PDF

ACL Anthology

aclanthology.org › 2021.acl-long.248

Few-NERD: A Few-shot Named Entity Recognition Dataset - ACL Anthology

The Few-NERD dataset and the baselines will be publicly available to facilitate the research on this problem. ... Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) ... Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. 2021. Few-NERD: A Few-shot Named Entity Recognition Dataset.

Stack Exchange

datascience.stackexchange.com › questions › 641 › dataset-for-named-entity-recognition-on-informal-text

nlp - Dataset for Named Entity Recognition on Informal Text - Data Science Stack Exchange

Top answer

1 of 3

As I understand it, these are the properties that you're seeking in a sample dataset:

Text data
It should be informal, i.e. have typos, slang, and basically something not professionally edited
Something other than Twitter (I don't blame you, Twitter is a useful yet way overused example datasource in text mining)

Here are some recommendations:

Emails from the SpamAssassin corpus -- note that both "ham" (non-spam) and spam datasets are available
microblogPCU data set from UCI, which is data scraped from the microblogs of Sina Weibo users -- note, the raw text data is a mix of Chinese and English (you could perform machine translation of the Chinese, filter to only English, or use it as-is)
Amazon Commerce reviews dataset from UCI
Within the bag-o-words dataset, try using the Enron emails
The Twenty Newsgroups dataset
This nice collection of SMS spam
You can always scrape (extract) your own text data from the Internet; I'm not sure which language or statistical package you're using, but XPath-based packages are available in R (rvest, scrapeR, etc) and Python to accomplish this

2 of 3

Check these :

Repository of Test Domains for Information Extraction : http://www.isi.edu/info-agents/RISE/repository.html

DBpedia : http://wiki.dbpedia.org/Downloads32 (mirror)

Link Updated :

http://www.isi.edu/integration/RISE/

https://github.com/dbpedia/extraction-framework/wiki/The-DBpedia-Data-Set

Find elsewhere

Google Bing Mojeek

Springer

link.springer.com › home › international journal of computational intelligence systems › article

Named Entity Recognition Datasets: A Classification Framework | International Journal of Computational Intelligence Systems

March 28, 2024 - In this thesis, we review the development of named entity recognition datasets over the years and describe them in terms of the language of the dataset, the domain of research, the type of entity, the granularity of the entity, and the annotation of the entity.

ScienceDirect

sciencedirect.com › science › article › pii › S2949719123000146

A survey on Named Entity Recognition — datasets, tools, and methodologies - ScienceDirect

May 26, 2023 - We examine the most relevant datasets, tools, and deep learning approaches like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Bidirectional Long Short Term Memory, Transfer learning approaches, and numerous other approaches currently being used in present-day NER problem environments and their applications.

Papers with Code

paperswithcode.com › datasets

Trending Papers - Hugging Face

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

Aksw

svn.aksw.org › papers › 2014 › LREC_N3NIFNERNED › public.pdf pdf

N3 - A Collection of Datasets for Named Entity Recognition and

Defined

defined.ai › datasets › named-entity-recognition

Named Entity Recognition Dataset | Defined.ai

Discover our Named Entity Recognition Dataset, a broad collection featuring 150,000 sentences annotated across 24 named entity categories in ten languages, including Norwegian (Bokmål), Finnish, Turkish, Hindi, Arabic, Danish, Swedish, Hebrew, Russian, and Czech.

Nature

nature.com › scientific data › data descriptors › article

Gold standard, multi-genre dataset for named entity recognition and linking | Scientific Data

June 13, 2025 - Scientific Data - Gold standard, multi-genre dataset for named entity recognition and linking

NLP-progress

nlpprogress.com › english › named_entity_recognition.html

Named entity recognition | NLP-progress

The NER dataset (of interest here) includes 18 tags, consisting of 11 types (PERSON, ORGANIZATION, etc) and 7 values (DATE, PERCENT, etc), and contains 2 million tokens. The common datasplit used in NER is defined in Pradhan et al 2013 and can be found here. Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens.

ACL Anthology

aclanthology.org › P19-1510

NNE: A Dataset for Nested Named Entity Recognition in English Newswire - ACL Anthology

Named entity recognition (NER) is widely used in natural language processing applications and downstream tasks. However, most NER tools target flat annotation from popular datasets, eschewing the semantic information available in nested entity ...

Kaggle

kaggle.com › datasets › abhinavwalia95 › entity-annotated-corpus

Annotated Corpus for Named Entity Recognition | Kaggle

September 21, 2017 - Feature Engineered Corpus annotated with IOB and POS tags

ScienceDirect

sciencedirect.com › science › article › pii › S2352340924007881

FloraNER: A new dataset for species and morphological terms named entity recognition in French botanical text - ScienceDirect

August 10, 2024 - FloraNER comprises separate sub-datasets for the recognition of plant species names, as well as coarse-grained and fine-grained botanical morphological terms. The resulting datasets are in CSV format, displaying textual data, identified named entities, and their annotations, covering one named entity type “Species” (Espèce in French) for species name identification, two named entity types “Organ” and “Descriptor” for coarse-grained morphological term identification, and eight named entity types for fine-grained morphological term identification: Organ, Descriptor, Form, Color, Development, Structure, Surface, Position, Disposition, and Measure.

Papers with Code

paperswithcode.com › datasets

Papers with Code - Machine Learning Datasets

... Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens.

Hugging Face

huggingface.co › dslim › bert-base-NER

dslim/bert-base-NER · Hugging Face

Specifically, this model is a bert-base-cased model that was fine-tuned on the English version of the standard CoNLL-2003 Named Entity Recognition dataset.