As I understand it, these are the properties that you're seeking in a sample dataset:
- Text data
- It should be informal, i.e. have typos, slang, and basically something not professionally edited
- Something other than Twitter (I don't blame you, Twitter is a useful yet way overused example datasource in text mining)
Here are some recommendations:
- Emails from the SpamAssassin corpus -- note that both "ham" (non-spam) and spam datasets are available
- microblogPCU data set from UCI, which is data scraped from the microblogs of Sina Weibo users -- note, the raw text data is a mix of Chinese and English (you could perform machine translation of the Chinese, filter to only English, or use it as-is)
- Amazon Commerce reviews dataset from UCI
- Within the bag-o-words dataset, try using the Enron emails
- The Twenty Newsgroups dataset
- This nice collection of SMS spam
- You can always scrape (extract) your own text data from the Internet; I'm not sure which language or statistical package you're using, but XPath-based packages are available in R (
rvest,scrapeR, etc) and Python to accomplish this
Kaggle
kaggle.com › datasets › naseralqaydeh › named-entity-recognition-ner-corpus
Named Entity Recognition (NER) Corpus
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
GitHub
github.com › juand-r › entity-recognition-datasets
GitHub - juand-r/entity-recognition-datasets: A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types. - juand-r/entity-recognitio...
Starred by 1.6K users
Forked by 248 users
Languages Python 99.2% | Shell 0.8%
Videos
15:06
KazNERD: Kazakh Named Entity Recognition Dataset - YouTube
10:32
How to Use spaCy to Create an NER training set (Named Entity ...
01:14:03
Kaggle Live-Coding: Named Entity Recognition | Kaggle - YouTube
How to do efficient Named Entity Recognition (NER) based ...
15:08
Named Entity Recognition Using BERT Transformers-@shahzaib_hamid ...
21:29
Learn How to Build a Custom Named Entity Recognition (NER) model ...
Metatext
metatext.io › datasets-list › ner-task
+86 Ner Datasets - NLP Database
Dataset contains a total of 13.6M articles across several languages: English, Spanish, Italian, German, French and Arabic. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of entity.
Kaggle
kaggle.com › datasets › debasisdotcom › name-entity-recognition-ner-dataset
Name Entity Recognition (NER) Dataset
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
GitHub
github.com › davidsbatista › NER-datasets
GitHub - davidsbatista/NER-datasets: Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English)
Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English) - davidsbatista/NER-datasets
Starred by 348 users
Forked by 82 users
Languages Python
arXiv
arxiv.org › abs › 2310.14282
[2310.14282] NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval
October 22, 2023 - View a PDF of the paper titled NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval, by Uri Katz and 3 other authors View PDF
ACL Anthology
aclanthology.org › 2021.acl-long.248
Few-NERD: A Few-shot Named Entity Recognition Dataset - ACL Anthology
The Few-NERD dataset and the baselines will be publicly available to facilitate the research on this problem. ... Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) ... Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. 2021. Few-NERD: A Few-shot Named Entity Recognition Dataset.
Top answer 1 of 3
6
As I understand it, these are the properties that you're seeking in a sample dataset:
- Text data
- It should be informal, i.e. have typos, slang, and basically something not professionally edited
- Something other than Twitter (I don't blame you, Twitter is a useful yet way overused example datasource in text mining)
Here are some recommendations:
- Emails from the SpamAssassin corpus -- note that both "ham" (non-spam) and spam datasets are available
- microblogPCU data set from UCI, which is data scraped from the microblogs of Sina Weibo users -- note, the raw text data is a mix of Chinese and English (you could perform machine translation of the Chinese, filter to only English, or use it as-is)
- Amazon Commerce reviews dataset from UCI
- Within the bag-o-words dataset, try using the Enron emails
- The Twenty Newsgroups dataset
- This nice collection of SMS spam
- You can always scrape (extract) your own text data from the Internet; I'm not sure which language or statistical package you're using, but XPath-based packages are available in R (
rvest,scrapeR, etc) and Python to accomplish this
2 of 3
3
Check these :
Repository of Test Domains for Information Extraction : http://www.isi.edu/info-agents/RISE/repository.html
DBpedia : http://wiki.dbpedia.org/Downloads32 (mirror)
Link Updated :
http://www.isi.edu/integration/RISE/
https://github.com/dbpedia/extraction-framework/wiki/The-DBpedia-Data-Set
ScienceDirect
sciencedirect.com › science › article › pii › S2949719123000146
A survey on Named Entity Recognition — datasets, tools, and methodologies - ScienceDirect
May 26, 2023 - We examine the most relevant datasets, tools, and deep learning approaches like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Bidirectional Long Short Term Memory, Transfer learning approaches, and numerous other approaches currently being used in present-day NER problem environments and their applications.
Aksw
svn.aksw.org › papers › 2014 › LREC_N3NIFNERNED › public.pdf pdf
N3 - A Collection of Datasets for Named Entity Recognition and
N3 - A Collection of Datasets for Named Entity Recognition and
Defined
defined.ai › datasets › named-entity-recognition
Named Entity Recognition Dataset | Defined.ai
Discover our Named Entity Recognition Dataset, a broad collection featuring 150,000 sentences annotated across 24 named entity categories in ten languages, including Norwegian (Bokmål), Finnish, Turkish, Hindi, Arabic, Danish, Swedish, Hebrew, Russian, and Czech.
NLP-progress
nlpprogress.com › english › named_entity_recognition.html
Named entity recognition | NLP-progress
The NER dataset (of interest here) includes 18 tags, consisting of 11 types (PERSON, ORGANIZATION, etc) and 7 values (DATE, PERCENT, etc), and contains 2 million tokens. The common datasplit used in NER is defined in Pradhan et al 2013 and can be found here. Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens.
ScienceDirect
sciencedirect.com › science › article › pii › S2352340924007881
FloraNER: A new dataset for species and morphological terms named entity recognition in French botanical text - ScienceDirect
August 10, 2024 - FloraNER comprises separate sub-datasets for the recognition of plant species names, as well as coarse-grained and fine-grained botanical morphological terms. The resulting datasets are in CSV format, displaying textual data, identified named entities, and their annotations, covering one named entity type “Species” (Espèce in French) for species name identification, two named entity types “Organ” and “Descriptor” for coarse-grained morphological term identification, and eight named entity types for fine-grained morphological term identification: Organ, Descriptor, Form, Color, Development, Structure, Surface, Position, Disposition, and Measure.
Papers with Code
paperswithcode.com › datasets
Papers with Code - Machine Learning Datasets
... Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens.