As I understand it, these are the properties that you're seeking in a sample dataset:

  1. Text data
  2. It should be informal, i.e. have typos, slang, and basically something not professionally edited
  3. Something other than Twitter (I don't blame you, Twitter is a useful yet way overused example datasource in text mining)

Here are some recommendations:

  1. Emails from the SpamAssassin corpus -- note that both "ham" (non-spam) and spam datasets are available
  2. microblogPCU data set from UCI, which is data scraped from the microblogs of Sina Weibo users -- note, the raw text data is a mix of Chinese and English (you could perform machine translation of the Chinese, filter to only English, or use it as-is)
  3. Amazon Commerce reviews dataset from UCI
  4. Within the bag-o-words dataset, try using the Enron emails
  5. The Twenty Newsgroups dataset
  6. This nice collection of SMS spam
  7. You can always scrape (extract) your own text data from the Internet; I'm not sure which language or statistical package you're using, but XPath-based packages are available in R (rvest, scrapeR, etc) and Python to accomplish this
Answer from Hack-R on Stack Exchange
🌐
Kaggle
kaggle.com › datasets › naseralqaydeh › named-entity-recognition-ner-corpus
Named Entity Recognition (NER) Corpus
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
GitHub
github.com › juand-r › entity-recognition-datasets
GitHub - juand-r/entity-recognition-datasets: A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types. - juand-r/entity-recognition-datasets
Starred by 1.6K users
Forked by 248 users
Languages   Python 99.2% | Shell 0.8%
🌐
Metatext
metatext.io › datasets-list › ner-task
+86 Ner Datasets - NLP Database
Dataset contains a total of 13.6M articles across several languages: English, Spanish, Italian, German, French and Arabic. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of entity.
🌐
Kaggle
kaggle.com › datasets › debasisdotcom › name-entity-recognition-ner-dataset
Name Entity Recognition (NER) Dataset
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
GitHub
github.com › davidsbatista › NER-datasets
GitHub - davidsbatista/NER-datasets: Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English)
Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English) - davidsbatista/NER-datasets
Starred by 348 users
Forked by 82 users
Languages   Python
🌐
arXiv
arxiv.org › abs › 2310.14282
[2310.14282] NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval
October 22, 2023 - View a PDF of the paper titled NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval, by Uri Katz and 3 other authors View PDF
🌐
ACL Anthology
aclanthology.org › 2021.acl-long.248
Few-NERD: A Few-shot Named Entity Recognition Dataset - ACL Anthology
The Few-NERD dataset and the baselines will be publicly available to facilitate the research on this problem. ... Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) ... Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. 2021. Few-NERD: A Few-shot Named Entity Recognition Dataset.
Find elsewhere
🌐
Springer
link.springer.com › home › international journal of computational intelligence systems › article
Named Entity Recognition Datasets: A Classification Framework | International Journal of Computational Intelligence Systems
March 28, 2024 - In this thesis, we review the development of named entity recognition datasets over the years and describe them in terms of the language of the dataset, the domain of research, the type of entity, the granularity of the entity, and the annotation of the entity.
🌐
ScienceDirect
sciencedirect.com › science › article › pii › S2949719123000146
A survey on Named Entity Recognition — datasets, tools, and methodologies - ScienceDirect
May 26, 2023 - We examine the most relevant datasets, tools, and deep learning approaches like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Bidirectional Long Short Term Memory, Transfer learning approaches, and numerous other approaches currently being used in present-day NER problem environments and their applications.
🌐
Papers with Code
paperswithcode.com › datasets
Trending Papers - Hugging Face
PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.
🌐
Defined
defined.ai › datasets › named-entity-recognition
Named Entity Recognition Dataset | Defined.ai
Discover our Named Entity Recognition Dataset, a broad collection featuring 150,000 sentences annotated across 24 named entity categories in ten languages, including Norwegian (Bokmål), Finnish, Turkish, Hindi, Arabic, Danish, Swedish, Hebrew, Russian, and Czech.
🌐
NLP-progress
nlpprogress.com › english › named_entity_recognition.html
Named entity recognition | NLP-progress
The NER dataset (of interest here) includes 18 tags, consisting of 11 types (PERSON, ORGANIZATION, etc) and 7 values (DATE, PERCENT, etc), and contains 2 million tokens. The common datasplit used in NER is defined in Pradhan et al 2013 and can be found here. Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens.
🌐
ACL Anthology
aclanthology.org › P19-1510
NNE: A Dataset for Nested Named Entity Recognition in English Newswire - ACL Anthology
Named entity recognition (NER) is widely used in natural language processing applications and downstream tasks. However, most NER tools target flat annotation from popular datasets, eschewing the semantic information available in nested entity ...
🌐
ScienceDirect
sciencedirect.com › science › article › pii › S2352340924007881
FloraNER: A new dataset for species and morphological terms named entity recognition in French botanical text - ScienceDirect
August 10, 2024 - FloraNER comprises separate sub-datasets for the recognition of plant species names, as well as coarse-grained and fine-grained botanical morphological terms. The resulting datasets are in CSV format, displaying textual data, identified named entities, and their annotations, covering one named entity type “Species” (Espèce in French) for species name identification, two named entity types “Organ” and “Descriptor” for coarse-grained morphological term identification, and eight named entity types for fine-grained morphological term identification: Organ, Descriptor, Form, Color, Development, Structure, Surface, Position, Disposition, and Measure.
🌐
Papers with Code
paperswithcode.com › datasets
Papers with Code - Machine Learning Datasets
... Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens.
🌐
Hugging Face
huggingface.co › dslim › bert-base-NER
dslim/bert-base-NER · Hugging Face
Specifically, this model is a bert-base-cased model that was fine-tuned on the English version of the standard CoNLL-2003 Named Entity Recognition dataset.