Try https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 (smaller option) or https://huggingface.co/openai/whisper-large-v3-turbo . Run the parakeet model using nemo library and use something like https://github.com/SYSTRAN/faster-whisper or https://github.com/ggml-org/whisper.cpp for the latter Answer from Few-Welcome3297 on reddit.com
🌐
Hugging Face
huggingface.co › models
Automatic Speech Recognition Models – Hugging Face
Text-to-Speech · Text-to-Audio · Automatic Speech Recognition · Audio-to-Audio · Audio Classification · Voice Activity Detection · Tabular · Tabular Classification · Tabular Regression · Time Series Forecasting · Reinforcement Learning · Reinforcement Learning ·
🌐
Hugging Face
huggingface.co › docs › transformers › en › model_doc › speech_to_text
Speech2Text
The Speech2Text model was proposed in fairseq S2T: Fast Speech-to-Text Modeling with fairseq by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. It’s a transformer-based seq2seq (encoder-decoder) model designed for end-to-end ...
Discussions

What's the Best Speech-to-Text Model Right Now?
Try https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 (smaller option) or https://huggingface.co/openai/whisper-large-v3-turbo . Run the parakeet model using nemo library and use something like https://github.com/SYSTRAN/faster-whisper or https://github.com/ggml-org/whisper.cpp for the latter More on reddit.com
🌐 r/LocalLLaMA
40
6
September 13, 2025
Looking to run local speech to text model.
Whisper large is great.... More on reddit.com
🌐 r/huggingface
3
3
November 19, 2023
🎧 Listen and Compare 12 Open-Source Text-to-Speech Models (Hugging Face Space)
Nice to have it all in one place. It'd be even nicer to have an apples to apples comparison, thus all female or all male voices, instead of mixed like it's now. Maybe both? The CSM example sounds like it's full of artifacts, just like F5-TTS - and both were highlighted for speech quality. Maybe something went wrong during generation? At least Sesame can sound way better. The Llasa sample seems slightly broken - that's maybe a hint that this happens more often? Same with the background noise for MegaTTS3. Orpheus was probably standing in a large room during the generation 😉. More on reddit.com
🌐 r/LocalLLaMA
30
155
July 6, 2025
Improved Text to Speech model: Parler TTS v1 by Hugging Face
Where can I find the full list of the 34 voice names, and do you have quick audio samples for them to get an idea of each one? More on reddit.com
🌐 r/LocalLLaMA
75
238
August 8, 2024
People also ask

What models can I use for Text-to-Speech?
The KittenML/kitten-tts-nano-0.1, ResembleAI/chatterbox, fishaudio/fish-speech-1.5, and nari-labs/Dia-1.6B-0626 models can be used for Text-to-Speech.
🌐
huggingface.co
huggingface.co › tasks › text-to-speech
What is Text-to-Speech? - Hugging Face
What is Text-to-Speech?
Text-to-Speech (TTS) is the task of generating natural sounding speech given text input. TTS models can be extended to have a single model that generates speech for multiple speakers and multiple languages.
🌐
huggingface.co
huggingface.co › tasks › text-to-speech
What is Text-to-Speech? - Hugging Face
What models can I use for Automatic Speech Recognition?
The openai/whisper-large-v3, facebook/w2v-bert-2.0, facebook/seamless-m4t-v2-large, nvidia/canary-1b, and pyannote/speaker-diarization-3.1 models can be used for Automatic Speech Recognition.
🌐
huggingface.co
huggingface.co › tasks › automatic-speech-recognition
What is Automatic Speech Recognition? - Hugging Face
🌐
Hugging Face
huggingface.co › openai › whisper-large-v3
openai/whisper-large-v3 · Hugging Face
Our studies show that, over many existing ASR systems, the models exhibit improved robustness to accents, background noise, technical language, as well as zero shot translation from multiple languages into English; and that accuracy on speech recognition and translation is near the state-of-the-art level. However, because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e.
🌐
Medium
medium.com › latinxinai › heres-to-the-crazy-ones-the-misfits-45f2132623c7
Here’s to the crazy ones, the misfits: Automatic Speech Recognition with PyTorch & Hugging Face
April 17, 2024 - One of the first things I noticed when I checked out the Transformers page was the ability to convert audio into text, demonstrated by a 60-second audio extract from one of the most inspiring speeches ever: the 1963 “I have a dream” speech from Martin Luther King. from transformers import pipeline transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small") transcription_results = transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") print(transcription_results)
🌐
Hugging Face
huggingface.co › tasks › automatic-speech-recognition
What is Automatic Speech Recognition? - Hugging Face
Note An end-to-end model that performs ASR and Speech Translation by MetaAI. Automatic Speech Recognition • Updated 12 days ago • 3.07k • 452 · Note A powerful multilingual ASR and Speech Translation model by Nvidia. Automatic Speech Recognition • Updated May 10, 2024 • 15.6M • 1.37k · Note Powerful speaker diarization model. Datasets for Automatic Speech Recognition Browse Datasets (1,679) ... Note 31,175 hours of multilingual audio-text dataset in 108 languages.
Find elsewhere
🌐
Hugging Face
huggingface.co › models
Text-to-Speech Models – Hugging Face
1 day ago - Text-to-Speech • 5B • Updated 1 day ago • 634 • 33 · Text-to-Speech • Updated Apr 10 • 3.99M • • 5.4k · Text-to-Speech • Updated 5 days ago • 17.9k • 439 · Text-to-Speech • Updated Dec 11, 2023 • 6.4M • 3.23k · Text-to-Speech • 3B • Updated Nov 12 • 74.3k • • 821 ·
🌐
GitHub
github.com › huggingface › speech-to-speech
GitHub - huggingface/speech-to-speech: Speech To Speech: an effort for an open-sourced and modular GPT4-o
The pipeline provides a fully open and modular approach, with a focus on leveraging models available through the Transformers library on the Hugging Face hub.
Starred by 4.2K users
Forked by 485 users
Languages   Python 99.7% | Dockerfile 0.3%
🌐
KDnuggets
kdnuggets.com › use-hugging-face-transformers-text-to-speech-applications
How to Use Hugging Face Transformers for Text-to-Speech Applications - KDnuggets
October 24, 2024 - Hugging Face provides a variety of pre-trained models that can turn text into speech. For TTS applications, you can use models like Tacotron2 or FastSpeech2. These models have been trained to convert text into human-like speech.
🌐
Hugging Face
huggingface.co › tasks › text-to-speech
What is Text-to-Speech? - Hugging Face
The Hub contains over 1500 TTS models that you can use right away by trying out the widgets directly in the browser or calling the models as a service using Inference Endpoints. Here is a simple code snippet to get you started: import json import ...
🌐
Hugging Face
huggingface.co › docs › transformers › en › tasks › text-to-speech
Text to speech
Text-to-speech (TTS) is the task ... text-to-speech models are currently available in 🤗 Transformers, such as Dia, CSM, Bark, MMS, VITS and SpeechT5....
🌐
Hugging Face
huggingface.co › learn › audio-course › en › chapter6 › pre-trained_models
Pre-trained models for text-to-speech - Hugging Face Audio Course
Just like any other Transformer, the encoder-decoder network models a sequence-to-sequence transformation using hidden representations. This Transformer backbone is the same for all tasks SpeechT5 supports.
🌐
GitHub
github.com › huggingface › parler-tts
GitHub - huggingface/parler-tts: Inference and training library for high-quality TTS models.
August 8, 2024 - Inference and training library for high-quality TTS models. - huggingface/parler-tts
Starred by 5.5K users
Forked by 582 users
Languages   Python
🌐
Hugging Face
huggingface.co › docs › transformers › en › model_doc › speech_to_text_2
Speech2Text2
Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as Wav2Vec2 or HuBERT for Speech-to-Text tasks.
🌐
Hugging Face
huggingface.co › collections › SamuraiBarbi › speech-to-text-models
Speech to Text Models - a SamuraiBarbi Collection
Unlock the magic of AI with handpicked models, awesome datasets, papers, and mind-blowing Spaces from SamuraiBarbi
🌐
Hugging Face
huggingface.co › collections › unsloth › text-to-speech-tts-models
Text-to-Speech (TTS) models - a unsloth Collection
A collection of 4-bit, Dynamic 4-bit and 16-bit voice models including Sesame-CSM, OpenAI's Whisper, Orpheus. Fine-tune them with Unsloth now!
🌐
Hugging Face
huggingface.co › docs › transformers › en › tasks › asr
Automatic speech recognition
Automatic speech recognition (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Virtual assistants like Siri and Alexa use ASR models to help users every day, and there are many other useful user-facing ...
🌐
Reddit
reddit.com › r/huggingface › looking to run local speech to text model.
r/huggingface on Reddit: Looking to run local speech to text model.
November 19, 2023 -

I've been using the OpenAI API for speech to text, it works well but the cost can start getting high. I have no experience of running a local speech to text model. Can someone offer guidance for both:

  1. Best models

  2. How to host and run the models locally

🌐
Reddit
reddit.com › r/localllama › 🎧 listen and compare 12 open-source text-to-speech models (hugging face space)
r/LocalLLaMA on Reddit: 🎧 Listen and Compare 12 Open-Source Text-to-Speech Models (Hugging Face Space)
July 6, 2025 -

Hey everyone!

We have been exploring various open-source Text-to-Speech (TTS) models, and decided to create a Hugging Face demo space that makes it easy to compare their quality side-by-side.

The demo features 12 popular TTS models, all tested using a consistent prompt, so you can quickly hear and compare their synthesized speech and choose the best one for your audio projects.

Would love to get feedback or suggestions!

👉 Check out the demo space and detailed comparison here!

👉 Check out the blog: Choosing the Right Text-to-Speech Model: Part 2

Share your use-case and we will update this space as required!

Which TTS model sounds most natural to you?

Cheers!

🌐
Hugging Face
huggingface.co › models
Models – Hugging Face
Docker Model Runner · Lemonade · Inference Providers Select all · Groq · Novita · Nebius AI · Cerebras · SambaNova · Nscale · fal · Hyperbolic · Together AI · Fireworks · Featherless AI · Zai · Replicate · Cohere · Scaleway · Public AI · OVHcloud AI Endpoints · HF Inference API · WaveSpeed · Misc Reset Misc · speech-to-text ·