Try https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 (smaller option) or https://huggingface.co/openai/whisper-large-v3-turbo . Run the parakeet model using nemo library and use something like https://github.com/SYSTRAN/faster-whisper or https://github.com/ggml-org/whisper.cpp for the latter Answer from Few-Welcome3297 on reddit.com
🌐
Hugging Face
huggingface.co › models
Text-to-Speech Models – Hugging Face
Text-to-Speech • Updated Dec 11, 2023 • 6.41M • 3.24k · Text-to-Speech • Updated Sep 23 • 637k • • 1.34k · Text-to-Speech • 3B • Updated Nov 12 • 75k • • 823 · Text-to-Speech • Updated 7 days ago • 18.1k • 439 · Text-to-Speech • 0.7B • Updated Oct 10 • 23.2k • 806 ·
🌐
Hugging Face
huggingface.co › docs › transformers › en › model_doc › speech_to_text
Speech2Text
Check out the from_pretrained() method to load the model weights. The bare Speech To Text Text Model outputting raw hidden-states without any specific head on to.
🌐
Hugging Face
huggingface.co › models
Automatic Speech Recognition Models – Hugging Face
Text-to-Speech · Text-to-Audio · Automatic Speech Recognition · Audio-to-Audio · Audio Classification · Voice Activity Detection · Tabular · Tabular Classification · Tabular Regression · Time Series Forecasting · Reinforcement Learning · Reinforcement Learning ·
🌐
Hugging Face
huggingface.co › docs › transformers › en › tasks › text-to-speech
Text to speech
We encourage you to log in to your Hugging Face account to upload and share your model with the community. When prompted, enter your token to log in: ... VoxPopuli is a large-scale multilingual speech corpus consisting of data sourced from 2009-2020 European Parliament event recordings. It contains labelled audio-transcription data for 15 European languages. In this guide, we are using the Dutch language subset, feel free to pick another subset.
🌐
GitHub
github.com › huggingface › speech-to-speech
GitHub - huggingface/speech-to-speech: Speech To Speech: an effort for an open-sourced and modular GPT4-o
Speech To Speech: an effort for an open-sourced and modular GPT4-o - huggingface/speech-to-speech
Starred by 4.3K users
Forked by 485 users
Languages   Python 99.7% | Dockerfile 0.3%
🌐
Hugging Face
huggingface.co › learn › audio-course › en › chapter6 › pre-trained_models
Pre-trained models for text-to-speech - Hugging Face Audio Course
Loading the vocoder is as easy as any other 🤗 Transformers model. ... from transformers import SpeechT5HifiGan vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") Now all you need to do is pass it as an argument when generating speech, and the outputs will be automatically converted to the speech waveform. ... Let’s listen to the result. The sample rate used by SpeechT5 is always 16 kHz. ... Feel free to play with the SpeechT5 text-to-speech demo, explore other voices, experiment with inputs.
🌐
Hugging Face
huggingface.co › spaces
Spaces - Hugging Face
TTS demo for T5Gemma-TTS model · Running on Zero · 618 · 🏢 · Generate expressive speech from text with emotion control · Running on Zero · MCP · 262 · 🌎 · Chatterbox TTS supporting 23 languages · Running · Featured · 1.73k · 🔥 · Free Text-To-Speech generator with Emotion control (OpenAI) Running on Zero ·
Find elsewhere
🌐
Hugging Face
huggingface.co › models
Models – Hugging Face
Docker Model Runner · Lemonade · Inference Providers Select all · Groq · Novita · Nebius AI · Cerebras · SambaNova · Nscale · fal · Hyperbolic · Together AI · Fireworks · Featherless AI · Zai · Replicate · Cohere · Scaleway · Public AI · OVHcloud AI Endpoints · HF Inference API · WaveSpeed · Misc Reset Misc · text-to-speech ·
🌐
Hugging Face
huggingface.co › docs › transformers › en › model_doc › speech_to_text_2
Speech2Text2
Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as Wav2Vec2 or HuBERT for Speech-to-Text tasks.
🌐
Hugging Face
huggingface.co › learn › audio-course › en › chapter5 › asr_models
Pre-trained models for automatic speech recognition - Hugging Face Audio Course
Now that we know we can toggle between speech recognition and speech translation, we can pick our task depending on our needs. Either we recognise from audio in language X to text in the same language X (e.g. Spanish audio to Spanish text), or we translate from audio in any language X to text in English (e.g. Spanish audio to English text). To read more about how the "task" argument is used to control the properties of the generated text, refer to the model card for the Whisper base model.
🌐
Reddit
reddit.com › r/localllama › 🎧 listen and compare 12 open-source text-to-speech models (hugging face space)
r/LocalLLaMA on Reddit: 🎧 Listen and Compare 12 Open-Source Text-to-Speech Models (Hugging Face Space)
July 6, 2025 -

Hey everyone!

We have been exploring various open-source Text-to-Speech (TTS) models, and decided to create a Hugging Face demo space that makes it easy to compare their quality side-by-side.

The demo features 12 popular TTS models, all tested using a consistent prompt, so you can quickly hear and compare their synthesized speech and choose the best one for your audio projects.

Would love to get feedback or suggestions!

👉 Check out the demo space and detailed comparison here!

👉 Check out the blog: Choosing the Right Text-to-Speech Model: Part 2

Share your use-case and we will update this space as required!

Which TTS model sounds most natural to you?

Cheers!

🌐
Medium
medium.com › latinxinai › heres-to-the-crazy-ones-the-misfits-45f2132623c7
Here’s to the crazy ones, the misfits: Automatic Speech Recognition with PyTorch & Hugging Face
April 17, 2024 - One of the first things I noticed when I checked out the Transformers page was the ability to convert audio into text, demonstrated by a 60-second audio extract from one of the most inspiring speeches ever: the 1963 “I have a dream” speech from Martin Luther King. from transformers import pipeline transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small") transcription_results = transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") print(transcription_results)
🌐
GitHub
github.com › huggingface › parler-tts
GitHub - huggingface/parler-tts: Inference and training library for high-quality TTS models.
It is a reproduction of work from ... and Edinburgh University respectively. Contrarily to other TTS models, Parler-TTS is a fully open-source release....
Starred by 5.5K users
Forked by 582 users
Languages   Python
🌐
Hugging Face
huggingface.co › tasks › text-to-speech
What is Text-to-Speech? - Hugging Face
from transformers import pipeline synthesizer = pipeline("text-to-speech", "suno/bark") synthesizer("Look I am generating speech in three lines of code!") You can use huggingface.js to infer summarization models on Hugging Face Hub.
🌐
Hugging Face
huggingface.co › docs › transformers › en › tasks › asr
Automatic speech recognition
Fine-tune Wav2Vec2 on the MInDS-14 dataset to transcribe audio to text. Use your fine-tuned model for inference. To see all architectures and checkpoints compatible with this task, we recommend checking the task-page · Before you begin, make sure you have all the necessary libraries installed: ... We encourage you to login to your Hugging Face account so you can upload and share your model with the community.
🌐
Hugging Face
huggingface.co › collections › SamuraiBarbi › speech-to-text-models
Speech to Text Models - a SamuraiBarbi Collection
Unlock the magic of AI with handpicked models, awesome datasets, papers, and mind-blowing Spaces from SamuraiBarbi
🌐
Hugging Face
huggingface.co › Nithu › text-to-speech
Nithu/text-to-speech · Hugging Face
Text-to-Speech · Fairseq · ljspeech · English · audio · arxiv: 2006.04558 · arxiv: 2109.06912 · Model card Files Files and versions · xet Community · 1 · Use this model · FastSpeech 2 text-to-speech model from fairseq S^2 (paper/code): ...
🌐
Hugging Face
huggingface.co › tasks › automatic-speech-recognition
What is Automatic Speech Recognition? - Hugging Face
The following detailed blog post shows how to fine-tune a pre-trained Whisper checkpoint on labeled data for ASR. With the right data and strategy you can fine-tune a high-performant model on a free Google Colab instance too.
🌐
Hugging Face
huggingface.co › collections › CIMAI › speech-to-text-models
Speech-to-Text Models - a CIMAI Collection
Speech-to-Text Models · Coding Models · updated 10 days ago · Leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard · Upvote · - 5B • Updated Jul 28 • 519k • 596 · Note See benchmark scores here: https://mistral.ai/news/voxtral ·