Try https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 (smaller option) or https://huggingface.co/openai/whisper-large-v3-turbo . Run the parakeet model using nemo library and use something like https://github.com/SYSTRAN/faster-whisper or https://github.com/ggml-org/whisper.cpp for the latter Answer from Few-Welcome3297 on reddit.com
🌐
Hugging Face
huggingface.co › models
Text-to-Speech Models – Hugging Face
Text-to-Speech • Updated 5 days ago • 2.4k • 247 · Text-to-Speech • 5B • Updated about 8 hours ago • 1.69k • 50 · Text-to-Speech • Updated 2 days ago • 18 • 31 · Text-to-Speech • 3B • Updated Sep 1 • 517k • 2.1k · Text-to-Speech • Updated Apr 10 • 3.96M • • 5.41k ·
🌐
Hugging Face
huggingface.co › models
Automatic Speech Recognition Models – Hugging Face
Text-to-Speech · Text-to-Audio · Automatic Speech Recognition · Audio-to-Audio · Audio Classification · Voice Activity Detection · Tabular · Tabular Classification · Tabular Regression · Time Series Forecasting · Reinforcement Learning · Reinforcement Learning ·
🌐
Hugging Face
huggingface.co › docs › transformers › model_doc › speech_to_text
Speech2Text
Check out the from_pretrained() method to load the model weights. The bare Speech To Text Text Model outputting raw hidden-states without any specific head on to.
🌐
Reddit
reddit.com › r/localllama › 🎧 listen and compare 12 open-source text-to-speech models (hugging face space)
r/LocalLLaMA on Reddit: 🎧 Listen and Compare 12 Open-Source Text-to-Speech Models (Hugging Face Space)
July 6, 2025 -

Hey everyone!

We have been exploring various open-source Text-to-Speech (TTS) models, and decided to create a Hugging Face demo space that makes it easy to compare their quality side-by-side.

The demo features 12 popular TTS models, all tested using a consistent prompt, so you can quickly hear and compare their synthesized speech and choose the best one for your audio projects.

Would love to get feedback or suggestions!

👉 Check out the demo space and detailed comparison here!

👉 Check out the blog: Choosing the Right Text-to-Speech Model: Part 2

Share your use-case and we will update this space as required!

Which TTS model sounds most natural to you?

Cheers!

🌐
Hugging Face
huggingface.co › docs › transformers › en › tasks › text-to-speech
Text to speech
In our experience, obtaining satisfactory results from this model can be challenging. The quality of the speaker embeddings appears to be a significant factor. Since SpeechT5 was pre-trained with English x-vectors, it performs best when using English speaker embeddings.
🌐
Hugging Face
huggingface.co › learn › audio-course › en › chapter6 › pre-trained_models
Pre-trained models for text-to-speech - Hugging Face Audio Course
You can’t take a fine-tuned ASR model and swap out the pre-nets and post-net to get a working TTS model, for example. SpeechT5 is flexible, but not that flexible ;) Let’s see what are the pre- and post-nets that SpeechT5 uses for the TTS task specifically: Text encoder pre-net: A text embedding layer that maps text tokens to the hidden representations that the encoder expects.
🌐
Hugging Face
huggingface.co › collections › SamuraiBarbi › speech-to-text-models
Speech to Text Models - a SamuraiBarbi Collection
Unlock the magic of AI with handpicked models, awesome datasets, papers, and mind-blowing Spaces from SamuraiBarbi
Find elsewhere
🌐
Modal
modal.com › blog › open-source-tts
The Top Open-Source Text to Speech (TTS) Models
This article explores the top open-source TTS models, based on Hugging Face’s trending models and insights from our developer community.
🌐
Hugging Face
discuss.huggingface.co › models
Real-Time Text-to-Speech Model - Models - Hugging Face Forums
January 4, 2025 - Greetings everyone, I’m currently looking for real-time tts model that can create an audio as soon as I type. Kindly guide me in this regard.
🌐
Medium
medium.com › latinxinai › heres-to-the-crazy-ones-the-misfits-45f2132623c7
Here’s to the crazy ones, the misfits: Automatic Speech Recognition with PyTorch & Hugging Face
April 17, 2024 - One of the first things I noticed when I checked out the Transformers page was the ability to convert audio into text, demonstrated by a 60-second audio extract from one of the most inspiring speeches ever: the 1963 “I have a dream” speech from Martin Luther King. from transformers import pipeline transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small") transcription_results = transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") print(transcription_results)
🌐
Hugging Face
huggingface.co › collections › CIMAI › speech-to-text-models
Speech-to-Text Models - a CIMAI Collection
Leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard · Note See benchmark scores here: https://mistral.ai/news/voxtral
🌐
Hugging Face
huggingface.co › learn › audio-course › en › chapter5 › asr_models
Pre-trained models for automatic speech recognition - Hugging Face Audio Course
Todos nosotros nos descarriamos como bejas, cada cual se apartó por su camino,'}, {'timestamp': (26.4, 32.48), 'text': ' mas Jehová cargó en él el pecado de todos nosotros. No es que partas tu pan con el'}, {'timestamp': (32.48, 38.4), 'text': ' hambriento y a los hombres herrantes metas en casa, que cuando vieres al desnudo lo cubras y no'}, ... And voila! We have our predicted text as well as corresponding timestamps. Whisper is a strong pre-trained model for speech recognition and translation.
🌐
Hugging Face
huggingface.co › docs › transformers › en › tasks › asr
Automatic speech recognition
Fine-tune Wav2Vec2 on the MInDS-14 dataset to transcribe audio to text. Use your fine-tuned model for inference. To see all architectures and checkpoints compatible with this task, we recommend checking the task-page · Before you begin, make sure you have all the necessary libraries installed: ... We encourage you to login to your Hugging Face account so you can upload and share your model with the community.
🌐
Hugging Face
huggingface.co › fractalego › personal-speech-to-text-model
fractalego/personal-speech-to-text-model · Hugging Face
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
🌐
GitHub
github.com › huggingface › parler-tts
GitHub - huggingface/parler-tts: Inference and training library for high-quality TTS models.
Inference and training library for high-quality TTS models. - huggingface/parler-tts
Starred by 5.5K users
Forked by 582 users
Languages   Python
🌐
GitHub
github.com › huggingface › speech-to-speech
GitHub - huggingface/speech-to-speech: Speech To Speech: an effort for an open-sourced and modular GPT4-o
The pipeline provides a fully open and modular approach, with a focus on leveraging models available through the Transformers library on the Hugging Face hub.
Starred by 4.3K users
Forked by 485 users
Languages   Python 99.7% | Dockerfile 0.3%
🌐
Hugging Face
huggingface.co › blog › arena-tts
TTS Arena: Benchmarking Text-to-Speech Models in the Wild
Just submit some text, listen to two different models speak it out, and vote on which model you think is the best. The results will be organized into a leaderboard that displays the community’s highest-rated models. The field of speech synthesis has long lacked an accurate method to measure ...
🌐
Hugging Face
discuss.huggingface.co › 🤗hub
Which hugging face llm is best for voice recognition - 🤗Hub - Hugging Face Forums
February 29, 2024 - How do I find tune Hugging face LLM for a voice recognition project. Which model is best
🌐
Modal
modal.com › blog › open-source-stt
The Top Open Source Speech-to-Text (STT) Models in 2025
Original Canary-1B (April 2024): This multilingual model supports English, German, French, and Spanish with bidirectional translation capabilities. It was trained on 85,000 hours of speech data and achieved a 6.67% word error rate on the HuggingFace ...