I am looking for the best Speech-to-Text/Speech Recognition Models, anyone could recommend any?
Hi all, I recently ran a benchmark comparing a bunch of speech-to-text APIs and models under real-world conditions like noise robustness, non-native accents, and technical vocab, etc.
It includes all the big players like Google, AWS, MS Azure, open source models like Whisper (small and large), speech recognition startups like AssemblyAI / Deepgram / Speechmatics, and newer LLM-based models like Gemini 2.0 Flash/Pro and GPT-4o. I've benchmarked the real time streaming versions of some of the APIs as well.
I mostly did this to decide the best API to use for an app I'm building but figured this might be helpful for other builders too. Would love to know what other cases would be useful to include too.
Link here: https://voicewriter.io/speech-recognition-leaderboard
TLDR if you don't want to click on the link: the best model right now seems to be GPT-4o-transcribe, followed by Eleven Labs, Whisper-large, and the Gemini models. All the startups and AWS/Microsoft are decent with varying performance in different situations. Google (the original, not Gemini) is extremely bad.
Videos
I know OpenAI recently released whisper V3 Turbo but I remember hearing about some other ones that's a lot better but I can't remember
OpenAI’s Whisper was released more than two years ago, and it seems that no other model has seriously challenged its position since then. While Whisper has received updates over time, its performance in languages other than English—such as Chinese—is not ideal for me. I’m looking for an alternative model to generate subtitles for videos and real-time subtitles for live streams.
I have also tried Alibaba’s FunASR, but it was released more than one year ago as well and does not seem to offer a satisfied performance.
I am aware of some LLM-based speech models, but their hardware requirements are too high for my use case.
In other AI fields, new models are released almost every months, but there seems to be less attention on advancements in speech recognition. Are there any recent models worth looking into?
Hi, I'd like to share my findings on several speech-to-text API providers based on real-world testing.
GPT-4o Transcribe
- 25 MB file limit. Not practical for real-world use cases.
Gemini 2.5 Pro (via Prompt)
- Not tested yet. Based on its documentation, it doesn’t seem well-suited for long recordings.
Google Cloud Speech-to-Text V2
- The API setup is complex. You need to specific region, language, ... explicitly.
- It fails to process .m4a audio files exported from iOS apps, even though the same files work fine with other services.
Sample configuration used:
config = cloud_speech.RecognitionConfig(
auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
language_codes=["en-US"],
model="chirp_2",
)Self-hosted WhisperX
- Performs well for recordings over 3 hours.
- Issues: occasional word repetitions or hallucinations.
AssemblyAI
- Reasonable performance.
- Lacks accurate punctuation for some non-English languages, such as Chinese.
Deepgram
- Similar to AssemblyAI: works okay but struggles with sentence-level punctuation in languages like Chinese.
Next Steps
I plan to test ElevenLabs next, based on https://www.reddit.com/r/speechtech/comments/1kd9abp/i_benchmarked_12_speechtotext_apis_under_various/
Hey guys, I would like to add speech to text transcription. Hosting an open source model on cloud so I can do this anywhere would be good.
Do you guys know any highly accurate STT open source models that is highly accurate?
Also, can it run on CPU or GPU is a must?
Hi everyone, hope you’re doing well. I’m currently working on a project where I need to convert audio conversations between a customer and agents into text.
Since most recordings involve up to three speakers, could you please suggest some top open-source models suited for this task, particularly those that support speaker diarization?
Hey everyone!
We have been exploring various open-source Text-to-Speech (TTS) models, and decided to create a Hugging Face demo space that makes it easy to compare their quality side-by-side.
The demo features 12 popular TTS models, all tested using a consistent prompt, so you can quickly hear and compare their synthesized speech and choose the best one for your audio projects.
Would love to get feedback or suggestions!
👉 Check out the demo space and detailed comparison here!
👉 Check out the blog: Choosing the Right Text-to-Speech Model: Part 2
Share your use-case and we will update this space as required!
Which TTS model sounds most natural to you?
Cheers!
I am looking for open source speech to text tools, I am not familiar with the progress in this field but Ideally I would like something fast and reliable, that does english as well as other languages as french and spanish, that is also easy to use. Are there any recommendations ?
Hey Redditors,
I’m currently exploring different text-to-speech (TTS) AI tools and was wondering what everyone here is using. I’m looking for something that delivers natural-sounding voices, supports multiple languages, and ideally offers some customization features like speed, pitch, or emotion.
I am working on a custom data-management software and for a while now I've been working and looking into possibility of integrating and modifying existing local conversational AI's into it (or at least developing the possibility of doing so in the future). The first thing I've been struggling with is that information is somewhat hard to come by - searches often lead me back here to r/LocalLLaMA/ and a year old threads in r/MachineLearning. Is anyone keeping track of what is out there what is worth the attention? I am posting this here in hope of finding some info while also sharing what I know for anyone who finds it useful or is interested.
I've noticed that most open source projects are based on Open AI's Whisper and it's re-implemented versions like:
Faster Whisper (MIT license)
Insanely fast Whisper (Apache-2.0 license)
Distil-Whisper (MIT license)
WhisperSpeech by github.com/collabora (MIT license, Added here 03/2025)
WhisperLive (MIT license, Added here 03/2025)
WhisperFusion, which is WhisperSpeech+WhisperLive in one package. (Added here 03/2025)
Coqui AI's TTS and STT -models (MPL-2.0 license) have gained some traction, but on their site they have stated that they're shutting down.
Tortoise TTS (Apache-2.0 license) and its re-implemented versions such as:
Tortoise-TTS-fast (AGPL-3.0, Apache-2.0 licenses) and its slightly faster(?) fork (AGPL-3.0 license).
StyleTTS and it's newer version:
StyleTTS2 (MIT license)
Alibaba Group's Tongyi SpeechTeam's SenseVoice (STT) [MIT license+possibly others] and CosyVoice (TTS) [Apache-2.0 license].
(11.2.2025): I will try to maintain this list so will begin adding new ones as well.
1/2025 Kokoro TTS (MIT License)
2/2025 Zonos by Zyphra (Apache-2.0 license)
3/2025 added: Metavoice (Apache-2.0 license)
3/2025 added: F5-TTS (MIT license)
3/2025 added: Orpheus-TTS by canopylabs.ai (Apache-2.0 license)
3/2025 added: MegaTTS3 (Apache-2.0 license)
4/2025 added: Index-tts (Apache-2.0 license). [Can be tried here.]
4/2025 added: Dia TTS (Apache-2.0 license) [Can be tried here.]
5/2025 added: Spark-TTS (Apache-2.0 license)[Can be tried here.]
5/2025 added: Parakeet TDT 0.6B V2 (CC-BY-4.0 license), STT English only [Can be tried here.], update: V3 is multilingual and has an onnx -version.
8/2025 added: Verbify-TTS (MIT License) by reddit user u/MattePalte. Described as simple locally run screen-reader-style app.
8/2025 added: Chatterbox-TTS (MIT License) [Can be tried here.]
8/2025 added: Microsoft's VibeVoice TTS (MIT Licence) for generating consistent long-form dialogues. Comes in 1.5B and 7B sizes. Both models can be tried here. 0.5B model is also on the way. This one also already has a ComfyUI wrapper by u/Fabix84/ (additional info here). Quantized versions by u/teachersecret can be found here
8/2025 added: BosonAI's Higgs Audio TTS (Apache-2.0 license). Can be tried here and further tested here. This one supports complex long-form dialogues. Extra prompting is supposed to allow setting the scene and adjusting expressions. Also has a quantized (4bit fork) version.
8/2025 added: StepFun AI's (Chinese AI-team source) Step-Audio 2 Mini Speech-To-Speech (Apache-2.0 license) a 8B "speech-to-speech" (Audio-To-Tokens + Tokens-To-Audio) -model. Added because related, even if bypasses the "to-text" -part.
---------------------------------------------------------
Edit1: Added Distil-Whisper because "insanely fast whisper" is not a model, but these were shipped together.
Edit2: StyleTTS2FineTune is not actually a different version of StyleTTS2, but rather a framework to finetuning it.
Edit3(11.2.2025): as suggested by u/caidong I added Kokoro TTS + also added Zonos to the list.
Edit4(20.3.2025): as suggested by u/Trysem , added WhisperSpeech, WhisperLive, WhisperFusion, Metavoice and F5-TTS.
Edit5(22.3.2025): Added Orpheus-TTS.
Edit6(28.3.2025): Added MegaTTS3.
Edit7(11.4.2025): as suggested by u/Trysem/, added Index-tts.
Edit8(24.4.2025): Added Dia TTS (Nari-labs).
Edit9(02.5.2025): Added Spark-TTS as suggested by u/Tandulim (here)
Edit9(02.5.2025): Added Parakeet TDT 0.6B V2. More info in this thread.
Edit10(29.8.2025): As originally suggested by u/Trysem and later by u/Nitroedge added Chatterbox-TTS to the list.
Edit10(29.8.2025): u/MattePalte asked me to add his own TTS called Verbify-TTS to the list.
Edit10(29.8.2025): Added Microsoft's recently released VibeVoice TTS, BosonAI's Higgs Audio TTS and StepFun's STS. +Extra info.
Edit11+12(1.9.2025): Added VibeVoice TTS's quantized versions and Parakeet V3.
Hello, everyone!
I just bought a new PC for gaming, but i mostly like to play building games like Cities Skylines, now i'm trying a bit of Enshrouded and also having fun.
The thing is: My mind don't stop, and i keep thinking about work sometimes - and usually its good ideas that i would like to remember later, in details, how they are "printed" in my head at that moment of epiphany.
That being said, what i'm looking for is a tool that can take notes from my voice so i can organize and summarize it better, later.
What would be your go-to tool for speech-to-text, to take this kind of notes?
Been playing around with Whisper and a few other models for live transcription, but even on a decent GPU, the delay’s still a bit much for anything interactive.
I’m curious what others here are using when low latency actually matters, like under 2 seconds, ideally even faster. Bonus if it works well with accents or in noisy environments.
Would love to hear what’s working for folks in production (or even fun side projects). Commercial or open source - am open to both!
From what I can see OpenAI's Whisper is currently one of the best-performing models, but the repo doesn't offer a way to train it, only to perform inference. How does DeepSpeech hold up? It seems pretty well documented and easy to use, but is quite old so I don't know if the actual model is up-to-date.
I'm looking for a high quality, free text to speech (TTS) AI that sounds as close to a real human as possible — preferably with multiple voice options (male/female, accents, etc). It could be an app, website, or downloadable software. I’ve tried a few that sound robotic or overly synthetic, and I’m hoping there’s something better out there that still doesn’t require payment.
Any recommendations?
I’ve been using an app called Spokenly lately and it’s made me curious about what everyone thinks is the best speech-to-text API and model available right now.
There are so many options. For example, Cartesia has Ink Whisperer, Deepgram has Nova 3, Soniox has their own system, and then there are models like Parakeet, Whisper variants, etc. Some apps seem to mix different APIs together and I’m trying to figure out what gives the best combination of accuracy, speed, and low latency for real-time dictation.
If you’ve tested different providers, which one has worked best for you and why? I’m especially interested in:
• Accuracy with normal everyday speech
• Speed and responsiveness
• Latency during dictation
• How well it handles punctuation and capitalization
• Pricing or free tier experiences
• Any issues with reliability
Feel free to share your setup or any comparisons you’ve done. I want to see what people think is the best choice
We've been using unmute, and it's the best open source real-time STT -> LLM -> TTS model/system that I know so far.
Now we're looking for a more accurate STT while maintaining real-time speed and high throughput. Ideally the model is speech-to-speech directly so the AI can provide feedback on the input voice itself and not just the transcription.
We want to try the Qwen3-Omni but AFAIK there's no speech-to-speech support in vLLM yet. There's a hosted model but we want to use the open source if possible.
We're building a free real-time AI app for people to practice their English speaking skills.
I'm building a tool to programmatically generate transcripts from sermons.
I have access to hundreds of sermon transcripts (and 100x more very similar in domain data) but less than a 40 transcripts with audio (~30 hours).
I want the lowest WER (Word Error Rate) possible and can budget 100 hours for this project in 2023.
Train my own acoustic model with the best open source offering
OpenAI's Whisper seems to be the best available today?
How much supervised data (e.g. hours of sermons with perfect transcripts) would I need to develop a model that would be more accurate than Google/AWS for my specific domain?
Can I take a model already trained and "tune" it by augmenting the data I have?
Use the best cloud speech-to-text API that I can provide in-domain data to to tune it
AWS Transcribe and Google Speech to Text seem to be big players
I've gone with AWS Transcribe since it can be tuned more easily with custom domain data (just upload text files) than Google's (which requires building phrase dictionaries with weights).
Is there anything out there that's better for my use case?
Any other thoughts on this on the whole?
----------------------------
UPDATE three months later after trying both
Whisper blew my custom AWS model with tons of domain-specific text out of the water. Long story short is to use Whisper for similar cases.