๐ŸŒ
Reddit
reddit.com โ€บ r/speechtech โ€บ i benchmarked 12+ speech-to-text apis under various real-world conditions
r/speechtech on Reddit: I benchmarked 12+ speech-to-text APIs under various real-world conditions
May 2, 2025 -

Hi all, I recently ran a benchmark comparing a bunch of speech-to-text APIs and models under real-world conditions like noise robustness, non-native accents, and technical vocab, etc.

It includes all the big players like Google, AWS, MS Azure, open source models like Whisper (small and large), speech recognition startups like AssemblyAI / Deepgram / Speechmatics, and newer LLM-based models like Gemini 2.0 Flash/Pro and GPT-4o. I've benchmarked the real time streaming versions of some of the APIs as well.

I mostly did this to decide the best API to use for an app I'm building but figured this might be helpful for other builders too. Would love to know what other cases would be useful to include too.

Link here: https://voicewriter.io/speech-recognition-leaderboard

TLDR if you don't want to click on the link: the best model right now seems to be GPT-4o-transcribe, followed by Eleven Labs, Whisper-large, and the Gemini models. All the startups and AWS/Microsoft are decent with varying performance in different situations. Google (the original, not Gemini) is extremely bad.

๐ŸŒ
TELUS Digital
telusdigital.com โ€บ insights โ€บ data-and-ai โ€บ article โ€บ 10-speech-to-text-models-tested
We Tested 10 Speech-to-Text Models, See Which Perform Best | TELUS Digital
November 20, 2024 - Last, we evaluated whisper-large-v3-local on an Apple MacBook Pro running a M3 Max chip, 36 GB of memory, and macOS Sequoia 15.1. Overall, assemblyai-universal-2 appeared to be the best speech-to-text model we tested.
Discussions

๐ŸŽง Listen and Compare 12 Open-Source Text-to-Speech Models (Hugging Face Space)
Nice to have it all in one place. It'd be even nicer to have an apples to apples comparison, thus all female or all male voices, instead of mixed like it's now. Maybe both? The CSM example sounds like it's full of artifacts, just like F5-TTS - and both were highlighted for speech quality. Maybe something went wrong during generation? At least Sesame can sound way better. The Llasa sample seems slightly broken - that's maybe a hint that this happens more often? Same with the background noise for MegaTTS3. Orpheus was probably standing in a large room during the generation ๐Ÿ˜‰. More on reddit.com
๐ŸŒ r/LocalLLaMA
30
155
July 6, 2025
Different Speech to Text models offered by Google
There is the old one on https://cloud.google.com/speech-to-text This is not an LLM but a purpose-built โ€œoldโ€ type architecture, right? Then from the ElevenLabs announcement I see that one of the best, state-of-the-art Speech to Text models is Gemini 2.0 Flash? But how is it doing Speech ... More on discuss.ai.google.dev
๐ŸŒ discuss.ai.google.dev
1
February 28, 2025
What's the best open source speech to text model
whisper-v3-turbo because of its wide compatibility with open source ecosystem (not necessarily because of its WER) The architecture is plug and play. You can typically add some LLMs along with whisper to correct stuff for you and customize as you need. I wrote a guide here just now: Creating Very High-Quality Transcripts with Open-Source Tools: An 100% automated workflow guide More on reddit.com
๐ŸŒ r/LocalLLaMA
16
22
August 3, 2024
[D]What is the best speech recognition model now?
Whisper is still the highest quality one in general and can be adopted for live recognition More on reddit.com
๐ŸŒ r/MachineLearning
21
29
February 1, 2025
People also ask

Some other speech models are not mentioned in your article
Please review the listicle criteria mentioned earlier to understand why we made our choices. Ultimately, we may have missed a few of them, but all of those mentioned are the top ones indeed in the market at the time of writing this article.

You are always welcome to leave us a comment about an addition that you think should be made to this article.
๐ŸŒ
fosspost.org
fosspost.org โ€บ home โ€บ open source for developers โ€บ top 15 open source speech recognition/tts/stt/ systems
Top 15 Open Source Speech Recognition/TTS/STT/ Systems
Why did you not mention the DeepSpeech project by Mozilla?
DeepSpeech by Mozilla was abandoned many years ago and it is no longer under active development.

We recommend using other open-source models on this page that are still maintained.
๐ŸŒ
fosspost.org
fosspost.org โ€บ home โ€บ open source for developers โ€บ top 15 open source speech recognition/tts/stt/ systems
Top 15 Open Source Speech Recognition/TTS/STT/ Systems
How about you compare the performance of these models?
That could be nice for a research paper project or a PhD thesis.

However, this is only a small listicle article to help you get started with voice and text recognition, and can not handle the weight of such a project.

Setting up these models and trying them with real data may take a lot of time, and it's up to you as a developer to choose the best one that fits your needs.
๐ŸŒ
fosspost.org
fosspost.org โ€บ home โ€บ open source for developers โ€บ top 15 open source speech recognition/tts/stt/ systems
Top 15 Open Source Speech Recognition/TTS/STT/ Systems
๐ŸŒ
AssemblyAI
assemblyai.com โ€บ blog โ€บ the-top-free-speech-to-text-apis-and-open-source-engines
The top free Speech-to-Text APIs, AI Models, and Open Source Engines
October 23, 2025 - This post compares the best free Speech-to-Text APIs and AI models on the market today, including APIs that have a free tier. Weโ€™ll also look at several free open-source Speech-to-Text engines and explore why you might choose an API vs. an open-source library, or vice versa.
๐ŸŒ
Hugging Face
huggingface.co โ€บ models
Text-to-Speech Models โ€“ Hugging Face
20 hours ago - Text-to-Speech โ€ข 5B โ€ข Updated 1 day ago โ€ข 634 โ€ข 33 ยท Text-to-Speech โ€ข Updated Apr 10 โ€ข 3.99M โ€ข โ€ข 5.4k ยท Text-to-Speech โ€ข Updated 5 days ago โ€ข 17.9k โ€ข 439 ยท Text-to-Speech โ€ข Updated Dec 11, 2023 โ€ข 6.4M โ€ข 3.23k ยท Text-to-Speech โ€ข 3B โ€ข Updated Nov 12 โ€ข 74.3k โ€ข โ€ข 821 ยท
๐ŸŒ
Hugging Face
huggingface.co โ€บ models
Automatic Speech Recognition Models โ€“ Hugging Face
Text-to-Speech ยท Text-to-Audio ยท Automatic Speech Recognition ยท Audio-to-Audio ยท Audio Classification ยท Voice Activity Detection ยท Tabular ยท Tabular Classification ยท Tabular Regression ยท Time Series Forecasting ยท Reinforcement Learning ยท Reinforcement Learning ยท
๐ŸŒ
Stanford University
web.stanford.edu โ€บ ~jurafsky โ€บ slp3
Speech and Language Processing
This release has preference alignment with DPO in the posttraining Chapter 9 completely new ASR (Whisper) and TTS (EnCodec and VALL-E) material in Chapter 15 and 16 a restructuring of earlier chapters to fit how we are teaching now: move Naive Bayes to the Appendix and instead using Logistic ...
Find elsewhere
๐ŸŒ
Cloudflare
developers.cloudflare.com โ€บ directory โ€บ workers ai โ€บ models
Models ยท Cloudflare Workers AI docs
... LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. ... Cybertron 7B v2 is a 7B MistralAI ...
๐ŸŒ
Deepgram
deepgram.com โ€บ learn โ€บ benchmarking-top-open-source-speech-models
3 Best Open-Source ASR Models Compared: Whisper, wav2vec 2.0, Kaldi โ€“ Insights & Usability
May 30, 2025 - Explore the top 3 open-source speech models, including Kaldi, wav2letter++, and OpenAI's Whisper, trained on 700,000 hours of speech. Discover insights on usability, accuracy, and speed. Click to find the right ASR model for your needs!
๐ŸŒ
The Open Source Post
fosspost.org โ€บ home โ€บ open source for developers โ€บ top 15 open source speech recognition/tts/stt/ systems
Top 15 Open Source Speech Recognition/TTS/STT/ Systems
August 1, 2024 - So if you are looking just for the basic usage of converting speech to text, then youโ€™ll find it easy to accomplish that via either Python or Bash. You may also wish to check Kaldi Active Grammar, which is a Python pre-built engine with English-trained models already ready for usage.
๐ŸŒ
Fingoweb
fingoweb.com โ€บ home โ€บ top 6 speech to text ai solutions in 2026
Top 6 speech to text AI solutions in 2026 - Fingoweb
June 5, 2025 - Whether used for meeting transcriptions, voice commands, or live captions, the best speech to text AI solutions in 2026 deliver fast and precise results. ... โ€œUntil recently, transcription was handled by more primitive models that often struggled with even minor variations in pronunciation ...
๐ŸŒ
Oscar
docs.ccv.brown.edu โ€บ ai-tools โ€บ services โ€บ transcribe โ€บ comparing-speech-to-text-models
Comparing Speech-to-text Models | CCV AI Services
October 24, 2025 - The CCV AI Transcribe service uses state-of-the-art speech-to-text and voice activity detection (VAD) models to provide high-quality and fast transcriptions. Currently, we offer a proprietary speech-to-text model Google Gemini and an open-source ...
๐ŸŒ
Lemonfox
lemonfox.ai โ€บ blog โ€บ best-speech-to-text-software
Best Speech to Text Software a Comprehensive Guide | Blog
October 5, 2025 - For developers and businesses focused on top-tier accuracy without breaking the bank, Lemonfox.ai is a standout choice for API-driven tasks. Then you have giants like Google Speech-to-Text with its massive, scalable infrastructure, and OpenAI's ...
๐ŸŒ
Inferless
inferless.com โ€บ learn โ€บ comparing-different-text-to-speech---tts--models-for-different-use-cases
Comprehensive Guide to Text-to-Speech (TTS) Models: How to Select the Best Model for Your Application
October 31, 2024 - The acoustic model converts linguistic features into mel-spectrograms, visually representing speech sounds over time. Models like Tacotron2 and Glow-TTS predict the relationship between text and sound, capturing the rhythm, tone, and emotional ...
๐ŸŒ
Reddit
reddit.com โ€บ r/localllama โ€บ ๐ŸŽง listen and compare 12 open-source text-to-speech models (hugging face space)
r/LocalLLaMA on Reddit: ๐ŸŽง Listen and Compare 12 Open-Source Text-to-Speech Models (Hugging Face Space)
July 6, 2025 -

Hey everyone!

We have been exploring various open-source Text-to-Speech (TTS) models, and decided to create a Hugging Face demo space that makes it easy to compare their quality side-by-side.

The demo features 12 popular TTS models, all tested using a consistent prompt, so you can quickly hear and compare their synthesized speech and choose the best one for your audio projects.

Would love to get feedback or suggestions!

๐Ÿ‘‰ Check out the demo space and detailed comparison here!

๐Ÿ‘‰ Check out the blog: Choosing the Right Text-to-Speech Model: Part 2

Share your use-case and we will update this space as required!

Which TTS model sounds most natural to you?

Cheers!

๐ŸŒ
PR Newswire
prnewswire.com โ€บ news-releases โ€บ rad-ai-unveils-next-generation-speech-recognition-that-redefines-radiology-reporting-302628659.html
Rad AI Unveils Next-Generation Speech Recognition That Redefines Radiology Reporting
2 weeks ago - Fine-tuned language models handle radiology-specific terminology, measurements, modifiers and even context-based interpretation cues such as laterality and sequence timing. In early evaluations of Rad AI's speech model, radiologists reported ...
๐ŸŒ
Google AI
discuss.ai.google.dev โ€บ google ai studio
Different Speech to Text models offered by Google - Google AI Studio - Google AI Developers Forum
February 28, 2025 - There is the old one on ... announcement I see that one of the best, state-of-the-art Speech to Text models is Gemini 2.0 Flash? But how is it doing Speech ......
๐ŸŒ
Gartner
gartner.com โ€บ all categories โ€บ speech-to-text solutions
Best Speech-to-Text Solutions Reviews 2025 | Gartner Peer Insights
Microsoft Security helps protect people and data against cyberthreats to give peace of mind. ... Intelligent Voice focuses on providing advanced speech and natural language processing solutions. The main problem it addresses is the need for privacy and regulatory compliance in fields requiring speech processing. Primarily, this includes the process of transforming multiple-language audio, video and text content into a structured, process-ready format.
๐ŸŒ
GitHub
github.com โ€บ FunAudioLLM โ€บ SenseVoice
GitHub - FunAudioLLM/SenseVoice: Multilingual Voice Understanding Model
Multilingual Speech Recognition: Trained with over 400,000 hours of data, supporting more than 50 languages, the recognition performance surpasses that of the Whisper model. ... Possess excellent emotion recognition capabilities, achieving and ...
Starred by 7.2K users
Forked by 659 users
Languages ย  Python 97.2% | Shell 2.8%
๐ŸŒ
BentoML
bentoml.com โ€บ blog โ€บ exploring-the-world-of-open-source-text-to-speech-models
The Best Open-Source Text-to-Speech Models in 2026
5 days ago - Multi-speaker control: VibeVoice can generate conversations involving up to four speakers. Speaker identities remain consistent across long passages, and the model handles natural turn-taking without degrading into monotony or repetition. Lightweight streaming variant: Microsoft also provides VibeVoice-Realtime-0.5B, which produces audible speech in roughly 300 ms and supports streaming text input for real-time narration or agent responses.