Hi all, I recently ran a benchmark comparing a bunch of speech-to-text APIs and models under real-world conditions like noise robustness, non-native accents, and technical vocab, etc.
It includes all the big players like Google, AWS, MS Azure, open source models like Whisper (small and large), speech recognition startups like AssemblyAI / Deepgram / Speechmatics, and newer LLM-based models like Gemini 2.0 Flash/Pro and GPT-4o. I've benchmarked the real time streaming versions of some of the APIs as well.
I mostly did this to decide the best API to use for an app I'm building but figured this might be helpful for other builders too. Would love to know what other cases would be useful to include too.
Link here: https://voicewriter.io/speech-recognition-leaderboard
TLDR if you don't want to click on the link: the best model right now seems to be GPT-4o-transcribe, followed by Eleven Labs, Whisper-large, and the Gemini models. All the startups and AWS/Microsoft are decent with varying performance in different situations. Google (the original, not Gemini) is extremely bad.
๐ง Listen and Compare 12 Open-Source Text-to-Speech Models (Hugging Face Space)
Different Speech to Text models offered by Google
What's the best open source speech to text model
[D]What is the best speech recognition model now?
Some other speech models are not mentioned in your article
You are always welcome to leave us a comment about an addition that you think should be made to this article.
Why did you not mention the DeepSpeech project by Mozilla?
We recommend using other open-source models on this page that are still maintained.
How about you compare the performance of these models?
However, this is only a small listicle article to help you get started with voice and text recognition, and can not handle the weight of such a project.
Setting up these models and trying them with real data may take a lot of time, and it's up to you as a developer to choose the best one that fits your needs.
Videos
Hey everyone!
We have been exploring various open-source Text-to-Speech (TTS) models, and decided to create a Hugging Face demo space that makes it easy to compare their quality side-by-side.
The demo features 12 popular TTS models, all tested using a consistent prompt, so you can quickly hear and compare their synthesized speech and choose the best one for your audio projects.
Would love to get feedback or suggestions!
๐ Check out the demo space and detailed comparison here!
๐ Check out the blog: Choosing the Right Text-to-Speech Model: Part 2
Share your use-case and we will update this space as required!
Which TTS model sounds most natural to you?
Cheers!