best speech-to-text models

reddit.com › r/speechtech › i benchmarked 12+ speech-to-text apis under various real-world conditions

r/speechtech on Reddit: I benchmarked 12+ speech-to-text APIs under various real-world conditions

May 2, 2025 -

Hi all, I recently ran a benchmark comparing a bunch of speech-to-text APIs and models under real-world conditions like noise robustness, non-native accents, and technical vocab, etc.

It includes all the big players like Google, AWS, MS Azure, open source models like Whisper (small and large), speech recognition startups like AssemblyAI / Deepgram / Speechmatics, and newer LLM-based models like Gemini 2.0 Flash/Pro and GPT-4o. I've benchmarked the real time streaming versions of some of the APIs as well.

I mostly did this to decide the best API to use for an app I'm building but figured this might be helpful for other builders too. Would love to know what other cases would be useful to include too.

Link here: https://voicewriter.io/speech-recognition-leaderboard

TLDR if you don't want to click on the link: the best model right now seems to be GPT-4o-transcribe, followed by Eleven Labs, Whisper-large, and the Gemini models. All the startups and AWS/Microsoft are decent with varying performance in different situations. Google (the original, not Gemini) is extremely bad.

Videos

26:24

YouTube

Streaming Speech to Text Models - Kyutai vs Whisper - YouTube

August 26, 2025

08:21

YouTube

BEST Speech to Text AI Revealed - YouTube

April 2, 2025

23:58

YouTube

The Most Accurate Speech-to-text APIs in 2025 - YouTube

February 6, 2025

youtube.com

What is OpenAI Whisper? (Best Speech to Text AI Model)

15:48

YouTube

Possibly THE BEST Open Source Text-to-Speech Model - VibeVoice ...

September 2, 2025

21:59

YouTube

RIP ELEVENLABS! Here's The BEST TTS AI Voices LOCALLY For FREE! ...

April 23, 2025

View all

AssemblyAI

assemblyai.com › blog › the-top-free-speech-to-text-apis-and-open-source-engines

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

October 23, 2025 - This post compares the best free Speech-to-Text APIs and AI models on the market today, including APIs that have a free tier. We’ll also look at several free open-source Speech-to-Text engines and explore why you might choose an API vs. an open-source library, or vice versa.

Hugging Face

huggingface.co › models

Text-to-Speech Models – Hugging Face

20 hours ago - Text-to-Speech • 5B • Updated 1 day ago • 634 • 33 · Text-to-Speech • Updated Apr 10 • 3.99M • • 5.4k · Text-to-Speech • Updated 5 days ago • 17.9k • 439 · Text-to-Speech • Updated Dec 11, 2023 • 6.4M • 3.23k · Text-to-Speech • 3B • Updated Nov 12 • 74.3k • • 821 ·

Hugging Face

huggingface.co › models

Automatic Speech Recognition Models – Hugging Face

Text-to-Speech · Text-to-Audio · Automatic Speech Recognition · Audio-to-Audio · Audio Classification · Voice Activity Detection · Tabular · Tabular Classification · Tabular Regression · Time Series Forecasting · Reinforcement Learning · Reinforcement Learning ·

Stanford University

web.stanford.edu › ~jurafsky › slp3

Speech and Language Processing

This release has preference alignment with DPO in the posttraining Chapter 9 completely new ASR (Whisper) and TTS (EnCodec and VALL-E) material in Chapter 15 and 16 a restructuring of earlier chapters to fit how we are teaching now: move Naive Bayes to the Appendix and instead using Logistic ...

Find elsewhere

Google Bing Mojeek

Cloudflare

developers.cloudflare.com › directory › workers ai › models

Models · Cloudflare Workers AI docs

... LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. ... Cybertron 7B v2 is a 7B MistralAI ...

Deepgram

deepgram.com › learn › benchmarking-top-open-source-speech-models

3 Best Open-Source ASR Models Compared: Whisper, wav2vec 2.0, Kaldi – Insights & Usability

May 30, 2025 - Explore the top 3 open-source speech models, including Kaldi, wav2letter++, and OpenAI's Whisper, trained on 700,000 hours of speech. Discover insights on usability, accuracy, and speed. Click to find the right ASR model for your needs!

The Open Source Post

fosspost.org › home › open source for developers › top 15 open source speech recognition/tts/stt/ systems

Top 15 Open Source Speech Recognition/TTS/STT/ Systems

August 1, 2024 - So if you are looking just for the basic usage of converting speech to text, then you’ll find it easy to accomplish that via either Python or Bash. You may also wish to check Kaldi Active Grammar, which is a Python pre-built engine with English-trained models already ready for usage.

Fingoweb

fingoweb.com › home › top 6 speech to text ai solutions in 2026

Top 6 speech to text AI solutions in 2026 - Fingoweb

June 5, 2025 - Whether used for meeting transcriptions, voice commands, or live captions, the best speech to text AI solutions in 2026 deliver fast and precise results. ... “Until recently, transcription was handled by more primitive models that often struggled with even minor variations in pronunciation ...

Oscar

docs.ccv.brown.edu › ai-tools › services › transcribe › comparing-speech-to-text-models

Comparing Speech-to-text Models | CCV AI Services

October 24, 2025 - The CCV AI Transcribe service uses state-of-the-art speech-to-text and voice activity detection (VAD) models to provide high-quality and fast transcriptions. Currently, we offer a proprietary speech-to-text model Google Gemini and an open-source ...

Lemonfox

lemonfox.ai › blog › best-speech-to-text-software

Best Speech to Text Software a Comprehensive Guide | Blog

October 5, 2025 - For developers and businesses focused on top-tier accuracy without breaking the bank, Lemonfox.ai is a standout choice for API-driven tasks. Then you have giants like Google Speech-to-Text with its massive, scalable infrastructure, and OpenAI's ...

Inferless

inferless.com › learn › comparing-different-text-to-speech---tts--models-for-different-use-cases

Comprehensive Guide to Text-to-Speech (TTS) Models: How to Select the Best Model for Your Application

October 31, 2024 - The acoustic model converts linguistic features into mel-spectrograms, visually representing speech sounds over time. Models like Tacotron2 and Glow-TTS predict the relationship between text and sound, capturing the rhythm, tone, and emotional ...

reddit.com › r/localllama › 🎧 listen and compare 12 open-source text-to-speech models (hugging face space)

r/LocalLLaMA on Reddit: 🎧 Listen and Compare 12 Open-Source Text-to-Speech Models (Hugging Face Space)

July 6, 2025 -

Hey everyone!

We have been exploring various open-source Text-to-Speech (TTS) models, and decided to create a Hugging Face demo space that makes it easy to compare their quality side-by-side.

The demo features 12 popular TTS models, all tested using a consistent prompt, so you can quickly hear and compare their synthesized speech and choose the best one for your audio projects.

Would love to get feedback or suggestions!

👉 Check out the demo space and detailed comparison here!

👉 Check out the blog: Choosing the Right Text-to-Speech Model: Part 2

Share your use-case and we will update this space as required!

Which TTS model sounds most natural to you?

Cheers!