What's the Best Speech-to-Text Model Right Now?

reddit.com › r › LocalLLaMA › comments › 1ng8bec › whats_the_best_speechtotext_model_right_now

Try https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 (smaller option) or https://huggingface.co/openai/whisper-large-v3-turbo . Run the parakeet model using nemo library and use something like https://github.com/SYSTRAN/faster-whisper or https://github.com/ggml-org/whisper.cpp for the latter Answer from Few-Welcome3297 on reddit.com

reddit.com › r/localllama › what's the best speech-to-text model right now?

r/LocalLLaMA on Reddit: What's the Best Speech-to-Text Model Right Now?

September 13, 2025 -

I am looking for the best Speech-to-Text/Speech Recognition Models, anyone could recommend any?

Top answer

1 of 13

2 of 13

Check out the HF ASR leaderboard. https://huggingface.co/spaces/hf-audio/open_asr_leaderboard Assume you are looking for an open source one? I am a fan of the nvidia parakeet series but it depends on your use case.

reddit.com › r/speechtech › i benchmarked 12+ speech-to-text apis under various real-world conditions

r/speechtech on Reddit: I benchmarked 12+ speech-to-text APIs under various real-world conditions

May 2, 2025 -

Hi all, I recently ran a benchmark comparing a bunch of speech-to-text APIs and models under real-world conditions like noise robustness, non-native accents, and technical vocab, etc.

It includes all the big players like Google, AWS, MS Azure, open source models like Whisper (small and large), speech recognition startups like AssemblyAI / Deepgram / Speechmatics, and newer LLM-based models like Gemini 2.0 Flash/Pro and GPT-4o. I've benchmarked the real time streaming versions of some of the APIs as well.

I mostly did this to decide the best API to use for an app I'm building but figured this might be helpful for other builders too. Would love to know what other cases would be useful to include too.

Link here: https://voicewriter.io/speech-recognition-leaderboard

TLDR if you don't want to click on the link: the best model right now seems to be GPT-4o-transcribe, followed by Eleven Labs, Whisper-large, and the Gemini models. All the startups and AWS/Microsoft are decent with varying performance in different situations. Google (the original, not Gemini) is extremely bad.

Top answer

1 of 4

Welcome to the painful world of benchmarking ML models. How confident are you that the audios, text, and tts you used isn't in the training data of the models? If you can't prove that then your benchmark isn't worth that much. It's a big reason why you can't have open data to benchmark against, because it's too easy to cheat. If your task to to run ASR on old TED videos and TTS/read speech of wikipedia articles, then these numbers may be valid. Otherwise I wouldn't trust them. Also, streaming WERs depend a lot on the desired latency, I can't see the information anywhere. And btw, Speechmatics has updated its pricing.

2 of 4

U should do new 2.5 models it blows everything out of water even the Dirization

Videos

26:24

YouTube

Streaming Speech to Text Models - Kyutai vs Whisper - YouTube

August 26, 2025

23:58

YouTube

The Most Accurate Speech-to-text APIs in 2025 - YouTube

February 6, 2025

08:21

YouTube

BEST Speech to Text AI Revealed - YouTube

April 2, 2025

youtube.com

What is OpenAI Whisper? (Best Speech to Text AI Model)

15:48

YouTube

Possibly THE BEST Open Source Text-to-Speech Model - VibeVoice ...

September 2, 2025

21:59

YouTube

RIP ELEVENLABS! Here's The BEST TTS AI Voices LOCALLY For FREE! ...

April 23, 2025

View all

reddit.com › r/localllama › what's the best open source speech to text model

r/LocalLLaMA on Reddit: What's the best open source speech to text model

August 1, 2024 -

I know OpenAI recently released whisper V3 Turbo but I remember hearing about some other ones that's a lot better but I can't remember

Top answer

1 of 9

whisper-v3-turbo because of its wide compatibility with open source ecosystem (not necessarily because of its WER) The architecture is plug and play. You can typically add some LLMs along with whisper to correct stuff for you and customize as you need. I wrote a guide here just now: Creating Very High-Quality Transcripts with Open-Source Tools: An 100% automated workflow guide

2 of 9

You might be talking about https://huggingface.co/Revai Here is the post you might be remembering https://x.com/reach_vb/status/1841885263766945930

reddit.com › r/machinelearning › [d]what is the best speech recognition model now?

r/MachineLearning on Reddit: [D]What is the best speech recognition model now?

February 1, 2025 -

OpenAI’s Whisper was released more than two years ago, and it seems that no other model has seriously challenged its position since then. While Whisper has received updates over time, its performance in languages other than English—such as Chinese—is not ideal for me. I’m looking for an alternative model to generate subtitles for videos and real-time subtitles for live streams.

I have also tried Alibaba’s FunASR, but it was released more than one year ago as well and does not seem to offer a satisfied performance.

I am aware of some LLM-based speech models, but their hardware requirements are too high for my use case.

In other AI fields, new models are released almost every months, but there seems to be less attention on advancements in speech recognition. Are there any recent models worth looking into?

Top answer

1 of 13

Hugging face moonshine is something that can be checked out moonshine

2 of 13

Whisper is still the highest quality one in general and can be adopted for live recognition

reddit.com › r/speechtech › comparative review of speech-to-text apis (2025)

r/speechtech on Reddit: Comparative Review of Speech-to-Text APIs (2025)

July 16, 2025 -

Hi, I'd like to share my findings on several speech-to-text API providers based on real-world testing.

GPT-4o Transcribe

- 25 MB file limit. Not practical for real-world use cases.

Gemini 2.5 Pro (via Prompt)

- Not tested yet. Based on its documentation, it doesn’t seem well-suited for long recordings.

Google Cloud Speech-to-Text V2

- The API setup is complex. You need to specific region, language, ... explicitly.

- It fails to process .m4a audio files exported from iOS apps, even though the same files work fine with other services.

Sample configuration used:

config = cloud_speech.RecognitionConfig(
    auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
    language_codes=["en-US"],
    model="chirp_2",
)

Self-hosted WhisperX

- Performs well for recordings over 3 hours.

- Issues: occasional word repetitions or hallucinations.

AssemblyAI

- Reasonable performance.

- Lacks accurate punctuation for some non-English languages, such as Chinese.

Deepgram

- Similar to AssemblyAI: works okay but struggles with sentence-level punctuation in languages like Chinese.

Next Steps

I plan to test ElevenLabs next, based on https://www.reddit.com/r/speechtech/comments/1kd9abp/i_benchmarked_12_speechtotext_apis_under_various/

Top answer

1 of 5

Need to see a review of Speechmatics and ElevenLabs too please - would love to hear what you find!

2 of 5

I would love to see you review Soniox (realtime or async) - we are constantly collecting feedback from the community so we can improve the service further.

reddit.com › r/localllama › best speech to text transcription? local model or api

r/LocalLLaMA on Reddit: Best speech to text transcription? Local model or api

July 4, 2024 -

Hey guys, I would like to add speech to text transcription. Hosting an open source model on cloud so I can do this anywhere would be good.

Do you guys know any highly accurate STT open source models that is highly accurate?

Also, can it run on CPU or GPU is a must?

Top answer

1 of 5

Whisper is really good, at least for single speaker text. It's less good for multi-speaker text, but still plenty usable.

2 of 5

Whisper.cpp runs on CPU, is pretty fast if you have decent hardware, and in my experience fantastically recognizes almost all words and sentences correctly. I was definitely very satisfied with the results. Never tested it with languages other than English, it may require thorough testing before deciding to use a model of specific size.

reddit.com › r/localllama › best open source speech to text+ diarization models

r/LocalLLaMA on Reddit: Best Open source Speech to text+ diarization models

May 8, 2025 -

Hi everyone, hope you’re doing well. I’m currently working on a project where I need to convert audio conversations between a customer and agents into text.

Since most recordings involve up to three speakers, could you please suggest some top open-source models suited for this task, particularly those that support speaker diarization?

Top answer

1 of 5

2 of 5

At the moment, you’re going to get your best transcript by splitting the audio into each voice (isolate) https://github.com/pyannote/pyannote-audio Once split, stt each individual stream through a timestamp capable model like parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 Finally, reassemble the conversation by speaker, interleaving the speech based on time stamps in the final transcript.

reddit.com › r/localllama › 🎧 listen and compare 12 open-source text-to-speech models (hugging face space)

r/LocalLLaMA on Reddit: 🎧 Listen and Compare 12 Open-Source Text-to-Speech Models (Hugging Face Space)

July 6, 2025 -

Hey everyone!

We have been exploring various open-source Text-to-Speech (TTS) models, and decided to create a Hugging Face demo space that makes it easy to compare their quality side-by-side.

The demo features 12 popular TTS models, all tested using a consistent prompt, so you can quickly hear and compare their synthesized speech and choose the best one for your audio projects.

Would love to get feedback or suggestions!

👉 Check out the demo space and detailed comparison here!

👉 Check out the blog: Choosing the Right Text-to-Speech Model: Part 2

Share your use-case and we will update this space as required!

Which TTS model sounds most natural to you?

Cheers!

Top answer

1 of 5

Nice to have it all in one place. It'd be even nicer to have an apples to apples comparison, thus all female or all male voices, instead of mixed like it's now. Maybe both? The CSM example sounds like it's full of artifacts, just like F5-TTS - and both were highlighted for speech quality. Maybe something went wrong during generation? At least Sesame can sound way better. The Llasa sample seems slightly broken - that's maybe a hint that this happens more often? Same with the background noise for MegaTTS3. Orpheus was probably standing in a large room during the generation 😉.

2 of 5

Cool project! And thanks for the work you put into it and making it a useful tool for others! I'd love to see Chatterbox and Kyutai added to the mix as well. At least, assuming they are open-source, if they aren't, of course, ignore this.

Find elsewhere

Google Bing Mojeek

reddit.com › r/machinelearning › [d] what are the best speech to text tools currently available ?

r/MachineLearning on Reddit: [D] What are the best speech to text tools currently available ?

January 6, 2023 -

I am looking for open source speech to text tools, I am not familiar with the progress in this field but Ideally I would like something fast and reliable, that does english as well as other languages as french and spanish, that is also easy to use. Are there any recommendations ?

Top answer

1 of 4

Whisper. You can run it locally for free or use the Whisper API on OpenAI that's quite cheap and doesn't require any special hardware. There is also the option to run it on Google colab for free and that also doesn't require any special hardware.

2 of 4

You can use HuggingFace’s speechbox library built on top of whisper. https://github.com/huggingface/speechbox They have a colab and a gradio demo on their readme

reddit.com › r/3cx › what’s the best text-to-speech ai tool you’re using right now?

r/3CX on Reddit: What’s the Best Text-to-Speech AI Tool You’re Using Right Now?

January 12, 2025 -

Hey Redditors,

I’m currently exploring different text-to-speech (TTS) AI tools and was wondering what everyone here is using. I’m looking for something that delivers natural-sounding voices, supports multiple languages, and ideally offers some customization features like speed, pitch, or emotion.

Google Speech Synthesis or Amazon Polly. But we often use real voices instead. Feels more natural.

reddit.com › r/localllama › best local open source text-to-speech and speech-to-text?

r/LocalLLaMA on Reddit: Best local open source Text-To-Speech and Speech-To-Text?

August 24, 2024 -

I am working on a custom data-management software and for a while now I've been working and looking into possibility of integrating and modifying existing local conversational AI's into it (or at least developing the possibility of doing so in the future). The first thing I've been struggling with is that information is somewhat hard to come by - searches often lead me back here to r/LocalLLaMA/ and a year old threads in r/MachineLearning. Is anyone keeping track of what is out there what is worth the attention? I am posting this here in hope of finding some info while also sharing what I know for anyone who finds it useful or is interested.

I've noticed that most open source projects are based on Open AI's Whisper and it's re-implemented versions like:

Faster Whisper (MIT license)
Insanely fast Whisper (Apache-2.0 license)
Distil-Whisper (MIT license)
WhisperSpeech by github.com/collabora (MIT license, Added here 03/2025)
WhisperLive (MIT license, Added here 03/2025)
WhisperFusion, which is WhisperSpeech+WhisperLive in one package. (Added here 03/2025)

Coqui AI's TTS and STT -models (MPL-2.0 license) have gained some traction, but on their site they have stated that they're shutting down.

Tortoise TTS (Apache-2.0 license) and its re-implemented versions such as:

Tortoise-TTS-fast (AGPL-3.0, Apache-2.0 licenses) and its slightly faster(?) fork (AGPL-3.0 license).

StyleTTS and it's newer version:

StyleTTS2 (MIT license)

Alibaba Group's Tongyi SpeechTeam's SenseVoice (STT) [MIT license+possibly others] and CosyVoice (TTS) [Apache-2.0 license].

(11.2.2025): I will try to maintain this list so will begin adding new ones as well.

1/2025 Kokoro TTS (MIT License)
2/2025 Zonos by Zyphra (Apache-2.0 license)
3/2025 added: Metavoice (Apache-2.0 license)
3/2025 added: F5-TTS (MIT license)
3/2025 added: Orpheus-TTS by canopylabs.ai (Apache-2.0 license)
3/2025 added: MegaTTS3 (Apache-2.0 license)
4/2025 added: Index-tts (Apache-2.0 license). [Can be tried here.]
4/2025 added: Dia TTS (Apache-2.0 license) [Can be tried here.]
5/2025 added: Spark-TTS (Apache-2.0 license)[Can be tried here.]
5/2025 added: Parakeet TDT 0.6B V2 (CC-BY-4.0 license), STT English only [Can be tried here.], update: V3 is multilingual and has an onnx -version.

8/2025 added: Verbify-TTS (MIT License) by reddit user u/MattePalte. Described as simple locally run screen-reader-style app.

8/2025 added: Chatterbox-TTS (MIT License) [Can be tried here.]

8/2025 added: Microsoft's VibeVoice TTS (MIT Licence) for generating consistent long-form dialogues. Comes in 1.5B and 7B sizes. Both models can be tried here. 0.5B model is also on the way. This one also already has a ComfyUI wrapper by u/Fabix84/ (additional info here). Quantized versions by u/teachersecret can be found here

8/2025 added: BosonAI's Higgs Audio TTS (Apache-2.0 license). Can be tried here and further tested here. This one supports complex long-form dialogues. Extra prompting is supposed to allow setting the scene and adjusting expressions. Also has a quantized (4bit fork) version.

8/2025 added: StepFun AI's (Chinese AI-team ^source) Step-Audio 2 Mini Speech-To-Speech (Apache-2.0 license) a 8B "speech-to-speech" (Audio-To-Tokens + Tokens-To-Audio) -model. Added because related, even if bypasses the "to-text" -part.

---------------------------------------------------------

Edit1: Added Distil-Whisper because "insanely fast whisper" is not a model, but these were shipped together.
Edit2: StyleTTS2FineTune is not actually a different version of StyleTTS2, but rather a framework to finetuning it.
Edit3(11.2.2025): as suggested by u/caidong I added Kokoro TTS + also added Zonos to the list.
Edit4(20.3.2025): as suggested by u/Trysem , added WhisperSpeech, WhisperLive, WhisperFusion, Metavoice and F5-TTS.
Edit5(22.3.2025): Added Orpheus-TTS.
Edit6(28.3.2025): Added MegaTTS3.
Edit7(11.4.2025): as suggested by u/Trysem/, added Index-tts.
Edit8(24.4.2025): Added Dia TTS (Nari-labs).
Edit9(02.5.2025): Added Spark-TTS as suggested by u/Tandulim (here)
Edit9(02.5.2025): Added Parakeet TDT 0.6B V2. More info in this thread.

Edit10(29.8.2025): As originally suggested by u/Trysem and later by u/Nitroedge added Chatterbox-TTS to the list.

Edit10(29.8.2025): u/MattePalte asked me to add his own TTS called Verbify-TTS to the list.

Edit10(29.8.2025): Added Microsoft's recently released VibeVoice TTS, BosonAI's Higgs Audio TTS and StepFun's STS. +Extra info.

Edit11+12(1.9.2025): Added VibeVoice TTS's quantized versions and Parakeet V3.

Top answer

1 of 38

I’ve been trying to keep a list of TTS solutions. Here you go: Text to Speech Solutions 11labs - Commercial xtts xtts2 Alltalk Styletts2 Fish-Speech PiperTTS - A fast, local neural text to speech system that is optimized for the Raspberry Pi 4. PiperUI Paroli - Streaming mode implementation of the Piper TTS with RK3588 NPU acceleration support. Bark Tortoise TTS LMNT AlwaysReddy - (uses Piper) Open-LLM-VTuber MeloTTS OpenVoice Sherpa-onnx Silero Neuro-sama Parler TTS Chat TTS VallE-X Coqui TTS Daswers XTTS GUI VoiceCraft - Zero-Shot Speech Editing and Text-to-Speech

2 of 38

I’ve been using alltalktts ( https://github.com/erew123/alltalk_tts ) which is based off of coqui and supports XTTS2, piper and some others. I’m on a Mac so my options are pretty limited, and this worked fairly well. If xtts is the model you want to go with, then maybe https://github.com/daswer123/xtts-api-server would work even better. Unfortunately most of my cases are in SillyTavern, for narration, and character tts, so these may not be the use case for you. The last link I shared might give you ideas for how to implement that on a real application though. Are you a dev-like person, or just enthusiastic about it? I ask because if you’re a dev with some Python knowledge, or willingness to follow code, the later link is actually pretty useful for ideas, in spite of being targeted towards SillyTavern. If not, this is whole space might be kind of hard to navigate at this point in time, and also will depend a lot on the hardware where you’ll be deploying this.

reddit.com › r/localllama › what's the best model for speech-to-text?

r/LocalLLaMA on Reddit: What's the best model for Speech-to-Text?

November 4, 2024 -

title

Top answer

1 of 5

In my opinion, whisperX ( https://github.com/m-bain/whisperX ) is the fastest and the most accurate.

2 of 5

That probably depends on what you mean with "best". How cpu / gpu demanding can it be ? What is the expected latency ? Streaming or offline ? What domains (legal, medical, etc). What languages ? For english non-streaming there are plenty of models, with comparable recognition rate. - Whisper (and many variants) - Nvidia Nemo Parakeet or canary - new one from Rev.ai For other languages and streaming models, the options are more limited to Whisper small Vosk and some wav2vec models on huggingface. We are planning to release some models of our own, under non commercial license (and affordable commercial license).| They are fast, have low WER, can do streaming and we have support for a dozen of languages. You can try them here: https://banafo.ai/en/realtime-demo (all running on cpu). contact me on [email protected] if you'd like to be one of our beta testers for the on premise versions.

reddit.com › r/artificialinteligence › whats your best speech-to-text tool?

r/ArtificialInteligence on Reddit: Whats your best Speech-to-text tool?

February 12, 2024 -

Hello, everyone!

I just bought a new PC for gaming, but i mostly like to play building games like Cities Skylines, now i'm trying a bit of Enshrouded and also having fun.

The thing is: My mind don't stop, and i keep thinking about work sometimes - and usually its good ideas that i would like to remember later, in details, how they are "printed" in my head at that moment of epiphany.

That being said, what i'm looking for is a tool that can take notes from my voice so i can organize and summarize it better, later.

What would be your go-to tool for speech-to-text, to take this kind of notes?

Top answer

1 of 10

I use Apple’s dictation feature. It’s pretty good. There should be something similar for windows

2 of 10

Go with the OG https://www.nuance.com/dragon.html

reddit.com › r/speechtech › what are people using for real-time speech recognition with low latency?

r/speechtech on Reddit: What are people using for real-time speech recognition with low latency?

July 23, 2025 -

Been playing around with Whisper and a few other models for live transcription, but even on a decent GPU, the delay’s still a bit much for anything interactive.

I’m curious what others here are using when low latency actually matters, like under 2 seconds, ideally even faster. Bonus if it works well with accents or in noisy environments.

Would love to hear what’s working for folks in production (or even fun side projects). Commercial or open source - am open to both!

Top answer

1 of 13

We’re using Azure Speech for realtime phone conversations. It performs quite well.

2 of 13

speechmatics have been dominating here

reddit.com › r/learnmachinelearning › what's the current best speech-to-text model that i can fine-tune for a different language?

r/learnmachinelearning on Reddit: What's the current best speech-to-text model that I can fine-tune for a different language?

May 7, 2022 -

From what I can see OpenAI's Whisper is currently one of the best-performing models, but the repo doesn't offer a way to train it, only to perform inference. How does DeepSpeech hold up? It seems pretty well documented and easy to use, but is quite old so I don't know if the actual model is up-to-date.

Top answer

1 of 1

https://huggingface.co/blog/fine-tune-whisper

reddit.com › r/software › what's the best free text to speech ai tool that sounds really natural?

r/software on Reddit: What's the best free text to speech AI tool that sounds really natural?

August 4, 2025 -

I'm looking for a high quality, free text to speech (TTS) AI that sounds as close to a real human as possible — preferably with multiple voice options (male/female, accents, etc). It could be an app, website, or downloadable software. I’ve tried a few that sound robotic or overly synthetic, and I’m hoping there’s something better out there that still doesn’t require payment.

Any recommendations?

Top answer

1 of 16

If your a bit technical Chatterbox or F5-TTS are pretty good models you can run in docker containers locally that support voice cloning, just provide a 5 second sample

2 of 16

The built-in ones that Micorosft Edge uses are pretty good, they sound much more natural than the ones used elsewhere in Windows. You can save a document as a PDF and get Edge to read it out, choose different voices etc. Main downside is it's not as convenient as typing stuff in and getting an MP3 back.

reddit.com › r/macapps › spokenly: what’s the best speech-to-text api and model right now?

r/macapps on Reddit: Spokenly: what’s the best speech-to-text API and model right now?

1 month ago -

I’ve been using an app called Spokenly lately and it’s made me curious about what everyone thinks is the best speech-to-text API and model available right now.

There are so many options. For example, Cartesia has Ink Whisperer, Deepgram has Nova 3, Soniox has their own system, and then there are models like Parakeet, Whisper variants, etc. Some apps seem to mix different APIs together and I’m trying to figure out what gives the best combination of accuracy, speed, and low latency for real-time dictation.

If you’ve tested different providers, which one has worked best for you and why? I’m especially interested in:

• Accuracy with normal everyday speech

• Speed and responsiveness

• Latency during dictation

• How well it handles punctuation and capitalization

• Pricing or free tier experiences

• Any issues with reliability

Feel free to share your setup or any comparisons you’ve done. I want to see what people think is the best choice

Spokenly with local Apple's model works fine. UI could have been better though.

reddit.com › r/localllama › best real-time speech-to-speech model?

r/LocalLLaMA on Reddit: Best real-time speech-to-speech model?

September 30, 2025 -

We've been using unmute, and it's the best open source real-time STT -> LLM -> TTS model/system that I know so far.

Now we're looking for a more accurate STT while maintaining real-time speed and high throughput. Ideally the model is speech-to-speech directly so the AI can provide feedback on the input voice itself and not just the transcription.

We want to try the Qwen3-Omni but AFAIK there's no speech-to-speech support in vLLM yet. There's a hosted model but we want to use the open source if possible.

We're building a free real-time AI app for people to practice their English speaking skills.

Top answer

1 of 8

Ideally the model is speech-to-speech directly so the AI can provide feedback on the input voice itself and not just the transcription We have yet to see this kind of sorcery

2 of 8

“Moshi,” by the same authors as unmute, is the only one I’m aware of. It’s impressive that its novel design works at all, but it’s a year old now and I don’t think it ever matched the intelligence of simply running a fast LLM in the TTS/STT setup you described

Murf AI

murf.ai › blog › best-text-to-speech-according-to-reddit

Best Text to Speech Software According to Reddit

ElevenLabs is another AI text to speech Reddit users recommend that is perfect for creating videos, gaming, audiobooks, and chatbots. Its text to voice features allow users to alter or amplify vocal style, stability, and clarity. Voice generation in over 29 languages. The free plan offers up to three custom voices and 10000 characters. Offers a text read-aloud feature. Bark is a text to audio model that can generate realistic speech and sound effects from any text input.

reddit.com › r/learnmachinelearning › best speech-to-text model - open source vs. paid services

r/learnmachinelearning on Reddit: Best Speech-to-Text Model - Open Source vs. Paid Services

December 27, 2022 -

I'm building a tool to programmatically generate transcripts from sermons.

I have access to hundreds of sermon transcripts (and 100x more very similar in domain data) but less than a 40 transcripts with audio (~30 hours).

I want the lowest WER (Word Error Rate) possible and can budget 100 hours for this project in 2023.

Train my own acoustic model with the best open source offering

OpenAI's Whisper seems to be the best available today?
How much supervised data (e.g. hours of sermons with perfect transcripts) would I need to develop a model that would be more accurate than Google/AWS for my specific domain?
Can I take a model already trained and "tune" it by augmenting the data I have?

Use the best cloud speech-to-text API that I can provide in-domain data to to tune it

AWS Transcribe and Google Speech to Text seem to be big players
I've gone with AWS Transcribe since it can be tuned more easily with custom domain data (just upload text files) than Google's (which requires building phrase dictionaries with weights).
Is there anything out there that's better for my use case?

Any other thoughts on this on the whole?

----------------------------

UPDATE three months later after trying both

Whisper blew my custom AWS model with tons of domain-specific text out of the water. Long story short is to use Whisper for similar cases.

Top answer

1 of 4

You should check out Assembly AI or Deepgram. They perform better than the big tech options

2 of 4

Hey, just wanted to say thanks for sharing your findings after 3 months of testing.