Hi guys,
I'm looking for some software or an app that can turn an mp3 recording into text. There's a lot of text-to-speech solutions out there but I can't find anything that goes the other way, that is speech-to-text.
I have some mp3 recordings of lectures that I would like to turn into text and then PDF.
I am working on a custom data-management software and for a while now I've been working and looking into possibility of integrating and modifying existing local conversational AI's into it (or at least developing the possibility of doing so in the future). The first thing I've been struggling with is that information is somewhat hard to come by - searches often lead me back here to r/LocalLLaMA/ and a year old threads in r/MachineLearning. Is anyone keeping track of what is out there what is worth the attention? I am posting this here in hope of finding some info while also sharing what I know for anyone who finds it useful or is interested.
I've noticed that most open source projects are based on Open AI's Whisper and it's re-implemented versions like:
Faster Whisper (MIT license)
Insanely fast Whisper (Apache-2.0 license)
Distil-Whisper (MIT license)
WhisperSpeech by github.com/collabora (MIT license, Added here 03/2025)
WhisperLive (MIT license, Added here 03/2025)
WhisperFusion, which is WhisperSpeech+WhisperLive in one package. (Added here 03/2025)
Coqui AI's TTS and STT -models (MPL-2.0 license) have gained some traction, but on their site they have stated that they're shutting down.
Tortoise TTS (Apache-2.0 license) and its re-implemented versions such as:
Tortoise-TTS-fast (AGPL-3.0, Apache-2.0 licenses) and its slightly faster(?) fork (AGPL-3.0 license).
StyleTTS and it's newer version:
StyleTTS2 (MIT license)
Alibaba Group's Tongyi SpeechTeam's SenseVoice (STT) [MIT license+possibly others] and CosyVoice (TTS) [Apache-2.0 license].
(11.2.2025): I will try to maintain this list so will begin adding new ones as well.
1/2025 Kokoro TTS (MIT License)
2/2025 Zonos by Zyphra (Apache-2.0 license)
3/2025 added: Metavoice (Apache-2.0 license)
3/2025 added: F5-TTS (MIT license)
3/2025 added: Orpheus-TTS by canopylabs.ai (Apache-2.0 license)
3/2025 added: MegaTTS3 (Apache-2.0 license)
4/2025 added: Index-tts (Apache-2.0 license). [Can be tried here.]
4/2025 added: Dia TTS (Apache-2.0 license) [Can be tried here.]
5/2025 added: Spark-TTS (Apache-2.0 license)[Can be tried here.]
5/2025 added: Parakeet TDT 0.6B V2 (CC-BY-4.0 license), STT English only [Can be tried here.], update: V3 is multilingual and has an onnx -version.
8/2025 added: Verbify-TTS (MIT License) by reddit user u/MattePalte. Described as simple locally run screen-reader-style app.
8/2025 added: Chatterbox-TTS (MIT License) [Can be tried here.]
8/2025 added: Microsoft's VibeVoice TTS (MIT Licence) for generating consistent long-form dialogues. Comes in 1.5B and 7B sizes. Both models can be tried here. 0.5B model is also on the way. This one also already has a ComfyUI wrapper by u/Fabix84/ (additional info here). Quantized versions by u/teachersecret can be found here
8/2025 added: BosonAI's Higgs Audio TTS (Apache-2.0 license). Can be tried here and further tested here. This one supports complex long-form dialogues. Extra prompting is supposed to allow setting the scene and adjusting expressions. Also has a quantized (4bit fork) version.
8/2025 added: StepFun AI's (Chinese AI-team source) Step-Audio 2 Mini Speech-To-Speech (Apache-2.0 license) a 8B "speech-to-speech" (Audio-To-Tokens + Tokens-To-Audio) -model. Added because related, even if bypasses the "to-text" -part.
---------------------------------------------------------
Edit1: Added Distil-Whisper because "insanely fast whisper" is not a model, but these were shipped together.
Edit2: StyleTTS2FineTune is not actually a different version of StyleTTS2, but rather a framework to finetuning it.
Edit3(11.2.2025): as suggested by u/caidong I added Kokoro TTS + also added Zonos to the list.
Edit4(20.3.2025): as suggested by u/Trysem , added WhisperSpeech, WhisperLive, WhisperFusion, Metavoice and F5-TTS.
Edit5(22.3.2025): Added Orpheus-TTS.
Edit6(28.3.2025): Added MegaTTS3.
Edit7(11.4.2025): as suggested by u/Trysem/, added Index-tts.
Edit8(24.4.2025): Added Dia TTS (Nari-labs).
Edit9(02.5.2025): Added Spark-TTS as suggested by u/Tandulim (here)
Edit9(02.5.2025): Added Parakeet TDT 0.6B V2. More info in this thread.
Edit10(29.8.2025): As originally suggested by u/Trysem and later by u/Nitroedge added Chatterbox-TTS to the list.
Edit10(29.8.2025): u/MattePalte asked me to add his own TTS called Verbify-TTS to the list.
Edit10(29.8.2025): Added Microsoft's recently released VibeVoice TTS, BosonAI's Higgs Audio TTS and StepFun's STS. +Extra info.
Edit11+12(1.9.2025): Added VibeVoice TTS's quantized versions and Parakeet V3.
Videos
Are there any open-source text to speech solutions highly regarded on Reddit?
What text to speech software do Reddit users recommend the most?
How user-friendly are the text to speech interfaces according to Reddit discussions?
I know OpenAI recently released whisper V3 Turbo but I remember hearing about some other ones that's a lot better but I can't remember
I am searching for a fully open-source Python script or an application to seamlessly transcribe audio into text in French.
I dislike websites since they usually come with restrictions and limitations in most instances (approximately 99% of the time).
Do you know where I could find something suitable?
I am building a LLMs infrastructure that misses one thing - text to speech. I know there are really good apis like MURF.AI out there, but I haven't been able to find any decent open source TTS, that is more natural than the system one.
If you know any of these, please leave a comment
Thanks
I'd like something I can use to transcribe speech to text as part of a larger program. Google-ing open source speech to text I see CMU Sphinx and Open Mind Speech. Any other options I should be aware of? Which is the most accurate?
Hey guys, I would like to add speech to text transcription. Hosting an open source model on cloud so I can do this anywhere would be good.
Do you guys know any highly accurate STT open source models that is highly accurate?
Also, can it run on CPU or GPU is a must?
Hi I am looking for some speech to text converter for windows, but not able to get anything.
I know there are two popular models, vosk and whisper, but I am unable to find anything which I can use on windows for live transcription.
Any suggestions please?
Hi everyone, hope you’re doing well. I’m currently working on a project where I need to convert audio conversations between a customer and agents into text.
Since most recordings involve up to three speakers, could you please suggest some top open-source models suited for this task, particularly those that support speaker diarization?
I am looking for open source speech to text tools, I am not familiar with the progress in this field but Ideally I would like something fast and reliable, that does english as well as other languages as french and spanish, that is also easy to use. Are there any recommendations ?
currently using elevenlabs for text to speech the voice quality is not good in hindi and also it is costly.So i thinking of moving to open source TTS.Suggest me good open source alternative for eleven labs with low latency and good hindi voice result.
I've been working on a project that needs reliable Speech to text conversion with the potential for multiple active individuals in a conversation. I've only used the long ago released OpenAI Whisper model which was pretty terrible at 3+ people and often had issues consistently attributing correct voice to tag.
It's a pretty old model (given how fast everything is) what models are you guys using and what are some of the pro/cons?
Years ago I searched unsuccessfully for human-sounding TTS software (German voice output) for Linux. Nothing was found.
Is there really still (in year 2024) nothing comparable to Balabolka and Read Aloud and in Linux-world?
Hey guys, I'm looking to make an application that uses neural text to speech for my Python program. I'm not sure what open source SOTA is like, would love to get some reference repositories to check out, especially if they have demos.
It would also be nice to have a model that can come with multiple voices :)
Cheers
Hi
In 2023 which one is (are) the best solutions to convert english speech to english text... Either from "live" recording or via saved audio file. Online solutions are ok too, but i would probably prefer some "offline" solutions (in order to have "safety" that my speech/text is not going anywhere - cannot be seen by someone else - privacy). But website/online solutions are ok too i guess, just would prefer the offline more.
Which ones are the BEST (most accurate etc.) in 2023...?
(and if there is even some free option that is super/quite good) thrown that in too :-) ).
Thank you