I’ve been trying to keep a list of TTS solutions. Here you go: Text to Speech Solutions 11labs - Commercial xtts xtts2 Alltalk Styletts2 Fish-Speech PiperTTS - A fast, local neural text to speech system that is optimized for the Raspberry Pi 4. PiperUI Paroli - Streaming mode implementation of the Piper TTS with RK3588 NPU acceleration support. Bark Tortoise TTS LMNT AlwaysReddy - (uses Piper) Open-LLM-VTuber MeloTTS OpenVoice Sherpa-onnx Silero Neuro-sama Parler TTS Chat TTS VallE-X Coqui TTS Daswers XTTS GUI VoiceCraft - Zero-Shot Speech Editing and Text-to-Speech Answer from jpummill2 on reddit.com
🌐
Reddit
reddit.com › r/opensource › speech-to-text software
r/opensource on Reddit: Speech-to-text software
February 15, 2023 -

Hi guys,

I'm looking for some software or an app that can turn an mp3 recording into text. There's a lot of text-to-speech solutions out there but I can't find anything that goes the other way, that is speech-to-text.

I have some mp3 recordings of lectures that I would like to turn into text and then PDF.

🌐
Reddit
reddit.com › r/localllama › best local open source text-to-speech and speech-to-text?
r/LocalLLaMA on Reddit: Best local open source Text-To-Speech and Speech-To-Text?
August 24, 2024 -

I am working on a custom data-management software and for a while now I've been working and looking into possibility of integrating and modifying existing local conversational AI's into it (or at least developing the possibility of doing so in the future). The first thing I've been struggling with is that information is somewhat hard to come by - searches often lead me back here to r/LocalLLaMA/ and a year old threads in r/MachineLearning. Is anyone keeping track of what is out there what is worth the attention? I am posting this here in hope of finding some info while also sharing what I know for anyone who finds it useful or is interested.

I've noticed that most open source projects are based on Open AI's Whisper and it's re-implemented versions like:

  • Faster Whisper (MIT license)

  • Insanely fast Whisper (Apache-2.0 license)

  • Distil-Whisper (MIT license)

  • WhisperSpeech by github.com/collabora (MIT license, Added here 03/2025)

  • WhisperLive (MIT license, Added here 03/2025)

  • WhisperFusion, which is WhisperSpeech+WhisperLive in one package. (Added here 03/2025)

Coqui AI's TTS and STT -models (MPL-2.0 license) have gained some traction, but on their site they have stated that they're shutting down.

Tortoise TTS (Apache-2.0 license) and its re-implemented versions such as:

  • Tortoise-TTS-fast (AGPL-3.0, Apache-2.0 licenses) and its slightly faster(?) fork (AGPL-3.0 license).

StyleTTS and it's newer version:

  • StyleTTS2 (MIT license)

Alibaba Group's Tongyi SpeechTeam's SenseVoice (STT) [MIT license+possibly others] and CosyVoice (TTS) [Apache-2.0 license].

(11.2.2025): I will try to maintain this list so will begin adding new ones as well.

1/2025 Kokoro TTS (MIT License)
2/2025 Zonos by Zyphra (Apache-2.0 license)
3/2025 added: Metavoice (Apache-2.0 license)
3/2025 added: F5-TTS (MIT license)
3/2025 added: Orpheus-TTS by canopylabs.ai (Apache-2.0 license)
3/2025 added: MegaTTS3 (Apache-2.0 license)
4/2025 added: Index-tts (Apache-2.0 license). [Can be tried here.]
4/2025 added: Dia TTS (Apache-2.0 license) [Can be tried here.]
5/2025 added: Spark-TTS (Apache-2.0 license)[Can be tried here.]
5/2025 added: Parakeet TDT 0.6B V2 (CC-BY-4.0 license), STT English only [Can be tried here.], update: V3 is multilingual and has an onnx -version.

8/2025 added: Verbify-TTS (MIT License) by reddit user u/MattePalte. Described as simple locally run screen-reader-style app.

8/2025 added: Chatterbox-TTS (MIT License) [Can be tried here.]

8/2025 added: Microsoft's VibeVoice TTS (MIT Licence) for generating consistent long-form dialogues. Comes in 1.5B and 7B sizes. Both models can be tried here. 0.5B model is also on the way. This one also already has a ComfyUI wrapper by u/Fabix84/ (additional info here). Quantized versions by u/teachersecret can be found here

8/2025 added: BosonAI's Higgs Audio TTS (Apache-2.0 license). Can be tried here and further tested here. This one supports complex long-form dialogues. Extra prompting is supposed to allow setting the scene and adjusting expressions. Also has a quantized (4bit fork) version.

8/2025 added: StepFun AI's (Chinese AI-team source) Step-Audio 2 Mini Speech-To-Speech (Apache-2.0 license) a 8B "speech-to-speech" (Audio-To-Tokens + Tokens-To-Audio) -model. Added because related, even if bypasses the "to-text" -part.

---------------------------------------------------------

Edit1: Added Distil-Whisper because "insanely fast whisper" is not a model, but these were shipped together.
Edit2: StyleTTS2FineTune is not actually a different version of StyleTTS2, but rather a framework to finetuning it.
Edit3(11.2.2025): as suggested by u/caidong I added Kokoro TTS + also added Zonos to the list.
Edit4(20.3.2025): as suggested by u/Trysem , added WhisperSpeech, WhisperLive, WhisperFusion, Metavoice and F5-TTS.
Edit5(22.3.2025): Added Orpheus-TTS.
Edit6(28.3.2025): Added MegaTTS3.
Edit7(11.4.2025): as suggested by u/Trysem/, added Index-tts.
Edit8(24.4.2025): Added Dia TTS (Nari-labs).
Edit9(02.5.2025): Added Spark-TTS as suggested by u/Tandulim (here)
Edit9(02.5.2025): Added Parakeet TDT 0.6B V2. More info in this thread.

Edit10(29.8.2025): As originally suggested by u/Trysem and later by u/Nitroedge added Chatterbox-TTS to the list.

Edit10(29.8.2025): u/MattePalte asked me to add his own TTS called Verbify-TTS to the list.

Edit10(29.8.2025): Added Microsoft's recently released VibeVoice TTS, BosonAI's Higgs Audio TTS and StepFun's STS. +Extra info.

Edit11+12(1.9.2025): Added VibeVoice TTS's quantized versions and Parakeet V3.

Top answer
1 of 38
72
I’ve been trying to keep a list of TTS solutions. Here you go: Text to Speech Solutions 11labs - Commercial xtts xtts2 Alltalk Styletts2 Fish-Speech PiperTTS - A fast, local neural text to speech system that is optimized for the Raspberry Pi 4. PiperUI Paroli - Streaming mode implementation of the Piper TTS with RK3588 NPU acceleration support. Bark Tortoise TTS LMNT AlwaysReddy - (uses Piper) Open-LLM-VTuber MeloTTS OpenVoice Sherpa-onnx Silero Neuro-sama Parler TTS Chat TTS VallE-X Coqui TTS Daswers XTTS GUI VoiceCraft - Zero-Shot Speech Editing and Text-to-Speech
2 of 38
15
I’ve been using alltalktts ( https://github.com/erew123/alltalk_tts ) which is based off of coqui and supports XTTS2, piper and some others. I’m on a Mac so my options are pretty limited, and this worked fairly well. If xtts is the model you want to go with, then maybe https://github.com/daswer123/xtts-api-server would work even better. Unfortunately most of my cases are in SillyTavern, for narration, and character tts, so these may not be the use case for you. The last link I shared might give you ideas for how to implement that on a real application though. Are you a dev-like person, or just enthusiastic about it? I ask because if you’re a dev with some Python knowledge, or willingness to follow code, the later link is actually pretty useful for ideas, in spite of being targeted towards SillyTavern. If not, this is whole space might be kind of hard to navigate at this point in time, and also will depend a lot on the hardware where you’ll be deploying this.
🌐
Reddit
reddit.com › r/localllama › open source ai voice dictation app with a fully customizable stt and llm pipeline
r/LocalLLaMA on Reddit: Open source AI voice dictation app with a fully customizable STT and LLM pipeline
3 days ago - Tambourine is an open source, cross-platform voice dictation app that uses a configurable STT and LLM pipeline to turn natural speech into clean, formatted text in any app.
🌐
Reddit
reddit.com › r/machinelearning › [d] what is the best open source text to speech model?
r/MachineLearning on Reddit: [D] What is the best open source text to speech model?
April 13, 2023 -

I am building a LLMs infrastructure that misses one thing - text to speech. I know there are really good apis like MURF.AI out there, but I haven't been able to find any decent open source TTS, that is more natural than the system one.

If you know any of these, please leave a comment

Thanks

Top answer
1 of 10
46
I have a whole list of TTS models (repos & white papers): Neural TTS Models Tacotron submitted: Mar 29, 2017 paper: https://arxiv.org/pdf/1703.10135.pdf github: https://github.com/keithito/tacotron (Not the official implementation but is the once cited the most) Tacotron2 submitted: Dec 16, 2017 paper: https://arxiv.org/pdf/1712.05884.pdf github: https://github.com/NVIDIA/tacotron2 Transformer TTS ** submitted: Sept 19, 2018 paper: https://arxiv.org/pdf/1809.08895.pdf github: N/A Flowtron submitted: May 12 2020 paper: https://arxiv.org/pdf/2005.05957.pdf github: https://github.com/NVIDIA/flowtron FastSpeech2 submitted: Jun 8, 2020 paper: https://arxiv.org/pdf/2006.04558.pdf github: https://github.com/ming024/FastSpeech2 (Not the official implementation but is the once cited the most) FastPitch submitted: Jun 11, 2020 paper: https://arxiv.org/pdf/2006.06873.pdf github: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch TalkNet (1/2) submitted: May 12, 2020/Apr16, 2021 paper: https://arxiv.org/pdf/2005.05514.pdf / https://arxiv.org/pdf/2104.08189.pdf github: https://github.com/NVIDIA/NeMo MOS (Mean Opinion Score) is not included because each paper has a different score for each model ** This model is not to be considered for implementation. It can be a reference but does not have an official GitHub implementation that I am aware of, nor is it very well known. Vocoders (Mel-spec to audio) WaveNet submitted: Sept 12, 2016 paper: https://arxiv.org/pdf/1609.03499v2.pdf github: N/A WaveGlow submitted: Oct 31, 2018 paper: https://arxiv.org/pdf/1811.00002.pdf github: https://github.com/NVIDIA/waveglow HiFiGAN submitted: Oct 12, 2020 paper: https://arxiv.org/pdf/2010.05646.pdf github: https://github.com/jik876/hifi-gan Amendments • TalkNet source code from NVIDIA/NeMo repo has been removed (commit #4082) • NVIDIA/NeMo repo now links to: • FastPitch, MixerTTS, Tacotron2, RadTTS for text to Mel-spectrogram models • HiFiGAN, UnivNet, WaveGlow for Vocoder models • RadTTS seems to be similar to or based around Flowtron (Autoregressive model) • MixerTTS seems to be similar to or based around FastPitch • There are a number of models that are heavily reliant on this monotonic align module. Such models currently include: • VITS • RadTTS • GradTTS • GlowTTS • Regarding GlowTTS, there is actually a Tensorflow implementation available here () which may prove helpful for other models that may use similar components • STYLER and DiffTTS relies on Montreal forced aligner (MFA) package • Presentation from Microsoft: https://www.microsoft.com/en-us/research/uploads/prod/2022/12/Generative-Models-for-TTS.pdf RadTTS submitted: Aug 18, 2021 (NVIDIA page, not Arxiv) paper: https://openreview.net/pdf?id=0NQwnnwAORi github: https://github.com/NVIDIA/radtts MixerTTS submitted: Oct 7, 2021 paper: https://arxiv.org/pdf/2110.03584.pdf github: https://github.com/NVIDIA/NeMo GradTTS (Diffusion TTS) submitted: May 13, 2021 paper: https://arxiv.org/pdf/2105.06337.pdf github: https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS VITS submitted: Jun 11, 2021 paper: https://arxiv.org/pdf/2106.06103.pdf github: https://github.com/jaywalnut310/vits GlowTTS submitted: May 22, 2020 paper: https://arxiv.org/pdf/2005.11129v1.pdf github: https://github.com/jaywalnut310/glow-tts STYLER submitted: Mar 17, 2021 paper: https://arxiv.org/pdf/2103.09474.pdf github: https://github.com/keonlee9420/STYLER TorToiseTTS submitted: N/A paper: N/A github: https://github.com/neonbjb/tortoise-tts DiffTTS (DiffSinger) submitted: Apr 3, 2021 paper: https://arxiv.org/pdf/2104.01409v1.pdf github: https://github.com/keonlee9420/DiffSinger
2 of 10
13
Tortoise TTS is supposed to be good. However inference can take a while if not on GPU's, so might not produce the real-time text-to-speech effect you want.
🌐
Reddit
reddit.com › r/opensource › speech-to-text with ai?
r/opensource on Reddit: Speech-to-text with AI?
August 31, 2023 -

I am searching for a fully open-source Python script or an application to seamlessly transcribe audio into text in French.

I dislike websites since they usually come with restrictions and limitations in most instances (approximately 99% of the time).

Do you know where I could find something suitable?

🌐
Reddit
reddit.com › r/programming › open source speech to text
r/programming on Reddit: Open Source Speech to Text
February 12, 2010 -

I'd like something I can use to transcribe speech to text as part of a larger program. Google-ing open source speech to text I see CMU Sphinx and Open Mind Speech. Any other options I should be aware of? Which is the most accurate?

Top answer
1 of 5
139
Hi, I'm a doctoral student studying speech recognition. Some of the most popular open source ASR engines are, including the Sphinx project which you mention, CMU Sphinx HTK Edit: as several people have pointed out, HTK more accurately described as 'source available'. The situation is a bit complicated so please refer to the website for details. My operational understanding is that you can modify HTK source and train up acoustic models with HTK for whatever purpose you like, but you cannot repackage and ship HTK source code. Julius Juicer The Sphinx and HTK projects contain software appropriate for training Acoustic Models from audio data, as well as the decoders. The other two are just decoders, and require models trained with some other system (HTK). In addition to acoustic models, if you are planning to train up your own system you will also need to build or obtain A pronunciation dictionary which contains entries consisting of the words you wish to transcribe and their corresponding pronunciation(s) A language model or expert grammar to constrain the input speech There are open source tools for building language models as well, but they vary in terms of their licenses. Some of the most popular language modeling tools are, CMU SLM Toolkit MIT LM SRILM There are also a variety of pre-trained open source acoustic models and language models available on the web, CMU Open Source AMs Keith Vertanen models VoxForge In terms of tutorial and help information the following are all good places to start, VoxForge CMU Robust Group Tutorial HTK Book (requires registration) I've never heard of Open Mind Speech and it does not appear to be very well-maintained so I'm not sure I can recommend that. If you are interested in understanding what is actually going on behind the scenes in most modern ASR systems I recommend the following papers, A tutorial on hidden markov models and selected applications in speech recognition (pdf) Speech Recognition with Weighted Finite State Transducers (pdf) In general, even with the many open source tools currently available, training up a high quality large vocabulary continuous speech recognition system from scratch is a non-trivial (read pretty complicated) exercise. If you are determined to do it, my strongest recommendation is to try out the Sphinx tutorial as it should take you through from beginning to end on an open source toy corpus, and leave you with the ability to modify or alter that yourself afterwards. Also, Sphinx is well-supported, completely open source, and has 3 different decoders with various strengths and weaknesses, one written in Java and 2 in C. Good luck! Edit: If you are trying to include this in a VoIP application there are also Sphinx plugins for popular open source software PBXs like Asterisk and FreeSwitch , and even a maturing MRCP server, UniMRCP . And Voxeo provides VXML based solutions for this.
2 of 5
6
Hi, thanks for bringing this topic up. I'm a marketing consultant, who's nearly deaf. Using a phone or video chat is really difficult (often impossible) for me, which means I have to spend more time and resources for attracting new clients than a hearing consultant would. Unable to phone also has a serious impact on job chances and my career. I've been thinking if a speech to text software could help me with phone calls and video conferences. What I'm looking for would be a speech to text software which plugs into skype, VoIP with a Fritzbox or another SIP software, or even Asteristik (I have no experience with Asteristik yet, but would be willing to dive in) has a roughly 50% success rate (I don't need perfect transcription, just enough to get the gist) can handle unknown voices (clients, business partners, coworkers) ideally works on Mac or Linux (Windows is fine, too, if nothing else is supported) supports English and German I've looked into Dragon Naturally Speaking and MacSpeech, but they seem to only work with one's own voice and only with extensive training and are more geared to authoring texts. I'm looking for a speech to text software from streaming / live audio (such as Skype). Does anybody have experience with what I'm trying to achieve? Does anybody know of a software or could anybody give me some hints which projects in blackkettle's post (e.g. Sphinx?) I should try out first? Also: Deaf redditors, how do you handle business phone calls? Additionally I'm looking for a speech-to-text software for subtitling videos, too. I've been thinking of creating some videos to place on Youtube and I'd like to offer subtitles to fellow deaf people. Does anybody know of specialized speech-text-software for video subtitling? Thanks a lot for any help and suggestions!
🌐
Reddit
reddit.com › r/chatgptcoding › i built an open source ai voice dictation app with fully customizable stt and llm pipelines
r/ChatGPTCoding on Reddit: I built an open source AI voice dictation app with fully customizable STT and LLM pipelines
3 days ago -

Tambourine is an open source, cross-platform voice dictation app that uses configurable STT and LLM pipelines to turn natural speech into clean, formatted text in any app.

I have been building this on the side for the past few weeks. The motivation was wanting something like Wispr Flow, but with full control over the models and prompts. I wanted to be able to choose which STT and LLM providers were used, tune formatting behavior, and experiment without being locked into a single black box setup.

The back end is a local Python server built on Pipecat. Pipecat provides a modular voice agent framework that makes it easy to stitch together different STT models and LLMs into a real-time pipeline. Swapping providers, adjusting prompts, or adding new processing steps does not require changing the desktop app, which makes experimentation much faster.

Speech is streamed in real time from the desktop app to the server. After transcription, the raw text is passed through an LLM that handles punctuation, filler word removal, formatting, list structuring, and personal dictionary rules. The formatting prompt is fully editable, so you can tailor the output to your own writing style or domain-specific language.

The desktop app is built with Tauri, with a TypeScript front end and Rust handling system level integration. This allows global hotkeys, audio device control, and text input directly at the cursor across platforms.

I shared an early version with friends and presented it at my local Claude Code meetup, and the feedback encouraged me to share it more widely.

This project is still under active development while I work through edge cases, but most core functionality already works well and is immediately useful for daily work. I would really appreciate feedback from people interested in voice interfaces, prompting strategies, latency tradeoffs, or model selection.

Happy to answer questions or go deeper into the pipeline.

https://github.com/kstonekuan/tambourine-voice

Find elsewhere
🌐
Reddit
reddit.com › r/opensource › tts: text-to-speech for all. open-sourced by mozilla
r/opensource on Reddit: TTS: Text-to-Speech for All. Open-sourced by Mozilla
July 4, 2020 - Since many of the key people are now at Coqui and because the rest of the AI people at Mozilla have been asked to work on other projects, I would expect the center of gravity for new development to shift to coqui. DeepSpeech was the state of the art for speech recognition in 2015 and still pretty cutting edge for certain contexts. Certainly one of the most accurate open source models.
🌐
Reddit
reddit.com › r/speechtech › i benchmarked 12+ speech-to-text apis under various real-world conditions
r/speechtech on Reddit: I benchmarked 12+ speech-to-text APIs under various real-world conditions
May 2, 2025 -

Hi all, I recently ran a benchmark comparing a bunch of speech-to-text APIs and models under real-world conditions like noise robustness, non-native accents, and technical vocab, etc.

It includes all the big players like Google, AWS, MS Azure, open source models like Whisper (small and large), speech recognition startups like AssemblyAI / Deepgram / Speechmatics, and newer LLM-based models like Gemini 2.0 Flash/Pro and GPT-4o. I've benchmarked the real time streaming versions of some of the APIs as well.

I mostly did this to decide the best API to use for an app I'm building but figured this might be helpful for other builders too. Would love to know what other cases would be useful to include too.

Link here: https://voicewriter.io/speech-recognition-leaderboard

TLDR if you don't want to click on the link: the best model right now seems to be GPT-4o-transcribe, followed by Eleven Labs, Whisper-large, and the Gemini models. All the startups and AWS/Microsoft are decent with varying performance in different situations. Google (the original, not Gemini) is extremely bad.

🌐
Murf AI
murf.ai › blog › best-text-to-speech-according-to-reddit
Best Text to Speech Software According to Reddit
According to numerous Reddit forums, Murf AI, Lovo AI, Amazon Polly, and IBM Watson text to speech offer superior multilingual support compared to others on this list. ... Bark, a text to speech solution based on the Hugging Face platform, is ...