I am working on a custom data-management software and for a while now I've been working and looking into possibility of integrating and modifying existing local conversational AI's into it (or at least developing the possibility of doing so in the future). The first thing I've been struggling with is that information is somewhat hard to come by - searches often lead me back here to r/LocalLLaMA/ and a year old threads in r/MachineLearning. Is anyone keeping track of what is out there what is worth the attention? I am posting this here in hope of finding some info while also sharing what I know for anyone who finds it useful or is interested.
I've noticed that most open source projects are based on Open AI's Whisper and it's re-implemented versions like:
Faster Whisper (MIT license)
Insanely fast Whisper (Apache-2.0 license)
Distil-Whisper (MIT license)
WhisperSpeech by github.com/collabora (MIT license, Added here 03/2025)
WhisperLive (MIT license, Added here 03/2025)
WhisperFusion, which is WhisperSpeech+WhisperLive in one package. (Added here 03/2025)
Coqui AI's TTS and STT -models (MPL-2.0 license) have gained some traction, but on their site they have stated that they're shutting down.
Tortoise TTS (Apache-2.0 license) and its re-implemented versions such as:
Tortoise-TTS-fast (AGPL-3.0, Apache-2.0 licenses) and its slightly faster(?) fork (AGPL-3.0 license).
StyleTTS and it's newer version:
StyleTTS2 (MIT license)
Alibaba Group's Tongyi SpeechTeam's SenseVoice (STT) [MIT license+possibly others] and CosyVoice (TTS) [Apache-2.0 license].
(11.2.2025): I will try to maintain this list so will begin adding new ones as well.
1/2025 Kokoro TTS (MIT License)
2/2025 Zonos by Zyphra (Apache-2.0 license)
3/2025 added: Metavoice (Apache-2.0 license)
3/2025 added: F5-TTS (MIT license)
3/2025 added: Orpheus-TTS by canopylabs.ai (Apache-2.0 license)
3/2025 added: MegaTTS3 (Apache-2.0 license)
4/2025 added: Index-tts (Apache-2.0 license). [Can be tried here.]
4/2025 added: Dia TTS (Apache-2.0 license) [Can be tried here.]
5/2025 added: Spark-TTS (Apache-2.0 license)[Can be tried here.]
5/2025 added: Parakeet TDT 0.6B V2 (CC-BY-4.0 license), STT English only [Can be tried here.], update: V3 is multilingual and has an onnx -version.
8/2025 added: Verbify-TTS (MIT License) by reddit user u/MattePalte. Described as simple locally run screen-reader-style app.
8/2025 added: Chatterbox-TTS (MIT License) [Can be tried here.]
8/2025 added: Microsoft's VibeVoice TTS (MIT Licence) for generating consistent long-form dialogues. Comes in 1.5B and 7B sizes. Both models can be tried here. 0.5B model is also on the way. This one also already has a ComfyUI wrapper by u/Fabix84/ (additional info here). Quantized versions by u/teachersecret can be found here
8/2025 added: BosonAI's Higgs Audio TTS (Apache-2.0 license). Can be tried here and further tested here. This one supports complex long-form dialogues. Extra prompting is supposed to allow setting the scene and adjusting expressions. Also has a quantized (4bit fork) version.
8/2025 added: StepFun AI's (Chinese AI-team source) Step-Audio 2 Mini Speech-To-Speech (Apache-2.0 license) a 8B "speech-to-speech" (Audio-To-Tokens + Tokens-To-Audio) -model. Added because related, even if bypasses the "to-text" -part.
---------------------------------------------------------
Edit1: Added Distil-Whisper because "insanely fast whisper" is not a model, but these were shipped together.
Edit2: StyleTTS2FineTune is not actually a different version of StyleTTS2, but rather a framework to finetuning it.
Edit3(11.2.2025): as suggested by u/caidong I added Kokoro TTS + also added Zonos to the list.
Edit4(20.3.2025): as suggested by u/Trysem , added WhisperSpeech, WhisperLive, WhisperFusion, Metavoice and F5-TTS.
Edit5(22.3.2025): Added Orpheus-TTS.
Edit6(28.3.2025): Added MegaTTS3.
Edit7(11.4.2025): as suggested by u/Trysem/, added Index-tts.
Edit8(24.4.2025): Added Dia TTS (Nari-labs).
Edit9(02.5.2025): Added Spark-TTS as suggested by u/Tandulim (here)
Edit9(02.5.2025): Added Parakeet TDT 0.6B V2. More info in this thread.
Edit10(29.8.2025): As originally suggested by u/Trysem and later by u/Nitroedge added Chatterbox-TTS to the list.
Edit10(29.8.2025): u/MattePalte asked me to add his own TTS called Verbify-TTS to the list.
Edit10(29.8.2025): Added Microsoft's recently released VibeVoice TTS, BosonAI's Higgs Audio TTS and StepFun's STS. +Extra info.
Edit11+12(1.9.2025): Added VibeVoice TTS's quantized versions and Parakeet V3.
Videos
I've got a 4090 and some stuff that I think it would be fun to have narrated. I've looked at some of the paid online options and $20-$30/mo for 2 hours of AI TTS is not gonna gut it. Can anyone point me to software that I can run locally that'll give me high quality?
It seems like if people are making billions of waifus in stable diffusion there ought to be something like this out there.
Hi,
I am the developer of the SpeechPulse speech recognition application available for Windows.
SpeechPulse uses offline Whisper AI models and Whisper APIs for real-time speech recognition. It can type into any text input area, including text editors, web browsers, and office applications.
You can also use AI language models and OpenAI-compatible LLM APIs to enhance/transform your dictations in real time. SpeechPulse supports customizable AI templates so you can prompt your AI models and APIs for your requirements. Example use cases include grammar correction and text enhancement, Email formatting, text summarization, and code generation.
SpeechPulse also supports batch file transcription and subtitle generation. I also recently added automatic speaker diarization to the file mode. Now SpeechPulse can automatically detect how many speakers are in the audio file and then automatically segment the transcription for each individual speaker.
SpeechPulse has a one-time fee. You can also try SpeechPulse with its 30-day free trial.
I would appreciate hearing your feedback and suggestions!
Thanks.
My office laptop has blocked the Windows+H combination which would seamlessly enable me to speak to type so that I dont have to use my hands to type. I'm looking for similar tool which is hopefully portable, which I can use on my office laptop. Could you please help?
Hi there, the Whisper model is the most powerful, the most capable speech to text (STT) implementation available to the public I have ever seen. Is there an app that will place the transcription directly at my cursor in Windows and/or macOS?
The closest I have seen do what I am asking for is
Windows https://github.com/Const-me/Whisper
macOS https://superwhisper.com/