I've got a 4090 and some stuff that I think it would be fun to have narrated. I've looked at some of the paid online options and $20-$30/mo for 2 hours of AI TTS is not gonna gut it. Can anyone point me to software that I can run locally that'll give me high quality?
It seems like if people are making billions of waifus in stable diffusion there ought to be something like this out there.
I am working on a custom data-management software and for a while now I've been working and looking into possibility of integrating and modifying existing local conversational AI's into it (or at least developing the possibility of doing so in the future). The first thing I've been struggling with is that information is somewhat hard to come by - searches often lead me back here to r/LocalLLaMA/ and a year old threads in r/MachineLearning. Is anyone keeping track of what is out there what is worth the attention? I am posting this here in hope of finding some info while also sharing what I know for anyone who finds it useful or is interested.
I've noticed that most open source projects are based on Open AI's Whisper and it's re-implemented versions like:
Faster Whisper (MIT license)
Insanely fast Whisper (Apache-2.0 license)
Distil-Whisper (MIT license)
WhisperSpeech by github.com/collabora (MIT license, Added here 03/2025)
WhisperLive (MIT license, Added here 03/2025)
WhisperFusion, which is WhisperSpeech+WhisperLive in one package. (Added here 03/2025)
Coqui AI's TTS and STT -models (MPL-2.0 license) have gained some traction, but on their site they have stated that they're shutting down.
Tortoise TTS (Apache-2.0 license) and its re-implemented versions such as:
Tortoise-TTS-fast (AGPL-3.0, Apache-2.0 licenses) and its slightly faster(?) fork (AGPL-3.0 license).
StyleTTS and it's newer version:
StyleTTS2 (MIT license)
Alibaba Group's Tongyi SpeechTeam's SenseVoice (STT) [MIT license+possibly others] and CosyVoice (TTS) [Apache-2.0 license].
(11.2.2025): I will try to maintain this list so will begin adding new ones as well.
1/2025 Kokoro TTS (MIT License)
2/2025 Zonos by Zyphra (Apache-2.0 license)
3/2025 added: Metavoice (Apache-2.0 license)
3/2025 added: F5-TTS (MIT license)
3/2025 added: Orpheus-TTS by canopylabs.ai (Apache-2.0 license)
3/2025 added: MegaTTS3 (Apache-2.0 license)
4/2025 added: Index-tts (Apache-2.0 license). [Can be tried here.]
4/2025 added: Dia TTS (Apache-2.0 license) [Can be tried here.]
5/2025 added: Spark-TTS (Apache-2.0 license)[Can be tried here.]
5/2025 added: Parakeet TDT 0.6B V2 (CC-BY-4.0 license), STT English only [Can be tried here.], update: V3 is multilingual and has an onnx -version.
8/2025 added: Verbify-TTS (MIT License) by reddit user u/MattePalte. Described as simple locally run screen-reader-style app.
8/2025 added: Chatterbox-TTS (MIT License) [Can be tried here.]
8/2025 added: Microsoft's VibeVoice TTS (MIT Licence) for generating consistent long-form dialogues. Comes in 1.5B and 7B sizes. Both models can be tried here. 0.5B model is also on the way. This one also already has a ComfyUI wrapper by u/Fabix84/ (additional info here). Quantized versions by u/teachersecret can be found here
8/2025 added: BosonAI's Higgs Audio TTS (Apache-2.0 license). Can be tried here and further tested here. This one supports complex long-form dialogues. Extra prompting is supposed to allow setting the scene and adjusting expressions. Also has a quantized (4bit fork) version.
8/2025 added: StepFun AI's (Chinese AI-team source) Step-Audio 2 Mini Speech-To-Speech (Apache-2.0 license) a 8B "speech-to-speech" (Audio-To-Tokens + Tokens-To-Audio) -model. Added because related, even if bypasses the "to-text" -part.
---------------------------------------------------------
Edit1: Added Distil-Whisper because "insanely fast whisper" is not a model, but these were shipped together.
Edit2: StyleTTS2FineTune is not actually a different version of StyleTTS2, but rather a framework to finetuning it.
Edit3(11.2.2025): as suggested by u/caidong I added Kokoro TTS + also added Zonos to the list.
Edit4(20.3.2025): as suggested by u/Trysem , added WhisperSpeech, WhisperLive, WhisperFusion, Metavoice and F5-TTS.
Edit5(22.3.2025): Added Orpheus-TTS.
Edit6(28.3.2025): Added MegaTTS3.
Edit7(11.4.2025): as suggested by u/Trysem/, added Index-tts.
Edit8(24.4.2025): Added Dia TTS (Nari-labs).
Edit9(02.5.2025): Added Spark-TTS as suggested by u/Tandulim (here)
Edit9(02.5.2025): Added Parakeet TDT 0.6B V2. More info in this thread.
Edit10(29.8.2025): As originally suggested by u/Trysem and later by u/Nitroedge added Chatterbox-TTS to the list.
Edit10(29.8.2025): u/MattePalte asked me to add his own TTS called Verbify-TTS to the list.
Edit10(29.8.2025): Added Microsoft's recently released VibeVoice TTS, BosonAI's Higgs Audio TTS and StepFun's STS. +Extra info.
Edit11+12(1.9.2025): Added VibeVoice TTS's quantized versions and Parakeet V3.
Videos
Hey all,
I'm looking for a good text-to-speech software with as natural sounding voices as possible, preferably with option to train on local machine with AMD GPU. It doesn't "have" to be free, but would be ideal if it was.
Basically, in 2024 I've got ~40 long-form papers I need to complete, each with ~150,000 characters or more, basically technical essays on certain subjects in my work. These projects need to have a voice-over, as they will be distributed to certain vision-impaired groups.
I've checked out quite a few platforms that offer cloud-based TTS, but majority of them either don't sound natural, or pricing is completely off the charts. Best one I've found so far is genny.lovo.ai and ElevenLabs.ai, with Genny being the better-sounding of the two, but both of their pricing is completely insane. Basically, with either platform I'm looking at ~2,500-3,500$ pricing to complete all of already planned work, but there's a possibility even more will be assigned, so pricing will only increase.
Another option I've discovered is Descript, which has freemium tier, but their voice selection is pretty poor, and not ideal for my project.
Ideally, I'd like the ability to train my own voice model on my local machine, but the problem with that is that majority of model, like RVCv2, MangioRVC, ApollioRVC, TortoiseTTS etc. require GPU with CUDA support, AKA, Nvidia GPU. And even old and decrepid stuff like 1070's in my area exceeds 300-400$ per card (economy is screwed around here) and is not an option to get.
Does anyone know if there even exist any decent TTS software with the ability to locally train a voice model on AMD GPU, that doesn't impose length-size limits on projects?
Any help appreciated.
Hi everyone! I'm a developer who also listens to audiobooks. I use AI text-to-speech and voice cloning for my personal projects and sometimes to read fiction stories out loud.
I tested ElevenLabs, speechify, play.ht, Fish Audio, murf ai, resemble ai, and a couple others... Fish Audio honestly blew me away with the quality of their voices.
I cloned myself and it sounded indistinguishable from real life. Their text-to-speech sounds as natural as real human speech and you can inject pauses and emotional tones to perfect it.
They also offer a free plan you can check them out at https://fish.audio !
If you want tips, settings I used, or anything else let me know!
Disclaimer: I am NOT affiliated with any of these companies in any way
Does anyone have a good local set up for text to speech. A while back, I used Tortoise TTS, which had reasonable quality (still behind 11labs) but was very slow. I see xtts might have taken this further. Unfortunately, no updates to Tortoise TTS since the developer was snapped up by OpenAI (though I guess great for the development of TTS in GPT4o).
Hey folks! Text-to-Speech (TTS) models have been pretty popular recently and one way to customize it (e.g. cloning a voice), is by fine-tuning the model. There are other methods however you do training, if you want speaking speed, phrasing, vocal quirks, and the subtleties of prosody - things that give a voice its personality and uniqueness. So, you'll need to do create a dataset and do a bit of training for it. You can do it completely locally (as we're open-source) and training is ~1.5x faster with 50% less VRAM compared to all other setups: https://github.com/unslothai/unsloth
Our showcase examples aren't the 'best' and were only trained on 60 steps and is using an average open-source dataset. Of course, the longer you train and the more effort you put into your dataset, the better it will be. We utilize female voices just to show that it works (as they're the only decent public open-source datasets available) however you can actually use any voice you want. E.g. Jinx from League of Legends as long as you make your own dataset.
We support models like
OpenAI/whisper-large-v3(which is a Speech-to-Text SST model),Sesame/csm-1b,CanopyLabs/orpheus-3b-0.1-ft, and pretty much any Transformer-compatible models including LLasa, Outte, Spark, and others.The goal is to clone voices, adapt speaking styles and tones, support new languages, handle specific tasks and more.
We’ve made notebooks to train, run, and save these models for free on Google Colab. Some models aren’t supported by llama.cpp and will be saved only as safetensors, but others should work. See our TTS docs and notebooks: https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning
The training process is similar to SFT, but the dataset includes audio clips with transcripts. We use a dataset called ‘Elise’ that embeds emotion tags like <sigh> or <laughs> into transcripts, triggering expressive audio that matches the emotion.
Since TTS models are usually small, you can train them using 16-bit LoRA, or go with FFT. Loading a 16-bit LoRA model is simple.
And here are our TTS notebooks:
| Sesame-CSM (1B) | Orpheus-TTS (3B) | Whisper Large V3 | Spark-TTS (0.5B) |
|---|---|---|---|
Thank you for reading and please do ask any questions - I will be replying to every single one!
I know OpenAI recently released whisper V3 Turbo but I remember hearing about some other ones that's a lot better but I can't remember
Looking to have voice clone, text to speech, different voices etc...do we have anything like that that is gui based?
I'm looking for a high quality, free text to speech (TTS) AI that sounds as close to a real human as possible — preferably with multiple voice options (male/female, accents, etc). It could be an app, website, or downloadable software. I’ve tried a few that sound robotic or overly synthetic, and I’m hoping there’s something better out there that still doesn’t require payment.
Any recommendations?
Hey Redditors,
I’m currently exploring different text-to-speech (TTS) AI tools and was wondering what everyone here is using. I’m looking for something that delivers natural-sounding voices, supports multiple languages, and ideally offers some customization features like speed, pitch, or emotion.
Last week, the Supertone team released Supertonic, an extremely fast and high-quality text-to-speech model. So, I created a demo for it that uses Transformers.js and ONNX Runtime Web to run the model 100% locally in the browser on WebGPU. The original authors made a web demo too, and I did my best to optimize the model as much as possible (up to ~40% faster in my tests, see below).
I was even able to generate a ~5 hour audiobook in under 3 minutes. Amazing, right?!
Link to demo (+ source code): https://huggingface.co/spaces/webml-community/Supertonic-TTS-WebGPU
* From my testing, for the same 226-character paragraph (on the same device): the newly-optimized model ran at ~1750.6 characters per second, while the original ran at ~1255.6 characters per second.
I tried using the internet edge read aloud and it always gets confused reading a reddit post. I like to do aaaalot of research on Reddit so I figure if I can find a good app, I can multitask and do other stuff while the program is speaking to me. I use android and windows 10.
I tried to research this awhile ago but couldn't find any answers.
Don't know if this is the right place to ask, but... i was looking for a text to speech alternative to the quite expensive online ones i was looking for recently.
I'm partially blind and it would be of great help to have a recorded and narrated version of some technical e-books i own.
As i was saying, models like Elevenlabs and similar are really quite good but absolutely too expensive in terms of €/time for what i need to do (and the books are quite long too).
I was wondering, because of that, if there was a good (the normal TTS is quite abismal and distracting) alternative to run locally that can transpose the book in audio and let me save a mp3 or similar file for later use.
I have to say, also, that i'm not a programmer whatsoever, so i should be able to follow simple instructions but, sadly, nothing more. so... a ready to use solution would be quite nice (or a detailed, like i'm a 3yo, set of instructions).
i'm using ollama + docker and free open web-ui for playing (literally) with some offline models and also thinking about using something compatible with this already running system... hopefully, possibly?
Another complication it's that i'm italian, so... the probably unexisting model should be capable to use italian language too...
The following are my PC specs, if needed:
Processor: intel i7 13700k
MB: Asus ROG Z790-H
Ram: 64gb Corsair 5600 MT/S
Gpu: RTX 4070TI 12gb - MSI Ventus 3X
Storage: Samsung 970EVO NVME SSD + others
Windows 11 PRO 64bit
Sorry for the long post and thank you for any help :)
I think llama 4 will have multimodality (including audio input/output) but until then, what do people use for going from text to speech + how do they run it locally (Ollama does not support this kind of models does it?)
I am looking for the best Speech-to-Text/Speech Recognition Models, anyone could recommend any?
I can not find something that is very well built to enable Voice assistants. Right now I am using just Open Web UI plus a Open AI TTS endpoint but it isnt really that great.
Is there anything out there already?
I built a simple, offline speech-to-text app after breaking my finger - now open sourcing it
TL;DR: Made a cross-platform speech-to-text app using whisper.cpp that runs completely offline. Press shortcut, speak, get text pasted anywhere. It's rough around the edges but works well and is designed to be easily modified/extended - including adding LLM calls after transcription.
Background
I broke my finger a while back and suddenly couldn't type properly. Tried existing speech-to-text solutions but they were either subscription-based, cloud-dependent, or I couldn't modify them to work exactly how I needed for coding and daily computer use.
So I built Handy - intentionally simple speech-to-text that runs entirely on your machine using whisper.cpp (Whisper Small model). No accounts, no subscriptions, no data leaving your computer.
What it does
Press keyboard shortcut → speak → press again (or use push-to-talk)
Transcribes with whisper.cpp and pastes directly into whatever app you're using
Works across Windows, macOS, Linux
GPU accelerated where available
Completely offline
That's literally it. No fancy UI, no feature creep, just reliable local speech-to-text.
Why I'm sharing this
This was my first Rust project and there are definitely rough edges, but the core functionality works well. More importantly, I designed it to be easily forkable and extensible because that's what I was looking for when I started this journey.
The codebase is intentionally simple - you can understand the whole thing in an afternoon. If you want to add LLM integration (calling an LLM after transcription to rewrite/enhance the text), custom post-processing, or whatever else, the foundation is there and it's straightforward to extend.
I'm hoping it might be useful for:
People who want reliable offline speech-to-text without subscriptions
Developers who want to experiment with voice computing interfaces
Anyone who prefers tools they can actually modify instead of being stuck with someone else's feature decisions
Project Reality
There are known bugs and architectural decisions that could be better. I'm documenting issues openly because I'd rather have people know what they're getting into. This isn't trying to compete with polished commercial solutions - it's trying to be the most hackable and modifiable foundation for people who want to build their own thing.
If you're looking for something perfect out of the box, this probably isn't it. If you're looking for something you can understand, modify, and make your own, it might be exactly what you need.
Would love feedback from anyone who tries it out, especially if you run into issues or see ways to make the codebase cleaner and more accessible for others to build on.
I just got Applio to try and use the TTS voice synthesiser. However the problem is that the base TTS that Applio uses before it converts the voices is incredibly robotic and stiff, making the TTS basically useless.
Are there any actually halfway decent TTS voices I can use so I can import them into Applio?
From what I've seen only ElevenLabs has natural sounding voices, they also have an emotional range. However you still need to figure things out through trial and error, which means you are burning through the allowed words you've paid for just trying to get something decent.