Try TortoiseTTS on the highest quality setting Answer from Royal-Landscape9353 on reddit.com
🌐
Reddit
reddit.com › r/localllama › best local open source text-to-speech and speech-to-text?
r/LocalLLaMA on Reddit: Best local open source Text-To-Speech and Speech-To-Text?
August 23, 2024 -

I am working on a custom data-management software and for a while now I've been working and looking into possibility of integrating and modifying existing local conversational AI's into it (or at least developing the possibility of doing so in the future). The first thing I've been struggling with is that information is somewhat hard to come by - searches often lead me back here to r/LocalLLaMA/ and a year old threads in r/MachineLearning. Is anyone keeping track of what is out there what is worth the attention? I am posting this here in hope of finding some info while also sharing what I know for anyone who finds it useful or is interested.

I've noticed that most open source projects are based on Open AI's Whisper and it's re-implemented versions like:

  • Faster Whisper (MIT license)

  • Insanely fast Whisper (Apache-2.0 license)

  • Distil-Whisper (MIT license)

  • WhisperSpeech by github.com/collabora (MIT license, Added here 03/2025)

  • WhisperLive (MIT license, Added here 03/2025)

  • WhisperFusion, which is WhisperSpeech+WhisperLive in one package. (Added here 03/2025)

Coqui AI's TTS and STT -models (MPL-2.0 license) have gained some traction, but on their site they have stated that they're shutting down.

Tortoise TTS (Apache-2.0 license) and its re-implemented versions such as:

  • Tortoise-TTS-fast (AGPL-3.0, Apache-2.0 licenses) and its slightly faster(?) fork (AGPL-3.0 license).

StyleTTS and it's newer version:

  • StyleTTS2 (MIT license)

Alibaba Group's Tongyi SpeechTeam's SenseVoice (STT) [MIT license+possibly others] and CosyVoice (TTS) [Apache-2.0 license].

(11.2.2025): I will try to maintain this list so will begin adding new ones as well.

1/2025 Kokoro TTS (MIT License)
2/2025 Zonos by Zyphra (Apache-2.0 license)
3/2025 added: Metavoice (Apache-2.0 license)
3/2025 added: F5-TTS (MIT license)
3/2025 added: Orpheus-TTS by canopylabs.ai (Apache-2.0 license)
3/2025 added: MegaTTS3 (Apache-2.0 license)
4/2025 added: Index-tts (Apache-2.0 license). [Can be tried here.]
4/2025 added: Dia TTS (Apache-2.0 license) [Can be tried here.]
5/2025 added: Spark-TTS (Apache-2.0 license)[Can be tried here.]
5/2025 added: Parakeet TDT 0.6B V2 (CC-BY-4.0 license), STT English only [Can be tried here.], update: V3 is multilingual and has an onnx -version.

8/2025 added: Verbify-TTS (MIT License) by reddit user u/MattePalte. Described as simple locally run screen-reader-style app.

8/2025 added: Chatterbox-TTS (MIT License) [Can be tried here.]

8/2025 added: Microsoft's VibeVoice TTS (MIT Licence) for generating consistent long-form dialogues. Comes in 1.5B and 7B sizes. Both models can be tried here. 0.5B model is also on the way. This one also already has a ComfyUI wrapper by u/Fabix84/ (additional info here). Quantized versions by u/teachersecret can be found here

8/2025 added: BosonAI's Higgs Audio TTS (Apache-2.0 license). Can be tried here and further tested here. This one supports complex long-form dialogues. Extra prompting is supposed to allow setting the scene and adjusting expressions. Also has a quantized (4bit fork) version.

8/2025 added: StepFun AI's (Chinese AI-team source) Step-Audio 2 Mini Speech-To-Speech (Apache-2.0 license) a 8B "speech-to-speech" (Audio-To-Tokens + Tokens-To-Audio) -model. Added because related, even if bypasses the "to-text" -part.

---------------------------------------------------------

Edit1: Added Distil-Whisper because "insanely fast whisper" is not a model, but these were shipped together.
Edit2: StyleTTS2FineTune is not actually a different version of StyleTTS2, but rather a framework to finetuning it.
Edit3(11.2.2025): as suggested by u/caidong I added Kokoro TTS + also added Zonos to the list.
Edit4(20.3.2025): as suggested by u/Trysem , added WhisperSpeech, WhisperLive, WhisperFusion, Metavoice and F5-TTS.
Edit5(22.3.2025): Added Orpheus-TTS.
Edit6(28.3.2025): Added MegaTTS3.
Edit7(11.4.2025): as suggested by u/Trysem/, added Index-tts.
Edit8(24.4.2025): Added Dia TTS (Nari-labs).
Edit9(02.5.2025): Added Spark-TTS as suggested by u/Tandulim (here)
Edit9(02.5.2025): Added Parakeet TDT 0.6B V2. More info in this thread.

Edit10(29.8.2025): As originally suggested by u/Trysem and later by u/Nitroedge added Chatterbox-TTS to the list.

Edit10(29.8.2025): u/MattePalte asked me to add his own TTS called Verbify-TTS to the list.

Edit10(29.8.2025): Added Microsoft's recently released VibeVoice TTS, BosonAI's Higgs Audio TTS and StepFun's STS. +Extra info.

Edit11+12(1.9.2025): Added VibeVoice TTS's quantized versions and Parakeet V3.

Top answer
1 of 38
72
I’ve been trying to keep a list of TTS solutions. Here you go: Text to Speech Solutions 11labs - Commercial xtts xtts2 Alltalk Styletts2 Fish-Speech PiperTTS - A fast, local neural text to speech system that is optimized for the Raspberry Pi 4. PiperUI Paroli - Streaming mode implementation of the Piper TTS with RK3588 NPU acceleration support. Bark Tortoise TTS LMNT AlwaysReddy - (uses Piper) Open-LLM-VTuber MeloTTS OpenVoice Sherpa-onnx Silero Neuro-sama Parler TTS Chat TTS VallE-X Coqui TTS Daswers XTTS GUI VoiceCraft - Zero-Shot Speech Editing and Text-to-Speech
2 of 38
15
I’ve been using alltalktts ( https://github.com/erew123/alltalk_tts ) which is based off of coqui and supports XTTS2, piper and some others. I’m on a Mac so my options are pretty limited, and this worked fairly well. If xtts is the model you want to go with, then maybe https://github.com/daswer123/xtts-api-server would work even better. Unfortunately most of my cases are in SillyTavern, for narration, and character tts, so these may not be the use case for you. The last link I shared might give you ideas for how to implement that on a real application though. Are you a dev-like person, or just enthusiastic about it? I ask because if you’re a dev with some Python knowledge, or willingness to follow code, the later link is actually pretty useful for ideas, in spite of being targeted towards SillyTavern. If not, this is whole space might be kind of hard to navigate at this point in time, and also will depend a lot on the hardware where you’ll be deploying this.
🌐
Reddit
reddit.com › r/machinelearning › [p]natural sounding text-to-speech, preferably free, with option to train voice on local machine with amd gpu?
r/MachineLearning on Reddit: [P]Natural sounding text-to-speech, preferably free, with option to train voice on local machine with AMD GPU?
January 1, 2024 -

Hey all,

I'm looking for a good text-to-speech software with as natural sounding voices as possible, preferably with option to train on local machine with AMD GPU. It doesn't "have" to be free, but would be ideal if it was.

Basically, in 2024 I've got ~40 long-form papers I need to complete, each with ~150,000 characters or more, basically technical essays on certain subjects in my work. These projects need to have a voice-over, as they will be distributed to certain vision-impaired groups.

I've checked out quite a few platforms that offer cloud-based TTS, but majority of them either don't sound natural, or pricing is completely off the charts. Best one I've found so far is genny.lovo.ai and ElevenLabs.ai, with Genny being the better-sounding of the two, but both of their pricing is completely insane. Basically, with either platform I'm looking at ~2,500-3,500$ pricing to complete all of already planned work, but there's a possibility even more will be assigned, so pricing will only increase.

Another option I've discovered is Descript, which has freemium tier, but their voice selection is pretty poor, and not ideal for my project.

Ideally, I'd like the ability to train my own voice model on my local machine, but the problem with that is that majority of model, like RVCv2, MangioRVC, ApollioRVC, TortoiseTTS etc. require GPU with CUDA support, AKA, Nvidia GPU. And even old and decrepid stuff like 1070's in my area exceeds 300-400$ per card (economy is screwed around here) and is not an option to get.

Does anyone know if there even exist any decent TTS software with the ability to locally train a voice model on AMD GPU, that doesn't impose length-size limits on projects?

Any help appreciated.

🌐
Reddit
reddit.com › r/texttospeech › i tested 10 ai text-to-speech voice tools — this one was the best, natural and expressive (with free version)
r/TextToSpeech on Reddit: I tested 10 AI text-to-speech voice tools — this one was the best, natural and expressive (with free version)
3 weeks ago -

Hi everyone! I'm a developer who also listens to audiobooks. I use AI text-to-speech and voice cloning for my personal projects and sometimes to read fiction stories out loud.

I tested ElevenLabs, speechify, play.ht, Fish Audio, murf ai, resemble ai, and a couple others... Fish Audio honestly blew me away with the quality of their voices.

I cloned myself and it sounded indistinguishable from real life. Their text-to-speech sounds as natural as real human speech and you can inject pauses and emotional tones to perfect it.

They also offer a free plan you can check them out at https://fish.audio !

If you want tips, settings I used, or anything else let me know!

Disclaimer: I am NOT affiliated with any of these companies in any way

🌐
Reddit
reddit.com › r/localllama › local text to speech
r/LocalLLaMA on Reddit: Local Text To Speech
May 27, 2024 -

Does anyone have a good local set up for text to speech. A while back, I used Tortoise TTS, which had reasonable quality (still behind 11labs) but was very slow. I see xtts might have taken this further. Unfortunately, no updates to Tortoise TTS since the developer was snapped up by OpenAI (though I guess great for the development of TTS in GPT4o).

🌐
Reddit
reddit.com › r/artificial › you can now train your own text-to-speech (tts) models locally!
r/artificial on Reddit: You can now train your own Text-to-Speech (TTS) models locally!
May 28, 2025 -

Hey folks! Text-to-Speech (TTS) models have been pretty popular recently and one way to customize it (e.g. cloning a voice), is by fine-tuning the model. There are other methods however you do training, if you want speaking speed, phrasing, vocal quirks, and the subtleties of prosody - things that give a voice its personality and uniqueness. So, you'll need to do create a dataset and do a bit of training for it. You can do it completely locally (as we're open-source) and training is ~1.5x faster with 50% less VRAM compared to all other setups: https://github.com/unslothai/unsloth

  • Our showcase examples aren't the 'best' and were only trained on 60 steps and is using an average open-source dataset. Of course, the longer you train and the more effort you put into your dataset, the better it will be. We utilize female voices just to show that it works (as they're the only decent public open-source datasets available) however you can actually use any voice you want. E.g. Jinx from League of Legends as long as you make your own dataset.

  • We support models like OpenAI/whisper-large-v3 (which is a Speech-to-Text SST model), Sesame/csm-1b, CanopyLabs/orpheus-3b-0.1-ft, and pretty much any Transformer-compatible models including LLasa, Outte, Spark, and others.

  • The goal is to clone voices, adapt speaking styles and tones, support new languages, handle specific tasks and more.

  • We’ve made notebooks to train, run, and save these models for free on Google Colab. Some models aren’t supported by llama.cpp and will be saved only as safetensors, but others should work. See our TTS docs and notebooks: https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning

  • The training process is similar to SFT, but the dataset includes audio clips with transcripts. We use a dataset called ‘Elise’ that embeds emotion tags like <sigh> or <laughs> into transcripts, triggering expressive audio that matches the emotion.

  • Since TTS models are usually small, you can train them using 16-bit LoRA, or go with FFT. Loading a 16-bit LoRA model is simple.

And here are our TTS notebooks:

Sesame-CSM (1B)Orpheus-TTS (3B)Whisper Large V3Spark-TTS (0.5B)

Thank you for reading and please do ask any questions - I will be replying to every single one!

Find elsewhere
🌐
Reddit
reddit.com › r/3cx › what’s the best text-to-speech ai tool you’re using right now?
r/3CX on Reddit: What’s the Best Text-to-Speech AI Tool You’re Using Right Now?
January 12, 2025 -

Hey Redditors,

I’m currently exploring different text-to-speech (TTS) AI tools and was wondering what everyone here is using. I’m looking for something that delivers natural-sounding voices, supports multiple languages, and ideally offers some customization features like speed, pitch, or emotion.

🌐
Reddit
reddit.com › r/localllama › supertonic webgpu: blazingly fast text-to-speech running 100% locally in your browser.
r/LocalLLaMA on Reddit: Supertonic WebGPU: blazingly fast text-to-speech running 100% locally in your browser.
3 weeks ago -

Last week, the Supertone team released Supertonic, an extremely fast and high-quality text-to-speech model. So, I created a demo for it that uses Transformers.js and ONNX Runtime Web to run the model 100% locally in the browser on WebGPU. The original authors made a web demo too, and I did my best to optimize the model as much as possible (up to ~40% faster in my tests, see below).

I was even able to generate a ~5 hour audiobook in under 3 minutes. Amazing, right?!

Link to demo (+ source code): https://huggingface.co/spaces/webml-community/Supertonic-TTS-WebGPU

* From my testing, for the same 226-character paragraph (on the same device): the newly-optimized model ran at ~1750.6 characters per second, while the original ran at ~1255.6 characters per second.

🌐
Reddit
reddit.com › r/texttospeech › what are good free text to speech programs with natural voices that can actually read reddit posts?
r/TextToSpeech on Reddit: What are good free text to speech programs with natural voices that can actually read reddit posts?
March 6, 2025 -

I tried using the internet edge read aloud and it always gets confused reading a reddit post. I like to do aaaalot of research on Reddit so I figure if I can find a good app, I can multitask and do other stuff while the program is speaking to me. I use android and windows 10.

I tried to research this awhile ago but couldn't find any answers.

🌐
Reddit
reddit.com › r/ollama › local tts (text-to-speech) ai model with a human voice and file output?
r/ollama on Reddit: Local TTS (text-to-speech) AI model with a human voice and file output?
February 10, 2025 -

Don't know if this is the right place to ask, but... i was looking for a text to speech alternative to the quite expensive online ones i was looking for recently.

I'm partially blind and it would be of great help to have a recorded and narrated version of some technical e-books i own.

As i was saying, models like Elevenlabs and similar are really quite good but absolutely too expensive in terms of €/time for what i need to do (and the books are quite long too).

I was wondering, because of that, if there was a good (the normal TTS is quite abismal and distracting) alternative to run locally that can transpose the book in audio and let me save a mp3 or similar file for later use.

I have to say, also, that i'm not a programmer whatsoever, so i should be able to follow simple instructions but, sadly, nothing more. so... a ready to use solution would be quite nice (or a detailed, like i'm a 3yo, set of instructions).

i'm using ollama + docker and free open web-ui for playing (literally) with some offline models and also thinking about using something compatible with this already running system... hopefully, possibly?

Another complication it's that i'm italian, so... the probably unexisting model should be capable to use italian language too...

The following are my PC specs, if needed:

  • Processor: intel i7 13700k

  • MB: Asus ROG Z790-H

  • Ram: 64gb Corsair 5600 MT/S

  • Gpu: RTX 4070TI 12gb - MSI Ventus 3X

  • Storage: Samsung 970EVO NVME SSD + others

  • Windows 11 PRO 64bit

Sorry for the long post and thank you for any help :)

🌐
Reddit
reddit.com › r › TextToSpeech
Text-To-Speech
September 13, 2013 - r/TextToSpeech: Discussion about text-to-speech engines, virtual assistants, and related topics.
🌐
Reddit
reddit.com › r/localllama › handy - a simple, open-source offline speech-to-text app written in rust using whisper.cpp
r/LocalLLaMA on Reddit: Handy - a simple, open-source offline speech-to-text app written in Rust using whisper.cpp
June 17, 2025 -

I built a simple, offline speech-to-text app after breaking my finger - now open sourcing it

TL;DR: Made a cross-platform speech-to-text app using whisper.cpp that runs completely offline. Press shortcut, speak, get text pasted anywhere. It's rough around the edges but works well and is designed to be easily modified/extended - including adding LLM calls after transcription.

Background

I broke my finger a while back and suddenly couldn't type properly. Tried existing speech-to-text solutions but they were either subscription-based, cloud-dependent, or I couldn't modify them to work exactly how I needed for coding and daily computer use.

So I built Handy - intentionally simple speech-to-text that runs entirely on your machine using whisper.cpp (Whisper Small model). No accounts, no subscriptions, no data leaving your computer.

What it does

  • Press keyboard shortcut → speak → press again (or use push-to-talk)

  • Transcribes with whisper.cpp and pastes directly into whatever app you're using

  • Works across Windows, macOS, Linux

  • GPU accelerated where available

  • Completely offline

That's literally it. No fancy UI, no feature creep, just reliable local speech-to-text.

Why I'm sharing this

This was my first Rust project and there are definitely rough edges, but the core functionality works well. More importantly, I designed it to be easily forkable and extensible because that's what I was looking for when I started this journey.

The codebase is intentionally simple - you can understand the whole thing in an afternoon. If you want to add LLM integration (calling an LLM after transcription to rewrite/enhance the text), custom post-processing, or whatever else, the foundation is there and it's straightforward to extend.

I'm hoping it might be useful for:

  • People who want reliable offline speech-to-text without subscriptions

  • Developers who want to experiment with voice computing interfaces

  • Anyone who prefers tools they can actually modify instead of being stuck with someone else's feature decisions

Project Reality

There are known bugs and architectural decisions that could be better. I'm documenting issues openly because I'd rather have people know what they're getting into. This isn't trying to compete with polished commercial solutions - it's trying to be the most hackable and modifiable foundation for people who want to build their own thing.

If you're looking for something perfect out of the box, this probably isn't it. If you're looking for something you can understand, modify, and make your own, it might be exactly what you need.

Would love feedback from anyone who tries it out, especially if you run into issues or see ways to make the codebase cleaner and more accessible for others to build on.

🌐
Reddit
reddit.com › r/openai › any text to speech that doesn't suck and is free?
r/OpenAI on Reddit: Any Text to Speech that doesn't suck and is free?
September 10, 2024 -

I just got Applio to try and use the TTS voice synthesiser. However the problem is that the base TTS that Applio uses before it converts the voices is incredibly robotic and stiff, making the TTS basically useless.

Are there any actually halfway decent TTS voices I can use so I can import them into Applio?

From what I've seen only ElevenLabs has natural sounding voices, they also have an emotional range. However you still need to figure things out through trial and error, which means you are burning through the allowed words you've paid for just trying to get something decent.