[D] Locally-runnable text to speech AI?

reddit.com › r › MachineLearning › comments › 10yzq25 › d_locallyrunnable_text_to_speech_ai

Try TortoiseTTS on the highest quality setting Answer from Royal-Landscape9353 on reddit.com

reddit.com › r/machinelearning › [d] locally-runnable text to speech ai?

r/MachineLearning on Reddit: [D] Locally-runnable text to speech AI?

February 10, 2023 -

I've got a 4090 and some stuff that I think it would be fun to have narrated. I've looked at some of the paid online options and $20-$30/mo for 2 hours of AI TTS is not gonna gut it. Can anyone point me to software that I can run locally that'll give me high quality?

It seems like if people are making billions of waifus in stable diffusion there ought to be something like this out there.

Try TortoiseTTS on the highest quality setting

Pyttsx, mbrola, mimic 3. I like the mimic 3. Which is lightweight. And can run on docker or just native. I started out with mycroft which has mimic 3 build in. But you can run it just stand alone as well and quite easy to set up. https://mycroft.ai/mimic-3/ If you want to go down the rabbithole of speech synthesis and analsys check out praat praat.org it's a quiet impressive piece of software.

reddit.com › r/localllama › best local open source text-to-speech and speech-to-text?

r/LocalLLaMA on Reddit: Best local open source Text-To-Speech and Speech-To-Text?

August 23, 2024 -

I am working on a custom data-management software and for a while now I've been working and looking into possibility of integrating and modifying existing local conversational AI's into it (or at least developing the possibility of doing so in the future). The first thing I've been struggling with is that information is somewhat hard to come by - searches often lead me back here to r/LocalLLaMA/ and a year old threads in r/MachineLearning. Is anyone keeping track of what is out there what is worth the attention? I am posting this here in hope of finding some info while also sharing what I know for anyone who finds it useful or is interested.

I've noticed that most open source projects are based on Open AI's Whisper and it's re-implemented versions like:

Faster Whisper (MIT license)
Insanely fast Whisper (Apache-2.0 license)
Distil-Whisper (MIT license)
WhisperSpeech by github.com/collabora (MIT license, Added here 03/2025)
WhisperLive (MIT license, Added here 03/2025)
WhisperFusion, which is WhisperSpeech+WhisperLive in one package. (Added here 03/2025)

Coqui AI's TTS and STT -models (MPL-2.0 license) have gained some traction, but on their site they have stated that they're shutting down.

Tortoise TTS (Apache-2.0 license) and its re-implemented versions such as:

Tortoise-TTS-fast (AGPL-3.0, Apache-2.0 licenses) and its slightly faster(?) fork (AGPL-3.0 license).

StyleTTS and it's newer version:

StyleTTS2 (MIT license)

Alibaba Group's Tongyi SpeechTeam's SenseVoice (STT) [MIT license+possibly others] and CosyVoice (TTS) [Apache-2.0 license].

(11.2.2025): I will try to maintain this list so will begin adding new ones as well.

1/2025 Kokoro TTS (MIT License)
2/2025 Zonos by Zyphra (Apache-2.0 license)
3/2025 added: Metavoice (Apache-2.0 license)
3/2025 added: F5-TTS (MIT license)
3/2025 added: Orpheus-TTS by canopylabs.ai (Apache-2.0 license)
3/2025 added: MegaTTS3 (Apache-2.0 license)
4/2025 added: Index-tts (Apache-2.0 license). [Can be tried here.]
4/2025 added: Dia TTS (Apache-2.0 license) [Can be tried here.]
5/2025 added: Spark-TTS (Apache-2.0 license)[Can be tried here.]
5/2025 added: Parakeet TDT 0.6B V2 (CC-BY-4.0 license), STT English only [Can be tried here.], update: V3 is multilingual and has an onnx -version.

8/2025 added: Verbify-TTS (MIT License) by reddit user u/MattePalte. Described as simple locally run screen-reader-style app.

8/2025 added: Chatterbox-TTS (MIT License) [Can be tried here.]

8/2025 added: Microsoft's VibeVoice TTS (MIT Licence) for generating consistent long-form dialogues. Comes in 1.5B and 7B sizes. Both models can be tried here. 0.5B model is also on the way. This one also already has a ComfyUI wrapper by u/Fabix84/ (additional info here). Quantized versions by u/teachersecret can be found here

8/2025 added: BosonAI's Higgs Audio TTS (Apache-2.0 license). Can be tried here and further tested here. This one supports complex long-form dialogues. Extra prompting is supposed to allow setting the scene and adjusting expressions. Also has a quantized (4bit fork) version.

8/2025 added: StepFun AI's (Chinese AI-team ^source) Step-Audio 2 Mini Speech-To-Speech (Apache-2.0 license) a 8B "speech-to-speech" (Audio-To-Tokens + Tokens-To-Audio) -model. Added because related, even if bypasses the "to-text" -part.

---------------------------------------------------------

Edit1: Added Distil-Whisper because "insanely fast whisper" is not a model, but these were shipped together.
Edit2: StyleTTS2FineTune is not actually a different version of StyleTTS2, but rather a framework to finetuning it.
Edit3(11.2.2025): as suggested by u/caidong I added Kokoro TTS + also added Zonos to the list.
Edit4(20.3.2025): as suggested by u/Trysem , added WhisperSpeech, WhisperLive, WhisperFusion, Metavoice and F5-TTS.
Edit5(22.3.2025): Added Orpheus-TTS.
Edit6(28.3.2025): Added MegaTTS3.
Edit7(11.4.2025): as suggested by u/Trysem/, added Index-tts.
Edit8(24.4.2025): Added Dia TTS (Nari-labs).
Edit9(02.5.2025): Added Spark-TTS as suggested by u/Tandulim (here)
Edit9(02.5.2025): Added Parakeet TDT 0.6B V2. More info in this thread.

Edit10(29.8.2025): As originally suggested by u/Trysem and later by u/Nitroedge added Chatterbox-TTS to the list.

Edit10(29.8.2025): u/MattePalte asked me to add his own TTS called Verbify-TTS to the list.

Edit10(29.8.2025): Added Microsoft's recently released VibeVoice TTS, BosonAI's Higgs Audio TTS and StepFun's STS. +Extra info.

Edit11+12(1.9.2025): Added VibeVoice TTS's quantized versions and Parakeet V3.

I’ve been trying to keep a list of TTS solutions. Here you go: Text to Speech Solutions 11labs - Commercial xtts xtts2 Alltalk Styletts2 Fish-Speech PiperTTS - A fast, local neural text to speech system that is optimized for the Raspberry Pi 4. PiperUI Paroli - Streaming mode implementation of the Piper TTS with RK3588 NPU acceleration support. Bark Tortoise TTS LMNT AlwaysReddy - (uses Piper) Open-LLM-VTuber MeloTTS OpenVoice Sherpa-onnx Silero Neuro-sama Parler TTS Chat TTS VallE-X Coqui TTS Daswers XTTS GUI VoiceCraft - Zero-Shot Speech Editing and Text-to-Speech

I’ve been using alltalktts ( https://github.com/erew123/alltalk_tts ) which is based off of coqui and supports XTTS2, piper and some others. I’m on a Mac so my options are pretty limited, and this worked fairly well. If xtts is the model you want to go with, then maybe https://github.com/daswer123/xtts-api-server would work even better. Unfortunately most of my cases are in SillyTavern, for narration, and character tts, so these may not be the use case for you. The last link I shared might give you ideas for how to implement that on a real application though. Are you a dev-like person, or just enthusiastic about it? I ask because if you’re a dev with some Python knowledge, or willingness to follow code, the later link is actually pretty useful for ideas, in spite of being targeted towards SillyTavern. If not, this is whole space might be kind of hard to navigate at this point in time, and also will depend a lot on the hardware where you’ll be deploying this.

Videos

r/LocalLLaMA on Reddit: Local conversational model with STT TTS

October 9, 2025

r/dotnet on Reddit: How I Transcribe Audio Locally with Whisper ...

r/LocalLLaMA on Reddit: Created a more accurate local speech-to-text ...

January 8, 2025

I built an AI-Powered Speech-to-text App for Windows ...

r/macapps on Reddit: Fully Local AI App for Real-Time Meeting ...

November 27, 2024

reddit.com › r/localllama › what is current ai go to for voice generation running locally on pc

r/LocalLLaMA on Reddit: What is current AI go to for voice generation running locally on PC

April 12, 2024 -

For educational purposes

As far as I know, XTTS-v2 is still the best, but if there's something better now, I'd be quite interested to hear about it.

Personally a big fan of piper

reddit.com › r/machinelearning › [p]natural sounding text-to-speech, preferably free, with option to train voice on local machine with amd gpu?

r/MachineLearning on Reddit: [P]Natural sounding text-to-speech, preferably free, with option to train voice on local machine with AMD GPU?

January 1, 2024 -

Hey all,

I'm looking for a good text-to-speech software with as natural sounding voices as possible, preferably with option to train on local machine with AMD GPU. It doesn't "have" to be free, but would be ideal if it was.

Basically, in 2024 I've got ~40 long-form papers I need to complete, each with ~150,000 characters or more, basically technical essays on certain subjects in my work. These projects need to have a voice-over, as they will be distributed to certain vision-impaired groups.

I've checked out quite a few platforms that offer cloud-based TTS, but majority of them either don't sound natural, or pricing is completely off the charts. Best one I've found so far is genny.lovo.ai and ElevenLabs.ai, with Genny being the better-sounding of the two, but both of their pricing is completely insane. Basically, with either platform I'm looking at ~2,500-3,500$ pricing to complete all of already planned work, but there's a possibility even more will be assigned, so pricing will only increase.

Another option I've discovered is Descript, which has freemium tier, but their voice selection is pretty poor, and not ideal for my project.

Ideally, I'd like the ability to train my own voice model on my local machine, but the problem with that is that majority of model, like RVCv2, MangioRVC, ApollioRVC, TortoiseTTS etc. require GPU with CUDA support, AKA, Nvidia GPU. And even old and decrepid stuff like 1070's in my area exceeds 300-400$ per card (economy is screwed around here) and is not an option to get.

Does anyone know if there even exist any decent TTS software with the ability to locally train a voice model on AMD GPU, that doesn't impose length-size limits on projects?

Any help appreciated.

There were options up til you said AMD GPU. Basically nothing is written for training on AMD. Any solution to this problem takes out the "free" part.

For OpenAI's TTS, GPT calculates... cost_per_1000_chars = 0.015 num_papers = 40 chars_per_paper = 150000 total_chars = num_papers * chars_per_paper total_cost = (total_chars / 1000) * cost_per_1000_chars total_cost is $90 (Double that for HD quality.)

reddit.com › r/texttospeech › i tested 10 ai text-to-speech voice tools — this one was the best, natural and expressive (with free version)

r/TextToSpeech on Reddit: I tested 10 AI text-to-speech voice tools — this one was the best, natural and expressive (with free version)

3 weeks ago -

Hi everyone! I'm a developer who also listens to audiobooks. I use AI text-to-speech and voice cloning for my personal projects and sometimes to read fiction stories out loud.

I tested ElevenLabs, speechify, play.ht, Fish Audio, murf ai, resemble ai, and a couple others... Fish Audio honestly blew me away with the quality of their voices.

I cloned myself and it sounded indistinguishable from real life. Their text-to-speech sounds as natural as real human speech and you can inject pauses and emotional tones to perfect it.

They also offer a free plan you can check them out at https://fish.audio !

If you want tips, settings I used, or anything else let me know!

Disclaimer: I am NOT affiliated with any of these companies in any way

These are all text to speech services, why didn’t you include local and open source options? Index tts 2 sounds just as good as fish audio, has all of the same features you mention, and is licensed under apache2. As a developer, wouldn’t that be more attractive to you?

Ok so you tested commercial services and ignored open source solutions. And you like one of them better than the others. Did you look at the license restrictions, usage policies, privacy and data safety? What about data sovereignty? Will my data be stored in their cloud? Will it be used to train their models? Does my data stay in my region (the EU)? There are more than just voice qualify to evaluate on. Open source solutions can be high quality and are better at privacy, data sovereignty.

reddit.com › r/localllama › local text to speech

r/LocalLLaMA on Reddit: Local Text To Speech

May 27, 2024 -

Does anyone have a good local set up for text to speech. A while back, I used Tortoise TTS, which had reasonable quality (still behind 11labs) but was very slow. I see xtts might have taken this further. Unfortunately, no updates to Tortoise TTS since the developer was snapped up by OpenAI (though I guess great for the development of TTS in GPT4o).

https://github.com/erew123/alltalk_tts

https://github.com/daswer123/xtts-api-server I use koboldcpp to connect to it.

reddit.com › r/artificial › you can now train your own text-to-speech (tts) models locally!

r/artificial on Reddit: You can now train your own Text-to-Speech (TTS) models locally!

May 28, 2025 -

Hey folks! Text-to-Speech (TTS) models have been pretty popular recently and one way to customize it (e.g. cloning a voice), is by fine-tuning the model. There are other methods however you do training, if you want speaking speed, phrasing, vocal quirks, and the subtleties of prosody - things that give a voice its personality and uniqueness. So, you'll need to do create a dataset and do a bit of training for it. You can do it completely locally (as we're open-source) and training is ~1.5x faster with 50% less VRAM compared to all other setups: https://github.com/unslothai/unsloth

Our showcase examples aren't the 'best' and were only trained on 60 steps and is using an average open-source dataset. Of course, the longer you train and the more effort you put into your dataset, the better it will be. We utilize female voices just to show that it works (as they're the only decent public open-source datasets available) however you can actually use any voice you want. E.g. Jinx from League of Legends as long as you make your own dataset.
We support models like OpenAI/whisper-large-v3 (which is a Speech-to-Text SST model), Sesame/csm-1b, CanopyLabs/orpheus-3b-0.1-ft, and pretty much any Transformer-compatible models including LLasa, Outte, Spark, and others.
The goal is to clone voices, adapt speaking styles and tones, support new languages, handle specific tasks and more.
We’ve made notebooks to train, run, and save these models for free on Google Colab. Some models aren’t supported by llama.cpp and will be saved only as safetensors, but others should work. See our TTS docs and notebooks: https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning
The training process is similar to SFT, but the dataset includes audio clips with transcripts. We use a dataset called ‘Elise’ that embeds emotion tags like <sigh> or <laughs> into transcripts, triggering expressive audio that matches the emotion.
Since TTS models are usually small, you can train them using 16-bit LoRA, or go with FFT. Loading a 16-bit LoRA model is simple.

And here are our TTS notebooks:

Sesame-CSM (1B)	Orpheus-TTS (3B)	Whisper Large V3	Spark-TTS (0.5B)

Thank you for reading and please do ask any questions - I will be replying to every single one!

Awesome! Great work guys.

Can't wait to try it out!

reddit.com › r/localllama › what's the best open source speech to text model

r/LocalLLaMA on Reddit: What's the best open source speech to text model

August 3, 2024 -

I know OpenAI recently released whisper V3 Turbo but I remember hearing about some other ones that's a lot better but I can't remember

whisper-v3-turbo because of its wide compatibility with open source ecosystem (not necessarily because of its WER) The architecture is plug and play. You can typically add some LLMs along with whisper to correct stuff for you and customize as you need. I wrote a guide here just now: Creating Very High-Quality Transcripts with Open-Source Tools: An 100% automated workflow guide

You might be talking about https://huggingface.co/Revai Here is the post you might be remembering https://x.com/reach_vb/status/1841885263766945930

Find elsewhere

Google Bing Mojeek

reddit.com › r/localllama › best text to speech out there?

r/LocalLLaMA on Reddit: Best text to speech out there?

April 18, 2023 -

Looking to have voice clone, text to speech, different voices etc...do we have anything like that that is gui based?

ElevenLabs has the highest quality output of the options I tried, but it's only available via a paid API (with a severely limited free option) msedge-tts + rvc wasn't awful. Most of the other options I tried either sounded little better than what we've had since the 90s, or sounded OK but couldn't handle longer inputs and quickly degenerated into monosyllabic nonsense. https://github.com/SillyTavern/SillyTavern-Extras bundles a lot of this together and makes it pretty easy to compare different options, if that's something you've got setup.

Microsoft through edge. Unlimited real-time natural voice.

reddit.com › r/software › what's the best free text to speech ai tool that sounds really natural?

r/software on Reddit: What's the best free text to speech AI tool that sounds really natural?

August 4, 2025 -

I'm looking for a high quality, free text to speech (TTS) AI that sounds as close to a real human as possible — preferably with multiple voice options (male/female, accents, etc). It could be an app, website, or downloadable software. I’ve tried a few that sound robotic or overly synthetic, and I’m hoping there’s something better out there that still doesn’t require payment.

Any recommendations?

If your a bit technical Chatterbox or F5-TTS are pretty good models you can run in docker containers locally that support voice cloning, just provide a 5 second sample

The built-in ones that Micorosft Edge uses are pretty good, they sound much more natural than the ones used elsewhere in Windows. You can save a document as a PDF and get Edge to read it out, choose different voices etc. Main downside is it's not as convenient as typing stuff in and getting an MP3 back.

reddit.com › r/3cx › what’s the best text-to-speech ai tool you’re using right now?

r/3CX on Reddit: What’s the Best Text-to-Speech AI Tool You’re Using Right Now?

January 12, 2025 -

Hey Redditors,

I’m currently exploring different text-to-speech (TTS) AI tools and was wondering what everyone here is using. I’m looking for something that delivers natural-sounding voices, supports multiple languages, and ideally offers some customization features like speed, pitch, or emotion.

Google Speech Synthesis or Amazon Polly. But we often use real voices instead. Feels more natural.

reddit.com › r/localllama › supertonic webgpu: blazingly fast text-to-speech running 100% locally in your browser.

r/LocalLLaMA on Reddit: Supertonic WebGPU: blazingly fast text-to-speech running 100% locally in your browser.

3 weeks ago -

Last week, the Supertone team released Supertonic, an extremely fast and high-quality text-to-speech model. So, I created a demo for it that uses Transformers.js and ONNX Runtime Web to run the model 100% locally in the browser on WebGPU. The original authors made a web demo too, and I did my best to optimize the model as much as possible (up to ~40% faster in my tests, see below).

I was even able to generate a ~5 hour audiobook in under 3 minutes. Amazing, right?!

Link to demo (+ source code): https://huggingface.co/spaces/webml-community/Supertonic-TTS-WebGPU

* From my testing, for the same 226-character paragraph (on the same device): the newly-optimized model ran at ~1750.6 characters per second, while the original ran at ~1255.6 characters per second.

It's outputting gibberish for me no matter what language/text.

Neat! Saw that there were two more voice files, I made a fork that shows those as well: https://github.com/dumheter/Supertonic-TTS-WebGPU Although not a fancy hugging face page, you would have to run it locally.

reddit.com › r/texttospeech › what are good free text to speech programs with natural voices that can actually read reddit posts?

r/TextToSpeech on Reddit: What are good free text to speech programs with natural voices that can actually read reddit posts?

March 6, 2025 -

I tried using the internet edge read aloud and it always gets confused reading a reddit post. I like to do aaaalot of research on Reddit so I figure if I can find a good app, I can multitask and do other stuff while the program is speaking to me. I use android and windows 10.

I tried to research this awhile ago but couldn't find any answers.

You Will Thank me Letter: https://www.reddit.com/r/ChatGPT/comments/1j5vdnp/

kokoru or whatever is pretty damn good

reddit.com › r/ollama › local tts (text-to-speech) ai model with a human voice and file output?

r/ollama on Reddit: Local TTS (text-to-speech) AI model with a human voice and file output?

February 10, 2025 -

Don't know if this is the right place to ask, but... i was looking for a text to speech alternative to the quite expensive online ones i was looking for recently.

I'm partially blind and it would be of great help to have a recorded and narrated version of some technical e-books i own.

As i was saying, models like Elevenlabs and similar are really quite good but absolutely too expensive in terms of €/time for what i need to do (and the books are quite long too).

I was wondering, because of that, if there was a good (the normal TTS is quite abismal and distracting) alternative to run locally that can transpose the book in audio and let me save a mp3 or similar file for later use.

I have to say, also, that i'm not a programmer whatsoever, so i should be able to follow simple instructions but, sadly, nothing more. so... a ready to use solution would be quite nice (or a detailed, like i'm a 3yo, set of instructions).

i'm using ollama + docker and free open web-ui for playing (literally) with some offline models and also thinking about using something compatible with this already running system... hopefully, possibly?

Another complication it's that i'm italian, so... the probably unexisting model should be capable to use italian language too...

The following are my PC specs, if needed:

Processor: intel i7 13700k
MB: Asus ROG Z790-H
Ram: 64gb Corsair 5600 MT/S
Gpu: RTX 4070TI 12gb - MSI Ventus 3X
Storage: Samsung 970EVO NVME SSD + others
Windows 11 PRO 64bit

Sorry for the long post and thank you for any help :)

Found both of these recently. They both use Kokoro which is pretty decent sounding and relatively fast to convert. They both take an epub file and convert it into audio. If you need any help with them just let me know. Command line option - https://github.com/santinic/audiblez Gui option - https://github.com/plusuncold/autiobooks Edit: I just looked back up at your specs and with your GPU using the audiblez program with the flag --cuda should smash it out pretty quickly.

reddit.com › r/localllama › what is the best text-to-speech model to run locally?

r/LocalLLaMA on Reddit: What is the best Text-to-speech model to run locally?

January 10, 2025 -

I think llama 4 will have multimodality (including audio input/output) but until then, what do people use for going from text to speech + how do they run it locally (Ollama does not support this kind of models does it?)

Been using F5-TTS, it has a pretty good voice cloning capability though it's not super stable always (there are also multiple finetunes on different languages available). XTTSv2 is ok but worse voice cloning performance but is more stable overall (it is also multilingual).

https://github.com/rhasspy/piper

reddit.com › r › TextToSpeech

September 13, 2013 - r/TextToSpeech: Discussion about text-to-speech engines, virtual assistants, and related topics.

reddit.com › r/localllama › what's the best speech-to-text model right now?

r/LocalLLaMA on Reddit: What's the Best Speech-to-Text Model Right Now?

September 13, 2025 -

I am looking for the best Speech-to-Text/Speech Recognition Models, anyone could recommend any?

Try https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 (smaller option) or https://huggingface.co/openai/whisper-large-v3-turbo . Run the parakeet model using nemo library and use something like https://github.com/SYSTRAN/faster-whisper or https://github.com/ggml-org/whisper.cpp for the latter

Check out the HF ASR leaderboard. https://huggingface.co/spaces/hf-audio/open_asr_leaderboard Assume you are looking for an open source one? I am a fan of the nvidia parakeet series but it depends on your use case.

reddit.com › r/localllama › are there any local speech to text and then text to speech set ups out there?

r/LocalLLaMA on Reddit: Are there any Local Speech to Text and then Text to Speech Set ups out there?

December 30, 2024 -

I can not find something that is very well built to enable Voice assistants. Right now I am using just Open Web UI plus a Open AI TTS endpoint but it isnt really that great.
Is there anything out there already?

FasterWhisper can be used in home assistant as Speech to text. And piper for TextToSpeech. Look it up.

Give me a day or two i ll post a code if nobody else shares something. I use cpu (with openvino accel) for stt and tts to avoid offloading my llm which takes up all my 28g vram. (whisper for stt , kokoro for tts, ollama for llm), no vad

reddit.com › r/localllama › handy - a simple, open-source offline speech-to-text app written in rust using whisper.cpp

r/LocalLLaMA on Reddit: Handy - a simple, open-source offline speech-to-text app written in Rust using whisper.cpp

June 17, 2025 -

I built a simple, offline speech-to-text app after breaking my finger - now open sourcing it

TL;DR: Made a cross-platform speech-to-text app using whisper.cpp that runs completely offline. Press shortcut, speak, get text pasted anywhere. It's rough around the edges but works well and is designed to be easily modified/extended - including adding LLM calls after transcription.

Background

I broke my finger a while back and suddenly couldn't type properly. Tried existing speech-to-text solutions but they were either subscription-based, cloud-dependent, or I couldn't modify them to work exactly how I needed for coding and daily computer use.

So I built Handy - intentionally simple speech-to-text that runs entirely on your machine using whisper.cpp (Whisper Small model). No accounts, no subscriptions, no data leaving your computer.

What it does

Press keyboard shortcut → speak → press again (or use push-to-talk)
Transcribes with whisper.cpp and pastes directly into whatever app you're using
Works across Windows, macOS, Linux
GPU accelerated where available
Completely offline

That's literally it. No fancy UI, no feature creep, just reliable local speech-to-text.

Why I'm sharing this

This was my first Rust project and there are definitely rough edges, but the core functionality works well. More importantly, I designed it to be easily forkable and extensible because that's what I was looking for when I started this journey.

The codebase is intentionally simple - you can understand the whole thing in an afternoon. If you want to add LLM integration (calling an LLM after transcription to rewrite/enhance the text), custom post-processing, or whatever else, the foundation is there and it's straightforward to extend.

I'm hoping it might be useful for:

People who want reliable offline speech-to-text without subscriptions
Developers who want to experiment with voice computing interfaces
Anyone who prefers tools they can actually modify instead of being stuck with someone else's feature decisions

Project Reality

There are known bugs and architectural decisions that could be better. I'm documenting issues openly because I'd rather have people know what they're getting into. This isn't trying to compete with polished commercial solutions - it's trying to be the most hackable and modifiable foundation for people who want to build their own thing.

If you're looking for something perfect out of the box, this probably isn't it. If you're looking for something you can understand, modify, and make your own, it might be exactly what you need.

Would love feedback from anyone who tries it out, especially if you run into issues or see ways to make the codebase cleaner and more accessible for others to build on.

Very nice! Okay, I have a feature request! Haha, of course. I spoke to it in Japanese and was amused that it translated what I said into English. Apparently, Whisper has to be told the language in advance if you don't want it to translate.

Understand that I'm late to the party and this is an old post, but I want to thank you. This is a very nice, lightweight package for Whisper. One that, I would say, is the best I've seen as an alternative to Windows speech-to-text recognition for general use.

reddit.com › r/openai › any text to speech that doesn't suck and is free?

r/OpenAI on Reddit: Any Text to Speech that doesn't suck and is free?

September 10, 2024 -

I just got Applio to try and use the TTS voice synthesiser. However the problem is that the base TTS that Applio uses before it converts the voices is incredibly robotic and stiff, making the TTS basically useless.

Are there any actually halfway decent TTS voices I can use so I can import them into Applio?

From what I've seen only ElevenLabs has natural sounding voices, they also have an emotional range. However you still need to figure things out through trial and error, which means you are burning through the allowed words you've paid for just trying to get something decent.

Just use ElevenLabs dude. Idk what trial and error you’re talking about, it’s literally just one API calls with the words you want it to say and the voice you selected and boom you’ve got amazing audio quality for quite a cheap price

Eleven labs have a free TTS app called Reader . You can't export the audio but the voices are the best you can get right now.