speech to text ai open source reddit - Brave Search

Best local open source Text-To-Speech and Speech-To-Text?

reddit.com › r › LocalLLaMA › comments › 1f0awd6 › best_local_open_source_texttospeech_and

I’ve been trying to keep a list of TTS solutions. Here you go: Text to Speech Solutions 11labs - Commercial xtts xtts2 Alltalk Styletts2 Fish-Speech PiperTTS - A fast, local neural text to speech system that is optimized for the Raspberry Pi 4. PiperUI Paroli - Streaming mode implementation of the Piper TTS with RK3588 NPU acceleration support. Bark Tortoise TTS LMNT AlwaysReddy - (uses Piper) Open-LLM-VTuber MeloTTS OpenVoice Sherpa-onnx Silero Neuro-sama Parler TTS Chat TTS VallE-X Coqui TTS Daswers XTTS GUI VoiceCraft - Zero-Shot Speech Editing and Text-to-Speech Answer from jpummill2 on reddit.com

reddit.com › r/opensource › speech-to-text software

r/opensource on Reddit: Speech-to-text software

February 15, 2023 -

Hi guys,

I'm looking for some software or an app that can turn an mp3 recording into text. There's a lot of text-to-speech solutions out there but I can't find anything that goes the other way, that is speech-to-text.

I have some mp3 recordings of lectures that I would like to turn into text and then PDF.

You won't find a better transcriber than the new whisper AI: https://whisper.ggerganov.com/

You can try Vibe It's free, open source, supports Windows / macOS / Linux. Works offline and supports up to 100 languages.

reddit.com › r/localllama › best local open source text-to-speech and speech-to-text?

r/LocalLLaMA on Reddit: Best local open source Text-To-Speech and Speech-To-Text?

August 24, 2024 -

I am working on a custom data-management software and for a while now I've been working and looking into possibility of integrating and modifying existing local conversational AI's into it (or at least developing the possibility of doing so in the future). The first thing I've been struggling with is that information is somewhat hard to come by - searches often lead me back here to r/LocalLLaMA/ and a year old threads in r/MachineLearning. Is anyone keeping track of what is out there what is worth the attention? I am posting this here in hope of finding some info while also sharing what I know for anyone who finds it useful or is interested.

I've noticed that most open source projects are based on Open AI's Whisper and it's re-implemented versions like:

Faster Whisper (MIT license)
Insanely fast Whisper (Apache-2.0 license)
Distil-Whisper (MIT license)
WhisperSpeech by github.com/collabora (MIT license, Added here 03/2025)
WhisperLive (MIT license, Added here 03/2025)
WhisperFusion, which is WhisperSpeech+WhisperLive in one package. (Added here 03/2025)

Coqui AI's TTS and STT -models (MPL-2.0 license) have gained some traction, but on their site they have stated that they're shutting down.

Tortoise TTS (Apache-2.0 license) and its re-implemented versions such as:

Tortoise-TTS-fast (AGPL-3.0, Apache-2.0 licenses) and its slightly faster(?) fork (AGPL-3.0 license).

StyleTTS and it's newer version:

StyleTTS2 (MIT license)

Alibaba Group's Tongyi SpeechTeam's SenseVoice (STT) [MIT license+possibly others] and CosyVoice (TTS) [Apache-2.0 license].

(11.2.2025): I will try to maintain this list so will begin adding new ones as well.

1/2025 Kokoro TTS (MIT License)
2/2025 Zonos by Zyphra (Apache-2.0 license)
3/2025 added: Metavoice (Apache-2.0 license)
3/2025 added: F5-TTS (MIT license)
3/2025 added: Orpheus-TTS by canopylabs.ai (Apache-2.0 license)
3/2025 added: MegaTTS3 (Apache-2.0 license)
4/2025 added: Index-tts (Apache-2.0 license). [Can be tried here.]
4/2025 added: Dia TTS (Apache-2.0 license) [Can be tried here.]
5/2025 added: Spark-TTS (Apache-2.0 license)[Can be tried here.]
5/2025 added: Parakeet TDT 0.6B V2 (CC-BY-4.0 license), STT English only [Can be tried here.], update: V3 is multilingual and has an onnx -version.

8/2025 added: Verbify-TTS (MIT License) by reddit user u/MattePalte. Described as simple locally run screen-reader-style app.

8/2025 added: Chatterbox-TTS (MIT License) [Can be tried here.]

8/2025 added: Microsoft's VibeVoice TTS (MIT Licence) for generating consistent long-form dialogues. Comes in 1.5B and 7B sizes. Both models can be tried here. 0.5B model is also on the way. This one also already has a ComfyUI wrapper by u/Fabix84/ (additional info here). Quantized versions by u/teachersecret can be found here

8/2025 added: BosonAI's Higgs Audio TTS (Apache-2.0 license). Can be tried here and further tested here. This one supports complex long-form dialogues. Extra prompting is supposed to allow setting the scene and adjusting expressions. Also has a quantized (4bit fork) version.

8/2025 added: StepFun AI's (Chinese AI-team ^source) Step-Audio 2 Mini Speech-To-Speech (Apache-2.0 license) a 8B "speech-to-speech" (Audio-To-Tokens + Tokens-To-Audio) -model. Added because related, even if bypasses the "to-text" -part.

---------------------------------------------------------

Edit1: Added Distil-Whisper because "insanely fast whisper" is not a model, but these were shipped together.
Edit2: StyleTTS2FineTune is not actually a different version of StyleTTS2, but rather a framework to finetuning it.
Edit3(11.2.2025): as suggested by u/caidong I added Kokoro TTS + also added Zonos to the list.
Edit4(20.3.2025): as suggested by u/Trysem , added WhisperSpeech, WhisperLive, WhisperFusion, Metavoice and F5-TTS.
Edit5(22.3.2025): Added Orpheus-TTS.
Edit6(28.3.2025): Added MegaTTS3.
Edit7(11.4.2025): as suggested by u/Trysem/, added Index-tts.
Edit8(24.4.2025): Added Dia TTS (Nari-labs).
Edit9(02.5.2025): Added Spark-TTS as suggested by u/Tandulim (here)
Edit9(02.5.2025): Added Parakeet TDT 0.6B V2. More info in this thread.

Edit10(29.8.2025): As originally suggested by u/Trysem and later by u/Nitroedge added Chatterbox-TTS to the list.

Edit10(29.8.2025): u/MattePalte asked me to add his own TTS called Verbify-TTS to the list.

Edit10(29.8.2025): Added Microsoft's recently released VibeVoice TTS, BosonAI's Higgs Audio TTS and StepFun's STS. +Extra info.

Edit11+12(1.9.2025): Added VibeVoice TTS's quantized versions and Parakeet V3.

I’ve been trying to keep a list of TTS solutions. Here you go: Text to Speech Solutions 11labs - Commercial xtts xtts2 Alltalk Styletts2 Fish-Speech PiperTTS - A fast, local neural text to speech system that is optimized for the Raspberry Pi 4. PiperUI Paroli - Streaming mode implementation of the Piper TTS with RK3588 NPU acceleration support. Bark Tortoise TTS LMNT AlwaysReddy - (uses Piper) Open-LLM-VTuber MeloTTS OpenVoice Sherpa-onnx Silero Neuro-sama Parler TTS Chat TTS VallE-X Coqui TTS Daswers XTTS GUI VoiceCraft - Zero-Shot Speech Editing and Text-to-Speech

I’ve been using alltalktts ( https://github.com/erew123/alltalk_tts ) which is based off of coqui and supports XTTS2, piper and some others. I’m on a Mac so my options are pretty limited, and this worked fairly well. If xtts is the model you want to go with, then maybe https://github.com/daswer123/xtts-api-server would work even better. Unfortunately most of my cases are in SillyTavern, for narration, and character tts, so these may not be the use case for you. The last link I shared might give you ideas for how to implement that on a real application though. Are you a dev-like person, or just enthusiastic about it? I ask because if you’re a dev with some Python knowledge, or willingness to follow code, the later link is actually pretty useful for ideas, in spite of being targeted towards SillyTavern. If not, this is whole space might be kind of hard to navigate at this point in time, and also will depend a lot on the hardware where you’ll be deploying this.

Videos

Possibly THE BEST Open Source Text-to-Speech Model - VibeVoice ...

September 2, 2025

My Top 5 Open-Source AI Text-to-Speech Models - YouTube

February 12, 2025

The Most Accurate Speech-to-text APIs in 2025 - YouTube

February 6, 2025

Local and Open Source Speech to Speech Assistant - YouTube

September 12, 2024

SpeechBrain - Speech to text model - YouTube

August 14, 2024

The MOST EXPRESSIVE Open Source Text-to-Speech of 2025 - YouTube

September 8, 2025

reddit.com › r/localllama › what's the best open source speech to text model

r/LocalLLaMA on Reddit: What's the best open source speech to text model

August 2, 2024 -

I know OpenAI recently released whisper V3 Turbo but I remember hearing about some other ones that's a lot better but I can't remember

whisper-v3-turbo because of its wide compatibility with open source ecosystem (not necessarily because of its WER) The architecture is plug and play. You can typically add some LLMs along with whisper to correct stuff for you and customize as you need. I wrote a guide here just now: Creating Very High-Quality Transcripts with Open-Source Tools: An 100% automated workflow guide

You might be talking about https://huggingface.co/Revai Here is the post you might be remembering https://x.com/reach_vb/status/1841885263766945930

reddit.com › r/localllama › open source ai voice dictation app with a fully customizable stt and llm pipeline

r/LocalLLaMA on Reddit: Open source AI voice dictation app with a fully customizable STT and LLM pipeline

3 days ago - Tambourine is an open source, cross-platform voice dictation app that uses a configurable STT and LLM pipeline to turn natural speech into clean, formatted text in any app.

reddit.com › r/machinelearning › [d] what is the best open source text to speech model?

r/MachineLearning on Reddit: [D] What is the best open source text to speech model?

April 13, 2023 -

I am building a LLMs infrastructure that misses one thing - text to speech. I know there are really good apis like MURF.AI out there, but I haven't been able to find any decent open source TTS, that is more natural than the system one.

If you know any of these, please leave a comment

Thanks

I have a whole list of TTS models (repos & white papers): Neural TTS Models Tacotron submitted: Mar 29, 2017 paper: https://arxiv.org/pdf/1703.10135.pdf github: https://github.com/keithito/tacotron (Not the official implementation but is the once cited the most) Tacotron2 submitted: Dec 16, 2017 paper: https://arxiv.org/pdf/1712.05884.pdf github: https://github.com/NVIDIA/tacotron2 Transformer TTS ** submitted: Sept 19, 2018 paper: https://arxiv.org/pdf/1809.08895.pdf github: N/A Flowtron submitted: May 12 2020 paper: https://arxiv.org/pdf/2005.05957.pdf github: https://github.com/NVIDIA/flowtron FastSpeech2 submitted: Jun 8, 2020 paper: https://arxiv.org/pdf/2006.04558.pdf github: https://github.com/ming024/FastSpeech2 (Not the official implementation but is the once cited the most) FastPitch submitted: Jun 11, 2020 paper: https://arxiv.org/pdf/2006.06873.pdf github: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch TalkNet (1/2) submitted: May 12, 2020/Apr16, 2021 paper: https://arxiv.org/pdf/2005.05514.pdf / https://arxiv.org/pdf/2104.08189.pdf github: https://github.com/NVIDIA/NeMo MOS (Mean Opinion Score) is not included because each paper has a different score for each model ** This model is not to be considered for implementation. It can be a reference but does not have an official GitHub implementation that I am aware of, nor is it very well known. Vocoders (Mel-spec to audio) WaveNet submitted: Sept 12, 2016 paper: https://arxiv.org/pdf/1609.03499v2.pdf github: N/A WaveGlow submitted: Oct 31, 2018 paper: https://arxiv.org/pdf/1811.00002.pdf github: https://github.com/NVIDIA/waveglow HiFiGAN submitted: Oct 12, 2020 paper: https://arxiv.org/pdf/2010.05646.pdf github: https://github.com/jik876/hifi-gan Amendments • TalkNet source code from NVIDIA/NeMo repo has been removed (commit #4082) • NVIDIA/NeMo repo now links to: • FastPitch, MixerTTS, Tacotron2, RadTTS for text to Mel-spectrogram models • HiFiGAN, UnivNet, WaveGlow for Vocoder models • RadTTS seems to be similar to or based around Flowtron (Autoregressive model) • MixerTTS seems to be similar to or based around FastPitch • There are a number of models that are heavily reliant on this monotonic align module. Such models currently include: • VITS • RadTTS • GradTTS • GlowTTS • Regarding GlowTTS, there is actually a Tensorflow implementation available here () which may prove helpful for other models that may use similar components • STYLER and DiffTTS relies on Montreal forced aligner (MFA) package • Presentation from Microsoft: https://www.microsoft.com/en-us/research/uploads/prod/2022/12/Generative-Models-for-TTS.pdf RadTTS submitted: Aug 18, 2021 (NVIDIA page, not Arxiv) paper: https://openreview.net/pdf?id=0NQwnnwAORi github: https://github.com/NVIDIA/radtts MixerTTS submitted: Oct 7, 2021 paper: https://arxiv.org/pdf/2110.03584.pdf github: https://github.com/NVIDIA/NeMo GradTTS (Diffusion TTS) submitted: May 13, 2021 paper: https://arxiv.org/pdf/2105.06337.pdf github: https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS VITS submitted: Jun 11, 2021 paper: https://arxiv.org/pdf/2106.06103.pdf github: https://github.com/jaywalnut310/vits GlowTTS submitted: May 22, 2020 paper: https://arxiv.org/pdf/2005.11129v1.pdf github: https://github.com/jaywalnut310/glow-tts STYLER submitted: Mar 17, 2021 paper: https://arxiv.org/pdf/2103.09474.pdf github: https://github.com/keonlee9420/STYLER TorToiseTTS submitted: N/A paper: N/A github: https://github.com/neonbjb/tortoise-tts DiffTTS (DiffSinger) submitted: Apr 3, 2021 paper: https://arxiv.org/pdf/2104.01409v1.pdf github: https://github.com/keonlee9420/DiffSinger

Tortoise TTS is supposed to be good. However inference can take a while if not on GPU's, so might not produce the real-time text-to-speech effect you want.

reddit.com › r/opensource › speech-to-text with ai?

r/opensource on Reddit: Speech-to-text with AI?

August 31, 2023 -

I am searching for a fully open-source Python script or an application to seamlessly transcribe audio into text in French.

I dislike websites since they usually come with restrictions and limitations in most instances (approximately 99% of the time).

Do you know where I could find something suitable?

If you‘re willing to dabble a bit with Python, a colleague of mine has written a great post for whisper: https://www.css.cnrs.fr/whisper-for-transcribing-interviews/

reddit.com › r/programming › open source speech to text

r/programming on Reddit: Open Source Speech to Text

February 12, 2010 -

I'd like something I can use to transcribe speech to text as part of a larger program. Google-ing open source speech to text I see CMU Sphinx and Open Mind Speech. Any other options I should be aware of? Which is the most accurate?

Hi, I'm a doctoral student studying speech recognition. Some of the most popular open source ASR engines are, including the Sphinx project which you mention, CMU Sphinx HTK Edit: as several people have pointed out, HTK more accurately described as 'source available'. The situation is a bit complicated so please refer to the website for details. My operational understanding is that you can modify HTK source and train up acoustic models with HTK for whatever purpose you like, but you cannot repackage and ship HTK source code. Julius Juicer The Sphinx and HTK projects contain software appropriate for training Acoustic Models from audio data, as well as the decoders. The other two are just decoders, and require models trained with some other system (HTK). In addition to acoustic models, if you are planning to train up your own system you will also need to build or obtain A pronunciation dictionary which contains entries consisting of the words you wish to transcribe and their corresponding pronunciation(s) A language model or expert grammar to constrain the input speech There are open source tools for building language models as well, but they vary in terms of their licenses. Some of the most popular language modeling tools are, CMU SLM Toolkit MIT LM SRILM There are also a variety of pre-trained open source acoustic models and language models available on the web, CMU Open Source AMs Keith Vertanen models VoxForge In terms of tutorial and help information the following are all good places to start, VoxForge CMU Robust Group Tutorial HTK Book (requires registration) I've never heard of Open Mind Speech and it does not appear to be very well-maintained so I'm not sure I can recommend that. If you are interested in understanding what is actually going on behind the scenes in most modern ASR systems I recommend the following papers, A tutorial on hidden markov models and selected applications in speech recognition (pdf) Speech Recognition with Weighted Finite State Transducers (pdf) In general, even with the many open source tools currently available, training up a high quality large vocabulary continuous speech recognition system from scratch is a non-trivial (read pretty complicated) exercise. If you are determined to do it, my strongest recommendation is to try out the Sphinx tutorial as it should take you through from beginning to end on an open source toy corpus, and leave you with the ability to modify or alter that yourself afterwards. Also, Sphinx is well-supported, completely open source, and has 3 different decoders with various strengths and weaknesses, one written in Java and 2 in C. Good luck! Edit: If you are trying to include this in a VoIP application there are also Sphinx plugins for popular open source software PBXs like Asterisk and FreeSwitch , and even a maturing MRCP server, UniMRCP . And Voxeo provides VXML based solutions for this.

Hi, thanks for bringing this topic up. I'm a marketing consultant, who's nearly deaf. Using a phone or video chat is really difficult (often impossible) for me, which means I have to spend more time and resources for attracting new clients than a hearing consultant would. Unable to phone also has a serious impact on job chances and my career. I've been thinking if a speech to text software could help me with phone calls and video conferences. What I'm looking for would be a speech to text software which plugs into skype, VoIP with a Fritzbox or another SIP software, or even Asteristik (I have no experience with Asteristik yet, but would be willing to dive in) has a roughly 50% success rate (I don't need perfect transcription, just enough to get the gist) can handle unknown voices (clients, business partners, coworkers) ideally works on Mac or Linux (Windows is fine, too, if nothing else is supported) supports English and German I've looked into Dragon Naturally Speaking and MacSpeech, but they seem to only work with one's own voice and only with extensive training and are more geared to authoring texts. I'm looking for a speech to text software from streaming / live audio (such as Skype). Does anybody have experience with what I'm trying to achieve? Does anybody know of a software or could anybody give me some hints which projects in blackkettle's post (e.g. Sphinx?) I should try out first? Also: Deaf redditors, how do you handle business phone calls? Additionally I'm looking for a speech-to-text software for subtitling videos, too. I've been thinking of creating some videos to place on Youtube and I'd like to offer subtitles to fellow deaf people. Does anybody know of specialized speech-text-software for video subtitling? Thanks a lot for any help and suggestions!

reddit.com › r/chatgptcoding › i built an open source ai voice dictation app with fully customizable stt and llm pipelines

r/ChatGPTCoding on Reddit: I built an open source AI voice dictation app with fully customizable STT and LLM pipelines

3 days ago -

Tambourine is an open source, cross-platform voice dictation app that uses configurable STT and LLM pipelines to turn natural speech into clean, formatted text in any app.

I have been building this on the side for the past few weeks. The motivation was wanting something like Wispr Flow, but with full control over the models and prompts. I wanted to be able to choose which STT and LLM providers were used, tune formatting behavior, and experiment without being locked into a single black box setup.

The back end is a local Python server built on Pipecat. Pipecat provides a modular voice agent framework that makes it easy to stitch together different STT models and LLMs into a real-time pipeline. Swapping providers, adjusting prompts, or adding new processing steps does not require changing the desktop app, which makes experimentation much faster.

Speech is streamed in real time from the desktop app to the server. After transcription, the raw text is passed through an LLM that handles punctuation, filler word removal, formatting, list structuring, and personal dictionary rules. The formatting prompt is fully editable, so you can tailor the output to your own writing style or domain-specific language.

The desktop app is built with Tauri, with a TypeScript front end and Rust handling system level integration. This allows global hotkeys, audio device control, and text input directly at the cursor across platforms.

I shared an early version with friends and presented it at my local Claude Code meetup, and the feedback encouraged me to share it more widely.

This project is still under active development while I work through edge cases, but most core functionality already works well and is immediately useful for daily work. I would really appreciate feedback from people interested in voice interfaces, prompting strategies, latency tradeoffs, or model selection.

Happy to answer questions or go deeper into the pipeline.

https://github.com/kstonekuan/tambourine-voice

This is great. Good work. One area that I suggest you always keep in mind is the response time. The time it takes for the first word to appear after it is spoken. I know that in your approach you need to capture the whole voice segment first and then start processing. To reduce response time in this approach, all you can do is to make the processing faster. Based on my experience running the whisper model locally on GPU reduces the response time significantly and is safe from network delays and interruptions.

Find elsewhere

Google Bing Mojeek

reddit.com › r/opensource › any live speech to text converter for windows (opensource & works offline)?

r/opensource on Reddit: Any live speech to text converter for Windows (opensource & works offline)?

April 10, 2024 -

Hi I am looking for some speech to text converter for windows, but not able to get anything.

I know there are two popular models, vosk and whisper, but I am unable to find anything which I can use on windows for live transcription.

Any suggestions please?

u/Ambitionless_Nihil FWIW, the Whisper.cpp repository's README has a list of example implementations that include a live transcription tool: https://whisper.ggerganov.com/stream/ You need to open the page in your web browser. But once you do that, the real time transcription is run offline and locally.

Coqui maybe?: https://docs.coqui.ai/en/latest/installation.html

reddit.com › r/localllama › best speech to text transcription? local model or api

r/LocalLLaMA on Reddit: Best speech to text transcription? Local model or api

July 4, 2024 -

Hey guys, I would like to add speech to text transcription. Hosting an open source model on cloud so I can do this anywhere would be good.

Do you guys know any highly accurate STT open source models that is highly accurate?

Also, can it run on CPU or GPU is a must?

Whisper is really good, at least for single speaker text. It's less good for multi-speaker text, but still plenty usable.

Whisper.cpp runs on CPU, is pretty fast if you have decent hardware, and in my experience fantastically recognizes almost all words and sentences correctly. I was definitely very satisfied with the results. Never tested it with languages other than English, it may require thorough testing before deciding to use a model of specific size.

reddit.com › r/localllama › best open source speech to text+ diarization models

r/LocalLLaMA on Reddit: Best Open source Speech to text+ diarization models

May 8, 2025 -

Hi everyone, hope you’re doing well. I’m currently working on a project where I need to convert audio conversations between a customer and agents into text.

Since most recordings involve up to three speakers, could you please suggest some top open-source models suited for this task, particularly those that support speaker diarization?

I've had a similar need a few months ago, and the best I could find was GitHub - MahmoudAshraf97/whisper-diarization: Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper It's still not ideal, especially when people talk over each other, but works fairly well. Of course, if the conversation happens over the phone/internet, you can record agent and customer into separate streams and just use normal whisper.

At the moment, you’re going to get your best transcript by splitting the audio into each voice (isolate) https://github.com/pyannote/pyannote-audio Once split, stt each individual stream through a timestamp capable model like parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 Finally, reassemble the conversation by speaker, interleaving the speech based on time stamps in the final transcript.

reddit.com › r/opensource › tts: text-to-speech for all. open-sourced by mozilla

r/opensource on Reddit: TTS: Text-to-Speech for All. Open-sourced by Mozilla

July 4, 2020 - Since many of the key people are now at Coqui and because the rest of the AI people at Mozilla have been asked to work on other projects, I would expect the center of gravity for new development to shift to coqui. DeepSpeech was the state of the art for speech recognition in 2015 and still pretty cutting edge for certain contexts. Certainly one of the most accurate open source models.

reddit.com › r/localllama › speech to text - whisper alternatives?

r/LocalLLaMA on Reddit: Speech to Text - Whisper alternatives?

June 3, 2024 -

I've been working on a project that needs reliable Speech to text conversion with the potential for multiple active individuals in a conversation. I've only used the long ago released OpenAI Whisper model which was pretty terrible at 3+ people and often had issues consistently attributing correct voice to tag.

It's a pretty old model (given how fast everything is) what models are you guys using and what are some of the pro/cons?

The Whisper model is still the best open source model I've found. But as far as multiple speakers, don't use Whisper by itself - you need to combine it with a good diarization model. I would take a look at the whisperX project which uses faster-whisper (4x speed increase over openAI/whisper) and has VAD and diarization capability included.

Whisper still is probably the best open source one. (Or its various distillations) But it's not meant to be used like you are doing. You should first identify different voices and separate them into different files. And then whisper separately on each.

reddit.com › r/localllama › suggest me open source text to speech for real time streaming

r/LocalLLaMA on Reddit: Suggest me open source text to speech for real time streaming

May 25, 2025 -

currently using elevenlabs for text to speech the voice quality is not good in hindi and also it is costly.So i thinking of moving to open source TTS.Suggest me good open source alternative for eleven labs with low latency and good hindi voice result.

Kokoro is the best I’ve come across with good Voices and low latency on minimal resources

For me, coqui tts with the Xttsv2 model worked best. You are able to clone voices and it can speak in so many languages. It also allows streaming inference, so you don't have to wait untill everything is generated. I only have a latency of 200 micro seconds. And it sounds Pretty good!

reddit.com › r/opensource › i am looking for tts software

r/opensource on Reddit: I am looking for TTS software

January 21, 2024 -

Years ago I searched unsuccessfully for human-sounding TTS software (German voice output) for Linux. Nothing was found.

Is there really still (in year 2024) nothing comparable to Balabolka and Read Aloud and in Linux-world?

piper-tts is hands down the best https://github.com/rhasspy/piper they have some decent voices for sure, but I highly recommend making one and sharing it. I have one here https://github.com/sweetbbak/Neural-Amy-TTS that explains how to use it and how you can add it as a system voice so you can use TTS features in the browser. The nice thing about Piper is that it sounds good, its faster than real time and it can run on a potato. It uses a mix of phoneme synthesis and AI generation so its like a merge of old and new methods. I hate most AI implementations of TTS but piper does it right imo. Its also a stand-alone binary made with C++ (or a python lib you can pipx install if you want) and its cross platform. No GPU required either. It uses CPU and is still faster than real-time Theres a cool Glados voice and some others too. I personally use the ancient Ivona TTS software under Wine as well, I can teach you what to do there if you want

Check out Coqui TTS - https://www.reddit.com/r/selfhosted/s/Yucmb19hNI

reddit.com › r/machinelearning › [d] what are the best speech to text tools currently available ?

r/MachineLearning on Reddit: [D] What are the best speech to text tools currently available ?

January 8, 2023 -

I am looking for open source speech to text tools, I am not familiar with the progress in this field but Ideally I would like something fast and reliable, that does english as well as other languages as french and spanish, that is also easy to use. Are there any recommendations ?

Whisper. You can run it locally for free or use the Whisper API on OpenAI that's quite cheap and doesn't require any special hardware. There is also the option to run it on Google colab for free and that also doesn't require any special hardware.

You can use HuggingFace’s speechbox library built on top of whisper. https://github.com/huggingface/speechbox They have a colab and a gradio demo on their readme

reddit.com › r/speechtech › i benchmarked 12+ speech-to-text apis under various real-world conditions

r/speechtech on Reddit: I benchmarked 12+ speech-to-text APIs under various real-world conditions

May 2, 2025 -

Hi all, I recently ran a benchmark comparing a bunch of speech-to-text APIs and models under real-world conditions like noise robustness, non-native accents, and technical vocab, etc.

It includes all the big players like Google, AWS, MS Azure, open source models like Whisper (small and large), speech recognition startups like AssemblyAI / Deepgram / Speechmatics, and newer LLM-based models like Gemini 2.0 Flash/Pro and GPT-4o. I've benchmarked the real time streaming versions of some of the APIs as well.

I mostly did this to decide the best API to use for an app I'm building but figured this might be helpful for other builders too. Would love to know what other cases would be useful to include too.

Link here: https://voicewriter.io/speech-recognition-leaderboard

TLDR if you don't want to click on the link: the best model right now seems to be GPT-4o-transcribe, followed by Eleven Labs, Whisper-large, and the Gemini models. All the startups and AWS/Microsoft are decent with varying performance in different situations. Google (the original, not Gemini) is extremely bad.

Welcome to the painful world of benchmarking ML models. How confident are you that the audios, text, and tts you used isn't in the training data of the models? If you can't prove that then your benchmark isn't worth that much. It's a big reason why you can't have open data to benchmark against, because it's too easy to cheat. If your task to to run ASR on old TED videos and TTS/read speech of wikipedia articles, then these numbers may be valid. Otherwise I wouldn't trust them. Also, streaming WERs depend a lot on the desired latency, I can't see the information anywhere. And btw, Speechmatics has updated its pricing.

U should do new 2.5 models it blows everything out of water even the Dirization

murf.ai › blog › best-text-to-speech-according-to-reddit

Best Text to Speech Software According to Reddit

According to numerous Reddit forums, Murf AI, Lovo AI, Amazon Polly, and IBM Watson text to speech offer superior multilingual support compared to others on this list. ... Bark, a text to speech solution based on the Hugging Face platform, is ...

reddit.com › r/texttospeech › what is the best open source text to speech without robotic voices?

r/TextToSpeech on Reddit: What is the best open source text to speech without robotic voices?

November 27, 2024 -

hello guys, it turns out that I want to develop a simple project where given the audio transcription it takes between 10 and 15 minutes to synthesize it, elevenlabs has good voices but it has many limitations with the amount of text, I tried coqui tts and the voices still sound very robotic to me as well The project is with a voice in Spanish. If anyone please recommend one that adapts to what I am publishing, thank you very much.

If you don't care about inference time and have a GPU I recommend tortoise. There's probably some better modern ones I haven't tested though

You could definitely use VoicePal - Text to Speech (Android app), free no limits. It does have natural-sounding Spanish voices both in Spanish and Mexican accents. When you use it, you can either load a document, or create a custom note (in the Notes folder) where you can type or paste your own text. Then you can simply choose "save audio file" from the menu and it will generate an mp3 audio file for that text. Here is the link for the app: https://play.google.com/store/apps/details?id=com.ttstools.voicepal

reddit.com › r/localllama › best realtime open source stt model?

r/LocalLLaMA on Reddit: Best realtime open source STT model?

June 19, 2025 -

What's the best model to transcribe a conversation in realtime, meaning that the words have to appear as the person is talking.

If you have GPU, check out whisper If u wanna run transcription through mobile application like flutter, try Sherpa onnx, I wouldn't bet too much on it, but it's good enough For web streaming try whisper base model, example or is already available open source Even for CPU I can see that whisper is doing good... Every application which I mentioned is available for streaming

whisperv3 turbo. Its my daily driver.