speech to text ai open source reddit - Brave Search

Best local open source Text-To-Speech and Speech-To-Text?

reddit.com › r › LocalLLaMA › comments › 1f0awd6 › best_local_open_source_texttospeech_and

I’ve been trying to keep a list of TTS solutions. Here you go: Text to Speech Solutions 11labs - Commercial xtts xtts2 Alltalk Styletts2 Fish-Speech PiperTTS - A fast, local neural text to speech system that is optimized for the Raspberry Pi 4. PiperUI Paroli - Streaming mode implementation of the Piper TTS with RK3588 NPU acceleration support. Bark Tortoise TTS LMNT AlwaysReddy - (uses Piper) Open-LLM-VTuber MeloTTS OpenVoice Sherpa-onnx Silero Neuro-sama Parler TTS Chat TTS VallE-X Coqui TTS Daswers XTTS GUI VoiceCraft - Zero-Shot Speech Editing and Text-to-Speech Answer from jpummill2 on reddit.com

reddit.com › r/opensource › speech-to-text software

r/opensource on Reddit: Speech-to-text software

February 15, 2023 -

Hi guys,

I'm looking for some software or an app that can turn an mp3 recording into text. There's a lot of text-to-speech solutions out there but I can't find anything that goes the other way, that is speech-to-text.

I have some mp3 recordings of lectures that I would like to turn into text and then PDF.

You won't find a better transcriber than the new whisper AI: https://whisper.ggerganov.com/

You can try Vibe It's free, open source, supports Windows / macOS / Linux. Works offline and supports up to 100 languages.

reddit.com › r/localllama › best local open source text-to-speech and speech-to-text?

r/LocalLLaMA on Reddit: Best local open source Text-To-Speech and Speech-To-Text?

August 23, 2024 -

I am working on a custom data-management software and for a while now I've been working and looking into possibility of integrating and modifying existing local conversational AI's into it (or at least developing the possibility of doing so in the future). The first thing I've been struggling with is that information is somewhat hard to come by - searches often lead me back here to r/LocalLLaMA/ and a year old threads in r/MachineLearning. Is anyone keeping track of what is out there what is worth the attention? I am posting this here in hope of finding some info while also sharing what I know for anyone who finds it useful or is interested.

I've noticed that most open source projects are based on Open AI's Whisper and it's re-implemented versions like:

Faster Whisper (MIT license)
Insanely fast Whisper (Apache-2.0 license)
Distil-Whisper (MIT license)
WhisperSpeech by github.com/collabora (MIT license, Added here 03/2025)
WhisperLive (MIT license, Added here 03/2025)
WhisperFusion, which is WhisperSpeech+WhisperLive in one package. (Added here 03/2025)

Coqui AI's TTS and STT -models (MPL-2.0 license) have gained some traction, but on their site they have stated that they're shutting down.

Tortoise TTS (Apache-2.0 license) and its re-implemented versions such as:

Tortoise-TTS-fast (AGPL-3.0, Apache-2.0 licenses) and its slightly faster(?) fork (AGPL-3.0 license).

StyleTTS and it's newer version:

StyleTTS2 (MIT license)

Alibaba Group's Tongyi SpeechTeam's SenseVoice (STT) [MIT license+possibly others] and CosyVoice (TTS) [Apache-2.0 license].

(11.2.2025): I will try to maintain this list so will begin adding new ones as well.

1/2025 Kokoro TTS (MIT License)
2/2025 Zonos by Zyphra (Apache-2.0 license)
3/2025 added: Metavoice (Apache-2.0 license)
3/2025 added: F5-TTS (MIT license)
3/2025 added: Orpheus-TTS by canopylabs.ai (Apache-2.0 license)
3/2025 added: MegaTTS3 (Apache-2.0 license)
4/2025 added: Index-tts (Apache-2.0 license). [Can be tried here.]
4/2025 added: Dia TTS (Apache-2.0 license) [Can be tried here.]
5/2025 added: Spark-TTS (Apache-2.0 license)[Can be tried here.]
5/2025 added: Parakeet TDT 0.6B V2 (CC-BY-4.0 license), STT English only [Can be tried here.], update: V3 is multilingual and has an onnx -version.

8/2025 added: Verbify-TTS (MIT License) by reddit user u/MattePalte. Described as simple locally run screen-reader-style app.

8/2025 added: Chatterbox-TTS (MIT License) [Can be tried here.]

8/2025 added: Microsoft's VibeVoice TTS (MIT Licence) for generating consistent long-form dialogues. Comes in 1.5B and 7B sizes. Both models can be tried here. 0.5B model is also on the way. This one also already has a ComfyUI wrapper by u/Fabix84/ (additional info here). Quantized versions by u/teachersecret can be found here

8/2025 added: BosonAI's Higgs Audio TTS (Apache-2.0 license). Can be tried here and further tested here. This one supports complex long-form dialogues. Extra prompting is supposed to allow setting the scene and adjusting expressions. Also has a quantized (4bit fork) version.

8/2025 added: StepFun AI's (Chinese AI-team ^source) Step-Audio 2 Mini Speech-To-Speech (Apache-2.0 license) a 8B "speech-to-speech" (Audio-To-Tokens + Tokens-To-Audio) -model. Added because related, even if bypasses the "to-text" -part.

---------------------------------------------------------

Edit1: Added Distil-Whisper because "insanely fast whisper" is not a model, but these were shipped together.
Edit2: StyleTTS2FineTune is not actually a different version of StyleTTS2, but rather a framework to finetuning it.
Edit3(11.2.2025): as suggested by u/caidong I added Kokoro TTS + also added Zonos to the list.
Edit4(20.3.2025): as suggested by u/Trysem , added WhisperSpeech, WhisperLive, WhisperFusion, Metavoice and F5-TTS.
Edit5(22.3.2025): Added Orpheus-TTS.
Edit6(28.3.2025): Added MegaTTS3.
Edit7(11.4.2025): as suggested by u/Trysem/, added Index-tts.
Edit8(24.4.2025): Added Dia TTS (Nari-labs).
Edit9(02.5.2025): Added Spark-TTS as suggested by u/Tandulim (here)
Edit9(02.5.2025): Added Parakeet TDT 0.6B V2. More info in this thread.

Edit10(29.8.2025): As originally suggested by u/Trysem and later by u/Nitroedge added Chatterbox-TTS to the list.

Edit10(29.8.2025): u/MattePalte asked me to add his own TTS called Verbify-TTS to the list.

Edit10(29.8.2025): Added Microsoft's recently released VibeVoice TTS, BosonAI's Higgs Audio TTS and StepFun's STS. +Extra info.

Edit11+12(1.9.2025): Added VibeVoice TTS's quantized versions and Parakeet V3.

I’ve been trying to keep a list of TTS solutions. Here you go: Text to Speech Solutions 11labs - Commercial xtts xtts2 Alltalk Styletts2 Fish-Speech PiperTTS - A fast, local neural text to speech system that is optimized for the Raspberry Pi 4. PiperUI Paroli - Streaming mode implementation of the Piper TTS with RK3588 NPU acceleration support. Bark Tortoise TTS LMNT AlwaysReddy - (uses Piper) Open-LLM-VTuber MeloTTS OpenVoice Sherpa-onnx Silero Neuro-sama Parler TTS Chat TTS VallE-X Coqui TTS Daswers XTTS GUI VoiceCraft - Zero-Shot Speech Editing and Text-to-Speech

I’ve been using alltalktts ( https://github.com/erew123/alltalk_tts ) which is based off of coqui and supports XTTS2, piper and some others. I’m on a Mac so my options are pretty limited, and this worked fairly well. If xtts is the model you want to go with, then maybe https://github.com/daswer123/xtts-api-server would work even better. Unfortunately most of my cases are in SillyTavern, for narration, and character tts, so these may not be the use case for you. The last link I shared might give you ideas for how to implement that on a real application though. Are you a dev-like person, or just enthusiastic about it? I ask because if you’re a dev with some Python knowledge, or willingness to follow code, the later link is actually pretty useful for ideas, in spite of being targeted towards SillyTavern. If not, this is whole space might be kind of hard to navigate at this point in time, and also will depend a lot on the hardware where you’ll be deploying this.

Videos

My Top 5 Open Source Text to Speech Softwares Starting off ...

FREE AI Voice Tool: Best Opensource AI Text-to-Speech (TTS) - Amphion ...

December 18, 2023

NEW Fast Open Source AI TTS Installation - DMOSpeech 2 - YouTube

Possibly THE BEST Open Source Text-to-Speech Model - VibeVoice ...

September 2, 2025

Supertonic WebGPU: blazingly fast text-to-speech running ...

r/LocalLLaMA on Reddit: Created a more accurate local speech-to-text ...

January 7, 2025

People also ask

Are there any open-source text to speech solutions highly regarded on Reddit?

Bark, a text to speech solution based on the Hugging Face platform, is considered to be one the best open-source TTS. · ‍

murf.ai › blog › best-text-to-speech-according-to-reddit

Best Text to Speech Software According to Reddit

What text to speech software do Reddit users recommend the most?

The best text to speech software Reddit users recommend is Murf, followed by ElevenLabs, Bark, Lovi AI, and Speechify. The software mentioned above garnered praise from Reddit users for its ability to generate realistic voices in many languages and accents. · ‍

murf.ai › blog › best-text-to-speech-according-to-reddit

Best Text to Speech Software According to Reddit

How user-friendly are the text to speech interfaces according to Reddit discussions?

Reddit users, based on their experience, described the interfaces of the text to speech tools to be a breeze to use, intuitive, and convenient.

murf.ai › blog › best-text-to-speech-according-to-reddit

Best Text to Speech Software According to Reddit

reddit.com › r/localllama › what's the best open source speech to text model

r/LocalLLaMA on Reddit: What's the best open source speech to text model

August 2, 2024 -

I know OpenAI recently released whisper V3 Turbo but I remember hearing about some other ones that's a lot better but I can't remember

whisper-v3-turbo because of its wide compatibility with open source ecosystem (not necessarily because of its WER) The architecture is plug and play. You can typically add some LLMs along with whisper to correct stuff for you and customize as you need. I wrote a guide here just now: Creating Very High-Quality Transcripts with Open-Source Tools: An 100% automated workflow guide

You might be talking about https://huggingface.co/Revai Here is the post you might be remembering https://x.com/reach_vb/status/1841885263766945930

reddit.com › r/localllama › open source ai voice dictation app with a fully customizable stt and llm pipeline

r/LocalLLaMA on Reddit: Open source AI voice dictation app with a fully customizable STT and LLM pipeline

3 days ago - Tambourine is an open source, cross-platform voice dictation app that uses a configurable STT and LLM pipeline to turn natural speech into clean, formatted text in any app.

reddit.com › r/opensource › speech-to-text with ai?

r/opensource on Reddit: Speech-to-text with AI?

August 31, 2023 -

I am searching for a fully open-source Python script or an application to seamlessly transcribe audio into text in French.

I dislike websites since they usually come with restrictions and limitations in most instances (approximately 99% of the time).

Do you know where I could find something suitable?

If you‘re willing to dabble a bit with Python, a colleague of mine has written a great post for whisper: https://www.css.cnrs.fr/whisper-for-transcribing-interviews/

reddit.com › r/machinelearning › [d] what is the best open source text to speech model?

r/MachineLearning on Reddit: [D] What is the best open source text to speech model?

April 13, 2023 -

I am building a LLMs infrastructure that misses one thing - text to speech. I know there are really good apis like MURF.AI out there, but I haven't been able to find any decent open source TTS, that is more natural than the system one.

If you know any of these, please leave a comment

Thanks

I have a whole list of TTS models (repos & white papers): Neural TTS Models Tacotron submitted: Mar 29, 2017 paper: https://arxiv.org/pdf/1703.10135.pdf github: https://github.com/keithito/tacotron (Not the official implementation but is the once cited the most) Tacotron2 submitted: Dec 16, 2017 paper: https://arxiv.org/pdf/1712.05884.pdf github: https://github.com/NVIDIA/tacotron2 Transformer TTS ** submitted: Sept 19, 2018 paper: https://arxiv.org/pdf/1809.08895.pdf github: N/A Flowtron submitted: May 12 2020 paper: https://arxiv.org/pdf/2005.05957.pdf github: https://github.com/NVIDIA/flowtron FastSpeech2 submitted: Jun 8, 2020 paper: https://arxiv.org/pdf/2006.04558.pdf github: https://github.com/ming024/FastSpeech2 (Not the official implementation but is the once cited the most) FastPitch submitted: Jun 11, 2020 paper: https://arxiv.org/pdf/2006.06873.pdf github: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch TalkNet (1/2) submitted: May 12, 2020/Apr16, 2021 paper: https://arxiv.org/pdf/2005.05514.pdf / https://arxiv.org/pdf/2104.08189.pdf github: https://github.com/NVIDIA/NeMo MOS (Mean Opinion Score) is not included because each paper has a different score for each model ** This model is not to be considered for implementation. It can be a reference but does not have an official GitHub implementation that I am aware of, nor is it very well known. Vocoders (Mel-spec to audio) WaveNet submitted: Sept 12, 2016 paper: https://arxiv.org/pdf/1609.03499v2.pdf github: N/A WaveGlow submitted: Oct 31, 2018 paper: https://arxiv.org/pdf/1811.00002.pdf github: https://github.com/NVIDIA/waveglow HiFiGAN submitted: Oct 12, 2020 paper: https://arxiv.org/pdf/2010.05646.pdf github: https://github.com/jik876/hifi-gan Amendments • TalkNet source code from NVIDIA/NeMo repo has been removed (commit #4082) • NVIDIA/NeMo repo now links to: • FastPitch, MixerTTS, Tacotron2, RadTTS for text to Mel-spectrogram models • HiFiGAN, UnivNet, WaveGlow for Vocoder models • RadTTS seems to be similar to or based around Flowtron (Autoregressive model) • MixerTTS seems to be similar to or based around FastPitch • There are a number of models that are heavily reliant on this monotonic align module. Such models currently include: • VITS • RadTTS • GradTTS • GlowTTS • Regarding GlowTTS, there is actually a Tensorflow implementation available here () which may prove helpful for other models that may use similar components • STYLER and DiffTTS relies on Montreal forced aligner (MFA) package • Presentation from Microsoft: https://www.microsoft.com/en-us/research/uploads/prod/2022/12/Generative-Models-for-TTS.pdf RadTTS submitted: Aug 18, 2021 (NVIDIA page, not Arxiv) paper: https://openreview.net/pdf?id=0NQwnnwAORi github: https://github.com/NVIDIA/radtts MixerTTS submitted: Oct 7, 2021 paper: https://arxiv.org/pdf/2110.03584.pdf github: https://github.com/NVIDIA/NeMo GradTTS (Diffusion TTS) submitted: May 13, 2021 paper: https://arxiv.org/pdf/2105.06337.pdf github: https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS VITS submitted: Jun 11, 2021 paper: https://arxiv.org/pdf/2106.06103.pdf github: https://github.com/jaywalnut310/vits GlowTTS submitted: May 22, 2020 paper: https://arxiv.org/pdf/2005.11129v1.pdf github: https://github.com/jaywalnut310/glow-tts STYLER submitted: Mar 17, 2021 paper: https://arxiv.org/pdf/2103.09474.pdf github: https://github.com/keonlee9420/STYLER TorToiseTTS submitted: N/A paper: N/A github: https://github.com/neonbjb/tortoise-tts DiffTTS (DiffSinger) submitted: Apr 3, 2021 paper: https://arxiv.org/pdf/2104.01409v1.pdf github: https://github.com/keonlee9420/DiffSinger

Tortoise TTS is supposed to be good. However inference can take a while if not on GPU's, so might not produce the real-time text-to-speech effect you want.

reddit.com › r/programming › open source speech to text

r/programming on Reddit: Open Source Speech to Text

February 12, 2010 -

I'd like something I can use to transcribe speech to text as part of a larger program. Google-ing open source speech to text I see CMU Sphinx and Open Mind Speech. Any other options I should be aware of? Which is the most accurate?

Hi, I'm a doctoral student studying speech recognition. Some of the most popular open source ASR engines are, including the Sphinx project which you mention, CMU Sphinx HTK Edit: as several people have pointed out, HTK more accurately described as 'source available'. The situation is a bit complicated so please refer to the website for details. My operational understanding is that you can modify HTK source and train up acoustic models with HTK for whatever purpose you like, but you cannot repackage and ship HTK source code. Julius Juicer The Sphinx and HTK projects contain software appropriate for training Acoustic Models from audio data, as well as the decoders. The other two are just decoders, and require models trained with some other system (HTK). In addition to acoustic models, if you are planning to train up your own system you will also need to build or obtain A pronunciation dictionary which contains entries consisting of the words you wish to transcribe and their corresponding pronunciation(s) A language model or expert grammar to constrain the input speech There are open source tools for building language models as well, but they vary in terms of their licenses. Some of the most popular language modeling tools are, CMU SLM Toolkit MIT LM SRILM There are also a variety of pre-trained open source acoustic models and language models available on the web, CMU Open Source AMs Keith Vertanen models VoxForge In terms of tutorial and help information the following are all good places to start, VoxForge CMU Robust Group Tutorial HTK Book (requires registration) I've never heard of Open Mind Speech and it does not appear to be very well-maintained so I'm not sure I can recommend that. If you are interested in understanding what is actually going on behind the scenes in most modern ASR systems I recommend the following papers, A tutorial on hidden markov models and selected applications in speech recognition (pdf) Speech Recognition with Weighted Finite State Transducers (pdf) In general, even with the many open source tools currently available, training up a high quality large vocabulary continuous speech recognition system from scratch is a non-trivial (read pretty complicated) exercise. If you are determined to do it, my strongest recommendation is to try out the Sphinx tutorial as it should take you through from beginning to end on an open source toy corpus, and leave you with the ability to modify or alter that yourself afterwards. Also, Sphinx is well-supported, completely open source, and has 3 different decoders with various strengths and weaknesses, one written in Java and 2 in C. Good luck! Edit: If you are trying to include this in a VoIP application there are also Sphinx plugins for popular open source software PBXs like Asterisk and FreeSwitch , and even a maturing MRCP server, UniMRCP . And Voxeo provides VXML based solutions for this.

Hi, thanks for bringing this topic up. I'm a marketing consultant, who's nearly deaf. Using a phone or video chat is really difficult (often impossible) for me, which means I have to spend more time and resources for attracting new clients than a hearing consultant would. Unable to phone also has a serious impact on job chances and my career. I've been thinking if a speech to text software could help me with phone calls and video conferences. What I'm looking for would be a speech to text software which plugs into skype, VoIP with a Fritzbox or another SIP software, or even Asteristik (I have no experience with Asteristik yet, but would be willing to dive in) has a roughly 50% success rate (I don't need perfect transcription, just enough to get the gist) can handle unknown voices (clients, business partners, coworkers) ideally works on Mac or Linux (Windows is fine, too, if nothing else is supported) supports English and German I've looked into Dragon Naturally Speaking and MacSpeech, but they seem to only work with one's own voice and only with extensive training and are more geared to authoring texts. I'm looking for a speech to text software from streaming / live audio (such as Skype). Does anybody have experience with what I'm trying to achieve? Does anybody know of a software or could anybody give me some hints which projects in blackkettle's post (e.g. Sphinx?) I should try out first? Also: Deaf redditors, how do you handle business phone calls? Additionally I'm looking for a speech-to-text software for subtitling videos, too. I've been thinking of creating some videos to place on Youtube and I'd like to offer subtitles to fellow deaf people. Does anybody know of specialized speech-text-software for video subtitling? Thanks a lot for any help and suggestions!

Find elsewhere

Google Bing Mojeek

murf.ai › blog › best-text-to-speech-according-to-reddit

Best Text to Speech Software According to Reddit

According to numerous Reddit forums, Murf AI, Lovo AI, Amazon Polly, and IBM Watson text to speech offer superior multilingual support compared to others on this list. ... Bark, a text to speech solution based on the Hugging Face platform, is ...

reddit.com › r/localllama › best speech to text transcription? local model or api

r/LocalLLaMA on Reddit: Best speech to text transcription? Local model or api

July 4, 2024 -

Hey guys, I would like to add speech to text transcription. Hosting an open source model on cloud so I can do this anywhere would be good.

Do you guys know any highly accurate STT open source models that is highly accurate?

Also, can it run on CPU or GPU is a must?

Whisper is really good, at least for single speaker text. It's less good for multi-speaker text, but still plenty usable.

Whisper.cpp runs on CPU, is pretty fast if you have decent hardware, and in my experience fantastically recognizes almost all words and sentences correctly. I was definitely very satisfied with the results. Never tested it with languages other than English, it may require thorough testing before deciding to use a model of specific size.

reddit.com › r/opensource › any live speech to text converter for windows (opensource & works offline)?

r/opensource on Reddit: Any live speech to text converter for Windows (opensource & works offline)?

April 10, 2024 -

Hi I am looking for some speech to text converter for windows, but not able to get anything.

I know there are two popular models, vosk and whisper, but I am unable to find anything which I can use on windows for live transcription.

Any suggestions please?

u/Ambitionless_Nihil FWIW, the Whisper.cpp repository's README has a list of example implementations that include a live transcription tool: https://whisper.ggerganov.com/stream/ You need to open the page in your web browser. But once you do that, the real time transcription is run offline and locally.

Coqui maybe?: https://docs.coqui.ai/en/latest/installation.html

reddit.com › r/localllama › best open source speech to text+ diarization models

r/LocalLLaMA on Reddit: Best Open source Speech to text+ diarization models

May 8, 2025 -

Hi everyone, hope you’re doing well. I’m currently working on a project where I need to convert audio conversations between a customer and agents into text.

Since most recordings involve up to three speakers, could you please suggest some top open-source models suited for this task, particularly those that support speaker diarization?

I've had a similar need a few months ago, and the best I could find was GitHub - MahmoudAshraf97/whisper-diarization: Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper It's still not ideal, especially when people talk over each other, but works fairly well. Of course, if the conversation happens over the phone/internet, you can record agent and customer into separate streams and just use normal whisper.

At the moment, you’re going to get your best transcript by splitting the audio into each voice (isolate) https://github.com/pyannote/pyannote-audio Once split, stt each individual stream through a timestamp capable model like parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 Finally, reassemble the conversation by speaker, interleaving the speech based on time stamps in the final transcript.

reddit.com › r/machinelearning › [d] what are the best speech to text tools currently available ?

r/MachineLearning on Reddit: [D] What are the best speech to text tools currently available ?

January 7, 2023 -

I am looking for open source speech to text tools, I am not familiar with the progress in this field but Ideally I would like something fast and reliable, that does english as well as other languages as french and spanish, that is also easy to use. Are there any recommendations ?

Whisper. You can run it locally for free or use the Whisper API on OpenAI that's quite cheap and doesn't require any special hardware. There is also the option to run it on Google colab for free and that also doesn't require any special hardware.

You can use HuggingFace’s speechbox library built on top of whisper. https://github.com/huggingface/speechbox They have a colab and a gradio demo on their readme

reddit.com › r/opensource › tts: text-to-speech for all. open-sourced by mozilla

r/opensource on Reddit: TTS: Text-to-Speech for All. Open-sourced by Mozilla

July 3, 2020 - Since many of the key people are now at Coqui and because the rest of the AI people at Mozilla have been asked to work on other projects, I would expect the center of gravity for new development to shift to coqui. DeepSpeech was the state of the art for speech recognition in 2015 and still pretty cutting edge for certain contexts. Certainly one of the most accurate open source models.

reddit.com › r/localllama › suggest me open source text to speech for real time streaming

r/LocalLLaMA on Reddit: Suggest me open source text to speech for real time streaming

May 25, 2025 -

currently using elevenlabs for text to speech the voice quality is not good in hindi and also it is costly.So i thinking of moving to open source TTS.Suggest me good open source alternative for eleven labs with low latency and good hindi voice result.

Kokoro is the best I’ve come across with good Voices and low latency on minimal resources

For me, coqui tts with the Xttsv2 model worked best. You are able to clone voices and it can speak in so many languages. It also allows streaming inference, so you don't have to wait untill everything is generated. I only have a latency of 200 micro seconds. And it sounds Pretty good!

reddit.com › r/localllama › speech to text - whisper alternatives?

r/LocalLLaMA on Reddit: Speech to Text - Whisper alternatives?

June 3, 2024 -

I've been working on a project that needs reliable Speech to text conversion with the potential for multiple active individuals in a conversation. I've only used the long ago released OpenAI Whisper model which was pretty terrible at 3+ people and often had issues consistently attributing correct voice to tag.

It's a pretty old model (given how fast everything is) what models are you guys using and what are some of the pro/cons?

The Whisper model is still the best open source model I've found. But as far as multiple speakers, don't use Whisper by itself - you need to combine it with a good diarization model. I would take a look at the whisperX project which uses faster-whisper (4x speed increase over openAI/whisper) and has VAD and diarization capability included.

Whisper still is probably the best open source one. (Or its various distillations) But it's not meant to be used like you are doing. You should first identify different voices and separate them into different files. And then whisper separately on each.

reddit.com › r/opensource › i am looking for tts software

r/opensource on Reddit: I am looking for TTS software

January 21, 2024 -

Years ago I searched unsuccessfully for human-sounding TTS software (German voice output) for Linux. Nothing was found.

Is there really still (in year 2024) nothing comparable to Balabolka and Read Aloud and in Linux-world?

piper-tts is hands down the best https://github.com/rhasspy/piper they have some decent voices for sure, but I highly recommend making one and sharing it. I have one here https://github.com/sweetbbak/Neural-Amy-TTS that explains how to use it and how you can add it as a system voice so you can use TTS features in the browser. The nice thing about Piper is that it sounds good, its faster than real time and it can run on a potato. It uses a mix of phoneme synthesis and AI generation so its like a merge of old and new methods. I hate most AI implementations of TTS but piper does it right imo. Its also a stand-alone binary made with C++ (or a python lib you can pipx install if you want) and its cross platform. No GPU required either. It uses CPU and is still faster than real-time Theres a cool Glados voice and some others too. I personally use the ancient Ivona TTS software under Wine as well, I can teach you what to do there if you want

Check out Coqui TTS - https://www.reddit.com/r/selfhosted/s/Yucmb19hNI

reddit.com › r/machinelearning › [d] best open source text to speech networks?

r/MachineLearning on Reddit: [D] Best open source Text to Speech networks?

March 17, 2018 -

Hey guys, I'm looking to make an application that uses neural text to speech for my Python program. I'm not sure what open source SOTA is like, would love to get some reference repositories to check out, especially if they have demos.

It would also be nice to have a model that can come with multiple voices :)

Cheers

https://github.com/keithito/tacotron - a good implementation, pretrained weights are available.

I didn't do anything in that direction, but to me the following sounded interesting: Mozilla: Common Voice Mozilla: TTS

reddit.com › r/software › best "speech-to-text" software/solution...?

r/software on Reddit: Best "speech-to-text" software/solution...?

March 30, 2023 -

Hi

In 2023 which one is (are) the best solutions to convert english speech to english text... Either from "live" recording or via saved audio file. Online solutions are ok too, but i would probably prefer some "offline" solutions (in order to have "safety" that my speech/text is not going anywhere - cannot be seen by someone else - privacy). But website/online solutions are ok too i guess, just would prefer the offline more.

Which ones are the BEST (most accurate etc.) in 2023...?

(and if there is even some free option that is super/quite good) thrown that in too :-) ).

Thank you

The app Reppi has by far worked best for me. Its unlimited (most charge per minute) and built on Whisper AI so extremely accurate.

You can try Dictation Daddy. It works online and is really accurate. You can just search Dictation Daddy on Google and you will find it.

reddit.com › r/python › speech to text that actually works! my first impressions of "whisper" from openai.

r/Python on Reddit: Speech to Text that actually works! My first impressions of "Whisper" from OpenAI.

September 28, 2022 - ... OpenAI's Whisper: an open-sourced neural net "that approaches human level robustness and accuracy on English speech recognition." Can be used as a Python package or from the command line

assemblyai.com › blog › the-top-free-speech-to-text-apis-and-open-source-engines

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

This post compares the best free Speech-to-Text APIs and AI models on the market today, including APIs that have a free tier. We’ll also look at several free open-source Speech-to-Text engines and explore why you might choose an API vs. an open-source library, or vice versa.