speech-to-text llm open source - Brave Search

What's the best open source speech to text model

reddit.com › r › LocalLLaMA › comments › 1g2shx7 › whats_the_best_open_source_speech_to_text_model

whisper-v3-turbo because of its wide compatibility with open source ecosystem (not necessarily because of its WER) The architecture is plug and play. You can typically add some LLMs along with whisper to correct stuff for you and customize as you need. I wrote a guide here just now: Creating Very High-Quality Transcripts with Open-Source Tools: An 100% automated workflow guide Answer from phoneixAdi on reddit.com

reddit.com › r/localllama › what's the best open source speech to text model

r/LocalLLaMA on Reddit: What's the best open source speech to text model

August 3, 2024 -

I know OpenAI recently released whisper V3 Turbo but I remember hearing about some other ones that's a lot better but I can't remember

whisper-v3-turbo because of its wide compatibility with open source ecosystem (not necessarily because of its WER) The architecture is plug and play. You can typically add some LLMs along with whisper to correct stuff for you and customize as you need. I wrote a guide here just now: Creating Very High-Quality Transcripts with Open-Source Tools: An 100% automated workflow guide

You might be talking about https://huggingface.co/Revai Here is the post you might be remembering https://x.com/reach_vb/status/1841885263766945930

northflank.com › blog › best-open-source-text-to-speech-models-and-how-to-run-them

Best open source text-to-speech models and how to run them | Blog — Northflank

Explore the best open source text-to-speech models like XTTS-v2, Mozilla TTS, and Bark. Learn how to choose, deploy, and scale them for production with GPU support using Northflank.

Videos

Local and Open Source Speech to Speech Assistant - YouTube

September 12, 2024

Creating Low Latency Voice Agents - Open Source 🗣️🗣️🗣️ ...

August 16, 2024

GLM-4 Voice: Talk to AI in Realtime using Voice! (Open source) ...

October 31, 2024

Build an LLM-Powered Voice Agent in Python - YouTube

The Most Accurate Speech-to-text APIs in 2025 - YouTube

February 6, 2025

Speech LLMs: Models that listen and talk back - YouTube

November 4, 2024

bentoml.com › blog › exploring-the-world-of-open-source-text-to-speech-models

The Best Open-Source Text-to-Speech Models in 2026

Here is a code example of serving XTTS-v2 with BentoML: Deploy XTTS-v2Deploy XTTS-v2 Deploy XTTS-v2 with a streaming endpointDeploy XTTS-v2 with a streaming endpoint · Developed by Neuphonic, NeuTTS Air is the world’s first on-device, ...

gladia.io › blog › best-open-source-speech-to-text-models

Gladia - Top 5 Open-Source Speech-to-Text Models for Enterprises

In this article, we will cover the most advanced open-source ASR models available, including Whisper ASR, DeepSpeech, Kaldi, Wav2vec, or SpeechBrain, highlighting their key strength and technical requirements, Modern ASR can very reliably transcribe ...

assemblyai.com › blog › the-top-free-speech-to-text-apis-and-open-source-engines

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

This post compares the best free Speech-to-Text APIs and AI models on the market today, including APIs that have a free tier. We’ll also look at several free open-source Speech-to-Text engines and explore why you might choose an API vs. an open-source library, or vice versa.

edenai.co › post › top-free-speech-to-text-tools-apis-and-open-source-models

Top Free Speech to text tools, APIs, and Open Source models | Eden AI

DeepSpeech is an open-source, embedded speech-to-text engine that operates in real-time on a variety of devices, ranging from high-powered GPUs to a Raspberry Pi 4.

northflank.com › blog › best-open-source-speech-to-text-stt-model-in-2025-benchmarks

Best open source speech-to-text (STT) model in 2025 (with benchmarks) | Blog — Northflank

The hybrid design pairs a FastConformer encoder optimized for speech recognition with an unmodified Qwen3-1.7B LLM decoder. This enables dual operation: pure transcription mode and intelligent analysis mode supporting summarization and question answering. ... Word Error Rate: 5.63% (Open ASR Leaderboard average), 1.6% (LibriSpeech Clean), 3.1% (LibriSpeech Other)

modal.com › blog › open-source-stt

The Top Open Source Speech-to-Text (STT) Models in 2025

What sets Canary apart is its new hybrid architecture that combines automatic speech recognition (ASR) with large language model (LLM) capabilities. This makes Canary Qwen 2.5B the first open-source Speech-Augmented Language Model (SALM).

Find elsewhere

Google Bing Mojeek

github.com › vndee › local-talking-llm

GitHub - vndee/local-talking-llm: A talking LLM that runs on your own computer without needing the internet.

Speech Recognition: Utilizing OpenAI's Whisper, we convert spoken language into text.

Starred by 735 users

Forked by 147 users

Languages Python 95.8% | Makefile 4.2%

modal.com › blog › open-source-tts

The Top Open-Source Text to Speech (TTS) Models

It’s an open sourced model that was built on top of Llama 3.2 3B, pre-trained on over 10 million hours of audio data. This model provides industry-leading expressive audio generation and multilingual voice cloning.

github.com › mozilla › DeepSpeech

GitHub - mozilla/DeepSpeech: DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper.

Starred by 26.7K users

Forked by 4.1K users

Languages C++ 47.0% | Python 21.4% | C 11.2% | Shell 10.8% | C# 2.8% | Swift 1.8%

reddit.com › r/singularity › building a local speech-to-speech interface for llms (open source)

r/singularity on Reddit: Building a Local Speech-to-Speech Interface for LLMs (Open Source)

March 29, 2025 -

I wanted a straightforward way to interact with local LLMs using voice, similar to some research projects (think sesame which was a huge disapointment and orpheus) but packaged into something easier to run. Existing options often involved cloud APIs or complex setups.

I built Persona Engine, an open-source tool that bundles the components for a local speech-to-speech loop:

It uses Whisper .NET for speech recognition.
Connects to any OpenAI-compatible LLM API (so your local models work fine or cloud if you prefer).
Uses a TTS pipeline (with optional real-time voice cloning) for the audio output.
It also includes Live2D avatar rendering and Spout output for streaming/visualization.

The goal was to create a self-contained system where the ASR, TTS, and optional RVC could all run locally (using an NVIDIA GPU for performance).

Making this kind of real-time, local voice interaction more accessible feels like a useful step as AI becomes more integrated. It allows for private, conversational interaction without constant cloud reliance.

If you're interested in this kind of local AI interface:

Code/Details: https://github.com/fagenorn/handcrafted-persona-engine
Demo: https://www.youtube.com/watch?v=4V2DgI7OtHE (forgive the cheesiness, I was having a bit of fun with capcut)

Curious about your thoughts 😊

my thought is that we need a proper local speech-to-speech model. the way OpenAI is doing it doesn't use stuff like whisper or TTS, instead they have a single model that gets speech as the input and outputs speech again. that's the only way to get perfect latency, the ability to interrupt the Ai while it's speaking etc

WoW! Amazing project, will definately try this tomorrow when I got free time to do this on Windows or even Linux (yes with CUDA). I am working on another side project on https://github.com/moeru-ai/airi (it's already live on web (shipped with a dedicated Electron app for desktop stream use, migrating to Tauri these days to reduce the installation size). I am also preparing the first stream (DevStream actually) with new model. The project is aimed to build something similar like Neuro-sama in the field of AI VTubering. Is there any chance that we could corporate together to bring the ability for the end to end STS pipeline to our project so that we both can benefit?

github.com › KoljaB › RealtimeTTS

GitHub - KoljaB/RealtimeTTS: Converts text to speech in realtime

It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available. Short_RealtimeTTS_Demo.mov · Low Latency · almost instantaneous text-to-speech conversion · compatible with LLM outputs ·

Starred by 3.7K users

Forked by 354 users

Languages Python 96.2% | Shell 1.8% | Batchfile 1.5%

Towards Data Science

towardsdatascience.com › build-a-locally-running-voice-assistant-2f2ead904fe9

Build a Locally Running Voice Assistant | Towards Data Science

March 5, 2025 - The chat service runs the open-source LLM called HuggingFaceH4/zephyr-7b-alpha.

mistral.ai › news › voxtral

Voxtral | Mistral AI

July 15, 2025 - Voxtral comprehensively outperforms Whisper large-v3, the current leading open-source Speech Transcription model.

kdnuggets.com › top-5-text-to-speech-open-source-models

Top 5 Text-to-Speech Open Source Models - KDnuggets

Orpheus TTS is a cutting-edge, Llama-based speech LLM designed for high-quality and empathetic text-to-speech applications. It is fine-tuned to deliver human-like speech with exceptional clarity and expressiveness, making it suitable for real-time ...

reddit.com › r/openai › looking for locally run open-source llms with real-time speech-to-speech capabilities - any recommendations?

r/OpenAI on Reddit: Looking for locally run open-source LLMs with real-time speech-to-speech capabilities - any recommendations?

October 24, 2024 -

I'd like to know if there are any recent open-source large language models that can be deployed locally on my computer? I want it to have speech-to-speech capabilities, like voice chat, and ideally with real-time interruption capabilities. Are there any such open-source models available?

any github address or advice i would really appreciate it.

Just use whisper large v3 turbo, send the output to an LLM server hosted locally, stream the output of that and split on sentences, and send each sentence as it's completed to a TTS service. This works really well - not real time, but close.

Not sure what you are expecting in terms of performance, but this pretty recent release can be installed and run locally. It’s not FOSS,though. It’s speech to speech and I believe has interruption capabilities. Also a little unhinged. https://moshi-ai.com/

github.com › openai › whisper

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

Starred by 92.1K users

Forked by 11.5K users

Languages Python

microsoft.github.io › VibeVoice

VibeVoice: A Frontier Open-Source Text-to-Speech Model

Model (LLM) to understand textual ... distinct speakers, surpassing the typical 1-2 speaker limits of many prior models. 2025-09-05: VibeVoice is an open-source research framework ......

siliconflow.com › articles › en › best-open-source-speech-to-text-models

The Best Open Source Speech-to-Text Models in 2025

Ultimate guide to 2025's best open source speech-to-text models: 1. Fish Speech V1.5; 2. CosyVoice2-0.5B; 3. IndexTTS-2. Compare TTS performance, multilingual support, latency, and duration control for speech synthesis applications.