whisperx streaming free

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization) - m-bain/whisperX

Starred by 19.2K users

Forked by 2K users

Languages Python

GitHub

github.com › m-bain › whisperX › issues › 476

Streaming with whisperx · Issue #476 · m-bain/whisperX

September 19, 2023 - Is there a repo or code that allows for real-time streaming with whisperx? Thank you!

Published Sep 19, 2023

Videos

27:23

YouTube

WhisperX: A Beginners Guide to Install & Run - YouTube

Creating an Auto-transcriber for YouTube Videos with Whisperx - ...

June 2, 2024

26:24

YouTube

Streaming Speech to Text Models - Kyutai vs Whisper - YouTube

August 26, 2025

02:34

YouTube

WhisperLive: Real-Time Speech Transcription with Whisper - YouTube

May 26, 2025

View all

GitHub

github.com › ufal › whisper_streaming

GitHub - ufal/whisper_streaming: Whisper realtime streaming for long speech-to-text transcription and translation

Whisper realtime streaming for long speech-to-text transcription and translation - ufal/whisper_streaming

Starred by 3.5K users

Forked by 410 users

Languages Python

SoundCloud

soundcloud.com › whisperx

Stream Whisperx music | Listen to songs, albums, playlists for free on SoundCloud

Play Whisperx and discover followers on SoundCloud | Stream tracks, albums, playlists on desktop and mobile.

Modal

modal.com › blog › open-source-stt

The Top Open Source Speech-to-Text (STT) Models in 2025

WhisperX (Community Project): This is a third-party transcription pipeline designed specifically to work with the Whisper family of models. It adds precise word-level time stamps, speaker diarization, and much better handling of larger audio files through intelligent chunking.

GitHub

github.com › SYSTRAN › faster-whisper

GitHub - SYSTRAN/faster-whisper: Faster Whisper transcription with CTranslate2

Feel free to add your project to the list! speaches is an OpenAI compatible server using faster-whisper. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription. WhisperX is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment

Starred by 19.5K users

Forked by 1.6K users

Languages Python 99.9% | Dockerfile 0.1%

YouTube

youtube.com › watch

Best FREE Speech to Text AI - WhisperX - w/ Speaker Detection - YouTube

Discover WhisperX - the powerful open-source transcription tool that automatically detects speakers and transcribes audio/video 5x faster than regular Whispe...

Published May 17, 2025

Replicate

replicate.com › daanelson › whisperx

daanelson/whisperx | Run with an API on Replicate

WhisperX provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

YouTube

youtube.com › watch

Can Whisper be used for real-time streaming ASR? - YouTube

08:41

Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.ioWhisper is a robust Automatic Speech Recognition (ASR) model by O

Published March 31, 2024

Find elsewhere

Google Bing Mojeek

PyPI

pypi.org › project › faster-whisper

faster-whisper · PyPI

      » pip install faster-whisper

Published Oct 31, 2025

Version 1.2.1

Homepage https://github.com/SYSTRAN/faster-whisper

reddit.com › r/learnmachinelearning › i used whisperx model to give me real time transcription of a user when they speak through the mic but i dont get how to get through few problems

r/learnmachinelearning on Reddit: I used whisperx model to give me real time transcription of a user when they speak through the mic but i dont get how to get through few problems

July 1, 2024 - import sounddevice as sd import numpy as np import whisperx import queue import threading# Initialize the WhisperX model (choose the model size you prefer) model = whisperx.load_model("base") # Change "base" to "tiny", "small", "medium", or "large" as needed# Parameters sample_rate = 16000 # WhisperX works best with 16kHz audio chunk_duration = 5 # Duration of each audio chunk in seconds buffer_size = int(sample_rate * chunk_duration) audio_queue = queue.Queue()def audio_callback(indata, frames, time, status): """Callback function to process audio chunks.""" if status: print(status, flush=True) audio_queue.put(indata.copy())def transcribe_audio(): """Thread function to transcribe audio chunks.""" while True: audio_chunk = audio_queue.get() if audio_chunk is None: break

Medium

medium.com › @david.richards.tech › how-to-build-a-streaming-whisper-websocket-service-1528b96b1235

How to Build a Streaming Open-Source Whisper WebSocket Service | by David Richards | Medium

January 22, 2024 - How to Build a Streaming Open-Source Whisper WebSocket Service In today’s fast-paced digital world, the ability to convert spoken language into text in real-time has become essential for various …

Modal

modal.com › blog › faster-transcription

5 Ways to Speed Up Whisper Transcription

If you need real-time transcription, the standard open-source Whisper library (which processes 30-second chunks) won’t cut it. Instead, use Whisper Streaming, which enables:

Stack Overflow

stackoverflow.com › questions › 78496897 › how-to-create-a-live-voice-activity-detection-vad-loop-for-whisperx

python - How to create a live voice activity detection (VAD) loop for whisperX? - Stack Overflow

I am using whisperX speech-to-text model to convert my voice into text input for a locally hosted LLM. Right now, I have it set up where I can record an audio file, and then load it into whisperX. ...

OpenAI Developer Community

community.openai.com › api

Whisper Streaming Strategy - API - OpenAI Developer Community

July 29, 2024 - The Whisper text to speech API does not yet support streaming. This would be a great feature. I’m trying to think of ways I can take advantage of Whisper with my Assistant. A moderate response can take 7-10 sec to process, which is a bit slow.

reddit.com › r/localllama › i compared the different open source whisper packages for long-form transcription

r/LocalLLaMA on Reddit: I compared the different open source whisper packages for long-form transcription

March 30, 2024 -

Hey everyone!

I hope you're having a great day.

I recently compared all the open source whisper-based packages that support long-form transcription.

Long-form transcription is basically transcribing audio files that are longer than whisper's input limit, which is 30 seconds. This can be useful if you want to chat with a youtube video or podcast etc.

I compared the following packages:

OpenAI's official whisper package
Huggingface Transformers
Huggingface BetterTransformer (aka Insanely-fast-whisper)
FasterWhisper
WhisperX
Whisper.cpp

I compared between them in the following areas:

Accuracy - using word error rate (wer) and character error rate (cer)
Efficieny - using vram usage and latency

I've written a detailed blog post about this. If you just want the results, here they are:

For all metrics, lower is better

If you have any comments or questions please leave them below.

Top answer

1 of 5

I tried all of them too and whisperX is by far better than the rest. And much faster too. Highly recommended

2 of 5

I love that you shared the notebook for running these benchmarks

Hugging Face

huggingface.co › spaces › openai › whisper › discussions › 10

openai/whisper · Streaming audio

Is it possible to have a streaming audio transcription?

AWS

aws.amazon.com › blogs › containers › host-the-whisper-model-with-streaming-mode-on-amazon-eks-and-ray-serve

Host the Whisper Model with Streaming Mode on Amazon EKS and Ray Serve | Amazon Web Services

June 20, 2024 - In the post, we explore how to build an ML inference solution based on pure Python and Ray Serve that can run locally on a single Amazon Elastic Compute Cloud (Amazon EC2) instance, and expose the streaming ASR service through WebSocket protocol.

reddit.com › r/localllama › whisper (whisper.cpp/whisperkit) for live transcription - why no prompt caching?

r/LocalLLaMA on Reddit: Whisper (Whisper.cpp/WhisperKit) for live transcription - why no prompt caching?

November 29, 2024 -

Hi everyone! Some quick questions for today:

Why do most streaming-based implementations of Whisper process incoming audio in chunks and then stitch the transcript together?
Why not cache the encoded content and then keep that in memory and simply encode more incoming audio?
If Whisper is an autoregressive model, and it encodes audio in a sequential manner... why not just keep a running KV cache of encoded audio and update it? Why process in separate batches?

We see this kind of run-on caching a lot in e.g. LLM backends - Llama.cpp and MLX_lm for instance both implement prompt caching. The encoded KV cache is saved so that next time a prompt is passed in, the already encoded part of the conversation history doesn't need to be calculated again.

And yet I can't find any open source implementations of Whisper that do this - unless I'm just really misunderstanding the code (which is very possible). From what I can see of the codebase; Whisper.cpp seems to do sliding chunks and stitches them together. And you can see the pitfalls when you use it for live transcription; there's clear errors introduced where the chunks overlap and get stitched together.

I've yet to get deep into WhisperKit, but considering it has those same hallmark errors when shifting from one chunk to the next, I can only assume it too has a stitch-together implementation.

KV cache reuse / keeping a running KV cache would eliminate those errors. It would also majorly reduce the complexity with having to implement custom logic for processing multiple chunks and stitching them together in a sliding window fashion. You could just have one stream of audio coming in, and one stream of decoded text coming out.

Cleaner code, no having to compute overlapping sections more than once, no reduction in transcript accuracy versus doing inference on a static file... IMO seems to good to be true. It leads me to think that maybe run-on prompt caching like we see with LLMs is just simply not possible with Whisper..? That seems the simplest explanation. But I don't understand why that's the case. Anyone happen to know?