🌐
GitHub
github.com › m-bain › whisperX
GitHub - m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization) - m-bain/whisperX
Starred by 19.2K users
Forked by 2K users
Languages   Python
🌐
GitHub
github.com › m-bain › whisperX › issues › 476
Streaming with whisperx · Issue #476 · m-bain/whisperX
September 19, 2023 - Is there a repo or code that allows for real-time streaming with whisperx? Thank you!
Published   Sep 19, 2023
🌐
GitHub
github.com › ufal › whisper_streaming
GitHub - ufal/whisper_streaming: Whisper realtime streaming for long speech-to-text transcription and translation
Whisper realtime streaming for long speech-to-text transcription and translation - ufal/whisper_streaming
Starred by 3.5K users
Forked by 410 users
Languages   Python
🌐
Modal
modal.com › blog › open-source-stt
The Top Open Source Speech-to-Text (STT) Models in 2025
WhisperX (Community Project): This is a third-party transcription pipeline designed specifically to work with the Whisper family of models. It adds precise word-level time stamps, speaker diarization, and much better handling of larger audio files through intelligent chunking.
🌐
SoundCloud
soundcloud.com › whisperx
Stream Whisperx music | Listen to songs, albums, playlists for free on SoundCloud
Play Whisperx and discover followers on SoundCloud | Stream tracks, albums, playlists on desktop and mobile.
🌐
GitHub
github.com › SYSTRAN › faster-whisper
GitHub - SYSTRAN/faster-whisper: Faster Whisper transcription with CTranslate2
Feel free to add your project to the list! speaches is an OpenAI compatible server using faster-whisper. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription. WhisperX is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
Starred by 19.5K users
Forked by 1.6K users
Languages   Python 99.9% | Dockerfile 0.1%
🌐
YouTube
youtube.com › watch
Best FREE Speech to Text AI - WhisperX - w/ Speaker Detection - YouTube
Discover WhisperX - the powerful open-source transcription tool that automatically detects speakers and transcribes audio/video 5x faster than regular Whispe...
Published   May 17, 2025
🌐
Replicate
replicate.com › daanelson › whisperx
daanelson/whisperx | Run with an API on Replicate
WhisperX provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
🌐
PyPI
pypi.org › project › faster-whisper
faster-whisper · PyPI
Feel free to add your project to the list! speaches is an OpenAI compatible server using faster-whisper. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription. WhisperX is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
      » pip install faster-whisper
    
Published   Oct 31, 2025
Version   1.2.1
Find elsewhere
🌐
YouTube
youtube.com › watch
Can Whisper be used for real-time streaming ASR? - YouTube
Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.ioWhisper is a robust Automatic Speech Recognition (ASR) model by O
Published   March 31, 2024
🌐
Reddit
reddit.com › r/learnmachinelearning › i used whisperx model to give me real time transcription of a user when they speak through the mic but i dont get how to get through few problems
r/learnmachinelearning on Reddit: I used whisperx model to give me real time transcription of a user when they speak through the mic but i dont get how to get through few problems
July 1, 2024 - import sounddevice as sd import numpy as np import whisperx import queue import threading# Initialize the WhisperX model (choose the model size you prefer) model = whisperx.load_model("base") # Change "base" to "tiny", "small", "medium", or "large" as needed# Parameters sample_rate = 16000 # WhisperX works best with 16kHz audio chunk_duration = 5 # Duration of each audio chunk in seconds buffer_size = int(sample_rate * chunk_duration) audio_queue = queue.Queue()def audio_callback(indata, frames, time, status): """Callback function to process audio chunks.""" if status: print(status, flush=True) audio_queue.put(indata.copy())def transcribe_audio(): """Thread function to transcribe audio chunks.""" while True: audio_chunk = audio_queue.get() if audio_chunk is None: break
🌐
Medium
medium.com › @david.richards.tech › how-to-build-a-streaming-whisper-websocket-service-1528b96b1235
How to Build a Streaming Open-Source Whisper WebSocket Service | by David Richards | Medium
January 22, 2024 - How to Build a Streaming Open-Source Whisper WebSocket Service In today’s fast-paced digital world, the ability to convert spoken language into text in real-time has become essential for various …
🌐
Modal
modal.com › blog › faster-transcription
5 Ways to Speed Up Whisper Transcription
If you need real-time transcription, the standard open-source Whisper library (which processes 30-second chunks) won’t cut it. Instead, use Whisper Streaming, which enables:
🌐
Stack Overflow
stackoverflow.com › questions › 78496897 › how-to-create-a-live-voice-activity-detection-vad-loop-for-whisperx
python - How to create a live voice activity detection (VAD) loop for whisperX? - Stack Overflow
I am using whisperX speech-to-text model to convert my voice into text input for a locally hosted LLM. Right now, I have it set up where I can record an audio file, and then load it into whisperX. ...
🌐
Reddit
reddit.com › r/localllama › i compared the different open source whisper packages for long-form transcription
r/LocalLLaMA on Reddit: I compared the different open source whisper packages for long-form transcription
March 30, 2024 -

Hey everyone!

I hope you're having a great day.

I recently compared all the open source whisper-based packages that support long-form transcription.

Long-form transcription is basically transcribing audio files that are longer than whisper's input limit, which is 30 seconds. This can be useful if you want to chat with a youtube video or podcast etc.

I compared the following packages:

  1. OpenAI's official whisper package

  2. Huggingface Transformers

  3. Huggingface BetterTransformer (aka Insanely-fast-whisper)

  4. FasterWhisper

  5. WhisperX

  6. Whisper.cpp

I compared between them in the following areas:

  1. Accuracy - using word error rate (wer) and character error rate (cer)

  2. Efficieny - using vram usage and latency

I've written a detailed blog post about this. If you just want the results, here they are:

For all metrics, lower is better

If you have any comments or questions please leave them below.

🌐
OpenAI Developer Community
community.openai.com › api
Whisper Streaming Strategy - API - OpenAI Developer Community
July 29, 2024 - The Whisper text to speech API does not yet support streaming. This would be a great feature. I’m trying to think of ways I can take advantage of Whisper with my Assistant. A moderate response can take 7-10 sec to process, which is a bit slow.
🌐
AWS
aws.amazon.com › blogs › containers › host-the-whisper-model-with-streaming-mode-on-amazon-eks-and-ray-serve
Host the Whisper Model with Streaming Mode on Amazon EKS and Ray Serve | Amazon Web Services
June 20, 2024 - In the post, we explore how to build an ML inference solution based on pure Python and Ray Serve that can run locally on a single Amazon Elastic Compute Cloud (Amazon EC2) instance, and expose the streaming ASR service through WebSocket protocol.
🌐
Reddit
reddit.com › r/localllama › whisper (whisper.cpp/whisperkit) for live transcription - why no prompt caching?
r/LocalLLaMA on Reddit: Whisper (Whisper.cpp/WhisperKit) for live transcription - why no prompt caching?
November 29, 2024 -

Hi everyone! Some quick questions for today:

  1. Why do most streaming-based implementations of Whisper process incoming audio in chunks and then stitch the transcript together?

  2. Why not cache the encoded content and then keep that in memory and simply encode more incoming audio?

  3. If Whisper is an autoregressive model, and it encodes audio in a sequential manner... why not just keep a running KV cache of encoded audio and update it? Why process in separate batches?

We see this kind of run-on caching a lot in e.g. LLM backends - Llama.cpp and MLX_lm for instance both implement prompt caching. The encoded KV cache is saved so that next time a prompt is passed in, the already encoded part of the conversation history doesn't need to be calculated again.

And yet I can't find any open source implementations of Whisper that do this - unless I'm just really misunderstanding the code (which is very possible). From what I can see of the codebase; Whisper.cpp seems to do sliding chunks and stitches them together. And you can see the pitfalls when you use it for live transcription; there's clear errors introduced where the chunks overlap and get stitched together.

I've yet to get deep into WhisperKit, but considering it has those same hallmark errors when shifting from one chunk to the next, I can only assume it too has a stitch-together implementation.

KV cache reuse / keeping a running KV cache would eliminate those errors. It would also majorly reduce the complexity with having to implement custom logic for processing multiple chunks and stitching them together in a sliding window fashion. You could just have one stream of audio coming in, and one stream of decoded text coming out.

Cleaner code, no having to compute overlapping sections more than once, no reduction in transcript accuracy versus doing inference on a static file... IMO seems to good to be true. It leads me to think that maybe run-on prompt caching like we see with LLMs is just simply not possible with Whisper..? That seems the simplest explanation. But I don't understand why that's the case. Anyone happen to know?