Videos
» pip install faster-whisper
Hey everyone!
I hope you're having a great day.
I recently compared all the open source whisper-based packages that support long-form transcription.
Long-form transcription is basically transcribing audio files that are longer than whisper's input limit, which is 30 seconds. This can be useful if you want to chat with a youtube video or podcast etc.
I compared the following packages:
OpenAI's official whisper package
Huggingface Transformers
Huggingface BetterTransformer (aka Insanely-fast-whisper)
FasterWhisper
WhisperX
Whisper.cpp
I compared between them in the following areas:
Accuracy - using word error rate (wer) and character error rate (cer)
Efficieny - using vram usage and latency
I've written a detailed blog post about this. If you just want the results, here they are:
For all metrics, lower is betterIf you have any comments or questions please leave them below.
Hi everyone! Some quick questions for today:
Why do most streaming-based implementations of Whisper process incoming audio in chunks and then stitch the transcript together?
Why not cache the encoded content and then keep that in memory and simply encode more incoming audio?
If Whisper is an autoregressive model, and it encodes audio in a sequential manner... why not just keep a running KV cache of encoded audio and update it? Why process in separate batches?
We see this kind of run-on caching a lot in e.g. LLM backends - Llama.cpp and MLX_lm for instance both implement prompt caching. The encoded KV cache is saved so that next time a prompt is passed in, the already encoded part of the conversation history doesn't need to be calculated again.
And yet I can't find any open source implementations of Whisper that do this - unless I'm just really misunderstanding the code (which is very possible). From what I can see of the codebase; Whisper.cpp seems to do sliding chunks and stitches them together. And you can see the pitfalls when you use it for live transcription; there's clear errors introduced where the chunks overlap and get stitched together.
I've yet to get deep into WhisperKit, but considering it has those same hallmark errors when shifting from one chunk to the next, I can only assume it too has a stitch-together implementation.
KV cache reuse / keeping a running KV cache would eliminate those errors. It would also majorly reduce the complexity with having to implement custom logic for processing multiple chunks and stitching them together in a sliding window fashion. You could just have one stream of audio coming in, and one stream of decoded text coming out.
Cleaner code, no having to compute overlapping sections more than once, no reduction in transcript accuracy versus doing inference on a static file... IMO seems to good to be true. It leads me to think that maybe run-on prompt caching like we see with LLMs is just simply not possible with Whisper..? That seems the simplest explanation. But I don't understand why that's the case. Anyone happen to know?