python - How to create a live voice activity detection (VAD) loop for whisperX? - Stack Overflow
Whisper Streaming Strategy
I compared the different open source whisper packages for long-form transcription
Whisper Streaming?
Videos
Hey everyone!
I hope you're having a great day.
I recently compared all the open source whisper-based packages that support long-form transcription.
Long-form transcription is basically transcribing audio files that are longer than whisper's input limit, which is 30 seconds. This can be useful if you want to chat with a youtube video or podcast etc.
I compared the following packages:
OpenAI's official whisper package
Huggingface Transformers
Huggingface BetterTransformer (aka Insanely-fast-whisper)
FasterWhisper
WhisperX
Whisper.cpp
I compared between them in the following areas:
Accuracy - using word error rate (wer) and character error rate (cer)
Efficieny - using vram usage and latency
I've written a detailed blog post about this. If you just want the results, here they are:
For all metrics, lower is betterIf you have any comments or questions please leave them below.
Does OpenAI have plans to develop live audio streaming in Whisper?