A month ago, Meta AI released W2V-Bert, one of the building blocks of their Seamless models.
It's been pretrained on 4.5M hours of unlabeled audio data, covering more than 143 languages.
Pros:
-
Enables low-resource fine-tuning
-
Faster and lighter than Whisper
-
MIT-license
-
Can be fine-tuned for other audio tasks
Cons:
-
CTC-based so it's for normalized transcriptions
-
Need to be fine-tuned before used
Resources:
-
Original repository: https://github.com/facebookresearch/seamless_communication?tab=readme-ov-file#whats-new
-
Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert
-
ASR fine-tuning on Mongolian blog post: https://huggingface.co/blog/fine-tune-w2v2-bert