[N] Meta open-sourced a wav2vec2 model pre-trained on 4.5M hours
How can ASR models like wav2vec2.0 handle arbitrary audio input length but whisper can't?
Adapted Wav2Vec2 for ECG Classification: Help Needed!
[D] Wav2Vec2 maximum inputs are audios of 10 sec?
Videos
A month ago, Meta AI released W2V-Bert, one of the building blocks of their Seamless models.
It's been pretrained on 4.5M hours of unlabeled audio data, covering more than 143 languages.
Pros:
-
Enables low-resource fine-tuning
-
Faster and lighter than Whisper
-
MIT-license
-
Can be fine-tuned for other audio tasks
Cons:
-
CTC-based so it's for normalized transcriptions
-
Need to be fine-tuned before used
Resources:
-
Original repository: https://github.com/facebookresearch/seamless_communication?tab=readme-ov-file#whats-new
-
Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert
-
ASR fine-tuning on Mongolian blog post: https://huggingface.co/blog/fine-tune-w2v2-bert