🌐
Hugging Face
huggingface.co › docs › transformers › en › model_doc › wav2vec2-bert
Wav2Vec2-BERT
Wav2Vec2-BERT follows the same architecture as Wav2Vec2-Conformer, but employs a causal depthwise convolutional layer and uses as input a mel-spectrogram representation of the audio instead of the raw waveform.
🌐
GitHub
github.com › huggingface › transformers › blob › main › docs › source › en › model_doc › wav2vec2-bert.md
transformers/docs/source/en/model_doc/wav2vec2-bert.md at main · huggingface/transformers
Wav2Vec2-BERT follows the same architecture as Wav2Vec2-Conformer, but employs a causal depthwise convolutional layer and uses as input a mel-spectrogram representation of the audio instead of the raw waveform.
Author   huggingface
🌐
Kavyamanohar
kavyamanohar.com › home › wav2 vec2 bert lm transcribing speech and evaluating models using huggingface transformers
Wav2Vec2-BERT+LM: Transcribing Speech and Evaluating Models using Huggingface Transformers -
August 20, 2024 - Wav2Vec2-BERT predicts text tokens in a single pass, making it much faster than Whisper. Wav2Vec2-BERT model is available in Huggingface Transformers and can be finetuned for any low resource ASR task, with a list of custom tokens.
🌐
arXiv
arxiv.org › abs › 2108.06209
[2108.06209] W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
September 13, 2021 - Motivated by the success of masked ... for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech ...
🌐
Hugging Face
huggingface.co › facebook › w2v-bert-2.0
facebook/w2v-bert-2.0 · Hugging Face
wav2vec2-bert · arxiv: 2312.05187 · License: mit · Model card Files Files and versions · xet Community · 33 · Deploy · Use this model · We are open-sourcing our Conformer-based W2v-BERT 2.0 speech encoder as described in Section 3.2.1 of the paper, which is at the core of our Seamless models.
🌐
GitHub
github.com › huggingface › blog › blob › main › fine-tune-w2v2-bert.md
blog/fine-tune-w2v2-bert.md at main · huggingface/blog
Wav2Vec2-BERT is the result of a series of improvements based on an original model: Wav2Vec2, a pre-trained model for Automatic Speech Recognition (ASR) released in September 2020 by Alexei Baevski, Michael Auli, and Alex Conneau.
Author   huggingface
🌐
Semantic Scholar
semanticscholar.org › papers › fusing wav2vec2.0 and bert into end-to-end model for low-resource speech recognition
Fusing Wav2vec2.0 and BERT into End-to-end Model for Low-resource Speech Recognition | Semantic Scholar
In this work, we propose an end-to-end model for the low-resource speech recognition, which fuses a pre-trained audio encoder (wav2vec2.0) and a pre-trained text decoder (BERT). The two modules are connected by a linear attention mechanism without ...
🌐
Hugging Face
huggingface.co › blog › fine-tune-w2v2-bert
Fine-Tune W2V2-Bert for low-resource ASR with 🤗 Transformers
Wav2Vec2-BERT is the result of a series of improvements based on an original model: Wav2Vec2, a pre-trained model for Automatic Speech Recognition (ASR) released in September 2020 by Alexei Baevski, Michael Auli, and Alex Conneau.
🌐
Dataloop
dataloop.ai › home › library › models › wav2vec2 bert cv16 en
Wav2vec2 Bert CV16 En · Models · Dataloop
The Wav2vec2 Bert CV16 En model is a fine-tuned version of ylacombe/w2v-bert-2.0, trained on the MOZILLA-FOUNDATION/COMMON_VOICE_16_0 - EN dataset. It achieves a loss of 0.2427, WER of 0.1455, and CER of 0.0580 on the evaluation set.
🌐
Hugging Face
huggingface.co › spygaurad › wav2vec2-bert
spygaurad/wav2vec2-bert · Hugging Face
wav2vec2-bert · Generated from ... Community · Deploy · Use this model · This model is a fine-tuned version of facebook/w2v-bert-2.0 on the common_voice_16_0 dataset....
Find elsewhere
🌐
Reddit
reddit.com › r/machinelearning › [n] meta open-sourced a wav2vec2 model pre-trained on 4.5m hours
r/MachineLearning on Reddit: [N] Meta open-sourced a wav2vec2 model pre-trained on 4.5M hours
October 1, 2023 -

A month ago, Meta AI released W2V-Bert, one of the building blocks of their Seamless models. 

It's been pretrained on 4.5M hours of unlabeled audio data, covering more than 143 languages.

Pros:

  • Enables low-resource fine-tuning

  • Faster and lighter than Whisper

  • MIT-license

  • Can be fine-tuned for other audio tasks

Cons:

  • CTC-based so it's for normalized transcriptions

  • Need to be fine-tuned before used

Resources:

  • Original repository: https://github.com/facebookresearch/seamless_communication?tab=readme-ov-file#whats-new

  • Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert

  • ASR fine-tuning on Mongolian blog post: https://huggingface.co/blog/fine-tune-w2v2-bert

Top answer
1 of 2
8
Paper: "Seamless: Multilingual Expressive and Streaming Speech Translation" , Barrault et al 2023: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at this https URL .
2 of 2
1
Isn't CTC-based an advantage as it makes it much faster than an autoregressive model ?
🌐
arXiv
arxiv.org › abs › 2207.04697
[2207.04697] Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition
July 12, 2022 - Abstract page for arXiv paper 2207.04697: Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition
🌐
Hugging Face
huggingface.co › docs › transformers › model_doc › wav2vec2
Wav2Vec2
Note: Meta (FAIR) released a new version of Wav2Vec2-BERT 2.0 - it’s pretrained on 4.5M hours of audio. We especially recommend using it for fine-tuning tasks, e.g.
🌐
Restack
restack.io › p › fine-tuning-answer-wav2vec2-bert-cat-ai
Fine-Tuning Wav2Vec2-BERT | Restackio
This approach allows for effective semi-supervised training, where the model is initially pre-trained on unlabeled data before being fine-tuned on a smaller labeled dataset. Notably, Wav2Vec2 can achieve state-of-the-art performance with as little as one hour of labeled data, showcasing its efficiency and practicality in real-world applications.
🌐
Hacker News
news.ycombinator.com › item
Fine-tune Wav2Vec2-BERT for low resource speech recognition | Hacker News
October 1, 2023 - The checkpoint is MIT licensed and available in Hugging Face Transformers, where you can fine-tune it to get comparable speech recognition results to Whisper, but with 10x faster inference. You only need 10 hours of audio data, and training can be run on a single Colab GPU
🌐
Lightning AI
lightning.ai › pashanitw › studios › w2v-bert-2-0-asr-finetuning
W2V-BERT-2.0-ASR-FineTuning
The all-in-one platform for AI development. Code together. Prototype. Train. Scale. Serve. From your browser - with zero setup. From the creators of PyTorch Lightning.
🌐
ResearchGate
researchgate.net › publication › 348589634_Fusing_Wav2vec20_and_BERT_into_End-to-end_Model_for_Low-resource_Speech_Recognition
Fusing Wav2vec2.0 and BERT into End-to-end Model for Low-resource Speech Recognition | Request PDF
January 17, 2021 - It indicates that the pretrain-and-finetune paradigm is a promising direction. In this work, we propose an end-to-end model for the low-resource speech recognition, which fuses a pre-trained audio encoder (wav2vec2.0) and a pre-trained text decoder ...