🌐
Hugging Face
huggingface.co › docs › transformers › en › model_doc › wav2vec2
Wav2Vec2
The Wav2Vec2 model was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
🌐
arXiv
arxiv.org › abs › 2006.11477
[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
October 22, 2020 - We show for the first time that ... wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned....
Discussions

[N] Meta open-sourced a wav2vec2 model pre-trained on 4.5M hours
Paper: "Seamless: Multilingual Expressive and Streaming Speech Translation" , Barrault et al 2023: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at this https URL . More on reddit.com
🌐 r/MachineLearning
3
45
October 2, 2023
How can ASR models like wav2vec2.0 handle arbitrary audio input length but whisper can't?
it comes down to wav2vec2 being an encoder model while whisper is encoder-decoder More on reddit.com
🌐 r/speechtech
7
3
January 27, 2024
Adapted Wav2Vec2 for ECG Classification: Help Needed!
I've tried to get stuff like this to work. Commenting in order to come back and see how this goes More on reddit.com
🌐 r/deeplearning
2
5
February 9, 2024
[D] Wav2Vec2 maximum inputs are audios of 10 sec?
I am currently using facebook/wav2vec2-base model for an audio classification task. More on reddit.com
🌐 r/MachineLearning
3
1
October 31, 2023
🌐
Meta
ai.meta.com › blog › wav2vec-20-learning-the-structure-of-speech-from-raw-audio
Wav2vec 2.0: Learning the structure of speech from raw audio
September 24, 2020 - Facebook AI is releasing code and models for wav2vec 2.0, a self-supervised algorithm that enables automatic speech recognition models with just 10 minutes of transcribed speech data.
🌐
Mohitmayank
mohitmayank.com › a_lazy_data_science_guide › audio_intelligence › wav2vec2
Wav2Vec2 Model - A Lazy Data Science Guide
To pre-train the model, Wav2Vec2 masks certain portions of time steps in the feature encoder which is similar to masked language model.
🌐
Hugging Face
huggingface.co › facebook › wav2vec2-base-960h
facebook/wav2vec2-base-960h · Hugging Face
January 16, 2024 - from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch from jiwer import wer librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda") processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") def map_to_pred(batch): input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) print("WER:", wer(result["text"], result["transcription"]))
🌐
GitHub
github.com › oliverguhr › wav2vec2-live
GitHub - oliverguhr/wav2vec2-live: A live speech recognition using Facebooks wav2vec 2.0 model.
A live speech recognition using Facebooks wav2vec 2.0 model. - oliverguhr/wav2vec2-live
Starred by 374 users
Forked by 58 users
Languages   Python
Find elsewhere
🌐
HackerNoon
hackernoon.com › wav2vec2-for-automatic-speech-recognition-in-plain-english
wav2vec2 for Automatic Speech Recognition In Plain English | HackerNoon
March 13, 2024 - Plain English description of how Meta AI Research's wav2vec2 model works with respect to automatic speech recognition (ASR).
🌐
Hugging Face
huggingface.co › facebook › wav2vec2-base
facebook/wav2vec2-base · Hugging Face
July 25, 2025 - wav2vec2 · pretraining · speech · arxiv: 2006.11477 · License: apache-2.0 · Model card Files Files and versions · xet Community · 11 · Deploy · Use this model · Facebook's Wav2Vec2 · The base model pretrained on 16kHz sampled speech audio.
🌐
GeeksforGeeks
geeksforgeeks.org › nlp › wav2vec2-self-a-supervised-learning-technique-for-speech-representations
Wav2Vec2: Self-A Supervised Learning Technique for Speech Representations - GeeksforGeeks
July 23, 2025 - Wav2Vec2 stands as a testament to the transformative potential of self-supervised training, particularly in the realm of Natural Language Processing (NLP). Its architecture is tailored to harness vast amounts of unlabeled speech data, distilling intricate patterns and nuances to create a rich and generalized understanding of spoken language.
🌐
Medium
medium.com › @shiryc › from-wav2vec2-to-decoded-sentences-9078e5b56d1f
From Wav2Vec2 to Decoded Sentences | by Shiry Yonash | Medium
June 6, 2022 - The first component of Wav2Vec2 consists of a stack of CNN layers that are used to extract acoustically meaningful — but contextually independent — features from the raw speech signal. According to the Wav2Vec2 paper, this part of the model has already been sufficiently trained during pre-training and does not need to be fine-tuned anymore.
🌐
YouTube
youtube.com › watch
Wav2vec2 A Framework for Self-Supervised Learning of ...
Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
🌐
AWS
aws.amazon.com › blogs › machine-learning › fine-tune-and-deploy-a-wav2vec2-model-for-speech-recognition-with-hugging-face-and-amazon-sagemaker
Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker | Artificial Intelligence
May 25, 2022 - Then the model is fine-tuned on labeled data with the Connectionist Temporal Classification (CTC) algorithm for specific ASR tasks. The base model we use in this post is Wav2Vec2-Base-960h, fine-tuned on 960 hours of Librispeech on 16 kHz sampled speech audio.
🌐
Meta
ai.meta.com › research › impact › wav2vec
Wav2vec
Not Logged In · Please log in to see this page
🌐
PyTorch
docs.pytorch.org › audio › main › generated › torchaudio.models.Wav2Vec2Model.html
Wav2Vec2Model — Torchaudio 2.8.0 documentation
Tutorials using Wav2Vec2Model: Speech Recognition with Wav2Vec2 · Speech Recognition with Wav2Vec2 · ASR Inference with CTC Decoder · ASR Inference with CTC Decoder · Forced Alignment with Wav2Vec2 · Forced Alignment with Wav2Vec2 · Wav2Vec2Model.forward(waveforms: Tensor, lengths: Optional[Tensor] = None) → Tuple[Tensor, Optional[Tensor]][source]¶ ·
🌐
Wolfram
resources.wolframcloud.com › NeuralNetRepository › resources › Wav2Vec2-Trained-on-LibriSpeech-Data
Wav2Vec2 - Wolfram Neural Net Repository
June 12, 2023 - This family of models was trained using self-supervised learning in order to learn powerful representations from speech audio alone, followed by a fine-tuning on transcribed speech. At training time, Wav2Vec2 encodes raw speech audio into latent speech representations via a multilayer convolutional neural network.
🌐
GitHub
github.com › openvinotoolkit › open_model_zoo › blob › master › models › public › wav2vec2-base › README.md
open_model_zoo/models/public/wav2vec2-base/README.md at master · openvinotoolkit/open_model_zoo
Wav2Vec2.0-base is a model, which pre-trained to learn speech representations on unlabeled data as described in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations paper and fine-tuned for speech recognition task ...
Author   openvinotoolkit
🌐
Reddit
reddit.com › r/machinelearning › [n] meta open-sourced a wav2vec2 model pre-trained on 4.5m hours
r/MachineLearning on Reddit: [N] Meta open-sourced a wav2vec2 model pre-trained on 4.5M hours
October 2, 2023 -

A month ago, Meta AI released W2V-Bert, one of the building blocks of their Seamless models. 

It's been pretrained on 4.5M hours of unlabeled audio data, covering more than 143 languages.

Pros:

  • Enables low-resource fine-tuning

  • Faster and lighter than Whisper

  • MIT-license

  • Can be fine-tuned for other audio tasks

Cons:

  • CTC-based so it's for normalized transcriptions

  • Need to be fine-tuned before used

Resources:

  • Original repository: https://github.com/facebookresearch/seamless_communication?tab=readme-ov-file#whats-new

  • Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert

  • ASR fine-tuning on Mongolian blog post: https://huggingface.co/blog/fine-tune-w2v2-bert

Top answer
1 of 2
8
Paper: "Seamless: Multilingual Expressive and Streaming Speech Translation" , Barrault et al 2023: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at this https URL .
2 of 2
1
Isn't CTC-based an advantage as it makes it much faster than an autoregressive model ?
🌐
TensorFlow
tensorflow.org › hub › fine-tuning wav2vec2 with an lm head
Fine-tuning Wav2Vec2 with an LM head | TensorFlow Hub
March 23, 2024 - Originally, wav2vec2 was pre-trained with a masked language modelling approach with the objective to identify the true quantized latent speech representation for a masked time step.
🌐
arXiv
arxiv.org › abs › 2403.01369
[2403.01369] A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement
March 3, 2024 - Abstract page for arXiv paper 2403.01369: A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement