🌐
Hugging Face
huggingface.co › docs › transformers › en › model_doc › wav2vec2
Wav2Vec2
The Wav2Vec2 model was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
🌐
arXiv
arxiv.org › abs › 2006.11477
[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
October 22, 2020 - We show for the first time that ... wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned....
Discussions

HuggingFace wav2vec2 for multitask training? : speechrecognition
Get the Reddit app · Scan this QR code to download the app now · Or check it out in the app stores More on reddit.com
🌐 r/speechrecognition
WhyML - Wav2vec2 A Framework for Self-Supervised Learning of Speech Representations
Hi guys, I have made a video on YouTube here where I go through the Wav2Vec2 paper and explain each section. This is a new series on my channel that… More on reddit.com
🌐 r/learnmachinelearning
1
November 5, 2022
[N] Meta open-sourced a wav2vec2 model pre-trained on 4.5M hours
Paper: "Seamless: Multilingual Expressive and Streaming Speech Translation" , Barrault et al 2023: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at this https URL . More on reddit.com
🌐 r/MachineLearning
3
45
October 2, 2023
How can ASR models like wav2vec2.0 handle arbitrary audio input length but whisper can't?
it comes down to wav2vec2 being an encoder model while whisper is encoder-decoder More on reddit.com
🌐 r/speechtech
7
3
January 27, 2024
🌐
Hugging Face
huggingface.co › facebook › wav2vec2-base-960h
facebook/wav2vec2-base-960h · Hugging Face
January 16, 2024 - from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch from jiwer import wer librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda") processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") def map_to_pred(batch): input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) print("WER:", wer(result["text"], result["transcription"]))
🌐
GeeksforGeeks
geeksforgeeks.org › nlp › wav2vec2-self-a-supervised-learning-technique-for-speech-representations
Wav2Vec2: Self-A Supervised Learning Technique for Speech Representations - GeeksforGeeks
July 23, 2025 - Wav2Vec2 stands as a testament to the transformative potential of self-supervised training, particularly in the realm of Natural Language Processing (NLP). Its architecture is tailored to harness vast amounts of unlabeled speech data, distilling intricate patterns and nuances to create a rich and generalized understanding of spoken language.
🌐
Meta
ai.meta.com › blog › wav2vec-20-learning-the-structure-of-speech-from-raw-audio
Wav2vec 2.0: Learning the structure of speech from raw audio
September 24, 2020 - Facebook AI is releasing code and models for wav2vec 2.0, a self-supervised algorithm that enables automatic speech recognition models with just 10 minutes of transcribed speech data.
🌐
GitHub
github.com › oliverguhr › wav2vec2-live
GitHub - oliverguhr/wav2vec2-live: A live speech recognition using Facebooks wav2vec 2.0 model.
A live speech recognition using Facebooks wav2vec 2.0 model. - oliverguhr/wav2vec2-live
Starred by 374 users
Forked by 58 users
Languages   Python
🌐
Mohitmayank
mohitmayank.com › a_lazy_data_science_guide › audio_intelligence › wav2vec2
Wav2Vec2 Model - A Lazy Data Science Guide
To pre-train the model, Wav2Vec2 masks certain portions of time steps in the feature encoder which is similar to masked language model.
Find elsewhere
🌐
Hugging Face
huggingface.co › facebook › wav2vec2-base
facebook/wav2vec2-base · Hugging Face
July 25, 2025 - wav2vec2 · pretraining · speech · arxiv: 2006.11477 · License: apache-2.0 · Model card Files Files and versions · xet Community · 11 · Deploy · Use this model · Facebook's Wav2Vec2 · The base model pretrained on 16kHz sampled speech audio.
🌐
HackerNoon
hackernoon.com › wav2vec2-for-automatic-speech-recognition-in-plain-english
wav2vec2 for Automatic Speech Recognition In Plain English | HackerNoon
March 13, 2024 - Plain English description of how Meta AI Research's wav2vec2 model works with respect to automatic speech recognition (ASR).
🌐
PyTorch
docs.pytorch.org › audio › main › generated › torchaudio.models.Wav2Vec2Model.html
Wav2Vec2Model — Torchaudio 2.8.0 documentation
Tutorials using Wav2Vec2Model: Speech Recognition with Wav2Vec2 · Speech Recognition with Wav2Vec2 · ASR Inference with CTC Decoder · ASR Inference with CTC Decoder · Forced Alignment with Wav2Vec2 · Forced Alignment with Wav2Vec2 · Wav2Vec2Model.forward(waveforms: Tensor, lengths: Optional[Tensor] = None) → Tuple[Tensor, Optional[Tensor]][source]¶ ·
🌐
Wolfram
resources.wolframcloud.com › NeuralNetRepository › resources › Wav2Vec2-Trained-on-LibriSpeech-Data
Wav2Vec2 - Wolfram Neural Net Repository
June 12, 2023 - This family of models was trained using self-supervised learning in order to learn powerful representations from speech audio alone, followed by a fine-tuning on transcribed speech. At training time, Wav2Vec2 encodes raw speech audio into latent speech representations via a multilayer convolutional neural network.
🌐
IEEE Xplore
ieeexplore.ieee.org › document › 10122501
A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition | IEEE Journals & Magazine | IEEE Xplore
In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self-supervised learning (SSL) toward improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of child speech training ...
🌐
Jonathan Bgn
jonathanbgn.com › 2021 › 09 › 30 › illustrated-wav2vec-2.html
An Illustrated Tour of Wav2vec 2.0 | Jonathan Bgn
September 30, 2021 - Self-supervised learning of speech representations explained visually.
🌐
Medium
medium.com › @shradrobo › wav2vec2-model-for-child-speech-recognition-eef1d142bcd2
Wav2Vec2 Model for Child Speech recognition | by Shradrobo | Medium
January 30, 2023 - In September 2020, Alexei Baevski, Michael Auli, and Alex Conneau published Wav2Vec2 as a pretrained model for Automatic Speech Recognition (ASR).
🌐
GitHub
github.com › khanld › Wav2vec2-Pretraining
GitHub - khanld/Wav2vec2-Pretraining: Wav2vec 2.0 Self-Supervised Pretraining
from transformers import Wav2Vec2Processor, Wav2Vec2Model import torch import librosa # load audio wav, sr = librosa.load(<audio_path>, sr=16000) # load pretrained feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("<output_dir>/saved_model/epoch_10") model = Wav2Vec2Model.from_pretrained("<output_dir>/saved_model/epoch_10") # run forward pass inputs = feature_extractor(wav, sampling_rate=sr, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state print(last_hidden_state.shape)
Starred by 57 users
Forked by 9 users
Languages   Python 96.7% | Shell 3.3%
🌐
Meta
ai.meta.com › research › impact › wav2vec
Wav2vec
Not Logged In · Please log in to see this page
🌐
AWS
aws.amazon.com › blogs › machine-learning › fine-tune-and-deploy-a-wav2vec2-model-for-speech-recognition-with-hugging-face-and-amazon-sagemaker
Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker | Artificial Intelligence
May 25, 2022 - Then the model is fine-tuned on labeled data with the Connectionist Temporal Classification (CTC) algorithm for specific ASR tasks. The base model we use in this post is Wav2Vec2-Base-960h, fine-tuned on 960 hours of Librispeech on 16 kHz sampled speech audio.
🌐
arXiv
arxiv.org › abs › 2202.05993
[2202.05993] Wav2Vec2.0 on the Edge: Performance Evaluation
February 12, 2022 - Abstract:Wav2Vec2.0 is a state-of-the-art model which learns speech representations through unlabeled speech data, aka, self supervised learning. The pretrained model is then fine tuned on small amounts of labeled data to use it for speech-to-text ...
🌐
TensorFlow
tensorflow.org › hub › fine-tuning wav2vec2 with an lm head
Fine-tuning Wav2Vec2 with an LM head | TensorFlow Hub
March 23, 2024 - Originally, wav2vec2 was pre-trained with a masked language modelling approach with the objective to identify the true quantized latent speech representation for a masked time step.