Hugging Face
huggingface.co › docs › transformers › en › model_doc › wav2vec2
Wav2Vec2
The Wav2Vec2 model was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
arXiv
arxiv.org › abs › 2006.11477
[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
October 22, 2020 - We show for the first time that ... wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned....
HuggingFace wav2vec2 for multitask training? : speechrecognition
Get the Reddit app · Scan this QR code to download the app now · Or check it out in the app stores More on reddit.com
WhyML - Wav2vec2 A Framework for Self-Supervised Learning of Speech Representations
Hi guys, I have made a video on YouTube here where I go through the Wav2Vec2 paper and explain each section. This is a new series on my channel that… More on reddit.com
[N] Meta open-sourced a wav2vec2 model pre-trained on 4.5M hours
Paper: "Seamless: Multilingual Expressive and Streaming Speech Translation" , Barrault et al 2023: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at this https URL . More on reddit.com
How can ASR models like wav2vec2.0 handle arbitrary audio input length but whisper can't?
it comes down to wav2vec2 being an encoder model while whisper is encoder-decoder More on reddit.com
Videos
12:59
Extract Embedding Features from Audio Data | Wav2Vec2 | Python ...
26:18
Speech Recognition in Python | finetune wav2vec2 model for a custom ...
Wav2vec2 A Framework for Self-Supervised Learning of ...
11:45
Fine-Tuning Wav2Vec2 using HuggingFace | Audio Classification - ...
11:55
Build Facebook's Wav2Vec2 Model For Speech To Text Application ...
Deploy Wav2Vec2.0 based Speech Recognition Service in ...
Hugging Face
huggingface.co › facebook › wav2vec2-base-960h
facebook/wav2vec2-base-960h · Hugging Face
January 16, 2024 - from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch from jiwer import wer librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda") processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") def map_to_pred(batch): input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) print("WER:", wer(result["text"], result["transcription"]))
PyTorch
docs.pytorch.org › audio › main › tutorials › speech_recognition_pipeline_tutorial.html
Speech Recognition with Wav2Vec2 — Torchaudio 2.8.0 documentation
Wav2Vec2 model provides method to perform the feature extraction and classification in one step.
GeeksforGeeks
geeksforgeeks.org › nlp › wav2vec2-self-a-supervised-learning-technique-for-speech-representations
Wav2Vec2: Self-A Supervised Learning Technique for Speech Representations - GeeksforGeeks
July 23, 2025 - Wav2Vec2 stands as a testament to the transformative potential of self-supervised training, particularly in the realm of Natural Language Processing (NLP). Its architecture is tailored to harness vast amounts of unlabeled speech data, distilling intricate patterns and nuances to create a rich and generalized understanding of spoken language.
GitHub
github.com › oliverguhr › wav2vec2-live
GitHub - oliverguhr/wav2vec2-live: A live speech recognition using Facebooks wav2vec 2.0 model.
Starred by 374 users
Forked by 58 users
Languages Python
Mohitmayank
mohitmayank.com › a_lazy_data_science_guide › audio_intelligence › wav2vec2
Wav2Vec2 Model - A Lazy Data Science Guide
To pre-train the model, Wav2Vec2 masks certain portions of time steps in the feature encoder which is similar to masked language model.
PyTorch
docs.pytorch.org › audio › main › generated › torchaudio.models.Wav2Vec2Model.html
Wav2Vec2Model — Torchaudio 2.8.0 documentation
Tutorials using Wav2Vec2Model: Speech Recognition with Wav2Vec2 · Speech Recognition with Wav2Vec2 · ASR Inference with CTC Decoder · ASR Inference with CTC Decoder · Forced Alignment with Wav2Vec2 · Forced Alignment with Wav2Vec2 · Wav2Vec2Model.forward(waveforms: Tensor, lengths: Optional[Tensor] = None) → Tuple[Tensor, Optional[Tensor]][source]¶ ·
Wolfram
resources.wolframcloud.com › NeuralNetRepository › resources › Wav2Vec2-Trained-on-LibriSpeech-Data
Wav2Vec2 - Wolfram Neural Net Repository
June 12, 2023 - This family of models was trained using self-supervised learning in order to learn powerful representations from speech audio alone, followed by a fine-tuning on transcribed speech. At training time, Wav2Vec2 encodes raw speech audio into latent speech representations via a multilayer convolutional neural network.
GitHub
github.com › khanld › Wav2vec2-Pretraining
GitHub - khanld/Wav2vec2-Pretraining: Wav2vec 2.0 Self-Supervised Pretraining
from transformers import Wav2Vec2Processor, Wav2Vec2Model import torch import librosa # load audio wav, sr = librosa.load(<audio_path>, sr=16000) # load pretrained feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("<output_dir>/saved_model/epoch_10") model = Wav2Vec2Model.from_pretrained("<output_dir>/saved_model/epoch_10") # run forward pass inputs = feature_extractor(wav, sampling_rate=sr, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state print(last_hidden_state.shape)
Starred by 57 users
Forked by 9 users
Languages Python 96.7% | Shell 3.3%
Meta
ai.meta.com › research › impact › wav2vec
Wav2vec
Not Logged In · Please log in to see this page
arXiv
arxiv.org › abs › 2202.05993
[2202.05993] Wav2Vec2.0 on the Edge: Performance Evaluation
February 12, 2022 - Abstract:Wav2Vec2.0 is a state-of-the-art model which learns speech representations through unlabeled speech data, aka, self supervised learning. The pretrained model is then fine tuned on small amounts of labeled data to use it for speech-to-text ...
TensorFlow
tensorflow.org › hub › fine-tuning wav2vec2 with an lm head
Fine-tuning Wav2Vec2 with an LM head | TensorFlow Hub
March 23, 2024 - Originally, wav2vec2 was pre-trained with a masked language modelling approach with the objective to identify the true quantized latent speech representation for a masked time step.