Brave Search

The Wav2Vec2 model was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

arXiv

arxiv.org › abs › 2006.11477

[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

October 22, 2020 - We show for the first time that ... wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned....

Discussions

HuggingFace wav2vec2 for multitask training? : speechrecognition

Get the Reddit app · Scan this QR code to download the app now · Or check it out in the app stores More on reddit.com

r/speechrecognition

WhyML - Wav2vec2 A Framework for Self-Supervised Learning of Speech Representations

Hi guys, I have made a video on YouTube here where I go through the Wav2Vec2 paper and explain each section. This is a new series on my channel that… More on reddit.com

r/learnmachinelearning

November 5, 2022

[N] Meta open-sourced a wav2vec2 model pre-trained on 4.5M hours

Paper: "Seamless: Multilingual Expressive and Streaming Speech Translation" , Barrault et al 2023: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at this https URL . More on reddit.com

r/MachineLearning

October 2, 2023

How can ASR models like wav2vec2.0 handle arbitrary audio input length but whisper can't?

it comes down to wav2vec2 being an encoder model while whisper is encoder-decoder More on reddit.com

r/speechtech

January 27, 2024

Videos

12:59

YouTube

Extract Embedding Features from Audio Data | Wav2Vec2 | Python ...

July 11, 2024

26:18

YouTube

Speech Recognition in Python | finetune wav2vec2 model for a custom ...

October 5, 2023

youtube.com

Wav2vec2 A Framework for Self-Supervised Learning of ...

11:45

YouTube

Fine-Tuning Wav2Vec2 using HuggingFace | Audio Classification - ...

Build Facebook's Wav2Vec2 Model For Speech To Text Application ...

December 20, 2022

youtube.com

Deploy Wav2Vec2.0 based Speech Recognition Service in ...

View all

Hugging Face

huggingface.co › facebook › wav2vec2-base-960h

facebook/wav2vec2-base-960h · Hugging Face

January 16, 2024 - from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch from jiwer import wer librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda") processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") def map_to_pred(batch): input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) print("WER:", wer(result["text"], result["transcription"]))

PyTorch

docs.pytorch.org › audio › main › tutorials › speech_recognition_pipeline_tutorial.html

Speech Recognition with Wav2Vec2 — Torchaudio 2.8.0 documentation

Wav2Vec2 model provides method to perform the feature extraction and classification in one step.

GeeksforGeeks

geeksforgeeks.org › nlp › wav2vec2-self-a-supervised-learning-technique-for-speech-representations

Wav2Vec2: Self-A Supervised Learning Technique for Speech Representations - GeeksforGeeks

July 23, 2025 - Wav2Vec2 stands as a testament to the transformative potential of self-supervised training, particularly in the realm of Natural Language Processing (NLP). Its architecture is tailored to harness vast amounts of unlabeled speech data, distilling intricate patterns and nuances to create a rich and generalized understanding of spoken language.

Meta

ai.meta.com › blog › wav2vec-20-learning-the-structure-of-speech-from-raw-audio

Wav2vec 2.0: Learning the structure of speech from raw audio

September 24, 2020 - Facebook AI is releasing code and models for wav2vec 2.0, a self-supervised algorithm that enables automatic speech recognition models with just 10 minutes of transcribed speech data.

GitHub

github.com › oliverguhr › wav2vec2-live

GitHub - oliverguhr/wav2vec2-live: A live speech recognition using Facebooks wav2vec 2.0 model.

A live speech recognition using Facebooks wav2vec 2.0 model. - oliverguhr/wav2vec2-live

Starred by 374 users

Forked by 58 users

Languages Python

Mohitmayank

mohitmayank.com › a_lazy_data_science_guide › audio_intelligence › wav2vec2

Wav2Vec2 Model - A Lazy Data Science Guide

To pre-train the model, Wav2Vec2 masks certain portions of time steps in the feature encoder which is similar to masked language model.

Find elsewhere

Google Bing Mojeek

Hugging Face

huggingface.co › facebook › wav2vec2-base

facebook/wav2vec2-base · Hugging Face

July 25, 2025 - wav2vec2 · pretraining · speech · arxiv: 2006.11477 · License: apache-2.0 · Model card Files Files and versions · xet Community · 11 · Deploy · Use this model · Facebook's Wav2Vec2 · The base model pretrained on 16kHz sampled speech audio.

HackerNoon

hackernoon.com › wav2vec2-for-automatic-speech-recognition-in-plain-english

wav2vec2 for Automatic Speech Recognition In Plain English | HackerNoon

March 13, 2024 - Plain English description of how Meta AI Research's wav2vec2 model works with respect to automatic speech recognition (ASR).

PyTorch

docs.pytorch.org › audio › main › generated › torchaudio.models.Wav2Vec2Model.html

Wav2Vec2Model — Torchaudio 2.8.0 documentation

Tutorials using Wav2Vec2Model: Speech Recognition with Wav2Vec2 · Speech Recognition with Wav2Vec2 · ASR Inference with CTC Decoder · ASR Inference with CTC Decoder · Forced Alignment with Wav2Vec2 · Forced Alignment with Wav2Vec2 · Wav2Vec2Model.forward(waveforms: Tensor, lengths: Optional[Tensor] = None) → Tuple[Tensor, Optional[Tensor]][source]¶ ·

Wolfram

resources.wolframcloud.com › NeuralNetRepository › resources › Wav2Vec2-Trained-on-LibriSpeech-Data

Wav2Vec2 - Wolfram Neural Net Repository

June 12, 2023 - This family of models was trained using self-supervised learning in order to learn powerful representations from speech audio alone, followed by a fine-tuning on transcribed speech. At training time, Wav2Vec2 encodes raw speech audio into latent speech representations via a multilayer convolutional neural network.

IEEE Xplore

ieeexplore.ieee.org › document › 10122501

A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition | IEEE Journals & Magazine | IEEE Xplore

In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self-supervised learning (SSL) toward improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of child speech training ...

Jonathan Bgn

jonathanbgn.com › 2021 › 09 › 30 › illustrated-wav2vec-2.html

An Illustrated Tour of Wav2vec 2.0 | Jonathan Bgn

September 30, 2021 - Self-supervised learning of speech representations explained visually.

Medium

medium.com › @shradrobo › wav2vec2-model-for-child-speech-recognition-eef1d142bcd2

Wav2Vec2 Model for Child Speech recognition | by Shradrobo | Medium

January 30, 2023 - In September 2020, Alexei Baevski, Michael Auli, and Alex Conneau published Wav2Vec2 as a pretrained model for Automatic Speech Recognition (ASR).

GitHub

github.com › khanld › Wav2vec2-Pretraining

GitHub - khanld/Wav2vec2-Pretraining: Wav2vec 2.0 Self-Supervised Pretraining

from transformers import Wav2Vec2Processor, Wav2Vec2Model import torch import librosa # load audio wav, sr = librosa.load(<audio_path>, sr=16000) # load pretrained feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("<output_dir>/saved_model/epoch_10") model = Wav2Vec2Model.from_pretrained("<output_dir>/saved_model/epoch_10") # run forward pass inputs = feature_extractor(wav, sampling_rate=sr, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state print(last_hidden_state.shape)

Starred by 57 users

Forked by 9 users

Languages Python 96.7% | Shell 3.3%