Brave Search

The Wav2Vec2 model was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

arXiv

arxiv.org › abs › 2006.11477

[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

October 22, 2020 - We show for the first time that ... wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned....

Discussions

[N] Meta open-sourced a wav2vec2 model pre-trained on 4.5M hours

Paper: "Seamless: Multilingual Expressive and Streaming Speech Translation" , Barrault et al 2023: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at this https URL . More on reddit.com

r/MachineLearning

3

45

October 2, 2023

How can ASR models like wav2vec2.0 handle arbitrary audio input length but whisper can't?

it comes down to wav2vec2 being an encoder model while whisper is encoder-decoder More on reddit.com

r/speechtech

7

3

January 27, 2024

Adapted Wav2Vec2 for ECG Classification: Help Needed!

I've tried to get stuff like this to work. Commenting in order to come back and see how this goes More on reddit.com

r/deeplearning

2

5

February 9, 2024

[D] Wav2Vec2 maximum inputs are audios of 10 sec?

I am currently using facebook/wav2vec2-base model for an audio classification task. More on reddit.com

r/MachineLearning

3

1

October 31, 2023

Videos

11:45

YouTube

Fine-Tuning Wav2Vec2 using HuggingFace | Audio Classification - ...

Speech Recognition in Python | finetune wav2vec2 model for a custom ...

October 5, 2023

12:59

YouTube

Extract Embedding Features from Audio Data | Wav2Vec2 | Python ...

July 11, 2024

11:55

YouTube

Build Facebook's Wav2Vec2 Model For Speech To Text Application ...

December 20, 2022

View all

Meta

ai.meta.com › blog › wav2vec-20-learning-the-structure-of-speech-from-raw-audio

Wav2vec 2.0: Learning the structure of speech from raw audio

September 24, 2020 - Facebook AI is releasing code and models for wav2vec 2.0, a self-supervised algorithm that enables automatic speech recognition models with just 10 minutes of transcribed speech data.

PyTorch

docs.pytorch.org › audio › stable › tutorials › speech_recognition_pipeline_tutorial.html

Speech Recognition with Wav2Vec2 — Torchaudio 2.9.0 documentation

Wav2Vec2 model provides method to perform the feature extraction and classification in one step.

Mohitmayank

mohitmayank.com › a_lazy_data_science_guide › audio_intelligence › wav2vec2

Wav2Vec2 Model - A Lazy Data Science Guide

To pre-train the model, Wav2Vec2 masks certain portions of time steps in the feature encoder which is similar to masked language model.

Hugging Face

huggingface.co › facebook › wav2vec2-base-960h

facebook/wav2vec2-base-960h · Hugging Face

January 16, 2024 - from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch from jiwer import wer librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda") processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") def map_to_pred(batch): input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) print("WER:", wer(result["text"], result["transcription"]))

GitHub

github.com › oliverguhr › wav2vec2-live

GitHub - oliverguhr/wav2vec2-live: A live speech recognition using Facebooks wav2vec 2.0 model.

A live speech recognition using Facebooks wav2vec 2.0 model. - oliverguhr/wav2vec2-live

Starred by 374 users

Forked by 58 users

Languages Python

Find elsewhere

Google Bing Mojeek

HackerNoon

hackernoon.com › wav2vec2-for-automatic-speech-recognition-in-plain-english

wav2vec2 for Automatic Speech Recognition In Plain English | HackerNoon

March 13, 2024 - Plain English description of how Meta AI Research's wav2vec2 model works with respect to automatic speech recognition (ASR).

Hugging Face

huggingface.co › facebook › wav2vec2-base

facebook/wav2vec2-base · Hugging Face

July 25, 2025 - wav2vec2 · pretraining · speech · arxiv: 2006.11477 · License: apache-2.0 · Model card Files Files and versions · xet Community · 11 · Deploy · Use this model · Facebook's Wav2Vec2 · The base model pretrained on 16kHz sampled speech audio.

GeeksforGeeks

geeksforgeeks.org › nlp › wav2vec2-self-a-supervised-learning-technique-for-speech-representations

Wav2Vec2: Self-A Supervised Learning Technique for Speech Representations - GeeksforGeeks

July 23, 2025 - Wav2Vec2 stands as a testament to the transformative potential of self-supervised training, particularly in the realm of Natural Language Processing (NLP). Its architecture is tailored to harness vast amounts of unlabeled speech data, distilling intricate patterns and nuances to create a rich and generalized understanding of spoken language.

Medium

medium.com › @shiryc › from-wav2vec2-to-decoded-sentences-9078e5b56d1f

From Wav2Vec2 to Decoded Sentences | by Shiry Yonash | Medium

June 6, 2022 - The first component of Wav2Vec2 consists of a stack of CNN layers that are used to extract acoustically meaningful — but contextually independent — features from the raw speech signal. According to the Wav2Vec2 paper, this part of the model has already been sufficiently trained during pre-training and does not need to be fine-tuned anymore.

YouTube

youtube.com › watch

Wav2vec2 A Framework for Self-Supervised Learning of ...

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

AWS

aws.amazon.com › blogs › machine-learning › fine-tune-and-deploy-a-wav2vec2-model-for-speech-recognition-with-hugging-face-and-amazon-sagemaker

Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker | Artificial Intelligence

May 25, 2022 - Then the model is fine-tuned on labeled data with the Connectionist Temporal Classification (CTC) algorithm for specific ASR tasks. The base model we use in this post is Wav2Vec2-Base-960h, fine-tuned on 960 hours of Librispeech on 16 kHz sampled speech audio.