huggingface speech to-text models

What's the Best Speech-to-Text Model Right Now?

reddit.com › r › LocalLLaMA › comments › 1ng8bec › whats_the_best_speechtotext_model_right_now

Try https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 (smaller option) or https://huggingface.co/openai/whisper-large-v3-turbo . Run the parakeet model using nemo library and use something like https://github.com/SYSTRAN/faster-whisper or https://github.com/ggml-org/whisper.cpp for the latter Answer from Few-Welcome3297 on reddit.com

Hugging Face

huggingface.co › models

Automatic Speech Recognition Models – Hugging Face

Text-to-Speech · Text-to-Audio · Automatic Speech Recognition · Audio-to-Audio · Audio Classification · Voice Activity Detection · Tabular · Tabular Classification · Tabular Regression · Time Series Forecasting · Reinforcement Learning · Reinforcement Learning ·

Hugging Face

huggingface.co › docs › transformers › en › model_doc › speech_to_text

Speech2Text

The Speech2Text model was proposed in fairseq S2T: Fast Speech-to-Text Modeling with fairseq by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. It’s a transformer-based seq2seq (encoder-decoder) model designed for end-to-end ...

Discussions

What's the Best Speech-to-Text Model Right Now?

Try https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 (smaller option) or https://huggingface.co/openai/whisper-large-v3-turbo . Run the parakeet model using nemo library and use something like https://github.com/SYSTRAN/faster-whisper or https://github.com/ggml-org/whisper.cpp for the latter More on reddit.com

r/LocalLLaMA

40

6

September 13, 2025

Looking to run local speech to text model.

Whisper large is great.... More on reddit.com

r/huggingface

3

November 19, 2023

🎧 Listen and Compare 12 Open-Source Text-to-Speech Models (Hugging Face Space)

Nice to have it all in one place. It'd be even nicer to have an apples to apples comparison, thus all female or all male voices, instead of mixed like it's now. Maybe both? The CSM example sounds like it's full of artifacts, just like F5-TTS - and both were highlighted for speech quality. Maybe something went wrong during generation? At least Sesame can sound way better. The Llasa sample seems slightly broken - that's maybe a hint that this happens more often? Same with the background noise for MegaTTS3. Orpheus was probably standing in a large room during the generation 😉. More on reddit.com

r/LocalLLaMA

30

155

July 6, 2025

Improved Text to Speech model: Parler TTS v1 by Hugging Face

Where can I find the full list of the 34 voice names, and do you have quick audio samples for them to get an idea of each one? More on reddit.com

r/LocalLLaMA

75

238

August 8, 2024

Videos

18:42

YouTube

3 steps to run HuggingFace 🤗 "Parler TTS" AI Voice on your local ...

October 13, 2024

05:15

YouTube

Let's Dive into a Speech Generation with AI Models Tutorial | ...

March 11, 2024

reddit.com

r/LocalLLaMA on Reddit: Chatterbox Turbo, new open-source voice ...

1 day ago

05:17

YouTube

Hugging Face - Text to Speech - Getting started in 5 minutes - YouTube

August 6, 2023

06:56

YouTube

Hugging Face : Text to Audio | Text to Speech Generation - How ...

May 15, 2024

04:51

YouTube

The Best Free Text to Speech AI You've Never Heard Of (Open Source) ...

July 5, 2025

View all

Hugging Face

huggingface.co › openai › whisper-large-v3

openai/whisper-large-v3 · Hugging Face

Our studies show that, over many existing ASR systems, the models exhibit improved robustness to accents, background noise, technical language, as well as zero shot translation from multiple languages into English; and that accuracy on speech recognition and translation is near the state-of-the-art level. However, because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e.

Medium

medium.com › latinxinai › heres-to-the-crazy-ones-the-misfits-45f2132623c7

Here’s to the crazy ones, the misfits: Automatic Speech Recognition with PyTorch & Hugging Face

April 17, 2024 - One of the first things I noticed when I checked out the Transformers page was the ability to convert audio into text, demonstrated by a 60-second audio extract from one of the most inspiring speeches ever: the 1963 “I have a dream” speech from Martin Luther King. from transformers import pipeline transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small") transcription_results = transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") print(transcription_results)

Hugging Face

huggingface.co › tasks › automatic-speech-recognition

What is Automatic Speech Recognition? - Hugging Face

Note An end-to-end model that performs ASR and Speech Translation by MetaAI. Automatic Speech Recognition • Updated 12 days ago • 3.07k • 452 · Note A powerful multilingual ASR and Speech Translation model by Nvidia. Automatic Speech Recognition • Updated May 10, 2024 • 15.6M • 1.37k · Note Powerful speaker diarization model. Datasets for Automatic Speech Recognition Browse Datasets (1,679) ... Note 31,175 hours of multilingual audio-text dataset in 108 languages.

reddit.com › r/localllama › what's the best speech-to-text model right now?

r/LocalLLaMA on Reddit: What's the Best Speech-to-Text Model Right Now?

September 13, 2025 -

I am looking for the best Speech-to-Text/Speech Recognition Models, anyone could recommend any?

Top answer

1 of 13

6

Try https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 (smaller option) or https://huggingface.co/openai/whisper-large-v3-turbo . Run the parakeet model using nemo library and use something like https://github.com/SYSTRAN/faster-whisper or https://github.com/ggml-org/whisper.cpp for the latter

2 of 13

3

Check out the HF ASR leaderboard. https://huggingface.co/spaces/hf-audio/open_asr_leaderboard Assume you are looking for an open source one? I am a fan of the nvidia parakeet series but it depends on your use case.

Find elsewhere

Google Bing Mojeek

Hugging Face

huggingface.co › models

Text-to-Speech Models – Hugging Face

1 day ago - Text-to-Speech • 5B • Updated 1 day ago • 634 • 33 · Text-to-Speech • Updated Apr 10 • 3.99M • • 5.4k · Text-to-Speech • Updated 5 days ago • 17.9k • 439 · Text-to-Speech • Updated Dec 11, 2023 • 6.4M • 3.23k · Text-to-Speech • 3B • Updated Nov 12 • 74.3k • • 821 ·

GitHub

github.com › huggingface › speech-to-speech

GitHub - huggingface/speech-to-speech: Speech To Speech: an effort for an open-sourced and modular GPT4-o

The pipeline provides a fully open and modular approach, with a focus on leveraging models available through the Transformers library on the Hugging Face hub.

Starred by 4.2K users

Forked by 485 users

Languages Python 99.7% | Dockerfile 0.3%

KDnuggets

kdnuggets.com › use-hugging-face-transformers-text-to-speech-applications

How to Use Hugging Face Transformers for Text-to-Speech Applications - KDnuggets

October 24, 2024 - Hugging Face provides a variety of pre-trained models that can turn text into speech. For TTS applications, you can use models like Tacotron2 or FastSpeech2. These models have been trained to convert text into human-like speech.

Hugging Face

huggingface.co › tasks › text-to-speech

What is Text-to-Speech? - Hugging Face

The Hub contains over 1500 TTS models that you can use right away by trying out the widgets directly in the browser or calling the models as a service using Inference Endpoints. Here is a simple code snippet to get you started: import json import ...

Hugging Face

huggingface.co › docs › transformers › en › tasks › text-to-speech

Text to speech

Text-to-speech (TTS) is the task ... text-to-speech models are currently available in 🤗 Transformers, such as Dia, CSM, Bark, MMS, VITS and SpeechT5....

Hugging Face

huggingface.co › learn › audio-course › en › chapter6 › pre-trained_models

Pre-trained models for text-to-speech - Hugging Face Audio Course

Just like any other Transformer, the encoder-decoder network models a sequence-to-sequence transformation using hidden representations. This Transformer backbone is the same for all tasks SpeechT5 supports.

GitHub

github.com › huggingface › parler-tts

GitHub - huggingface/parler-tts: Inference and training library for high-quality TTS models.

August 8, 2024 - Inference and training library for high-quality TTS models. - huggingface/parler-tts

Starred by 5.5K users

Forked by 582 users

Languages Python

Hugging Face

huggingface.co › docs › transformers › en › model_doc › speech_to_text_2

Speech2Text2

Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as Wav2Vec2 or HuBERT for Speech-to-Text tasks.

Hugging Face

huggingface.co › collections › SamuraiBarbi › speech-to-text-models

Speech to Text Models - a SamuraiBarbi Collection

Unlock the magic of AI with handpicked models, awesome datasets, papers, and mind-blowing Spaces from SamuraiBarbi

Hugging Face

huggingface.co › collections › unsloth › text-to-speech-tts-models

Text-to-Speech (TTS) models - a unsloth Collection

A collection of 4-bit, Dynamic 4-bit and 16-bit voice models including Sesame-CSM, OpenAI's Whisper, Orpheus. Fine-tune them with Unsloth now!

Hugging Face

huggingface.co › docs › transformers › en › tasks › asr

Automatic speech recognition

Automatic speech recognition (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Virtual assistants like Siri and Alexa use ASR models to help users every day, and there are many other useful user-facing ...

reddit.com › r/huggingface › looking to run local speech to text model.

r/huggingface on Reddit: Looking to run local speech to text model.

November 19, 2023 -

I've been using the OpenAI API for speech to text, it works well but the cost can start getting high. I have no experience of running a local speech to text model. Can someone offer guidance for both:

Best models
How to host and run the models locally