In my limited testing, I found StyleTTS 2 to be incredibly good. It is super-fast compared to other high-end models and that is a key advantage. StyleTTS 2 depracated fast, low-resources TTS models such as FastSpeech (and FastSpeech 2) and Tacotron for me. xtts2 provides slightly better audio quality, but is significantly slower and consumes more resources. I recently tested metavoice and the quality is amazing, very natural (a little news reporter-type narration, though). It produces some small artifacts at the end of sentences, but nothing to worry about. In my opinion, StyleTTS 2 offers the best balance between speed and quality at the moment. Tortoise TTS provides a very natural sound, but in my tests, it had a lot of artifacts and strange mid-sentence pitch and intonation changes. There is also this project, which aims to optimize and accelerate Tortoise TTS inference, but I haven't tried it yet. https://github.com/152334H/tortoise-tts-fast Answer from RYSKZ on reddit.com
🌐
GitHub
github.com › yl4579 › StyleTTS2
GitHub - yl4579/StyleTTS2: StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
Starred by 6.1K users
Forked by 642 users
Languages   Python 77.8% | Jupyter Notebook 22.2%
🌐
arXiv
arxiv.org › abs › 2306.07691
[2306.07691] StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
November 20, 2023 - Abstract:In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
Discussions

Anyone played and experimented with StyleTTS2?
In my limited testing, I found StyleTTS 2 to be incredibly good. It is super-fast compared to other high-end models and that is a key advantage. StyleTTS 2 depracated fast, low-resources TTS models such as FastSpeech (and FastSpeech 2) and Tacotron for me. xtts2 provides slightly better audio quality, but is significantly slower and consumes more resources. I recently tested metavoice and the quality is amazing, very natural (a little news reporter-type narration, though). It produces some small artifacts at the end of sentences, but nothing to worry about. In my opinion, StyleTTS 2 offers the best balance between speed and quality at the moment. Tortoise TTS provides a very natural sound, but in my tests, it had a lot of artifacts and strange mid-sentence pitch and intonation changes. There is also this project, which aims to optimize and accelerate Tortoise TTS inference, but I haven't tried it yet. https://github.com/152334H/tortoise-tts-fast More on reddit.com
🌐 r/speechtech
13
7
February 14, 2024
StyleTTS 2 - Closes gap further on TTS quality + Voice generation from samples
It would be the future if it supported european languages More on reddit.com
🌐 r/LocalLLaMA
21
112
September 29, 2023
[D] What is the best text-to-speech tool currently?
Coqui gives you accesses to many through a single python api, XTTS is probs the best for reasonably fast, you can get real-time on a nvidia graphics card with only 4gb vram lol, Bark can also be accessed through it for voice cloning but it will hallucinate, and styletts2 More on reddit.com
🌐 r/MachineLearning
72
45
January 13, 2024
[D] Are there any open source TTS model that can rival 11labs?
XTTS 2 is as good as open source TTS gets. More on reddit.com
🌐 r/MachineLearning
26
68
December 17, 2023
🌐
Styletts2
styletts2.github.io
Audio Samples from StyleTTS 2
Abstract: In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
🌐
NeurIPS
proceedings.neurips.cc › paper_files › paper › 2023 › hash › 3eaad2a0b62b5ed7a2e66c2188bb1449-Abstract-Conference.html
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
December 15, 2023 - In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
🌐
OpenReview
openreview.net › forum
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models | OpenReview
November 2, 2023 - Abstract: In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
🌐
Reddit
reddit.com › r/speechtech › anyone played and experimented with styletts2?
r/speechtech on Reddit: Anyone played and experimented with StyleTTS2?
February 14, 2024 -

Hello redditors,

Recently I've been playing with Style TTS 2 and I have to say the inference speed versus quality is quite good. It's fast and quality is not bad by any means.

For example, the inference of pre-trained LJ speech model is great. Although the quality of the speech isn't the best, but the intonation, pauses and everything else is quite natural, if not for the LJ-speech dataset quality itself. I think it would be great.

I have very old video card only 4GB and I am still able to inference quite a bit of text in not such a long time. It is impressive for sure.

curious for anyone who pre-trained their own models with this what is your opinion?

I'm posting here not only to get the opinion from people who used it, but also to ask if anyone is willing to share their pre-trained model with me. I'm gonna give you two reasons below why I need this. And I would absolutely appreciate anyone's help in this matter.

1 I am blind and I desperately need more natural text to speech system other than SAPI on windows or or standard text to speech output on iOS. telling you folks, using such systems is demotivating to read anything.

2 I don't have a budget to buy RTX 4090 GPU or a skills just yet to pre-traine my own model.

11 labs is definitely too expensive to convert longer text. Let's say a textbook to audio. That's for damn sure. play.ht isn't cheap either. I suppose I could pay 99 dollars or so for unlimited conversions. But that isn't feasible either for me.

tortoise-tts is way too computationally expencive for any text to audiobook making procedures that for sure.

then I thought about RVC but for that you also need a decent TTS solution and from my testing I think if I have good enough pre-trained model for StyleTTS I could experiment further with RVC if needed.

Yeah that's my thoughts if anyone is willing to help me out DM me because I suppose nobody wants to share their models publikly.

I perfectly understand the issues surrounding sharing pre-trained models or audio. So I can promise 3 things for anyone who is willing to help in my situation.

1 I will never share your model with anybody.

2 I will never share the audio generated with your given model publicly.

3 It will be used for my reading activities because that's my intention.

I perfectly understand that the post title is a bit of a clickbait, I suppose, but I want people to actually read the post and asking for help in a title is discouraging. So sorry for that...

I appreciate any comments and opinions, particularly from the people who can evaluate the style TTS 2 performance over the other available options, because that is above my pay grade and knowledge to evaluate how good it is in comparison to other implementations, particularly where diffusion is concerned...

🌐
Reddit
reddit.com › r/localllama › styletts 2 - closes gap further on tts quality + voice generation from samples
r/LocalLLaMA on Reddit: StyleTTS 2 - Closes gap further on TTS quality + Voice generation from samples
September 29, 2023 -

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

Paper: https://arxiv.org/abs/2306.07691

Audio samples: https://styletts2.github.io/

Find elsewhere
🌐
Hugging Face
huggingface.co › spaces › Wismut › StyleTTS2_Studio
StyleTTS2 Studio - a Hugging Face Space by Wismut
Use this app to generate speech from text with predefined voices. Customize the voice using sliders for features like gender, tone, and pace. Save and reuse your customized voices.
🌐
Hugging Face
huggingface.co › papers › 2306.07691
Paper page - StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
June 13, 2023 - StyleTTS 2 uses style diffusion and adversarial training with large pre-trained speech language models to achieve human-level text-to-speech synthesis, improving naturalness and zero-shot speaker adaptation.
🌐
ChatPaper
chatpaper.com › fr › chatpaper › paper › 10609
StyleTTS 2: Towards Human-Level Text-to-Speech through ...
You can explore daily latest papers, get AI summaries! Or AI chat with your files directly, get instant AI answers with cited sources. Ultimate ChatPDF tool for efficient academic research, effortless literature review!
🌐
YouTube
youtube.com › watch
StyleTTS2 - Easiest Local Installation of Text to Speech AI Model - YouTube
This video shows how to locally install StyleTTS 2 which is a text-to-speech (TTS) model with human level quality for voice cloning and text to speech.🔥 Buy...
Published   August 29, 2024
🌐
PyPI
pypi.org › project › styletts2
styletts2 · PyPI
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
      » pip install styletts2
    
Published   Jan 11, 2024
Version   0.1.6
🌐
Unite.AI
unite.ai › styletts-2-human-level-text-to-speech-with-large-speech-language-models
StyleTTS 2: Human-Level Text-to-Speech with Large Speech Language Models – Unite.AI
December 4, 2023 - StyleTTS2 is an innovative Text to Speech synthesis model that takes the next step towards building human-level TTS frameworks, and it is built upon StyleTTS, a style-based text to speech generative model.
🌐
Semantic Scholar
semanticscholar.org › papers › styletts: a style-based generative model for natural and diverse text-to-speech synthesis
[PDF] StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis | Semantic Scholar
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS
🌐
arXiv
arxiv.org › html › 2504.10309v1
AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis
April 11, 2025 - Text-to-Speech, Retrieval-Augmented Generation, Automatic Style Matching · Existing language model (LM) based TTS models [1, 2, 3, 4, 5] have achieved highly realistic results in speech generation and voice cloning, and podcast synthesis is one of the most important application scenarios[6]. Whether it is storytelling or interview dialogue, an engaging podcast is often accompanied by changes of subject, interactions between speakers, and emotional fluctuations[7]. As a result, speech synthesis technology must not only be able to control the speaker’s timbre and style of speech, but also be able to dynamically adapt the style of speech according to different contextual situations in order to create more natural and immersive user experience.
🌐
Smallest.ai
smallest.ai › blog › run-styletts2-on-google-colab
StyleTTS 2 : TTS model using Style Vector and Diffusion ...
StyleTTS 2 builds on StyleTTS, which was a non-autoregressive model using a style vector from reference audio, enabling natural and expressive speech generation. Its goal was to synthesize speech that can capture full para-linguistic information ...
🌐
Replicate
replicate.com › adirik › styletts2
adirik/styletts2 | Run with an API on Replicate
November 20, 2023 - StyleTTS 2 is a text-to-speech model that can generate speech from text and text + a reference speech to copy its style (speaker adaptation).
🌐
Unreal Speech
blog.unrealspeech.com › styletts2-simplified
StyleTTS2 Simplified
November 24, 2023 - A complete guide to understanding StyleTTS, Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. What is StyleTTS? StyleTTS 2 is a state-of-the-art text-to-speech (TTS) model that represents ...
🌐
Scrapbox
scrapbox.io › work4ai › StyleTTS_2
StyleTTS 2 - work4ai
https://arxiv.org/abs/2306.07691StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models · https://github.com/yl4579/StyleTTS2yl4579/StyleTTS2 · Demo https://huggingface.co/spaces/styletts2/styletts2 · style diffusionとlarge speech language models (SLMs)による敵対的学習を活用し、人間レベルのTTS合成を実現するtext-to-speech(TTS)モデルであるStyleTTS 2を紹介する ·
🌐
ACM Digital Library
dl.acm.org › doi › 10.5555 › 3666122.3666982
StyleTTS 2 | Proceedings of the 37th International Conference on Neural Information Processing Systems
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.