In my limited testing, I found StyleTTS 2 to be incredibly good. It is super-fast compared to other high-end models and that is a key advantage. StyleTTS 2 depracated fast, low-resources TTS models such as FastSpeech (and FastSpeech 2) and Tacotron for me. xtts2 provides slightly better audio quality, but is significantly slower and consumes more resources. I recently tested metavoice and the quality is amazing, very natural (a little news reporter-type narration, though). It produces some small artifacts at the end of sentences, but nothing to worry about. In my opinion, StyleTTS 2 offers the best balance between speed and quality at the moment. Tortoise TTS provides a very natural sound, but in my tests, it had a lot of artifacts and strange mid-sentence pitch and intonation changes. There is also this project, which aims to optimize and accelerate Tortoise TTS inference, but I haven't tried it yet. https://github.com/152334H/tortoise-tts-fast Answer from RYSKZ on reddit.com
๐ŸŒ
GitHub
github.com โ€บ yl4579 โ€บ StyleTTS2
GitHub - yl4579/StyleTTS2: StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
Starred by 6.1K users
Forked by 642 users
Languages ย  Python 77.8% | Jupyter Notebook 22.2%
๐ŸŒ
arXiv
arxiv.org โ€บ abs โ€บ 2306.07691
[2306.07691] StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
November 20, 2023 - In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
๐ŸŒ
Styletts2
styletts2.github.io
Audio Samples from StyleTTS 2
Abstract: In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
๐ŸŒ
OpenReview
openreview.net โ€บ forum
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models | OpenReview
November 2, 2023 - Abstract: In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
๐ŸŒ
PyPI
pypi.org โ€บ project โ€บ styletts2
styletts2 ยท PyPI
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
      ยป pip install styletts2
    
Published ย  Jan 11, 2024
Version ย  0.1.6
๐ŸŒ
Hugging Face
huggingface.co โ€บ spaces โ€บ styletts2 โ€บ styletts2
StyleTTS 2 - a Hugging Face Space by styletts2
Enter text and select a voice to generate high-quality, natural-sounding speech. The app supports multiple voices and allows customization of synthesis parameters for better control over the output.
๐ŸŒ
Hugging Face
huggingface.co โ€บ papers โ€บ 2306.07691
Paper page - StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
StyleTTS 2 uses style diffusion and adversarial training with large pre-trained speech language models to achieve human-level text-to-speech synthesis, improving naturalness and zero-shot speaker adaptation.
๐ŸŒ
Reddit
reddit.com โ€บ r/localllama โ€บ styletts 2 - closes gap further on tts quality + voice generation from samples
r/LocalLLaMA on Reddit: StyleTTS 2 - Closes gap further on TTS quality + Voice generation from samples
September 28, 2023 -

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

Paper: https://arxiv.org/abs/2306.07691

Audio samples: https://styletts2.github.io/

Find elsewhere
๐ŸŒ
Reddit
reddit.com โ€บ r/speechtech โ€บ anyone played and experimented with styletts2?
r/speechtech on Reddit: Anyone played and experimented with StyleTTS2?
February 14, 2024 -

Hello redditors,

Recently I've been playing with Style TTS 2 and I have to say the inference speed versus quality is quite good. It's fast and quality is not bad by any means.

For example, the inference of pre-trained LJ speech model is great. Although the quality of the speech isn't the best, but the intonation, pauses and everything else is quite natural, if not for the LJ-speech dataset quality itself. I think it would be great.

I have very old video card only 4GB and I am still able to inference quite a bit of text in not such a long time. It is impressive for sure.

curious for anyone who pre-trained their own models with this what is your opinion?

I'm posting here not only to get the opinion from people who used it, but also to ask if anyone is willing to share their pre-trained model with me. I'm gonna give you two reasons below why I need this. And I would absolutely appreciate anyone's help in this matter.

1 I am blind and I desperately need more natural text to speech system other than SAPI on windows or or standard text to speech output on iOS. telling you folks, using such systems is demotivating to read anything.

2 I don't have a budget to buy RTX 4090 GPU or a skills just yet to pre-traine my own model.

11 labs is definitely too expensive to convert longer text. Let's say a textbook to audio. That's for damn sure. play.ht isn't cheap either. I suppose I could pay 99 dollars or so for unlimited conversions. But that isn't feasible either for me.

tortoise-tts is way too computationally expencive for any text to audiobook making procedures that for sure.

then I thought about RVC but for that you also need a decent TTS solution and from my testing I think if I have good enough pre-trained model for StyleTTS I could experiment further with RVC if needed.

Yeah that's my thoughts if anyone is willing to help me out DM me because I suppose nobody wants to share their models publikly.

I perfectly understand the issues surrounding sharing pre-trained models or audio. So I can promise 3 things for anyone who is willing to help in my situation.

1 I will never share your model with anybody.

2 I will never share the audio generated with your given model publicly.

3 It will be used for my reading activities because that's my intention.

I perfectly understand that the post title is a bit of a clickbait, I suppose, but I want people to actually read the post and asking for help in a title is discouraging. So sorry for that...

I appreciate any comments and opinions, particularly from the people who can evaluate the style TTS 2 performance over the other available options, because that is above my pay grade and knowledge to evaluate how good it is in comparison to other implementations, particularly where diffusion is concerned...

๐ŸŒ
GitHub
github.com โ€บ sidharthrajaram โ€บ StyleTTS2
GitHub - sidharthrajaram/StyleTTS2: ๐Ÿ ๐Ÿค– Pip installable package for StyleTTS 2 human-level text-to-speech and voice cloning
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
Starred by 160 users
Forked by 36 users
Languages ย  Python
๐ŸŒ
NeurIPS
proceedings.neurips.cc โ€บ paper_files โ€บ paper โ€บ 2023 โ€บ hash โ€บ 3eaad2a0b62b5ed7a2e66c2188bb1449-Abstract-Conference.html
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
December 15, 2023 - In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
๐ŸŒ
DagsHub
dagshub.com โ€บ blog โ€บ styletts2
StyleTTS2: A Quest To Improve Zero-Shot Performance
March 27, 2024 - Discover StyleTTS2's breakthrough in text-to-speech technology: an innovative architecture that captures the essence of any voice with a brief reference. This blog delves into our journey enhancing zero-shot performance, tackling dataset challenges, ...
๐ŸŒ
Replicate
replicate.com โ€บ adirik โ€บ styletts2
adirik/styletts2 | Run with an API on Replicate
StyleTTS 2 is a text-to-speech model that can generate speech from text and text + a reference speech to copy its style (speaker adaptation).
๐ŸŒ
Unreal Speech
blog.unrealspeech.com โ€บ style
StyleTTS2 Tutorial - The Ultimate Guide to Getting Started
November 21, 2023 - StyleTTS 2 is a state-of-the-art text-to-speech (TTS) model that represents a significant leap in the field of speech synthesis. It is designed to produce human-like speech by incorporating advanced techniques such as style diffusion and adversarial ...
๐ŸŒ
GIGAZINE
gigazine.net โ€บ gsc_news โ€บ en โ€บ 20231122-style-tts-2
Introducing 'StyleTTS 2', an open source reading AI that can express emotions and synthesize human-like speech - GIGAZINE
Researchers at Columbia University have developed StyleTTS 2 , a text-to-speech AI that can synthesize human-level speech by using adversarial learning using a large-scale speech language model (SLM) and a diffusion model.
๐ŸŒ
YouTube
youtube.com โ€บ watch
StyleTTS2 - Easiest Local Installation of Text to Speech AI Model - YouTube
This video shows how to locally install StyleTTS 2 which is a text-to-speech (TTS) model with human level quality for voice cloning and text to speech.๐Ÿ”ฅ Buy...
Published ย  August 29, 2024
๐ŸŒ
OpenAI
blog.gopenai.com โ€บ code-examples-and-insights-from-tortoise-tts-and-styletts-2-133814647f87
Hands-On with Voice Cloning : Code Examples and Insights from TorToise-TTS and StyleTTS 2 | by kirouane Ayoub | GoPenAI
August 13, 2024 - We will examine its core features, ... 2 distinguishes itself through its innovative application of style diffusion and adversarial training with large speech language models (SLMs)....
๐ŸŒ
GitHub
github.com โ€บ yl4579 โ€บ StyleTTS
GitHub - yl4579/StyleTTS: Official Implementation of StyleTTS
The model will be saved in the format "epoch_1st_d.pth" and "epoch_2nd_d.pth". Checkpoints and Tensorboard logs will be saved at log_dir. The data list format needs to be filename.wav|transcription, see val_list_libritts.txt as an example. Please refer to inference.ipynb for details. The pretrained StyleTTS and Hifi-GAN on LJSpeech corpus in 24 kHz can be downloaded at StyleTTS Link and Hifi-GAN Link.
Starred by 456 users
Forked by 67 users
Languages ย  Python
๐ŸŒ
YouTube
youtube.com โ€บ watch
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion - YouTube
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Modelshttps://github.com/yl4579/St...
Published ย  June 30, 2024
๐ŸŒ
ACM Digital Library
dl.acm.org โ€บ doi โ€บ 10.5555 โ€บ 3666122.3666982
StyleTTS 2 | Proceedings of the 37th International Conference on Neural Information Processing Systems
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.