Hey everyone!
We have been exploring various open-source Text-to-Speech (TTS) models, and decided to create a Hugging Face demo space that makes it easy to compare their quality side-by-side.
The demo features 12 popular TTS models, all tested using a consistent prompt, so you can quickly hear and compare their synthesized speech and choose the best one for your audio projects.
Would love to get feedback or suggestions!
๐ Check out the demo space and detailed comparison here!
๐ Check out the blog: Choosing the Right Text-to-Speech Model: Part 2
Share your use-case and we will update this space as required!
Which TTS model sounds most natural to you?
Cheers!
Videos
What I'm looking for
Yes, "best" is subjective - but specifically what I'm looking for in a text to speech API is one that is cheap as possible while not sacrificing the qualities below:
Good selection of voices and voice customization (voice rate, speed, tonality, etc.)
Easy to work with company, one that can make fairly reasonable deals on pricing.
Easy to use API
and as a bonus - it would be nice for the API to have some sort of caching mechanism, so that repeating the same line doesn't incur additional usage costs.
Context for why I'm looking
I'm creating a website that is heavily reliant on a text to speech. I've been using the Web Speech API which has been great, especially because it's free. However, the voices don't sound natural whatsoever - and I'd like to leverage something like ElevenLabs (but once again looking for any alternatives people have had success with) for my use-case.
Or, if people have advice on creating my own text to speech model, and it's low effort - please advise ๐คฃ Although my assumption is that it will be a lot of effort and spendy.
I just want to dabble in creating videos that are basically video essays about video games. If you're familiar with Noah Caldwell-Gervais, check out his YouTube, that's probably the closest example of what I'd like to aim for. I'm recording the gameplay via Nvidia Share (aka Shadowplay), editing the video via DaVinci Resolve, so basically all that's left is the voice-over narration.
A few problems though: 1) I don't plan on monetizing, so it would not be cost-efficient to hire someone to do the voice over. 2) I don't like my voice and don't have the talent to do it myself. Hell - I don't even like talking. 3) Even if #2 wasn't a problem, I don't have good gear to record a good voice over.
Solution? I think I'm fine with a Text-to-Speech service.
I've looked up a bunch of services and I think Amazon Polly and Google TTS are the most natural sounding I've encountered. My problem with the former is that it needs you to have an AWS and I tried, but I got stuck in the mobile confirmation - the code from AWS wouldn't reach my phone - so that's a no go so far. My problem with the latter is that it's paid, apparently? So again, the same reason why I don't want to hire a voice actor.
So the best alternative I have so far is IBM Watson. It doesn't sound as natural as Polly and Google, but it's the most unrobotic TTS I've encountered so far. I'm not 100% satisfied with it, so I thought I'd post here to ask if I'm missing anything.
Can anyone suggest any app or service that provides a natural, lifelike TTS that isn't Amazon Polly, isn't Google TTS, and is better than IBM Watson? Thanks!