Why does nobody care about AI voices like they do for text or image generation? We're almost a full year behind them and there's been no progress since then.
CRIME Shirt $21.68 |
CRIME Shirt $21.68 |
Why does nobody care about AI voices like they do for text or image generation? We're almost a full year behind them and there's been no progress since then.
CRIME Shirt $21.68 |
CRIME Shirt $21.68 |
Fewer people jerk off to that.
Progress isn't completely dead, we have XTTS now for example, but yeah, it leaves a lot to be desired if you want expressiveness and don't simply want to voice clone. I want to help push this field along more but I'm very moronic
baka, you've clearly never thought about the possibility of your futa mommy dragon chatbot wife whispering into your ear while jerking you off
AI-powered brainwashing! AI-powered brainwashing! Come on nerds make it real already!
God hypno audio generation would be so fricking good
Also this.
Definitely seems we're closer than farther now then. Gotta hope some other people pick it up.
AI voices is adjacent to the music industry who are very litigious and what is stopping devlopment.
YouTube is completely ridden and overrun with AI voice narrated content. In today's day and age if it's a faceless video channel then chances are the narration is AI. Especially if it's a science channel.
Especially if it is a true crime or another morbid scenario gleaned off wikipedia. Fixed it for you.
Why is AI on BOT anyways?
I don't want to be here either but /ai/ (or /gai/) is clearly not happening
>AI is technology
>Development of AI and AI tools involves programming and/or technical knowledge
>Using local AI requires a computer, typically a good one
Why wouldn't it be on BOT?
BOT is dicky board. Make mesugaki voice synth and it will become BOT related.
>/b/ and BOT are the dicky boards
fixed your typos
moron. BOT was named technoe-girlgy be a tourist elsewhere
I hope this is bait. The SOTA is a StyleTTS2 finetuned model. It's a b***h to finetune but you get an elevenlabs-tier TTS
Can a mere mortal run that locally?
Yeah it uses 4-5GB of VRAM
Cool I'd have 20gb to spare then. If it works with other languages I'm sold.
The provided samples are amazing, sometimes better than ground truth. But the demo at
https://huggingface.co/spaces/styletts2/styletts2
is totally hosed. Full blown speech impediment and dyslexia on anything I put in. What gives?
That's 0-shot from the default voices. You need a finetuned model to get something good.
Wait, what's the technical reason? I can get the intonation not being great out-of-distribution, but if it's this phonetically unreliable, it's worse than 80s TTS tech.
"It's a b***h to finetune", when people talk about LORAs on image gen this usually means th settings to do it are esoteric and people rarely get a good result. Is that what you mean? Because if that's the case that isn't very usable
No the settings are straightforward what's hard is to set up the whole thing and gather a large cleaned audio dataset. Also you need to rent the GPU as it needs ~75GB of VRAM
Oh I see, well maybe with 5000 series
>Also you need to rent the GPU as it needs ~75GB of VRAM
What??
Finetuning that shit isn't cheap
i shat bricks when i saw it too anon
>Impossible on a 24GB card
It's fricking over.
Impossible to train yes but you can run it
And there's no other way to get a finetuned model but training it yourself?
You can find some on huggingface but there are only a few of them
And what is the reason why it can't be trained on 24GB of Vram? Shouldn't it just take much much longer? I don't mind letting it run for a few days, there's no rush to have it in 3 hours like that thing says.
As you can see here
you can train it if you set it with max_len: 100 with Style Diffusion/SLM disabled or with Style Diffusion only. The issue is that the output quality will be somewhat passable instead of 'very good'.
You can train on a 24GB card. Just set max len to 280 and batch size to 2. Enter virtual console mode by typing Ctrl + Alt + F1 at the same time. Type in nvtop and close as many programs as possible in order to reduce the vram usage.
Start training.
accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml
>huggingface
What an innocent and adorable name that couldn't possibly mean anything nefarious.
does BOT have a guide to finetuning styletts2? how much data would you need for a decent finetune?
You can look at this:
https://github.com/yl4579/StyleTTS2/discussions/81
Mods shut down discussion about AI voice on /vt/ because they didn't want their favorite girls to be impersonated.
> trannies worried people will pretend to be something they're not.
Oh the irony
>Hey Don, have you heard, of this new technology
https://soundcloud.com/user-535691776/dialog
https://arxiv.org/pdf/2403.03100.pdf
The code isn't there anon.
there's a few decent models for ai voices but they arnt getting the software support they need.
whisper speech has been out for ages and its basically just an inversion of openAI's whisper model (speech to text). It should be trivial to modify whisper.cpp to work with the model but the devs arnt interested in working on it and every time anyone brings it up they get ignored.
which is a running theme for text to speech models. Piper is another model for voice cloning (though it isn't as good at 1 shoting a voice clone), it at least has a CPP library... which doesn't compile easily and hasn't been worked on for over 3 months.
It's peak uncanny valley. Even hand-pitched vocaloid stuff sounds far more natural than AI ever will, and the algorithms don't seem to get any better.
Same as music.
>Even hand-pitched vocaloid stuff sounds far more natural than AI ever will
Nah that's cope.
I wish they did, AI voices is important for laerps or video game modding, ESPECIALLY when it comes to e-girl because there are not nearly enough e-girl voice actors around for "moral" reasons, or they are willing to do it but just dont want it on their VA resume....
There are enough JA voices for that
But if a English now japanese speaking creator wants to hire a JA VA, how difficult would it be because of language barrier? Pretty difficult id imagine, since ive seen none anywhere. Where would you even find cheap ones when you just need VA for a game mod?
You have a point. Still you can get good enough EN e-girl voice with the current TTS tech, I can think of neuro-sama's voice for example.
What we need is S tier voices like from Shondo or even Gura. Shondo in particular has the perfect e-girl voice, so if we can get AI to replicate that perfectly then we are good. And no one would complain, there are no laws against the AI batman.
There are many copies of Gura's voice. Youtube is filled with Gura AI songs
sauce to the AI voices?
https://huggingface.co/sail-rvc/Gawr_Gura__Hoe-girlve_EN__RVC_v1
i just want real time ai audio porn
Supposedly Bark can generate audio in real time according to them but that hasn't been my experience with it. It's still the superior choice when it comes to expressiveness but it's still slower than they claim in their github.
Maybe it's not actually using my GPU because I don't see any utilization in the task manager but every GPU option is set to ON so I don't know what is going on.
hm... not bad... although these being able to moan, suck, slap, etc is absolutely necessary for this particular usecase IMO. I'm assuming bark can't do that...
Not yet, but it can laugh, sigh and clear its throat which is miles ahead than all other (local) voice generators. It's way more human.
The only issue is those lazy fricking Black folk haven't updated the thing in 10 months.
Bark is theoretically closer to The Dream because it will do text-to-audio tasks alongside your speech (e.g. laughing, crying; non-verbal and seggs sounds if you finetune it) but in reality the model is so fricking bad I've come to assume training was intentionally botched for safety or profit.
There's CONSTANTLY either other shit in the background or just plain awful output.
>There's CONSTANTLY either other shit in the background or just plain awful output.
My moronic solution is to get the entire audio, run it through Ultimate Vocal Remover AI to separate the background noise from the voice and then run it again through RVC to convert it to a different voice. It's fricking bullshit, but it works and it ends up sounding real.
But it involves a lot of trimming and pasting audio together to get the best result. Why the FRICK does Bark have a 14 second limit?
Nvidia broadcast would be more efficient to filter all the noise I think.
AI voices are already conpletely beleivabley real and have been for a long time when trained on a single persons voice. The surprising thing is that coomérs havents really done much with the technology. I think whàts missing for you is ä voice generator where you can put in tags like "sexy black hentai succubus whispering" and get a good output. Maybe throw a bitcoin or two at it and someone will make that software. In the meantime, people will be making fake political speeches every day
Entered this thread as a tourist, I use AI for imagegen and text at the moment, how easy is it to duplicate someones voice? Is there a good site where people share them like civit?
Not too difficult to train at all with RVC, but the issue is the source audio it converts from. Most autists don't want to record themselves to convert to someone else's voice (They live with their parents.)
So it's not text to speech?
RVC is speech to speech.
RTVC exists and is not hard to train on modest hardware. You can make an excellent model with dialogue ripped from a game, for example. Do you require more hand holding than that?
Because no corpo decided to pour billions into scaling up some random well known model. It's much less practical uses and is much more likely to cause some scandals so it's harder to bait investors into it.
>progress
Hype doesn't always correlate with progress
imo it's pretty much far enough that I don't think theres much you can do aside from making the resources required far less than where it is at now. TTS doesn't always generate how I want to hear it, but RVC and real time voice changing gives as much as you put in it. The issue I've only come across with some models is that it only does so much. It isn't enough to have the sound, but you need to emphasize tones, pauses and accents to make it sound authentic to the model.
That and new voices take resources to form models out of.
you homosexuals need to follow chinks more closely
https://github.com/RVC-Boss/GPT-SoVITS
have fun
damn
Oh sweet. Thanks Anon.
https://vocaroo.com/12pT31vC7bcN
Cute
Can you post some samples?
Nah I'm not interested in acting as a salesman for it any more than I already have (it's not my project). If someone's a local voice model enthusiast they should be trying it for themselves.
Come on. I don't even know who is Bella
Bah, ok
https://files.catbox.moe/rar1y0.wav
Sounds nice, does it work with other languages?
Hypothetically, if someone was moronic an didn't speak Chinese, is there a tard's guide?
What are good (free) ones we can use right now?
Bark
>Forgetting BOT, BOT and yes, BOT
You didn't fix shit
>You didn't fix shit
no, you're just another shitposting tourist that shouldn't be here
>I'M the tourist
>Implying
yet neither of you chuckleheads remembered /vt/
>Newboard for redditors I don't care about
I didn't forget anything
you can't post audio files on this board.
>He doesn't know
newbie moron.
People find AI voices more disturbing than Microsoft Sam according to my sister.
I totally get it though, the uncanny valley shit and the "i'm just too lazy to use my voice and don't care about how real it sounds" aspect. AI voices are too realistic to most and it creeps people out.
I, though, am not like those people. I find vtubers that use AI voices interesting.
We care but no one with the top tier tech is letting plebs have access anymore, so it’s strictly for rich people and the CIA.
XTTS2 is a genuinely huge leap, I don't know why so few people know about it. It's not very far behind 11L and it's fast as frick, better than realtime on my 3090 and with low latency.
The requirement to toss a few seconds of example wav files into a folder for inference is not a significant inconvenience. I use a few samples of the 11L Bella voice and it imitates her basically perfectly. The only issue is the occasional little hallucinations but they're pretty minor. We're not there yet but it's a lie to say large progress hasn't happened recently.
>XTTS2
Is it better than this?
wasn't Amazon showing off an Alexa feature a couple of years ago where you could play a voice to the Alexa and it would learn the voice, so that would then become the new voice for your echo?
That was giga based, but they never released officially released it. Just imagine how much more use Amazon Echos would get if it had your favorite celebrity's voice or something.
>Just imagine how much more use Amazon Echos would get if it had your favorite celebrity's voice or something
And they would have to pay licensing fees for that. Out of their own pocket? No, that'll be a $59.99 a month for you.
But can it roleplay nonconsensual handholding?
Anon, Elevenlabs is a thing.
>Paying for voice generation or getting 3 free sentences a month
No thanks.
Don't worry, once text to video is as good as sora everywhere. People will start caring about AI voices again.
>make audiobook with tortoise
>it's 97% of the way there immediately
>fixing the last 3% would take days of work
What can Tortoise do that the others don't?
not just tortoise but mrq's whole system makes it the most convenient option i've seen yet for both training and reading boatloads of text.
Is it English only?
don't know i never tried anything else
pi.ai is genuinely insane in how the voice works. The voice ai doesn't just read the words, it actually takes into account the context of what the chatbot writes. If the answer is sarcastic, the voice ai will put a sarcastic tone
Is it paid? Does it have generation limits?
If the answer is yes to either of those it's shit.
Oh shit it actually said it
Never mind it's fricking pozzed.
It's fricking over.
It would defuse the bomb
This is obviously just ChatGPT.
It understands sneed since I last used it.
It's fricking trash. Never shill this israeli piece of shit ever again.
It won't even load anymore just because I said Black person.
I hate leftists.
You should have made it say homierdly
i've been out of the loop on AI, but do we have anything on ElevenLabs level? i remember having some fun back when it started out and was free, then playing with dubs some time later. now it's all paid-only.
The powers that be would never allow that
this is just a fun concept that uses style transfers pretty well
https://podcast.ai/
I care about ai voices. In fact they’re the only kind of ai product I care about because they’re basically just a better kind of tts and not goyslop machines.