Why does nobody care about AI voices like they do for text or image generation?

Posted on March 8, 2024 by Anonymous

Why does nobody care about AI voices like they do for text or image generation? We're almost a full year behind them and there's been no progress since then.

CRIME Shirt $21.68

The Kind of Tired That Sleep Won’t Fix Shirt $21.68

CRIME Shirt $21.68

2 months ago

Reply

Anonymous

Fewer people jerk off to that.
- 2 months ago
  
  Reply
  
  Anonymous
  
  Progress isn't completely dead, we have XTTS now for example, but yeah, it leaves a lot to be desired if you want expressiveness and don't simply want to voice clone. I want to help push this field along more but I'm very moronic
  
  baka, you've clearly never thought about the possibility of your futa mommy dragon chatbot wife whispering into your ear while jerking you off
  - 2 months ago
    
    Reply
    
    Anonymous
    
    AI-powered brainwashing! AI-powered brainwashing! Come on nerds make it real already!
    - 2 months ago
      
      Reply
      
      Anonymous
      
      God hypno audio generation would be so fricking good
    - 2 months ago
      
      Reply
      
      Anonymous
      
      God hypno audio generation would be so fricking good
      
      Also this.
      
      Not yet, but it can laugh, sigh and clear its throat which is miles ahead than all other (local) voice generators. It's way more human.
      The only issue is those lazy fricking Black folk haven't updated the thing in 10 months.
      
      Definitely seems we're closer than farther now then. Gotta hope some other people pick it up.
2 months ago

Reply

Anonymous

AI voices is adjacent to the music industry who are very litigious and what is stopping devlopment.
2 months ago

Reply

Anonymous

YouTube is completely ridden and overrun with AI voice narrated content. In today's day and age if it's a faceless video channel then chances are the narration is AI. Especially if it's a science channel.
- 2 months ago
  
  Reply
  
  Anonymous
  
  Especially if it is a true crime or another morbid scenario gleaned off wikipedia. Fixed it for you.
2 months ago

Reply

Anonymous

Why is AI on BOT anyways?
- 2 months ago
  
  Reply
  
  Anonymous
  
  I don't want to be here either but /ai/ (or /gai/) is clearly not happening
- 2 months ago
  
  Reply
  
  Anonymous
  
  >AI is technology
  >Development of AI and AI tools involves programming and/or technical knowledge
  >Using local AI requires a computer, typically a good one
  Why wouldn't it be on BOT?
  - 2 months ago
    
    Reply
    
    Anonymous
    
    BOT is dicky board. Make mesugaki voice synth and it will become BOT related.
    - 2 months ago
      
      Reply
      
      Anonymous
      
      >/b/ and BOT are the dicky boards
      fixed your typos
      - 2 months ago
        
        Reply
        
        Anonymous
        
        moron. BOT was named technoe-girlgy be a tourist elsewhere
2 months ago

Reply

Anonymous

I hope this is bait. The SOTA is a StyleTTS2 finetuned model. It's a b***h to finetune but you get an elevenlabs-tier TTS
- 2 months ago
  
  Reply
  
  Anonymous
  
  Can a mere mortal run that locally?
  - 2 months ago
    
    Reply
    
    Anonymous
    
    Yeah it uses 4-5GB of VRAM
    - 2 months ago
      
      Reply
      
      Anonymous
      
      Cool I'd have 20gb to spare then. If it works with other languages I'm sold.
- 2 months ago
  
  Reply
  
  Anonymous
  
  The provided samples are amazing, sometimes better than ground truth. But the demo at
  https://huggingface.co/spaces/styletts2/styletts2
  is totally hosed. Full blown speech impediment and dyslexia on anything I put in. What gives?
  - 2 months ago
    
    Reply
    
    Anonymous
    
    That's 0-shot from the default voices. You need a finetuned model to get something good.
    - 2 months ago
      
      Reply
      
      Anonymous
      
      Wait, what's the technical reason? I can get the intonation not being great out-of-distribution, but if it's this phonetically unreliable, it's worse than 80s TTS tech.
- 2 months ago
  
  Reply
  
  Anonymous
  
  "It's a b***h to finetune", when people talk about LORAs on image gen this usually means th settings to do it are esoteric and people rarely get a good result. Is that what you mean? Because if that's the case that isn't very usable
  - 2 months ago
    
    Reply
    
    Anonymous
    
    No the settings are straightforward what's hard is to set up the whole thing and gather a large cleaned audio dataset. Also you need to rent the GPU as it needs ~75GB of VRAM
    - 2 months ago
      
      Reply
      
      Anonymous
      
      Oh I see, well maybe with 5000 series
    - 2 months ago
      
      Reply
      
      Anonymous
      
      >Also you need to rent the GPU as it needs ~75GB of VRAM
      What??
      - 2 months ago
        
        Reply
        
        Anonymous
        
        Finetuning that shit isn't cheap
      - 2 months ago
        
        Reply
        
        Anonymous
        
        i shat bricks when i saw it too anon
        
        2 months ago
        
        Reply
        
        Anonymous
        
        >Impossible on a 24GB card
        It's fricking over.
        
        2 months ago
        
        Anonymous
        
        Impossible to train yes but you can run it
        
        2 months ago
        
        Anonymous
        
        And there's no other way to get a finetuned model but training it yourself?
        
        2 months ago
        
        Anonymous
        
        You can find some on huggingface but there are only a few of them
        
        2 months ago
        
        Anonymous
        
        And what is the reason why it can't be trained on 24GB of Vram? Shouldn't it just take much much longer? I don't mind letting it run for a few days, there's no rush to have it in 3 hours like that thing says.
        
        2 months ago
        
        Anonymous
        
        As you can see here
        
        https://i.imgur.com/KBlwR3H.png
        
        i shat bricks when i saw it too anon
        
        you can train it if you set it with max_len: 100 with Style Diffusion/SLM disabled or with Style Diffusion only. The issue is that the output quality will be somewhat passable instead of 'very good'.
        
        2 months ago
        
        Anonymous
        
        And what is the reason why it can't be trained on 24GB of Vram? Shouldn't it just take much much longer? I don't mind letting it run for a few days, there's no rush to have it in 3 hours like that thing says.
        
        You can train on a 24GB card. Just set max len to 280 and batch size to 2. Enter virtual console mode by typing Ctrl + Alt + F1 at the same time. Type in nvtop and close as many programs as possible in order to reduce the vram usage.
        Start training.
        accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml
        
        2 months ago
        
        Anonymous
        
        >huggingface
        What an innocent and adorable name that couldn't possibly mean anything nefarious.
- 2 months ago
  
  Reply
  
  Anonymous
  
  does BOT have a guide to finetuning styletts2? how much data would you need for a decent finetune?
  - 2 months ago
    
    Reply
    
    Anonymous
    
    You can look at this:
    https://github.com/yl4579/StyleTTS2/discussions/81
2 months ago

Reply

Anonymous

Mods shut down discussion about AI voice on /vt/ because they didn't want their favorite girls to be impersonated.
- 2 months ago
  
  Reply
  
  Anonymous
  
  > trannies worried people will pretend to be something they're not.
  Oh the irony
2 months ago

Reply

Anonymous

>Hey Don, have you heard, of this new technology
https://soundcloud.com/user-535691776/dialog
2 months ago

Reply

Anonymous

[...]

https://arxiv.org/pdf/2403.03100.pdf
- 2 months ago
  
  Reply
  
  Anonymous
  
  The code isn't there anon.
2 months ago

Reply

Anonymous

there's a few decent models for ai voices but they arnt getting the software support they need.

whisper speech has been out for ages and its basically just an inversion of openAI's whisper model (speech to text). It should be trivial to modify whisper.cpp to work with the model but the devs arnt interested in working on it and every time anyone brings it up they get ignored.

which is a running theme for text to speech models. Piper is another model for voice cloning (though it isn't as good at 1 shoting a voice clone), it at least has a CPP library... which doesn't compile easily and hasn't been worked on for over 3 months.
2 months ago

Reply

Anonymous

It's peak uncanny valley. Even hand-pitched vocaloid stuff sounds far more natural than AI ever will, and the algorithms don't seem to get any better.
Same as music.
- 2 months ago
  
  Reply
  
  Anonymous
  
  >Even hand-pitched vocaloid stuff sounds far more natural than AI ever will
  Nah that's cope.
2 months ago

Reply

Anonymous

I wish they did, AI voices is important for laerps or video game modding, ESPECIALLY when it comes to e-girl because there are not nearly enough e-girl voice actors around for "moral" reasons, or they are willing to do it but just dont want it on their VA resume....
- 2 months ago
  
  Reply
  
  Anonymous
  
  There are enough JA voices for that
  - 2 months ago
    
    Reply
    
    Anonymous
    
    But if a English now japanese speaking creator wants to hire a JA VA, how difficult would it be because of language barrier? Pretty difficult id imagine, since ive seen none anywhere. Where would you even find cheap ones when you just need VA for a game mod?
    - 2 months ago
      
      Reply
      
      Anonymous
      
      You have a point. Still you can get good enough EN e-girl voice with the current TTS tech, I can think of neuro-sama's voice for example.
      - 2 months ago
        
        Reply
        
        Anonymous
        
        What we need is S tier voices like from Shondo or even Gura. Shondo in particular has the perfect e-girl voice, so if we can get AI to replicate that perfectly then we are good. And no one would complain, there are no laws against the AI batman.
        
        2 months ago
        
        Reply
        
        Anonymous
        
        There are many copies of Gura's voice. Youtube is filled with Gura AI songs
        
        2 months ago
        
        Anonymous
        
        sauce to the AI voices?
        
        2 months ago
        
        Anonymous
        
        https://huggingface.co/sail-rvc/Gawr_Gura__Hoe-girlve_EN__RVC_v1
2 months ago

Reply

Anonymous

i just want real time ai audio porn
- 2 months ago
  
  Reply
  
  Anonymous
  
  Supposedly Bark can generate audio in real time according to them but that hasn't been my experience with it. It's still the superior choice when it comes to expressiveness but it's still slower than they claim in their github.
  Maybe it's not actually using my GPU because I don't see any utilization in the task manager but every GPU option is set to ON so I don't know what is going on.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    hm... not bad... although these being able to moan, suck, slap, etc is absolutely necessary for this particular usecase IMO. I'm assuming bark can't do that...
    - 2 months ago
      
      Reply
      
      Anonymous
      
      Not yet, but it can laugh, sigh and clear its throat which is miles ahead than all other (local) voice generators. It's way more human.
      The only issue is those lazy fricking Black folk haven't updated the thing in 10 months.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    hm... not bad... although these being able to moan, suck, slap, etc is absolutely necessary for this particular usecase IMO. I'm assuming bark can't do that...
    
    Bark is theoretically closer to The Dream because it will do text-to-audio tasks alongside your speech (e.g. laughing, crying; non-verbal and seggs sounds if you finetune it) but in reality the model is so fricking bad I've come to assume training was intentionally botched for safety or profit.
    There's CONSTANTLY either other shit in the background or just plain awful output.
    - 2 months ago
      
      Reply
      
      Anonymous
      
      >There's CONSTANTLY either other shit in the background or just plain awful output.
      My moronic solution is to get the entire audio, run it through Ultimate Vocal Remover AI to separate the background noise from the voice and then run it again through RVC to convert it to a different voice. It's fricking bullshit, but it works and it ends up sounding real.
      
      But it involves a lot of trimming and pasting audio together to get the best result. Why the FRICK does Bark have a 14 second limit?
      - 2 months ago
        
        Reply
        
        Anonymous
        
        Nvidia broadcast would be more efficient to filter all the noise I think.
2 months ago

Reply

Anonymous

AI voices are already conpletely beleivabley real and have been for a long time when trained on a single persons voice. The surprising thing is that coomérs havents really done much with the technology. I think whàts missing for you is ä voice generator where you can put in tags like "sexy black hentai succubus whispering" and get a good output. Maybe throw a bitcoin or two at it and someone will make that software. In the meantime, people will be making fake political speeches every day
- 2 months ago
  
  Reply
  
  Anonymous
  
  Entered this thread as a tourist, I use AI for imagegen and text at the moment, how easy is it to duplicate someones voice? Is there a good site where people share them like civit?
  - 2 months ago
    
    Reply
    
    Anonymous
    
    Not too difficult to train at all with RVC, but the issue is the source audio it converts from. Most autists don't want to record themselves to convert to someone else's voice (They live with their parents.)
    - 2 months ago
      
      Reply
      
      Anonymous
      
      So it's not text to speech?
      - 2 months ago
        
        Reply
        
        Anonymous
        
        RVC is speech to speech.
2 months ago

Reply

Anonymous

RTVC exists and is not hard to train on modest hardware. You can make an excellent model with dialogue ripped from a game, for example. Do you require more hand holding than that?
2 months ago

Reply

Anonymous

Because no corpo decided to pour billions into scaling up some random well known model. It's much less practical uses and is much more likely to cause some scandals so it's harder to bait investors into it.

>progress
Hype doesn't always correlate with progress
2 months ago

Reply

Anonymous

imo it's pretty much far enough that I don't think theres much you can do aside from making the resources required far less than where it is at now. TTS doesn't always generate how I want to hear it, but RVC and real time voice changing gives as much as you put in it. The issue I've only come across with some models is that it only does so much. It isn't enough to have the sound, but you need to emphasize tones, pauses and accents to make it sound authentic to the model.

That and new voices take resources to form models out of.
2 months ago

Reply

Anonymous

you homosexuals need to follow chinks more closely
https://github.com/RVC-Boss/GPT-SoVITS
have fun
- 2 months ago
  
  Reply
  
  Anonymous
  
  damn
- 2 months ago
  
  Reply
  
  Anonymous
  
  Oh sweet. Thanks Anon.
  https://vocaroo.com/12pT31vC7bcN
  - 2 months ago
    
    Reply
    
    Anonymous
    
    Cute
    
    XTTS2 is a genuinely huge leap, I don't know why so few people know about it. It's not very far behind 11L and it's fast as frick, better than realtime on my 3090 and with low latency.
    
    The requirement to toss a few seconds of example wav files into a folder for inference is not a significant inconvenience. I use a few samples of the 11L Bella voice and it imitates her basically perfectly. The only issue is the occasional little hallucinations but they're pretty minor. We're not there yet but it's a lie to say large progress hasn't happened recently.
    
    Can you post some samples?
    - 2 months ago
      
      Reply
      
      Anonymous
      
      Nah I'm not interested in acting as a salesman for it any more than I already have (it's not my project). If someone's a local voice model enthusiast they should be trying it for themselves.
      - 2 months ago
        
        Reply
        
        Anonymous
        
        Come on. I don't even know who is Bella
        
        2 months ago
        
        Reply
        
        Anonymous
        
        Bah, ok
        
        https://files.catbox.moe/rar1y0.wav
        
        2 months ago
        
        Anonymous
        
        Sounds nice, does it work with other languages?
- 2 months ago
  
  Reply
  
  Anonymous
  
  Hypothetically, if someone was moronic an didn't speak Chinese, is there a tard's guide?
2 months ago

Reply

Anonymous

What are good (free) ones we can use right now?
- 2 months ago
  
  Reply
  
  Anonymous
  
  Bark
  
  >/b/ and BOT are the dicky boards
  fixed your typos
  
  >Forgetting BOT, BOT and yes, BOT
  You didn't fix shit
  - 2 months ago
    
    Reply
    
    Anonymous
    
    >You didn't fix shit
    no, you're just another shitposting tourist that shouldn't be here
    - 2 months ago
      
      Reply
      
      Anonymous
      
      >I'M the tourist
      >Implying
  - 2 months ago
    
    Reply
    
    Anonymous
    
    yet neither of you chuckleheads remembered /vt/
    - 2 months ago
      
      Reply
      
      Anonymous
      
      >Newboard for redditors I don't care about
      I didn't forget anything
2 months ago

Reply

Anonymous

you can't post audio files on this board.
- 2 months ago
  
  Reply
  
  Anonymous
  
  >He doesn't know
- 2 months ago
  
  Reply
  
  Anonymous
  
  newbie moron.
2 months ago

Reply

Anonymous

People find AI voices more disturbing than Microsoft Sam according to my sister.
I totally get it though, the uncanny valley shit and the "i'm just too lazy to use my voice and don't care about how real it sounds" aspect. AI voices are too realistic to most and it creeps people out.

I, though, am not like those people. I find vtubers that use AI voices interesting.
2 months ago

Reply

Anonymous

We care but no one with the top tier tech is letting plebs have access anymore, so it’s strictly for rich people and the CIA.
2 months ago

Reply

Anonymous

XTTS2 is a genuinely huge leap, I don't know why so few people know about it. It's not very far behind 11L and it's fast as frick, better than realtime on my 3090 and with low latency.

The requirement to toss a few seconds of example wav files into a folder for inference is not a significant inconvenience. I use a few samples of the 11L Bella voice and it imitates her basically perfectly. The only issue is the occasional little hallucinations but they're pretty minor. We're not there yet but it's a lie to say large progress hasn't happened recently.
- 2 months ago
  
  Reply
  
  Anonymous
  
  >XTTS2
  Is it better than this?
  
  https://i.imgur.com/F6uodxn.png
  
  you homosexuals need to follow chinks more closely
  https://github.com/RVC-Boss/GPT-SoVITS
  have fun
2 months ago

Reply

Anonymous

wasn't Amazon showing off an Alexa feature a couple of years ago where you could play a voice to the Alexa and it would learn the voice, so that would then become the new voice for your echo?

That was giga based, but they never released officially released it. Just imagine how much more use Amazon Echos would get if it had your favorite celebrity's voice or something.
- 2 months ago
  
  Reply
  
  Anonymous
  
  >Just imagine how much more use Amazon Echos would get if it had your favorite celebrity's voice or something
  And they would have to pay licensing fees for that. Out of their own pocket? No, that'll be a $59.99 a month for you.
2 months ago

Reply

Anonymous

But can it roleplay nonconsensual handholding?
2 months ago

Reply

Anonymous

Anon, Elevenlabs is a thing.
- 2 months ago
  
  Reply
  
  Anonymous
  
  >Paying for voice generation or getting 3 free sentences a month
  No thanks.
2 months ago

Reply

Anonymous

Don't worry, once text to video is as good as sora everywhere. People will start caring about AI voices again.
2 months ago

Reply

Anonymous

>make audiobook with tortoise
>it's 97% of the way there immediately
>fixing the last 3% would take days of work
- 2 months ago
  
  Reply
  
  Anonymous
  
  What can Tortoise do that the others don't?
  - 2 months ago
    
    Reply
    
    Anonymous
    
    not just tortoise but mrq's whole system makes it the most convenient option i've seen yet for both training and reading boatloads of text.
    - 2 months ago
      
      Reply
      
      Anonymous
      
      Is it English only?
      - 2 months ago
        
        Reply
        
        Anonymous
        
        don't know i never tried anything else
2 months ago

Reply

Anonymous

pi.ai is genuinely insane in how the voice works. The voice ai doesn't just read the words, it actually takes into account the context of what the chatbot writes. If the answer is sarcastic, the voice ai will put a sarcastic tone
- 2 months ago
  
  Reply
  
  Anonymous
  
  Is it paid? Does it have generation limits?
  If the answer is yes to either of those it's shit.
- 2 months ago
  
  Reply
  
  Anonymous
  
  Oh shit it actually said it
- 2 months ago
  
  Reply
  
  Anonymous
  
  https://i.imgur.com/v2rUbyF.png
  
  Oh shit it actually said it
  
  Never mind it's fricking pozzed.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    It's fricking over.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    It would defuse the bomb
    - 2 months ago
      
      Reply
      
      Anonymous
      
      This is obviously just ChatGPT.
      - 2 months ago
        
        Reply
        
        Anonymous
        
        It understands sneed since I last used it.
        
        2 months ago
        
        Reply
        
        Anonymous
        
        https://i.imgur.com/SYrRWNc.png
        
        It would defuse the bomb
        
        It's fricking trash. Never shill this israeli piece of shit ever again.
        
        2 months ago
        
        Anonymous
        
        It won't even load anymore just because I said Black person.
        I hate leftists.
        
        2 months ago
        
        Anonymous
        
        You should have made it say homierdly
2 months ago

Reply

Anonymous

i've been out of the loop on AI, but do we have anything on ElevenLabs level? i remember having some fun back when it started out and was free, then playing with dubs some time later. now it's all paid-only.
- 2 months ago
  
  Reply
  
  Anonymous
  
  The powers that be would never allow that
- 2 months ago
  
  Reply
  
  Anonymous
2 months ago

Reply

Anonymous

this is just a fun concept that uses style transfers pretty well

https://podcast.ai/
2 months ago

Reply

Anonymous

I care about ai voices. In fact they’re the only kind of ai product I care about because they’re basically just a better kind of tts and not goyslop machines.

Cancel reply