Why does nobody care about AI voices like they do for text or image generation?

Why does nobody care about AI voices like they do for text or image generation? We're almost a full year behind them and there's been no progress since then.

CRIME Shirt $21.68

The Kind of Tired That Sleep Won’t Fix Shirt $21.68

CRIME Shirt $21.68

  1. 2 months ago
    Anonymous

    Fewer people jerk off to that.

    • 2 months ago
      Anonymous

      Progress isn't completely dead, we have XTTS now for example, but yeah, it leaves a lot to be desired if you want expressiveness and don't simply want to voice clone. I want to help push this field along more but I'm very moronic

      baka, you've clearly never thought about the possibility of your futa mommy dragon chatbot wife whispering into your ear while jerking you off

      • 2 months ago
        Anonymous

        AI-powered brainwashing! AI-powered brainwashing! Come on nerds make it real already!

        • 2 months ago
          Anonymous

          God hypno audio generation would be so fricking good

        • 2 months ago
          Anonymous

          God hypno audio generation would be so fricking good

          Also this.

          Not yet, but it can laugh, sigh and clear its throat which is miles ahead than all other (local) voice generators. It's way more human.
          The only issue is those lazy fricking Black folk haven't updated the thing in 10 months.

          Definitely seems we're closer than farther now then. Gotta hope some other people pick it up.

  2. 2 months ago
    Anonymous

    AI voices is adjacent to the music industry who are very litigious and what is stopping devlopment.

  3. 2 months ago
    Anonymous

    YouTube is completely ridden and overrun with AI voice narrated content. In today's day and age if it's a faceless video channel then chances are the narration is AI. Especially if it's a science channel.

    • 2 months ago
      Anonymous

      Especially if it is a true crime or another morbid scenario gleaned off wikipedia. Fixed it for you.

  4. 2 months ago
    Anonymous

    Why is AI on BOT anyways?

    • 2 months ago
      Anonymous

      I don't want to be here either but /ai/ (or /gai/) is clearly not happening

    • 2 months ago
      Anonymous

      >AI is technology
      >Development of AI and AI tools involves programming and/or technical knowledge
      >Using local AI requires a computer, typically a good one
      Why wouldn't it be on BOT?

      • 2 months ago
        Anonymous

        BOT is dicky board. Make mesugaki voice synth and it will become BOT related.

        • 2 months ago
          Anonymous

          >/b/ and BOT are the dicky boards
          fixed your typos

          • 2 months ago
            Anonymous

            moron. BOT was named technoe-girlgy be a tourist elsewhere

  5. 2 months ago
    Anonymous

    I hope this is bait. The SOTA is a StyleTTS2 finetuned model. It's a b***h to finetune but you get an elevenlabs-tier TTS

    • 2 months ago
      Anonymous

      Can a mere mortal run that locally?

      • 2 months ago
        Anonymous

        Yeah it uses 4-5GB of VRAM

        • 2 months ago
          Anonymous

          Cool I'd have 20gb to spare then. If it works with other languages I'm sold.

    • 2 months ago
      Anonymous

      The provided samples are amazing, sometimes better than ground truth. But the demo at
      https://huggingface.co/spaces/styletts2/styletts2
      is totally hosed. Full blown speech impediment and dyslexia on anything I put in. What gives?

      • 2 months ago
        Anonymous

        That's 0-shot from the default voices. You need a finetuned model to get something good.

        • 2 months ago
          Anonymous

          Wait, what's the technical reason? I can get the intonation not being great out-of-distribution, but if it's this phonetically unreliable, it's worse than 80s TTS tech.

    • 2 months ago
      Anonymous

      "It's a b***h to finetune", when people talk about LORAs on image gen this usually means th settings to do it are esoteric and people rarely get a good result. Is that what you mean? Because if that's the case that isn't very usable

      • 2 months ago
        Anonymous

        No the settings are straightforward what's hard is to set up the whole thing and gather a large cleaned audio dataset. Also you need to rent the GPU as it needs ~75GB of VRAM

        • 2 months ago
          Anonymous

          Oh I see, well maybe with 5000 series

        • 2 months ago
          Anonymous

          >Also you need to rent the GPU as it needs ~75GB of VRAM
          What??

          • 2 months ago
            Anonymous

            Finetuning that shit isn't cheap

          • 2 months ago
            Anonymous

            i shat bricks when i saw it too anon

            • 2 months ago
              Anonymous

              >Impossible on a 24GB card
              It's fricking over.

              • 2 months ago
                Anonymous

                Impossible to train yes but you can run it

              • 2 months ago
                Anonymous

                And there's no other way to get a finetuned model but training it yourself?

              • 2 months ago
                Anonymous

                You can find some on huggingface but there are only a few of them

              • 2 months ago
                Anonymous

                And what is the reason why it can't be trained on 24GB of Vram? Shouldn't it just take much much longer? I don't mind letting it run for a few days, there's no rush to have it in 3 hours like that thing says.

              • 2 months ago
                Anonymous

                As you can see here

                https://i.imgur.com/KBlwR3H.png

                i shat bricks when i saw it too anon

                you can train it if you set it with max_len: 100 with Style Diffusion/SLM disabled or with Style Diffusion only. The issue is that the output quality will be somewhat passable instead of 'very good'.

              • 2 months ago
                Anonymous

                And what is the reason why it can't be trained on 24GB of Vram? Shouldn't it just take much much longer? I don't mind letting it run for a few days, there's no rush to have it in 3 hours like that thing says.

                You can train on a 24GB card. Just set max len to 280 and batch size to 2. Enter virtual console mode by typing Ctrl + Alt + F1 at the same time. Type in nvtop and close as many programs as possible in order to reduce the vram usage.
                Start training.
                accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml

              • 2 months ago
                Anonymous

                >huggingface
                What an innocent and adorable name that couldn't possibly mean anything nefarious.

    • 2 months ago
      Anonymous

      does BOT have a guide to finetuning styletts2? how much data would you need for a decent finetune?

      • 2 months ago
        Anonymous

        You can look at this:
        https://github.com/yl4579/StyleTTS2/discussions/81

  6. 2 months ago
    Anonymous

    Mods shut down discussion about AI voice on /vt/ because they didn't want their favorite girls to be impersonated.

    • 2 months ago
      Anonymous

      > trannies worried people will pretend to be something they're not.
      Oh the irony

  7. 2 months ago
    Anonymous

    >Hey Don, have you heard, of this new technology
    https://soundcloud.com/user-535691776/dialog

  8. 2 months ago
    Anonymous

    [...]

    https://arxiv.org/pdf/2403.03100.pdf

    • 2 months ago
      Anonymous

      The code isn't there anon.

  9. 2 months ago
    Anonymous

    there's a few decent models for ai voices but they arnt getting the software support they need.

    whisper speech has been out for ages and its basically just an inversion of openAI's whisper model (speech to text). It should be trivial to modify whisper.cpp to work with the model but the devs arnt interested in working on it and every time anyone brings it up they get ignored.

    which is a running theme for text to speech models. Piper is another model for voice cloning (though it isn't as good at 1 shoting a voice clone), it at least has a CPP library... which doesn't compile easily and hasn't been worked on for over 3 months.

  10. 2 months ago
    Anonymous

    It's peak uncanny valley. Even hand-pitched vocaloid stuff sounds far more natural than AI ever will, and the algorithms don't seem to get any better.
    Same as music.

    • 2 months ago
      Anonymous

      >Even hand-pitched vocaloid stuff sounds far more natural than AI ever will
      Nah that's cope.

  11. 2 months ago
    Anonymous

    I wish they did, AI voices is important for laerps or video game modding, ESPECIALLY when it comes to e-girl because there are not nearly enough e-girl voice actors around for "moral" reasons, or they are willing to do it but just dont want it on their VA resume....

    • 2 months ago
      Anonymous

      There are enough JA voices for that

      • 2 months ago
        Anonymous

        But if a English now japanese speaking creator wants to hire a JA VA, how difficult would it be because of language barrier? Pretty difficult id imagine, since ive seen none anywhere. Where would you even find cheap ones when you just need VA for a game mod?

        • 2 months ago
          Anonymous

          You have a point. Still you can get good enough EN e-girl voice with the current TTS tech, I can think of neuro-sama's voice for example.

          • 2 months ago
            Anonymous

            What we need is S tier voices like from Shondo or even Gura. Shondo in particular has the perfect e-girl voice, so if we can get AI to replicate that perfectly then we are good. And no one would complain, there are no laws against the AI batman.

            • 2 months ago
              Anonymous

              There are many copies of Gura's voice. Youtube is filled with Gura AI songs

              • 2 months ago
                Anonymous

                sauce to the AI voices?

              • 2 months ago
                Anonymous

                https://huggingface.co/sail-rvc/Gawr_Gura__Hoe-girlve_EN__RVC_v1

  12. 2 months ago
    Anonymous

    i just want real time ai audio porn

    • 2 months ago
      Anonymous

      Supposedly Bark can generate audio in real time according to them but that hasn't been my experience with it. It's still the superior choice when it comes to expressiveness but it's still slower than they claim in their github.
      Maybe it's not actually using my GPU because I don't see any utilization in the task manager but every GPU option is set to ON so I don't know what is going on.

      • 2 months ago
        Anonymous

        hm... not bad... although these being able to moan, suck, slap, etc is absolutely necessary for this particular usecase IMO. I'm assuming bark can't do that...

        • 2 months ago
          Anonymous

          Not yet, but it can laugh, sigh and clear its throat which is miles ahead than all other (local) voice generators. It's way more human.
          The only issue is those lazy fricking Black folk haven't updated the thing in 10 months.

      • 2 months ago
        Anonymous

        hm... not bad... although these being able to moan, suck, slap, etc is absolutely necessary for this particular usecase IMO. I'm assuming bark can't do that...

        Bark is theoretically closer to The Dream because it will do text-to-audio tasks alongside your speech (e.g. laughing, crying; non-verbal and seggs sounds if you finetune it) but in reality the model is so fricking bad I've come to assume training was intentionally botched for safety or profit.
        There's CONSTANTLY either other shit in the background or just plain awful output.

        • 2 months ago
          Anonymous

          >There's CONSTANTLY either other shit in the background or just plain awful output.
          My moronic solution is to get the entire audio, run it through Ultimate Vocal Remover AI to separate the background noise from the voice and then run it again through RVC to convert it to a different voice. It's fricking bullshit, but it works and it ends up sounding real.

          But it involves a lot of trimming and pasting audio together to get the best result. Why the FRICK does Bark have a 14 second limit?

          • 2 months ago
            Anonymous

            Nvidia broadcast would be more efficient to filter all the noise I think.

  13. 2 months ago
    Anonymous

    AI voices are already conpletely beleivabley real and have been for a long time when trained on a single persons voice. The surprising thing is that coomérs havents really done much with the technology. I think whàts missing for you is ä voice generator where you can put in tags like "sexy black hentai succubus whispering" and get a good output. Maybe throw a bitcoin or two at it and someone will make that software. In the meantime, people will be making fake political speeches every day

    • 2 months ago
      Anonymous

      Entered this thread as a tourist, I use AI for imagegen and text at the moment, how easy is it to duplicate someones voice? Is there a good site where people share them like civit?

      • 2 months ago
        Anonymous

        Not too difficult to train at all with RVC, but the issue is the source audio it converts from. Most autists don't want to record themselves to convert to someone else's voice (They live with their parents.)

        • 2 months ago
          Anonymous

          So it's not text to speech?

          • 2 months ago
            Anonymous

            RVC is speech to speech.

  14. 2 months ago
    Anonymous

    RTVC exists and is not hard to train on modest hardware. You can make an excellent model with dialogue ripped from a game, for example. Do you require more hand holding than that?

  15. 2 months ago
    Anonymous

    Because no corpo decided to pour billions into scaling up some random well known model. It's much less practical uses and is much more likely to cause some scandals so it's harder to bait investors into it.

    >progress
    Hype doesn't always correlate with progress

  16. 2 months ago
    Anonymous

    imo it's pretty much far enough that I don't think theres much you can do aside from making the resources required far less than where it is at now. TTS doesn't always generate how I want to hear it, but RVC and real time voice changing gives as much as you put in it. The issue I've only come across with some models is that it only does so much. It isn't enough to have the sound, but you need to emphasize tones, pauses and accents to make it sound authentic to the model.

    That and new voices take resources to form models out of.

  17. 2 months ago
    Anonymous

    you homosexuals need to follow chinks more closely
    https://github.com/RVC-Boss/GPT-SoVITS
    have fun

    • 2 months ago
      Anonymous

      damn

    • 2 months ago
      Anonymous

      Oh sweet. Thanks Anon.
      https://vocaroo.com/12pT31vC7bcN

      • 2 months ago
        Anonymous

        Cute

        XTTS2 is a genuinely huge leap, I don't know why so few people know about it. It's not very far behind 11L and it's fast as frick, better than realtime on my 3090 and with low latency.

        The requirement to toss a few seconds of example wav files into a folder for inference is not a significant inconvenience. I use a few samples of the 11L Bella voice and it imitates her basically perfectly. The only issue is the occasional little hallucinations but they're pretty minor. We're not there yet but it's a lie to say large progress hasn't happened recently.

        Can you post some samples?

        • 2 months ago
          Anonymous

          Nah I'm not interested in acting as a salesman for it any more than I already have (it's not my project). If someone's a local voice model enthusiast they should be trying it for themselves.

          • 2 months ago
            Anonymous

            Come on. I don't even know who is Bella

            • 2 months ago
              Anonymous

              Bah, ok

              https://files.catbox.moe/rar1y0.wav

              • 2 months ago
                Anonymous

                Sounds nice, does it work with other languages?

    • 2 months ago
      Anonymous

      Hypothetically, if someone was moronic an didn't speak Chinese, is there a tard's guide?

  18. 2 months ago
    Anonymous

    What are good (free) ones we can use right now?

    • 2 months ago
      Anonymous

      Bark

      >/b/ and BOT are the dicky boards
      fixed your typos

      >Forgetting BOT, BOT and yes, BOT
      You didn't fix shit

      • 2 months ago
        Anonymous

        >You didn't fix shit
        no, you're just another shitposting tourist that shouldn't be here

        • 2 months ago
          Anonymous

          >I'M the tourist
          >Implying

      • 2 months ago
        Anonymous

        yet neither of you chuckleheads remembered /vt/

        • 2 months ago
          Anonymous

          >Newboard for redditors I don't care about
          I didn't forget anything

  19. 2 months ago
    Anonymous

    you can't post audio files on this board.

    • 2 months ago
      Anonymous

      >He doesn't know

    • 2 months ago
      Anonymous

      newbie moron.

  20. 2 months ago
    Anonymous

    People find AI voices more disturbing than Microsoft Sam according to my sister.
    I totally get it though, the uncanny valley shit and the "i'm just too lazy to use my voice and don't care about how real it sounds" aspect. AI voices are too realistic to most and it creeps people out.

    I, though, am not like those people. I find vtubers that use AI voices interesting.

  21. 2 months ago
    Anonymous

    We care but no one with the top tier tech is letting plebs have access anymore, so it’s strictly for rich people and the CIA.

  22. 2 months ago
    Anonymous

    XTTS2 is a genuinely huge leap, I don't know why so few people know about it. It's not very far behind 11L and it's fast as frick, better than realtime on my 3090 and with low latency.

    The requirement to toss a few seconds of example wav files into a folder for inference is not a significant inconvenience. I use a few samples of the 11L Bella voice and it imitates her basically perfectly. The only issue is the occasional little hallucinations but they're pretty minor. We're not there yet but it's a lie to say large progress hasn't happened recently.

    • 2 months ago
      Anonymous

      >XTTS2
      Is it better than this?

      https://i.imgur.com/F6uodxn.png

      you homosexuals need to follow chinks more closely
      https://github.com/RVC-Boss/GPT-SoVITS
      have fun

  23. 2 months ago
    Anonymous

    wasn't Amazon showing off an Alexa feature a couple of years ago where you could play a voice to the Alexa and it would learn the voice, so that would then become the new voice for your echo?

    That was giga based, but they never released officially released it. Just imagine how much more use Amazon Echos would get if it had your favorite celebrity's voice or something.

    • 2 months ago
      Anonymous

      >Just imagine how much more use Amazon Echos would get if it had your favorite celebrity's voice or something
      And they would have to pay licensing fees for that. Out of their own pocket? No, that'll be a $59.99 a month for you.

  24. 2 months ago
    Anonymous

    But can it roleplay nonconsensual handholding?

  25. 2 months ago
    Anonymous

    Anon, Elevenlabs is a thing.

    • 2 months ago
      Anonymous

      >Paying for voice generation or getting 3 free sentences a month
      No thanks.

  26. 2 months ago
    Anonymous

    Don't worry, once text to video is as good as sora everywhere. People will start caring about AI voices again.

  27. 2 months ago
    Anonymous

    >make audiobook with tortoise
    >it's 97% of the way there immediately
    >fixing the last 3% would take days of work

    • 2 months ago
      Anonymous

      What can Tortoise do that the others don't?

      • 2 months ago
        Anonymous

        not just tortoise but mrq's whole system makes it the most convenient option i've seen yet for both training and reading boatloads of text.

        • 2 months ago
          Anonymous

          Is it English only?

          • 2 months ago
            Anonymous

            don't know i never tried anything else

  28. 2 months ago
    Anonymous

    pi.ai is genuinely insane in how the voice works. The voice ai doesn't just read the words, it actually takes into account the context of what the chatbot writes. If the answer is sarcastic, the voice ai will put a sarcastic tone

    • 2 months ago
      Anonymous

      Is it paid? Does it have generation limits?
      If the answer is yes to either of those it's shit.

    • 2 months ago
      Anonymous

      Oh shit it actually said it

    • 2 months ago
      Anonymous

      https://i.imgur.com/v2rUbyF.png

      Oh shit it actually said it

      Never mind it's fricking pozzed.

      • 2 months ago
        Anonymous

        It's fricking over.

      • 2 months ago
        Anonymous

        It would defuse the bomb

        • 2 months ago
          Anonymous

          This is obviously just ChatGPT.

          • 2 months ago
            Anonymous

            It understands sneed since I last used it.

            • 2 months ago
              Anonymous

              https://i.imgur.com/SYrRWNc.png

              It would defuse the bomb

              It's fricking trash. Never shill this israeli piece of shit ever again.

              • 2 months ago
                Anonymous

                It won't even load anymore just because I said Black person.
                I hate leftists.

              • 2 months ago
                Anonymous

                You should have made it say homierdly

  29. 2 months ago
    Anonymous

    i've been out of the loop on AI, but do we have anything on ElevenLabs level? i remember having some fun back when it started out and was free, then playing with dubs some time later. now it's all paid-only.

    • 2 months ago
      Anonymous

      The powers that be would never allow that

    • 2 months ago
      Anonymous
  30. 2 months ago
    Anonymous

    this is just a fun concept that uses style transfers pretty well

    https://podcast.ai/

  31. 2 months ago
    Anonymous

    I care about ai voices. In fact they’re the only kind of ai product I care about because they’re basically just a better kind of tts and not goyslop machines.

Your email address will not be published. Required fields are marked *