Why does nobody care about AI voices like they do for text or image generation?

Why does nobody care about AI voices like they do for text or image generation? We're almost a full year behind them and there's been no progress since then.

Shopping Cart Returner Shirt $21.68

Black Rifle Cuck Company, Conservative Humor Shirt $21.68

Shopping Cart Returner Shirt $21.68

  1. 1 month ago
    Anonymous

    Fewer people jerk off to that.

    • 1 month ago
      Anonymous

      Progress isn't completely dead, we have XTTS now for example, but yeah, it leaves a lot to be desired if you want expressiveness and don't simply want to voice clone. I want to help push this field along more but I'm very moronic

      baka, you've clearly never thought about the possibility of your futa mommy dragon chatbot wife whispering into your ear while jerking you off

      • 1 month ago
        Anonymous

        AI-powered brainwashing! AI-powered brainwashing! Come on nerds make it real already!

        • 1 month ago
          Anonymous

          God hypno audio generation would be so fricking good

        • 1 month ago
          Anonymous

          God hypno audio generation would be so fricking good

          Also this.

          Not yet, but it can laugh, sigh and clear its throat which is miles ahead than all other (local) voice generators. It's way more human.
          The only issue is those lazy fricking Black folk haven't updated the thing in 10 months.

          Definitely seems we're closer than farther now then. Gotta hope some other people pick it up.

  2. 1 month ago
    Anonymous

    AI voices is adjacent to the music industry who are very litigious and what is stopping devlopment.

  3. 1 month ago
    Anonymous

    YouTube is completely ridden and overrun with AI voice narrated content. In today's day and age if it's a faceless video channel then chances are the narration is AI. Especially if it's a science channel.

    • 1 month ago
      Anonymous

      Especially if it is a true crime or another morbid scenario gleaned off wikipedia. Fixed it for you.

  4. 1 month ago
    Anonymous

    Why is AI on BOT anyways?

    • 1 month ago
      Anonymous

      I don't want to be here either but /ai/ (or /gai/) is clearly not happening

    • 1 month ago
      Anonymous

      >AI is technology
      >Development of AI and AI tools involves programming and/or technical knowledge
      >Using local AI requires a computer, typically a good one
      Why wouldn't it be on BOT?

      • 1 month ago
        Anonymous

        BOT is dicky board. Make mesugaki voice synth and it will become BOT related.

        • 1 month ago
          Anonymous

          >/b/ and BOT are the dicky boards
          fixed your typos

          • 1 month ago
            Anonymous

            moron. BOT was named technoe-girlgy be a tourist elsewhere

  5. 1 month ago
    Anonymous

    I hope this is bait. The SOTA is a StyleTTS2 finetuned model. It's a b***h to finetune but you get an elevenlabs-tier TTS

    • 1 month ago
      Anonymous

      Can a mere mortal run that locally?

      • 1 month ago
        Anonymous

        Yeah it uses 4-5GB of VRAM

        • 1 month ago
          Anonymous

          Cool I'd have 20gb to spare then. If it works with other languages I'm sold.

    • 1 month ago
      Anonymous

      The provided samples are amazing, sometimes better than ground truth. But the demo at
      https://huggingface.co/spaces/styletts2/styletts2
      is totally hosed. Full blown speech impediment and dyslexia on anything I put in. What gives?

      • 1 month ago
        Anonymous

        That's 0-shot from the default voices. You need a finetuned model to get something good.

        • 1 month ago
          Anonymous

          Wait, what's the technical reason? I can get the intonation not being great out-of-distribution, but if it's this phonetically unreliable, it's worse than 80s TTS tech.

    • 1 month ago
      Anonymous

      "It's a b***h to finetune", when people talk about LORAs on image gen this usually means th settings to do it are esoteric and people rarely get a good result. Is that what you mean? Because if that's the case that isn't very usable

      • 1 month ago
        Anonymous

        No the settings are straightforward what's hard is to set up the whole thing and gather a large cleaned audio dataset. Also you need to rent the GPU as it needs ~75GB of VRAM

        • 1 month ago
          Anonymous

          Oh I see, well maybe with 5000 series

        • 1 month ago
          Anonymous

          >Also you need to rent the GPU as it needs ~75GB of VRAM
          What??

          • 1 month ago
            Anonymous

            Finetuning that shit isn't cheap

          • 1 month ago
            Anonymous

            i shat bricks when i saw it too anon

            • 1 month ago
              Anonymous

              >Impossible on a 24GB card
              It's fricking over.

              • 1 month ago
                Anonymous

                Impossible to train yes but you can run it

              • 1 month ago
                Anonymous

                And there's no other way to get a finetuned model but training it yourself?

              • 1 month ago
                Anonymous

                You can find some on huggingface but there are only a few of them

              • 1 month ago
                Anonymous

                And what is the reason why it can't be trained on 24GB of Vram? Shouldn't it just take much much longer? I don't mind letting it run for a few days, there's no rush to have it in 3 hours like that thing says.

              • 1 month ago
                Anonymous

                As you can see here

                https://i.imgur.com/KBlwR3H.png

                i shat bricks when i saw it too anon

                you can train it if you set it with max_len: 100 with Style Diffusion/SLM disabled or with Style Diffusion only. The issue is that the output quality will be somewhat passable instead of 'very good'.

              • 1 month ago
                Anonymous

                And what is the reason why it can't be trained on 24GB of Vram? Shouldn't it just take much much longer? I don't mind letting it run for a few days, there's no rush to have it in 3 hours like that thing says.

                You can train on a 24GB card. Just set max len to 280 and batch size to 2. Enter virtual console mode by typing Ctrl + Alt + F1 at the same time. Type in nvtop and close as many programs as possible in order to reduce the vram usage.
                Start training.
                accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml

              • 1 month ago
                Anonymous

                >huggingface
                What an innocent and adorable name that couldn't possibly mean anything nefarious.

    • 1 month ago
      Anonymous

      does BOT have a guide to finetuning styletts2? how much data would you need for a decent finetune?

      • 1 month ago
        Anonymous

        You can look at this:
        https://github.com/yl4579/StyleTTS2/discussions/81

  6. 1 month ago
    Anonymous

    Mods shut down discussion about AI voice on /vt/ because they didn't want their favorite girls to be impersonated.

    • 1 month ago
      Anonymous

      > trannies worried people will pretend to be something they're not.
      Oh the irony

  7. 1 month ago
    Anonymous

    >Hey Don, have you heard, of this new technology
    https://soundcloud.com/user-535691776/dialog

  8. 1 month ago
    Anonymous

    [...]

    https://arxiv.org/pdf/2403.03100.pdf

    • 1 month ago
      Anonymous

      The code isn't there anon.

  9. 1 month ago
    Anonymous

    there's a few decent models for ai voices but they arnt getting the software support they need.

    whisper speech has been out for ages and its basically just an inversion of openAI's whisper model (speech to text). It should be trivial to modify whisper.cpp to work with the model but the devs arnt interested in working on it and every time anyone brings it up they get ignored.

    which is a running theme for text to speech models. Piper is another model for voice cloning (though it isn't as good at 1 shoting a voice clone), it at least has a CPP library... which doesn't compile easily and hasn't been worked on for over 3 months.

  10. 1 month ago
    Anonymous

    It's peak uncanny valley. Even hand-pitched vocaloid stuff sounds far more natural than AI ever will, and the algorithms don't seem to get any better.
    Same as music.

    • 1 month ago
      Anonymous

      >Even hand-pitched vocaloid stuff sounds far more natural than AI ever will
      Nah that's cope.

  11. 1 month ago
    Anonymous

    I wish they did, AI voices is important for laerps or video game modding, ESPECIALLY when it comes to e-girl because there are not nearly enough e-girl voice actors around for "moral" reasons, or they are willing to do it but just dont want it on their VA resume....

    • 1 month ago
      Anonymous

      There are enough JA voices for that

      • 1 month ago
        Anonymous

        But if a English now japanese speaking creator wants to hire a JA VA, how difficult would it be because of language barrier? Pretty difficult id imagine, since ive seen none anywhere. Where would you even find cheap ones when you just need VA for a game mod?

        • 1 month ago
          Anonymous

          You have a point. Still you can get good enough EN e-girl voice with the current TTS tech, I can think of neuro-sama's voice for example.

          • 1 month ago
            Anonymous

            What we need is S tier voices like from Shondo or even Gura. Shondo in particular has the perfect e-girl voice, so if we can get AI to replicate that perfectly then we are good. And no one would complain, there are no laws against the AI batman.

            • 1 month ago
              Anonymous

              There are many copies of Gura's voice. Youtube is filled with Gura AI songs

              • 1 month ago
                Anonymous

                sauce to the AI voices?

              • 1 month ago
                Anonymous

                https://huggingface.co/sail-rvc/Gawr_Gura__Hoe-girlve_EN__RVC_v1

  12. 1 month ago
    Anonymous

    i just want real time ai audio porn

    • 1 month ago
      Anonymous

      Supposedly Bark can generate audio in real time according to them but that hasn't been my experience with it. It's still the superior choice when it comes to expressiveness but it's still slower than they claim in their github.
      Maybe it's not actually using my GPU because I don't see any utilization in the task manager but every GPU option is set to ON so I don't know what is going on.

      • 1 month ago
        Anonymous

        hm... not bad... although these being able to moan, suck, slap, etc is absolutely necessary for this particular usecase IMO. I'm assuming bark can't do that...

        • 1 month ago
          Anonymous

          Not yet, but it can laugh, sigh and clear its throat which is miles ahead than all other (local) voice generators. It's way more human.
          The only issue is those lazy fricking Black folk haven't updated the thing in 10 months.

      • 1 month ago
        Anonymous

        hm... not bad... although these being able to moan, suck, slap, etc is absolutely necessary for this particular usecase IMO. I'm assuming bark can't do that...

        Bark is theoretically closer to The Dream because it will do text-to-audio tasks alongside your speech (e.g. laughing, crying; non-verbal and seggs sounds if you finetune it) but in reality the model is so fricking bad I've come to assume training was intentionally botched for safety or profit.
        There's CONSTANTLY either other shit in the background or just plain awful output.

        • 1 month ago
          Anonymous

          >There's CONSTANTLY either other shit in the background or just plain awful output.
          My moronic solution is to get the entire audio, run it through Ultimate Vocal Remover AI to separate the background noise from the voice and then run it again through RVC to convert it to a different voice. It's fricking bullshit, but it works and it ends up sounding real.

          But it involves a lot of trimming and pasting audio together to get the best result. Why the FRICK does Bark have a 14 second limit?

          • 1 month ago
            Anonymous

            Nvidia broadcast would be more efficient to filter all the noise I think.

  13. 1 month ago
    Anonymous

    AI voices are already conpletely beleivabley real and have been for a long time when trained on a single persons voice. The surprising thing is that coomérs havents really done much with the technology. I think whàts missing for you is ä voice generator where you can put in tags like "sexy black hentai succubus whispering" and get a good output. Maybe throw a bitcoin or two at it and someone will make that software. In the meantime, people will be making fake political speeches every day

    • 1 month ago
      Anonymous

      Entered this thread as a tourist, I use AI for imagegen and text at the moment, how easy is it to duplicate someones voice? Is there a good site where people share them like civit?

      • 1 month ago
        Anonymous

        Not too difficult to train at all with RVC, but the issue is the source audio it converts from. Most autists don't want to record themselves to convert to someone else's voice (They live with their parents.)

        • 1 month ago
          Anonymous

          So it's not text to speech?

          • 1 month ago
            Anonymous

            RVC is speech to speech.

  14. 1 month ago
    Anonymous

    RTVC exists and is not hard to train on modest hardware. You can make an excellent model with dialogue ripped from a game, for example. Do you require more hand holding than that?

  15. 1 month ago
    Anonymous

    Because no corpo decided to pour billions into scaling up some random well known model. It's much less practical uses and is much more likely to cause some scandals so it's harder to bait investors into it.

    >progress
    Hype doesn't always correlate with progress

  16. 1 month ago
    Anonymous

    imo it's pretty much far enough that I don't think theres much you can do aside from making the resources required far less than where it is at now. TTS doesn't always generate how I want to hear it, but RVC and real time voice changing gives as much as you put in it. The issue I've only come across with some models is that it only does so much. It isn't enough to have the sound, but you need to emphasize tones, pauses and accents to make it sound authentic to the model.

    That and new voices take resources to form models out of.

  17. 1 month ago
    Anonymous

    you homosexuals need to follow chinks more closely
    https://github.com/RVC-Boss/GPT-SoVITS
    have fun

    • 1 month ago
      Anonymous

      damn

    • 1 month ago
      Anonymous

      Oh sweet. Thanks Anon.
      https://vocaroo.com/12pT31vC7bcN

      • 1 month ago
        Anonymous

        Cute

        XTTS2 is a genuinely huge leap, I don't know why so few people know about it. It's not very far behind 11L and it's fast as frick, better than realtime on my 3090 and with low latency.

        The requirement to toss a few seconds of example wav files into a folder for inference is not a significant inconvenience. I use a few samples of the 11L Bella voice and it imitates her basically perfectly. The only issue is the occasional little hallucinations but they're pretty minor. We're not there yet but it's a lie to say large progress hasn't happened recently.

        Can you post some samples?

        • 1 month ago
          Anonymous

          Nah I'm not interested in acting as a salesman for it any more than I already have (it's not my project). If someone's a local voice model enthusiast they should be trying it for themselves.

          • 1 month ago
            Anonymous

            Come on. I don't even know who is Bella

            • 1 month ago
              Anonymous

              Bah, ok

              https://files.catbox.moe/rar1y0.wav

              • 1 month ago
                Anonymous

                Sounds nice, does it work with other languages?

    • 1 month ago
      Anonymous

      Hypothetically, if someone was moronic an didn't speak Chinese, is there a tard's guide?

  18. 1 month ago
    Anonymous

    What are good (free) ones we can use right now?

    • 1 month ago
      Anonymous

      Bark

      >/b/ and BOT are the dicky boards
      fixed your typos

      >Forgetting BOT, BOT and yes, BOT
      You didn't fix shit

      • 1 month ago
        Anonymous

        >You didn't fix shit
        no, you're just another shitposting tourist that shouldn't be here

        • 1 month ago
          Anonymous

          >I'M the tourist
          >Implying

      • 1 month ago
        Anonymous

        yet neither of you chuckleheads remembered /vt/

        • 1 month ago
          Anonymous

          >Newboard for redditors I don't care about
          I didn't forget anything

  19. 1 month ago
    Anonymous

    you can't post audio files on this board.

    • 1 month ago
      Anonymous

      >He doesn't know

    • 1 month ago
      Anonymous

      newbie moron.

  20. 1 month ago
    Anonymous

    People find AI voices more disturbing than Microsoft Sam according to my sister.
    I totally get it though, the uncanny valley shit and the "i'm just too lazy to use my voice and don't care about how real it sounds" aspect. AI voices are too realistic to most and it creeps people out.

    I, though, am not like those people. I find vtubers that use AI voices interesting.

  21. 1 month ago
    Anonymous

    We care but no one with the top tier tech is letting plebs have access anymore, so it’s strictly for rich people and the CIA.

  22. 1 month ago
    Anonymous

    XTTS2 is a genuinely huge leap, I don't know why so few people know about it. It's not very far behind 11L and it's fast as frick, better than realtime on my 3090 and with low latency.

    The requirement to toss a few seconds of example wav files into a folder for inference is not a significant inconvenience. I use a few samples of the 11L Bella voice and it imitates her basically perfectly. The only issue is the occasional little hallucinations but they're pretty minor. We're not there yet but it's a lie to say large progress hasn't happened recently.

    • 1 month ago
      Anonymous

      >XTTS2
      Is it better than this?

      https://i.imgur.com/F6uodxn.png

      you homosexuals need to follow chinks more closely
      https://github.com/RVC-Boss/GPT-SoVITS
      have fun

  23. 1 month ago
    Anonymous

    wasn't Amazon showing off an Alexa feature a couple of years ago where you could play a voice to the Alexa and it would learn the voice, so that would then become the new voice for your echo?

    That was giga based, but they never released officially released it. Just imagine how much more use Amazon Echos would get if it had your favorite celebrity's voice or something.

    • 1 month ago
      Anonymous

      >Just imagine how much more use Amazon Echos would get if it had your favorite celebrity's voice or something
      And they would have to pay licensing fees for that. Out of their own pocket? No, that'll be a $59.99 a month for you.

  24. 1 month ago
    Anonymous

    But can it roleplay nonconsensual handholding?

  25. 1 month ago
    Anonymous

    Anon, Elevenlabs is a thing.

    • 1 month ago
      Anonymous

      >Paying for voice generation or getting 3 free sentences a month
      No thanks.

  26. 1 month ago
    Anonymous

    Don't worry, once text to video is as good as sora everywhere. People will start caring about AI voices again.

  27. 1 month ago
    Anonymous

    >make audiobook with tortoise
    >it's 97% of the way there immediately
    >fixing the last 3% would take days of work

    • 1 month ago
      Anonymous

      What can Tortoise do that the others don't?

      • 1 month ago
        Anonymous

        not just tortoise but mrq's whole system makes it the most convenient option i've seen yet for both training and reading boatloads of text.

        • 1 month ago
          Anonymous

          Is it English only?

          • 1 month ago
            Anonymous

            don't know i never tried anything else

  28. 1 month ago
    Anonymous

    pi.ai is genuinely insane in how the voice works. The voice ai doesn't just read the words, it actually takes into account the context of what the chatbot writes. If the answer is sarcastic, the voice ai will put a sarcastic tone

    • 1 month ago
      Anonymous

      Is it paid? Does it have generation limits?
      If the answer is yes to either of those it's shit.

    • 1 month ago
      Anonymous

      Oh shit it actually said it

    • 1 month ago
      Anonymous

      https://i.imgur.com/v2rUbyF.png

      Oh shit it actually said it

      Never mind it's fricking pozzed.

      • 1 month ago
        Anonymous

        It's fricking over.

      • 1 month ago
        Anonymous

        It would defuse the bomb

        • 1 month ago
          Anonymous

          This is obviously just ChatGPT.

          • 1 month ago
            Anonymous

            It understands sneed since I last used it.

            • 1 month ago
              Anonymous

              https://i.imgur.com/SYrRWNc.png

              It would defuse the bomb

              It's fricking trash. Never shill this israeli piece of shit ever again.

              • 1 month ago
                Anonymous

                It won't even load anymore just because I said Black person.
                I hate leftists.

              • 1 month ago
                Anonymous

                You should have made it say homierdly

  29. 1 month ago
    Anonymous

    i've been out of the loop on AI, but do we have anything on ElevenLabs level? i remember having some fun back when it started out and was free, then playing with dubs some time later. now it's all paid-only.

    • 1 month ago
      Anonymous

      The powers that be would never allow that

    • 1 month ago
      Anonymous
  30. 1 month ago
    Anonymous

    this is just a fun concept that uses style transfers pretty well

    https://podcast.ai/

  31. 1 month ago
    Anonymous

    I care about ai voices. In fact they’re the only kind of ai product I care about because they’re basically just a better kind of tts and not goyslop machines.

Your email address will not be published. Required fields are marked *