Riffusion AI Music Generation

Anyone already tried out riffusion.com?
Managed to create anything catchy?
Explainer:
https://www.riffusion.com/about

ChatGPT Wizard Shirt $21.68

Beware Cat Shirt $21.68

ChatGPT Wizard Shirt $21.68

  1. 1 year ago
    Anonymous

    Pretty lackluster imo, give it 4 months and producers will be seething.

    • 1 year ago
      Anonymous

      Yeah, these things develop dramatically fast

  2. 1 year ago
    Anonymous

    > Github:
    https://github.com/hmartiro/riffusion-app

    > Webui:
    https://github.com/enlyth/sd-webui-riffusion

    > Subreddit:
    https://www.reddit.com/r/riffusion

    > Discord:
    https://discord.gg/v7DGUUzz

  3. 1 year ago
    Anonymous

    Is there a way to use this in games? Like have it generate music based on the current context of the player, such as music that dynamically escalates and deescalates depending on how intense the situation is?

    • 1 year ago
      Anonymous

      No idea, ask on discord or plebbit
      This will only develop from here on, so it's clear those features will be implemented, question is when

    • 1 year ago
      Anonymous

      That would require it to generate live during the game and yes, if you just add "intense" or some keywords that produce harsher, more dramatic sounds that should work, but that means generating on the get-go, resource intensive

    • 1 year ago
      Anonymous

      eventually yeah

      right now it's far too unreliable, even if you could package and plug it into a game

    • 1 year ago
      Anonymous

      dep

      No idea, ask on discord or plebbit
      This will only develop from here on, so it's clear those features will be implemented, question is when

      That would require it to generate live during the game and yes, if you just add "intense" or some keywords that produce harsher, more dramatic sounds that should work, but that means generating on the get-go, resource intensive

      eventually yeah

      right now it's far too unreliable, even if you could package and plug it into a game

      The inference server is supposed to work live, I've tried it on an A100 and can handle 4 users at once all with live-generating songs. It creates 5-second segments every 2 seconds with the A100. It's just gonna be faster with tensorrt. The site is overloaded according to the main dev

      Listened to a bunch of samples, they're all complete garbage. Maybe in 5 years.

      5 months

      I am pretty sure the dataset used was small. However this shouldn't be any different to finetune as it's basically just spectograms.

      I am downloading all of snail house and moe shop music, gonna convert it into the required spectograms and train it on a 8*A100 cluster to see how it goes

      in the meantime theres a colab that has audio2audio, or you can use my repo to toy with it more extensively https://github.com/chavinlo/riffusion-manipulation

      • 1 year ago
        Anonymous

        >this shouldn't be any different to finetune as it's basically just spectograms.
        I have extensive experience with diffusion models, especially for non-image data. Doing music tasks using the spectrogram as an image is a very old trick. In my latest work where I use mass spectrograms (not audio), diffusions perform quite poorly despite the massive datasets I leveraged. In the end, ad-hoc auto-regressive models worked best.
        It's not at all true that it "shouldn't be any different" because the data *distribution* is completely different -- the patterns in a music spectrogram or a mass spectrogram and the data in an image are widely different (the image data is far closer to a gaussian on average, there are strong conditional biases in music etc.). Playing with noise schedules and denoising models and noise models have not yielded anything promising either on my end nor in the literature. The d3pm paper, while it focuses on discrete data, exposes some of this and makes note of the difficulty of making non-standard noise schedules work, but also conversely that this might be required to make diffusions work on non-image data.
        LDM held the promise of being encoder/decoder agnostic, but that doesn't hold at all in practice, which is why, for example, those guys are not using a sequence encoder/decoder but rather a spectrogram input with image encoder/decoder: nobody has been able to make the LDM paradigm's promise of input/output format independence by using VAEs to get gaussians in the latents eitherway hold.

        • 1 year ago
          Anonymous

          I don't really understand your point...
          If you mean the difference between the audio spectogram and the output image then yes you are right, but I was referring to doing the same that the original devs did:

          >This is the v1.5 stable diffusion model with no modifications, just fine-tuned on images of spectrograms paired with text.

          Riffusion only uses 1 channel to encode the amplitude info hence the B&W. Using the 3 channels (RGB) could let us encode more amplitude info, 24bit amplitude maybe. those 2 channels are not being used and the data is just being duplicated.

          Also that they use img2img and use the seed image just to follow the tempo and have some kind of coherency. Some users tried outpainting, but because this isn't an outpainting model, it will get lost in the long run.

          Although now that you mention that you have experience, what do you suggest to make this better if theres even room for improvement

          • 1 year ago
            Anonymous

            >If you mean the difference between the audio spectogram and [natural] image[s]
            Yes. Diffusions are good when the data has the right properties. In practice (because that's the only real context where it's ever worked so far), it means the data must be gaussian-enough on average. This also means that the dependency patterns in the image have to be generally limited. Obviously that's the precise opposite as in music spectrograms as you can see in

            https://i.imgur.com/iTnQKwa.png

            dep

            [...]
            [...]
            [...]
            The inference server is supposed to work live, I've tried it on an A100 and can handle 4 users at once all with live-generating songs. It creates 5-second segments every 2 seconds with the A100. It's just gonna be faster with tensorrt. The site is overloaded according to the main dev

            [...]
            [...]
            I am pretty sure the dataset used was small. However this shouldn't be any different to finetune as it's basically just spectograms.

            I am downloading all of snail house and moe shop music, gonna convert it into the required spectograms and train it on a 8*A100 cluster to see how it goes

            in the meantime theres a colab that has audio2audio, or you can use my repo to toy with it more extensively https://github.com/chavinlo/riffusion-manipulation

            >Riffusion only uses 1 channel to encode the amplitude info hence the B&W. Using the 3 channels (RGB) could let us encode more amplitude info, 24bit amplitude maybe.
            That'll just make the problem harder (in the sense that more data will be needed).

            > Some users tried outpainting, but because this isn't an outpainting model, it will get lost in the long run.
            Diffusions can natively do outpainting, inpainting, or guided generation even when trained just in the purely unsupervised context. It's one of their interesting strengths. That it doesn't work is not because 'it's not an outpainting model' but because the model is stuck on a bad local minimum and unable to learn the data distribution correctly. This is not something that can be solved by more data alone (it COULD be solved by more parameters, but that's a toss up).

            >what do you suggest to make this better if theres even room for improvement
            This is not an incremental question. Beside bruteforcing like everyone else (i.e. more compute, more data, more parameters), a major breakthrough is needed to make this work in more general contexts.
            The usual directions are:
            - Noise design
            - Noise scheduling
            - Denoising model
            - Noise model
            With LDM you can add
            - Encoder/decoder models
            In particular, recurrent VAEs. However I haven't been able to get those to produce nice results myself.
            Otherwise, forget diffusions and look into autoregressive models instead, or find a way to combine them (re: encoder/decoder model for e.g.). All very non-trivial to do.

            • 1 year ago
              Anonymous

              Thanks
              Good to know that theres still smart people on BOT

  4. 1 year ago
    Anonymous

    yes being triying it for 2 days, is pretty green, needs a lot of work and has low variety, but sometimes you get beats that sound good enough.

    • 1 year ago
      Devs

      what is your pc config?

    • 1 year ago
      Anonymous

      hi kitler

  5. 1 year ago
    Anonymous

    this is a dead end. you'd need to produce images with a height of 14-6000px for this to start being viable.

  6. 1 year ago
    Anonymous

    Can I get midi from it?

  7. 1 year ago
    Anonymous

    Listened to a bunch of samples, they're all complete garbage. Maybe in 5 years.

    • 1 year ago
      Anonymous

      5 months

      • 1 year ago
        Anonymous

        t. moron who doesn't understand shit about the tech.

  8. 1 year ago
    Anonymous

    Can it make DAWS fileS?

  9. 1 year ago
    Anonymous
  10. 1 year ago
    Anonymous

    I can imagine this music playing in the Backrooms.

    https://www.riffusion.com/?&prompt=swing+jazz+trumpet&seed=124&denoising=0.75&seedImageId=og_beat

  11. 1 year ago
    Anonymous

    I tested it, its pretty shit right now. Hopefully we get a better version soon. I want to see the corrupted music industry collapse.

  12. 1 year ago
    Anonymous

    https://www.riffusion.com/?&prompt=dry+techno+music&denoising=0.95&seedImageId=marim

Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *