Riffusion AI Music Generation

Posted on December 16, 2022 by Anonymous

Anyone already tried out riffusion.com?
Managed to create anything catchy?
Explainer:
https://www.riffusion.com/about

Shopping Cart Returner Shirt $21.68

Tip Your Landlord Shirt $21.68

Shopping Cart Returner Shirt $21.68

1 year ago

Reply

Anonymous

Pretty lackluster imo, give it 4 months and producers will be seething.
- 1 year ago
  
  Reply
  
  Anonymous
  
  Yeah, these things develop dramatically fast
1 year ago

Reply

Anonymous

> Github:
https://github.com/hmartiro/riffusion-app

> Webui:
https://github.com/enlyth/sd-webui-riffusion

> Subreddit:
https://www.reddit.com/r/riffusion

> Discord:
https://discord.gg/v7DGUUzz
1 year ago

Reply

Anonymous

Is there a way to use this in games? Like have it generate music based on the current context of the player, such as music that dynamically escalates and deescalates depending on how intense the situation is?
- 1 year ago
  
  Reply
  
  Anonymous
  
  No idea, ask on discord or plebbit
  This will only develop from here on, so it's clear those features will be implemented, question is when
- 1 year ago
  
  Reply
  
  Anonymous
  
  That would require it to generate live during the game and yes, if you just add "intense" or some keywords that produce harsher, more dramatic sounds that should work, but that means generating on the get-go, resource intensive
- 1 year ago
  
  Reply
  
  Anonymous
  
  eventually yeah
  
  right now it's far too unreliable, even if you could package and plug it into a game
- 1 year ago
  
  Reply
  
  Anonymous
  
  dep
  
  No idea, ask on discord or plebbit
  This will only develop from here on, so it's clear those features will be implemented, question is when
  
  That would require it to generate live during the game and yes, if you just add "intense" or some keywords that produce harsher, more dramatic sounds that should work, but that means generating on the get-go, resource intensive
  
  eventually yeah
  
  right now it's far too unreliable, even if you could package and plug it into a game
  
  The inference server is supposed to work live, I've tried it on an A100 and can handle 4 users at once all with live-generating songs. It creates 5-second segments every 2 seconds with the A100. It's just gonna be faster with tensorrt. The site is overloaded according to the main dev
  
  Listened to a bunch of samples, they're all complete garbage. Maybe in 5 years.
  
  5 months
  
  I am pretty sure the dataset used was small. However this shouldn't be any different to finetune as it's basically just spectograms.
  
  I am downloading all of snail house and moe shop music, gonna convert it into the required spectograms and train it on a 8*A100 cluster to see how it goes
  
  in the meantime theres a colab that has audio2audio, or you can use my repo to toy with it more extensively https://github.com/chavinlo/riffusion-manipulation
  - 1 year ago
    
    Reply
    
    Anonymous
    
    >this shouldn't be any different to finetune as it's basically just spectograms.
    I have extensive experience with diffusion models, especially for non-image data. Doing music tasks using the spectrogram as an image is a very old trick. In my latest work where I use mass spectrograms (not audio), diffusions perform quite poorly despite the massive datasets I leveraged. In the end, ad-hoc auto-regressive models worked best.
    It's not at all true that it "shouldn't be any different" because the data *distribution* is completely different -- the patterns in a music spectrogram or a mass spectrogram and the data in an image are widely different (the image data is far closer to a gaussian on average, there are strong conditional biases in music etc.). Playing with noise schedules and denoising models and noise models have not yielded anything promising either on my end nor in the literature. The d3pm paper, while it focuses on discrete data, exposes some of this and makes note of the difficulty of making non-standard noise schedules work, but also conversely that this might be required to make diffusions work on non-image data.
    LDM held the promise of being encoder/decoder agnostic, but that doesn't hold at all in practice, which is why, for example, those guys are not using a sequence encoder/decoder but rather a spectrogram input with image encoder/decoder: nobody has been able to make the LDM paradigm's promise of input/output format independence by using VAEs to get gaussians in the latents eitherway hold.
    - 1 year ago
      
      Reply
      
      Anonymous
      
      I don't really understand your point...
      If you mean the difference between the audio spectogram and the output image then yes you are right, but I was referring to doing the same that the original devs did:
      
      >This is the v1.5 stable diffusion model with no modifications, just fine-tuned on images of spectrograms paired with text.
      
      Riffusion only uses 1 channel to encode the amplitude info hence the B&W. Using the 3 channels (RGB) could let us encode more amplitude info, 24bit amplitude maybe. those 2 channels are not being used and the data is just being duplicated.
      
      Also that they use img2img and use the seed image just to follow the tempo and have some kind of coherency. Some users tried outpainting, but because this isn't an outpainting model, it will get lost in the long run.
      
      Although now that you mention that you have experience, what do you suggest to make this better if theres even room for improvement
      - 1 year ago
        
        Reply
        
        Anonymous
        
        >If you mean the difference between the audio spectogram and [natural] image[s]
        Yes. Diffusions are good when the data has the right properties. In practice (because that's the only real context where it's ever worked so far), it means the data must be gaussian-enough on average. This also means that the dependency patterns in the image have to be generally limited. Obviously that's the precise opposite as in music spectrograms as you can see in
        
        https://i.imgur.com/iTnQKwa.png
        
        dep
        
        [...]
        [...]
        [...]
        The inference server is supposed to work live, I've tried it on an A100 and can handle 4 users at once all with live-generating songs. It creates 5-second segments every 2 seconds with the A100. It's just gonna be faster with tensorrt. The site is overloaded according to the main dev
        
        [...]
        [...]
        I am pretty sure the dataset used was small. However this shouldn't be any different to finetune as it's basically just spectograms.
        
        I am downloading all of snail house and moe shop music, gonna convert it into the required spectograms and train it on a 8*A100 cluster to see how it goes
        
        in the meantime theres a colab that has audio2audio, or you can use my repo to toy with it more extensively https://github.com/chavinlo/riffusion-manipulation
        
        >Riffusion only uses 1 channel to encode the amplitude info hence the B&W. Using the 3 channels (RGB) could let us encode more amplitude info, 24bit amplitude maybe.
        That'll just make the problem harder (in the sense that more data will be needed).
        
        > Some users tried outpainting, but because this isn't an outpainting model, it will get lost in the long run.
        Diffusions can natively do outpainting, inpainting, or guided generation even when trained just in the purely unsupervised context. It's one of their interesting strengths. That it doesn't work is not because 'it's not an outpainting model' but because the model is stuck on a bad local minimum and unable to learn the data distribution correctly. This is not something that can be solved by more data alone (it COULD be solved by more parameters, but that's a toss up).
        
        >what do you suggest to make this better if theres even room for improvement
        This is not an incremental question. Beside bruteforcing like everyone else (i.e. more compute, more data, more parameters), a major breakthrough is needed to make this work in more general contexts.
        The usual directions are:
        - Noise design
        - Noise scheduling
        - Denoising model
        - Noise model
        With LDM you can add
        - Encoder/decoder models
        In particular, recurrent VAEs. However I haven't been able to get those to produce nice results myself.
        Otherwise, forget diffusions and look into autoregressive models instead, or find a way to combine them (re: encoder/decoder model for e.g.). All very non-trivial to do.
        
        1 year ago
        
        Reply
        
        Anonymous
        
        Thanks
        Good to know that theres still smart people on BOT
1 year ago

Reply

Anonymous

yes being triying it for 2 days, is pretty green, needs a lot of work and has low variety, but sometimes you get beats that sound good enough.
- 1 year ago
  
  Reply
  
  Devs
  
  what is your pc config?
- 1 year ago
  
  Reply
  
  Anonymous
  
  hi kitler
1 year ago

Reply

Anonymous

this is a dead end. you'd need to produce images with a height of 14-6000px for this to start being viable.
1 year ago

Reply

Anonymous

Can I get midi from it?
1 year ago

Reply

Anonymous

Listened to a bunch of samples, they're all complete garbage. Maybe in 5 years.
- 1 year ago
  
  Reply
  
  Anonymous
  
  5 months
  - 1 year ago
    
    Reply
    
    Anonymous
    
    t. moron who doesn't understand shit about the tech.
1 year ago

Reply

Anonymous

Can it make DAWS fileS?
1 year ago

Reply

Anonymous
1 year ago

Reply

Anonymous

I can imagine this music playing in the Backrooms.

https://www.riffusion.com/?&prompt=swing+jazz+trumpet&seed=124&denoising=0.75&seedImageId=og_beat
1 year ago

Reply

Anonymous

I tested it, its pretty shit right now. Hopefully we get a better version soon. I want to see the corrupted music industry collapse.
1 year ago

Reply

Anonymous

https://www.riffusion.com/?&prompt=dry+techno+music&denoising=0.95&seedImageId=marim

Cancel reply