Anyone already tried out riffusion.com?
Managed to create anything catchy?
Riffusion AI Music Generation
Anyone already tried out riffusion.com?
Pretty lackluster imo, give it 4 months and producers will be seething.
Yeah, these things develop dramatically fast
Is there a way to use this in games? Like have it generate music based on the current context of the player, such as music that dynamically escalates and deescalates depending on how intense the situation is?
No idea, ask on discord or plebbit
This will only develop from here on, so it's clear those features will be implemented, question is when
That would require it to generate live during the game and yes, if you just add "intense" or some keywords that produce harsher, more dramatic sounds that should work, but that means generating on the get-go, resource intensive
right now it's far too unreliable, even if you could package and plug it into a game
The inference server is supposed to work live, I've tried it on an A100 and can handle 4 users at once all with live-generating songs. It creates 5-second segments every 2 seconds with the A100. It's just gonna be faster with tensorrt. The site is overloaded according to the main dev
I am pretty sure the dataset used was small. However this shouldn't be any different to finetune as it's basically just spectograms.
I am downloading all of snail house and moe shop music, gonna convert it into the required spectograms and train it on a 8*A100 cluster to see how it goes
in the meantime theres a colab that has audio2audio, or you can use my repo to toy with it more extensively https://github.com/chavinlo/riffusion-manipulation
>this shouldn't be any different to finetune as it's basically just spectograms.
I have extensive experience with diffusion models, especially for non-image data. Doing music tasks using the spectrogram as an image is a very old trick. In my latest work where I use mass spectrograms (not audio), diffusions perform quite poorly despite the massive datasets I leveraged. In the end, ad-hoc auto-regressive models worked best.
It's not at all true that it "shouldn't be any different" because the data *distribution* is completely different -- the patterns in a music spectrogram or a mass spectrogram and the data in an image are widely different (the image data is far closer to a gaussian on average, there are strong conditional biases in music etc.). Playing with noise schedules and denoising models and noise models have not yielded anything promising either on my end nor in the literature. The d3pm paper, while it focuses on discrete data, exposes some of this and makes note of the difficulty of making non-standard noise schedules work, but also conversely that this might be required to make diffusions work on non-image data.
LDM held the promise of being encoder/decoder agnostic, but that doesn't hold at all in practice, which is why, for example, those guys are not using a sequence encoder/decoder but rather a spectrogram input with image encoder/decoder: nobody has been able to make the LDM paradigm's promise of input/output format independence by using VAEs to get gaussians in the latents eitherway hold.
I don't really understand your point...
If you mean the difference between the audio spectogram and the output image then yes you are right, but I was referring to doing the same that the original devs did:
>This is the v1.5 stable diffusion model with no modifications, just fine-tuned on images of spectrograms paired with text.
Riffusion only uses 1 channel to encode the amplitude info hence the B&W. Using the 3 channels (RGB) could let us encode more amplitude info, 24bit amplitude maybe. those 2 channels are not being used and the data is just being duplicated.
Also that they use img2img and use the seed image just to follow the tempo and have some kind of coherency. Some users tried outpainting, but because this isn't an outpainting model, it will get lost in the long run.
Although now that you mention that you have experience, what do you suggest to make this better if theres even room for improvement
>If you mean the difference between the audio spectogram and [natural] image[s]
Yes. Diffusions are good when the data has the right properties. In practice (because that's the only real context where it's ever worked so far), it means the data must be gaussian-enough on average. This also means that the dependency patterns in the image have to be generally limited. Obviously that's the precise opposite as in music spectrograms as you can see in
>Riffusion only uses 1 channel to encode the amplitude info hence the B&W. Using the 3 channels (RGB) could let us encode more amplitude info, 24bit amplitude maybe.
That'll just make the problem harder (in the sense that more data will be needed).
> Some users tried outpainting, but because this isn't an outpainting model, it will get lost in the long run.
Diffusions can natively do outpainting, inpainting, or guided generation even when trained just in the purely unsupervised context. It's one of their interesting strengths. That it doesn't work is not because 'it's not an outpainting model' but because the model is stuck on a bad local minimum and unable to learn the data distribution correctly. This is not something that can be solved by more data alone (it COULD be solved by more parameters, but that's a toss up).
>what do you suggest to make this better if theres even room for improvement
This is not an incremental question. Beside bruteforcing like everyone else (i.e. more compute, more data, more parameters), a major breakthrough is needed to make this work in more general contexts.
The usual directions are:
- Noise design
- Noise scheduling
- Denoising model
- Noise model
With LDM you can add
- Encoder/decoder models
In particular, recurrent VAEs. However I haven't been able to get those to produce nice results myself.
Otherwise, forget diffusions and look into autoregressive models instead, or find a way to combine them (re: encoder/decoder model for e.g.). All very non-trivial to do.
Good to know that theres still smart people on BOT
yes being triying it for 2 days, is pretty green, needs a lot of work and has low variety, but sometimes you get beats that sound good enough.
what is your pc config?
this is a dead end. you'd need to produce images with a height of 14-6000px for this to start being viable.
Can I get midi from it?
Listened to a bunch of samples, they're all complete garbage. Maybe in 5 years.
t. retard who doesn't understand shit about the tech.
Can it make DAWS fileS?
I can imagine this music playing in the Backrooms.
I tested it, its pretty shit right now. Hopefully we get a better version soon. I want to see the corrupted music industry collapse.