How many more parameters and training until AI models can create movies indistinguishable from reality?
Shopping Cart Returner Shirt $21.68 |
How many more parameters and training until AI models can create movies indistinguishable from reality?
Shopping Cart Returner Shirt $21.68 |
7
bro really thought reddit was a board on bot I'm dead
it's impossible. we're already at the limits of what AI can do. things are just gonna get faster and higher res but not better quality
Wrong
Use AI for lypsincing on 3d models and animations. I saw a game that was indistinguishable from real life I forgot what it was called.
Hey you could start with Willy Wonka remake why not.
Oh I forgot make the ai make the 3d models too. The ai generated 3d models right now are not that great.
>How many more parameters and training
to give a sense of scale, let's consider the following:
the context window of a modern LLM is maybe on the order of 100k tokens
that means that a good document that it can understand during training is probably that length too
100k tokens is roughly 400k bytes, and roughly the size of a decent image for training an image generator too
a video, for comparison, might be 1000x bigger, with some balance between bitrate and duration
to sense check this, we currently have video gen models that can produce fairly decent ~5 second clips, but a movie is more like ~5000 seconds
so to answer your question, we need about 3 OOMs more of scaling
according to Epoch, scaling compute that much will take 5 more years, and scaling parameter count that much will take about 6 more years
https://epochai.org/trends
>100k tokens
Huh? I thought it was much less.
let me guess, you need more?
https://dev.to/maximsaplin/gpt-4-128k-context-it-is-not-big-enough-1h02
That article says 128k tokens is enough for 1684 tweets. Assuming 280 characters per tweet, that's 471520 characters. Assume six characters per average word, including trailing space, that's 78586 words. Pretty impressive to be honest. Though the article says hallucinations are much more likely when you cross 50% of context usage. Anyway, it's not ready to make a movie yet but in a few years, maybe (because it needs a lot more context than just the film script on its own, in order to make a good movie). But, when it gets to that point, it could make a whole feature film pretty much instantly.
>Assume six characters per average word, including trailing space
fwiw, your comment was 600 characters and 105 words, so six characters per word is a very good estimate
One massive limitation AI has that's not often discussed here, is the quality of the training data.
More effort and resources are poured into collecting, labeling, and processing the data, than anything to do with the AI models themselves.
If we want to make coherent movies with AI, someone needs to label thousands of movies scene by scene. And how would you even label stuff like that?
labelling is a lot easier than generation, and GPT-4 Vision is getting close to doing that at a human level
with a mixture of synthetic data and a corpus the size of youtube or TV stations' back catalogs, there should be enough training data to last another 5 years no problem
>quality
AI-generated labeling is not of any quality. Models are only as good as the training data.
Consider current anime models. They're restricted largely to danboorus tags, which causes most pics to be very rigid in composition and such.
If you want to do anything interesting you gotta dabble with loras, style-overfitted models and controlnets.
>AI-generated labeling is not of any quality.
do you have a benchmark that compares current labelling systems to the best human labellers?
it's hard to take your claim seriously when you make such an extreme statement as "not of any quality"
My point is that all labeling is shit, which puts a hard limit on the quality of AI generated media.
i'm not convinced
an AI model can find statistical patterns that represent categories like "dog", for example, and all it needs is one label to know how to convert that that region in visual latent space into the equivalent region in textual latent space (the vicinity of "dog", "puppy", "canine", etc.)
in fact, we've seen that text-only models can learn to convert between two different languages without any supervised training data at all, just by looking at the "shapes" of the connections within each language
i don't see why a multi-modal model couldn't end up learning patterns in the same way, which is also probably what human babies do when they are learning to understand the world and language at the same time
And all you can generate is generic stockphoto-like dogs in uninteresting poses and compositions. And if you try to prompt something more specific, it most likely wont listen because those concepts were not labelled in the training data.
Your AI can only do what the training data allows it to.
>Your AI can only do what the training data allows it to.
sure, but that's not a problem with the labelling, it's a problem with the training data and parameter count being too small for the space of all possible images that you might want to generate
i fully expect there to be emergent abilities unlocked at bigger model sizes as the model starts to grok higher level concepts like occlusion and 3D curvature
>but that's not a problem with the labelling, it's a problem with the training data
What do you mean by labeling, if not the training data?
>What do you mean by labeling, if not the training data?
the training data needed to recognize "hand" as a category of object is much smaller than the training data needed to generate realistic hands
my claim is that the latter is the bottleneck
so there is already enough data/parameters in the labelling part of the network to recognize hands correctly, but more data/parameters will improve image generation considerably
>And how would you even label stuff like that?
With another AI, of course.
Do you want an AI without those limitations?
Would you pay me decently to build it?
>someone needs to label thousands of movies scene by scene. And how would you even label stuff like that?
Audio description of scenes for the blind. Description of noise from the subtitles for the deaf.
>create movies indistinguishable from reality
I don't think a LLM could do that. An inherent limitation is that amount of context it can handle (the size of the prompt). An LLM that could ingest an entire movie script as context would be fairly improbable.
I just want wearing GLASS that will display me following things, when i talk to girl :
1. Answers i can say or question that i can ask her to manipulate her into getting interested in me?
2. How can i make her laugh so much?
3. Tease her.
4. Say things to her that will not put me in friendzone.
Please someone make ai to do this.
like a jarvis
There's nothing that will stop the awkwardness though. You can't fix timing. You gotta be more confident bro.
>bro really thought reddit was a board on BOT I'm dead
newbie genAlpha, you must be 18+ to post on this site.