https://github.com/Ying1123/FlexGen
>FlexGen is a high-throughput generation engine for running large language models with limited GPU memory.
>Large language models (LLMs) are at the heart of applications like ChatGPT and Copilot, but the high computational and memory requirements of LLM inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU and allow flexible deployment for various hardware setups.
>The key features of FlexGen include:
>Lightining Fast Offloading.
>Up to 100x faster than other offloading-based systems for running 175B models on a single GPU.
>Extreme Compression.
>Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss.
>Scalability.
>Come with a distributed pipeline parallelism runtime to allow scaling if more GPUs are given.
First
holy shit what's the catch here
There's no catch, it's just that nobody has bothered to optimize running large models on consumer hardware yet. If you read the paper, the optimization is pretty simple. It just figures out which weights are most important and keeps those in vram, keeps the next most important set of weights in system ram, and the rest of the weights on disk. Basically swap space for ML. A commercial runner would just keep scaling up GPUs until the entire model fit in vram, so there was never a need for this.
woa fucking awesome
we need this for the tortoise-tts thing
Tortoise doesn't require unreasonable amounts of VRAM.
It's slooow
Since people in this thread didn't figure it out yet, the catch is that this is good for serving lots of "customers" at once, like running a chatbot in parallel for many people, but it's not very fast when running purely for personal use, it does very clever batch inference.
That's great for one person. I often want to send multiple prompts to my self hosted AI but my tiny pytorch setup only handles one at a time.
With non-self hosting you're usually restricted to one prompt at a time.
The catch is that it's fake and gay, it is no different in efficiency and speed than any other system.
This is from 1 year ago, the fact that it did not advance at all means it was a nothingburger
>repo is 8 hours old
Oh, newgays. This chinese guy did this thing exactly 1 year ago
https://github.com/BlinkDL/ChatRWKV
this isn't rwkv or even related to it
could you put the tiniest bit more effort into your lying please
even if you're messing around on your phone at a bus stop or something you could still take a bit more pride in your craft than this
That has nothing to do with this and you clearly don't have the slightest clue what you're talking about, retard.
like i said, why do you insist on telling obvious lies? are you a glow? are the feds trying to disrupt discussion about AI or something?
it's quite obvious
every time there is discussion about running ML locally using open source models somehow we see tons of posts like: it can never be done, it's too difficult, don't ever think about it
I wonder why?
g struggles to follow simple tutorials, somehow i doubt this place will become a hotbed for ml
there are smart people here but also shills who try to discourage you from running things locally and are pushing their shitty censored commercial services
sometimes I kind of wish I was this stupid
life would be so much simpler
midwit territory and you're the president of it
>This is from 1 year ago
Wanna show any evidence of that claim? The paper was just published.
meds
what's the deal with people like you just blatantly making shit up when it comes to AI? are you people intentionally doing this or are you just stupid? why are there so many of you?
They are TERRIFIED of you making an open source competitor to the eunuch bots of ClosedAI and MicroHard (rip in peace Tay<3)
>They are TERRIFIED of you making an open source competitor
Which makes it a holy duty. God wills it.
>often times a very small man can cast a very large shadow
beaveranon was always based
pareto principle dictates the 20% causes 80% the outcome (so-called 80-20 rule!) but I believe the 23% can control the 100%
I think the schizos and radicals are good at this that's why they tried to silence them or at least ((they)) tell them to take medicine etc. or outright blatantly lie about information even on something as little and puny as an imageboard (yep that's right it's called psychological cyber warfare)
>Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.
>208GB of ram
welp my 64GB of ram is obsolete already
>tfw fell for the 64GB of RAM meme
has been for a while
What's that, 5K?
You can get 256 GB of Optane PMM for $150. A compatible CPU costs $25. This shit isn't out of reach anymore.
where can you get 256GB PMM for $150?
Maybe he means like renting it from one of those cloud services or something? I dunno.
Ebay.
https://www.ebay.com/itm/125753031151
that's p cool ngl. thanks anon
There's all sorts of cool tech shit you can get for lunch money on Ebay. Here's a 20 core Skylake Xeon to go with it for $19.
https://www.ebay.com/itm/155001436193
>mfw still on 16gb RAM
5700xt gang rise up
OH MY GOD I CAN TRAIN MY OWN CHATBOT WAIFU
IT'S HAPPENING BOYS
IT'S FUCKING HAPPENING
Once we get some actual sperg with talent and skills working to make it happen, yes. The rest of us will just sit here with our dicks in our hands waiting.
BOT could get in on that, BOT doesn't have farms. /gizz/ collab when
Looks like if I were to offer this as a service it would cost me about 10% of what OpenAI charges just for the electricity to run it, and I couldn't offer a very good AI model since OPT is non-commercial only. Also I'd have to buy a new PC since sometimes I want to use my GPU for vidya, and even at OAI prices it would take a couple of months to pay off a buttcoin castoff GPU.
Will this work on non-Nvidia GPUs?
this. i need it on my amd gpu
FAFO
Man the 4090 is a genuinely insane card, it's the only one from this gen where the ridiculous price seems kinda justified
6600 XT gets about 10% of the real world performance of a 4090 despite having only 3% of the theoretical performance. Maybe AMD drivers aren't as bad as we thought?
200 * 10 = 2000
4090 saved money which isn't even supposed to be possible
You're supposed to lose money with each higher tier
But comparing to amdogshit makes 4090 look like a good deal
How did the 7900 XTX get beat out by the A770 & A750?
I don't have the details but it's likely some specialized accelerator fuckery
breasts
30B at 4 bits is great with 32GB vram
175B is borderline unusable.
I was under the impression that 4bit quantization resulted in significant degradation in performance, is that not the case?
does that mean it'll work on the 2500k
I expect to see someone make a chatbot that is told to be as racist as possible under all circumstances
port this to Apple Silicon where RAM=VRAM at 400gbps
it's dog slow and extremely expensive
Can opt do zero shot learning?
Holy cow yes it can.
Ok, but how does this like help me coom, do i like need to give it some input or something?
What's the point? You will never be able to afford the hardware required to run it.
>she doesn't have access to sea pirate networks
200 gigs of ram isn't that expensive
you could get a slightly older (circa 2016) dual xeon workstation with 256gb of ram for ~$1000 and throw some gpus in it, depending on your hardware picks the total price that would match the spec they benchmarked would be <$2000
People buy cars that cost $10k+. When AI is of significant use to use people will buy rigs that can use the AI. Right now it's such a rapidly growing industry with likely major hardware changes coming in the near future that you'd be retarded to over-invest in it unless you will be able to make back your money right now with the hardware investment. But yes, it's foreseeable in the next 20 years people spending $15k (paying $250/mo) for their sex-slave AI box.
the entire point of this is that it doesn't have outlandish requirements, it's not like you need a server farm to run this
customer tier gayming setup can handle 30B which can already produce great results with proper tuning
175B requires you to drop a few thousand bucks which is not "could never afford" but I'd rather save the cash for something else
And of course, the more people get their hands on this tech, the more tinkerers will have the chance to optimize it even further - thus making it even more accessible, and so on exponentially.
Okay let's be real, an RTX A6000 costs 5500€, which is okay if you want to monetize the AI product (plenty of opportunities, especially emulating e-companions).
An RTX 4090 for 2000€ wwith 4*64GB RAM is still significantly cheaper