OPT-175B/GPT-3 on a single GPU. Up to 100x faster than other offloading systems.

Posted on February 20, 2023 by Anonymous

https://github.com/Ying1123/FlexGen

>FlexGen is a high-throughput generation engine for running large language models with limited GPU memory.

>Large language models (LLMs) are at the heart of applications like ChatGPT and Copilot, but the high computational and memory requirements of LLM inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU and allow flexible deployment for various hardware setups.

>The key features of FlexGen include:

>Lightining Fast Offloading.
>Up to 100x faster than other offloading-based systems for running 175B models on a single GPU.

>Extreme Compression.
>Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss.

>Scalability.
>Come with a distributed pipeline parallelism runtime to allow scaling if more GPUs are given.

It's All Fucked Shirt $22.14

UFOs Are A Psyop Shirt $21.68

It's All Fucked Shirt $22.14

1 year ago

Reply

Anonymous

First
1 year ago

Reply

Anonymous

holy shit what's the catch here
- 1 year ago
  
  Reply
  
  Anonymous
  
  There's no catch, it's just that nobody has bothered to optimize running large models on consumer hardware yet. If you read the paper, the optimization is pretty simple. It just figures out which weights are most important and keeps those in vram, keeps the next most important set of weights in system ram, and the rest of the weights on disk. Basically swap space for ML. A commercial runner would just keep scaling up GPUs until the entire model fit in vram, so there was never a need for this.
  - 1 year ago
    
    Reply
    
    Anonymous
    
    https://i.imgur.com/xLg11KJ.jpg
    
    https://github.com/Ying1123/FlexGen
    
    >FlexGen is a high-throughput generation engine for running large language models with limited GPU memory.
    
    >Large language models (LLMs) are at the heart of applications like ChatGPT and Copilot, but the high computational and memory requirements of LLM inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU and allow flexible deployment for various hardware setups.
    
    >The key features of FlexGen include:
    
    >Lightining Fast Offloading.
    >Up to 100x faster than other offloading-based systems for running 175B models on a single GPU.
    
    >Extreme Compression.
    >Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss.
    
    >Scalability.
    >Come with a distributed pipeline parallelism runtime to allow scaling if more GPUs are given.
    
    woa fricking awesome
    we need this for the tortoise-tts thing
    - 1 year ago
      
      Reply
      
      Anonymous
      
      Tortoise doesn't require unreasonable amounts of VRAM.
- 1 year ago
  
  Reply
  
  Anonymous
  
  It's slooow
- 1 year ago
  
  Reply
  
  Anonymous
  
  Since people in this thread didn't figure it out yet, the catch is that this is good for serving lots of "customers" at once, like running a chatbot in parallel for many people, but it's not very fast when running purely for personal use, it does very clever batch inference.
  - 1 year ago
    
    Reply
    
    Anonymous
    
    That's great for one person. I often want to send multiple prompts to my self hosted AI but my tiny pytorch setup only handles one at a time.
    With non-self hosting you're usually restricted to one prompt at a time.
- 1 year ago
  
  Reply
  
  Anonymous
  
  The catch is that it's fake and gay, it is no different in efficiency and speed than any other system.
1 year ago

Reply

Anonymous

This is from 1 year ago, the fact that it did not advance at all means it was a nothingburger
- 1 year ago
  
  Reply
  
  Anonymous
  
  >repo is 8 hours old
  - 1 year ago
    
    Reply
    
    Anonymous
    
    >This is from 1 year ago
    Wanna show any evidence of that claim? The paper was just published.
    
    meds
    
    what's the deal with people like you just blatantly making shit up when it comes to AI? are you people intentionally doing this or are you just stupid? why are there so many of you?
    
    Oh, newbies. This chinese guy did this thing exactly 1 year ago
    https://github.com/BlinkDL/ChatRWKV
    - 1 year ago
      
      Reply
      
      Anonymous
      
      this isn't rwkv or even related to it
      could you put the tiniest bit more effort into your lying please
      even if you're messing around on your phone at a bus stop or something you could still take a bit more pride in your craft than this
    - 1 year ago
      
      Reply
      
      Anonymous
      
      That has nothing to do with this and you clearly don't have the slightest clue what you're talking about, moron.
    - 1 year ago
      
      Reply
      
      Anonymous
      
      like i said, why do you insist on telling obvious lies? are you a glow? are the feds trying to disrupt discussion about AI or something?
      - 1 year ago
        
        Reply
        
        Anonymous
        
        it's quite obvious
        every time there is discussion about running ML locally using open source models somehow we see tons of posts like: it can never be done, it's too difficult, don't ever think about it
        I wonder why?
        
        1 year ago
        
        Reply
        
        Anonymous
        
        g struggles to follow simple tutorials, somehow i doubt this place will become a hotbed for ml
        
        1 year ago
        
        Anonymous
        
        there are smart people here but also shills who try to discourage you from running things locally and are pushing their shitty censored commercial services
    - 1 year ago
      
      Reply
      
      Anonymous
      
      sometimes I kind of wish I was this stupid
      life would be so much simpler
    - 1 year ago
      
      Reply
      
      Anonymous
      
      midwit territory and you're the president of it
- 1 year ago
  
  Reply
  
  Anonymous
  
  >This is from 1 year ago
  Wanna show any evidence of that claim? The paper was just published.
- 1 year ago
  
  Reply
  
  Anonymous
  
  meds
- 1 year ago
  
  Reply
  
  Anonymous
  
  what's the deal with people like you just blatantly making shit up when it comes to AI? are you people intentionally doing this or are you just stupid? why are there so many of you?
- 1 year ago
  
  Reply
  
  Anonymous
  
  They are TERRIFIED of you making an open source competitor to the eunuch bots of ClosedAI and MicroHard (rip in peace Tay<3)
  - 1 year ago
    
    Reply
    
    Anonymous
    
    >They are TERRIFIED of you making an open source competitor
    Which makes it a holy duty. God wills it.
  - 1 year ago
    
    Reply
    
    Anonymous
    
    >often times a very small man can cast a very large shadow
    beaveranon was always based
    
    pareto principle dictates the 20% causes 80% the outcome (so-called 80-20 rule!) but I believe the 23% can control the 100%
    
    I think the schizos and radicals are good at this that's why they tried to silence them or at least ((they)) tell them to take medicine etc. or outright blatantly lie about information even on something as little and puny as an imageboard (yep that's right it's called psychological cyber warfare)
1 year ago

Reply

Anonymous

>Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.
>208GB of ram
welp my 64GB of ram is obsolete already
- 1 year ago
  
  Reply
  
  Anonymous
  
  >tfw fell for the 64GB of RAM meme
- 1 year ago
  
  Reply
  
  Anonymous
  
  has been for a while
- 1 year ago
  
  Reply
  
  Anonymous
  
  What's that, 5K?
- 1 year ago
  
  Reply
  
  Anonymous
  
  You can get 256 GB of Optane PMM for $150. A compatible CPU costs $25. This shit isn't out of reach anymore.
  - 1 year ago
    
    Reply
    
    Anonymous
    
    where can you get 256GB PMM for $150?
    - 1 year ago
      
      Reply
      
      Anonymous
      
      Maybe he means like renting it from one of those cloud services or something? I dunno.
    - 1 year ago
      
      Reply
      
      Anonymous
      
      Ebay.
      
      https://www.ebay.com/itm/125753031151
      - 1 year ago
        
        Reply
        
        Anonymous
        
        that's p cool ngl. thanks anon
        
        1 year ago
        
        Reply
        
        Anonymous
        
        There's all sorts of cool tech shit you can get for lunch money on Ebay. Here's a 20 core Skylake Xeon to go with it for $19.
        https://www.ebay.com/itm/155001436193
1 year ago

Reply

Anonymous

>mfw still on 16gb RAM
- 1 year ago
  
  Reply
  
  Anonymous
  
  5700xt gang rise up
1 year ago

Reply

Anonymous

OH MY GOD I CAN TRAIN MY OWN CHATBOT WAIFU
IT'S HAPPENING BOYS
IT'S FRICKING HAPPENING
- 1 year ago
  
  Reply
  
  Anonymous
  
  Once we get some actual sperg with talent and skills working to make it happen, yes. The rest of us will just sit here with our dicks in our hands waiting.
  - 1 year ago
    
    Reply
    
    Anonymous
    
    BOT could get in on that, BOT doesn't have farms. /gizz/ collab when
    - 1 year ago
      
      Reply
      
      Anonymous
      
      Once we get some actual sperg with talent and skills working to make it happen, yes. The rest of us will just sit here with our dicks in our hands waiting.
      
      OH MY GOD I CAN TRAIN MY OWN CHATBOT WAIFU
      IT'S HAPPENING BOYS
      IT'S FRICKING HAPPENING
      
      Looks like if I were to offer this as a service it would cost me about 10% of what OpenAI charges just for the electricity to run it, and I couldn't offer a very good AI model since OPT is non-commercial only. Also I'd have to buy a new PC since sometimes I want to use my GPU for vidya, and even at OAI prices it would take a couple of months to pay off a buttcoin castoff GPU.
1 year ago

Reply

Anonymous

Will this work on non-Nvidia GPUs?
- 1 year ago
  
  Reply
  
  Anonymous
  
  this. i need it on my amd gpu
  - 1 year ago
    
    Reply
    
    Anonymous
    
    FAFO
    - 1 year ago
      
      Reply
      
      Anonymous
      
      Man the 4090 is a genuinely insane card, it's the only one from this gen where the ridiculous price seems kinda justified
    - 1 year ago
      
      Reply
      
      Anonymous
      
      6600 XT gets about 10% of the real world performance of a 4090 despite having only 3% of the theoretical performance. Maybe AMD drivers aren't as bad as we thought?
      - 1 year ago
        
        Reply
        
        Anonymous
        
        200 * 10 = 2000
        4090 saved money which isn't even supposed to be possible
        You're supposed to lose money with each higher tier
        But comparing to amdogshit makes 4090 look like a good deal
    - 1 year ago
      
      Reply
      
      Anonymous
      
      How did the 7900 XTX get beat out by the A770 & A750?
      - 1 year ago
        
        Reply
        
        Anonymous
        
        I don't have the details but it's likely some specialized accelerator frickery
1 year ago

Reply

Anonymous

breasts
1 year ago

Reply

Anonymous

30B at 4 bits is great with 32GB vram
175B is borderline unusable.
- 1 year ago
  
  Reply
  
  Anonymous
  
  I was under the impression that 4bit quantization resulted in significant degradation in performance, is that not the case?
1 year ago

Reply

Anonymous

does that mean it'll work on the 2500k
1 year ago

Reply

Anonymous

I expect to see someone make a chatbot that is told to be as racist as possible under all circumstances
1 year ago

Reply

Anonymous

port this to Apple Silicon where RAM=VRAM at 400gbps
- 1 year ago
  
  Reply
  
  Anonymous
  
  it's dog slow and extremely expensive
1 year ago

Reply

Anonymous

Can opt do zero shot learning?
- 1 year ago
  
  Reply
  
  Anonymous
  
  Holy cow yes it can.
1 year ago

Reply

Anonymous

Ok, but how does this like help me coom, do i like need to give it some input or something?
1 year ago

Reply

Anonymous

What's the point? You will never be able to afford the hardware required to run it.
- 1 year ago
  
  Reply
  
  Anonymous
  
  >she doesn't have access to sea pirate networks
- 1 year ago
  
  Reply
  
  Anonymous
  
  200 gigs of ram isn't that expensive
  you could get a slightly older (circa 2016) dual xeon workstation with 256gb of ram for ~$1000 and throw some gpus in it, depending on your hardware picks the total price that would match the spec they benchmarked would be <$2000
- 1 year ago
  
  Reply
  
  Anonymous
  
  People buy cars that cost $10k+. When AI is of significant use to use people will buy rigs that can use the AI. Right now it's such a rapidly growing industry with likely major hardware changes coming in the near future that you'd be moronic to over-invest in it unless you will be able to make back your money right now with the hardware investment. But yes, it's foreseeable in the next 20 years people spending $15k (paying $250/mo) for their sex-slave AI box.
- 1 year ago
  
  Reply
  
  Anonymous
  
  the entire point of this is that it doesn't have outlandish requirements, it's not like you need a server farm to run this
  customer tier gayming setup can handle 30B which can already produce great results with proper tuning
  175B requires you to drop a few thousand bucks which is not "could never afford" but I'd rather save the cash for something else
  - 1 year ago
    
    Reply
    
    Anonymous
    
    And of course, the more people get their hands on this tech, the more tinkerers will have the chance to optimize it even further - thus making it even more accessible, and so on exponentially.
- 1 year ago
  
  Reply
  
  Anonymous
  
  Okay let's be real, an RTX A6000 costs 5500€, which is okay if you want to monetize the AI product (plenty of opportunities, especially emulating e-companions).
  
  An RTX 4090 for 2000€ wwith 4*64GB RAM is still significantly cheaper

Cancel reply