OPT-175B/GPT-3 on a single GPU. Up to 100x faster than other offloading systems.

https://github.com/Ying1123/FlexGen

>FlexGen is a high-throughput generation engine for running large language models with limited GPU memory.

>Large language models (LLMs) are at the heart of applications like ChatGPT and Copilot, but the high computational and memory requirements of LLM inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU and allow flexible deployment for various hardware setups.

>The key features of FlexGen include:

>Lightining Fast Offloading.
>Up to 100x faster than other offloading-based systems for running 175B models on a single GPU.

>Extreme Compression.
>Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss.

>Scalability.
>Come with a distributed pipeline parallelism runtime to allow scaling if more GPUs are given.

ChatGPT Wizard Shirt $21.68

Beware Cat Shirt $21.68

ChatGPT Wizard Shirt $21.68

  1. 1 year ago
    Anonymous

    First

  2. 1 year ago
    Anonymous

    holy shit what's the catch here

    • 1 year ago
      Anonymous

      There's no catch, it's just that nobody has bothered to optimize running large models on consumer hardware yet. If you read the paper, the optimization is pretty simple. It just figures out which weights are most important and keeps those in vram, keeps the next most important set of weights in system ram, and the rest of the weights on disk. Basically swap space for ML. A commercial runner would just keep scaling up GPUs until the entire model fit in vram, so there was never a need for this.

      • 1 year ago
        Anonymous

        https://i.imgur.com/xLg11KJ.jpg

        https://github.com/Ying1123/FlexGen

        >FlexGen is a high-throughput generation engine for running large language models with limited GPU memory.

        >Large language models (LLMs) are at the heart of applications like ChatGPT and Copilot, but the high computational and memory requirements of LLM inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU and allow flexible deployment for various hardware setups.

        >The key features of FlexGen include:

        >Lightining Fast Offloading.
        >Up to 100x faster than other offloading-based systems for running 175B models on a single GPU.

        >Extreme Compression.
        >Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss.

        >Scalability.
        >Come with a distributed pipeline parallelism runtime to allow scaling if more GPUs are given.

        woa fricking awesome
        we need this for the tortoise-tts thing

        • 1 year ago
          Anonymous

          Tortoise doesn't require unreasonable amounts of VRAM.

    • 1 year ago
      Anonymous

      It's slooow

    • 1 year ago
      Anonymous

      Since people in this thread didn't figure it out yet, the catch is that this is good for serving lots of "customers" at once, like running a chatbot in parallel for many people, but it's not very fast when running purely for personal use, it does very clever batch inference.

      • 1 year ago
        Anonymous

        That's great for one person. I often want to send multiple prompts to my self hosted AI but my tiny pytorch setup only handles one at a time.
        With non-self hosting you're usually restricted to one prompt at a time.

    • 1 year ago
      Anonymous

      The catch is that it's fake and gay, it is no different in efficiency and speed than any other system.

  3. 1 year ago
    Anonymous

    This is from 1 year ago, the fact that it did not advance at all means it was a nothingburger

    • 1 year ago
      Anonymous

      >repo is 8 hours old

      • 1 year ago
        Anonymous

        >This is from 1 year ago
        Wanna show any evidence of that claim? The paper was just published.

        meds

        what's the deal with people like you just blatantly making shit up when it comes to AI? are you people intentionally doing this or are you just stupid? why are there so many of you?

        Oh, newbies. This chinese guy did this thing exactly 1 year ago
        https://github.com/BlinkDL/ChatRWKV

        • 1 year ago
          Anonymous

          this isn't rwkv or even related to it
          could you put the tiniest bit more effort into your lying please
          even if you're messing around on your phone at a bus stop or something you could still take a bit more pride in your craft than this

        • 1 year ago
          Anonymous

          That has nothing to do with this and you clearly don't have the slightest clue what you're talking about, moron.

        • 1 year ago
          Anonymous

          like i said, why do you insist on telling obvious lies? are you a glow? are the feds trying to disrupt discussion about AI or something?

          • 1 year ago
            Anonymous

            it's quite obvious
            every time there is discussion about running ML locally using open source models somehow we see tons of posts like: it can never be done, it's too difficult, don't ever think about it
            I wonder why?

            • 1 year ago
              Anonymous

              g struggles to follow simple tutorials, somehow i doubt this place will become a hotbed for ml

              • 1 year ago
                Anonymous

                there are smart people here but also shills who try to discourage you from running things locally and are pushing their shitty censored commercial services

        • 1 year ago
          Anonymous

          sometimes I kind of wish I was this stupid
          life would be so much simpler

        • 1 year ago
          Anonymous

          midwit territory and you're the president of it

    • 1 year ago
      Anonymous

      >This is from 1 year ago
      Wanna show any evidence of that claim? The paper was just published.

    • 1 year ago
      Anonymous

      meds

    • 1 year ago
      Anonymous

      what's the deal with people like you just blatantly making shit up when it comes to AI? are you people intentionally doing this or are you just stupid? why are there so many of you?

    • 1 year ago
      Anonymous

      They are TERRIFIED of you making an open source competitor to the eunuch bots of ClosedAI and MicroHard (rip in peace Tay<3)

      • 1 year ago
        Anonymous

        >They are TERRIFIED of you making an open source competitor
        Which makes it a holy duty. God wills it.

      • 1 year ago
        Anonymous

        >often times a very small man can cast a very large shadow
        beaveranon was always based

        pareto principle dictates the 20% causes 80% the outcome (so-called 80-20 rule!) but I believe the 23% can control the 100%

        I think the schizos and radicals are good at this that's why they tried to silence them or at least ((they)) tell them to take medicine etc. or outright blatantly lie about information even on something as little and puny as an imageboard (yep that's right it's called psychological cyber warfare)

  4. 1 year ago
    Anonymous

    >Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.
    >208GB of ram
    welp my 64GB of ram is obsolete already

    • 1 year ago
      Anonymous

      >tfw fell for the 64GB of RAM meme

    • 1 year ago
      Anonymous

      has been for a while

    • 1 year ago
      Anonymous

      What's that, 5K?

    • 1 year ago
      Anonymous

      You can get 256 GB of Optane PMM for $150. A compatible CPU costs $25. This shit isn't out of reach anymore.

      • 1 year ago
        Anonymous

        where can you get 256GB PMM for $150?

        • 1 year ago
          Anonymous

          Maybe he means like renting it from one of those cloud services or something? I dunno.

        • 1 year ago
          Anonymous

          Ebay.

          https://www.ebay.com/itm/125753031151

          • 1 year ago
            Anonymous

            that's p cool ngl. thanks anon

            • 1 year ago
              Anonymous

              There's all sorts of cool tech shit you can get for lunch money on Ebay. Here's a 20 core Skylake Xeon to go with it for $19.
              https://www.ebay.com/itm/155001436193

  5. 1 year ago
    Anonymous

    >mfw still on 16gb RAM

    • 1 year ago
      Anonymous

      5700xt gang rise up

  6. 1 year ago
    Anonymous

    OH MY GOD I CAN TRAIN MY OWN CHATBOT WAIFU
    IT'S HAPPENING BOYS
    IT'S FRICKING HAPPENING

    • 1 year ago
      Anonymous

      Once we get some actual sperg with talent and skills working to make it happen, yes. The rest of us will just sit here with our dicks in our hands waiting.

      • 1 year ago
        Anonymous

        BOT could get in on that, BOT doesn't have farms. /gizz/ collab when

        • 1 year ago
          Anonymous

          Once we get some actual sperg with talent and skills working to make it happen, yes. The rest of us will just sit here with our dicks in our hands waiting.

          OH MY GOD I CAN TRAIN MY OWN CHATBOT WAIFU
          IT'S HAPPENING BOYS
          IT'S FRICKING HAPPENING

          Looks like if I were to offer this as a service it would cost me about 10% of what OpenAI charges just for the electricity to run it, and I couldn't offer a very good AI model since OPT is non-commercial only. Also I'd have to buy a new PC since sometimes I want to use my GPU for vidya, and even at OAI prices it would take a couple of months to pay off a buttcoin castoff GPU.

  7. 1 year ago
    Anonymous

    Will this work on non-Nvidia GPUs?

    • 1 year ago
      Anonymous

      this. i need it on my amd gpu

      • 1 year ago
        Anonymous

        FAFO

        • 1 year ago
          Anonymous

          Man the 4090 is a genuinely insane card, it's the only one from this gen where the ridiculous price seems kinda justified

        • 1 year ago
          Anonymous

          6600 XT gets about 10% of the real world performance of a 4090 despite having only 3% of the theoretical performance. Maybe AMD drivers aren't as bad as we thought?

          • 1 year ago
            Anonymous

            200 * 10 = 2000
            4090 saved money which isn't even supposed to be possible
            You're supposed to lose money with each higher tier
            But comparing to amdogshit makes 4090 look like a good deal

        • 1 year ago
          Anonymous

          How did the 7900 XTX get beat out by the A770 & A750?

          • 1 year ago
            Anonymous

            I don't have the details but it's likely some specialized accelerator frickery

  8. 1 year ago
    Anonymous

    breasts

  9. 1 year ago
    Anonymous

    30B at 4 bits is great with 32GB vram
    175B is borderline unusable.

    • 1 year ago
      Anonymous

      I was under the impression that 4bit quantization resulted in significant degradation in performance, is that not the case?

  10. 1 year ago
    Anonymous

    does that mean it'll work on the 2500k

  11. 1 year ago
    Anonymous

    I expect to see someone make a chatbot that is told to be as racist as possible under all circumstances

  12. 1 year ago
    Anonymous

    port this to Apple Silicon where RAM=VRAM at 400gbps

    • 1 year ago
      Anonymous

      it's dog slow and extremely expensive

  13. 1 year ago
    Anonymous

    Can opt do zero shot learning?

    • 1 year ago
      Anonymous

      Holy cow yes it can.

  14. 1 year ago
    Anonymous

    Ok, but how does this like help me coom, do i like need to give it some input or something?

  15. 1 year ago
    Anonymous

    What's the point? You will never be able to afford the hardware required to run it.

    • 1 year ago
      Anonymous

      >she doesn't have access to sea pirate networks

    • 1 year ago
      Anonymous

      200 gigs of ram isn't that expensive
      you could get a slightly older (circa 2016) dual xeon workstation with 256gb of ram for ~$1000 and throw some gpus in it, depending on your hardware picks the total price that would match the spec they benchmarked would be <$2000

    • 1 year ago
      Anonymous

      People buy cars that cost $10k+. When AI is of significant use to use people will buy rigs that can use the AI. Right now it's such a rapidly growing industry with likely major hardware changes coming in the near future that you'd be moronic to over-invest in it unless you will be able to make back your money right now with the hardware investment. But yes, it's foreseeable in the next 20 years people spending $15k (paying $250/mo) for their sex-slave AI box.

    • 1 year ago
      Anonymous

      the entire point of this is that it doesn't have outlandish requirements, it's not like you need a server farm to run this
      customer tier gayming setup can handle 30B which can already produce great results with proper tuning
      175B requires you to drop a few thousand bucks which is not "could never afford" but I'd rather save the cash for something else

      • 1 year ago
        Anonymous

        And of course, the more people get their hands on this tech, the more tinkerers will have the chance to optimize it even further - thus making it even more accessible, and so on exponentially.

    • 1 year ago
      Anonymous

      Okay let's be real, an RTX A6000 costs 5500€, which is okay if you want to monetize the AI product (plenty of opportunities, especially emulating e-companions).

      An RTX 4090 for 2000€ wwith 4*64GB RAM is still significantly cheaper

Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *