OPT-175B/GPT-3 on a single GPU. Up to 100x faster than other offloading systems.

https://github.com/Ying1123/FlexGen

>FlexGen is a high-throughput generation engine for running large language models with limited GPU memory.

>Large language models (LLMs) are at the heart of applications like ChatGPT and Copilot, but the high computational and memory requirements of LLM inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU and allow flexible deployment for various hardware setups.

>The key features of FlexGen include:

>Lightining Fast Offloading.
>Up to 100x faster than other offloading-based systems for running 175B models on a single GPU.

>Extreme Compression.
>Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss.

>Scalability.
>Come with a distributed pipeline parallelism runtime to allow scaling if more GPUs are given.

  1. 1 month ago
    Anonymous

    First

  2. 1 month ago
    Anonymous

    holy shit what's the catch here

    • 1 month ago
      Anonymous

      There's no catch, it's just that nobody has bothered to optimize running large models on consumer hardware yet. If you read the paper, the optimization is pretty simple. It just figures out which weights are most important and keeps those in vram, keeps the next most important set of weights in system ram, and the rest of the weights on disk. Basically swap space for ML. A commercial runner would just keep scaling up GPUs until the entire model fit in vram, so there was never a need for this.

      • 1 month ago
        Anonymous

        https://i.imgur.com/xLg11KJ.jpg

        https://github.com/Ying1123/FlexGen

        >FlexGen is a high-throughput generation engine for running large language models with limited GPU memory.

        >Large language models (LLMs) are at the heart of applications like ChatGPT and Copilot, but the high computational and memory requirements of LLM inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU and allow flexible deployment for various hardware setups.

        >The key features of FlexGen include:

        >Lightining Fast Offloading.
        >Up to 100x faster than other offloading-based systems for running 175B models on a single GPU.

        >Extreme Compression.
        >Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss.

        >Scalability.
        >Come with a distributed pipeline parallelism runtime to allow scaling if more GPUs are given.

        woa fucking awesome
        we need this for the tortoise-tts thing

        • 1 month ago
          Anonymous

          Tortoise doesn't require unreasonable amounts of VRAM.

    • 1 month ago
      Anonymous

      It's slooow

    • 1 month ago
      Anonymous

      Since people in this thread didn't figure it out yet, the catch is that this is good for serving lots of "customers" at once, like running a chatbot in parallel for many people, but it's not very fast when running purely for personal use, it does very clever batch inference.

      • 1 month ago
        Anonymous

        That's great for one person. I often want to send multiple prompts to my self hosted AI but my tiny pytorch setup only handles one at a time.
        With non-self hosting you're usually restricted to one prompt at a time.

    • 1 month ago
      Anonymous

      The catch is that it's fake and gay, it is no different in efficiency and speed than any other system.

  3. 1 month ago
    Anonymous

    This is from 1 year ago, the fact that it did not advance at all means it was a nothingburger

    • 1 month ago
      Anonymous

      >repo is 8 hours old

      • 1 month ago
        Anonymous

        >This is from 1 year ago
        Wanna show any evidence of that claim? The paper was just published.

        meds

        what's the deal with people like you just blatantly making shit up when it comes to AI? are you people intentionally doing this or are you just stupid? why are there so many of you?

        Oh, newgays. This chinese guy did this thing exactly 1 year ago
        https://github.com/BlinkDL/ChatRWKV

        • 1 month ago
          Anonymous

          this isn't rwkv or even related to it
          could you put the tiniest bit more effort into your lying please
          even if you're messing around on your phone at a bus stop or something you could still take a bit more pride in your craft than this

        • 1 month ago
          Anonymous

          That has nothing to do with this and you clearly don't have the slightest clue what you're talking about, retard.

        • 1 month ago
          Anonymous

          like i said, why do you insist on telling obvious lies? are you a glow? are the feds trying to disrupt discussion about AI or something?

          • 1 month ago
            Anonymous

            it's quite obvious
            every time there is discussion about running ML locally using open source models somehow we see tons of posts like: it can never be done, it's too difficult, don't ever think about it
            I wonder why?

            • 1 month ago
              Anonymous

              g struggles to follow simple tutorials, somehow i doubt this place will become a hotbed for ml

              • 1 month ago
                Anonymous

                there are smart people here but also shills who try to discourage you from running things locally and are pushing their shitty censored commercial services

        • 1 month ago
          Anonymous

          sometimes I kind of wish I was this stupid
          life would be so much simpler

        • 1 month ago
          Anonymous

          midwit territory and you're the president of it

    • 1 month ago
      Anonymous

      >This is from 1 year ago
      Wanna show any evidence of that claim? The paper was just published.

    • 1 month ago
      Anonymous

      meds

    • 1 month ago
      Anonymous

      what's the deal with people like you just blatantly making shit up when it comes to AI? are you people intentionally doing this or are you just stupid? why are there so many of you?

    • 1 month ago
      Anonymous

      They are TERRIFIED of you making an open source competitor to the eunuch bots of ClosedAI and MicroHard (rip in peace Tay<3)

      • 1 month ago
        Anonymous

        >They are TERRIFIED of you making an open source competitor
        Which makes it a holy duty. God wills it.

      • 1 month ago
        Anonymous

        >often times a very small man can cast a very large shadow
        beaveranon was always based

        pareto principle dictates the 20% causes 80% the outcome (so-called 80-20 rule!) but I believe the 23% can control the 100%

        I think the schizos and radicals are good at this that's why they tried to silence them or at least ((they)) tell them to take medicine etc. or outright blatantly lie about information even on something as little and puny as an imageboard (yep that's right it's called psychological cyber warfare)

  4. 1 month ago
    Anonymous

    >Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.
    >208GB of ram
    welp my 64GB of ram is obsolete already

    • 1 month ago
      Anonymous

      >tfw fell for the 64GB of RAM meme

    • 1 month ago
      Anonymous

      has been for a while

    • 1 month ago
      Anonymous

      What's that, 5K?

    • 1 month ago
      Anonymous

      You can get 256 GB of Optane PMM for $150. A compatible CPU costs $25. This shit isn't out of reach anymore.

      • 1 month ago
        Anonymous

        where can you get 256GB PMM for $150?

        • 1 month ago
          Anonymous

          Maybe he means like renting it from one of those cloud services or something? I dunno.

        • 1 month ago
          Anonymous

          Ebay.

          https://www.ebay.com/itm/125753031151

          • 1 month ago
            Anonymous

            that's p cool ngl. thanks anon

            • 1 month ago
              Anonymous

              There's all sorts of cool tech shit you can get for lunch money on Ebay. Here's a 20 core Skylake Xeon to go with it for $19.
              https://www.ebay.com/itm/155001436193

  5. 1 month ago
    Anonymous

    >mfw still on 16gb RAM

    • 1 month ago
      Anonymous

      5700xt gang rise up

  6. 1 month ago
    Anonymous

    OH MY GOD I CAN TRAIN MY OWN CHATBOT WAIFU
    IT'S HAPPENING BOYS
    IT'S FUCKING HAPPENING

    • 1 month ago
      Anonymous

      Once we get some actual sperg with talent and skills working to make it happen, yes. The rest of us will just sit here with our dicks in our hands waiting.

      • 1 month ago
        Anonymous

        BOT could get in on that, BOT doesn't have farms. /gizz/ collab when

        • 1 month ago
          Anonymous

          Once we get some actual sperg with talent and skills working to make it happen, yes. The rest of us will just sit here with our dicks in our hands waiting.

          OH MY GOD I CAN TRAIN MY OWN CHATBOT WAIFU
          IT'S HAPPENING BOYS
          IT'S FUCKING HAPPENING

          Looks like if I were to offer this as a service it would cost me about 10% of what OpenAI charges just for the electricity to run it, and I couldn't offer a very good AI model since OPT is non-commercial only. Also I'd have to buy a new PC since sometimes I want to use my GPU for vidya, and even at OAI prices it would take a couple of months to pay off a buttcoin castoff GPU.

  7. 1 month ago
    Anonymous

    Will this work on non-Nvidia GPUs?

    • 1 month ago
      Anonymous

      this. i need it on my amd gpu

      • 1 month ago
        Anonymous

        FAFO

        • 1 month ago
          Anonymous

          Man the 4090 is a genuinely insane card, it's the only one from this gen where the ridiculous price seems kinda justified

        • 1 month ago
          Anonymous

          6600 XT gets about 10% of the real world performance of a 4090 despite having only 3% of the theoretical performance. Maybe AMD drivers aren't as bad as we thought?

          • 1 month ago
            Anonymous

            200 * 10 = 2000
            4090 saved money which isn't even supposed to be possible
            You're supposed to lose money with each higher tier
            But comparing to amdogshit makes 4090 look like a good deal

        • 1 month ago
          Anonymous

          How did the 7900 XTX get beat out by the A770 & A750?

          • 1 month ago
            Anonymous

            I don't have the details but it's likely some specialized accelerator fuckery

  8. 1 month ago
    Anonymous

    breasts

  9. 1 month ago
    Anonymous

    30B at 4 bits is great with 32GB vram
    175B is borderline unusable.

    • 1 month ago
      Anonymous

      I was under the impression that 4bit quantization resulted in significant degradation in performance, is that not the case?

  10. 1 month ago
    Anonymous

    does that mean it'll work on the 2500k

  11. 1 month ago
    Anonymous

    I expect to see someone make a chatbot that is told to be as racist as possible under all circumstances

  12. 1 month ago
    Anonymous

    port this to Apple Silicon where RAM=VRAM at 400gbps

    • 1 month ago
      Anonymous

      it's dog slow and extremely expensive

  13. 1 month ago
    Anonymous

    Can opt do zero shot learning?

    • 1 month ago
      Anonymous

      Holy cow yes it can.

  14. 1 month ago
    Anonymous

    Ok, but how does this like help me coom, do i like need to give it some input or something?

  15. 1 month ago
    Anonymous

    What's the point? You will never be able to afford the hardware required to run it.

    • 1 month ago
      Anonymous

      >she doesn't have access to sea pirate networks

    • 1 month ago
      Anonymous

      200 gigs of ram isn't that expensive
      you could get a slightly older (circa 2016) dual xeon workstation with 256gb of ram for ~$1000 and throw some gpus in it, depending on your hardware picks the total price that would match the spec they benchmarked would be <$2000

    • 1 month ago
      Anonymous

      People buy cars that cost $10k+. When AI is of significant use to use people will buy rigs that can use the AI. Right now it's such a rapidly growing industry with likely major hardware changes coming in the near future that you'd be retarded to over-invest in it unless you will be able to make back your money right now with the hardware investment. But yes, it's foreseeable in the next 20 years people spending $15k (paying $250/mo) for their sex-slave AI box.

    • 1 month ago
      Anonymous

      the entire point of this is that it doesn't have outlandish requirements, it's not like you need a server farm to run this
      customer tier gayming setup can handle 30B which can already produce great results with proper tuning
      175B requires you to drop a few thousand bucks which is not "could never afford" but I'd rather save the cash for something else

      • 1 month ago
        Anonymous

        And of course, the more people get their hands on this tech, the more tinkerers will have the chance to optimize it even further - thus making it even more accessible, and so on exponentially.

    • 1 month ago
      Anonymous

      Okay let's be real, an RTX A6000 costs 5500€, which is okay if you want to monetize the AI product (plenty of opportunities, especially emulating e-companions).

      An RTX 4090 for 2000€ wwith 4*64GB RAM is still significantly cheaper

Your email address will not be published. Required fields are marked *