from Hacker News

Punica: Serving multiple LoRA finetuned LLM as one

by abcdabcd987 on 11/8/23, 8:42 PM with 26 comments

by huac on 11/9/23, 4:20 AM
I think this is one of the most important possible works for open source LLM's, really glad y'all pushed this forward!
That's not hyperbole. Why is OpenAI able to charge so little for their API's? I have heard rival mega LLM company CEO's complain that OpenAI's prices would be a loss for their rivals. But I think it's still positive margin, and that they can charge low prices for API because they've invested more into managing the infra, sure, but most importantly because they have the best utilization of their existing hardware.
If it costs everyone $X/gpu/hr to serve models, the company that has the most throughput wins on price. In a world without finetunes, the most capable model, the one that can zero- or few-shot the most tasks will have the most usage. Finetuned open models can reach parity with GPT on narrow tasks, but until now, having public providers serve the models was expensive. Your private finetune is only going to be queried by you, not everyone, so it's super expensive to serve on a per token level. With hot swappable LoRA adapters, that calculus changes, and the cost per token can go way down. Super, super exciting!
by kcorbitt on 11/9/23, 12:11 AM
Awesome work! Here's a recent paper released yesterday, also focused on efficiently serving many LoRAs simultaneously: https://arxiv.org/abs/2311.03285
Really looking forward to these innovations becoming more widespread -- I expect we're very close to a world where training a LoRA on a one-off task like "review every HN post from the last 3 years and flag any of them that contain informed speculation about the architecture of GPT-4" will be easy, cheap and routine.
by Palmik on 11/9/23, 5:11 AM
This is amazing, and will unlock many possibilities. I just recently read the S-LoRA paper, which is related, but it's even better to have a working (and extremely efficient!) implementation.
How hard would it be to adapt your kernels to work with the new-gen quants like AWQ or EXL2?
by vlovich123 on 11/9/23, 2:30 AM
Am I correct in understanding that LoRA is basically a way to cheaply create “delta” LLMs that apply onto the main large one to create a specialization? In other words, this would obviate all the vector DB stuff that people are doing right?
by yyding on 11/8/23, 9:23 PM
Good job! I observed that you implemented many cuda kernels by yourselves. Just wondering your consideration or trade-off between implementating the kernels via pure CUDA code vs. implementing based on compiler like TVM/Triton.
by j0057 on 11/9/23, 11:51 AM
That name is easy to confuse with the unrelated LoRa and LoRaWAN.
by lmeyerov on 11/9/23, 11:51 AM
Super cool!
I'm curious if there is a quality argument to be made: imagine needing to finetune k different classifiers...
Before this work, we could train a single multi-label classifier by pooling the training sets, and deploy as 1 LoRa
Now, we can have k distinct classifiers, and not risk them interfering with one another
Any sense of, in realistic scenarios, when the quality of k distinct LoRas would be better?
by kkielhofner on 11/8/23, 11:51 PM
Nice!
Any thoughts as to how this would come together with serving frameworks like vLLM, lmdeploy, Triton Inference Server, etc?
by junrushao1994 on 11/8/23, 8:54 PM
This is great! Have you guys considered integrating with one of the existing systems?
by ruihangl on 11/9/23, 12:25 AM
Great work! I am curious that how much effort it would take to support LoRAs with different ranks?
by busssard on 11/9/23, 1:15 PM
there was a word on GPT4 just being 8 different GPT3 in a trenchcoat finetuned on different topics. If we can do this now with 8x finetuned Vicuna 13b for the price of running Vicuna once, this is huge!