by abcdabcd987 on 11/8/23, 8:42 PM with 26 comments
by huac on 11/9/23, 4:20 AM
That's not hyperbole. Why is OpenAI able to charge so little for their API's? I have heard rival mega LLM company CEO's complain that OpenAI's prices would be a loss for their rivals. But I think it's still positive margin, and that they can charge low prices for API because they've invested more into managing the infra, sure, but most importantly because they have the best utilization of their existing hardware.
If it costs everyone $X/gpu/hr to serve models, the company that has the most throughput wins on price. In a world without finetunes, the most capable model, the one that can zero- or few-shot the most tasks will have the most usage. Finetuned open models can reach parity with GPT on narrow tasks, but until now, having public providers serve the models was expensive. Your private finetune is only going to be queried by you, not everyone, so it's super expensive to serve on a per token level. With hot swappable LoRA adapters, that calculus changes, and the cost per token can go way down. Super, super exciting!
by kcorbitt on 11/9/23, 12:11 AM
Really looking forward to these innovations becoming more widespread -- I expect we're very close to a world where training a LoRA on a one-off task like "review every HN post from the last 3 years and flag any of them that contain informed speculation about the architecture of GPT-4" will be easy, cheap and routine.
by Palmik on 11/9/23, 5:11 AM
How hard would it be to adapt your kernels to work with the new-gen quants like AWQ or EXL2?
by vlovich123 on 11/9/23, 2:30 AM
by yyding on 11/8/23, 9:23 PM
by j0057 on 11/9/23, 11:51 AM
by lmeyerov on 11/9/23, 11:51 AM
I'm curious if there is a quality argument to be made: imagine needing to finetune k different classifiers...
Before this work, we could train a single multi-label classifier by pooling the training sets, and deploy as 1 LoRa
Now, we can have k distinct classifiers, and not risk them interfering with one another
Any sense of, in realistic scenarios, when the quality of k distinct LoRas would be better?
by kkielhofner on 11/8/23, 11:51 PM
Any thoughts as to how this would come together with serving frameworks like vLLM, lmdeploy, Triton Inference Server, etc?
by junrushao1994 on 11/8/23, 8:54 PM
by ruihangl on 11/9/23, 12:25 AM
by busssard on 11/9/23, 1:15 PM