from Hacker News

A guide to open-source LLM inference and performance

by varunshenoy on 11/20/23, 8:33 PM with 14 comments

  • by abcdabcd987 on 11/21/23, 12:45 AM

    Related discussion on serving finetuned LLMs: https://news.ycombinator.com/item?id=38196661
  • by llwu on 11/21/23, 2:56 AM

    Question on the "Batching memory-bound processes on a GPU" section - it says "This enables us to reuse parts of the model that we’ve already loaded into the GPU’s SRAM", but the 10 GB we are loading is into the HBM, right? How did we overcome the HBM <-> SRAM bottleneck?

    More generally, how can we find out the size of the SRAM?

  • by joaquincabezas on 11/21/23, 12:11 AM

    Thanks a lot for the material Varun, neat presentation with exhaustive computations that make it easy to follow. Question on the serving part: vLLM, Deepspeed, TensorRT-LLM... ? Thanks!
  • by bicepjai on 11/20/23, 11:38 PM

    That’s really detailed explanation. Can we do something like this for M1 ultra/M2 ultra/M3 max with large RAM ?
  • by alanaan on 11/21/23, 3:11 AM

    great post. could you apply this same framework to optimize training as well?
  • by seth_ on 11/20/23, 10:57 PM

    love the deep dive here
  • by samspenc on 11/20/23, 8:52 PM

    Likely trending on home page since this is directly relevant to LLM costs, i.e., questions like "how much would it cost to rebuild ChatGPT from scratch".