from Hacker News

Efficient Memory Management for Large Language Model Serving with PagedAttention

by jmorgan on 9/14/23, 2:42 PM with 16 comments

  • by maccam912 on 9/14/23, 3:37 PM

    Without understanding most of that paper, here's a question for someone who might know more: can pagedattention work to make cpu inference faster too?
  • by heliophobicdude on 9/14/23, 7:09 PM

    Ah, I see. This isn't necessarily virtualizing the static weights but the variable -sized and data dependent key value caches. These caches are built up as you go through the sequence of tokens. Makes sense.

    How doesn't paging worsen speed performance though? If you are making more trips to the memory, then are you really just saving vram?

    Also I see that vLLM which implements PagedAttention is also using a better scheduling? Wouldn't the speed improvements be coming from that instead? Don't put an expected short input and output in the same batch as a big input and big output?

    What are the results of using the sequence-length only without virtualization?

  • by notpublic on 9/14/23, 5:28 PM