by jmorgan on 9/14/23, 2:42 PM with 16 comments
by maccam912 on 9/14/23, 3:37 PM
by heliophobicdude on 9/14/23, 7:09 PM
How doesn't paging worsen speed performance though? If you are making more trips to the memory, then are you really just saving vram?
Also I see that vLLM which implements PagedAttention is also using a better scheduling? Wouldn't the speed improvements be coming from that instead? Don't put an expected short input and output in the same batch as a big input and big output?
What are the results of using the sequence-length only without virtualization?
by notpublic on 9/14/23, 5:28 PM