Has anyone gotten 8K context with a 33B model on a 4090?
Unfortunately there's just no way. KV cache size scales with the square of context length, so at 8k it's 16 times larger than at 2k, for 33b that's over 20GB for the cache alone, without weights or other buffers.
Wow that’s crazy. Would it be possible to offload the KV cache onto system RAM and keep model weights in VRAM or would that just slow everything down too much? I guess that’s kind of what llama.cpp does with GPU offload of layers? I’m still trying to figure out how this stuff actually works.
I tried with WizardLM uncensored, but 8K seems to be too much for 4090, it runs out of VRAM and dies.
I also tried with just 4K, but that also seems to not work.
When I run it with 2K, it doesn't crash but the output is garbage.
I hope llama.cpp supports SuperHOT at some point. I never use GPTQ but may need to make an exception to try out the larger context sized. Are you using exllama? Curious why you’re getting garbage output
Has anyone gotten 8K context with a 33B model on a 4090?
Unfortunately there's just no way. KV cache size scales with the square of context length, so at 8k it's 16 times larger than at 2k, for 33b that's over 20GB for the cache alone, without weights or other buffers.
Wow that’s crazy. Would it be possible to offload the KV cache onto system RAM and keep model weights in VRAM or would that just slow everything down too much? I guess that’s kind of what llama.cpp does with GPU offload of layers? I’m still trying to figure out how this stuff actually works.
I tried with WizardLM uncensored, but 8K seems to be too much for 4090, it runs out of VRAM and dies.
I also tried with just 4K, but that also seems to not work.
When I run it with 2K, it doesn't crash but the output is garbage.
I hope llama.cpp supports SuperHOT at some point. I never use GPTQ but may need to make an exception to try out the larger context sized. Are you using exllama? Curious why you’re getting garbage output