Huggingface Text Generation Inference adds exllama support
Huggingface Text Generation Inference adds exllama support

github.com
Release v0.9.4 · huggingface/text-generation-inference

This is actually a pretty big deal, exllama is by far the most performant inference engine out there for CUDA, but the strangest thing is that the PR claims it works for starcoder which is a non-llama model:
https://github.com/huggingface/text-generation-inference/pull/553
So I'm extremely curious to see what this brings...