2y ago

Huggingface Text Generation Inference adds exllama support

Release v0.9.4 · huggingface/text-generation-inference

This is actually a pretty big deal, exllama is by far the most performant inference engine out there for CUDA, but the strangest thing is that the PR claims it works for starcoder which is a non-llama model:

https://github.com/huggingface/text-generation-inference/pull/553

So I'm extremely curious to see what this brings...

No comments