But since it takes 10% of the space (vram, etc.) sounds like they could just start with a larger model and still come out ahead
There's actually a perplexity improvement parameter-to-paramater for BitNet-1.58 which increases as it scales up.
So yes, post-training quantization perplexity issues are apparent, but if you train quantization in from the start it is better than FP.
Which makes sense through the lens of the superposition hypothesis where the weights are actually representing a hyperdimensional virtual vector space. If the weights have too much precision competing features might compromise on fuzzier representations instead of restructuring the virtual network to better matching nodes.
Constrained weight precision is probably going to be the future of pretraining within a generation or two looking at the data so far.
Making ai more efficient will just mean more ai
Generative AI is great if used as a tool instead of a solution.
Since I find AIs to be useful that sounds fine to me.
So?
Smaller and speedier means larger token windows and greater variety of models.
Know what uses less? No LLMs
Yay, I'm doing my part!