llama2.c: Inference Llama 2 in one file of pure C by Andrej Karpathy
llama2.c: Inference Llama 2 in one file of pure C by Andrej Karpathy

GitHub - karpathy/llama2.c: Inference Llama 2 in one file of pure C

Have you ever wanted to inference a baby Llama 2 model in pure C? No? Well, now you can!
With this code you can train the Llama 2 LLM architecture from scratch in PyTorch, then save the weights to a raw binary file, then load that into one simple 500-line C file (run.c) that inferences the model, simply in fp32 for now. On my cloud Linux devbox a dim 288 6-layer 6-head model (15M params) inferences at ~100 tok/s in fp32, and about the same on my M1 MacBook Air. I was somewhat pleasantly surprised that one can run reasonably sized models (few ten million params) at highly interactive rates with an approach this simple.