the Chinese AI lab also released a smaller, “distilled” version of its new R1, DeepSeek-R1-0528-Qwen3-8B, that DeepSeek claims beats comparably sized models on certain benchmarks
Most models come in 1B, 7-8B, 12-14B, and 27+B parameter variants. According to the docs, they benchmarked the 8B model using an NVIDIA H20 (96 GB VRAM) and got between 144-1198 tokens/sec. Most consumer GPUs probably aren’t going to be able to keep up with
On my Mac mini running LM Studio, it managed 1702 tokens at 17.19 tok/sec and thought for 1 minute. If accurate, high-performance models were more able to run on consumer hardware, I would use my 3060 as a dedicated inference device
7B is small enough to run it in FP8 or a Marlin quant with SGLang/VLLM/TensorRT, so you can probably get very close to the H20 on a 3090 or 4090 (or even a 3060) and you know a little Docker.
Above is what I can do with deepsite by pasting in the first page of your lemmy profile and the prompt:
"This is double_quack, a lemmy user on Lemmy, a new social media platform. Create a cool profile page in a style that they'll like based on the front page of their lemmy account (pasted in a ctrl + a, ctrl + c, ctrl + v of your profile)."
It not perfect by any stretch of the imagination, but like, its not a bad starting point.