Skip Navigation

You're viewing a single thread.

9 comments
  • Article says nothing really relevant about the architecture and implementation. Unified memory says nothing. Is it using a GPU-like math coprocessor or just extra cores. If it is just the CPU, it will have the same limitations of cache bus width. If it is a split workload, it will be limited in tools that can split the math. The article is comparing this to ancient standards of compute like 32 GB of system memory when AI has been in the public space for nearly 2 years now. For an AI setup 64 GB to 128 GB is pretty standard. It also talks about running models that are impossible to fit on even a dual station setup in their full form. You generally need twice the memory of the model parameters size to load the full version. You need the full version for any kind of training but not for general use. Even two of these systems at 256 GB of memory is not going to load a 405B model at full precision. Sure it can run a quantized version, and that will be like having your own ChatGPT 3.5, but it is not training or some developer use case. Best case scenario, this would load a 70B at full precision. A 70B on an Intel 12th gen 20 logical core CPU, 64 GB of DDR5 at max spec speed, and a 16 GB 3080Ti GPU loads a 70B with Q4L quantization and streams slightly slower than my natural reading pace. There is no chance that this could be used in anything like an agent or for more complex tasks that require interaction. My entire reason for clicking on the article was to see the potential architecture difference that might enable larger models beyond the extremely limited CPU cache bus width problem present in all CPU architectures. The real world design life cycle of hardware is 10 years. So real AI specific hardware is still at least 8 years away. Anything in the present is marketing wank and hackery. Hackery is interesting, but it is not in this text.

9 comments