In the ‘Medium’ difficulty category, OpenAI’s o4-mini-high model scored the highest at 53.5%.
This fits my observation of such models. o4-mini-high is able to help me with 80-90% of the problems at work. For the remaining problems, it would come up with a nonsensical solution and no matter how much I prompt it, it would tunnel-vision on that specific approach. It could never second guess itself and realise that its initial solution is completely off the mark, and try an entirely differently approach. That's where I usually step in and do the work myself.
It still saves me time with the trivial stuff though.
I can't say the same for the rest of the LLMs. They are simply no good at coding and just waste my time.
I didn’t see Claude 4 Sonnet in the tests and this is the one I use. And it looks like about the same category as o4 mini from my experience.
It is a nice tool to have in my belt. But these LLM based agents are still very far from being able to do advanced and hard tasks. But to me it is probably more important to communicate and learn about the limitations about these tools to not lose tile instead of gaining it.
In fact, I am not even sure they are good enough to be used to really generate production-ready code. But they are nice for pre-reviewing, building simple scripts that don’t need to be highly reliable, analyse a project, ask specific questions etc… The game changer for me was to use Clojure-MCP. Having a REPL at disposal really enhance the quality of most answers.
For me, it’s the Claude Code where everything finally clicked. For advanced stuff, sure they’re shit when they left alone. But as long as I approach it as a Junior Developer (breaking down the tasks to easy bites, having a clear plan all the time, steering away from pitfalls), I find myself enjoying other stuff while it’s doing the monkey work. Just be sure you provide it with tools, mcp, rag and some patience.
It's a rubber ducky that talks back. If you don't take it seriously, it can reach the level of usefulness just above a wheezing piece of yellow rubber.
For instance, if an AI model could complete a one-hour task with 50% success, it only had a 25% chance of successfully completing a two-hour task. This indicates that for 99% reliability, task duration must be reduced by a factor of 70.
This is interesting. I have noticed this myself. Generally, when an LLM boosts productivity, it shoots back a solution very quickly, and after a quick sanity check, I can accept it and move on. When it has trouble, that's something of a red flag. You might get there eventually by probing it more and more, but there is good reason for pessimism if it's taking too long.
In the worst case scenario where you ask it a coding problem for which there is no solution—it's just not possible to do what you're asking—it may nevertheless engage you indefinitely until you eventually realize it's running you around in circles. I've wasted a whole afternoon with that nonsense.
Anyway, I worry that companies are no longer hiring junior devs. Today's juniors are tomorrow's elites and there is going to be a talent gap in a decade that LLMs—in their current state at least—seem unlikely to fill.
Sadly, the lack of junior devs means my job is probably safe until I am ready to retire. I have mixed feelings about that. On the one hand, yeah for me. On the other sad for the new grads. And sad for software as a whole. But software truely sucks, and has only been enshitifying worse and worse. Could a shake up like this somehow help that? I don't see how, but who knows.
In the worst case scenario where you ask it a coding problem for which there is no solution—it's just not possible to do what you're asking—it may nevertheless engage you indefinitely until you eventually realize it's running you around in circles.
Exactly this, and it's frustrating as a Jr dev to be fed this bs when you're learning. I've had multiple scenarios where it blatantly told me wrong things. Like using string interpolation in a terraform file to try and set a dynamic source - what it was giving me looked totally viable. It wasn't until I dug around some more that I found out that terraform init can't use variables in the source field.
On the positive side it helps give me some direction when I don't know where to start. I use it with a highly pessimistic and cautious approach. I understand that today is the worst it's going to be, and that I will be required to use it as a tool in my job going forward, so I'm making an effort to get to grips when working with it.
Well, this kind of AI won't ever be useful as a programmer. It doesn't think. It doesn't reason. It cannot make decisions besides using a ton of computational power and enormous deep neural networks to shit out a series of words that seem like they should follow your prompt. An LLM is just a really, really good next-word guesser.
So when you ask it to solve the Tower of Hanoi problem, great it can do that. Because it saw someone else's answer. But if you ask it to solve it for a tower than is 20 disks high it will fail because no one ever talks about going that far and it flounders. It's not actually reasoning to solve the problem - it's regurgitating answers it has ingested from stolen internet conversations. It's not even attempting to solve the general case because it's not trying to solve the problem, it's responding to your prompt.
That said - an LLM is also great as an interface to allow natural language and code as prompts for other tools. This is where the actually productive advancements will be made. Those tools are garbage today but they'll certainly improve.
They have their uses. For instance the other day I needed to read some assembly and decompiled C, you know how fun that can be. LLM proved quite good at translating it to english. And really speed up the process.
Writing it back wasn't that good though, just good enough to point in a direction but I still ended up writing the patcher mostly by myself.
Fortunately, 90% of coding is not hard problems. We write the same crap over and over. How many different creat an account and signin flows do we really need. Yet there seem to be an infinite amount, and each with it's own bugs.
The hard problems are the only reason I like programming. If 90% of my job was repetitive boilerplate, I'd probably be looking elsewhere.
I really dislike how LLMs are flooding the internet with a seemingly infinite amount of half-broken TODO-app style programs with no care at all for improving things or doing something actually unique.
A lot of people don't realize how many times the problem they are solving has already been solved. But after being in the industry for 3 decades, very few things people are working on haven't been done before. They just get put together in different combinations.
As for AI, I have found it decent at wruting one time scripts to gather information I need to make design decisions. And it's a little quicker when I need to look up a syntax for a language or like a resource name for terraform. But even one off scripts I sometimes have to ask it if a while loop wouldn't be better and such.
there aren't that many, if you're talking specifically LLMs, but ML+AI is more than LLMs.
Not a defence or indictment of either side, just people tend to confuse the terms "LLM" and "AI"
I think there could be worth in AI for identification (what insect in this, find the photo I took of the receipt for my train ticket last month, order these chemicals from lowest to highest pH...) - but LLMs are only part of that stack - the input and output - which isn't going to make many massive breakthroughs week to week.
The recent boom in neural net research will have real applicable results that are genuine progress: signal processing (e.g. noise removal), optical character recognition, transcription, and more.
However the biggest hype areas with what I see as the smallest real return is in the huge model LLM space, which basically try to portray AGI as just around the corner.
LLMs will have real applications in summarization, but largely otherwise they just generate asymptotically plausible babble, very good for filling the Internet with slop, not actually useful to replace all the positions OAI, et al, need it to (for their funding to be justified).
I've found that AI is only good at solving programming problems that are relatively "small picture" — or if it has to do with the basics of a language — anything else that it provides a solution for you will have to re-write completely once you consult with the language's standards and best practices.
Well, I recently did kind of an experiment, writing a kid game in Kotlin without ever using it. And it was surprisingly easy to do. I guess it helps that I'm fluent in ~5 other programming languages because I could tell what looked obviously wrong.
My conclusion kinda is that it's a really great help if you know programming in general.