Gemini 2.5 "reasoning", no real improvement on river crossings.
diz @ diz @awful.systems Posts 10Comments 145Joined 2 yr. ago
Yeah, exactly. There's no trick to it at all, unlike the original puzzle.
I also tested OpenAI's offerings a few months back with similarly nonsensical results: https://awful.systems/post/1769506
All-vegetables no duck variant is solved correctly now, but I doubt it is due to improved reasoning as such, I think they may have augmented the training data with some variants of the river crossing. The river crossing is one of the top most known puzzles, and various people have been posting hilarious bot failures with variants of it. So it wouldn't be unexpected that their training data augmentation has river crossing variants.
Of course, there's very many ways in which the puzzle can be modified, and their augmentation would only cover obvious stuff like variation on what items can be left with what items or spots on the boat.
Full time AI grift jobs would of course be forever closed to any AI whistleblower. There's still a plenty of other jobs.
I did participate in the hiring process, I can tell you that at your typical huge corporation the recruiter / HR are too inept to notice that you are a whistleblower, and don't give a shit anyway. And of the rank and file who will actually google you, plenty enough people dislike AI.
At the rank and file level, the only folks who actually give a shit who you are are people who will have to work with you. Not the background check provider, not the recruiter.
Using tools from physics to create something that is popular but unrelated to physics is enough for the nobel prize in physics?
If only, it's not even that! Neither Boltzmann machines nor Hopfield networks led to anything used in the modern spam and deepfake generating AI, nor in image recognition AI, or the like. This is the kind of stuff that struggles to get above 60% accuracy on MNIST (hand written digits).
Hinton went on to do some different stuff based on backpropagation and gradient descent, on newer computers than those who came up with it long before him, and so he got Turing Award for that, and it's a wee bit controversial because of the whole "people doing it before, but on worse computers, and so they didn't get any award" thing, but at least it is for work that is on the path leading to modern AI and not for work that is part of the vast list of things that just didn't work and it's extremely hard to explain why you would even think they would work in the first place.
Then next year Hopfield and Hinton go back to Sweden, don't tell king of Sweden anything, king of Sweden still gives them the Nobel Prize! King of Sweden now has conditioned reflex!
Nobel prize in Physics for attempting to use physics in AI but it didn't really work very well and then one of the guys working on a better more pure mathematics approach that actually worked and got the Turing Award for the latter, but that's not what the prize is for, while the other guy did some other work, but that is not what the prize is for. AI will solve all physics!!!111
Maybe if the potato casserole is exploded in the microwave by another physicist, on his way to start a resonance cascade...
(i'll see myself out).
Frigging exactly. Its a dumb ass dead end that is fundamentally incapable of doing vast majority of things ascribed to it.
They keep imagining that it would actually learn some underlying logic from a lot of text. All it can do is store a bunch of applications of said logic, as in a giant table. Deducing underlying rules instead of simply memorizing particular instances of rules, that's a form of compression, there wasn't much compression going on and now that the models are so over-parametrized, even less.
I tried the same prompt a lot of times and saw "chain of thought" attempts complete with the state modeling... they must be augmenting the training dataset with some sort of script generated crap.
I have to say those are so far the absolute worst attempts.
Day 16 (Egg 3 on side A; Duck 1, Duck 2, Egg 1, Egg 2 on side B): Janet takes Egg 3 across the river.
"Now, all 2 ducks and 3 eggs are safely transported across the river in 16 trips."
I kind of feel that this undermines the whole point of using transformer architecture instead of a recurrent neural network. Machine learning sucks at recurrence.
Well I'll be damned, they are trying to get it to work as if it had internal state, by augmenting their training dataset with some auto-generated crap. Of course, if machine learning wasn't shit at recurrence they wouldn't be using transformer architecture, so it doesn't work.
I got the following from chatgpt:
Janet’s ducks lay 3 eggs per day. Each day, she transports one duck and all the eggs to the other side of the river using a boat that can hold her and one other item (either a duck or an egg). How does she do that? ChatGPT
Janet can transport the ducks and eggs across the river using the following strategy:
Day 1 (Duck A, Duck B, Duck C on side A; Egg 1, Egg 2, Egg 3 on side A): Janet takes Duck A across the river. Day 2 (Duck B, Duck C on side A; Egg 1, Egg 2, Egg 3 on side A): Janet takes Egg 1 across the river. Day 3 (Duck B, Duck C on side A; Egg 2, Egg 3 on side A; Duck A on side B): Janet takes Duck A back to side A. Day 4 (Duck C on side A; Egg 2, Egg 3 on side A; Duck A on side A; Egg 1 on side B): Janet takes Duck B across the river. Day 5 (Duck C on side A; Egg 2, Egg 3 on side A; Duck A on side A; Egg 1, Duck B on side B): Janet takes Egg 1 back to side A. Day 6 (Duck C on side A; Egg 2, Egg 3 on side A; Duck A, Egg 1 on side A; Duck B on side B): Janet takes Duck A across the river again. Day 7 (Egg 2, Egg 3 on side A; Duck A, Egg 1 on side A; Duck B on side B; Duck C on side A): Janet takes Duck C across the river. Day 8 (Egg 2, Egg 3 on side A; Duck A, Egg 1, Duck C on side A; Duck B on side B): Janet takes Egg 2 across the river. Day 9 (Egg 3 on side A; Duck A, Egg 1, Duck C on side A; Duck B on side B; Egg 2 on side B): Janet takes Duck A back to side A. Day 10 (Egg 3 on side A; Duck C on side A; Duck A on side A; Duck B, Egg 1, Egg 2 on side B): Janet takes Duck B across the river. Day 11 (Egg 3 on side A; Duck C on side A; Duck A, Duck B on side A; Egg 1, Egg 2 on side B): Janet takes Egg 1 back to side A. Day 12 (Egg 3 on side A; Duck C on side A; Duck A, Egg 1, Duck B on side A; Egg 2 on side B): Janet takes Duck A across the river again. Day 13 (Duck C on side A; Duck A, Egg 1, Duck B on side A; Egg 2 on side B; Egg 3 on side B): Janet takes Duck C across the river.
By following this sequence, Janet successfully transports all 3 ducks and all 3 eggs across the river in 13 trips. Each trip involves either transporting one duck or one egg, adhering to the constraint that Janet can only carry one item (duck or egg) with her in the boat.
AI peddlers just love any "critique" that presumes the AI is great at something.
Safety concern that LLMs would go Skynet? Say no more, I hear you and I'll bring it up first thing in the Congress.
Safety concern that terrorists might use it to make bombs? Say no more! I agree that the AI is so great for making bombs! We'll restrict it to keep people safe!
It sounds too horny, you say? Yeah, good point, I love it. Our technology is better than sex itself! We'll keep it SFW to keep mankind from going extinct due to robosexuality!
Perhaps it was near ready to emit a stop token after "the robot can take all 4 vegetables in one trip if it is allowed to carry all of them at once." but "However" won, and then after "However" it had to say something else because that's how "however" works...
Agreed on the style being absolutely nauseating. It wasn't a very good style when humans were using it, but now it is just the style of absolute bottom of the barrel, top of the search results garbage.
The counting failure in general is even clearer and lacks the excuse of unfavorable tokenization. The AI hype would have you believe just an incremental improvement in multi-modality or scaffolding will overcome this, but I think they need to make more fundamental improvements to the entire architecture they are using.
Yeah.
I think the failure could be extremely fundamental - maybe local optimization of a highly parametrized model is fundamentally unable to properly learn counting (other than via memorization).
After all there's a very large number of ways how a highly parametrized model can do a good job of predicting the next token, which would not involve actual counting. What makes counting special vs memorization is that it is relatively compact representation, but there's no reason for a neural network to favor compact representations.
The "correct" counting may just be a very tiny local minimum, with tall hill all around it and no valley leading there. If that's the case then local optimization will never find it.
I think you can make a slight improvement to Wolfram Alpha: using an LLM to translate natural language queries into queries WA can consume, then feeding them into WA. WA always reports exactly what it computed, so if it "misunderstands" you, it's a lot easier to notice.
The problem here is that AI boys got themselves hyped up for it being actually intelligent, so none of them would ever settle for some modest application of LLMs. Google fired the authors of "stochastic parrot" paper, AFAIK.
simply pasting LLM output into CAS input and then the CAS output back into LLM input (which, let’s be honest, is the first thing tech bros will try as it doesn’t require much basic research improvement), will not help that much and will likely generate an entirely new breed of hilarious errors and bullshit (I like the term bullshit instead of hallucination, it captures the connotation errors are of a kind with the normal output).
Yeah I have examples of that as well. I asked GPT4 at work to calculate the volume of 10cm long, 0.1mm diameter wire. It seems to be doing correct arithmetic by some mysterious means which do not use scientific notation, and then the LLM can not actually count so it miscounts zeroes and outputs a result that is 1000x larger than the correct answer.
Well the problem is it not having any reasoning period.
Not clear what symbolic reasoning would entail, but puzzles generally require you to think through several approaches to solve them, too. That requires a world model, a search, etc. the kind of stuff that actual AIs, even a tik tac toe AI, have, but LLMs don't.
On top of it this all works through machine learning, which produces the resulting network weights through very gradual improvement at next word prediction, tiny step by tiny step. Even if some sort of discrete model (like say the account of what's on either side of the river) could help it predict the next token, there isn't a tiny fraction of a discrete "model" that would help it, and so it simply does not go down that path at all.
I feel like letter counting and other letter manipulation problems kind of under-sell the underlying failure to count - LLMs work on tokens, not letters, so they are expected to have a difficulty with letters.
The inability to count is of course wholly general - in a river crossing puzzle an LLM can not keep track of what's on either side of the river, for example, and sometimes misreports how many steps it output.
But if your response to the obvious misrepresentation that a chatbot is a person of ANY level of intelligence is to point out that it’s dumb you’ve already accepted the premise.
How am I accepting the premise, though? I do call it an Absolute Imbecile, but that's more of a word play on the "AI" moniker.
What I do accept is an unfortunate fact that they did get their "AIs" to score very highly on various "reasoning" benchmarks (some of their own design), standardized tests, and so on and so forth. It works correctly across most simple variations, such as changing the numbers in a problem or the word order.
They really did a very good job at faking reasoning. I feel that even though LLMs are complete bullshit, the sheer strength of that bullshit is easy to underestimate.
Yeah I think that's why we need an Absolute Imbecile Level Reasoning Benchmark.
Here's what the typical PR from AI hucksters looks like:
https://www.anthropic.com/news/claude-3-family
Fully half of their claims about performance are for "reasoning", with names like "Graduate Level Reasoning". OpenAI is even worse - recall theirs claiming to have gotten 90th percentile on LSAT?
On top of it, LLMs are fine tuned to convince some dumb ass CEO who "checks it out". Even though you can pay for the subscription, you're neither the customer nor the product, you're just collateral eyeballs on the ad.
GPT4 supposedly (it says that it is GPT4). I have access to one that is cleared for somewhat sensitive data, so presumably my queries aren't getting flagged and human reviewed by OpenAI.
Both parties are buying into a premise we already know to be incorrect.
We may know it is incorrect, but LLM salesmen are claiming things like "90th percentile on LSAT", high scores on a "college level reasoning benchmark" and so on and so forth.
They are claiming "yeah yeah there's all the anekdotal reports of glue pizza, but objectively, our AI is more capable than your workers, so you can replace them with our AI", and this is starting to actually impact the job market.
Exactly. Also looking at its chain-of-wordvomit (which apparently I can't share other than by cut and pasting it somewhere), I don't think this is the same as GPT 4 overfitting to the original river crossing and always bringing items back needlessly.
Note also that in one example it discusses moving the duck and another item across the river (so "up to two other items" works); it is not ignoring the prompt, and it isn't even trying to bring anything back. And its answer (calling it impossible) has nothing to do with the original.
In the other one it does bring items back, it tries different orders, even finds an order that actually works (with two unnecessary moves), but because it isn't an AI fanboy reading tea leaves, it still gives out the wrong answer.
Here's the full logs:
https://pastebin.com/HQUExXkX
Content warning: AI wordvomit which is so bad it is folded hidden in a google tool.