The Blog Post from the researcher is a more interesting read.
Important points here about benchmarking:
o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs.
o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability that I did not previously know about. This vulnerability is also due to a free of sess->user, but this time in the session logoff handler.
I'm not sure if a signal to noise ratio of 1:100 is uh... Great...
If the researcher had spent as much time auditing the code as he did having to evaluate the merit of 100s of incorrect LLM reports then he would have found the second vulnerability himself, no doubt.
I agree not brilliant, but It's early days. If one is looking to mechanise a process like finding bugs, you have to start somewhere. Determine how to measure success, set performance baselines and all that.
Problem is motivation. As someone with ADHD I definitely understand that having an interesting project makes tedious stuff much more likely to get done. LOL