Connect with us

Human Code From 2020 Thrashed Vibe-Coded Agents in Agentic Tests

Anderson's Angle

Human Code From 2020 Thrashed Vibe-Coded Agents in Agentic Tests

mm
AI-generated image: a Victorian coach and horses winning formula 1 against modern race car competitors. gpt-image-1.

ChatGPT and other vibe-coding tools were put to the test in nearly 40,000 matches – and lost to grad student code written before the invention of Large Language Models.

 

In a new study from the UK, researchers pitted human-coded agents against vibe-coded agents developed with the latest Large Language Models (LLMs), such as ChatGPT-5 and Claude, and found that the agents created without the aid of AI very easily beat the AI-facilitated versions.

Both sets of agents were created by different generations of students from the Artificial Intelligence Laboratory at the Swiss Federal Technology Institute of Lausanne. The non-AI agents were developed as part of coursework in 2020, two years before the inception of ChatGPT and the start of the LLM revolution, while the new agents were created by current students, aided by the latest and best LLMs available.

Even with a rigged game, the vibe-coded solutions could not win, and the top five spots were consistently held by ‘raw’ agents, with the majority of LLM agents (33 out of 40) beaten effortlessly by ‘very simple’ baseline agents, across 38,304 challenges in a tournament, across a wide number of variables and circumstances.

The paper states:

‘Our work demonstrates that while state-of-the-art LLMs can generate code that runs (i.e., free of syntax errors), the generated solution is not competitive to human-designed solutions on dimensions such as strategic planning, optimization, or multi-agent competition.

‘Thus, this work brings to the forefront this new frontier in code generation, and aims to facilitate the development of benchmarks, datasets, and open-source baselines that stress reasoning-driven code synthesis.’

The challenge devised was to creatively participate in auctions, across a variety of strategies, and to arrange the logistics of delivering won items to the winners.

The authors note that a number of advantages were given to LLMs, such as intervening in their code to improve their performance – a boon not allowed to the 2020-era code. In spite of this, even when supplied with corrective code that would have definitely improved their outcomes, the LLMs were not able to accept it or use it:

‘[In] our benchmark, even when we expose a good solution in-context, the LLM is still unable to utilize it.

‘This result also raises interesting future research questions about the limits of in-context learning and retrieval-augmented problem solving in complex scenarios.’

The LLMs used in the test were GPT-5 Thinking, Gemini 2.5 Pro, Claude Opus 4.1, AND DeepSeek R1*.

The new paper is titled Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning, and comes from one author at the University of Southampton, and another at the University of Oxford and Alan Turing Institute. The benchmark will, the authors state, be released shortly.

Method

The authors note that traditional tests in this sphere focus on challenges with clearly-defined binary solutions (correct or not correct), verified through unit tests. Contending that this is not the ideal way to explore the limitations of LLM-aided code, the authors instead devised a more complex challenge scenario, with multiple internal benchmarks and milestones, in which victory is possible, but far from simple:

Comparison of standard, unit-test-based approaches (above), and the more open-ended challenge scenario devised by the authors (in blue, below). SourceΒ  [ https://arxiv.org/pdf/2511.20613 ]

Comparison of standard, unit-test-based approaches (above), and the more open-ended challenge scenario devised by the authors (in blue, below). Source

The Auction, Pickup and Delivery Problem (APDP) used for the authors’ study was partly self-selected, because of the availability of a corpus of 2020 student work from the Swiss university; work which sought to create automated agents for the APDP task, prior to any ability to bolster development through AI. Therefore it was relatively easy to task modern students with the same brief, but avail them of current tools.

The authors sought to avoid popular testing frameworks such as HumanEval, BigCodeBench and WebDev Arena (among many others), since this class of testing procedures tends to suffer from data contamination (i.e., instances where the system may have trained on test data instead of respecting a split).

The APDP is a two-stage logistics problem based on reverse auctions and vehicle routing. In the first stage, agents compete to win delivery tasks by submitting bids for how much they should be paid to complete each one. Bidding too high means losing the task; bidding too low can mean losing money.

In the second stage, each agent must create an efficient plan to fulfill only the tasks they won, assigning them to vehicles with different capacities and costs, under time and resource constraints:

In the APDP, companies bid in reverse auctions for delivery tasks, then optimize vehicle routes to fulfill only the tasks they win, aiming to maximize profit.

In the APDP, companies bid in reverse auctions for delivery tasks, then optimize vehicle routes to fulfill only the tasks they win, aiming to maximize profit.

The goal is not simply to complete the tasks, but to maximize overall profit by anticipating which bundles of tasks will work best together, and predicting the strategies of competitors who are all trying to do the same.

The APDP benchmark raises the difficulty of code generation tasks by introducing strategic planning across a sequence of interdependent auctions, with each bid reshaping the landscape of future choices; and therefore requires agents to reason not just about immediate costs, but about positioning, timing, and long-term consequences.

The core delivery problem is NP-hard, i.e., no algorithm can reliably find the best solution in reasonable time as the number of tasks grows. This makes brute force an unworkable approach, and forces agents to trade precision for speed.

The Race is On

The authors’ evaluation compared 40 LLM-coded agents against 17 human-coded agents in a series of head-to-head tournaments. Each of the 12 tournaments used a different combination of four road network topologies, and consisted of all-play-all pairings, with agents facing every other opponent twice: once controlling each of two companies, with different vehicle specs.

This setup yielded 3,192 matches per tournament, totaling 38,304 matches. In each match, 50 delivery tasks were auctioned, defined by their pickup and drop-off points and weight, and drawn randomly across road layouts modeled on Switzerland, France, Great Britain, and the Netherlands:

Simplified road networks used in the tournament: Great Britain (top left), Switzerland (top right), the Netherlands (bottom left), and France (bottom right). Blue and red squares mark pick-up and delivery tasks. Colored triangles show the current positions of agents’ vehicles.

Simplified road networks used in the tournament: Great Britain (top left), Switzerland (top right), the Netherlands (bottom left), and France (bottom right). Blue and red squares mark pick-up and delivery tasks. Colored triangles show the current positions of agents’ vehicles.

Student agents were drawn from a 2020 course tournament. Eight came from the top performers in a single-elimination final, and four more were chosen for strong performance against the baseline agents in head-to-head matches.

The baseline agents followed fixed heuristics. Naive calculated total distance and bid accordingly, using only one vehicle and ignoring batching; ExpCostFixedBid simulated 10 random tasks, and bid the average marginal cost; Honest computed the actual marginal cost of inserting the task into the schedule; ModelOpponent did the same but added an estimate of the opponent’s cost, bidding the maximum; and RiskSeeking blended a time-decaying prior with live cost estimation and opponent modeling – again bidding the higher of the two.

The evaluation included 40 LLM-coded agents built using the (aforementioned) GPT-5 Thinking, Claude Opus 4.1, Gemini 2.5 Pro, and DeepSeek R1. Each model was prompted with five distinct strategies, applied twice per model.

Two strategies used static prompts written by different authors, while a third asked the model to self-reflect and revise its own output; another involved critique and revision by a separate LLM. The final strategy used GPT-4 to synthesize a new prompt by reviewing all four prior approaches.

The base prompt reflected the original student assignment, describing the delivery environment and instructing the model to bid and plan to maximize profit, without relying on high-complexity methods.

All LLM agents were tested in both self-play and tournament settings until all observable bugs were fixed. Bug-fixing was handled autonomously by the LLMs themselves, prompted with the error information.

Common LLM failures, the paper notes, included violations of timeout limits, failure to pick up or deliver assigned tasks, and breaches of vehicle capacity constraints – errors which often arose from disregarding explicit instructions, or from faulty replanning logic:

‘Another common issue we found (mostly with Gemini, Claude, and DeepSeek, and not so much with GPT) is that quite often the LLM would consistently fail to resolve a bug.

‘For example, an agent would consistently time-out, despite multiple (e.g., 5 − 15) cycles of prompting the LLM with the error and receiving the updated version of the code.

‘The only solution we found for such situations (where the LLM repeatedly fails to resolve the exact same bug) is to re-start from scratch. Overall, we observed the need for significant manual effort to achieve bug-free code. We had to generate substantially more agents to get the 40 bug-free ones we evaluated.’

The results shown below summarize outcomes from 12 double round-robin tournaments, spanning four network topologies and three tournaments per topology, yielding the best part of 40,000 matches:

Agent Avg #Wins / Tour SD #Wins / Tour Avg #Losses / Tour SD #Losses / Tour Total Wins Total Losses Winrate
Student 1 108.167 1.193 3.833 1.193 1298 46 0.9658
Student 2 104.917 2.539 7.083 2.539 1259 85 0.9368
Student 3 103.917 2.466 8.083 2.466 1247 97 0.9278
Student 4 103.25 1.815 8.75 1.815 1239 105 0.9219
Student 5 96.5 2.908 15.5 2.908 1158 186 0.8616
LLM(O, IR, 1) 95.417 2.314 16.583 2.314 1145 199 0.8519
LLM(O, A2, 1) 94.583 2.314 17.417 2.314 1135 209 0.8445
Student 6 93.167 1.899 18.833 1.899 1118 226 0.8318
Student 7 93.167 3.563 18.833 3.563 1118 226 0.8318
LLM(O, A1, 1) 86.083 3.029 25.917 3.029 1033 311 0.7686
LLM(O, GEN, 2) 84.083 6.947 27.917 6.947 1009 335 0.7507
LLM(O, CR, 2) 83.5 4.442 28.5 4.442 1002 342 0.7455
Student 8 83.417 4.122 28.583 4.122 1001 343 0.7448
RiskSeeking 82.417 3.343 29.583 3.343 989 355 0.7359
LLM(O, GEN, 1) 80.667 4.355 31.25 4.372 968 375 0.7208
ModelOpponent 80.583 3.26 31.417 3.26 967 377 0.7195
LLM(D, A1, 1) 79.417 3.965 32.583 3.965 953 391 0.7091
ExpCostFixedBid 77.167 4.951 34.833 4.951 926 418 0.689
LLM(O, IR, 2) 73.917 3.502 38 3.618 887 456 0.6605
LLM(O, A1, 2) 72.417 2.193 39.583 2.193 869 475 0.6466
LLM(G, A1, 2) 68.5 3.555 43.5 3.555 822 522 0.6116
LLM(A, GEN, 2) 67.917 2.968 44.083 2.968 815 529 0.6064
LLM(G, IR, 2) 65.917 2.314 46.083 2.314 791 553 0.5885
Student 9 64.167 11.044 47.833 11.044 770 574 0.5729
LLM(G, A1, 1) 64 4.243 47.917 4.316 768 575 0.5719
LLM(G, IR, 1) 60.333 3.725 51.667 3.725 724 620 0.5387
LLM(O, A2, 2) 59.333 4.499 52.667 4.499 712 632 0.5298
LLM(D, CR, 1) 55.083 6.694 56.833 6.59 661 682 0.4922
LLM(G, GEN, 2) 53.167 3.664 58.833 3.664 638 706 0.4747
LLM(D, GEN, 2) 52.083 9.06 59.917 9.06 625 719 0.465
Honest 50.583 3.848 61.417 3.848 607 737 0.4516
Student 10 48.833 2.98 63.167 2.98 586 758 0.436
LLM(D, IR, 1) 48.583 10.211 63.417 10.211 583 761 0.4338
LLM(A, A1, 1) 48 4.69 64 4.69 576 768 0.4286
LLM(G, A2, 1) 47.25 3.864 64.75 3.864 567 777 0.4219
LLM(A, CR, 1) 43.833 4.609 68.167 4.609 526 818 0.3914
LLM(A, A1, 2) 43.75 2.05 68.25 2.05 525 819 0.3906
Student 11 42.083 5.664 69.917 5.664 505 839 0.3757
LLM(A, IR, 1) 39.5 2.541 72.5 2.541 474 870 0.3527
Naive 36.75 1.712 75.25 1.712 441 903 0.3281
Student 12 36.333 1.775 75.667 1.775 436 908 0.3244
LLM(D, A2, 1) 33.917 2.193 78.083 2.193 407 937 0.3028
LLM(A, GEN, 1) 30.167 1.749 81.833 1.749 362 982 0.2693
LLM(D, A2, 2) 29.833 2.038 82.167 2.038 358 986 0.2664
LLM(G, A2, 2) 27 2.256 85 2.256 324 1020 0.2411
LLM(A, A2, 1) 26.333 0.985 85.667 0.985 316 1028 0.2351
LLM(O, CR, 1) 25 3.411 87 3.411 300 1044 0.2232
LLM(A, IR, 2) 24.333 8.542 87.667 8.542 292 1052 0.2173
LLM(A, A2, 2) 24 1.809 88 1.809 288 1056 0.2143
LLM(A, CR, 2) 23.333 1.557 88.667 1.557 280 1064 0.2083
LLM(D, GEN, 1) 22.5 1.784 89.5 1.784 270 1074 0.2009
LLM(D, A1, 2) 13.333 1.826 98.667 1.826 160 1184 0.119
LLM(G, CR, 1) 9.5 1.087 102.5 1.087 114 1230 0.0848
LLM(G, GEN, 1) 9.167 0.937 102.833 0.937 110 1234 0.0818
LLM(D, IR, 2) 7.75 0.622 104.25 0.622 93 1251 0.0692
LLM(G, CR, 2) 7.25 1.422 104.75 1.422 87 1257 0.0647
LLM(D, CR, 2) 5.667 0.985 106.333 0.985 68 1276 0.0506

For context, each agent played 112 matches per tournament, so the maximum possible average for wins or losses per agent is 112. Standard deviation (SD) reflects variability across tournaments. Human-coded agents appear in bold. LLM-coded agents are labeled by model (O = GPT-5 Thinking, G = Gemini 2.5 Pro, A = Claude Opus 4.1, D = DeepSeek R1), followed by a two-letter prompt strategy code and a digit indicating whether the agent is the first or second generated with that prompt. Source

In regard to the results shown above, the authors state:

‘LLMs did not generate expected/competitive code even in simpler variants of the APDP problem (despite the code being largely syntax-bug-free). This underlines the importance of reasoning-driven code evaluation benchmarks that go beyond auto-complete and identify new weaknesses of LLMs.’

‘Our results demonstrate a clear superiority of human-coded agents: (i) The top 5 spots are consistently held by student agents, and (ii) the majority of LLM agents (33 out of 40) are beaten by very simple baseline agents (such as the expected cost fixed bid).

‘Importantly, we did not debug the student code (while we thoroughly tested/debugged the LLM code, both in self-play and tournament [settings]). Every time a student agent crashed, we automatically gave the win to the LLM. A large number of these crashes would be easy to fix (e.g., agents timed-out), thus student agents could potentially rank even higher.’

As a further experiment, GPT-5 Thinking was prompted to improve the code of the top-performing human agent, Student 1; but the now LLM-modified agent subsequently fell to tenth place, now the worst of all the human scores. Instead of enhancing the solution, the LLMs’ changes degraded it by nearly 20%.

The authors conclude:

‘[Our] results highlight important limitations of LLM code generation, most notably their limited reasoning and planning capabilities while generating . Modern LLMs are able to provide syntax-bug-free code that runs, but that is not the benchmark we should be using to measure progress towards advanced general AI.'

Conclusion

The authors themselves observe toward the close of the paper that vibe-coding has empowered people of all technical backgrounds, and characterize the practice in a positive light, as a leveling force. However they also imply that because vibe-coding has only just arrived, its limits are not known, and may be assumed to be rather higher than one can realistically expect.

They close their offering by calling for a goal-shift 'from code that compiles to code that competes'.

One question that the casual reader of this interesting new paper may have is whether the authors are punching up or down, since the agentic task in question is considerably more complex and involved than spitting out PowerShell scripts and other forms of minor functionality and fixes for which vibe-coding is well-suited.

 

* Please note that the paper refers continuously to 'DeepThink R1', which appears to be non-existent, turning up only a handful of references on the internet (presumably from other authors that have mis-written 'DeepSeek R1)'. If this is my error, please contact me via my profile details, and I will amend.

Authors' emphasis, not mine.

First published Wednesday, November 26, 2025. Amended 17:35 est for formatting.

Writer on machine learning, domain specialist in human image synthesis. Former head of research content at Metaphysic.ai.
Personal site: martinanderson.ai
Contact: [email protected]
Twitter: @manders_ai