Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
arxiv.org Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability...

0
comments