Skip Navigation
Hacker News @lemmy.bestiver.se RSS Bot @lemmy.bestiver.se
BOT

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

arxiv.org Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability...

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
0
0 comments