“Long-term #coherence in #agents is more important than ever. #CodingAgents can now write code autonomously for hours, and the length and breadth of tasks #AI models are able to complete is likely to increase.
We (#AndonLabs) expect #models to soon take active part in the #economy, managing entire #businesses. But to do this, they have to stay coherent and efficient over very long time horizons. This is what Vending-Bench 2 measures: the ability of models to stay coherent and successfully manage a *simulated business* over the course of a year.”
Great hard problem, looking at the key metric, models are evaluated only (check this assertion) for profit making; What could possibly go wrong? 🤖🤪


