They Promised Us Agents, but All We Got Were Static Chains
In the spring of 2023, the world got excited about the emergence of LLM-based AI agents. Powerful demos like AutoGPT and BabyAGI demonstrated the potential of LLMs running in a loop, choosing the next action, observing its results, and choosing the next action, one step at a time (also known as the ReACT framework). This new method was expected to power agents that autonomously and generically perform multi-step tasks. Give it an objective and a set of tools and it will take care of the rest. By the end of 2024, the landscape will be full of AI agents and AI agent-building frameworks. But how do they measure against the promise?
It is safe to say that the agents powered by the naive ReACT framework suffer from severe limitations. Give them a task that requires more than a few steps, using more than a few tools and they will miserably fail. Beyond their obvious latency issues, they will lose track, fail to follow instructions, stop too early or stop too late, and produce wildly different results on each attempt. And it is no wonder. The ReACT framework takes the limitations of unpredictable LLMs and compounds them by the number of steps. However, agent builders looking to solve real-world use cases, especially in the enterprise, cannot do with that level of performance. They need reliable, predictable, and explainable results for complex multi-step workflows. And they need AI systems that mitigate, rather than exacerbate, the unpredictable nature of LLMs.
So how are agents built in the enterprise today? For use cases that require more than a few tools and a few steps (e.g. conversational RAG), today agent builders have largely abandoned the dynamic and autonomous promise of ReACT for methods that heavily rely on static chaining – the creation of predefined chains designed to solve a specific use case. This approach resembles traditional software engineering and is far from the agentic promise of ReACT. It achieves higher levels of control and reliability but lacks autonomy and flexibility. Solutions are therefore development intensive, narrow in application, and too rigid to address high levels of variation in the input space and the environment.
To be sure, static chaining practices can vary in how “static” they are. Some chains use LLMs only to perform atomic steps (for example, to extract information, summarize text, or draft a message) while others also use LLMs to make some decisions dynamically at runtime (for example, an LLM routing between alternative flows in the chain or an LLM validating the outcome of a step to determine whether it should be run again). In any event, as long as LLMs are responsible for any dynamic decision-making in the solution – we are inevitably caught in a tradeoff between reliability and autonomy. The more a solution is static, is more reliable and predictable but also less autonomous and therefore more narrow in application and more development-intensive. The more a solution is dynamic and autonomous, is more generic and simple to build but also less reliable and predictable.
This tradeoff can be represented in the following graphic:
This begs the question, why have we yet to see an agentic framework that can be placed in the upper right quadrant? Are we doomed to forever trade off reliability for autonomy? Can we not get a framework that provides the simple interface of a ReACT agent (take an objective and a set of tools and figure it out) without sacrificing reliability?
The answer is – we can and we will! But for that, we need to realize that we’ve been doing it all wrong. All current agent-building frameworks share a common flaw: they rely on LLMs as the dynamic, autonomous component. However, the crucial element we’re missing—what we need to create agents that are both autonomous and reliable—is planning technology. And LLMs are NOT great planners.
But first, what is “planning”? By “planning” we mean the ability to explicitly model alternative courses of action that lead to a desired result and to efficiently explore and exploit these alternatives under budget constraints. Planning should be done at both the macro and micro levels. A macro-plan breaks down a task into dependent and independent steps that must be executed to achieve the desired outcome. What is often overlooked is the need for micro-planning aimed to guarantee desired outcomes at the step level. There are many available strategies for increasing reliability and achieving guarantees at the single-step level by using more inference-time computing. For example, you could paraphrase semantic search queries multiple times, you can retrieve more context per a given query, can use a larger model, and you can get more inferences from an LLM – all resulting in more requirements-satisfying results from which to choose the best one. A good micro-planner can efficiently use inference-time computing to achieve the best results under a given compute and latency budget. To scale the resource investment as needed by the particular task at hand. That way, planful AI systems can mitigate the probabilistic nature of LLMs to achieve guaranteed outcomes at the step level. Without such guarantees, we’re back to the compounding error problem that will undermine even the best macro-level plan.
But why can’t LLMs serve as planners? After all, they are capable of translating high-level instructions into reasonable chains of thought or plans defined in natural language or code. The reason is that planning requires more than that. Planning requires the ability to model alternative courses of action that may reasonably lead to the desired outcome AND to reason about the expected utility and expected costs (in compute and/or latency) of each alternative. While LLMs can potentially generate representations of available courses of action, they cannot predict their corresponding expected utility and costs. For example, what are the expected utility and costs of using model X vs. model Y to generate an answer per a particular context? What is the expected utility of looking for a particular piece of information in the indexed documents corpus vs. an API call to the CRM? Your LLM doesn’t begin to have a clue. And for good reason – historical traces of these probabilistic traits are rarely found in the wild and are not included in LLM training data. They also tend to be specific to the particular tool and data environment in which the AI system will operate, unlike the general knowledge that LLMs can acquire. And even if LLMs could predict expected utility and costs, reasoning about them to choose the most effective course of action is a logical decision-theoretical deduction, that cannot be assumed to be reliably performed by LLMs’ next token predictions.
So what are the missing ingredients for AI planning technology? We need planner models that can learn from experience and simulation to explicitly model alternative courses of action and corresponding utility and cost probabilities per a particular task in a particular tool and data environment. We need a Plan Definition Language (PDL) that can be used to represent and reason about said courses of action and probabilities. We need an execution engine that can deterministically and efficiently execute a given plan defined in PDL.
Some people are already hard at work on delivering on this promise. Until then, keep building static chains. Just please don’t call them “agents”.
LLMs as Operating Systems
Last month the Microsoft Research team shared some insights in a generative application framework called AutoGen that casts a spotlight on so many intriguing possibilities in AI. Large Language Models (LLMs) like GPT-4 can step up beyond simple tasks
Tumblr media
A Comprehensive Guide to Autogpt
While ChatGPT is still enjoying the attention it has garnered in recent months, AI agents that were born from the tool are already giving it stiff competition. AutoGPT, a self-generating AI agent that runs independently, is the newest player to join this league. Check out this blog to know about a comprehensive guide to autogpt and its relevance for the modern world.
AgentGPT, BabyGPT and AutoGPT - what is the difference?
These are semi-autonomous "agents", which can be given high level goals "make a website for selling books online", and can figure out the high level tasks, such as front-end HTML site development, then backend database, etc. and execute each of the tasks and subtasks. They are all the same (at a high level), but use recursive mechanisms to help GPT create prompts for GPT (so meta).
Are you ready to use multiple AI agents with one click?
Can AI automate computational reproducibility?
New Post has been published on https://thedigitalinsider.com/can-ai-automate-computational-reproducibility/
Can AI automate computational reproducibility?
Last month, Sakana AI released an “AI scientist”, which the company called “the first comprehensive system for fully automatic scientific discovery”. It was touted as being able to accelerate science without suffering from human limitations. 
Unfortunately, the “AI Scientist” has many shortcomings. It has no checks for novelty, so generated papers could rehash earlier work. And Sakana did not perform any human review (let alone expert “peer” review) of the generated papers—so it is unclear if the papers are any good (apparently they are not). While these flaws are particularly flagrant in Sakana’s case, the lack of good evaluation affects most AI agents, making it hard to measure their real-world impact.
Today, we introduce a new benchmark for measuring how well AI can reproduce existing computational research. We also share how this project has changed our thinking about “general intelligence” and the potential economic impact of AI. Read the paper.
Visions of AI automating science are enticing, but aren’t within reach, and lead to flawed science. In contrast, using AI for well-scoped tasks such as verifying computational reproducibility can save a lot of time and redirect effort towards more productive scientific activity. AI could also help find relevant literature, write code to rapidly test ideas, and perform other computational tasks.
In a new paper, we introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark for measuring how well AI can automate computational reproducibility, that is, reproducing a paper’s findings when the code and data are available. The authors are Zachary S. Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE-Bench is a first step in a larger project to rigorously evaluate progress in automating research tasks of increasing difficulty.
Computationally reproducing a study is a far more limited task than replication, which requires re-running experiments that might involve human subjects. Even the limited reproducibility task is hard: In the 2022 Machine Learning Reproducibility Challenge, over a third of the papers could not be reproduced even when experts reproducing the papers had the code and data. 
If AI could automate this mundane yet important task, researchers could automate the implementation of baselines, reviewers could more easily assess if a paper has flaws, and journals and conferences could more easily verify if submitted and published papers are reproducible.
We created CORE-Bench using scientific papers and their accompanying code and data repositories. We used Code Ocean to source papers that were likely to be reproducible. We manually reproduced 90 papers from computer science, medicine, and social science, and curated a set of questions for each paper to be able to verify the answers. 
We release CORE-Bench with three difficulty levels. Tasks in all three levels require the use of both language and vision capabilities. The hardest version closely resembles real-world reproduction attempts, and we expect that improvements on the benchmark will translate to agents that are actually useful to scientists.
To implement baselines, we tested the generalist AutoGPT agent and also implemented a task-specific modification to AutoGPT, which we call CORE-Agent. While the task-specific version improved accuracy significantly, there is still massive room for improvement: the best agent (CORE-Agent with GPT-4o) has an accuracy of 22% on CORE-Bench-Hard.
Computational reproducibility requires setting up the code environment correctly, running the code, and seeing if it produces the same results as reported in the paper. Using the shell and other tools correctly is still tricky for LLMs. When we evaluated generalist agents like AutoGPT, we weren’t surprised by their poor accuracy (less than 10% on CORE-Bench-Hard). 
Yet, with a few person-days of effort, we were able to build CORE-Agent by modifying AutoGPT, which more than doubled accuracy on the hardest level. We also built a task-specific agent from scratch, but modifying AutoGPT was far less time consuming while also resulting in a stronger agent. We are cautiously optimistic that this approach can be pushed to yield agents that perform well enough to be useful in practice. 
Simple task-specific modifications allow CORE-Agent to outperform AutoGPT. 
If this pattern of being able to easily adapt a generalist agent to produce a task-specific agent holds in other areas, it should make us rethink generality. Generality roughly translates to being able to use the same model or agent without modification to perform a variety of tasks. This notion of generality underpins how Artificial General Intelligence (or AGI) is usually understood and the hopes and fears that accompany it. 
But at least from the point of view of economic impacts, generality might be a red herring. For a task such as computational reproducibility on which expert humans collectively spend millions of hours every year, being able to automate it would be hugely impactful — regardless of whether the AI system did so out of the box, or after a few person days (or even a person year) of programmer effort. 
In the AI Snake Oil book, we define generality as the inverse of task-specificity, and analyze how the history of AI (and computing) can be seen as the pursuit of gradually increasing generality. Increasing generality means decreasing the human effort it takes to build an AI system to perform a given task. From this perspective, systems like AutoGPT may be more general than most people (including us) gave them credit for.
Yet, definitions of AGI typically insist that a single system be able to do everything out of the box. There is no systematic effort to track how the human effort needed to build task-specific AI is changing over time. Just as we’ve argued against flawed conceptions of generality that overestimate AI progress, we should avoid flawed conceptions of generality that underestimate it. 
Read the CORE-Bench paper here.
In our recent paper, AI Agents That Matter, we found several shortcomings with AI agent evaluations. While building CORE-Bench, these shortcomings informed the design of our benchmark.
We recently organized an online workshop on useful and reliable AI agents where leading experts shared their views on better agent design and evaluation. The workshop videos are available online.
Ben Bogin et al. released the SUPER benchmark to evaluate if AI agents can set up and execute tasks from repositories accompanying research papers. It is another interesting benchmark for measuring AI agents’ capability to automate research tasks. It differs from CORE-Bench in many ways: 
CORE-Bench consists of tasks across scientific disciplines (computer science, medicine, social science) whereas SUPER consists of tasks from AI.
CORE-Bench requires the use of both vision-language and language models, and consists of multiple languages (Python and R) as opposed to SUPER (language models, Python).
Tasks in SUPER require access to a Jupyter notebook. In contrast, tasks in CORE-Bench require shell access and allow the agent to modify the sandbox arbitrarily.
