Large language models are powering a new wave of digital agents to handle complex web-based tasks. These agents are expected to explain user instructions, navigate the interface and execute complex commands in a changing environment. The difficulty lies not in understanding language, but in adapting to dynamic environments, while converting understanding into precise sequencing actions. The success of long-distance tasks, such as booking a trip or retrieving specific web data, depends on the management of a series of steps that evolve with each action. Despite significant progress in language proficiency, creating a proxy that can effectively plan and adapt to each step remains an unsolved issue.
The main problem in establishing such agents is to establish a broad range of goals. When a user asks for “follow the top contributor to this GitHub project”, the agent must explain the command and determine how to navigate to the contributor’s section, identify the relevant person and initiate the following actions. This task becomes more complex in a dynamic environment where content may shift between executions. Without a clear plan and update strategy, the agency may make inconsistent decisions or fail completely. The scarcity of training data shows how to correctly plan and execute long tasks, adding another layer of difficulty.
Previously, researchers tried to solve these problems through models that either rely on single-agent strategies or apply reinforcement learning to guide actions. Single-institutional systems like React try to incorporate reasoning and execution, but are often overwhelmed by thinking and acting due to models. Reinforcement learning methods show promise, but proven to be unstable and highly sensitive to environmentally specific adjustments. Collecting training data for these methods requires extensive interaction with the environment, making it time-consuming and impractical. These methods also strive to maintain performance consistency when tasks change processes.
Researchers at the University of California, Berkeley, the University of Tokyo and ICSI have introduced a new planning and action system. Companies like Apple, NVIDIA, Microsoft and Intel all support this work. The framework divides task planning and execution into two modules: planner and executor. The planner’s task is to create a structured plan based on the user’s request, essentially outlining what steps need to be taken. The executor then converts each step into an action of a specific environment. By separating these responsibilities, the system allows planners to focus on the strategy while the executors handle execution, thereby improving the reliability of both components. This modular design marks a significant shift from previous approaches.
The approach behind the planning and action is detailed, with a focus on scalable training. Due to limited data on human logout, researchers have introduced a pipeline of synthetic data generation. They first collect the action trajectory from the simulation agent – the sequence of clicks, inputs and responses. Large language models then analyze these trajectories to reconstruct advanced programs based on real results. For example, a plan can specify that the highest contributor is identified, and the action linking to it involves clicking the Contributors tab and parsing the result HTML. The team expanded the dataset with 10,000 additional synthetic plans, and then generated 5,000 target plans based on failed analysis. This synthetic training method saves time and produces high-quality data that reflects the actual execution needs.
In the test, the task success rate of planned and ACT reached 53.94% on the Webarena-Lite benchmark, surpassing Webrl’s previous best results of 49.1%. Without any planners, the basic executor can only reach 9.85%. During the 10,000 synthetic plans fill period, adding a non-local planner increased performance to 29.63%, bringing the result up to 44.24%. Combined dynamic re-implantation increases the final 10.31% performance gain. In all experiments, the data show that most performance improvements come from enhancement planners rather than executors. Even with a basic executor, having a strong planner can increase success rates, thus validating the researchers’ hypothesis that separation of planning and execution will produce better task outcomes.
In summary, this article highlights how to identify the gap between goal understanding and environmental interactions lead to more efficient AI systems. By focusing on structured planning and the generation of scalable data, researchers have proposed a solution to a specific problem and demonstrated a framework that can be extended to a wider range of applications. Planning and Actions show that effective planning, not just execution, is crucial to the success of AI agents in complex environments.
Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 85k+ ml reddit.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.