The use of chatbots like ChatGPT and Claude has risen dramatically in the past three years because they can help you complete a wide range of tasks. Whether you’re writing a Shakespearean sonnet, debugging code, or need to answer an obscure trivia question, AI systems seem to have you covered. Where does this versatility come from? There are billions, if not trillions, of text data points on the Internet.
However, this data is not enough to teach the robot to become a useful assistant in the home or factory. In order to understand how to handle, stack and place various objects in different environments, the robot needs to be demonstrated. You can think of robot training data as a collection of operational videos that guide the system through each movement of a task. Collecting these demonstrations on real robots is time-consuming and not entirely repeatable, so engineers create training data by using artificial intelligence to generate simulations that often do not reflect real-world physics, or tediously handcraft each digital environment from scratch.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Toyota Research Institute may have found a way to create the diverse, realistic training grounds needed for robots. Their “guided scene generation” approach creates digital scenes such as kitchens, living rooms, and dining rooms that engineers can use to simulate a wide range of real-world interactions and scenarios. The tool was trained on more than 44 million 3D rooms filled with models of objects such as tables and plates, placing existing assets in new scenes and then detailing each asset into a physically accurate, realistic environment.
Steerable scene generation creates these 3D worlds by “guiding” diffusion models, an artificial intelligence system that generates visuals from random noise, toward scenes found in everyday life. Researchers use this generative system to “paint” environments, populating specific elements throughout the scene. You can imagine a blank canvas suddenly transforming into a kitchen dotted with 3D objects that gradually rearrange into a scene that mimics real-world physics. For example, the system ensures that a fork doesn’t go through a bowl on a table, a common glitch in 3D graphics called “clipping,” where models overlap or intersect.
However, how exactly steerable scene generation guides its creation toward realism depends on the strategy you choose. Its main strategy is “Monte Carlo Tree Search” (MCTS), in which the model creates a series of alternative scenes, populating them in different ways to achieve a specific goal (such as making the scene more physically realistic, or containing as many edible items as possible). The artificial intelligence program AlphaGo uses it to defeat human opponents in Go, a game similar to chess, because the system considers potential sequences of moves before choosing the most advantageous one.
“We are the first to apply MCTS to scene generation, framing the scene generation task as a sequential decision-making process,” said Nicholas Pfaff, a doctoral student in MIT’s Department of Electrical Engineering and Computer Science (EECS), a CSAIL researcher and lead author of a paper describing this work. “Over time, we continue to build on parts of the scenario to generate better or more ideal scenarios. As a result, MCTS creates more complex scenarios than those trained on the diffusion model.”
In one particularly telling experiment, MCTS added the maximum number of objects to a simple restaurant scene. After training on scenes with an average of only 17 objects, it was shown up to 34 objects on a table, including a large number of dim sum dishes.
Manipulable scenario generation also allows you to generate different training scenarios through reinforcement learning – essentially teaching the diffusion model to achieve your goals through trial and error. After training on the initial data, your system will go through a second training phase where you outline a reward (basically a desired outcome with a score indicating how close you are to that goal). The model automatically learns to create higher-scoring scenarios, often producing completely different scenarios than what it was trained on.
Users can also prompt the system directly by entering a specific visual description (such as “There are four apples in the kitchen and a bowl on the table”). Then, controllable scene generation can achieve exactly what you want. For example, the tool accurately followed user prompts 98% of the time when building a scene of pantry shelves, and 86% of the time when building a scene of a messy breakfast table. Both scores improve by at least 10% compared to similar methods such as “MiDiffusion” and “DiffuScene”.
The system can also complete specific scenes with prompts or light instructions (e.g. “come up with different scene arrangements using the same objects”). For example, you can ask it to place apples on several plates on the kitchen table, or place board games and books on a shelf. It essentially “fills in the blanks” by inserting items into the empty spaces, but leaving the rest of the scene intact.
The researchers say the strength of their project is its ability to create many scenarios that roboticists can actually use. “An important insight from our findings is that it is acceptable for the scenarios we pre-trained to be not exactly similar to the scenarios we actually want,” Pfaff said. “Using our bootstrapping method, we can go beyond this broad distribution and sample from a ‘better’ distribution. In other words, generate diverse, realistic, and task-consistent scenarios on which we actually want to train the robot.”
Such a huge scene became a proving ground where they could record virtual robots interacting with different objects. For example, the machine carefully places forks and knives into utensil holders and rearranges bread onto plates in various 3D settings. Each simulation appears smooth and realistic, similar to the real world, and adaptable robot-steering scene generation could one day aid training.
While the system could be an encouraging path toward generating large amounts of diverse training data for robots, the researchers say their work is more of a proof-of-concept. In the future, they hope to use generative AI to create entirely new objects and scenes, rather than using a fixed library of assets. They also plan to include articulated objects that the robot can open or twist, such as cupboards or jars filled with food, to make the scene more interactive.
To make their virtual environments more realistic, Pfaff and his colleagues were able to incorporate real-world objects by using a library of objects and scenes pulled from images on the Internet, as well as their previous work on “Scalable Real2Sim.” By expanding the variety and realism of the testing ground for AI-built robots, the team hopes to build a community of users that will create vast amounts of data that can then be used as massive data sets to teach dexterous robots different skills.
“Today, creating realistic scenarios for simulations can be quite challenging; procedural generation can easily generate a large number of scenarios, but they may not represent the environments that robots will encounter in the real world. Manually creating custom scenarios is time-consuming and expensive,” said Jeremy Binagia, an applications scientist at Amazon Robotics who was not involved in the paper. “Steerable scene generation offers a better approach: train a generative model on a large set of pre-existing scenes and adapt it (using strategies such as reinforcement learning) to a specific downstream application. Compared to previous work that leveraged off-the-shelf visual language models or focused solely on arranging objects in a 2D grid, this approach guarantees physical feasibility and accounts for full 3D translation and rotation, enabling the generation of more interesting scenes.”
“Steerable scene generation with late training and inference-time search provides a novel and efficient framework for large-scale automated scene generation,” said Rick Cory SM ’08, PhD ’10, a robotics specialist at Toyota Research Institute who also was not involved in the paper. “Furthermore, it can generate ‘never-before-seen’ scenarios that are deemed important for downstream tasks. In the future, combining this framework with large amounts of internet data could start an important milestone towards efficiently training robots for deployment in the real world.”
Pfaff co-authored the paper with senior author Russ Tedrake, the Toyota Professor of Electrical Engineering and Computer Science, Aerospace, and Mechanical Engineering at MIT. Senior Vice President, Large Behavioral Modeling, Toyota Research Institute; and Principal Investigator, CSAIL. Other authors include Toyota Research Institute robotics researcher Hongkai Dai SM ’12, PhD ’16; team leader and senior research scientist Sergey Zakharov; and Carnegie Mellon University doctoral student Shun Iwase. Their work was supported in part by Amazon and Toyota Research Institute. The researchers presented their work at the Conference on Robot Learning (CoRL) in September.