BMClogo

Large Language Models (LLMS) are good at using text reasoning to understand the context of a document and provide logical answers about its content. However, these same LLMs are also often difficult to correctly answer even the simplest math questions.

Textual reasoning is often a less ideal approach to considering theories of computational or algorithmic tasks. While some LLMs can generate code like Python to handle symbolic queries, the model does not always know when to use the code, or which code is most effective.

It seems like LLM may need a coach to guide them toward the best technology.

Enter CodeSteer, an intelligent assistant developed by MIT researchers, which directs LLM to switch between code and text generation until the query is correctly answered.

CodeSteer itself is a smaller LLM that automatically generates a series of prompts to iteratively move toward larger LLMs. It reviews the current and previous answers of the model after each turn and provides guidance on how to fix or perfect the solution until it thinks the answer is correct.

The researchers found that using CodeSteer to expand larger LLMs can improve its accuracy on symbolic tasks, such as multiplying by numbers, playing Sudoku and stacking blocks, increasing by more than 30%. It also enables less complex models to outperform more advanced models with improved inference skills.

This advance could improve the problem-solving capabilities of LLM for complex tasks that are difficult to solve by text reasoning alone, such as generating paths for robots in uncertain environments or scheduling transportation in international supply chains.

“There is a race to develop better and better models that are capable of doing everything, but we’ve taken a complete approach. Researchers have spent years developing effective technologies and tools to tackle problems in many domains. We want to enable LLMs to select the right tools and methods, and make use of others’ expertise to enhance their own capabilities,” says Chuchu Fan, an associate professor of aeronautics and astronautics (AeroAstro) and principal investigator in the MIT Laboratory (LIDS) of Information and Decision Systems.

Senior author Fan of the study joined a paper on the work of LID graduate student Yongchao Chen; Yilun Hao, Aeroastro graduate student; Yueying Liu, Urbana-Champaign graduate student at the University of Illinois; and Yang Zhang, a research scientist at the MIT-IBM Watson AI Laboratory. The study will be presented at the International Machine Learning Conference.

LLM “Trainer”

Asking LLM which number is larger, 9.11 or 9.9, usually gives wrong answers by using text reasoning. But it is required to answer the same question using code and can generate and execute Python scripts to compare two numbers, thus easily solving the problem.

LLM was initially trained to understand and predict human languages, and even if the code is more efficient, it is more likely to answer queries using text. Although they have learned to generate code through fine-tuning, these models often generate incorrect or inefficient versions of the code.

Rather than trying to retrain powerful LLMs or Claude to improve these features, MIT researchers are fine-tuning smaller, lightweight LLMs to guide larger models between text and code. Fine-tuning smaller models won’t change the larger LLM, so there is no risk that will undermine the other capabilities of larger models.

“We are also inspired by humans. In sports, coaches may not be better than star athletes on the team, but coaches can still make useful suggestions to guide athletes. This steering method also works for LLM,” Chen said.

This coach CodeSteer works with the larger LLM. It first reviews the query and determines whether the text or code is suitable for this question, which code is the best.

It then generates a prompt for the larger LLM telling it to answer the query using encoding methods or text reasoning. Larger models follow this prompt to answer the query and send the result back to CodeSteer, which reviews it.

If the answer is incorrect, CodeSteer will continue to prompt the LLM to try to solve different things in the problem, such as incorporating search algorithms or constraints into its Python code until the answer is correct.

“We found that, often, larger LLMs will try to be lazy and use shorter, less efficient code without the correct symbolic calculations. We have designed CodeSter to avoid this phenomenon,” Chen said.

The symbol checker evaluates the complexity of the code and sends a signal in CodeSter if the signal is too simple or inefficient. The researchers also incorporated a self-reply checker into CodeSteer, which prompted LLM to generate code to calculate the answer to verify that it is correct.

Solve complex tasks

As CodeSteers designed by researchers, they couldn’t find a suitable symbolic dataset to fine-tune and test the model, because many existing benchmarks did not indicate whether a query could be best solved by text or code.

Therefore, they collected a corpus of 37 complex symbolic tasks, including spatial reasoning, mathematics, sequential reasoning and optimization, and constructed their own dataset called Symbecch. They implemented a fine-tuning method that utilizes Symbecch to maximize CodeSteer performance.

In their experiments, CodeSter outperformed all nine baseline methods they evaluated and increased the average accuracy from 53.3% to 86.4%. It maintains similar performance even on invisible tasks and on various LLMs.

Additionally, a universal model enhanced with CodeSteer can achieve higher accuracy while requiring less computation than state-of-the-art models designed to focus on complex inference and planning.

“Our approach uses LLM’s own capabilities. By enhancing LLM’s ability to use encoding intelligently, we can adopt a model that is already very powerful and improves its performance even more,” Chen said.

In the future, researchers hope to simplify coding to speed up its iterative prompting process. Furthermore, they are looking at how to effectively tweak a unified model and be able to switch between text reasoning and code generation rather than relying on a separate assistant.

“The authors provide an elegant solution to the key challenges leveraging LLMS tools. This simple and impactful approach enables state-of-the-art LLMS to achieve significant performance improvements without direct fine-tuning.” said Jinsung Yoon, an employee research scientist at Google Cloud AI, who is not involved in the work. “This study represents a substantial contribution that is expected to greatly improve the application of LLM in a variety of tasks they are currently encountering.”

“Their success in training smaller professional models is particularly influential,” added Chi Wang, senior staff scientist at Google DeepMind, who is not involved in the work. “In a complex real world, this clever collaboration between this different AI ‘Adents’ paves the way for more powerful and versatile applications.”

The study was supported in part by the U.S. Office of Naval Research and the MIT-IBM Watson AI Laboratory.

Source link