BMClogo

Despite their impressive abilities, large language models (LLMs) are often lacking when challenging new tasks that require complex inference skills.

Although the LLM of an accounting firm may perform well in summarizing financial reports, the same model may unexpectedly fail if it is responsible for predicting market trends or determining fraudulent transactions.

To make LLM more adaptable, MIT researchers looked at how to strategically deploy some kind of training technology to improve the performance of the model on unfamiliar, difficult issues.

They show that test-time training is a method that involves temporarily updating some models’ internal work during deployment, which can lead to a six-fold increase in accuracy. The researchers developed a framework for implementing a test-time training strategy that uses examples of new tasks to maximize these benefits.

Their work can improve the flexibility of the model, thereby adapting ready-made LLMs to complex tasks that require planning or abstraction. This may result in LLM being more accurate in many applications that require logical deductions, from medical diagnosis to supply chain management.

“Real learning – what we do in our test-time training – these models cannot complete themselves after being shipped. They won’t be able to acquire new skills or get better on a task. But if you push the model for a little practical learning, you’ll find that huge performance happens.”

Akyürek was added to the paper by graduate students Mehul Damani, Linlu Qiu, Han Guo and Jyothish Pari; undergraduate Adam Zweiger; assistant professor of electrical engineering and computer science (EEC) and member Yoon Kim (CSAIL) of the Computer Science and Artificial Intelligence Laboratory (CSAIL); associate professor of EECS and member of Csail Jacob Andreas. The study will be presented at the International Machine Learning Conference.

Solve hard domain

LLM users often try to use a technique called “cultural learning” to improve the performance of models on new tasks. They provide some examples of new tasks for the model as text prompts guiding the model’s output.

However, intrinsic learning is not always useful when it comes to problems that require logic and reasoning.

MIT researchers investigated how testing time training can be learned with intrinsic learning to improve the performance of these challenging tasks. Test time training involves updating certain model parameters (the internal variables it is used to make predictions) using a small amount of new data specific to the task at hand.

Researchers explore how test-time training interacts with cultural learning. They studied design choices, thus maximizing a performance improvement that can be coaxed from a universal LLM.

“We found that testing time training is a stronger form of learning. While simply providing examples can moderately improve accuracy, actually updating the model with these examples can lead to better performance, especially in challenging areas,” Damani said.

Internal cultural learning requires a small number of task examples, including problems and their solutions. The researchers used these examples to create specific task datasets needed for testing time training.

To enlarge the size of this dataset, they create new inputs by slightly changing the problem and solution in the example, such as flipping some input data horizontally. They found that training models about the output of this new dataset resulted in optimal performance.

Furthermore, the researchers updated only a small number of model parameters using a technique called low-level adaptation, which increased the efficiency of the test-time training process.

“This is important because if we deploy it in the real world, our approach needs to be effective. We found that with a small amount of parameter training, you can get a huge improvement in accuracy.”

Develop new skills

Simplifying the process is key, because the test time training is done statistics, which means that the user needs to do this for each task. The update of this model is only temporary, and after the prediction is made, the model is restored to its original form.

Akyürek added that a model that usually takes less than a minute to answer a query can take five to 10 minutes to provide answers through test time training.

“We don’t want to do this for all user queries, but it’s useful if you want to solve a good hard task for your model. There may also be some tasks that are too challenging to solve LLM without this approach,” he said.

The researchers tested their approach on two benchmark datasets, such as IQ puzzles. It improves accuracy by up to six times than techniques using only cultural learning.

Tasks involving structured patterns or using completely unfamiliar data show maximum performance.

“For simpler tasks, learning in the context might be OK. But updating the parameters itself may develop new skills in the model,” Damani said.

In the future, researchers hope to use these insights to develop models that are constantly learning.

The long-term goal is an LLM that, given the query, can automatically determine whether it needs to update parameters using test time training, or whether it can use internal text learning to solve the task, and then implement the best test time training strategy without manual intervention.

The MIT-IBM Watson AI Lab and the National Science Foundation partially support this work.

Source link