BMClogo

Language models have made great strides in solving inference tasks, and even small-scale supervised fine-tuning (SFT) approaches such as luxury sedans and S1s have demonstrated significant improvements in mathematical problem-solving capabilities. However, the basic question about these advances remains: do these models really generalize their training data, or are they just overfitting the test set? The research community faces the challenge of what features are enhanced by small-scale SFT, and despite these improvements, which limitations remain. Despite being excellent on popular benchmarks, there is still an incomplete understanding of the specific strengths and weaknesses of these fine-tuning models, creating a critical gap in knowledge of knowledge.

Various attempts have been made to understand the effects of reasoning-based supervised fine-tuning beyond simple benchmark scores. Researchers have questioned whether SFT has improved performance only for previously seen problem types, or whether it is really enabling models to transfer problem-solving strategies to new environments, such as applying coordinate-based techniques in geometry. Existing methods focus on factors such as correctness, solution length and response diversity. Preliminary studies show that SFT plays an important role in model improvement. However, these approaches lack the granularity required to determine exactly what types of previously unsolvable problems can be addressed after fine-tuning, and despite extensive training which problem categories are still resistant to improvements. The research community is still working to determine whether the observed improvements reflect more in-depth learning or simply memorizing the training trajectory, thus highlighting the need for more complex analytical approaches.

Researchers at the University of California, Berkeley and the AI ​​Institute have proposed a hierarchical analysis framework to investigate how supervised fine-tuning affects inference capabilities in language models. This method utilizes AIME24 dataset, Choosing its complexity and widespread use in inference research, the study exhibits a ladder-like structure in which the model solves higher problems and is often successful in the lower levels. By dividing the problem into four difficult layers, Simple, medium, hard and exhale, This study systematically examines the specific requirements for progress between levels. This analysis shows that development from easy to intermediate to medium requires long-term reasoning R1 reasoning style, while hard-level problems require greater computational stability during in-depth exploration. The problem at the expiration level presents fundamentally different challenges and requires strategies for unconventional problems that the current model solves uniformly. The study also identified four key insights: the performance gap between potential and stability in small SFT models, the minimal benefits of careful dataset planning, reduced returns from scaling SFT datasets, and potential intelligence barriers that may not be overcome by SFT alone.

This method uses the AIME24 dataset as the primary test benchmark to adopt a comprehensive analytical analysis. This choice stems from three key attributes: the hierarchical difficulty of the dataset, and even challenges state-of-the-art models, diverse coverage of the field of mathematics, and focus on high school mathematics that separates pure reasoning abilities from domain-specific knowledge. QWEN2.5-32 B instruction is the basic model due to its widespread adoption and inherent cognitive behaviors, including validation, backtracking and sub-target settings. The fine-tuning data consists of problem response pairs for the OpenR1-Math-220k dataset, using the COT trajectory generated by DeepSeek R1 specifically for the problem of Numinamath1.5 and filtering for the wrong solutions. The training configuration reflects the learning rate of 1×10-5, the weight attenuation of 1×10-4, batch sizes of 32 and 5 epochs. Performance evaluation uses AVG@N (multiple attempts than multiple attempts) and COV@n indicators. According to the model performance model, the problem is divided into four difficulty levels (simple, medium, difficult and extremely difficult).

The results of the study show that effective progress from easy intermediate to intermediate mathematical problem solving requires minimal but specific conditions. The study systematically examined multiple training variables, including basic knowledge across different mathematical categories, dataset size variation (100-1000 examples per category), track length (short, normal, or long), and track style (comparing DeepSeek-R1 with Gemini-Flash). Through a comprehensive ablation study, the researchers isolated the effect of each dimension on model performance, expressed as p = f(c, n, l, s), where c represents the category, n represents the number of trajectories, n represents the length, and s represents the style. The results of the study show that achieving ≥90% performance on intermediate problems requires at least 500 normal or long R1 trajectories, regardless of the specific mathematical category. When trained through fewer trajectories, shorter trajectories, or Gemini-style trajectories, the model never reaches the performance threshold. This suggests that the length and number of inference trajectories represent key factors in developing mathematical reasoning ability, and that the specific topics of the trajectory prove less important than its structural characteristics.

The study shows that models with small-scale supervised fine-tuning can solve as many problems as possible with more complex models like DeepSeek-R1, although there are still significant challenges. The main limitation of determination is instability in mathematical reasoning, not ability. Experimental results show that geometrically trained models can achieve a coverage score of 90, and R1 matches performance when making multiple attempts, but their overall accuracy lag is more than 20%. This performance gap mainly stems from instability in deep exploration and computational limitations in complex problem solving processes. Meanwhile, increasing the size of the SFT dataset provides a solution path, but performance enhancements follow log-scaling trends with reduced returns. Notably, this study challenged the latest assertions about the importance of careful dataset planning, indicating that the performance of various mathematical categories remains consistent over the narrow range of 55±4%, with only marginal differences between a particular constructed similar dataset and a range built on the system. This conclusion shows that the quantity and quality of inference trajectories are not merely subject-specific content to develop reliable mathematical reasoning abilities.


This is Paper and Github page. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.

🔥 (Register now) Minicon Virtual Agent AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am to 1pm) + Hands-on for the seminar


Asjad is an intern consultant at Marktechpost. He is mastering B.Tech in the field of mechanical engineering at Kharagpur Indian Institute of Technology. Asjad is a machine learning and deep learning enthusiast who has been studying the applications of machine learning in healthcare.

Source link