Valueless methods such as GRPO and DAPO show great effectiveness in large language model (LLM) RL training. The real potential lies in a value-based approach that allows for more precise credit allocation by accurately tracking the impact of each action on subsequent returns. This accuracy is crucial for complex reasoning, where subtle errors can lead to catastrophic failures. However, effective value models for training long-term thinking (COT) tasks face challenges: achieving low bias despite the long trajectory, unique preferences for managing short-term and long-term responses, and addressing reward signal sparseness. Despite its theoretical advantages, these difficulties still hinder the full realization of value-based approaches.
Value-based reinforcement learning approaches face three major challenges when applied to reasoning tasks that experience ideas over a long period of time. First, the problem of value model bias identified in VC-PPO shows that initializing value models with reward models introduces positive bias. Second, heterogeneous sequence lengths in complex inference tasks present difficulties for standard methods such as GAE with fixed parameters that cannot effectively adapt to sequences ranging from very short to very long. Third, in validator-based tasks that provide binary feedback instead of continuous values, the sparsity of reward signals becomes problematic. The lengthy crib response worsens this sparseness, thus creating a difficult exploration and trade-off in the optimization process.
Researchers from BONDECANCE Seeds propose value-enhanced proximity policy optimization (VAPO), a value-based RL training framework to address the challenges of long-term COT reasoning tasks. VAPO introduces three key innovations: a detailed value-based training framework with superior performance and efficiency, a length-adaptive GAE mechanism that adjusts parameters based on response lengths to optimize optimization advantage estimates and systematically integrates from previous studies. VAPO combines these components to create a system in which collective improvements go beyond systems that individual enhancements can be achieved independently. Using the QWEN2.5-32B model without SFT data, VAPO raised the score from 5 to 60, surpassing the previous latest method by 10 points.
VAPO is based on the PPO algorithm and has a variety of key modifications to enhance mathematical reasoning capabilities. Training dynamics analysis revealed excellent features of VAPO compared to DAPO, including a more stable optimization training curve, better length scaling, enhanced generalization ability, faster score growth due to particle signals provided by the value model and lower entropy in later training phases. Although reduced entropy may limit exploration, the approach can effectively balance this tradeoff, resulting in minimal impact on performance while improving repeatability and stability. This shows how VAPO’s decisions directly address the core challenges of value-based RL in complex inference tasks.
While DeepSeek R1 scored 47 points on AIME24 using GRPO and DAPO reached 50 points, VAPO matched DAPO’s performance on QWEN-32B with only 60% update steps and a new latest score of 60.4 in just 5,000 steps. Vanilla PPO only scored 5 points due to the collapse of value model learning, but VAPO ended up with 60 points. The ablation study verified the effectiveness of the seven proposed modifications: value prevention prevents crashes, decoupled GAEs can fully optimize long responses, adaptive GAEs balance short and long response optimization, clippers encourage thorough exploration, thorough exploration, token-level losses increase longer responses, thereby increasing longer responses, and thus increasing longer effects, longer Samexample sample lm adds contributions 6 points and a set of 5 points, and adds an exponential 5 points.
In this article, the researchers introduce VAPO, an algorithm that utilizes the QWEN2.5-32B model to achieve the latest performance of the AIME24 benchmark. By introducing seven innovative technologies on top of the PPO framework, VAPO can significantly improve value learning and establish the best balance between exploration and development. This value-based approach decisively outperforms valueless approaches such as GRPO and DAPO, setting new performance caps for inference tasks. It solves the fundamental challenges in the training value model for long-term COT scenarios and is a strong foundation for advancing LLM in inference reinforcement applications.
Check Paper. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 85k+ ml reddit.
🔥 (Register now) Open Source AI’s Minicon Virtual Conference: Free Registration + Attendance Certificate + 3-hour Short Event (April 12, 9am to 12pm)

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.
