BMClogo

LLM has completely changed artificial intelligence and changed various applications in the entire industry. Autoregression (AR) models dominate the current text generation, using leading systems such as gpt-4, DeepSeek, and Claude, which all use sequential left and right architectures. Despite its impressive capabilities, fundamental questions about the next generation of architectural paradigms have emerged with the massive limitations of AR models. These challenges include complex reasoning difficulties, insufficient long-term planning, and the struggle to maintain coherence between extended environments. These are problematic for emerging applications in embodied AI, autonomous agents and long-distance decision making systems where continuous reasoning and contextual understanding are critical to success.

Discrete diffusion model (DMS) is a promising alternative to sequence generation autocyclogenic approach. Unlike the AR model that generates tokens in sequence, DMS perfects all sequences in parallel from a complete NONISE state. This difference provides a significant advantage: bidirectional context modeling enhances overall coherence, flexible and controllable generation occurs naturally through iterative refinement, and there is a potential for basic sampling acceleration through effective noise-to-data mapping. Recent advances show that diffusion has grown in potential in language tasks, with models such as Diffiloma and Llada scaling to 7B parameters, while Mercury encoder exhibits impressive inference efficiency in code generation.

Researchers at the University of Hong Kong and Huawei Noah’s Ark Laboratory have released Dream 7b (Diffusion Inference Model), the most powerful open diffusion large language model to date. This model matches or exceeds the general task, mathematical and coding benchmarks of AR models of similar size. Dream 7b shows excellent zero-shot planning capabilities and reasoning flexibility, showing larger models on structured tasks (e.g., DeepSeek V3 (671b)). The model was trained in 580B tokens including Dolma and OpenCoder, which adopts mask-based diffusion and performs automatic regression weight initialization from QWEN2.5 7B. Its architecture enables powerful bidirectional context processing, arbitrary order generation, fill function, and adjustable quality speed trade-offs during inference.

Dream 7b uses the theoretical basis of RDM and Diffullama’s adaptive strategy to build on previous work in diffusion language modeling. It implements the mask diffusion paradigm through architectures designed for various applications. Training data uses text, mathematics and code, including Dolma V1.7, OpenCoder and DCLM-Baseline. Preprocessing utilizes 580 billion tokens, performed on 96 NVIDIA H800 GPUs in 256 hours, without unrecoverable peaks of loss. Extensive design experiments at the 1B parameter level identified key components, including weight initialization of self-rotating models such as QWEN2.5 and LLAMA3, as well as context-adapted token-level noise rescheduling, which are essential for Dream 7B training.

The countdown and Sudoku tasks were evaluated and the proposed method was evaluated compared to the LLADA 8B, QWEN2.5 7B, LLAMA3 8B and DEEPSEEK V3 671B. It performs better than similarly sized baseline models, with both diffusion models surpassing autoregressive alternatives. Despite its large parameter counts, these diffusion models occasionally exceed DeepSeek V3, showing the effectiveness of the diffusion model for multi-contract problem solving and target-specific tasks. The method uses three periods of Tulu 3 and Smollm2 datasets to supervise fine-tuning training using 1.8 million instructions. The results show that Dream has the ability to match the performance of autoregressive models:

In summary, the researchers introduced Dream 7b, which represents a breakthrough family of diffusion language models characterized by efficiency, scalability, and flexibility through carefully developed training methods. These models are comparable to leading automatic regression models with similar sizes in general tasks, mathematical and coding applications. Dream has the most unique advantage in advanced planning schemes and flexible inference capabilities, with its diffusion-based architecture having significant advantages over traditional automated regression methods. This achievement shows the viability of the diffusion model, a compelling alternative path in language model development.


Check Dream-Org/Dream-V0-Instruct-7b and Dream-Org/dream-v0-base-7b. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 85k+ ml reddit.

🔥 (Register now) Open Source AI’s Minicon Virtual Conference: Free Registration + Attendance Certificate + 3-hour Short Event (April 12, 9am to 12pm)


Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.

Source link