BMClogo

Autoregressive transformers have become the leading method of sequence modeling because of their powerful intrinsic learning and feasible training through SoftMax attention. However, the sequence length of soft magnetic attention has quadratic complexity, resulting in high computational and memory requirements, especially for long sequences. Although GPU optimization can mitigate this short sequence, the scale of inference is still high. The researchers explored recurring architectures with compressed states that have linear complexity and continuous memory usage to solve this problem. Advances in linear attention and state space model (SSM) have shown promise, with RNN-based approaches such as RWKV-4 achieving competitive performance while significantly reducing inference costs.

Researchers from multiple institutions, including the RWKV project, Eleutherai, Tsinghua University, etc., introduced the RWKV-7 “Goose”, a new sequential modeling architecture that establishes new advanced (SOTA) performance in a 3 billion parameter scale to achieve a variety of tasks. Despite receiving less token training than the competition model, RWKV-7 achieved comparable English performance while maintaining constant memory usage and inference time per token. The architecture extends the delta rules by combining state gating of vector values, adaptive intrinsic learning rates and refined value substitution mechanisms. These improvements improve expressiveness, enable effective state tracking, and allow recognition of all conventional languages, exceeding the theoretical capabilities of Transformers under the standard complexity assumption. To support its development, researchers have released a large 3.1 trillion multilingual corpus, along with a variety of pre-trained RWKV-7 models, ranging from 19 to 2.9 billion parameters, all available under the Open Apache 2.0 license.

RWKV-7 introduces key innovations that are layered on the RWKV-6 architecture, including token shifting, reward mechanisms and Relu² feed networks. The model’s training corpus RWKV World V3 enhances its English, code and multilingual capabilities. In addition to releasing well-trained models, the team also provided RWKV-7 that could solve problems beyond TC₀ complexity, including S₅ State Tracking and General Language Recognition. This proves its ability to handle computational complex tasks more efficiently than Transformers. Furthermore, the researchers propose a cost-effective approach to upgrading the RWKV architecture without the need for complete retraining, thereby facilitating incremental improvements. Larger data sets and models will continue to be developed under the open source license, ensuring wide accessibility and repeatability.

The RWKV-7 model adopts a structured method for sequence modeling, represents the model dimension as D, and uses a trainable matrix for calculation. It introduces state gating for vector values, internal context learning rates, and exquisite delta rule formulas. The process of mixing time involves weight preparation using low-level MLP and has key components such as replacement keys, decay factors and learning rates designed for effective state evolution. The weighted key value (WKV) mechanism facilitates dynamic state transitions, approximating the forgotten gate. In addition, RWKV-7 improves expressiveness through per-channel modification and two-layer MLP, thereby improving computational stability and efficiency while retaining national tracking capabilities.

Various English and multilingual benchmarks were evaluated using LM evaluation seat belts to evaluate the RWKV-7 model with state-of-the-art models while using fewer training tokens to demonstrate competitive performance. It is worth noting that the RWKV-7 outperforms its predecessor in MMLU and significantly improves multilingual tasks. Furthermore, recent evaluations of Internet data confirm their effectiveness in processing information. The model performed well in association recalls, mechanical architecture design and long text retention. Despite the limited limitations of training resources, the RWKV-7 exhibits high efficiency, obtaining strong benchmark results while requiring fewer failures than leading transformer models.

In summary, RWKV-7 is an RNN-based architecture that enables the latest results in multiple benchmarks while requiring fewer training tokens. It maintains high parameter efficiency, linear time complexity and constant memory usage, making it a powerful alternative to transformers. However, it faces limitations such as numerical precision sensitivity, lack of guidance adjustment, rapid sensitivity, and constrained computing resources. Future improvements include optimization speed, combined with thoughtful reasoning and expansion with larger data sets. The RWKV-7 model and training code are publicly available under the Apache 2.0 license to encourage research and development with effective sequence modeling.


Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 85k+ ml reddit.


Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.

Source link