BMClogo

Large language models (LLMS) have become critical in scope, enabling high-performance applications such as natural language generation, scientific research, and dialogue agents. Below these advances is the transformer architecture where the alternating layers of the focus mechanism and the feedforward network (FFN) process tokenized inputs sequentially. However, as size and complexity increase, the computational burden required for inference increases greatly, creating efficiency bottlenecks. Now, effective inference is a key issue, and many research groups focus on strategies that can reduce latency, increase throughput and reduce computational costs while maintaining or improving model performance.

The center of this efficiency problem is the inherent sequential structure of the transformer. The output of each layer feeds the next, requiring strict order and synchronization, which is especially problematic on a large scale. As the model size increases, the cost of sequential computing and communication across GPUs increases, reducing efficiency and increasing deployment costs. This challenge is amplified in situations where fast, multi-talk, such as real-time AI assistants. Reducing this sequential load while maintaining model functionality is a key technical barrier. Unlocking new parallelization strategies that maintain accuracy but greatly reduce computing depth is critical to expanding the accessibility and scalability of LLM.

Several technologies have emerged to improve efficiency. Quantization reduces the accuracy of the numerical representation to minimize memory and computational requirements, although it often takes a loss of accuracy, especially at low widths. Pruning eliminates redundant parameters and simplifies the model, but potentially compromises accuracy without caution. Experts’ Mixture (MOE) model activates only a subset of parameters for each input, making it efficient in specific workloads. Nevertheless, they can still perform poorly on intermediate batch sizes due to low hardware utilization. These strategies, while valuable, have tradeoffs that limit their universal applicability. Therefore, the field seeks to provide less compromise efficiency improvements, especially for intensive architectures that are simpler to train, deploy and maintain.

NVIDIA researchers launch a new building optimization technology FFN Fusionsolve the sequential bottleneck in Transformers by identifying FFN sequences that can be executed in parallel. This approach occurs from observations, that is, when using the puzzle tool to remove the attention layer, the model usually retains long continuous FFN sequences. These sequences show minimal interdependence and can therefore be processed simultaneously. By analyzing the structure of LLMS (such as Llama-3.1-405B Laboratory), the researchers created a new model called Ultra-253b Foundation by pruning and reorganizing the basic model through FFN fusion. This method results in a more efficient model to maintain competitive performance.

FFN Fusion fuses multiple consecutive FFN layers into a wider FFN. This process is based on mathematical equivalence: by concatenating the weights of multiple FFNs, one can generate a single module that behaves the same as the sum of the original layers, but can be calculated in parallel. For example, if three FFNs are stacked sequentially, each FFN depends on the output of the previous one, and their fusion can eliminate these dependencies by ensuring that all three run on the same input and summarizing their output. The theoretical basis of this method shows that the fused FFN has the same representative capacity. The researchers used the cosine distance between the FFN outputs to perform dependency analysis to identify regions with low interdependence. These regions are considered the best choice for fusion, as the minimum change in token orientation between layers indicates the feasibility of parallel processing.

The application of FFN fusion to the Llama-405b model led to the Ultra-253b basis, resulting in a significant improvement in speed and resource efficiency. Specifically, the inference latency of the new model increased by 1.71 times and decreased by 35 times in batches of 32 batches. This efficiency does not come at the expense of capability. Ultra-253B cardinality scored 85.17% on MMLU, MMLU-Pro scored 72.25%, arena’s 84.92% at 84.92%, HumaneVal’s 86.58% at 86.58%, and MT bench 9.19. These results usually match or exceed the original 405B parameter model, even if Ultra-253b basically contains only 253 billion parameters. Memory usage has also improved, with KV-CACHE requirements being reduced by 2 times. The training process involves distilling 54 billion tokens in an 8K context window and then fine-tuning in 16K, 32K and 128K contexts. These steps ensure that the fusion model maintains high accuracy while benefiting from reduced size.

This study shows how thoughtful architectural redesigns release significant efficiency improvements. The researchers show that the FFN layer in the transformer architecture is usually more independent than previously assumed. Their approach to quantifying inter-layer dependencies and transforming model structures allows for wider application between models of various sizes. The technology is also verified through the 70B parameter model, proving generalizability. Further experiments show that although the FFN layer can usually be fused with minimal effects, due to stronger interdependence, all parallelization (including attention) introduces more performance degradation.

Some key points of FFN fusion research:

  • The FFN fusion technology is calculated in sequence in transformers parallel to the low-dependence FFN layer.
  • Fusion is achieved by replacing a single wider FFN sequence using tandem weights.
  • Ultra-253b is basically derived from Llama-3.1-405B, with a faster inference speed of 1.71 times and a 35-fold reduction in per capita cost.
  • The benchmark results include: 85.17% (MMLU), 72.25% (MMLU-PRO), 86.58% (HumaneVal), 84.92% (hardware), and 9.19 (MT Bench).
  • Due to KV-CACHE optimization, memory usage has been reduced by half.
  • FFN fusion is more effective at larger model scales and can work well with technologies such as pruning and quantization.
  • Complete transformer block parallelization shows potential, but further research is needed due to stronger interdependence.
  • The systematic approach using cosine distance helps determine which FFN sequences can be safely fused.
  • The technology is verified on different model sizes, including 49b, 70b and 253b.
  • This approach lays the foundation for a more friendly and hardware-effective LLM design.

Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 85k+ ml reddit.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Source link