BMClogo

Multimodal Large Language Model (MLLM) has improved the integration of visual and textual approaches, thus making progress in tasks such as image subtitles, answers to visual questions, and document interpretation. However, a lack of transparency often hinders the replication and further development of these models. Many state-of-the-art MLLMs do not publish critical components, including training code, data curation methods, and preprocessing datasets. Furthermore, the large amount of computing resources required to train these models poses a significant obstacle, especially for academic researchers with limited infrastructure. This lack of accessibility hinders repeatability and slows the spread of new technologies within the research community.

Researchers from UC Santa Barbara, Bytedance and NVIDIA have launched Open-Qwen2vl, a 20 billion parameter multimodal model that has been pre-trained in 29 million image text pairs, using approximately 220 A100-40G GPU hours. Developed in collaboration with researchers at UC Santa Barbara, Bytedance and Nvidia Research, Open-QWEN2VL aims to address reproducibility and resource limitations in MLLM research. The project provides a complete open source resource suite, including training code bases, data filtering scripts, WebDataSet-Formatted Prestrating Data, and model checkpoints for basic and directive tuning. This comprehensive version is designed to support transparent experiments and method development in multimodal learning domains.

Open-QWEN2VL is based on the QWEN2.5-1.5B-Instruct LLM backbone and is combined with the Siglip-Siglip-So-400m vision encoder. The adaptive average playback visual projector reduces the number of visual tokens from 729 to 144 during pre-training, thereby improving computational efficiency. During the Supervised Fine Tuning (SFT) phase, the token count increases to 729. This low to high resolution strategy maintains image comprehension when optimizing resource usage.

To further improve training efficiency, Open-QWEN2VL implements multi-modal sequence stacking, thereby dividing multiple image text pairs into sequences of approximately 4096 tokens, thus minimizing padding and computational setup. The visual encoder parameters remain frozen during preprocessing to save resources and optionally not frozen during SFT to improve downstream performance.

Open-QWEN2VL is trained only at 0.36% of the token count used in QWEN2-VL, but has shown comparable or excellent performance in several benchmarks. The model scored 80.9 on MMBench and competed for performance on Seedbench (72.5), MMSTAR (49.7) and Mathvista (53.1). Ablation studies have shown that small subsets (5M samples) of high-quality image text pairs filtered using MLM-based techniques can lead to measurable performance improvements, highlighting the importance of data quality over volume.

In addition, Open-QWEN2VL shows strong multi-mode internal learning capabilities for several shootings. When evaluated on datasets such as GQA and TextVQA, the model showed an improvement in accuracy from 0 shots to 8 shots. The size of the instruction tuned dataset is predictable, and the fine-tuning performance scale is predictable, with approximately 8 million examples of performance obtained from the Mammoth-VL-10M dataset.

Open-QWEN2VL introduces a reproducible and resource-efficient pipeline for training multimodal big-word models. It can participate more widely in MLLM research by systematically addressing the limitations of openness and computational requirements of previous models. The design choice of the model, including effective visual token processing, multi-modal sequence packaging and wise data selection, provides viable avenues for academic institutions aimed at contributing to the field. Open-QWEN2VL establishes a reproducible baseline and lays the foundation for future work of scalable, high-performance MLLM in constrained computing environments.


Check Papers, models, data and code. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 85k+ ml reddit.

🔥 (Register now) Open Source AI’s Minicon Virtual Conference: Free Registration + Attendance Certificate + 3-hour Short Event (April 12, 9am to 12pm)


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Source link