BMClogo

Despite rapid advances in visual modeling, many advances in this field are shaped by models trained in proprietary data sets, often relying on distillation of closed source systems. This dependence creates barriers to scientific transparency and repeatability, especially for tasks involving fine-grained image and video understanding. The performance of the benchmark reflects training data and black-frame model functionality better than architectural or methodological improvements, so it is difficult to evaluate the real research progress.

To solve these limitations. PLM is designed to support image and video input and is trained without using proprietary model output. Instead, it is acquired from large-scale synthetic data and newly collected human marker datasets, thereby evaluating model behavior and training dynamics in transparent conditions in detail.

The PLM framework integrates visual encoders (perceptual encoders) into language decoders of different sizes – 1B, 3B and 8B parameters. It employs a multi-stage training pipeline: initial warm-up, with low resolution synthetic images, trains on large scales of various synthetic datasets, and supervised fine-tuning using high-resolution data with precise annotations. The pipeline emphasizes training stability and scalability while maintaining control over the work and content of the data.

A key contribution of this work is the release of two large-scale high-quality video datasets that address existing gaps in time and space understanding. this PLM – FGQA The dataset includes 2.4 million answers to questions, capturing fine particle details of human behavior, such as object manipulation, motion direction and spatial relationships – scattering various video domains. Add this PLM – STCa dataset of 476,000 spatio-temporal subtitles linked to a segmentation mask for tracking topics across time, allowing the model to reason about “what”, “where” and “where” and “wher” in complex video scenarios.

Technically, PLM adopts a modular architecture that supports high-resolution image tiling (up to 36 feet) and multi-frame video input (up to 32 frames). The 2-layer MLP projector connects the visual encoder to the LLM and the synthetic and human marker data are constructed to support a wide range of tasks including subtitles, visual question answers and inference based on dense areas. The synthetic data engine is built entirely using open source models, generating approximately 64.7 million samples in natural images, charts, documents, and videos, and avoiding relying on all proprietary sources to ensure diversity.

Meta AI also introduced PLM – VideobenchThis is a new benchmark that aims to evaluate uncaptured aspects of video comprehension in existing benchmarks. It includes tasks such as Fine Grain Activity Recognition (FGQA), Smart Glass Video QA (SGQA), Region-based Dense Subtitles (RDCAP), and Spatial-Time Positioning (RTLOC). These tasks require models to perform time-based and spatially explicit reasoning.

Empirical evaluations show that the PLM model, especially on the 8B parameter scale, competes for performance in over 40 image and video benchmarks. In video subtitles, PLM averaged +39.8 cider, while open bases averaged returns. On PLM-VideObench, the 8B variant narrows the gap with human performance in structured tasks such as FGQA and shows improved results on spatiotemporal localization and dense subtitles. It is worth noting that all results were obtained without a closed model, emphasizing the feasibility of open and transparent VLM development.

In summary, PLM provides a rigorous and completely open approach to training and evaluating visual language models. Its distribution includes not only models and code, but also the largest curated dataset for fine-quality video understanding and benchmark suites targeting previously unfulfilled features. PLM positioning is the basis for reproducible research in multimodal AI and a resource for future work to perform detailed visual reasoning in an open environment.


This is Paper,,,,, Model and Code. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.

🔥 (Register now) Minicon Virtual Agent AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am to 1pm) + Hands-on for the seminar


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Source link