Hug the face to release Smol2operator: a fully open source pipeline that trains 2.2B VLM to proxy GUI encoder

Hug Face (HF) has been released Smol2operatora reproducible end-to-end recipe that transforms into a small vision model (VLM) without the previous UI grounding it as Gui-Opering, using the tool’s proxy. This distribution covers data conversion utility, training scripts, transformed datasets, and the resulting 2.2B parameter model checkpoint, which is a complete blueprint for building a GUI proxy from scratch, rather than a single benchmark result.

But what new features are there?

Two-phase training on small VLM: from smolvlm2-2.2b-instruct– Model of “The grounding feature for GUI tasks initially” – smol2operator first instills perception/grounding, and then uses supervised fine tuning (SFT) for hierarchical proxy reasoning.
Unified action space across heterogeneous sources: Transform pipeline normalizes the function API to different GUI action classification (mobile, desktop, web) (e.g. click,,,,, type,,,,, dragnormalized (0,1) coordinates), can be coordinated across datasets. one Action Space Converter Supports remapping to custom vocabulary.

But why do you need smol2operator?

Most Gui-Engent pipelines are blocked by scattered action modes and unstorable coordinates. Smol2operator’s Unified action space and Normalized coordinates Strategies make datasets interoperable and trainable to stabilize at image resize, which is common in VLM preprocessing. This reduces the engineering overhead of assembling multi-source GUI data and reduces barriers to the behavior of using small model reproducers.

How does it work? Training stack and data paths

Data standardization:
- Normalize function calls and normalized functions in source data sets (for example, Aguvis stage) into a unified signature set; remove redundant actions; normalize parameter names; convert pixels to normalized coordinates.
Stage 1 (Sensory/Grounding):
- Unify SFTs on Action Datasets to understand element localization and basic UI burdens to measure ScreenSpot-V2 (Elements are localized on screenshot).
Stage 2 (Cognitive/Agent Reasoning):
- Transform rooted perception into other SFTs that are consistent with the unified action API.

The HF team reported the cleaning performance trajectory on ScreensPot-V2 (benchmark) when learning grounding and showed a similar training strategy, narrowing down to ~460m of “Nanovlm”, indicating the portability of the method across capacity (digits are shown in the postal table).

Scope, Limits and Next Steps

Not “at all costs” to promote: The HF team takes work as a Process blueprint– Have data conversion → grounding → reasoning – instead of chasing the ranking peak.
Evaluation focus: Demonstration Center ScreenSpot-V2 Perceived and qualitative end-to-end task video; a broader cross-environment, crossing or long-winning task benchmark is the future of work. The HF team noted that RL/DPO surpassed the potential benefits of SFT for a thorough adaptation.
Ecosystem Trajectory: Screenenv’s roadmap includes wider OS coverage (Android/MacOS/Windows), which will increase the external effectiveness of trained policies.

Summary

Smol2operator is a fully open source, reproducible pipeline that can be upgraded smolvlm2-2.2b-instruct– VLM with zero GUI grounding – Agent GUI encoder through a two-stage SFT process. This version normalizes heterogeneous GUI action patterns to a unified API with normalized coordinates, provides AGUVIS-based transformation datasets, publishes training notebooks and preprocessing codes, and sends out final checkpoints and demonstration spaces. It targets process transparency and portability, not ranking chases and inserts it into Smolagents Runtime and Screenenv Conduct evaluations to provide a practical blueprint for teams that build small operator GUI agents.

Check Technical detailsand The complete collection of HF. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Max is an AI analyst at Marktechpost, based in Silicon Valley, who actively shapes the future of technology. He teaches robotics at Brainvyne, uses comma to combat spam, and uses AI every day to transform complex technological advancements into clear, understandable insights

🔥 (Recommended Reading) NVIDIA AI Open Source VIPE (Video Pose Engine): a powerful and universal 3D video annotation tool for spatial AI

Source link