Vision-language models are fundamentally changing how humans and robots collaborate in manufacturing environments, moving industrial automation from scripted programming to contextual understanding. A new survey published in Frontiers of Engineering Management provides the first comprehensive mapping of how these AI systems are reshaping human-robot collaboration in smart manufacturing.
The research team from The Hong Kong Polytechnic University and KTH Royal Institute of Technology analyzed 109 studies from 2020-2024 to examine how VLMs—AI systems that jointly process images and language—enable robots to plan tasks, navigate complex environments, perform manipulation, and learn new skills directly from multimodal demonstrations. The complete findings are available at https://doi.org/10.1007/s42524-025-4136-9.
Traditional industrial robots have been constrained by brittle programming, limited perception, and minimal understanding of human intent, struggling to adapt to dynamic manufacturing environments. VLMs address these limitations by adding a powerful cognitive layer that allows robots to interpret both what they see and what they are told. These models learn to align images and text through contrastive objectives, generative modeling, and cross-modal matching, creating shared semantic spaces that enable more intuitive human-robot interaction.
In practical applications, VLMs help robots interpret human commands, analyze real-time scenes, break down multi-step instructions, and generate executable action sequences. Systems built on architectures like CLIP, GPT-4V, BERT, and ResNet have achieved success rates above 90% in collaborative assembly and tabletop manipulation tasks. For navigation, VLMs allow robots to translate natural-language goals into movement, mapping visual cues to spatial decisions that enable robust autonomy in industrial environments.
The safety implications are particularly significant for manufacturing leaders. In manipulation tasks, VLMs help robots recognize objects, evaluate affordances, and adjust to human motion—critical capabilities for safety-critical collaboration on factory floors. This dual-modality reasoning makes interaction more intuitive and safer for human workers, addressing long-standing concerns about human-robot proximity in industrial settings.
Beyond immediate applications, the survey highlights emerging work in multimodal skill transfer, where robots learn directly from visual-language demonstrations rather than labor-intensive coding. This capability could dramatically reduce the time and expertise required to reprogram industrial robots for new tasks, potentially lowering barriers to automation adoption for small and medium-sized manufacturers.
The authors caution that achieving large-scale deployment will require addressing challenges in model efficiency, robustness, and data collection, as well as developing industrial-grade multimodal benchmarks for reliable evaluation. Breakthroughs in efficient VLM architectures, high-quality multimodal datasets, and dependable real-time processing will be key to unlocking their full industrial impact.
Looking forward, VLM-enabled robots could become central to future smart factories, capable of adjusting to changing tasks, assisting workers in assembly, retrieving tools, managing logistics, conducting equipment inspections, and coordinating multi-robot systems. As these technologies mature, robots may learn new procedures from video-and-language demonstrations, reason through long-horizon plans, and collaborate fluidly with humans without extensive reprogramming, potentially ushering in a new era of adaptive, human-centric manufacturing.


