Vision LLMs in Multi-Agent
Collaborative Systems: Architecture
and Integration
Niyati Prajapati
ML/Gen AI Lead
Google
© 2025 Google 2
Overview
Agenda
● Context
● Intersection of machine vision, large language models (LLMs) and human
computer interface (HCI)
● Multi-agent collaborative systems (MACS) architecture, integration,
performance
● Vision LLMs into MACS
● Real world use cases
○ Sensor fabrication analysis systems
○ Warehouse robotics systems
● Future trend and challenges
© 2025 Google 3
© 2025 Google 4
Context
Context
Urgency
● Rising complexity of data, automation, robotics, logistics and manufacturing
● Increasing demand of autonomous decision making
● Limitation of legacy systems without advanced vision, reasoning and efficient collaborative
capabilities
● What makes vision LLMs + multi-agent collaborative systems (MACS) ground breaking? Next-gen
architecture with the fusion of visual reasoning and distributed intelligence.
● “Vision LLMs = smarter cameras” — understand scenes semantically using natural language
● MACS — allows distributed agents to reason collectively with intelligent perception and
coordination across agents.
● Central outcome: “autonomous decision-making system”
© 2025 Google 5
© 2025 Google 6
Intersection of machine
vision, LLMs and HCI
Intersection of machine vision, LLMs and HCI
© 2025 Google
Image
Acquisition
Image
Processing
Feature
Extraction
Object
Recognition
Text Input
Language
Processing
Context
Understanding
Object
Recognition
User
Interaction
Feedback
Collection
Interface
Design
Usability
Testing
LLMs
Machine Vision HCI
Start
User
Interaction
Feedback
Collection
Interface
Design
Integration
End
Data Sources
Audio
Textual
Visual
Sensor
Processing
Layer
LLM
Module
Machine
Vision
Module
Real time
feedback
Gesture
Control
Interactive UI
HCI Layer
7
Intersection of machine vision, LLMs and HCI
What makes vision LLMs + MACS synergy practical with HCI?
● Multimodal LLMs, real-time multi-agent frameworks, distributed systems that support these
functionalities at scale
● Autonomous systems that perceive and interpret their environments to coordinate actions and
interact with humans intuitively.
● Triad that forms the core of next-gen AI-driven industrial and human-interactive systems with
below key technologies :
○ Vision-language models (GPT-4V, PaLI, Kosmos-2)
○ Reinforcement learning for agent coordination
○ Distributed multi-agent frameworks (ROS2, Ray)
○ Edge AI & vision hardware (Jetson, RealSense)
© 2025 Google 8
© 2025 Google 9
Multi-agent collaborative systems (MACS)
architecture, integration and performance
Multi-Agents Collaborative Systems architecture,
integration and performance
Core concepts
● Vision LLMs[1]
○ Inputs: Image + optional text
○ Outputs: Descriptive reasoning, decision support
○ Capabilities: Q&A, scene interpretation, captioning, semantic analysis
● MACs:
○ Collaborative distributed agents performing autonomous/semi-autonomous operation
○ Communication: Agents exchange visual, linguistic and task level signals
○ Perception: Each agent interprets local sensor data using Vision LLMs
○ Shared objectives: All agents operate toward a unified goal for adaptive decision-making
protocols.
© 2025 Google 10
Multi-agent collaborative systems architecture,
integration and performance
MACS Architecture
● Perception layer (computer vision, sensors)
● Communication layer
● Action layer
● Shared knowledge via LLMs
© 2025 Google
Computer
Vision
Sensors
Perception Layer Communication Layer
Message passing Shared knowledge
via LLMs
Action Layer
Execution
11
Multi-agent collaborative systems architecture,
integration and performance
Vision LLMs into MACS
● Shared visual understanding through natural language
● Vision LLMs = perceptual interpreters and planners
● Decoupled sensing + reasoning
Performance Benefits
● Better perception in complex environment
● Scalable, robust and explainable
© 2025 Google 12
Multi-agent collaborative systems architecture,
integration and performance
MACS with vs without Vision LLMs[2]
© 2025 Google
Metric MACS without
Vision LLMs
MACS with
Vision LLMs
Improvement Analogy
Package sorting
accuracy
90% 98% +8% Specialized workers with fixed signals vs
intelligent workers with language
Handling time per
package
15 seconds 12 seconds -20% Step-by-step visual check vs contextual
understanding
Adaptability to new
package types
weeks hours/minutes significantly faster Manual retraining for each type vs
understanding through description
Communication
efficiency
moderate high increased Fixed codes vs natural conversation
Handling ambiguity low high significantly Better Relying solely on appearance vs using
language for clarification
13
© 2025 Google 14
Real world use case 1:
Sensor fabrication analysis system
Real world use case - Sensor fabrication analysis system
MACS Architecture
● Scanning electron microscope (SEM) image of
carbon nanotube (CNT) fabrication
[Scanning / transmission electron
microscopy]
● Domain use cases:
○ Precision + automation in micro-nano
manufacturing
○ High-resolution imagery +
interpretation
© 2025 Google 15
Sensor fabrication analysis system
CASE A: Automated quality control and defect detection[3]
● Agent1: Acquires SEM/TEM images
● Agent2: LLM defects from etching and polymerization
process results
● Agent3: Predicts performance quality with linear
coefficient to measure identified species in plasma and
different geometrical parameters like undercut, line
edge roughness, sidewall, damage, linewidth
● Outcome: Real-time quality control and feedback
3 agents performing collective actions of acquiring image in
analysis software, perform defect analysis, predict overall
fabrication and performance quality with decision making
© 2025 Google 16
Sensor fabrication analysis system
CASE A: Automated quality control and defect
detection[3]
Agents Functionalities:
● Agent 1: Captures high-resolution images
from electron microscopes (SEM/TEM) and
surface analysis instruments during CNT
synthesis and deposition stages
[Required pre-processing - denoising,
contrast enhancement, edge detection]
● Agent 2: Detects and categorizes CNT
imperfections [i.e. misaligned bundles,
structural defects]
Vision-Language Interpretation:
● Describes & generates visual defects in natural
language terms ( “Surface irregularity ,
misalignment of CNT bundles”)
● Agent 3: Quality performance prediction
Integrating data from fabrication, visual scans,
and defect reports
● Triggers corrective actions (e.g., adjusting
chemical vapor deposition parameters).
● Outcome: Real-time quality control and
feedback
© 2025 Google 17
Real world use case - Sensor fabrication analysis system
CASE B: Nano-scale pattern recognition for CNT structural
analysis[4]
● Agent1: Captures structural imagery
● Agent2: LLM interprets morphology
● Agent3: Material classification and analysis
● Outcome: AI enhanced feedback loop for research and
development
3 agents collects data of CNT imagery, provide categorical and
structural interpretation about what kind of CNT is identified
and perform material analysis with AI driven data pipelines
© 2025 Google 18
Sensor fabrication analysis system
CASE B: Nano-scale pattern recognition for CNT structural analysis[4]
Agents Functionalities:
● Agent 1 : Capturing high-resolution CNT structures using imaging tools like SEM or AFM
● Agent 2 : Vision LLM interprets and categorize CNT morphological patterns [e.g., alignment,
defects,density]
● Agent 3 : Classifies the material type and predicts performance properties [e.g., conductivity,
strength]
● Continuous AI-driven pipeline: sensing → interpretation → material analysis → feedback for
accelerating R&D cycles
© 2025 Google 19
© 2025 Google 20
Real world use case 2:
Warehouse robotics
Real world use case - Warehouse robotics
Introduction
● Need visual intelligence and coordination in
logistics
● Dynamic cluttered environments
© 2025 Google
Figure A: Robotics warehouse[5]
21
Real world use case - Warehouse robotics
CASE A: Inventory and tracking
● Agent1: Pan-tilt zoom camera scan
● Agent2: Vision LLM analyzes and learns inventory
status
● Agent3: Robot fleet coordination
● Outcome: Efficient pathing, fewer errors
Inventory tracking, object recognition and dynamic path
optimization: multi agent system workflow
Agents workflow: multi camera scanning, visual reasoning
through interpretation and adaptive learning by
coordinating robot fleet system
© 2025 Google 22
Real world use case - Warehouse robotics
CASE A: Inventory and tracking
Agents Functionalities :
● Agent 1 : Continuous wide-area pan-tilt-zoom (PTZ) camera scanning to to capture and update
the warehouse's visual inventory
● Agent 2 : Vision LLM to interpret visual data, inventory items, detect misplaced or missing goods,
and update the inventory
● Agent 3 : Manages the robot fleet’s routes for picking, stocking, and repositioning items
● Vision Acquisition Agent → Perception and Reasoning Agent → Action Coordination Agent
© 2025 Google 23
Real world use case - Warehouse robotics
CASE B: Real-time Fault Detection[5]
● Agent1: Detects faults visually
● Agent2: Root cause reasoning
● Agent3: Dispatch + system update
● Outcome: Downtime reduction, proactive maintenance
Real-time fault detection and maintenance: multi-agent system workflow
Agents workflow: Vision fault detection, root cause analysis and real-time update on maintenance
status
© 2025 Google 24
Real world use case - Warehouse robotics
CASE B: Real-time Fault Detection[5]
Agents Functionalities :
● Agent 1 : Continuously monitors machinery, conveyor belts and robotic systems through visual
data to detect anomalies
● Agent 2 : Vision LLM reasoning to diagnose the underlying cause of detected misalignments,
mechanical wear
● Agent 3 : Dispatch of maintenance bots and updates system logs to prioritize and track repair
tasks
● Fault Detection Agent → Root Cause Analysis Agent → Maintenance Coordination Agent
© 2025 Google 25
© 2025 Google 26
Future trends and challenges
Future trends and challenges
Future Trends
● Vision language graph agents
● Edge deployment of LLMs
● Neuro-symbolic MACS
● Closed loop learning
Challenges
● Real-time performance
● Agent safety and control
● Communication protocol standardization
© 2025 Google 27
Conclusion
● Vision LLM + MACS = future of intelligent automation
● CNT fabrication and warehouse logistics as real-world applications
● Synergy of vision, language and collaboration
● Vision LLM improves interpretability for human operators
© 2025 Google
Case Extended applied Improvements quantified
CNT fabrication - quality control Pilot/Research labs 15–25% defect detection ↑, 2–3x inspection speed
↑, 8–12% yield ↑
CNT fabrication - pattern
recognition
R&D labs 20% structural ID ↑, 30–40% cycle time ↓
Warehouse - inventory and
tracking
Deployed 35% inventory error ↓, 20–30% efficiency ↑, 15–
25% order speed ↑
Warehouse - fault detection Deployed/Vision LLM 50–60% faster fault detection, 20–40% downtime ↓
28
© 2025 Google 29
References
References
[1] PaLI: A Jointly-Scaled Multilingual Language-Image Model, Chen, X., Alayrac, J.-B., Google Research et al. (2023).
[2] Multimodal capabilities of GPT-4V for real-world vision-language tasks, OpenAI, GPT-4 Technical Report, 2023.
[3] Automated SEM/TEM Defect Analysis in CNT Electronics Fabrication, Samsung Research Whitepaper,
Nanoelectronics and Materials Division, 2022.
[4] Automated Characterization of Carbon Nanotube Arrays Using Computer Vision Technique, IEEE Trans. Nanotech.,
Vol 21, 2022, DOI: 10.1109/TNANO.2022.3149572
[5] Warehouse Robotics and Vision-based Automation for Inventory Management, Amazon Robotics 2022
Whitepaper, “Towards Fully Autonomous Fulfillment Centers”.
© 2025 Google 30
© 2025 Google 31
Question and discussion

“Vision LLMs in Multi-agent Collaborative Systems: Architecture and Integration,” a Presentation from Google

  • 1.
    Vision LLMs inMulti-Agent Collaborative Systems: Architecture and Integration Niyati Prajapati ML/Gen AI Lead Google
  • 2.
    © 2025 Google2 Overview
  • 3.
    Agenda ● Context ● Intersectionof machine vision, large language models (LLMs) and human computer interface (HCI) ● Multi-agent collaborative systems (MACS) architecture, integration, performance ● Vision LLMs into MACS ● Real world use cases ○ Sensor fabrication analysis systems ○ Warehouse robotics systems ● Future trend and challenges © 2025 Google 3
  • 4.
    © 2025 Google4 Context
  • 5.
    Context Urgency ● Rising complexityof data, automation, robotics, logistics and manufacturing ● Increasing demand of autonomous decision making ● Limitation of legacy systems without advanced vision, reasoning and efficient collaborative capabilities ● What makes vision LLMs + multi-agent collaborative systems (MACS) ground breaking? Next-gen architecture with the fusion of visual reasoning and distributed intelligence. ● “Vision LLMs = smarter cameras” — understand scenes semantically using natural language ● MACS — allows distributed agents to reason collectively with intelligent perception and coordination across agents. ● Central outcome: “autonomous decision-making system” © 2025 Google 5
  • 6.
    © 2025 Google6 Intersection of machine vision, LLMs and HCI
  • 7.
    Intersection of machinevision, LLMs and HCI © 2025 Google Image Acquisition Image Processing Feature Extraction Object Recognition Text Input Language Processing Context Understanding Object Recognition User Interaction Feedback Collection Interface Design Usability Testing LLMs Machine Vision HCI Start User Interaction Feedback Collection Interface Design Integration End Data Sources Audio Textual Visual Sensor Processing Layer LLM Module Machine Vision Module Real time feedback Gesture Control Interactive UI HCI Layer 7
  • 8.
    Intersection of machinevision, LLMs and HCI What makes vision LLMs + MACS synergy practical with HCI? ● Multimodal LLMs, real-time multi-agent frameworks, distributed systems that support these functionalities at scale ● Autonomous systems that perceive and interpret their environments to coordinate actions and interact with humans intuitively. ● Triad that forms the core of next-gen AI-driven industrial and human-interactive systems with below key technologies : ○ Vision-language models (GPT-4V, PaLI, Kosmos-2) ○ Reinforcement learning for agent coordination ○ Distributed multi-agent frameworks (ROS2, Ray) ○ Edge AI & vision hardware (Jetson, RealSense) © 2025 Google 8
  • 9.
    © 2025 Google9 Multi-agent collaborative systems (MACS) architecture, integration and performance
  • 10.
    Multi-Agents Collaborative Systemsarchitecture, integration and performance Core concepts ● Vision LLMs[1] ○ Inputs: Image + optional text ○ Outputs: Descriptive reasoning, decision support ○ Capabilities: Q&A, scene interpretation, captioning, semantic analysis ● MACs: ○ Collaborative distributed agents performing autonomous/semi-autonomous operation ○ Communication: Agents exchange visual, linguistic and task level signals ○ Perception: Each agent interprets local sensor data using Vision LLMs ○ Shared objectives: All agents operate toward a unified goal for adaptive decision-making protocols. © 2025 Google 10
  • 11.
    Multi-agent collaborative systemsarchitecture, integration and performance MACS Architecture ● Perception layer (computer vision, sensors) ● Communication layer ● Action layer ● Shared knowledge via LLMs © 2025 Google Computer Vision Sensors Perception Layer Communication Layer Message passing Shared knowledge via LLMs Action Layer Execution 11
  • 12.
    Multi-agent collaborative systemsarchitecture, integration and performance Vision LLMs into MACS ● Shared visual understanding through natural language ● Vision LLMs = perceptual interpreters and planners ● Decoupled sensing + reasoning Performance Benefits ● Better perception in complex environment ● Scalable, robust and explainable © 2025 Google 12
  • 13.
    Multi-agent collaborative systemsarchitecture, integration and performance MACS with vs without Vision LLMs[2] © 2025 Google Metric MACS without Vision LLMs MACS with Vision LLMs Improvement Analogy Package sorting accuracy 90% 98% +8% Specialized workers with fixed signals vs intelligent workers with language Handling time per package 15 seconds 12 seconds -20% Step-by-step visual check vs contextual understanding Adaptability to new package types weeks hours/minutes significantly faster Manual retraining for each type vs understanding through description Communication efficiency moderate high increased Fixed codes vs natural conversation Handling ambiguity low high significantly Better Relying solely on appearance vs using language for clarification 13
  • 14.
    © 2025 Google14 Real world use case 1: Sensor fabrication analysis system
  • 15.
    Real world usecase - Sensor fabrication analysis system MACS Architecture ● Scanning electron microscope (SEM) image of carbon nanotube (CNT) fabrication [Scanning / transmission electron microscopy] ● Domain use cases: ○ Precision + automation in micro-nano manufacturing ○ High-resolution imagery + interpretation © 2025 Google 15
  • 16.
    Sensor fabrication analysissystem CASE A: Automated quality control and defect detection[3] ● Agent1: Acquires SEM/TEM images ● Agent2: LLM defects from etching and polymerization process results ● Agent3: Predicts performance quality with linear coefficient to measure identified species in plasma and different geometrical parameters like undercut, line edge roughness, sidewall, damage, linewidth ● Outcome: Real-time quality control and feedback 3 agents performing collective actions of acquiring image in analysis software, perform defect analysis, predict overall fabrication and performance quality with decision making © 2025 Google 16
  • 17.
    Sensor fabrication analysissystem CASE A: Automated quality control and defect detection[3] Agents Functionalities: ● Agent 1: Captures high-resolution images from electron microscopes (SEM/TEM) and surface analysis instruments during CNT synthesis and deposition stages [Required pre-processing - denoising, contrast enhancement, edge detection] ● Agent 2: Detects and categorizes CNT imperfections [i.e. misaligned bundles, structural defects] Vision-Language Interpretation: ● Describes & generates visual defects in natural language terms ( “Surface irregularity , misalignment of CNT bundles”) ● Agent 3: Quality performance prediction Integrating data from fabrication, visual scans, and defect reports ● Triggers corrective actions (e.g., adjusting chemical vapor deposition parameters). ● Outcome: Real-time quality control and feedback © 2025 Google 17
  • 18.
    Real world usecase - Sensor fabrication analysis system CASE B: Nano-scale pattern recognition for CNT structural analysis[4] ● Agent1: Captures structural imagery ● Agent2: LLM interprets morphology ● Agent3: Material classification and analysis ● Outcome: AI enhanced feedback loop for research and development 3 agents collects data of CNT imagery, provide categorical and structural interpretation about what kind of CNT is identified and perform material analysis with AI driven data pipelines © 2025 Google 18
  • 19.
    Sensor fabrication analysissystem CASE B: Nano-scale pattern recognition for CNT structural analysis[4] Agents Functionalities: ● Agent 1 : Capturing high-resolution CNT structures using imaging tools like SEM or AFM ● Agent 2 : Vision LLM interprets and categorize CNT morphological patterns [e.g., alignment, defects,density] ● Agent 3 : Classifies the material type and predicts performance properties [e.g., conductivity, strength] ● Continuous AI-driven pipeline: sensing → interpretation → material analysis → feedback for accelerating R&D cycles © 2025 Google 19
  • 20.
    © 2025 Google20 Real world use case 2: Warehouse robotics
  • 21.
    Real world usecase - Warehouse robotics Introduction ● Need visual intelligence and coordination in logistics ● Dynamic cluttered environments © 2025 Google Figure A: Robotics warehouse[5] 21
  • 22.
    Real world usecase - Warehouse robotics CASE A: Inventory and tracking ● Agent1: Pan-tilt zoom camera scan ● Agent2: Vision LLM analyzes and learns inventory status ● Agent3: Robot fleet coordination ● Outcome: Efficient pathing, fewer errors Inventory tracking, object recognition and dynamic path optimization: multi agent system workflow Agents workflow: multi camera scanning, visual reasoning through interpretation and adaptive learning by coordinating robot fleet system © 2025 Google 22
  • 23.
    Real world usecase - Warehouse robotics CASE A: Inventory and tracking Agents Functionalities : ● Agent 1 : Continuous wide-area pan-tilt-zoom (PTZ) camera scanning to to capture and update the warehouse's visual inventory ● Agent 2 : Vision LLM to interpret visual data, inventory items, detect misplaced or missing goods, and update the inventory ● Agent 3 : Manages the robot fleet’s routes for picking, stocking, and repositioning items ● Vision Acquisition Agent → Perception and Reasoning Agent → Action Coordination Agent © 2025 Google 23
  • 24.
    Real world usecase - Warehouse robotics CASE B: Real-time Fault Detection[5] ● Agent1: Detects faults visually ● Agent2: Root cause reasoning ● Agent3: Dispatch + system update ● Outcome: Downtime reduction, proactive maintenance Real-time fault detection and maintenance: multi-agent system workflow Agents workflow: Vision fault detection, root cause analysis and real-time update on maintenance status © 2025 Google 24
  • 25.
    Real world usecase - Warehouse robotics CASE B: Real-time Fault Detection[5] Agents Functionalities : ● Agent 1 : Continuously monitors machinery, conveyor belts and robotic systems through visual data to detect anomalies ● Agent 2 : Vision LLM reasoning to diagnose the underlying cause of detected misalignments, mechanical wear ● Agent 3 : Dispatch of maintenance bots and updates system logs to prioritize and track repair tasks ● Fault Detection Agent → Root Cause Analysis Agent → Maintenance Coordination Agent © 2025 Google 25
  • 26.
    © 2025 Google26 Future trends and challenges
  • 27.
    Future trends andchallenges Future Trends ● Vision language graph agents ● Edge deployment of LLMs ● Neuro-symbolic MACS ● Closed loop learning Challenges ● Real-time performance ● Agent safety and control ● Communication protocol standardization © 2025 Google 27
  • 28.
    Conclusion ● Vision LLM+ MACS = future of intelligent automation ● CNT fabrication and warehouse logistics as real-world applications ● Synergy of vision, language and collaboration ● Vision LLM improves interpretability for human operators © 2025 Google Case Extended applied Improvements quantified CNT fabrication - quality control Pilot/Research labs 15–25% defect detection ↑, 2–3x inspection speed ↑, 8–12% yield ↑ CNT fabrication - pattern recognition R&D labs 20% structural ID ↑, 30–40% cycle time ↓ Warehouse - inventory and tracking Deployed 35% inventory error ↓, 20–30% efficiency ↑, 15– 25% order speed ↑ Warehouse - fault detection Deployed/Vision LLM 50–60% faster fault detection, 20–40% downtime ↓ 28
  • 29.
    © 2025 Google29 References
  • 30.
    References [1] PaLI: AJointly-Scaled Multilingual Language-Image Model, Chen, X., Alayrac, J.-B., Google Research et al. (2023). [2] Multimodal capabilities of GPT-4V for real-world vision-language tasks, OpenAI, GPT-4 Technical Report, 2023. [3] Automated SEM/TEM Defect Analysis in CNT Electronics Fabrication, Samsung Research Whitepaper, Nanoelectronics and Materials Division, 2022. [4] Automated Characterization of Carbon Nanotube Arrays Using Computer Vision Technique, IEEE Trans. Nanotech., Vol 21, 2022, DOI: 10.1109/TNANO.2022.3149572 [5] Warehouse Robotics and Vision-based Automation for Inventory Management, Amazon Robotics 2022 Whitepaper, “Towards Fully Autonomous Fulfillment Centers”. © 2025 Google 30
  • 31.
    © 2025 Google31 Question and discussion