SlideShare a Scribd company logo
1 of 24
Download to read offline
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
Unifying Computer Vision and
Natural Language
Understanding for
Autonomous Systems
Mumtaz Vauhkonen
Lead Distinguished Scientist
AI & D
Verizon
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
2
Computer Vision (CV) and Natural Language
Processing (NLP)
Combining computer vision and NLP will lead to more integrated intelligent
autonomous systems
CV and NLP Integration:
• Generate reasoning in language from visual input
• From language input generate visual representation
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
3
Example Applications
1. Robotic delivery systems
2. Service industry robots (hospitality, medical)
3. Educational systems
4. Disaster recovery systems
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
Approaches in Integration of CV and NLP
Some of the integration methods
• Visual description generation
• Visual reasoning
• Visual question answering (VQA)
• Visual generation from text
• Visual dialog
• Visual storytelling
• Multi modal machine translation
4
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
5
Areas of Focus Today
VISUAL DESCRIPTION
Identifying objects and providing
captions on their attributes and
sometimes action of individual
objects
VISUAL REASONING
In addition to visual description
adding the most possible interactions
associated between each object and
their attributes
A rule-based combined with Deep Learning approach for Visual reasoning is presented
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
6
Visual Description
1. Generate basic description or captions
Process -> Detect objects -> Generate captions
2. Generate sentence level description of a scene
1. Sample images from Coco Dataset
History of models:
N-grams, templates,
dependency parsing,
Sequence to sequence
models, Encoder-
decoder models
{desc: Airplane in the sky. Sky is partly cloudy}
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
7
Visual Reasoning
• Visual reasoning involves:
• Detecting the objects
• Identifying the attributes of the objects (size, color, shape, features, etc.)
• Localizing the objects in relation to each others
• Using reasoning and giving a logical explanation of the scene in the
image
Object Detection
{Airplane is on runway in a city. Airplane is about to take off
or approach the airport gate.}
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
8
A System with Visual Reasoning Should Be Capable Of
Understanding language
Detecting objects
Identifying objects’ attributes
Establishing meaningful relationships
between objects
Translating those relationships into
sentences
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
9
Example Model on Visual Reasoning
• TbD-Net Transparency by Design Network (1) creates multiple
submodules in a series of steps that when combined create a logical
sequence of reasoning.
• Example: “What color is the cube to the right of the large metal
sphere?”
• Identify the large metal sphere (attributes, color, shape, etc.)
• Identify the cube’s
• Attributes (color, size)
• location / direction (what is right vs. left)
TbD-Net MIT Mascharka et al 2018
Attention
Module
Query
module
Relate
Module
Same
Module
And
Module
Compare
Module
Or
Module
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
Example Model of Visual Reasoning (continued)
TbD-Net MIT Mascharka et al 2018
10
Compositional Language and Elementary Visual
Reasoning Dataset ( CLEVR)
A New Approach: Rule-Based Lingual Model
with Deep Learning
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
12
Present Work: Deep Learning Rule-based Approach
for Visual Reasoning
• The method developed uses a combined CNN architecture and a rule-
based approach to provide visual reasoning
• This method allows gaining confidence over object-to-object relationships
to reason the interaction as more examples are encountered
• This is achieved by providing a Universal Lingual model
• This model has the ability to localize to specific domains using a
distributed AI model with 5G (hospital, tourist centers, retail or
manufacturing facility)
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
13
Architecture Modules
Derive object rules
from corpus
Detect objects
Identify attributes
Actions
Functional rules
CNN to match the actions
based on rules from corpus
Translate into language
representation
Transformer Model
Use method similar to word embeddings
Identify the attributes, rules and actions of objects
A CNN-based Object Detector (Yolo) or other
For each detected object identify attributes, actions rules from
step 1 and also spatial mapping of each object in the frame
Train a model to establish links between objects with
their actions, rules and attributes
Train a transformer language model to create sentences
based on the attributes rules and actions matched
Create a Universal lingual
model with codes for action
and attributes
Mapping of actions and attributes to codes to
establish reasoning connection vs lingual
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
14
Derive Object Relationship from Language Corpus
• Dog walking with human
• Person biking on the road
• Attributes: (bicycle: two wheels, seat, pedal)
Examples of derived rules, actions and attributes
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
15
Object Detection and Attributes
• Use Yolo (or other) for object detection
• Match with derived object attributes
• For each object create an attribute table
• Identify orientation of each object in relation to
each other
• Relative pixel-wise distance between objects
• Map the actions of each object vs. the other
Dog Bicycle
Attributes: Furry, legs, ears, tail Two wheels, pedal, seat, handle
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
16
Creating the Universal Lingual Model
• Match with actions and functional rules of the detected object
• Learn functional rules for each object
• Examples: (bird can fly, car can drive forward or backward, park or crash, dog cannot fly)
• Derive functional rules of objects and encode with universal codes for actions
• Map code to each language
Dog Bicycle
Dog run(1)/runs(1a)/is running(1b)
Dog walk/walks/walking
Dog jump/jumps/jumping
Dog sit/…./…
Dog bark/…../
Bicycle ride(1a)/riding(1b)
Bicycle parked
Bicycle fall
Bicycle broke
Bicycle empty
Bicycle rider
Derive Functional Rules of Objects from large corpus
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
17
Establish Meaningful Relationships between Objects
• Dog action + dog functional rule + location in relation to bicycle
+ bicycle functional rule
• Feed the attributes, actions and functional rules into CNN-based system
and train to associate reasoning between objects based on highest
probability matches
• The more relationships the model correctly identifies the higher the
probability assigned to those associations for the future
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
18
System Understanding Language
• Map the match of attributes and functions onto to the universal codes
• Translate the match of attributes and function codes into lingual sentences
 Dog sitting next to bicycle
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
19
Overview of Architecture
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
20
Localization with 5G and Mobile Edge Compute
Cloud
Localization: Can be localized to a domain
Training: Trained models for
multiple domains can exist in cloud
or Mobile Edge Compute (MEC)
Connectivity: 5G enables fast
connectivity and data transfer with low
latency and high compute accessibility
Derive High Value: Compute resources on a mobile
autonomous system are limited
A basic robot can be connected to the trained models
wherever needed and operate fully in that environment
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
21
Advantages
• Deriving rules, attributes and functional rules and mapping to detected objects
cuts down large computations needed for reasoning
• Specialized training for high accuracy in localized environments vs. overall generic
training
• Rules can be constantly updated independently of object detection or mapping
CNN architecture
• Encoding is from a perspective of reasoning vs. purely language based
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
22
Metrics and Results
• Metrics used for final output is Consensus based Image Description Evaluation
(CIDEr)*. It measures the similarity of a generated sentence with a human derived
ground truth sentence. This metric show consensus on how close to human generated
sentences the auto generated sentences are.
* Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh. "CIDEr: Consensus-based
Image Description Evaluation" Proceedings of IEEE conference on computer vision
and pattern recognition 2015.
• Compared to ROUGE: Uses n-gram method to
• compare automatic summarization to human
created summary
Thank You
Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
References
24
1 . Mascharka, David and Tran, Philip and Soklaski, Ryan and Majumdar, Arjun.
"Transparency by Design: Closing the Gap Between Performance and
Interpretability in Visual Reasoning", The IEEE Conference on Computer Vision
and Pattern Recognition - CVPR June, 2018TbD-Net MIT Mascharka et al 2018
2. Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh. "CIDEr:
Consensus-based Image Description Evaluation" Proceedings of IEEE
conference on computer vision and pattern recognition 2015.

More Related Content

Similar to “Unifying Computer Vision and Natural Language Understanding for Autonomous Systems,” a Presentation from Verizon

Similar to “Unifying Computer Vision and Natural Language Understanding for Autonomous Systems,” a Presentation from Verizon (20)

Andrew Resume
Andrew ResumeAndrew Resume
Andrew Resume
 
FIWARE Global Summit - Cloud Robotics with AWS RoboMaker and FIWARE
FIWARE Global Summit - Cloud Robotics with AWS RoboMaker and FIWAREFIWARE Global Summit - Cloud Robotics with AWS RoboMaker and FIWARE
FIWARE Global Summit - Cloud Robotics with AWS RoboMaker and FIWARE
 
20130503 iCore at calipso workshop fia dublin
20130503 iCore at calipso workshop fia dublin20130503 iCore at calipso workshop fia dublin
20130503 iCore at calipso workshop fia dublin
 
Sensing WiFi Network for Personal IoT Analytics
Sensing WiFi Network for Personal IoT Analytics Sensing WiFi Network for Personal IoT Analytics
Sensing WiFi Network for Personal IoT Analytics
 
INTRODUCTION TO INTERNET OF THINGS
INTRODUCTION TO INTERNET OF THINGSINTRODUCTION TO INTERNET OF THINGS
INTRODUCTION TO INTERNET OF THINGS
 
zenoh: The Edge Data Fabric
zenoh: The Edge Data Fabriczenoh: The Edge Data Fabric
zenoh: The Edge Data Fabric
 
IRJET- Object Detection and Recognition for Blind Assistance
IRJET- Object Detection and Recognition for Blind AssistanceIRJET- Object Detection and Recognition for Blind Assistance
IRJET- Object Detection and Recognition for Blind Assistance
 
Robotic Virtual Telepresence with Raspberry pi
Robotic Virtual Telepresence with Raspberry piRobotic Virtual Telepresence with Raspberry pi
Robotic Virtual Telepresence with Raspberry pi
 
Emerging trends in Artificial intelligence - A deeper review
Emerging trends in Artificial intelligence - A deeper reviewEmerging trends in Artificial intelligence - A deeper review
Emerging trends in Artificial intelligence - A deeper review
 
Dynamic Semantics for Semantics for Dynamic IoT Environments
Dynamic Semantics for Semantics for Dynamic IoT EnvironmentsDynamic Semantics for Semantics for Dynamic IoT Environments
Dynamic Semantics for Semantics for Dynamic IoT Environments
 
Real-Time Pertinent Maneuver Recognition for Surveillance
Real-Time Pertinent Maneuver Recognition for SurveillanceReal-Time Pertinent Maneuver Recognition for Surveillance
Real-Time Pertinent Maneuver Recognition for Surveillance
 
Cloudcamp- The World Wide Cloud
Cloudcamp- The World Wide CloudCloudcamp- The World Wide Cloud
Cloudcamp- The World Wide Cloud
 
Cloud to Edge
Cloud to EdgeCloud to Edge
Cloud to Edge
 
Xeric Security Application Notes
Xeric Security Application NotesXeric Security Application Notes
Xeric Security Application Notes
 
IRJET- Alternate Vision Assistance: For the Blind
IRJET- Alternate Vision Assistance: For the BlindIRJET- Alternate Vision Assistance: For the Blind
IRJET- Alternate Vision Assistance: For the Blind
 
Era ofdataeconomyv4short
Era ofdataeconomyv4shortEra ofdataeconomyv4short
Era ofdataeconomyv4short
 
Imhotep
ImhotepImhotep
Imhotep
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logic
 
TechTalk: Advanced Practices for Visual Test Automation
TechTalk: Advanced Practices for Visual Test AutomationTechTalk: Advanced Practices for Visual Test Automation
TechTalk: Advanced Practices for Visual Test Automation
 
Operational Visibiliy and Analytics - BU Seminar
Operational Visibiliy and Analytics - BU SeminarOperational Visibiliy and Analytics - BU Seminar
Operational Visibiliy and Analytics - BU Seminar
 

More from Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
Edge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
Edge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

“Unifying Computer Vision and Natural Language Understanding for Autonomous Systems,” a Presentation from Verizon

  • 1. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. Unifying Computer Vision and Natural Language Understanding for Autonomous Systems Mumtaz Vauhkonen Lead Distinguished Scientist AI & D Verizon
  • 2. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 2 Computer Vision (CV) and Natural Language Processing (NLP) Combining computer vision and NLP will lead to more integrated intelligent autonomous systems CV and NLP Integration: • Generate reasoning in language from visual input • From language input generate visual representation
  • 3. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 3 Example Applications 1. Robotic delivery systems 2. Service industry robots (hospitality, medical) 3. Educational systems 4. Disaster recovery systems
  • 4. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. Approaches in Integration of CV and NLP Some of the integration methods • Visual description generation • Visual reasoning • Visual question answering (VQA) • Visual generation from text • Visual dialog • Visual storytelling • Multi modal machine translation 4
  • 5. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 5 Areas of Focus Today VISUAL DESCRIPTION Identifying objects and providing captions on their attributes and sometimes action of individual objects VISUAL REASONING In addition to visual description adding the most possible interactions associated between each object and their attributes A rule-based combined with Deep Learning approach for Visual reasoning is presented
  • 6. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 6 Visual Description 1. Generate basic description or captions Process -> Detect objects -> Generate captions 2. Generate sentence level description of a scene 1. Sample images from Coco Dataset History of models: N-grams, templates, dependency parsing, Sequence to sequence models, Encoder- decoder models {desc: Airplane in the sky. Sky is partly cloudy}
  • 7. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 7 Visual Reasoning • Visual reasoning involves: • Detecting the objects • Identifying the attributes of the objects (size, color, shape, features, etc.) • Localizing the objects in relation to each others • Using reasoning and giving a logical explanation of the scene in the image Object Detection {Airplane is on runway in a city. Airplane is about to take off or approach the airport gate.}
  • 8. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 8 A System with Visual Reasoning Should Be Capable Of Understanding language Detecting objects Identifying objects’ attributes Establishing meaningful relationships between objects Translating those relationships into sentences
  • 9. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 9 Example Model on Visual Reasoning • TbD-Net Transparency by Design Network (1) creates multiple submodules in a series of steps that when combined create a logical sequence of reasoning. • Example: “What color is the cube to the right of the large metal sphere?” • Identify the large metal sphere (attributes, color, shape, etc.) • Identify the cube’s • Attributes (color, size) • location / direction (what is right vs. left) TbD-Net MIT Mascharka et al 2018 Attention Module Query module Relate Module Same Module And Module Compare Module Or Module
  • 10. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. Example Model of Visual Reasoning (continued) TbD-Net MIT Mascharka et al 2018 10 Compositional Language and Elementary Visual Reasoning Dataset ( CLEVR)
  • 11. A New Approach: Rule-Based Lingual Model with Deep Learning
  • 12. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 12 Present Work: Deep Learning Rule-based Approach for Visual Reasoning • The method developed uses a combined CNN architecture and a rule- based approach to provide visual reasoning • This method allows gaining confidence over object-to-object relationships to reason the interaction as more examples are encountered • This is achieved by providing a Universal Lingual model • This model has the ability to localize to specific domains using a distributed AI model with 5G (hospital, tourist centers, retail or manufacturing facility)
  • 13. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 13 Architecture Modules Derive object rules from corpus Detect objects Identify attributes Actions Functional rules CNN to match the actions based on rules from corpus Translate into language representation Transformer Model Use method similar to word embeddings Identify the attributes, rules and actions of objects A CNN-based Object Detector (Yolo) or other For each detected object identify attributes, actions rules from step 1 and also spatial mapping of each object in the frame Train a model to establish links between objects with their actions, rules and attributes Train a transformer language model to create sentences based on the attributes rules and actions matched Create a Universal lingual model with codes for action and attributes Mapping of actions and attributes to codes to establish reasoning connection vs lingual
  • 14. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 14 Derive Object Relationship from Language Corpus • Dog walking with human • Person biking on the road • Attributes: (bicycle: two wheels, seat, pedal) Examples of derived rules, actions and attributes
  • 15. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 15 Object Detection and Attributes • Use Yolo (or other) for object detection • Match with derived object attributes • For each object create an attribute table • Identify orientation of each object in relation to each other • Relative pixel-wise distance between objects • Map the actions of each object vs. the other Dog Bicycle Attributes: Furry, legs, ears, tail Two wheels, pedal, seat, handle
  • 16. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 16 Creating the Universal Lingual Model • Match with actions and functional rules of the detected object • Learn functional rules for each object • Examples: (bird can fly, car can drive forward or backward, park or crash, dog cannot fly) • Derive functional rules of objects and encode with universal codes for actions • Map code to each language Dog Bicycle Dog run(1)/runs(1a)/is running(1b) Dog walk/walks/walking Dog jump/jumps/jumping Dog sit/…./… Dog bark/…../ Bicycle ride(1a)/riding(1b) Bicycle parked Bicycle fall Bicycle broke Bicycle empty Bicycle rider Derive Functional Rules of Objects from large corpus
  • 17. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 17 Establish Meaningful Relationships between Objects • Dog action + dog functional rule + location in relation to bicycle + bicycle functional rule • Feed the attributes, actions and functional rules into CNN-based system and train to associate reasoning between objects based on highest probability matches • The more relationships the model correctly identifies the higher the probability assigned to those associations for the future
  • 18. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 18 System Understanding Language • Map the match of attributes and functions onto to the universal codes • Translate the match of attributes and function codes into lingual sentences  Dog sitting next to bicycle
  • 19. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 19 Overview of Architecture
  • 20. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 20 Localization with 5G and Mobile Edge Compute Cloud Localization: Can be localized to a domain Training: Trained models for multiple domains can exist in cloud or Mobile Edge Compute (MEC) Connectivity: 5G enables fast connectivity and data transfer with low latency and high compute accessibility Derive High Value: Compute resources on a mobile autonomous system are limited A basic robot can be connected to the trained models wherever needed and operate fully in that environment
  • 21. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 21 Advantages • Deriving rules, attributes and functional rules and mapping to detected objects cuts down large computations needed for reasoning • Specialized training for high accuracy in localized environments vs. overall generic training • Rules can be constantly updated independently of object detection or mapping CNN architecture • Encoding is from a perspective of reasoning vs. purely language based
  • 22. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 22 Metrics and Results • Metrics used for final output is Consensus based Image Description Evaluation (CIDEr)*. It measures the similarity of a generated sentence with a human derived ground truth sentence. This metric show consensus on how close to human generated sentences the auto generated sentences are. * Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh. "CIDEr: Consensus-based Image Description Evaluation" Proceedings of IEEE conference on computer vision and pattern recognition 2015. • Compared to ROUGE: Uses n-gram method to • compare automatic summarization to human created summary
  • 24. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. References 24 1 . Mascharka, David and Tran, Philip and Soklaski, Ryan and Majumdar, Arjun. "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning", The IEEE Conference on Computer Vision and Pattern Recognition - CVPR June, 2018TbD-Net MIT Mascharka et al 2018 2. Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh. "CIDEr: Consensus-based Image Description Evaluation" Proceedings of IEEE conference on computer vision and pattern recognition 2015.