Successfully reported this slideshow.
Your SlideShare is downloading. ×

“Unifying Computer Vision and Natural Language Understanding for Autonomous Systems,” a Presentation from Verizon

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 24 Ad

“Unifying Computer Vision and Natural Language Understanding for Autonomous Systems,” a Presentation from Verizon

Download to read offline

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/09/unifying-computer-vision-and-natural-language-understanding-for-autonomous-systems-a-presentation-from-verizon/

Mumtaz Vauhkonen, Lead Distinguished Scientist and Head of Computer Vision for Cognitive AI in AI&D at Verizon, presents the “Unifying Computer Vision and Natural Language Understanding for Autonomous Systems” tutorial at the May 2022 Embedded Vision Summit.

As the applications of autonomous systems expand, many such systems need the ability to perceive using both vision and language, coherently. For example, some systems need to translate a visual scene into language. Others may need to follow language-based instructions when operating in environments that they understand visually. Or, they may need to combine visual and language inputs to understand their environments.

In this talk, Vauhkonen introduces popular approaches to joint language-vision perception. She also presents a unique deep learning rule-based approach utilizing a universal language object model. This new model derives rules and learns a universal language of object interaction and reasoning structure from a corpus, which it then applies to the objects detected visually. She shows that this approach works reliably for frequently occurring actions. She also shows that this type of model can be localized for specific environments and can communicate with humans and other autonomous systems.

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/09/unifying-computer-vision-and-natural-language-understanding-for-autonomous-systems-a-presentation-from-verizon/

Mumtaz Vauhkonen, Lead Distinguished Scientist and Head of Computer Vision for Cognitive AI in AI&D at Verizon, presents the “Unifying Computer Vision and Natural Language Understanding for Autonomous Systems” tutorial at the May 2022 Embedded Vision Summit.

As the applications of autonomous systems expand, many such systems need the ability to perceive using both vision and language, coherently. For example, some systems need to translate a visual scene into language. Others may need to follow language-based instructions when operating in environments that they understand visually. Or, they may need to combine visual and language inputs to understand their environments.

In this talk, Vauhkonen introduces popular approaches to joint language-vision perception. She also presents a unique deep learning rule-based approach utilizing a universal language object model. This new model derives rules and learns a universal language of object interaction and reasoning structure from a corpus, which it then applies to the objects detected visually. She shows that this approach works reliably for frequently occurring actions. She also shows that this type of model can be localized for specific environments and can communicate with humans and other autonomous systems.

Advertisement
Advertisement

More Related Content

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

Advertisement

“Unifying Computer Vision and Natural Language Understanding for Autonomous Systems,” a Presentation from Verizon

  1. 1. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. Unifying Computer Vision and Natural Language Understanding for Autonomous Systems Mumtaz Vauhkonen Lead Distinguished Scientist AI & D Verizon
  2. 2. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 2 Computer Vision (CV) and Natural Language Processing (NLP) Combining computer vision and NLP will lead to more integrated intelligent autonomous systems CV and NLP Integration: • Generate reasoning in language from visual input • From language input generate visual representation
  3. 3. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 3 Example Applications 1. Robotic delivery systems 2. Service industry robots (hospitality, medical) 3. Educational systems 4. Disaster recovery systems
  4. 4. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. Approaches in Integration of CV and NLP Some of the integration methods • Visual description generation • Visual reasoning • Visual question answering (VQA) • Visual generation from text • Visual dialog • Visual storytelling • Multi modal machine translation 4
  5. 5. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 5 Areas of Focus Today VISUAL DESCRIPTION Identifying objects and providing captions on their attributes and sometimes action of individual objects VISUAL REASONING In addition to visual description adding the most possible interactions associated between each object and their attributes A rule-based combined with Deep Learning approach for Visual reasoning is presented
  6. 6. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 6 Visual Description 1. Generate basic description or captions Process -> Detect objects -> Generate captions 2. Generate sentence level description of a scene 1. Sample images from Coco Dataset History of models: N-grams, templates, dependency parsing, Sequence to sequence models, Encoder- decoder models {desc: Airplane in the sky. Sky is partly cloudy}
  7. 7. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 7 Visual Reasoning • Visual reasoning involves: • Detecting the objects • Identifying the attributes of the objects (size, color, shape, features, etc.) • Localizing the objects in relation to each others • Using reasoning and giving a logical explanation of the scene in the image Object Detection {Airplane is on runway in a city. Airplane is about to take off or approach the airport gate.}
  8. 8. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 8 A System with Visual Reasoning Should Be Capable Of Understanding language Detecting objects Identifying objects’ attributes Establishing meaningful relationships between objects Translating those relationships into sentences
  9. 9. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 9 Example Model on Visual Reasoning • TbD-Net Transparency by Design Network (1) creates multiple submodules in a series of steps that when combined create a logical sequence of reasoning. • Example: “What color is the cube to the right of the large metal sphere?” • Identify the large metal sphere (attributes, color, shape, etc.) • Identify the cube’s • Attributes (color, size) • location / direction (what is right vs. left) TbD-Net MIT Mascharka et al 2018 Attention Module Query module Relate Module Same Module And Module Compare Module Or Module
  10. 10. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. Example Model of Visual Reasoning (continued) TbD-Net MIT Mascharka et al 2018 10 Compositional Language and Elementary Visual Reasoning Dataset ( CLEVR)
  11. 11. A New Approach: Rule-Based Lingual Model with Deep Learning
  12. 12. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 12 Present Work: Deep Learning Rule-based Approach for Visual Reasoning • The method developed uses a combined CNN architecture and a rule- based approach to provide visual reasoning • This method allows gaining confidence over object-to-object relationships to reason the interaction as more examples are encountered • This is achieved by providing a Universal Lingual model • This model has the ability to localize to specific domains using a distributed AI model with 5G (hospital, tourist centers, retail or manufacturing facility)
  13. 13. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 13 Architecture Modules Derive object rules from corpus Detect objects Identify attributes Actions Functional rules CNN to match the actions based on rules from corpus Translate into language representation Transformer Model Use method similar to word embeddings Identify the attributes, rules and actions of objects A CNN-based Object Detector (Yolo) or other For each detected object identify attributes, actions rules from step 1 and also spatial mapping of each object in the frame Train a model to establish links between objects with their actions, rules and attributes Train a transformer language model to create sentences based on the attributes rules and actions matched Create a Universal lingual model with codes for action and attributes Mapping of actions and attributes to codes to establish reasoning connection vs lingual
  14. 14. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 14 Derive Object Relationship from Language Corpus • Dog walking with human • Person biking on the road • Attributes: (bicycle: two wheels, seat, pedal) Examples of derived rules, actions and attributes
  15. 15. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 15 Object Detection and Attributes • Use Yolo (or other) for object detection • Match with derived object attributes • For each object create an attribute table • Identify orientation of each object in relation to each other • Relative pixel-wise distance between objects • Map the actions of each object vs. the other Dog Bicycle Attributes: Furry, legs, ears, tail Two wheels, pedal, seat, handle
  16. 16. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 16 Creating the Universal Lingual Model • Match with actions and functional rules of the detected object • Learn functional rules for each object • Examples: (bird can fly, car can drive forward or backward, park or crash, dog cannot fly) • Derive functional rules of objects and encode with universal codes for actions • Map code to each language Dog Bicycle Dog run(1)/runs(1a)/is running(1b) Dog walk/walks/walking Dog jump/jumps/jumping Dog sit/…./… Dog bark/…../ Bicycle ride(1a)/riding(1b) Bicycle parked Bicycle fall Bicycle broke Bicycle empty Bicycle rider Derive Functional Rules of Objects from large corpus
  17. 17. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 17 Establish Meaningful Relationships between Objects • Dog action + dog functional rule + location in relation to bicycle + bicycle functional rule • Feed the attributes, actions and functional rules into CNN-based system and train to associate reasoning between objects based on highest probability matches • The more relationships the model correctly identifies the higher the probability assigned to those associations for the future
  18. 18. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 18 System Understanding Language • Map the match of attributes and functions onto to the universal codes • Translate the match of attributes and function codes into lingual sentences  Dog sitting next to bicycle
  19. 19. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 19 Overview of Architecture
  20. 20. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 20 Localization with 5G and Mobile Edge Compute Cloud Localization: Can be localized to a domain Training: Trained models for multiple domains can exist in cloud or Mobile Edge Compute (MEC) Connectivity: 5G enables fast connectivity and data transfer with low latency and high compute accessibility Derive High Value: Compute resources on a mobile autonomous system are limited A basic robot can be connected to the trained models wherever needed and operate fully in that environment
  21. 21. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 21 Advantages • Deriving rules, attributes and functional rules and mapping to detected objects cuts down large computations needed for reasoning • Specialized training for high accuracy in localized environments vs. overall generic training • Rules can be constantly updated independently of object detection or mapping CNN architecture • Encoding is from a perspective of reasoning vs. purely language based
  22. 22. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. 22 Metrics and Results • Metrics used for final output is Consensus based Image Description Evaluation (CIDEr)*. It measures the similarity of a generated sentence with a human derived ground truth sentence. This metric show consensus on how close to human generated sentences the auto generated sentences are. * Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh. "CIDEr: Consensus-based Image Description Evaluation" Proceedings of IEEE conference on computer vision and pattern recognition 2015. • Compared to ROUGE: Uses n-gram method to • compare automatic summarization to human created summary
  23. 23. Thank You
  24. 24. Copyright 2021 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. References 24 1 . Mascharka, David and Tran, Philip and Soklaski, Ryan and Majumdar, Arjun. "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning", The IEEE Conference on Computer Vision and Pattern Recognition - CVPR June, 2018TbD-Net MIT Mascharka et al 2018 2. Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh. "CIDEr: Consensus-based Image Description Evaluation" Proceedings of IEEE conference on computer vision and pattern recognition 2015.

×