Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Christian Siagian

State of the Art Innovations in
Computer Vision
Christian Siagian
DataCon LA
August 16, 2019

Presentation Structure
• 10 minutes background to set the information
• 20 minutes current Computer Vision topics
• 10 minutes summary and questions

My Background
• Academic:
– Publications in Computer Vision, Robotic Vision,
Human Vision
– Beobot 2.0:
• parallel high-performing robotics vision mobile
platform
• full software architecture with vision localization and
navigation
• Start Up: AIO Robotics, inc.
– fully integrated 3D printer, scanner, editor, object
search
– 2 patents and CES Innovation Awards 2016 & 2017
• Start Up: Eyenuk, inc. Medical Deep Learning
– retinal image lesion detection and segmentation
– end-to-end robotic system to automate eye
screening, monitoring, diagnosis, reporting
– Patent and Grant applications
• Competition Robotics:
– Robocup Soccer Robot & AUVSI Autonomous
Submarine
• Teaching:
– After school robotics program, USC robotics courses
• Learning:
– Academics, sports journalism, nutrition, art, music

Artificial Intelligence
• Fields: Machine Learning (ML), Computer
Vision (CV), Natural Language Processing
(NLP), and Robotics
– Digitally and in real world
– They are connected for particular applications
• We will focus on Computer Vision and related
topics

Connection with Data Science
• Computer Vision (CV) processes raw data to be used for
data science
• Raw input data: images (regular cameras, heat cameras,
etc), texts, audio
– These data do not have direct semantic meaning:
• Not measuring specific (or isolated) characteristic
– Create models to understand what is in the images, etc.
• Advantage of raw data:
– General purpose/richer source of information
– Target events can be obtained by further processing later
– Less reliant on manual entry, more natural interactions (with
customers)

Connection with Data Science
• Disadvantage of raw data:
– Systems/Infrastructure (hardware & software
environment tools): are expensive
– Models: are more complex
– Data: are of higher dimension, massive, and need
data annotations (for learning)

Deep Learning: AlexNet 2012
• Trying to solve Object Recognition:
– Given an image (massive number of
pixels), determine the object (1
label)
• Have labeled training dataset,
would like to learn a function of the
mapping
• Data should encapsulate invariance
in the presence of:
– Appearance
– Interaction with the world
– Perspective (2D – 3D), including
size
– Occlusion
– Lighting

Deep Learning: AlexNet 2012
• Data: CalTech 101, ImageNet: 2005
– 1,000,000 images (1000 categories,
1000 image/categories)
– The set of all objects in real life is in
the thousands
• Model: 1989
– Convolutional Neural Network:
Yann LeCun 1989: MNIST digit
recognition
– Deep network that jointly trains
both the feature extraction and
classification stage
• Systems/Infrastructure: 2010
– From Video games (Sony
PlayStations): GPU, CUDA: 60 – 100
times speed up
• BLOG:
https://adeshpande3.github.io/adeshpande3.github.io/A-
Beginner's-Guide-To-Understanding-Convolutional-Neural-
Networks/

Data-driven features within a
compositional architecture

Solving Other Computer Vision
Problems
• The data-driven features is key in moving efforts for ALMOST ALL
other difficult Computer Vision tasks forward
– Note: Basic single image object/person/background recognition has
moved to Enterprise AI (e.g. Amazon Rekognition)
– Mature tasks, such as tracking are available in many free libraries
(OpenCV, etc.)
• Complex algorithms hinges on: architecture & training
– Papers focus on architecture, training is tribal knowledge
• Whether the data is noisy
• Do we need more data
• Training regiment: hyper-parameter grid, fine-tuning, multiple stages, etc.
• Visualization
• Evaluation

Problems
• Additional key concepts
in architecture:
– Adding dependencies to
the past (recurrence):
• Recurrence Neural
Network (RNN)
Long range dependency:
“When I was in Paris I got
lost because I couldn’t
ask for directions in
_____”

Problems
• Additional key concepts in
architecture:
– Adding dependencies to the
past (recurrence):
• Recurrence Neural Network
(RNN)
– Undoing the dimensional
collapse to get more
details:
• Fully Convolutional
Network
Segmentation tasks, Neural
Network visualization

Problems
• Additional key concepts in
architecture:
– Adding dependencies to the past
(recurrence):
• Recurrence Neural Network (RNN)
– Undoing the dimension collapse:
• Fully Convolutional Network
– Using multiple networks:
• Joint Learning: jointly learn inter-
related tasks
• Generative Adversarial Network
(GAN): learning using competing
networks
Learning jointly can provide benefits
of improved individual task
performance
GAN is used for synthetic data
generation

Contemporary Computer Vision
• Topics:
– Deep Learning Theory: accuracy, efficiency
– Recognition: robustness, more detail, larger context
– Reconstruction: WILL NOT DELVE DEEP INTO THIS
• 6DOF pose, clothing, hair, light, deformation, mesh, depth, joint
• GAN is moving forward: generates control signals at multiple layers
– https://www.youtube.com/watch?v=kSLJriaOumA
• Inputs:
– Images, Videos, 3D data, special cameras (thermal, event cameras)
– Video and: audio, text (language), robots
• Applications:
– Medical
– Robots: language/semantic navigation, interacting with object

Deep Learning Theory
• Graph Neural Networks
– Relationships: objects, joints
• Few shot, one shot, zero shot
learning. Weakly/un supervised
Learning
• measure uncertainty & class
imbalance
• Active/online Learning
• open-set learning
• Architectural search:
• Component analyses:
– RELU, Augmentation strategy
• Resources allocation/compression
• Stability/sensitivity/adversarial

Deep Learning Theory
• Graph Convolutional Networks [https://arxiv.org/pdf/1609.02907.pdf]
– http://openaccess.thecvf.com/content_CVPR_2019/papers/Kim_Edge-Labeling_Graph_Neural_Network_for_Few-
Shot_Learning_CVPR_2019_paper.pdf
• Few shot, one shot, zero shot learning. Weakly/un
supervised Learning
– http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Few-Shot_Adaptive_Faster_R-CNN_CVPR_2019_paper.pdf
• Active Learning
• measure uncertainty & class imbalance
– http://openaccess.thecvf.com/content_CVPR_2019/papers/Khan_Striking_the_Right_Balance_With_Uncertainty_CVPR_2019_paper.pdf
• Online learning, open-set
• Architectural search:– http://openaccess.thecvf.com/content_CVPR_2019/papers/Liu_Auto-DeepLab_Hierarchical_Neural_Architecture_Search_for_Semantic_Image_Segmentation_CVPR_2019_paper.pdf
• Component analyses:
– RELU, Augmentation strategy
– http://openaccess.thecvf.com/content_CVPR_2019/papers/Cubuk_AutoAugment_Learning_Augmentation_Strategies_From_Data_CVPR_2019_paper.pdf
• Resources allocation/compression:– http://openaccess.thecvf.com/content_CVPR_2019/papers/Qiao_Neural_Rejuvenation_Improving_Deep_Network_Training_by_Enhancing_Computational_Resource_CVPR_2019_paper.pdf
• Stability/sensitivity/adversarial

Recognition
• Image: detection, recognition,
segmentation, landmarking,
identification in the crowd/wild:
– Face, hand & body pose estimation
• Skeleton, joint localization
• Dense pose
– Panoptic segmentation, RCNN-family
• Video: (person, object, background,
and combination):
– Action Recognition (1 person):
• most active in recognition
• Still in 80 actions: space of actions is
unknown
• Segmenting action in the wild,
simultaneous multiple actions is difficult
– Social relationship (multiple person):
– Video Object segmentation Faster R-
CNN, etc (multiple object)
– Surveillance: tracking & Re-identification

Recognition, cont.
• Visual Question
Answering (VQA): words
& image connection:
– Visual dialog
– Video Captioning
• Video and Audio:
– Audio video event
recognition
– Video enhancement:
diarization

Overarching Trends
• Datasets dictates
research activity
– Largest datasets are from
large entities (Facebook,
Google Deep Mind, etc.)
– Examples:
• Cityscapes: Dashboard
Cam: Segmentation:
semantic, instance
• COCO datasets:
Segmentation: semantic,
instance
• Kinetics Human Action
Dataset
• Social interaction capture:
CMU
• Person Re-identification

Trends/Predictions Moving Forward
• Smaller manually-annotated dataset training catches
up in performance
– Few, one, no shot training
– mixed use real & synthetic data
• Grounded recognition and reconstruction (adding
more modules to solve a problem robustly):
– Image: recognition – segmentation (panoptic) – 3D object
reconstruction – space understanding
– Video: pose estimation – action recognition – action
forecasting – reconstruction
• The next superior building block should direct the field
again (following SIFT 2004, and DL features 2012)

How Do We Apply All These
Information?
• Have a working knowledge of the ML/CV
fundamentals:
– theory, software, hardware, models (CNN, RNN)
• Start with your use-case:
– find keywords in the papers
– search blogs for definition, background
• Run the open-source code
– Understand the limitations
– Are they acceptable to your business?

Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Christian Siagian

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Christian Siagian

Similar to Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Christian Siagian (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Christian Siagian

Editor's Notes