Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Christian Siagian
1. State of the Art Innovations in
Computer Vision
Christian Siagian
DataCon LA
August 16, 2019
2. Presentation Structure
• 10 minutes background to set the information
• 20 minutes current Computer Vision topics
• 10 minutes summary and questions
3. My Background
• Academic:
– Publications in Computer Vision, Robotic Vision,
Human Vision
– Beobot 2.0:
• parallel high-performing robotics vision mobile
platform
• full software architecture with vision localization and
navigation
• Start Up: AIO Robotics, inc.
– fully integrated 3D printer, scanner, editor, object
search
– 2 patents and CES Innovation Awards 2016 & 2017
• Start Up: Eyenuk, inc. Medical Deep Learning
– retinal image lesion detection and segmentation
– end-to-end robotic system to automate eye
screening, monitoring, diagnosis, reporting
– Patent and Grant applications
• Competition Robotics:
– Robocup Soccer Robot & AUVSI Autonomous
Submarine
• Teaching:
– After school robotics program, USC robotics courses
• Learning:
– Academics, sports journalism, nutrition, art, music
4. Artificial Intelligence
• Fields: Machine Learning (ML), Computer
Vision (CV), Natural Language Processing
(NLP), and Robotics
– Digitally and in real world
– They are connected for particular applications
• We will focus on Computer Vision and related
topics
5. Connection with Data Science
• Computer Vision (CV) processes raw data to be used for
data science
• Raw input data: images (regular cameras, heat cameras,
etc), texts, audio
– These data do not have direct semantic meaning:
• Not measuring specific (or isolated) characteristic
– Create models to understand what is in the images, etc.
• Advantage of raw data:
– General purpose/richer source of information
– Target events can be obtained by further processing later
– Less reliant on manual entry, more natural interactions (with
customers)
6. Connection with Data Science
• Disadvantage of raw data:
– Systems/Infrastructure (hardware & software
environment tools): are expensive
– Models: are more complex
– Data: are of higher dimension, massive, and need
data annotations (for learning)
7. Deep Learning: AlexNet 2012
• Trying to solve Object Recognition:
– Given an image (massive number of
pixels), determine the object (1
label)
• Have labeled training dataset,
would like to learn a function of the
mapping
• Data should encapsulate invariance
in the presence of:
– Appearance
– Interaction with the world
– Perspective (2D – 3D), including
size
– Occlusion
– Lighting
8. Deep Learning: AlexNet 2012
• Data: CalTech 101, ImageNet: 2005
– 1,000,000 images (1000 categories,
1000 image/categories)
– The set of all objects in real life is in
the thousands
• Model: 1989
– Convolutional Neural Network:
Yann LeCun 1989: MNIST digit
recognition
– Deep network that jointly trains
both the feature extraction and
classification stage
• Systems/Infrastructure: 2010
– From Video games (Sony
PlayStations): GPU, CUDA: 60 – 100
times speed up
• BLOG:
https://adeshpande3.github.io/adeshpande3.github.io/A-
Beginner's-Guide-To-Understanding-Convolutional-Neural-
Networks/
10. Solving Other Computer Vision
Problems
• The data-driven features is key in moving efforts for ALMOST ALL
other difficult Computer Vision tasks forward
– Note: Basic single image object/person/background recognition has
moved to Enterprise AI (e.g. Amazon Rekognition)
– Mature tasks, such as tracking are available in many free libraries
(OpenCV, etc.)
• Complex algorithms hinges on: architecture & training
– Papers focus on architecture, training is tribal knowledge
• Whether the data is noisy
• Do we need more data
• Training regiment: hyper-parameter grid, fine-tuning, multiple stages, etc.
• Visualization
• Evaluation
11. Solving Other Computer Vision
Problems
• Additional key concepts
in architecture:
– Adding dependencies to
the past (recurrence):
• Recurrence Neural
Network (RNN)
Long range dependency:
“When I was in Paris I got
lost because I couldn’t
ask for directions in
_____”
12. Solving Other Computer Vision
Problems
• Additional key concepts in
architecture:
– Adding dependencies to the
past (recurrence):
• Recurrence Neural Network
(RNN)
– Undoing the dimensional
collapse to get more
details:
• Fully Convolutional
Network
Segmentation tasks, Neural
Network visualization
13. Solving Other Computer Vision
Problems
• Additional key concepts in
architecture:
– Adding dependencies to the past
(recurrence):
• Recurrence Neural Network (RNN)
– Undoing the dimension collapse:
• Fully Convolutional Network
– Using multiple networks:
• Joint Learning: jointly learn inter-
related tasks
• Generative Adversarial Network
(GAN): learning using competing
networks
Learning jointly can provide benefits
of improved individual task
performance
GAN is used for synthetic data
generation
14. Contemporary Computer Vision
• Topics:
– Deep Learning Theory: accuracy, efficiency
– Recognition: robustness, more detail, larger context
– Reconstruction: WILL NOT DELVE DEEP INTO THIS
• 6DOF pose, clothing, hair, light, deformation, mesh, depth, joint
• GAN is moving forward: generates control signals at multiple layers
– https://www.youtube.com/watch?v=kSLJriaOumA
• Inputs:
– Images, Videos, 3D data, special cameras (thermal, event cameras)
– Video and: audio, text (language), robots
• Applications:
– Medical
– Robots: language/semantic navigation, interacting with object
15. Deep Learning Theory
• Graph Neural Networks
– Relationships: objects, joints
• Few shot, one shot, zero shot
learning. Weakly/un supervised
Learning
• measure uncertainty & class
imbalance
• Active/online Learning
• open-set learning
• Architectural search:
• Component analyses:
– RELU, Augmentation strategy
• Resources allocation/compression
• Stability/sensitivity/adversarial
16. Deep Learning Theory
• Graph Convolutional Networks [https://arxiv.org/pdf/1609.02907.pdf]
– http://openaccess.thecvf.com/content_CVPR_2019/papers/Kim_Edge-Labeling_Graph_Neural_Network_for_Few-
Shot_Learning_CVPR_2019_paper.pdf
• Few shot, one shot, zero shot learning. Weakly/un
supervised Learning
– http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Few-Shot_Adaptive_Faster_R-CNN_CVPR_2019_paper.pdf
• Active Learning
• measure uncertainty & class imbalance
– http://openaccess.thecvf.com/content_CVPR_2019/papers/Khan_Striking_the_Right_Balance_With_Uncertainty_CVPR_2019_paper.pdf
• Online learning, open-set
• Architectural search:– http://openaccess.thecvf.com/content_CVPR_2019/papers/Liu_Auto-DeepLab_Hierarchical_Neural_Architecture_Search_for_Semantic_Image_Segmentation_CVPR_2019_paper.pdf
• Component analyses:
– RELU, Augmentation strategy
– http://openaccess.thecvf.com/content_CVPR_2019/papers/Cubuk_AutoAugment_Learning_Augmentation_Strategies_From_Data_CVPR_2019_paper.pdf
• Resources allocation/compression:– http://openaccess.thecvf.com/content_CVPR_2019/papers/Qiao_Neural_Rejuvenation_Improving_Deep_Network_Training_by_Enhancing_Computational_Resource_CVPR_2019_paper.pdf
• Stability/sensitivity/adversarial
17. Recognition
• Image: detection, recognition,
segmentation, landmarking,
identification in the crowd/wild:
– Face, hand & body pose estimation
• Skeleton, joint localization
• Dense pose
– Panoptic segmentation, RCNN-family
• Video: (person, object, background,
and combination):
– Action Recognition (1 person):
• most active in recognition
• Still in 80 actions: space of actions is
unknown
• Segmenting action in the wild,
simultaneous multiple actions is difficult
– Social relationship (multiple person):
– Video Object segmentation Faster R-
CNN, etc (multiple object)
– Surveillance: tracking & Re-identification
18. Recognition, cont.
• Visual Question
Answering (VQA): words
& image connection:
– Visual dialog
– Video Captioning
• Video and Audio:
– Audio video event
recognition
– Video enhancement:
diarization
19. Overarching Trends
• Datasets dictates
research activity
– Largest datasets are from
large entities (Facebook,
Google Deep Mind, etc.)
– Examples:
• Cityscapes: Dashboard
Cam: Segmentation:
semantic, instance
• COCO datasets:
Segmentation: semantic,
instance
• Kinetics Human Action
Dataset
• Social interaction capture:
CMU
• Person Re-identification
20. Trends/Predictions Moving Forward
• Smaller manually-annotated dataset training catches
up in performance
– Few, one, no shot training
– mixed use real & synthetic data
• Grounded recognition and reconstruction (adding
more modules to solve a problem robustly):
– Image: recognition – segmentation (panoptic) – 3D object
reconstruction – space understanding
– Video: pose estimation – action recognition – action
forecasting – reconstruction
• The next superior building block should direct the field
again (following SIFT 2004, and DL features 2012)
21. How Do We Apply All These
Information?
• Have a working knowledge of the ML/CV
fundamentals:
– theory, software, hardware, models (CNN, RNN)
• Start with your use-case:
– find keywords in the papers
– search blogs for definition, background
• Run the open-source code
– Understand the limitations
– Are they acceptable to your business?
Editor's Notes
Evaluation - Sports Recruiting
Self Improvements
Robotics, Computer Vision, ML, AI,
robotics is sensor driven & bayesian model
The future is in this field
Flight cameras,
Won’t talk a lot on Infrastructure,
Flight cameras,
Won’t talk a lot on Infrastructure,
Edges, texture, more complex textures, objects
creating new item from distribution
Training
creating new item from distribution
Training
Training can be difficult
GAN paper: http://openaccess.thecvf.com/content_CVPR_2019/papers/Karras_A_Style-Based_Generator_Architecture_for_Generative_Adversarial_Networks_CVPR_2019_paper.pdf
Animation from Single image
Robotics: grasping
http://openaccess.thecvf.com/content_CVPR_2019/papers/Huang_Neural_Task_Graphs_Generalizing_to_Unseen_Tasks_From_a_Single_CVPR_2019_paper.pdf
“Three Strong Accept” paper: semantic navigation: in the kitchen
Interacting with people
http://openaccess.thecvf.com/content_CVPR_2019/papers/Wortsman_Learning_to_Learn_How_to_Learn_Self-Adaptive_Visual_Navigation_Using_CVPR_2019_paper.pdf
Architectural search:
Network comprised on spatial computation & within layer computation
Scaling policies:
http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_ELASTIC_Improving_CNNs_With_Dynamic_Scaling_Policies_CVPR_2019_paper.pdf
Architectural search:
Network comprised on spatial computation & within layer computation
Scaling policies:
http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_ELASTIC_Improving_CNNs_With_Dynamic_Scaling_Policies_CVPR_2019_paper.pdf
http://ikea.csail.mit.edu/
Pose estimation is moving forward with dense pose
http://densepose.org/
https://github.com/facebookresearch/DensePose
http://openaccess.thecvf.com/content_CVPR_2019/papers/Guler_HoloPose_Holistic_3D_Human_Reconstruction_In-The-Wild_CVPR_2019_paper.pdf
Pose estimation: Hand & Pose
http://openaccess.thecvf.com/content_CVPR_2019/papers/Ge_3D_Hand_Shape_and_Pose_Estimation_From_a_Single_RGB_CVPR_2019_paper.pdf
http://openaccess.thecvf.com/content_CVPR_2019/papers/Pavllo_3D_Human_Pose_Estimation_in_Video_With_Temporal_Convolutions_and_CVPR_2019_paper.pdf
Mask-R-CNN
http://openaccess.thecvf.com/content_CVPR_2019/papers/Huang_Mask_Scoring_R-CNN_CVPR_2019_paper.pdf
Panoptic Segmentation [https://arxiv.org/pdf/1801.00868.pdf]
Action recognition:
Flow Representation
http://openaccess.thecvf.com/content_CVPR_2019/papers/Piergiovanni_Representation_Flow_for_Action_Recognition_CVPR_2019_paper.pdf
Video Salient object detection
http://openaccess.thecvf.com/content_CVPR_2019/papers/Fan_Shifting_More_Attention_to_Video_Salient_Object_Detection_CVPR_2019_paper.pdf
Object Relationship:
http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhan_On_Exploring_Undetermined_Relationships_for_Visual_Relationship_Detection_CVPR_2019_paper.pdf
Video Classification:
http://openaccess.thecvf.com/content_CVPR_2019/papers/Bhardwaj_Efficient_Video_Classification_Using_Fewer_Frames_CVPR_2019_paper.pdf
Relationship:
http://openaccess.thecvf.com/content_CVPR_2019/papers/Sun_Relational_Action_Forecasting_CVPR_2019_paper.pdf
Performance/Action Quality
http://openaccess.thecvf.com/content_CVPR_2019/papers/Parmar_What_and_How_Well_You_Performed_A_Multitask_Learning_Approach_CVPR_2019_paper.pdf
http://openaccess.thecvf.com/content_CVPR_2019/papers/Doughty_The_Pros_and_Cons_Rank-Aware_Temporal_Attention_for_Skill_Determination_CVPR_2019_paper.pdf
http://ikea.csail.mit.edu/
Pose estimation is moving forward with dense pose
http://densepose.org/
https://github.com/facebookresearch/DensePose
http://openaccess.thecvf.com/content_CVPR_2019/papers/Guler_HoloPose_Holistic_3D_Human_Reconstruction_In-The-Wild_CVPR_2019_paper.pdf
Pose estimation: Hand & Pose
http://openaccess.thecvf.com/content_CVPR_2019/papers/Ge_3D_Hand_Shape_and_Pose_Estimation_From_a_Single_RGB_CVPR_2019_paper.pdf
http://openaccess.thecvf.com/content_CVPR_2019/papers/Pavllo_3D_Human_Pose_Estimation_in_Video_With_Temporal_Convolutions_and_CVPR_2019_paper.pdf
Mask-R-CNN
http://openaccess.thecvf.com/content_CVPR_2019/papers/Huang_Mask_Scoring_R-CNN_CVPR_2019_paper.pdf
Action recognition:
Flow Representation
http://openaccess.thecvf.com/content_CVPR_2019/papers/Piergiovanni_Representation_Flow_for_Action_Recognition_CVPR_2019_paper.pdf
Video Salient object detection
http://openaccess.thecvf.com/content_CVPR_2019/papers/Fan_Shifting_More_Attention_to_Video_Salient_Object_Detection_CVPR_2019_paper.pdf
Object Relationship:
http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhan_On_Exploring_Undetermined_Relationships_for_Visual_Relationship_Detection_CVPR_2019_paper.pdf
Video Classification:
http://openaccess.thecvf.com/content_CVPR_2019/papers/Bhardwaj_Efficient_Video_Classification_Using_Fewer_Frames_CVPR_2019_paper.pdf
Relationship:
http://openaccess.thecvf.com/content_CVPR_2019/papers/Sun_Relational_Action_Forecasting_CVPR_2019_paper.pdf
Performance/Action Quality
http://openaccess.thecvf.com/content_CVPR_2019/papers/Parmar_What_and_How_Well_You_Performed_A_Multitask_Learning_Approach_CVPR_2019_paper.pdf
http://openaccess.thecvf.com/content_CVPR_2019/papers/Doughty_The_Pros_and_Cons_Rank-Aware_Temporal_Attention_for_Skill_Determination_CVPR_2019_paper.pdf
Larger context in visual reasoning from language:
https://arxiv.org/abs/1906.08237
https://github.com/zihangdai/xlnet
http://openaccess.thecvf.com/content_CVPR_2019/papers/Hudson_GQA_A_New_Dataset_for_Real-World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.pdf
CityscapesDataset:
https://www.cityscapes-dataset.com/
COCO datasets:
http://cocodataset.org
Kinetics Human Action Dataset:
https://deepmind.com/research/open-source/kinetics
Panoptic Studio Dataset:
https://www.cs.cmu.edu/~hanbyulj/panoptic-studio/
Person Reidentification Dataset:
https://amberer.gitlab.io/papers_in_ai/person-reid.html
Reconstruction: mechanistic understanding
Vs. Recognition: discriminative, deeper understanding