SlideShare a Scribd company logo
1 of 74
Download to read offline
Artificial Intelligence for Vision:
A walkthrough of recent breakthroughs
February 2024
Nikolas Markou
Electi Consulting
Nikolas Markou
● Head of AI / Partner @ Electi Consulting
● Consultant for various AI startups
● Senior AI Engineer, Autonomous Cars (NavInfo
Europe)
● Principal Engineer, Yodigram, Yodiwo
● SW Engineer, Intelligence / Drone / Geolocation
projects, Verint
● 2010-2011 - MSc, Imperial College London
● 2005-2010 - BEng, MEng, University of Patras
Introduction
Welcome to a journey through time and
pixels.
Today, we embark on a quest to uncover
the fascinating evolution of computer
vision, from humble beginnings to the
cutting-edge marvels of Vision
Transformers.
In just a few minutes i will try to give you a
historical overview of computer vision and
the bright future ahead.
3
Computer Vision
Computer vision is a field within artificial
intelligence focused on: enabling
machines to interpret and understand
visual information from the real world.
It encompasses tasks such as:
● image recognition,
● object detection,
● and scene understanding,
aiming to replicate human-like vision
capabilities using algorithms and
computational techniques.
4
The Dawn of Computer Vision 1960-1970
Researchers explore the idea of
teaching computers to "see" and
interpret visual data.
Early Challenges: Limited
computational power and lack of
data hinder progress.
Milestones: Development of
foundational concepts like edge
detection and pattern recognition.
5
The pioneer / Yann LeCun
LeNet, introduced by Yann LeCun in 1998,
was a pioneering convolutional neural
network (CNN) architecture.
Its compact design and hierarchical feature
extraction revolutionized handwritten digit
recognition, marking a pivotal moment in
deep learning history.
LeNet's breakthrough paved the way for
modern CNNs, igniting the era of deep
learning-based computer vision.
6
The pioneer / LeNet
7
What is Convolution ?
Convolution refers to a mathematical
operation applied to input data, typically
images, using a convolutional filter or kernel.
This operation involves sliding the filter over
the input data and computing the dot product
between the filter and the overlapping
portions of the input.
Convolutional layers in deep learning models
use this operation to extract meaningful
features from the input data.
8
What is Convolution ?
The convolution operation is characterized by parameters such as the size of the filter, the
stride (step size) of the filter as it moves across the input, and padding to control the spatial
dimensions of the output.
Through training, CNNs learn to adjust the weights of the convolutional filters to extract
relevant features from the input data, enabling the model to make accurate predictions or
classifications based on the learned representations.
9
What does convolution learn ?
The kernels or filters learn (through the training dataset) abstractions of increased
complexity, so stacked convolutions (in specific setup) activate on different kind of features.
10
Innovation struggles through the 2000s
Following the AI winter of the 1980s, neural networks fell out of favor, leading to a reliance on
handcrafted features and classical techniques like support vector machines (SVMs) and
decision trees in computer vision.
Between the debut of LeNet in 1998 and the breakthrough of AlexNet in 2012, deep learning
and computer vision navigated through a challenging landscape shaped by the AI winter and
the dominance of traditional computer vision methods.
Despite progress in applications such as facial recognition and optical character recognition,
these methods faced limitations in scalability and generalization.
11
The big break
2012 we got the big break with AlexNet the
first “deep” neural network.
AlexNet competed in the ImageNet Large
Scale Visual Recognition Challenge on
September 30, 2012.
The network achieved a top-5 error of
15.3%, more than 10.8 percentage points
lower than that of the runner up.
12
The big break / AlexNet
13
The big break / AlexNet
14
The big break / AlexNet
The persistence and innovation of
researchers / engineers / tinkerers set the
stage for the pivotal moment in 2012 when
AlexNet showcased the transformative power
of deep CNNs on the ImageNet Large Scale
Visual Recognition Challenge.
This resurgence marked a turning point,
reigniting widespread interest in deep
learning for computer vision and catalyzing a
new era of rapid advancement and
innovation in the field.
15
The big break / AlexNet
● Extensive Data: Trained on ImageNet's vast dataset of 1.2 million images across 1,000
classes, facilitating diverse feature learning.
● Deep Architecture: Featuring eight layers, including five convolutional and three fully
connected layers, enabling hierarchical feature extraction.
● Convolutional Layers: Utilizing spatially invariant features crucial for image
classification via convolutional and pooling layers.
● ReLU Activation: Employing ReLU activation, accelerating learning and mitigating
vanishing gradient issues.
● Data Augmentation: Employing techniques like cropping, flipping, and scaling to
enhance training data and combat overfitting.
● Dropout Regularization: Implementing dropout to randomly deactivate neurons during
training, preventing overfitting and enhancing generalization.
16
The big break / AlexNet / Parameters
17
The big break / AlexNet / MaxPooling
18
The big break / AlexNet / ReLU activation
19
The big break / AlexNet / Dropout
20
The CNN revolution
Between 2012 and ~2019
there was a mad rash to
release the best CNN
model hitting new top1
accuracy on imagenet.
21
The CNN revolution
The holy grail is a family
architecture that scales well with
the number of parameters:
IE: It performs better as we
increase the number of
parameters.
In practise every architecture
exhibits asymptotic behavior after
which it is not performing better
and the computational demands
become too large.
22
The CNN revolution / VGG
23
The CNN revolution / ResNet
24
The CNN revolution / UNet
● U-shaped Architecture:
○ U-Net features a distinctive "U"-shaped design, consisting of
both a contracting and an expanding path.
○ The contracting path employs convolutional and pooling layers
to decrease spatial resolution and increase feature maps.
○ In contrast, the expanding path utilizes upsampling and
convolutional layers to progressively enhance spatial
resolution.
● Skip Connections:
○ U-Net incorporates skip connections between the contracting
and expanding paths to retain information from earlier layers.
○ Similar to ResNets, these connections mitigate the vanishing
gradient problem in deep networks.
○ They facilitate accurate segmentation by enabling the network
to leverage information from various scales.
● Multi-scale Feature Maps:
○ Leveraging multi-scale feature maps, U-Net captures
information at different abstraction levels.
○ The contracting path reduces input resolution while increasing
feature maps, while the expanding path upscales resolution.
○ Skip connections merge information from diverse scales,
enhancing segmentation accuracy and overall performance.
25
The CNN revolution / Inception
26
The CNN revolution / EfficientNet
27
The CNN revolution / EfficientNet
28
The CNN revolution / Activations
Different activations and their
gradients promote different
behaviours:
● Relu
● Sigmoid
● Tanh
● Selu
● ELU
● Gelu
● Swish 29
The CNN revolution / Normalizations
Different normalizations and
their gradients allow faster
convergence and deeper
models:
● Batch Normalization
● Layer Normalization
● Group Normalization
● Instance Normalization
● Many many more 30
Downstream Tasks
Downstream computer vision tasks
refer to specific applications or
problems solved using pre-trained
models or transfer learning.
They enable the adaptation of existing
models to new tasks, reducing the
need for extensive data and
computational resources.
Object detection, image segmentation,
facial recognition, and scene
understanding. 31
Downstream Tasks / Object Detection
Object detection involves
identifying and locating objects
within an image or video.
Challenges: Scale variation,
occlusion, and object deformation.
Transfer Learning: Pre-trained
models like Faster R-CNN, YOLO,
or SSD can be fine-tuned on
custom datasets for specific object
detection tasks. 32
Downstream Tasks / Image Segmentation
Image segmentation partitions an image
into multiple segments or regions based
on pixel similarity.
Applications: Medical imaging,
autonomous driving, and image editing.
Transfer Learning: Models such as U-Net
and Mask R-CNN, pre-trained on large
datasets like COCO or Pascal VOC, can
be adapted to segment specific objects or
classes.
33
Downstream Tasks / Facial Recognition
Facial recognition identifies and verifies
individuals based on facial features.
Use Cases: Security systems, access
control, and personalized user
experiences.
Transfer Learning: Pre-trained models like
VGG-Face or FaceNet can be fine-tuned
on custom datasets to recognize specific
individuals or facial expressions.
34
Downstream Tasks / Facial Recognition
35
Downstream Tasks / SLAM
Simultaneous localization and
mapping attempts to make a
robot or other autonomous
vehicle map an unfamiliar area
while, at the same time,
determining where within that
area the robot itself is located.
36
Downstream Tasks / Depth Estimation
37
Downstream Tasks / Human Pose Estimation
38
Downstream Tasks / Scene Understanding
Scene understanding involves analyzing
and comprehending the content and
context of an entire scene or image.
Applications: Autonomous navigation,
augmented reality, and content
understanding.
Transfer Learning: Models such as
ResNet or Inception, pre-trained on
large-scale image classification tasks, can
be utilized for scene understanding tasks
by fine-tuning on relevant datasets.
39
Enter the Transformers
Just when we thought CNN's had
reached their peak, along comes a
disruptor.
“Attention is all you need” is the 2017
paper that changed NLP forever.
It introduced the Transformer
architecture and a novel attention
mechanism.
This was a very big divergence off the
usual way of doing NLP.
40
Enter the Transformers
Transformers did not become a overnight success until GPT and BERT immensely
popularized it. Here is a timeline of events:
● Attention is all you need: 2017
● ElMo (LSTM-based): 2018
● ULMFiT (LSTM-based): 2018
● GPT (Transformer-based): 2018
● BERT (Transformer-based): 2018
● Transformers revolutionizing the world of NLP, Speech, and Vision: 2018
onwards
41
Enter the Transformers
Transformers revolutionized NLP, they could
scale immensely and soon we reached the
era of Large Language Models (LLMs).
Models like GPT2 - GPT3 with several
billions of parameters and a training dataset
of a few trillion tokens.
These models have been literally trained with
a good portion of all the human textual
knowledge.
This innovation sparked the latest AI
innovation explosion.
42
Enter the Transformers / Encoder
The Transformer's encoder transforms input
sequences into machine-readable representations
by capturing word similarities and positions. It
utilizes input embeddings and positional encoding
to prepare the sequence for processing.
Stacked encoder layers, analyze relationships
between words through multi-head attention blocks.
This understanding is enhanced by residual
connections and layer normalizations. A
feed-forward network further refines the sequence.
The encoded knowledge is then passed to the
decoder for generating the final output sequence.
43
Enter the Transformers / Decoder
The decoder utilizes knowledge from the
encoder. In the first prediction cycle, it
starts with a "start of sentence" token.
Similar to the encoder, the decoder layer
analyzes encoder information and previous
predictions.
Initially, it relies on prediction knowledge
alone, then combines it with encoder output
for further analysis.
The output is the probability of the next
word in the sequence.
44
Enter the Transformers / Attention
45
Enter the Transformers / Attention
The Transformer uses "scaled dot product attention," a self-attention mechanism that assesses token
dependencies within a sequence.
Unlike global attention, which considers each word's importance relative to the entire sequence,
self-attention examines relationships between tokens.
For instance, in the sentence "I went to the store and bought tons of fruits along with some furniture.
They tasted amazing," self-attention would correctly identify "they" refers to "fruits," not "furniture."
This contrasts with global attention, which might assign higher importance to unrelated words without
understanding their connections. Self-attention's comprehensive analysis of word interactions enables
accurate interpretation of meaning.
46
Enter the Transformers
47
Enter the Transformers / Pretraining
Pre-training is essential for transformer models due to their vast parameter space
and complexity.
Pre-training on large datasets allows models to learn general features and
representations, facilitating transfer learning to downstream tasks with smaller
datasets.
This initial training phase enables transformers to capture intricate patterns and
relationships in data, leading to improved performance and faster convergence
during fine-tuning.
Additionally, pre-training mitigates the risk of overfitting on limited task-specific
data by providing a solid foundation of knowledge.
48
Transformers with a twist: Vision-Transformers (ViT)
It was a only a matter of time before the
Transformer architecture took over the
computer vision domain.
In the seminal paper:
“An Image is Worth 16x16 Words:
Transformers for Image Recognition at
Scale”, the vision transformer was
introduced and changed everything
again.
49
Transformers with a twist: Vision-Transformers (ViT)
The Vision Transformer (ViT) model takes a
sequence of flattened 2D patches derived from an
image as input.
The image, denoted as x, with pixels in the [0, 255]
range and in the dimension of (H×W×C), is
reshaped into a sequence of patches.
The patches derived from the image are
transformed into a lower-dimensional space using a
trainable linear projection.
This transformation process, which we refer to as
“flattening,” results in a set of patch embeddings.
50
Transformers with a twist: Vision-Transformers (ViT)
ViT treats images as sequences of patches, employing a
Transformer encoder like those used in NLP.
Despite its simplicity, this approach, when paired with
pre-training on large datasets, proves remarkably
effective.
The ViT rivals or surpasses state-of-the-art performance
on numerous image classification tasks, with
cost-effective pre-training.
Its self-attention mechanism enables information
integration across the entire image, even in lower layers,
representing a significant advantage.
51
Transformers with a twist: Vision-Transformers (ViT)
The Vision Transformer has several variants, each with different sizes and
configurations:
52
Transformers with a twist: Vision-Transformers (ViT)
53
Transformers with a twist: Vision-Transformers (ViT)
Transformers sparked a new generation of models, borrowing previous
innovations like multiscale models, small number of convolutions, different types of
activations, normalizations, loss functions, pretraining tasks and types of attention.
54
Transformers with a twist: Vision-Transformers (ViT)
Multiscale Vision Transformers (MViT) leverages
the idea of combining multi-scale feature
hierarchies with vision transformer models. In
practice, starting from the initial image size with
3 channels, the authors gradually expand
(hierarchically) the channel capacity while
reducing the spatial resolution.
As a result, a multiscale pyramid of features is
created. Intuitively, early layers will learn
high-spatial with simple low-level visual
information, while deeper layers are responsible
for complex, high-dimensional features.
55
Transformers with a twist: Vision-Transformers (ViT)
The pivotal shift in Transformers lies in their
integration of images as a distinct
'language,' introducing a profound modality.
This breakthrough enables the fusion of
diverse modalities, creating a frontier in AI
where mapping between realms holds
sway.
With images now in their arsenal,
Transformers redefine boundaries of
comprehension, hinting at new levels of
creativity and understanding.
56
CNN’s fight back
Despite the ViT-based models achieving
state-of-the-art performance in all
vision-related tasks, convolutional
neural networks were not disregarded
and experienced a resurgence with the
ConvNext variant.
57
CNN’s fight back
ConvNext and ConvNext v2 came out in
2022 and 2023.
They sparked a renewed interest in
convolutional neural networks.
58
CNN’s and attention
Since Transformers came out there has
been a strong push for adding attention
mechanisms to CNN’s as well:
● Squeeze and Excite
● Additive attention
● Self-Attention
● CBAM
● and much more
59
Current SOTA: Vision-Language Models (VLMS)
The paradigm Pre-training, Fine-tuning
and Prediction has demonstrated great
effectiveness in a wide range of visual
recognition tasks.
Under this new paradigm, a DNN model
is first pre-trained with certain
off-the-shelf large-scale training data,
being annotated or unannotated, and
the pre-trained model is then fine-tuned
with task-specific annotated training
data.
60
Current SOTA: Vision-Language Models (VLMS)
A new deep learning paradigm named Vision-Language Model Pre-training and Zero-shot
Prediction has attracted increasing attention recently.
In this paradigm, a vision-language model (VLM) is pre-trained with large-scale image-text
pairs that are almost infinitely available on the internet, and the pre-trained VLM can be
directly applied to downstream visual recognition tasks without fine-tuning.
61
Current SOTA: Vision-Language Models (VLMS)
Zero-shot prediction directly applies pre-trained VLMs to downstream tasks without any task-specific fine-tuning.
Image Classification aims to classify images into predefined categories. VLMs achieve zero-shot image
classification by comparing the embeddings of images and texts, where “prompt engineering” is often employed to
generate task-related prompts like “a photo of a [label].” .
Semantic Segmentation aims to assign a category label to each pixel in images. Pre-trained VLMs achieve
zero-shot prediction for segmentation tasks by comparing the embeddings of the given image pixels and texts.
Object Detection aims to localize and classify objects in images, which is important for various vision applications.
With the object locating ability learned from auxiliary datasets, pre-trained VLMs achieve zero-shot prediction for
object detection tasks by comparing the embeddings of the given object proposals and texts.
Image-Text Retrieval aims to retrieve the demanded samples from one modality given the cues from another
modality, which consists of two tasks, i.e., text-to-image retrieval that retrieves images based on texts and
image-to-text retrieval that retrieves texts based on images.
62
Current SOTA: Vision-Language Models (VLMS)
63
Current SOTA: Vision-Language Models (VLMS)
64
Current SOTA: Vision-Language Models (VLMs)
The most famous of zero-shot VLMs is the Segment Anything Now (SAM) family.
It came out in early 2023 and has moved from natural images to all types of tasks.
65
Current SOTA: Vision-Language Models (VLMs)
It takes points / bounding boxes / text to create a segmentation of the original
image
66
Current SOTA: Vision-Language Models (VLMs)
67
Current SOTA: Vision-Language Models (VLMs)
68
Current SOTA: Vision-Language Models (VLMS)
A recent very attractive application of VLMs is the Yolo-World that came out on the 31st of January 2024.
YOLO-World tackles the challenges faced by traditional Open-Vocabulary detection models, which often
rely on cumbersome Transformer models requiring extensive computational resources.
69
Current SOTA: Vision-Language Models (VLMS)
What it does it’s allow us to interact with the
detector without using a fixed number of classes,
but by using natural language.
You can try it and you will understand how
incredible this is, especially if you have trained fixed
object detectors before:
https://huggingface.co/spaces/stevengrove/YOLO-
World
70
Conclusion
● It’s been a long and arduous journey.
● Data and Compute power are more important than ever.
● Each innovation becomes the enabler of a huge number of
applications. So, stay curious and keep exploring.
● The entry point is quite low. It's nice to know how we arrived
here, but it's not necessary to use the current methods.
● The evolution continues, so be part of it.
71
Electi / Capabilities
● We have been involved in the AI
space for over a decade.
● We help companies transition to
the AI age.
● Reach out at:
ai@electiconsulting.com
LinkedIn
Select Papers / Resources
● AlexNet / Imagenet-classification-with-deep-convolutional-neural-networks
● Deep Learning (MIT Press)
● Deep Residual Learning for Image Recognition
● You Only Look Once: Unified, Real-Time Object Detection
● Attention Is All You Need
● An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale
● Vision-Language Models for Vision Tasks: A Survey
● YOLO-World: Real-Time Open-Vocabulary Object Detection
● Understanding Deep Learning (MIT Press)
Thank you
Nikolas Markou
Head of AI @ Electi Consulting
LinkedIn

More Related Content

Similar to Artificial Intelligence for Vision: A walkthrough of recent breakthroughs

Image Recognition Expert System based on deep learning
Image Recognition Expert System based on deep learningImage Recognition Expert System based on deep learning
Image Recognition Expert System based on deep learningPRATHAMESH REGE
 
ANALYSIS OF LUNG NODULE DETECTION AND STAGE CLASSIFICATION USING FASTER RCNN ...
ANALYSIS OF LUNG NODULE DETECTION AND STAGE CLASSIFICATION USING FASTER RCNN ...ANALYSIS OF LUNG NODULE DETECTION AND STAGE CLASSIFICATION USING FASTER RCNN ...
ANALYSIS OF LUNG NODULE DETECTION AND STAGE CLASSIFICATION USING FASTER RCNN ...IRJET Journal
 
REVIEW ON OBJECT DETECTION WITH CNN
REVIEW ON OBJECT DETECTION WITH CNNREVIEW ON OBJECT DETECTION WITH CNN
REVIEW ON OBJECT DETECTION WITH CNNIRJET Journal
 
Performance investigation of two-stage detection techniques using traffic lig...
Performance investigation of two-stage detection techniques using traffic lig...Performance investigation of two-stage detection techniques using traffic lig...
Performance investigation of two-stage detection techniques using traffic lig...IAESIJAI
 
Improving AI surveillance using Edge Computing
Improving AI surveillance using Edge ComputingImproving AI surveillance using Edge Computing
Improving AI surveillance using Edge ComputingIRJET Journal
 
IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...
IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...
IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...IRJET Journal
 
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfleewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfrobertsamuel23
 
A survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural NetworkA survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural NetworkSasanko Sekhar Gantayat
 
IRJET - Autonomous Navigation System using Deep Learning
IRJET -  	  Autonomous Navigation System using Deep LearningIRJET -  	  Autonomous Navigation System using Deep Learning
IRJET - Autonomous Navigation System using Deep LearningIRJET Journal
 
Predicting Steering Angle for Self Driving Vehicles
Predicting Steering Angle for Self Driving VehiclesPredicting Steering Angle for Self Driving Vehicles
Predicting Steering Angle for Self Driving VehiclesIRJET Journal
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learningijtsrd
 
Smart Navigation Assistance System for Blind People
Smart Navigation Assistance System for Blind PeopleSmart Navigation Assistance System for Blind People
Smart Navigation Assistance System for Blind PeopleIRJET Journal
 
Scene recognition using Convolutional Neural Network
Scene recognition using Convolutional Neural NetworkScene recognition using Convolutional Neural Network
Scene recognition using Convolutional Neural NetworkDhirajGidde
 
Automatism System Using Faster R-CNN and SVM
Automatism System Using Faster R-CNN and SVMAutomatism System Using Faster R-CNN and SVM
Automatism System Using Faster R-CNN and SVMIRJET Journal
 
Garbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning TechniquesGarbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning TechniquesIRJET Journal
 
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...ijscai
 
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...ijscai
 
Unsupervised learning models of invariant features in images: Recent developm...
Unsupervised learning models of invariant features in images: Recent developm...Unsupervised learning models of invariant features in images: Recent developm...
Unsupervised learning models of invariant features in images: Recent developm...IJSCAI Journal
 

Similar to Artificial Intelligence for Vision: A walkthrough of recent breakthroughs (20)

Image Recognition Expert System based on deep learning
Image Recognition Expert System based on deep learningImage Recognition Expert System based on deep learning
Image Recognition Expert System based on deep learning
 
ANALYSIS OF LUNG NODULE DETECTION AND STAGE CLASSIFICATION USING FASTER RCNN ...
ANALYSIS OF LUNG NODULE DETECTION AND STAGE CLASSIFICATION USING FASTER RCNN ...ANALYSIS OF LUNG NODULE DETECTION AND STAGE CLASSIFICATION USING FASTER RCNN ...
ANALYSIS OF LUNG NODULE DETECTION AND STAGE CLASSIFICATION USING FASTER RCNN ...
 
REVIEW ON OBJECT DETECTION WITH CNN
REVIEW ON OBJECT DETECTION WITH CNNREVIEW ON OBJECT DETECTION WITH CNN
REVIEW ON OBJECT DETECTION WITH CNN
 
Performance investigation of two-stage detection techniques using traffic lig...
Performance investigation of two-stage detection techniques using traffic lig...Performance investigation of two-stage detection techniques using traffic lig...
Performance investigation of two-stage detection techniques using traffic lig...
 
Waymo Essay
Waymo EssayWaymo Essay
Waymo Essay
 
Improving AI surveillance using Edge Computing
Improving AI surveillance using Edge ComputingImproving AI surveillance using Edge Computing
Improving AI surveillance using Edge Computing
 
IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...
IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...
IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...
 
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfleewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
 
A survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural NetworkA survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural Network
 
IRJET - Autonomous Navigation System using Deep Learning
IRJET -  	  Autonomous Navigation System using Deep LearningIRJET -  	  Autonomous Navigation System using Deep Learning
IRJET - Autonomous Navigation System using Deep Learning
 
Mnist report
Mnist reportMnist report
Mnist report
 
Predicting Steering Angle for Self Driving Vehicles
Predicting Steering Angle for Self Driving VehiclesPredicting Steering Angle for Self Driving Vehicles
Predicting Steering Angle for Self Driving Vehicles
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
Smart Navigation Assistance System for Blind People
Smart Navigation Assistance System for Blind PeopleSmart Navigation Assistance System for Blind People
Smart Navigation Assistance System for Blind People
 
Scene recognition using Convolutional Neural Network
Scene recognition using Convolutional Neural NetworkScene recognition using Convolutional Neural Network
Scene recognition using Convolutional Neural Network
 
Automatism System Using Faster R-CNN and SVM
Automatism System Using Faster R-CNN and SVMAutomatism System Using Faster R-CNN and SVM
Automatism System Using Faster R-CNN and SVM
 
Garbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning TechniquesGarbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning Techniques
 
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
 
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
 
Unsupervised learning models of invariant features in images: Recent developm...
Unsupervised learning models of invariant features in images: Recent developm...Unsupervised learning models of invariant features in images: Recent developm...
Unsupervised learning models of invariant features in images: Recent developm...
 

Recently uploaded

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxviniciusperissetr
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一z xss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Recently uploaded (20)

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Artificial Intelligence for Vision: A walkthrough of recent breakthroughs

  • 1. Artificial Intelligence for Vision: A walkthrough of recent breakthroughs February 2024 Nikolas Markou Electi Consulting
  • 2. Nikolas Markou ● Head of AI / Partner @ Electi Consulting ● Consultant for various AI startups ● Senior AI Engineer, Autonomous Cars (NavInfo Europe) ● Principal Engineer, Yodigram, Yodiwo ● SW Engineer, Intelligence / Drone / Geolocation projects, Verint ● 2010-2011 - MSc, Imperial College London ● 2005-2010 - BEng, MEng, University of Patras
  • 3. Introduction Welcome to a journey through time and pixels. Today, we embark on a quest to uncover the fascinating evolution of computer vision, from humble beginnings to the cutting-edge marvels of Vision Transformers. In just a few minutes i will try to give you a historical overview of computer vision and the bright future ahead. 3
  • 4. Computer Vision Computer vision is a field within artificial intelligence focused on: enabling machines to interpret and understand visual information from the real world. It encompasses tasks such as: ● image recognition, ● object detection, ● and scene understanding, aiming to replicate human-like vision capabilities using algorithms and computational techniques. 4
  • 5. The Dawn of Computer Vision 1960-1970 Researchers explore the idea of teaching computers to "see" and interpret visual data. Early Challenges: Limited computational power and lack of data hinder progress. Milestones: Development of foundational concepts like edge detection and pattern recognition. 5
  • 6. The pioneer / Yann LeCun LeNet, introduced by Yann LeCun in 1998, was a pioneering convolutional neural network (CNN) architecture. Its compact design and hierarchical feature extraction revolutionized handwritten digit recognition, marking a pivotal moment in deep learning history. LeNet's breakthrough paved the way for modern CNNs, igniting the era of deep learning-based computer vision. 6
  • 7. The pioneer / LeNet 7
  • 8. What is Convolution ? Convolution refers to a mathematical operation applied to input data, typically images, using a convolutional filter or kernel. This operation involves sliding the filter over the input data and computing the dot product between the filter and the overlapping portions of the input. Convolutional layers in deep learning models use this operation to extract meaningful features from the input data. 8
  • 9. What is Convolution ? The convolution operation is characterized by parameters such as the size of the filter, the stride (step size) of the filter as it moves across the input, and padding to control the spatial dimensions of the output. Through training, CNNs learn to adjust the weights of the convolutional filters to extract relevant features from the input data, enabling the model to make accurate predictions or classifications based on the learned representations. 9
  • 10. What does convolution learn ? The kernels or filters learn (through the training dataset) abstractions of increased complexity, so stacked convolutions (in specific setup) activate on different kind of features. 10
  • 11. Innovation struggles through the 2000s Following the AI winter of the 1980s, neural networks fell out of favor, leading to a reliance on handcrafted features and classical techniques like support vector machines (SVMs) and decision trees in computer vision. Between the debut of LeNet in 1998 and the breakthrough of AlexNet in 2012, deep learning and computer vision navigated through a challenging landscape shaped by the AI winter and the dominance of traditional computer vision methods. Despite progress in applications such as facial recognition and optical character recognition, these methods faced limitations in scalability and generalization. 11
  • 12. The big break 2012 we got the big break with AlexNet the first “deep” neural network. AlexNet competed in the ImageNet Large Scale Visual Recognition Challenge on September 30, 2012. The network achieved a top-5 error of 15.3%, more than 10.8 percentage points lower than that of the runner up. 12
  • 13. The big break / AlexNet 13
  • 14. The big break / AlexNet 14
  • 15. The big break / AlexNet The persistence and innovation of researchers / engineers / tinkerers set the stage for the pivotal moment in 2012 when AlexNet showcased the transformative power of deep CNNs on the ImageNet Large Scale Visual Recognition Challenge. This resurgence marked a turning point, reigniting widespread interest in deep learning for computer vision and catalyzing a new era of rapid advancement and innovation in the field. 15
  • 16. The big break / AlexNet ● Extensive Data: Trained on ImageNet's vast dataset of 1.2 million images across 1,000 classes, facilitating diverse feature learning. ● Deep Architecture: Featuring eight layers, including five convolutional and three fully connected layers, enabling hierarchical feature extraction. ● Convolutional Layers: Utilizing spatially invariant features crucial for image classification via convolutional and pooling layers. ● ReLU Activation: Employing ReLU activation, accelerating learning and mitigating vanishing gradient issues. ● Data Augmentation: Employing techniques like cropping, flipping, and scaling to enhance training data and combat overfitting. ● Dropout Regularization: Implementing dropout to randomly deactivate neurons during training, preventing overfitting and enhancing generalization. 16
  • 17. The big break / AlexNet / Parameters 17
  • 18. The big break / AlexNet / MaxPooling 18
  • 19. The big break / AlexNet / ReLU activation 19
  • 20. The big break / AlexNet / Dropout 20
  • 21. The CNN revolution Between 2012 and ~2019 there was a mad rash to release the best CNN model hitting new top1 accuracy on imagenet. 21
  • 22. The CNN revolution The holy grail is a family architecture that scales well with the number of parameters: IE: It performs better as we increase the number of parameters. In practise every architecture exhibits asymptotic behavior after which it is not performing better and the computational demands become too large. 22
  • 23. The CNN revolution / VGG 23
  • 24. The CNN revolution / ResNet 24
  • 25. The CNN revolution / UNet ● U-shaped Architecture: ○ U-Net features a distinctive "U"-shaped design, consisting of both a contracting and an expanding path. ○ The contracting path employs convolutional and pooling layers to decrease spatial resolution and increase feature maps. ○ In contrast, the expanding path utilizes upsampling and convolutional layers to progressively enhance spatial resolution. ● Skip Connections: ○ U-Net incorporates skip connections between the contracting and expanding paths to retain information from earlier layers. ○ Similar to ResNets, these connections mitigate the vanishing gradient problem in deep networks. ○ They facilitate accurate segmentation by enabling the network to leverage information from various scales. ● Multi-scale Feature Maps: ○ Leveraging multi-scale feature maps, U-Net captures information at different abstraction levels. ○ The contracting path reduces input resolution while increasing feature maps, while the expanding path upscales resolution. ○ Skip connections merge information from diverse scales, enhancing segmentation accuracy and overall performance. 25
  • 26. The CNN revolution / Inception 26
  • 27. The CNN revolution / EfficientNet 27
  • 28. The CNN revolution / EfficientNet 28
  • 29. The CNN revolution / Activations Different activations and their gradients promote different behaviours: ● Relu ● Sigmoid ● Tanh ● Selu ● ELU ● Gelu ● Swish 29
  • 30. The CNN revolution / Normalizations Different normalizations and their gradients allow faster convergence and deeper models: ● Batch Normalization ● Layer Normalization ● Group Normalization ● Instance Normalization ● Many many more 30
  • 31. Downstream Tasks Downstream computer vision tasks refer to specific applications or problems solved using pre-trained models or transfer learning. They enable the adaptation of existing models to new tasks, reducing the need for extensive data and computational resources. Object detection, image segmentation, facial recognition, and scene understanding. 31
  • 32. Downstream Tasks / Object Detection Object detection involves identifying and locating objects within an image or video. Challenges: Scale variation, occlusion, and object deformation. Transfer Learning: Pre-trained models like Faster R-CNN, YOLO, or SSD can be fine-tuned on custom datasets for specific object detection tasks. 32
  • 33. Downstream Tasks / Image Segmentation Image segmentation partitions an image into multiple segments or regions based on pixel similarity. Applications: Medical imaging, autonomous driving, and image editing. Transfer Learning: Models such as U-Net and Mask R-CNN, pre-trained on large datasets like COCO or Pascal VOC, can be adapted to segment specific objects or classes. 33
  • 34. Downstream Tasks / Facial Recognition Facial recognition identifies and verifies individuals based on facial features. Use Cases: Security systems, access control, and personalized user experiences. Transfer Learning: Pre-trained models like VGG-Face or FaceNet can be fine-tuned on custom datasets to recognize specific individuals or facial expressions. 34
  • 35. Downstream Tasks / Facial Recognition 35
  • 36. Downstream Tasks / SLAM Simultaneous localization and mapping attempts to make a robot or other autonomous vehicle map an unfamiliar area while, at the same time, determining where within that area the robot itself is located. 36
  • 37. Downstream Tasks / Depth Estimation 37
  • 38. Downstream Tasks / Human Pose Estimation 38
  • 39. Downstream Tasks / Scene Understanding Scene understanding involves analyzing and comprehending the content and context of an entire scene or image. Applications: Autonomous navigation, augmented reality, and content understanding. Transfer Learning: Models such as ResNet or Inception, pre-trained on large-scale image classification tasks, can be utilized for scene understanding tasks by fine-tuning on relevant datasets. 39
  • 40. Enter the Transformers Just when we thought CNN's had reached their peak, along comes a disruptor. “Attention is all you need” is the 2017 paper that changed NLP forever. It introduced the Transformer architecture and a novel attention mechanism. This was a very big divergence off the usual way of doing NLP. 40
  • 41. Enter the Transformers Transformers did not become a overnight success until GPT and BERT immensely popularized it. Here is a timeline of events: ● Attention is all you need: 2017 ● ElMo (LSTM-based): 2018 ● ULMFiT (LSTM-based): 2018 ● GPT (Transformer-based): 2018 ● BERT (Transformer-based): 2018 ● Transformers revolutionizing the world of NLP, Speech, and Vision: 2018 onwards 41
  • 42. Enter the Transformers Transformers revolutionized NLP, they could scale immensely and soon we reached the era of Large Language Models (LLMs). Models like GPT2 - GPT3 with several billions of parameters and a training dataset of a few trillion tokens. These models have been literally trained with a good portion of all the human textual knowledge. This innovation sparked the latest AI innovation explosion. 42
  • 43. Enter the Transformers / Encoder The Transformer's encoder transforms input sequences into machine-readable representations by capturing word similarities and positions. It utilizes input embeddings and positional encoding to prepare the sequence for processing. Stacked encoder layers, analyze relationships between words through multi-head attention blocks. This understanding is enhanced by residual connections and layer normalizations. A feed-forward network further refines the sequence. The encoded knowledge is then passed to the decoder for generating the final output sequence. 43
  • 44. Enter the Transformers / Decoder The decoder utilizes knowledge from the encoder. In the first prediction cycle, it starts with a "start of sentence" token. Similar to the encoder, the decoder layer analyzes encoder information and previous predictions. Initially, it relies on prediction knowledge alone, then combines it with encoder output for further analysis. The output is the probability of the next word in the sequence. 44
  • 45. Enter the Transformers / Attention 45
  • 46. Enter the Transformers / Attention The Transformer uses "scaled dot product attention," a self-attention mechanism that assesses token dependencies within a sequence. Unlike global attention, which considers each word's importance relative to the entire sequence, self-attention examines relationships between tokens. For instance, in the sentence "I went to the store and bought tons of fruits along with some furniture. They tasted amazing," self-attention would correctly identify "they" refers to "fruits," not "furniture." This contrasts with global attention, which might assign higher importance to unrelated words without understanding their connections. Self-attention's comprehensive analysis of word interactions enables accurate interpretation of meaning. 46
  • 48. Enter the Transformers / Pretraining Pre-training is essential for transformer models due to their vast parameter space and complexity. Pre-training on large datasets allows models to learn general features and representations, facilitating transfer learning to downstream tasks with smaller datasets. This initial training phase enables transformers to capture intricate patterns and relationships in data, leading to improved performance and faster convergence during fine-tuning. Additionally, pre-training mitigates the risk of overfitting on limited task-specific data by providing a solid foundation of knowledge. 48
  • 49. Transformers with a twist: Vision-Transformers (ViT) It was a only a matter of time before the Transformer architecture took over the computer vision domain. In the seminal paper: “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, the vision transformer was introduced and changed everything again. 49
  • 50. Transformers with a twist: Vision-Transformers (ViT) The Vision Transformer (ViT) model takes a sequence of flattened 2D patches derived from an image as input. The image, denoted as x, with pixels in the [0, 255] range and in the dimension of (H×W×C), is reshaped into a sequence of patches. The patches derived from the image are transformed into a lower-dimensional space using a trainable linear projection. This transformation process, which we refer to as “flattening,” results in a set of patch embeddings. 50
  • 51. Transformers with a twist: Vision-Transformers (ViT) ViT treats images as sequences of patches, employing a Transformer encoder like those used in NLP. Despite its simplicity, this approach, when paired with pre-training on large datasets, proves remarkably effective. The ViT rivals or surpasses state-of-the-art performance on numerous image classification tasks, with cost-effective pre-training. Its self-attention mechanism enables information integration across the entire image, even in lower layers, representing a significant advantage. 51
  • 52. Transformers with a twist: Vision-Transformers (ViT) The Vision Transformer has several variants, each with different sizes and configurations: 52
  • 53. Transformers with a twist: Vision-Transformers (ViT) 53
  • 54. Transformers with a twist: Vision-Transformers (ViT) Transformers sparked a new generation of models, borrowing previous innovations like multiscale models, small number of convolutions, different types of activations, normalizations, loss functions, pretraining tasks and types of attention. 54
  • 55. Transformers with a twist: Vision-Transformers (ViT) Multiscale Vision Transformers (MViT) leverages the idea of combining multi-scale feature hierarchies with vision transformer models. In practice, starting from the initial image size with 3 channels, the authors gradually expand (hierarchically) the channel capacity while reducing the spatial resolution. As a result, a multiscale pyramid of features is created. Intuitively, early layers will learn high-spatial with simple low-level visual information, while deeper layers are responsible for complex, high-dimensional features. 55
  • 56. Transformers with a twist: Vision-Transformers (ViT) The pivotal shift in Transformers lies in their integration of images as a distinct 'language,' introducing a profound modality. This breakthrough enables the fusion of diverse modalities, creating a frontier in AI where mapping between realms holds sway. With images now in their arsenal, Transformers redefine boundaries of comprehension, hinting at new levels of creativity and understanding. 56
  • 57. CNN’s fight back Despite the ViT-based models achieving state-of-the-art performance in all vision-related tasks, convolutional neural networks were not disregarded and experienced a resurgence with the ConvNext variant. 57
  • 58. CNN’s fight back ConvNext and ConvNext v2 came out in 2022 and 2023. They sparked a renewed interest in convolutional neural networks. 58
  • 59. CNN’s and attention Since Transformers came out there has been a strong push for adding attention mechanisms to CNN’s as well: ● Squeeze and Excite ● Additive attention ● Self-Attention ● CBAM ● and much more 59
  • 60. Current SOTA: Vision-Language Models (VLMS) The paradigm Pre-training, Fine-tuning and Prediction has demonstrated great effectiveness in a wide range of visual recognition tasks. Under this new paradigm, a DNN model is first pre-trained with certain off-the-shelf large-scale training data, being annotated or unannotated, and the pre-trained model is then fine-tuned with task-specific annotated training data. 60
  • 61. Current SOTA: Vision-Language Models (VLMS) A new deep learning paradigm named Vision-Language Model Pre-training and Zero-shot Prediction has attracted increasing attention recently. In this paradigm, a vision-language model (VLM) is pre-trained with large-scale image-text pairs that are almost infinitely available on the internet, and the pre-trained VLM can be directly applied to downstream visual recognition tasks without fine-tuning. 61
  • 62. Current SOTA: Vision-Language Models (VLMS) Zero-shot prediction directly applies pre-trained VLMs to downstream tasks without any task-specific fine-tuning. Image Classification aims to classify images into predefined categories. VLMs achieve zero-shot image classification by comparing the embeddings of images and texts, where “prompt engineering” is often employed to generate task-related prompts like “a photo of a [label].” . Semantic Segmentation aims to assign a category label to each pixel in images. Pre-trained VLMs achieve zero-shot prediction for segmentation tasks by comparing the embeddings of the given image pixels and texts. Object Detection aims to localize and classify objects in images, which is important for various vision applications. With the object locating ability learned from auxiliary datasets, pre-trained VLMs achieve zero-shot prediction for object detection tasks by comparing the embeddings of the given object proposals and texts. Image-Text Retrieval aims to retrieve the demanded samples from one modality given the cues from another modality, which consists of two tasks, i.e., text-to-image retrieval that retrieves images based on texts and image-to-text retrieval that retrieves texts based on images. 62
  • 63. Current SOTA: Vision-Language Models (VLMS) 63
  • 64. Current SOTA: Vision-Language Models (VLMS) 64
  • 65. Current SOTA: Vision-Language Models (VLMs) The most famous of zero-shot VLMs is the Segment Anything Now (SAM) family. It came out in early 2023 and has moved from natural images to all types of tasks. 65
  • 66. Current SOTA: Vision-Language Models (VLMs) It takes points / bounding boxes / text to create a segmentation of the original image 66
  • 67. Current SOTA: Vision-Language Models (VLMs) 67
  • 68. Current SOTA: Vision-Language Models (VLMs) 68
  • 69. Current SOTA: Vision-Language Models (VLMS) A recent very attractive application of VLMs is the Yolo-World that came out on the 31st of January 2024. YOLO-World tackles the challenges faced by traditional Open-Vocabulary detection models, which often rely on cumbersome Transformer models requiring extensive computational resources. 69
  • 70. Current SOTA: Vision-Language Models (VLMS) What it does it’s allow us to interact with the detector without using a fixed number of classes, but by using natural language. You can try it and you will understand how incredible this is, especially if you have trained fixed object detectors before: https://huggingface.co/spaces/stevengrove/YOLO- World 70
  • 71. Conclusion ● It’s been a long and arduous journey. ● Data and Compute power are more important than ever. ● Each innovation becomes the enabler of a huge number of applications. So, stay curious and keep exploring. ● The entry point is quite low. It's nice to know how we arrived here, but it's not necessary to use the current methods. ● The evolution continues, so be part of it. 71
  • 72. Electi / Capabilities ● We have been involved in the AI space for over a decade. ● We help companies transition to the AI age. ● Reach out at: ai@electiconsulting.com LinkedIn
  • 73. Select Papers / Resources ● AlexNet / Imagenet-classification-with-deep-convolutional-neural-networks ● Deep Learning (MIT Press) ● Deep Residual Learning for Image Recognition ● You Only Look Once: Unified, Real-Time Object Detection ● Attention Is All You Need ● An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale ● Vision-Language Models for Vision Tasks: A Survey ● YOLO-World: Real-Time Open-Vocabulary Object Detection ● Understanding Deep Learning (MIT Press)
  • 74. Thank you Nikolas Markou Head of AI @ Electi Consulting LinkedIn