we embark on a quest to uncover the fascinating evolution of computer vision, from humble beginnings to the cutting-edge marvels of Vision Transformers.
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
1. Artificial Intelligence for Vision:
A walkthrough of recent breakthroughs
February 2024
Nikolas Markou
Electi Consulting
2. Nikolas Markou
● Head of AI / Partner @ Electi Consulting
● Consultant for various AI startups
● Senior AI Engineer, Autonomous Cars (NavInfo
Europe)
● Principal Engineer, Yodigram, Yodiwo
● SW Engineer, Intelligence / Drone / Geolocation
projects, Verint
● 2010-2011 - MSc, Imperial College London
● 2005-2010 - BEng, MEng, University of Patras
3. Introduction
Welcome to a journey through time and
pixels.
Today, we embark on a quest to uncover
the fascinating evolution of computer
vision, from humble beginnings to the
cutting-edge marvels of Vision
Transformers.
In just a few minutes i will try to give you a
historical overview of computer vision and
the bright future ahead.
3
4. Computer Vision
Computer vision is a field within artificial
intelligence focused on: enabling
machines to interpret and understand
visual information from the real world.
It encompasses tasks such as:
● image recognition,
● object detection,
● and scene understanding,
aiming to replicate human-like vision
capabilities using algorithms and
computational techniques.
4
5. The Dawn of Computer Vision 1960-1970
Researchers explore the idea of
teaching computers to "see" and
interpret visual data.
Early Challenges: Limited
computational power and lack of
data hinder progress.
Milestones: Development of
foundational concepts like edge
detection and pattern recognition.
5
6. The pioneer / Yann LeCun
LeNet, introduced by Yann LeCun in 1998,
was a pioneering convolutional neural
network (CNN) architecture.
Its compact design and hierarchical feature
extraction revolutionized handwritten digit
recognition, marking a pivotal moment in
deep learning history.
LeNet's breakthrough paved the way for
modern CNNs, igniting the era of deep
learning-based computer vision.
6
8. What is Convolution ?
Convolution refers to a mathematical
operation applied to input data, typically
images, using a convolutional filter or kernel.
This operation involves sliding the filter over
the input data and computing the dot product
between the filter and the overlapping
portions of the input.
Convolutional layers in deep learning models
use this operation to extract meaningful
features from the input data.
8
9. What is Convolution ?
The convolution operation is characterized by parameters such as the size of the filter, the
stride (step size) of the filter as it moves across the input, and padding to control the spatial
dimensions of the output.
Through training, CNNs learn to adjust the weights of the convolutional filters to extract
relevant features from the input data, enabling the model to make accurate predictions or
classifications based on the learned representations.
9
10. What does convolution learn ?
The kernels or filters learn (through the training dataset) abstractions of increased
complexity, so stacked convolutions (in specific setup) activate on different kind of features.
10
11. Innovation struggles through the 2000s
Following the AI winter of the 1980s, neural networks fell out of favor, leading to a reliance on
handcrafted features and classical techniques like support vector machines (SVMs) and
decision trees in computer vision.
Between the debut of LeNet in 1998 and the breakthrough of AlexNet in 2012, deep learning
and computer vision navigated through a challenging landscape shaped by the AI winter and
the dominance of traditional computer vision methods.
Despite progress in applications such as facial recognition and optical character recognition,
these methods faced limitations in scalability and generalization.
11
12. The big break
2012 we got the big break with AlexNet the
first “deep” neural network.
AlexNet competed in the ImageNet Large
Scale Visual Recognition Challenge on
September 30, 2012.
The network achieved a top-5 error of
15.3%, more than 10.8 percentage points
lower than that of the runner up.
12
15. The big break / AlexNet
The persistence and innovation of
researchers / engineers / tinkerers set the
stage for the pivotal moment in 2012 when
AlexNet showcased the transformative power
of deep CNNs on the ImageNet Large Scale
Visual Recognition Challenge.
This resurgence marked a turning point,
reigniting widespread interest in deep
learning for computer vision and catalyzing a
new era of rapid advancement and
innovation in the field.
15
16. The big break / AlexNet
● Extensive Data: Trained on ImageNet's vast dataset of 1.2 million images across 1,000
classes, facilitating diverse feature learning.
● Deep Architecture: Featuring eight layers, including five convolutional and three fully
connected layers, enabling hierarchical feature extraction.
● Convolutional Layers: Utilizing spatially invariant features crucial for image
classification via convolutional and pooling layers.
● ReLU Activation: Employing ReLU activation, accelerating learning and mitigating
vanishing gradient issues.
● Data Augmentation: Employing techniques like cropping, flipping, and scaling to
enhance training data and combat overfitting.
● Dropout Regularization: Implementing dropout to randomly deactivate neurons during
training, preventing overfitting and enhancing generalization.
16
21. The CNN revolution
Between 2012 and ~2019
there was a mad rash to
release the best CNN
model hitting new top1
accuracy on imagenet.
21
22. The CNN revolution
The holy grail is a family
architecture that scales well with
the number of parameters:
IE: It performs better as we
increase the number of
parameters.
In practise every architecture
exhibits asymptotic behavior after
which it is not performing better
and the computational demands
become too large.
22
25. The CNN revolution / UNet
● U-shaped Architecture:
○ U-Net features a distinctive "U"-shaped design, consisting of
both a contracting and an expanding path.
○ The contracting path employs convolutional and pooling layers
to decrease spatial resolution and increase feature maps.
○ In contrast, the expanding path utilizes upsampling and
convolutional layers to progressively enhance spatial
resolution.
● Skip Connections:
○ U-Net incorporates skip connections between the contracting
and expanding paths to retain information from earlier layers.
○ Similar to ResNets, these connections mitigate the vanishing
gradient problem in deep networks.
○ They facilitate accurate segmentation by enabling the network
to leverage information from various scales.
● Multi-scale Feature Maps:
○ Leveraging multi-scale feature maps, U-Net captures
information at different abstraction levels.
○ The contracting path reduces input resolution while increasing
feature maps, while the expanding path upscales resolution.
○ Skip connections merge information from diverse scales,
enhancing segmentation accuracy and overall performance.
25
29. The CNN revolution / Activations
Different activations and their
gradients promote different
behaviours:
● Relu
● Sigmoid
● Tanh
● Selu
● ELU
● Gelu
● Swish 29
30. The CNN revolution / Normalizations
Different normalizations and
their gradients allow faster
convergence and deeper
models:
● Batch Normalization
● Layer Normalization
● Group Normalization
● Instance Normalization
● Many many more 30
31. Downstream Tasks
Downstream computer vision tasks
refer to specific applications or
problems solved using pre-trained
models or transfer learning.
They enable the adaptation of existing
models to new tasks, reducing the
need for extensive data and
computational resources.
Object detection, image segmentation,
facial recognition, and scene
understanding. 31
32. Downstream Tasks / Object Detection
Object detection involves
identifying and locating objects
within an image or video.
Challenges: Scale variation,
occlusion, and object deformation.
Transfer Learning: Pre-trained
models like Faster R-CNN, YOLO,
or SSD can be fine-tuned on
custom datasets for specific object
detection tasks. 32
33. Downstream Tasks / Image Segmentation
Image segmentation partitions an image
into multiple segments or regions based
on pixel similarity.
Applications: Medical imaging,
autonomous driving, and image editing.
Transfer Learning: Models such as U-Net
and Mask R-CNN, pre-trained on large
datasets like COCO or Pascal VOC, can
be adapted to segment specific objects or
classes.
33
34. Downstream Tasks / Facial Recognition
Facial recognition identifies and verifies
individuals based on facial features.
Use Cases: Security systems, access
control, and personalized user
experiences.
Transfer Learning: Pre-trained models like
VGG-Face or FaceNet can be fine-tuned
on custom datasets to recognize specific
individuals or facial expressions.
34
36. Downstream Tasks / SLAM
Simultaneous localization and
mapping attempts to make a
robot or other autonomous
vehicle map an unfamiliar area
while, at the same time,
determining where within that
area the robot itself is located.
36
39. Downstream Tasks / Scene Understanding
Scene understanding involves analyzing
and comprehending the content and
context of an entire scene or image.
Applications: Autonomous navigation,
augmented reality, and content
understanding.
Transfer Learning: Models such as
ResNet or Inception, pre-trained on
large-scale image classification tasks, can
be utilized for scene understanding tasks
by fine-tuning on relevant datasets.
39
40. Enter the Transformers
Just when we thought CNN's had
reached their peak, along comes a
disruptor.
“Attention is all you need” is the 2017
paper that changed NLP forever.
It introduced the Transformer
architecture and a novel attention
mechanism.
This was a very big divergence off the
usual way of doing NLP.
40
41. Enter the Transformers
Transformers did not become a overnight success until GPT and BERT immensely
popularized it. Here is a timeline of events:
● Attention is all you need: 2017
● ElMo (LSTM-based): 2018
● ULMFiT (LSTM-based): 2018
● GPT (Transformer-based): 2018
● BERT (Transformer-based): 2018
● Transformers revolutionizing the world of NLP, Speech, and Vision: 2018
onwards
41
42. Enter the Transformers
Transformers revolutionized NLP, they could
scale immensely and soon we reached the
era of Large Language Models (LLMs).
Models like GPT2 - GPT3 with several
billions of parameters and a training dataset
of a few trillion tokens.
These models have been literally trained with
a good portion of all the human textual
knowledge.
This innovation sparked the latest AI
innovation explosion.
42
43. Enter the Transformers / Encoder
The Transformer's encoder transforms input
sequences into machine-readable representations
by capturing word similarities and positions. It
utilizes input embeddings and positional encoding
to prepare the sequence for processing.
Stacked encoder layers, analyze relationships
between words through multi-head attention blocks.
This understanding is enhanced by residual
connections and layer normalizations. A
feed-forward network further refines the sequence.
The encoded knowledge is then passed to the
decoder for generating the final output sequence.
43
44. Enter the Transformers / Decoder
The decoder utilizes knowledge from the
encoder. In the first prediction cycle, it
starts with a "start of sentence" token.
Similar to the encoder, the decoder layer
analyzes encoder information and previous
predictions.
Initially, it relies on prediction knowledge
alone, then combines it with encoder output
for further analysis.
The output is the probability of the next
word in the sequence.
44
46. Enter the Transformers / Attention
The Transformer uses "scaled dot product attention," a self-attention mechanism that assesses token
dependencies within a sequence.
Unlike global attention, which considers each word's importance relative to the entire sequence,
self-attention examines relationships between tokens.
For instance, in the sentence "I went to the store and bought tons of fruits along with some furniture.
They tasted amazing," self-attention would correctly identify "they" refers to "fruits," not "furniture."
This contrasts with global attention, which might assign higher importance to unrelated words without
understanding their connections. Self-attention's comprehensive analysis of word interactions enables
accurate interpretation of meaning.
46
48. Enter the Transformers / Pretraining
Pre-training is essential for transformer models due to their vast parameter space
and complexity.
Pre-training on large datasets allows models to learn general features and
representations, facilitating transfer learning to downstream tasks with smaller
datasets.
This initial training phase enables transformers to capture intricate patterns and
relationships in data, leading to improved performance and faster convergence
during fine-tuning.
Additionally, pre-training mitigates the risk of overfitting on limited task-specific
data by providing a solid foundation of knowledge.
48
49. Transformers with a twist: Vision-Transformers (ViT)
It was a only a matter of time before the
Transformer architecture took over the
computer vision domain.
In the seminal paper:
“An Image is Worth 16x16 Words:
Transformers for Image Recognition at
Scale”, the vision transformer was
introduced and changed everything
again.
49
50. Transformers with a twist: Vision-Transformers (ViT)
The Vision Transformer (ViT) model takes a
sequence of flattened 2D patches derived from an
image as input.
The image, denoted as x, with pixels in the [0, 255]
range and in the dimension of (H×W×C), is
reshaped into a sequence of patches.
The patches derived from the image are
transformed into a lower-dimensional space using a
trainable linear projection.
This transformation process, which we refer to as
“flattening,” results in a set of patch embeddings.
50
51. Transformers with a twist: Vision-Transformers (ViT)
ViT treats images as sequences of patches, employing a
Transformer encoder like those used in NLP.
Despite its simplicity, this approach, when paired with
pre-training on large datasets, proves remarkably
effective.
The ViT rivals or surpasses state-of-the-art performance
on numerous image classification tasks, with
cost-effective pre-training.
Its self-attention mechanism enables information
integration across the entire image, even in lower layers,
representing a significant advantage.
51
52. Transformers with a twist: Vision-Transformers (ViT)
The Vision Transformer has several variants, each with different sizes and
configurations:
52
54. Transformers with a twist: Vision-Transformers (ViT)
Transformers sparked a new generation of models, borrowing previous
innovations like multiscale models, small number of convolutions, different types of
activations, normalizations, loss functions, pretraining tasks and types of attention.
54
55. Transformers with a twist: Vision-Transformers (ViT)
Multiscale Vision Transformers (MViT) leverages
the idea of combining multi-scale feature
hierarchies with vision transformer models. In
practice, starting from the initial image size with
3 channels, the authors gradually expand
(hierarchically) the channel capacity while
reducing the spatial resolution.
As a result, a multiscale pyramid of features is
created. Intuitively, early layers will learn
high-spatial with simple low-level visual
information, while deeper layers are responsible
for complex, high-dimensional features.
55
56. Transformers with a twist: Vision-Transformers (ViT)
The pivotal shift in Transformers lies in their
integration of images as a distinct
'language,' introducing a profound modality.
This breakthrough enables the fusion of
diverse modalities, creating a frontier in AI
where mapping between realms holds
sway.
With images now in their arsenal,
Transformers redefine boundaries of
comprehension, hinting at new levels of
creativity and understanding.
56
57. CNN’s fight back
Despite the ViT-based models achieving
state-of-the-art performance in all
vision-related tasks, convolutional
neural networks were not disregarded
and experienced a resurgence with the
ConvNext variant.
57
58. CNN’s fight back
ConvNext and ConvNext v2 came out in
2022 and 2023.
They sparked a renewed interest in
convolutional neural networks.
58
59. CNN’s and attention
Since Transformers came out there has
been a strong push for adding attention
mechanisms to CNN’s as well:
● Squeeze and Excite
● Additive attention
● Self-Attention
● CBAM
● and much more
59
60. Current SOTA: Vision-Language Models (VLMS)
The paradigm Pre-training, Fine-tuning
and Prediction has demonstrated great
effectiveness in a wide range of visual
recognition tasks.
Under this new paradigm, a DNN model
is first pre-trained with certain
off-the-shelf large-scale training data,
being annotated or unannotated, and
the pre-trained model is then fine-tuned
with task-specific annotated training
data.
60
61. Current SOTA: Vision-Language Models (VLMS)
A new deep learning paradigm named Vision-Language Model Pre-training and Zero-shot
Prediction has attracted increasing attention recently.
In this paradigm, a vision-language model (VLM) is pre-trained with large-scale image-text
pairs that are almost infinitely available on the internet, and the pre-trained VLM can be
directly applied to downstream visual recognition tasks without fine-tuning.
61
62. Current SOTA: Vision-Language Models (VLMS)
Zero-shot prediction directly applies pre-trained VLMs to downstream tasks without any task-specific fine-tuning.
Image Classification aims to classify images into predefined categories. VLMs achieve zero-shot image
classification by comparing the embeddings of images and texts, where “prompt engineering” is often employed to
generate task-related prompts like “a photo of a [label].” .
Semantic Segmentation aims to assign a category label to each pixel in images. Pre-trained VLMs achieve
zero-shot prediction for segmentation tasks by comparing the embeddings of the given image pixels and texts.
Object Detection aims to localize and classify objects in images, which is important for various vision applications.
With the object locating ability learned from auxiliary datasets, pre-trained VLMs achieve zero-shot prediction for
object detection tasks by comparing the embeddings of the given object proposals and texts.
Image-Text Retrieval aims to retrieve the demanded samples from one modality given the cues from another
modality, which consists of two tasks, i.e., text-to-image retrieval that retrieves images based on texts and
image-to-text retrieval that retrieves texts based on images.
62
65. Current SOTA: Vision-Language Models (VLMs)
The most famous of zero-shot VLMs is the Segment Anything Now (SAM) family.
It came out in early 2023 and has moved from natural images to all types of tasks.
65
66. Current SOTA: Vision-Language Models (VLMs)
It takes points / bounding boxes / text to create a segmentation of the original
image
66
69. Current SOTA: Vision-Language Models (VLMS)
A recent very attractive application of VLMs is the Yolo-World that came out on the 31st of January 2024.
YOLO-World tackles the challenges faced by traditional Open-Vocabulary detection models, which often
rely on cumbersome Transformer models requiring extensive computational resources.
69
70. Current SOTA: Vision-Language Models (VLMS)
What it does it’s allow us to interact with the
detector without using a fixed number of classes,
but by using natural language.
You can try it and you will understand how
incredible this is, especially if you have trained fixed
object detectors before:
https://huggingface.co/spaces/stevengrove/YOLO-
World
70
71. Conclusion
● It’s been a long and arduous journey.
● Data and Compute power are more important than ever.
● Each innovation becomes the enabler of a huge number of
applications. So, stay curious and keep exploring.
● The entry point is quite low. It's nice to know how we arrived
here, but it's not necessary to use the current methods.
● The evolution continues, so be part of it.
71
72. Electi / Capabilities
● We have been involved in the AI
space for over a decade.
● We help companies transition to
the AI age.
● Reach out at:
ai@electiconsulting.com
LinkedIn
73. Select Papers / Resources
● AlexNet / Imagenet-classification-with-deep-convolutional-neural-networks
● Deep Learning (MIT Press)
● Deep Residual Learning for Image Recognition
● You Only Look Once: Unified, Real-Time Object Detection
● Attention Is All You Need
● An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale
● Vision-Language Models for Vision Tasks: A Survey
● YOLO-World: Real-Time Open-Vocabulary Object Detection
● Understanding Deep Learning (MIT Press)