Comparative study of three state-of-the-art papers on Semantic segmentation, a technique to identify each and every pixel of an image to class. The approach used is using Transformers and the papers we compare were using different techniques like Swin Transformer, Segmenter and SegFormer. Learning need much more improvements.
1. PROJECT
Applied AI and ML Cohort 2
SEMANTIC SEGMENTATION
AARUNI PARIMAL
aarupari@gmail.com
Presented by:
HITESH KUMAR
proudhitesh@gmail.com
GROUP 9
Project Guide:
Prof. ADITYA NIGAM
aditya@iitmandi.ac.in
Prof. ARNAV BHAVASAR
arnav@iitmandi.ac.in
TA Mentors:
ANOUSHKA BANARJEE
s19016@students.iitmandi.ac.in
RANJEET RANJAN
ranjanjharanjeet@gmail.com
2. OBJECTIVE
The object of this presentation is to fulfil program
requirement for accomplishing this course and to learn
something practical by solving a real problem and follow
research methodology.
Further towards research the objective of this study to
solve a computer vision problem called semantic
segmentation and perform it using Transformers which
was a topic of syllabus of this program.
PURPOSE
3. SEMANTIC SEGMENTATION
Segmentation is a computer vision problem to identify and label
each and every pixel of an image into classes.
INTRODUCTION
Identification and classification of each object in an image
Classification of each pixel according to the corresponding object
4. COMPUTER VISION PROBLE
IMAGE SEGMANTATION
P1 P3
P2
PERSON
Object
Identification
PERSON PERSON
PERSON
PERSON PERSON
PERSON
Instance
Segmentation
Semantic
Segmentation
Original Image
Position and
Identification Mask
Single Object
Single Class Mask
Single Object
Multi Class Mask
Fig: showing difference between identification, instance segmentation and semantic segmentation.
5. BUILDING MODEL
Our initial approach to train our model was by using
“Transformers”. We replicated model proposed in a paper
and then thought to improve.
Because we were unable to achieve any improvement
over what was already given in the paper, so we did a
comparative analysis of 3 state of the art papers. Also we
generated few image, video sample and also tried
same on live stream.
APPROACH
7. BUILDING MODEL
TRANSFORMER
Inspired by its results in NLP
Treats the problem as sequence-to-sequence conversion
Divide image into patches, like word tokens in text
13. TRAINING
ENVORNMENT / TOOLS
Google CoLab pro
PyTorch V1.11
Implemented directly from official source code
Using MMSegmentation library by Open MMLab for SigFormer and Swin
Default configurations with smallest image and patch size
FFMPEG tool to create side by side sample video
19. BUILDING MODEL
RESULT – SEGMENTER
Training 1:
epochs = 1, mIoU = 12.08
Training 2:
epochs = 64, mIoU = 8.54
Training 3:
epochs = 64, mIoU = 18.98
Training 4:
epochs = 64, mIoU = 38.37
20. BUILDING MODEL
RESULT – SegFormer
Training 1:
epochs = 16, mIoU = 11.53
mIoU score Training loss
21. Training 1:
epochs = 16, mIoU = 11.53
BUILDING MODEL
RESULT – Swin Transformer
mIoU score Training loss
22. FUTURE POSSIBILITIES
Train for more epochs
Use better pretrained weights for feature extraction
Evaluate on different datasets
Implement the model on videos
ON IMPROVEMENTS
23. LITERATURE REVIEW
REFERENCES
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
[2020]
Segmenter: Transformer for Semantic Segmentation [2021]
SegFormer: Simple and Efficient Design for Semantic Segmentation with
Transformers [2021] Swin
Transformer: Hierarchical Vision Transformer using Shifted Windows [2021]
24. GRATITUDE
WE LEARNT A LOT
All IIT Mandi Faculty members
WilyNXT team
Mentor TAs
Master Class Mentors
Group 9 Team members
And all other background team who made this
possible.