SlideShare a Scribd company logo
1 of 10
Download to read offline
Project Shringar - An Exploration Of Approaches Towards Virtual TryOn
Nivedit Jain (B18CSE039)1, Mitul Indravadanbhai Patel (B18CSE041)1, and Rajat Sharma
(B18CSE043)1
1Department of Computer Science and Engineering
Indian Institute of Technology, Jodhpur
Project Report for completion of B.Tech. Project (BTech Pre-Final Year, Trimester 2, Academic Year 2020-2021) under the guidance
of Dr. Anand Mishra, Indian Institute of Technology, Jodhpur.
I INTRODUCTION
Augmented Reality tries to combine computer machinery
with human interaction, and has become one of the fastest
growing fields in information technology within the past
couple of years. Companies and entrepreneurs have
invested a lot of capital and human resource to bring
human and computers closer, and new innovations have
opened a lot of doors into industries that were considered
to be out of scope of these technologies previously. One of
these industries is fashion, which is also the one we are
targeting in this project. Our goal is to create a Virtual
Tryon application, where users don’t have to go through
the hustle of trying out clothes in a changing room, and can
instead try them out online through an AR interface. An
popular example of this system is currently in use by
Lenskart.com®, where they do it for spectacle frames
instead of clothes, which is what we are trying to do.
Through this project, we aim to accomplish the following
goals:
• Survey the existing and upcoming methodologies and
innovations that can be used for creating the
application.
• Expand our knowledge base about the internal
workings of different methods, so as to design
efficient and cost effective solutions.
• Implement starter modules for our application pipeline
which can later be bundled into the system.
• To gather knowledge about the field of Deep Learning
and Augmented Reality in general, and their
applications in the fashion industry, along with future
prospects.
Through the course of this project we have surveyed
several papers, literature and various other content,
pertaining to our use case, learned about new technologies
such as GANs and auto-encoders, trained Image
Segmentation models for fashion specific uses, and
developed a method for cloth size estimation using Human
Pose Estimation techniques. Overall it was a highly
enriched learning experience for all of us.
* jain.22@iitj.ac.in, patel.6@iitj.ac.in, sharma.30@iitj.ac.in
Figure 1: How Coded Light Technique Works ([1])
II LITERATURE SURVEY
While working on this project we have explored a number
of interesting papers which we have discussed in this
section.
1 Depth Based Camera for Measurements
A Depth Camera as the name suggests is a camera that has
the ability to not only capture images of an object but also
able to detect depth/distance from the camera of the object
usually using one of the approaches [1] from the following
• Structural Light and Coded Light Approach In this
class of approach a pattern of light is emitted by the
sensor and distance is calculated by the change in that
pattern. See figure 1
• Stereo Depth Usually use reflection of infrared waves
with visible waves and produces depth effects just like
our eyes, the basic idea is that we know the distance
between sensors and then we used the 2 captured
images to predict the depth of the object. See figure 2
• Time of Flight and LiDAR usually sensors in this
range are based on the speed of light or waves, which
is already known, the working is quite intuitive and
based on time it takes a beam of light to come back to
the camera.
We observe that these sensors are slowly and slowly getting
in our mobile devices and thus creating some high potential
use cases. Example Apple iPhone 12 has LiDAR, Samsung
Galaxy S20 Ultra (Time of Flight Based), and many more.
A number of techniques in literature could be found
which uses this incredible device to obtain and produce
Figure 2: How Stereo Depth Works ([1])
avtars for a human and also produce measurements. Most
of these techniques are based on two main techniques of
Random Forest[2] and Geodesic Distances[3].
One such approach we particularly though for the first
step of our pipeline was by Tan Xiaohui, et al, Automatic
human body feature extraction and personal size
measurement [4]. They were not only able to generate 3D
Avatar effectively using depth cameras and human body
features parsing using a random forest approach but also
able to efficiently use geodesic distances to be able to
predict the size and shapes of various important features
like shoulder, chest, waist, hip and legs with an extremely
low average error margin of 0.0617cm (for all
measurements). Also the computational requirements were
not very high.
2 Using Non Depth Based Camera
With the evolution of computer vision, several methods
have been developed to get very accurate information
about the shape and size of a variety of human body parts,
however, these approaches need some additional inputs
with photo or video of the person, like height, weight,
gender, age etc. Moreover these approaches also require a
person to be in a particular pose and orientation. We were
able to find several applications, startups and companies
who are working on similar approaches or have
successfully implemented them, however, the majority of
these approaches and algorithms were not available
publicly.
We after carefully studying them have designed one of
our approaches and also have suggested some
improvements which we felt we could implement. We
found out that the majority of these approaches calculate
measurement per unit pixel and then try to improve them
using some learning approaches. Some also asked to keep
the distance between you and the camera to be fixed and
then used basic trigonometry to calculate the height, then
move to per pixel notation. Others moved ahead and tried
to improve them using Deep Learning Based approaches
and some tried to use various alignment methods to align
images from different angles. Few example Presize[5],
Nettelo[6], Sizer[7]. See figures 3 3 4.
3 Parsing clothes and Humans
Parsing and understanding a variety of objects is a
well-studied problem in computer science and there are
several approaches to for the same. To parse and segments
clothes we first started with a Variational Autoencoder
Figure 3: Basic Trigonometry Approach for calculating height,
the basic idea is that if we have at least one parameter then we
can try to predict the other parameters.
Based Approach [8] and simultaneously tried various
operation on them like changing colors, however later we
changed our approach to a Mask-R-CNN[9] based instance
segmentation approach as a lot literature was based upon
it, and we also found the much of the earlier literature was
using a UNet[10] based approach, we have also tried and
experimented with similar approaches (both of these
approaches discussed in detailed in later sections of this
text).
A lot of approaches have already been found in the
literature regarding parsing humans and especially human
pose, we have gone through and used one such approach,
namely OpenPose [11] (discussed in detail in later
sections), however, there are other great techniques to
segment humans, parse them, one such great is
pose2seg[12], which segments human-based building on
their pose.
4 Virtual Try On (2D)
It is one of the fields which is increasingly becoming
popular in computer science. And in a very naive sense
means to change clothing or try various things in clothing,
like colour, dress etc, in this work we have worked with
two such prominent papers, VITON: An Image-based
Virtual Try-on Network [13] and Towards Photo-Realistic
Virtual Try-On by Adaptively Generating ↔ Preserving
Image Content [14]. (both of them discussed in details in
later sections)
A part from this we have also survey and learnt basics of
a lot of technologies like AR, GANs, Variational
Autoencoders, R-CNNs, etc, these could be found in
presentation available at
https://drive.google.com/drive/folders/
15FRuGU0VDZM2ySMLFDOEQlqK1gBPbFZg?usp=sharing
III OUR SOLUTIONS
1 U-Net
The U-Net was developed by Olaf Ronneberger et al. for
biomedical Image segmentation using Convolutional
Networks[15]. It is based on fully Convolutional networks.
It takes less than a second to make predictions on latest
GPUs for a 512x512 Image using a U-Net.[16] It has
shown Very good performance on very different
biomedical segmentation applications. U-Net is used for
semantic segmentation, like given an image of person
Figure 4: Presize.ai extra input methods
Figure 5: Presize.ai video capturing in a particular alignment
wearing some clothes, it can be semantically differentiated
as in segmented to the into the clothes, like T-Shirt, Jeans,
Hat etc.
1.1 Methodology
1.1.1 Structure Similar to a Convolutional network
intitally it has the downscaling layers, then up scaling
layers concatenating a copy of the last layer from layers in
the corresponding downscaling component and doing
covolution-transpose; As shown in the a generic structure
in Figure 6. As we can see, instead of having downscaling
in each step, we have discrete downscaling, in the bursts of
3 layer groups.
1.1.2 Energy The energy function is computed by a
pixel-wise soft-max over the final feature map combined
with the cross entropy loss function.
1.2 Implementation
• We Implemented a basic U-Net to segment out
T-Shirt/Upper-Body-Clothing from a given Image of a
person.
• We prepared a DataSet from the Viton dataset[17],
involving a given image and it’s T-Shirt or
Upper-body-clothing mask(binary mask). As shown
in Figure 8.
Figure 6: A Generic U-Net Structure[15]
• We produced a network to take in the image of the
person(256x256x3) and predict the segmentation
mask of the T-Shirt or Upper-body-clothing
(256x256x1).
• We used a similar structure to the structure shown in
the Figure 6, except we were going for (256x256x3)
-> (256x256x1). Also we used the same padding
while convoluting over the previous group of the
layers to maintain the similar structure as the input
image instead of losing the border pixels.
• We Trained the network with the dataset generated
earlier using viton[17] upto the accuracy of 99.28%.
• We measured the mean IOU background excluded on
a test DataSet of 1000 images to be 0.9284. The given
metric is defined as:
IoU =
PixelsPred
T
PixelsGT
PixelsPred
S
PixelsGT
(1)
where Pixels signify the number of Pixels classified
into the given category.
• We also experimented with the masked out T-Shirt to
change the T-Shirt Colours. As shown in Figure 9.
(a) Person (b) T-Shirt Mask
Figure 8: A Slice of the DataSet generated by us from the Viton
DataSet[17]
1.3 Results
The Results are shown in Figure 9. The IoU over the dataset
found was 0.9284.
Figure 7: The U-Net Structure used by us[18]
2 Mask RCNN
2.1 Motivation
The main motivation behind using Mask RCNN[19] is for
the cloth segmentation part, so as to prepare a cloth mask
that can be used as input for the cloth warping stage in the
VITON[13] model.
2.2 Methodology
Mask RCNN is based upon the RCNN (Regional
Convolutional Neural Network) family of neural networks,
and is an extension of Faster RCNN which was developed
for object detection tasks in image processing. Mask
RCNN extends this by providing for instance segmentation
also by creating segmentation masks (which specify the
category to which each pixel belongs) and hence is suitable
for our use case. The internal working of the model is
described as follows:
2.2.1 Region Proposal Networks[20] Region Proposal
forms the basis of RCNNs[21][22][20][19], wherein
prospective regions of object containment are marked by
using a heuristical algorithm (RCNN[22], Fast
RCNN[21]), or an RPN (Region Proposal Network)
(Faster RCNN[20]). The RPN takes as input a
convolutional feature map of the image, from the last
convolutional layer, and provides output for region
proposals in the form of bounding boxes, and a
classification score for two classes (object and no object).
The input map is subjected to a sliding network which
takes an n × n spatial window of the map, and maps it to a
lower dimensional feature (256-d or 512-d), which is then
fed into two parallel fully connected layers, i.e. a proposal
bounding box regression layer(reg) and a box classification
layer obj. To account for different scales and aspect ratio
of images in order to make region proposal invariant of
these factors, we predict k region proposals for each
sliding window simultaneously, each with its own scale
and aspect ratio. Thus the reg layer generates 4k different
outputs representing the bounding box coordinates, while
the obj layer generates 2k different outputs representing
class scores, for each of the k proposals. The RPN assigns
a binary label to each anchor point (i.e. whether it contains
an object or not), according to whether it has the highest
IoU with a ground-truth box, or if it has an IoU > 0.7 with
any ground-truth box. The loss function for the RPN is
represented as L = Lobj + Lreg where Lobj is the log loss
over the two classes, while Lreg is the Robust loss over the
region proposal box coordinates.
2.2.2 RoI Align[19] The region proposals from the
RPN are mapped onto the feature map, and are used to
pool features using an RoI pooling layer, which basically is
a max-pooling layer that divides the proposal into
sub-windows of a fixed size and performs pooling on them,
in order to give an output of dimensions (N,7,7,512),
where N is the number of initial regions proposed. The
main problem with RoI pooling is that while it maps RoI to
the extracted features, it introduces a level of quantization
while performing the mapping, which results in a
misalignment between input and the extracted features.
RoI align avoids this by maintaing the actual floating point
values of the region coordinates, and uses bilinear
interpolation to figure out the values for matrix in the
pooling step. This preserves the input to extracted feature
mapping, and hence is suitable for use in semantic
segmentation tasks.
2.2.3 Basic Structure The final structure of the
network is as follows: the input image is passed through
several convolutional layers of some convolutional
network, which outputs a feature map that is passed
through an RPN, and the regions proposed are mapped
onto the feature map and pooled using the RoI align layer.
The output from the RoI layer is passed through two fully
connected layers that are connected to two parallel object
classification and bounding box regression branches. The
output from the RoI layer is also fed to a mask prediction
branch simultaneously via an FCN (Fully Convolutional
Network). The loss function for the network is defined as
L = Lcls + Lbox + Lmask, where Lcls is a log loss for each
class, Lbox is a smooth L1 regression loss for each class,
while Lmask is an averaged binary cross entropy loss over
each pixel, for each class.
2.3 Implementation
• We used the matterport implementation[24] of Mask
RCNN[19] (implemented in tensorflow 1.15), with
pre-trained weights on the MS-COCO[25] dataset.
• We trained our network on the iMaterialist 2019[26]
dataset, with 45k images, of which 36k were train
(a) Converted Pink T-Shirt into White T-Shirt
(b) Converted Blue T-Shirt into white T-Shirt
(c) Converted Pink T-Shirt into Gray T-Shirt
Figure 9: U-Net Results
images, and the rest 9k were test images, and has 46
categories of clothing.
• The metric used for measuring performance is mean
precision and recall over IoU threshold (using
bounding boxes). IoU (Intersection over Union) for
prediction and ground truth bounding boxes for an
object of class C is defined as:
IoUC =
BoxPred
T
BoxGT
BoxPred
S
BoxGT
(2)
The IoU threshold is a parameter θ where if
IoUC >= θ, then the given image has an object of
class C predicted for it. Taking precision of all classes
over all images, and then taking their mean forms the
basis of our metric. We calculate MP and MR for
θ = 0.5 and θ = 0.75 and also calculate the averages
of the two values (listed as MP and MR). We also
calculate the combined F1 score from MP and MR.
They are listed as below:
Figure 10: Mask RCNN Structure[23]
Model MP for θ = 0.5 MP for θ = 0.75 MP
Mask RCNN 0.316 0.172 0.244
Model MR for θ = 0.5 MR for θ = 0.75 MR
Mask RCNN 0.347 0.186 0.267
Model F1
Mask RCNN 0.255
2.4 Results
Some of the sample results are shown in Figures 11 and 12.
Also for IoU please refer to the tables above.
Figure 11: Mask RCNN Result
Figure 12: Mask RCNN Result
3 Cloth Size Estimator
3.1 Motivation
An important part while shopping for clothing is finding
the right size of apparel to wear. But size measurement is
not an easy task, and certainly cannot be done individually.
Hence this led us to creating a simple heuristical cloth size
estimator, which would serve as the first module of our
pipeline.
3.2 Methodology
3.2.1 OpenPose[27] OpenPose is a popular and open
source human 2D pose and keypoint estimation model. It
performs pose estimation by taking the input image and
preparing part confidence maps (which show the likelihood
of a body part being present at a given point), and the part
affinity maps (which show the orientation and association
of different body parts). The body parts candidates are then
associated using a set of bipartite matchings, after which
they are assembled into full body poses for all the people
in the image.
3.2.2 Basic Structure The OpenPose is used to
generate keypoints for a given input image, which we use
to predict the person’s size. We make the assumption that
the person is standing parallely to the vertical plane, and
the camera is also held perpendicularly to it. This ensures
that the distances between the keypoints are actual
measurements, and not mere projections. The keypoint
diagram that we referred to is shown in Fig 13. Then we
define the metric cm/pixels as:
cm/pixels =
Height −10
Distance(P14,P16)
(3)
where the Height is taken as an input from the user. Once
the metric is calculated, for any euclidean distance d
between two points on the image, we can find the
corresponding length len in the real world with a margin of
approximation using the relation:
len = λ ∗(cm/pixels)∗d (4)
where λ is a constant introduced for taking into
approximation the fat on the person’s body.
3.3 Implementation
1. We implemented a version of the model in python and
performed testing on our own images since any
appropriate dataset was not available to us.
2. We first used OpenPose on images to generate the pose
and get data about keypoints, as shown in figure 14.
3. Then we calculated cm per pixel as described by the
equation in the above section.
4. We took advantage of the fact the fact that, when we
need to find the size of a person we need to classify
them in size as XS, S, M, L, XL getting exact
measurements is not required. Hence small
inaccuracy in exact is valid to some extend.
Figure 13: Keypoints generated from OpenPose
3.4 Results
We were not able to find a good dataset with all the
necessary parameters to report the accuracy of our model,
however, we tested on ourselves and the model performed
quite decent. Few examples for our case were as follows
Person Parameter Real Predicted
Nivedit Trouser Length 103 cm 102.02 Cm
Nivedit Waist 42 Inch 41.63 Inch
Nivedit Shoulder 40 Inch 34.44 Inch
Pratik Trouser Length 89cm 86.11 Cm
Pratik Waist 36 Inch 37.8 Inch
Pratik Shoulder 37 Inch 36.3 Inch
We have further proposed an approach for an approach
in later section, to improve these results considering
parameters like fat and body shape in much more detail not
just pose skeletons.
However, we were not able to train and verify the
proposed approach due to lack of availability of good data
set for the purpose. We further propose to make one such
data set for the purpose.
4 VITON
Viton is one of the most popular UNet Like,
encoder-decoder network (see figure 18), which operates
Virtual Try-On. It is quite a large architecture, we have
Figure 14: Keypoints generated from OpenPose for a person
briefly described it here, (Note : All images for this section
are directly taken from VITON Paper [13])
4.1 Methodology
1. First, given the image of person, we use OpenPose
[11], to get the pose map of the person, then using the
human parser[28] we get the body shape, also we
remove the face and hair from the image. All this
information is stacked in form of images (22) over
one another to form a matrix of size (256 * 192 * 22).
2. All this information (input) thus generated with the
target cloth is passed into the main Viton network as
shown in figure 18.
3. The image of the person thus generated is fed with
target cloth in a mask improvement network, as
shown in figure 19.
4. The final mask generated used with the target image, a
thin plate transformation (see figure 17) in applied on
the target cloth with reference to the newly generated
mask.
5. Finally the transformed target cloth is superimposed
on the person with a perceptual loss.
4.2 Results
We weren’t able to compile the code ourselves since the
model was very computationally expensive and its proper
dependencies weren’t available on the DGX2 server. Hence
the qualitative results are from the repository of the original
authors themselves [17]. See figure 15
Figure 15: Results for VITON Model
Figure 16: Generating input for VITON
5 ACGPN
It is recent Virtual Try On Network, which is used to
improve the fidelity of the previous existing, Virtual
Try-On Network. It is completely GAN based and have a
number of 3 modules, namely Semantic Generation
Module, Clothes Warping Module and Content Fusion
Module (see figure 20). Each of unit is extremely
computationally intensive and would need a very detailed
description, which is out of scope of this report. Thus we
have provided a brief introduction of various units. For
details please refer [14].
5.1 Semantic Generation Module
This is the first module and as the name suggest it is used
to understand semantics of the image of the person as well
as the target cloth. All units are based on conditional
GANs[29] with a UNet structure is the generator and for
discriminator they have used pix2pixHD[30] network.
First GAN (G1) is used to understand the areas of image
where we need to place the clothing from the target cloth.
Then the second GAN unit (G2) is used to get orientation
and positioning of the target cloth.
5.2 Clothes Warping Module
This modules is responsible for wrapping the target cloth
over the intermediate image of a person. This was the most
challenging part for us to understand. This internally used
a second order difference equation to effectively wrap the
cloth with high fidelity. This difference equation is used
with a Spatial Transformation Network[31] and Thin-Plate
Spline.
Before sending the next module all the data is passed
through Non-target Body Part Composition, it is simply
Figure 17: Transformation Module for better cloth fit
Figure 18: Viton Architecture[13]
due using appropriate dot products of various mask to
make sure that all the necessary parts which needs to be
present pass to the next module.
5.3 Content Fusion Module
Finally all data from all channels are fed into a content
fusion module, which uses the third GAN (G3), which uses
all the poses that to produce a final high resolution image
of the target. It basically acts like an inpainting unit which
fills in all the missing parts of the image.
The code for ACGPAN could be found at
https://github.com/switchablenorms/
DeepFashion_Try_On.
IV PROPOSED METHODS
In this section we propose some of the methods which we
were not able to experiment with to have results, but we
believe could be highly useful
1 Proposed Size Estimator
• In our previous size estimator we were using only basic
pose skeleton to predict the size of the body. However
this ignores the body shape in many cases.
• To overcome this, we propose first segment the person
from the background using human segmentation
module like pose2seg (pose2seg will further make the
process faster as it calculates pose as an intermediate
step).
• In step 2, use OpenPose on that image. Output should
be like fig 21.
• Now we calculate size as in the first process, what can
do after that, extrapolate the line.
Figure 19: Viton Mask Improvement Architecture[13]
Figure 20: ACGPAN Architecture[14]
• We know can use the above two parameters and CNN
over the nearby areas to compute another scaling
factor, which could be given for each image.
• Finally our output would be a function f, such that
sizepart = f(method1,extrapolated,λ)+c
Figure 21: Keypoints generated from OpenPose for on a person
after background removal
2 CGAN Based TShirt Color Changing Approach
CGANs (Controllable Generative Adversarial
Networks)[32] have been experimented a lot with changing
the colors of hairs of a person and simulate hair dyes [33].
This approach planned was quite similar, we would aim to
find the independent color variable from the vector. Idea is
that similarly to the hair color module if we would find a
independent set of variable which could link to Tshirt
colors then we would be able to able to change to colors
just by tweaking that set of variables.
We tried to train it, but learning was quite unstable and
model failed to converge, we feel that we could try to
improve the learning and convergence, given sufficient
computational resources.
V CONCLUSION AND LEARNING OUTCOMES
Our conclusion for this project consists of the following
findings:
1. Explored the scope of AR in the field of fashion.
2. Surveyed several papers about technologies that
related to our use case.
3. Studied and implemented two working cloth parsing
models using U-Net and Mask RCNN architectures,
and trained them to work on semantic segmentation
for clothing in images.
4. Developed and implemented size estimation model
using OpenPose.
5. Studied the workings of an existing virtual tryon model
VITON and ACGPN.
6. The U-Net architecture reports very high semantic
segmentation accuracy for a single class. We tried to
create a multi-class architecture also, but it gave poor
results. It might be possible that training different
U-Nets for different classes and then combining the
predictions can provide for a good albeit slow cloth
parser.
7. The Mask RCNN approach showed results lower than
that of state of the art models [26] (AP@.5 = 60.26
and AP@0.7 = .4765), although they also used Mask
RCNN only. Better training and adjustment of hyper
parameters can maybe help increase this.
8. The cloth size estimation approach worked well for the
most part, although we had to factor in the λ factor on
our own, but this can be rectified through our proposed
methods.
9. The virtual tryon models we explored, i.e. VITON
and ACGPN worked good enough for their test
images, but they were highly mathematically complex
to understand and computationally expensive to
implement on a commercial scale.
The learning outcomes from this project are as following:
1. We learnt about the various technologies that can be
used in the fashion industry and surveyed several
papers regarding them.
2. We learnt about semantic segmentation techniques and
applied them for cloth parsing purposes.
3. We developed our own technique for cloth size
estimation using human pose estimation.
4. We read about the existing virtual try on techniques
such as VITON, CP-VTON and ACGPN, and
understood their architectures well, and tried to delve
into the mathemetical details with the best of our
efforts.
5. The search for cloth wrapping techniques also
introduced us to GANs and autoencoders, which we
read about in detail.
6. We explored methods for the λ parameter in the cloth
size estimation module to be learned as a function of
the person’s image itself (using image processing), or
as a function of the person’s BMI, so as to take care of
person’s with different sizes.
Detailed presentation on learning outcomes could be found
at https://drive.google.com/drive/folders/
15FRuGU0VDZM2ySMLFDOEQlqK1gBPbFZg?usp=sharing
VI ISSUES FACED
1. The mathematical details regarding the workings of
the virtual tryon models proved to be out of scope for
us, and hence we weren’t able to proceed to create a
version of our own.
2. The multi-class implementation of U-Net gave poor
performance on the test images, and the faults weren’t
clear, and hence we proceeded with a single class
implementation only.
3. Good dataset was not available to check to
experiments, train and validated our proposed body
shape model.
VII FUTURE WORK
1. To gain an intuitive understanding of the mathematics
behind VITON and ACGPN.
2. To try to implement U-Net for multi-class semantic
segmentation.
3. To improve the results for Mask RCNN by better
training and hyper parameter adjustments.
4. Implement our proposals for cloth size estimators.
5. To understand and design the target cloath warping.
6. To analyse the logistics of the project from a
commercial perspective and to make it commercially
viable.
7. To develop a simple prototype mobile application.
8. To Design and optimise a model such that it can work
on a mobile device with a primitive GPU.
9. To extend/design a Model for 3D imaging and use
AR/VR to project the outputs in a more elegant way
to produce a better visualisation.
VIII ACKNOWLEDGMENTS
We are highly thankful for Dr. Anand Mishra
(https://anandmishra22.github.io/) for allowing
us to explore this wonderful field, work completely as per
our ideas, valuable discussions, and supporting us on each
step.
REFERENCES
[1] Intel. https://www.intelrealsense.com/beginners-
guide-to-depth/.
(2021).
[2] T. K. Ho. Random Decision Forests. (2021).
[3] https://www.sciencedirect.com/topics/computer-
science/geodesic-distance.
(2021).
[4] T. Xiaohui, P. Xiaoyu, L. Liwen & X. Qing.
Automatic human body feature extraction and
personal size measurement. Journal of Visual
Languages Computing 47, 9–18 (2018). ISSN:
1045-926X. https://www.sciencedirect.com/
science/article/pii/S1045926X17302835.
[5] Presize.ai. https://www.presize.ai/. (2021).
[6] Nettelo. http://nettelo.com/. (2021).
[7] Size. https://sizer.me/. (2021).
[8] D. P. Kingma & M. Welling. An Introduction to
Variational Autoencoders. Foundations and
Trends® in Machine Learning 12, 307–392 (2019).
ISSN: 1935-8245.
http://dx.doi.org/10.1561/2200000056.
[9] K. He, G. Gkioxari, P. Dollár & R. Girshick. Mask
R-CNN 2018. arXiv: 1703.06870 [cs.CV].
[10] O. Ronneberger, P. Fischer & T. Brox. U-Net:
Convolutional Networks for Biomedical Image
Segmentation 2015. arXiv: 1505.04597 [cs.CV].
[11] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei &
Y. Sheikh. OpenPose: Realtime Multi-Person 2D
Pose Estimation using Part Affinity Fields 2019.
arXiv: 1812.08008 [cs.CV].
[12] S.-H. Zhang et al. Pose2Seg: Detection Free Human
Instance Segmentation 2019. arXiv: 1803 . 10683
[cs.CV].
[13] X. Han, Z. Wu, Z. Wu, R. Yu & L. S. Davis. VITON:
An Image-based Virtual Try-on Network 2018. arXiv:
1711.08447 [cs.CV].
[14] H. Yang et al. Towards Photo-Realistic Virtual
Try-On by Adaptively Generating↔Preserving
Image Content 2020. arXiv: 2003.05863 [cs.CV].
[15] P. F. Olaf Ronneberger & T. Brox. U-Net:
Convolutional Networks for Biomedical Image
Segmentation 2015. arXiv: 1505.04597 [cs.CV].
[16] https: // en. wikipedia. org/ wiki/ U-Net
[17] https: // github. com/ xthan/ VITON
[18] https : / / github . com / HarisIqbal88 /
PlotNeuralNet / blob / master / examples /
Unet/ Unet. pdf
[19] K. He, G. Gkioxari, P. Dollár & R. B. Girshick.
Mask R-CNN. CoRR abs/1703.06870 (2017).
arXiv: 1703 . 06870.
http://arxiv.org/abs/1703.06870.
[20] S. Ren, K. He, R. B. Girshick & J. Sun. Faster
R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks. CoRR abs/1506.01497
(2015). arXiv: 1506 . 01497.
http://arxiv.org/abs/1506.01497.
[21] R. B. Girshick. Fast R-CNN. CoRR abs/1504.08083
(2015). arXiv: 1504.08083. http://arxiv.org/
abs/1504.08083.
[22] R. B. Girshick, J. Donahue, T. Darrell & J. Malik.
Rich feature hierarchies for accurate object
detection and semantic segmentation. CoRR
abs/1311.2524 (2013). arXiv: 1311 . 2524.
http://arxiv.org/abs/1311.2524.
[23] https : / / ars . els -
cdn . com / content / image / 1 - s2 . 0 -
S0168169919301103-gr4. jpg
[24] https: // github. com/ matterport/ Mask_
RCNN
[25] T.-Y. Lin et al. Microsoft COCO: Common Objects
in Context. CoRR abs/1405.0312 (2014). arXiv:
1405 . 0312.
http://arxiv.org/abs/1405.0312.
[26] S. Guo et al. The iMaterialist Fashion Attribute
Dataset. CoRR abs/1906.05750 (2019). arXiv:
1906 . 05750.
http://arxiv.org/abs/1906.05750.
[27] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei &
Y. Sheikh. OpenPose: Realtime Multi-Person 2D
Pose Estimation using Part Affinity Fields. CoRR
abs/1812.08008 (2018). arXiv: 1812 . 08008.
http://arxiv.org/abs/1812.08008.
[28] K. Gong, X. Liang, D. Zhang, X. Shen & L. Lin.
Look into Person: Self-supervised Structure-sensitive
Learning and A New Benchmark for Human Parsing
2017. arXiv: 1703.05446 [cs.CV].
[29] M. Mirza & S. Osindero. Conditional Generative
Adversarial Nets 2014. arXiv: 1411 . 1784
[cs.LG].
[30] T.-C. Wang et al. High-Resolution Image Synthesis
and Semantic Manipulation with Conditional GANs.
CoRR abs/1711.11585 (2017). arXiv: 1711.11585.
http://arxiv.org/abs/1711.11585.
[31] M. Jaderberg, K. Simonyan, A. Zisserman &
K. Kavukcuoglu. Spatial Transformer Networks
2016. arXiv: 1506.02025 [cs.CV].
[32] M. Lee & J. Seok. Controllable Generative
Adversarial Network 2019. arXiv: 1708 . 00598
[cs.LG].
[33] https : / / towardsdatascience . com / dye -
your - hair - or - look - older - using - ai -
930bc6928422

More Related Content

Similar to BTP Report.pdf

IRJET - Deep Learning Approach to Inpainting and Outpainting System
IRJET -  	  Deep Learning Approach to Inpainting and Outpainting SystemIRJET -  	  Deep Learning Approach to Inpainting and Outpainting System
IRJET - Deep Learning Approach to Inpainting and Outpainting SystemIRJET Journal
 
AI Personal Trainer Using Open CV and Media Pipe
AI Personal Trainer Using Open CV and Media PipeAI Personal Trainer Using Open CV and Media Pipe
AI Personal Trainer Using Open CV and Media PipeIRJET Journal
 
A Survey on Human Pose Estimation
A Survey on Human Pose EstimationA Survey on Human Pose Estimation
A Survey on Human Pose EstimationIRJET Journal
 
IRJET- Design an Approach for Prediction of Human Activity Recognition us...
IRJET-  	  Design an Approach for Prediction of Human Activity Recognition us...IRJET-  	  Design an Approach for Prediction of Human Activity Recognition us...
IRJET- Design an Approach for Prediction of Human Activity Recognition us...IRJET Journal
 
AI Personal Trainer Using Open CV and Media Pipe
AI Personal Trainer Using Open CV and Media PipeAI Personal Trainer Using Open CV and Media Pipe
AI Personal Trainer Using Open CV and Media PipeIRJET Journal
 
Visual Saliency Model Using Sift and Comparison of Learning Approaches
Visual Saliency Model Using Sift and Comparison of Learning ApproachesVisual Saliency Model Using Sift and Comparison of Learning Approaches
Visual Saliency Model Using Sift and Comparison of Learning Approachescsandit
 
Image Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep LearningImage Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep LearningIRJET Journal
 
User Friendly Virtual Clothes System Based on Simulation and Visualization us...
User Friendly Virtual Clothes System Based on Simulation and Visualization us...User Friendly Virtual Clothes System Based on Simulation and Visualization us...
User Friendly Virtual Clothes System Based on Simulation and Visualization us...IJMTST Journal
 
INDOOR AND OUTDOOR NAVIGATION ASSISTANCE SYSTEM FOR VISUALLY IMPAIRED PEOPLE ...
INDOOR AND OUTDOOR NAVIGATION ASSISTANCE SYSTEM FOR VISUALLY IMPAIRED PEOPLE ...INDOOR AND OUTDOOR NAVIGATION ASSISTANCE SYSTEM FOR VISUALLY IMPAIRED PEOPLE ...
INDOOR AND OUTDOOR NAVIGATION ASSISTANCE SYSTEM FOR VISUALLY IMPAIRED PEOPLE ...IRJET Journal
 
Proposed Multi-object Tracking Algorithm Using Sobel Edge Detection operator
Proposed Multi-object Tracking Algorithm Using Sobel Edge Detection operatorProposed Multi-object Tracking Algorithm Using Sobel Edge Detection operator
Proposed Multi-object Tracking Algorithm Using Sobel Edge Detection operatorQUESTJOURNAL
 
Object and Currency Detection for the Visually Impaired
Object and Currency Detection for the Visually ImpairedObject and Currency Detection for the Visually Impaired
Object and Currency Detection for the Visually ImpairedIRJET Journal
 
Person Acquisition and Identification Tool
Person Acquisition and Identification ToolPerson Acquisition and Identification Tool
Person Acquisition and Identification ToolIRJET Journal
 
Image Recognition Expert System based on deep learning
Image Recognition Expert System based on deep learningImage Recognition Expert System based on deep learning
Image Recognition Expert System based on deep learningPRATHAMESH REGE
 
Activity Recognition Using RGB-Depth Sensors-Final report
Activity Recognition Using RGB-Depth Sensors-Final reportActivity Recognition Using RGB-Depth Sensors-Final report
Activity Recognition Using RGB-Depth Sensors-Final reportnazlitemu
 
IMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUESIMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUESIRJET Journal
 
Multiple object detection report
Multiple object detection reportMultiple object detection report
Multiple object detection reportManish Raghav
 
Real Time Moving Object Detection for Day-Night Surveillance using AI
Real Time Moving Object Detection for Day-Night Surveillance using AIReal Time Moving Object Detection for Day-Night Surveillance using AI
Real Time Moving Object Detection for Day-Night Surveillance using AIIRJET Journal
 
Schematic model for analyzing mobility and detection of multiple
Schematic model for analyzing mobility and detection of multipleSchematic model for analyzing mobility and detection of multiple
Schematic model for analyzing mobility and detection of multipleIAEME Publication
 
A Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringA Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringIRJET Journal
 

Similar to BTP Report.pdf (20)

IRJET - Deep Learning Approach to Inpainting and Outpainting System
IRJET -  	  Deep Learning Approach to Inpainting and Outpainting SystemIRJET -  	  Deep Learning Approach to Inpainting and Outpainting System
IRJET - Deep Learning Approach to Inpainting and Outpainting System
 
AI Personal Trainer Using Open CV and Media Pipe
AI Personal Trainer Using Open CV and Media PipeAI Personal Trainer Using Open CV and Media Pipe
AI Personal Trainer Using Open CV and Media Pipe
 
A Survey on Human Pose Estimation
A Survey on Human Pose EstimationA Survey on Human Pose Estimation
A Survey on Human Pose Estimation
 
IRJET- Design an Approach for Prediction of Human Activity Recognition us...
IRJET-  	  Design an Approach for Prediction of Human Activity Recognition us...IRJET-  	  Design an Approach for Prediction of Human Activity Recognition us...
IRJET- Design an Approach for Prediction of Human Activity Recognition us...
 
AI Personal Trainer Using Open CV and Media Pipe
AI Personal Trainer Using Open CV and Media PipeAI Personal Trainer Using Open CV and Media Pipe
AI Personal Trainer Using Open CV and Media Pipe
 
Visual Saliency Model Using Sift and Comparison of Learning Approaches
Visual Saliency Model Using Sift and Comparison of Learning ApproachesVisual Saliency Model Using Sift and Comparison of Learning Approaches
Visual Saliency Model Using Sift and Comparison of Learning Approaches
 
Image Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep LearningImage Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep Learning
 
User Friendly Virtual Clothes System Based on Simulation and Visualization us...
User Friendly Virtual Clothes System Based on Simulation and Visualization us...User Friendly Virtual Clothes System Based on Simulation and Visualization us...
User Friendly Virtual Clothes System Based on Simulation and Visualization us...
 
INDOOR AND OUTDOOR NAVIGATION ASSISTANCE SYSTEM FOR VISUALLY IMPAIRED PEOPLE ...
INDOOR AND OUTDOOR NAVIGATION ASSISTANCE SYSTEM FOR VISUALLY IMPAIRED PEOPLE ...INDOOR AND OUTDOOR NAVIGATION ASSISTANCE SYSTEM FOR VISUALLY IMPAIRED PEOPLE ...
INDOOR AND OUTDOOR NAVIGATION ASSISTANCE SYSTEM FOR VISUALLY IMPAIRED PEOPLE ...
 
Proposed Multi-object Tracking Algorithm Using Sobel Edge Detection operator
Proposed Multi-object Tracking Algorithm Using Sobel Edge Detection operatorProposed Multi-object Tracking Algorithm Using Sobel Edge Detection operator
Proposed Multi-object Tracking Algorithm Using Sobel Edge Detection operator
 
Object and Currency Detection for the Visually Impaired
Object and Currency Detection for the Visually ImpairedObject and Currency Detection for the Visually Impaired
Object and Currency Detection for the Visually Impaired
 
Person Acquisition and Identification Tool
Person Acquisition and Identification ToolPerson Acquisition and Identification Tool
Person Acquisition and Identification Tool
 
Image Recognition Expert System based on deep learning
Image Recognition Expert System based on deep learningImage Recognition Expert System based on deep learning
Image Recognition Expert System based on deep learning
 
Activity Recognition Using RGB-Depth Sensors-Final report
Activity Recognition Using RGB-Depth Sensors-Final reportActivity Recognition Using RGB-Depth Sensors-Final report
Activity Recognition Using RGB-Depth Sensors-Final report
 
IMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUESIMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUES
 
Multiple object detection report
Multiple object detection reportMultiple object detection report
Multiple object detection report
 
Real Time Moving Object Detection for Day-Night Surveillance using AI
Real Time Moving Object Detection for Day-Night Surveillance using AIReal Time Moving Object Detection for Day-Night Surveillance using AI
Real Time Moving Object Detection for Day-Night Surveillance using AI
 
Schematic model for analyzing mobility and detection of multiple
Schematic model for analyzing mobility and detection of multipleSchematic model for analyzing mobility and detection of multiple
Schematic model for analyzing mobility and detection of multiple
 
A Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringA Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question Answering
 
40120140501006
4012014050100640120140501006
40120140501006
 

More from niveditJain

BTP Learning Outcome.pdf
BTP Learning Outcome.pdfBTP Learning Outcome.pdf
BTP Learning Outcome.pdfniveditJain
 
BTP Presentation.pdf
BTP Presentation.pdfBTP Presentation.pdf
BTP Presentation.pdfniveditJain
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR OptimizationniveditJain
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR OptimizationniveditJain
 
Caste Wise Analysis of MGNREGA in Mandor Block of Jodhpur
Caste Wise Analysis of MGNREGA in Mandor Block of Jodhpur Caste Wise Analysis of MGNREGA in Mandor Block of Jodhpur
Caste Wise Analysis of MGNREGA in Mandor Block of Jodhpur niveditJain
 
Analysis Of MGNREGA on people of Mandor Region on Caste Basis
Analysis Of MGNREGA on people of Mandor Region on Caste BasisAnalysis Of MGNREGA on people of Mandor Region on Caste Basis
Analysis Of MGNREGA on people of Mandor Region on Caste BasisniveditJain
 
Essentialize Extreme Programming practices
Essentialize Extreme Programming practicesEssentialize Extreme Programming practices
Essentialize Extreme Programming practicesniveditJain
 
Context Free Grammar
Context Free GrammarContext Free Grammar
Context Free GrammarniveditJain
 
NFA DFA Equivalence theorem
NFA DFA Equivalence theorem NFA DFA Equivalence theorem
NFA DFA Equivalence theorem niveditJain
 
Maximum weighted edge biclique problem on bipartite graphs
Maximum weighted edge biclique problem on bipartite graphsMaximum weighted edge biclique problem on bipartite graphs
Maximum weighted edge biclique problem on bipartite graphsniveditJain
 
Tesla aquisition of maxwell
Tesla aquisition of maxwellTesla aquisition of maxwell
Tesla aquisition of maxwellniveditJain
 
Literature club Introduction 2k19
Literature club Introduction 2k19Literature club Introduction 2k19
Literature club Introduction 2k19niveditJain
 
Inter IIT Tech Meet 2k19, IIT Jodhpur
Inter IIT Tech Meet 2k19, IIT JodhpurInter IIT Tech Meet 2k19, IIT Jodhpur
Inter IIT Tech Meet 2k19, IIT JodhpurniveditJain
 
Jargons eCell IIT Jodhpur
Jargons eCell IIT JodhpurJargons eCell IIT Jodhpur
Jargons eCell IIT JodhpurniveditJain
 

More from niveditJain (17)

BTP Learning Outcome.pdf
BTP Learning Outcome.pdfBTP Learning Outcome.pdf
BTP Learning Outcome.pdf
 
BTP Presentation.pdf
BTP Presentation.pdfBTP Presentation.pdf
BTP Presentation.pdf
 
Project muZiK
Project muZiKProject muZiK
Project muZiK
 
Bucket Sort
Bucket SortBucket Sort
Bucket Sort
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 
Caste Wise Analysis of MGNREGA in Mandor Block of Jodhpur
Caste Wise Analysis of MGNREGA in Mandor Block of Jodhpur Caste Wise Analysis of MGNREGA in Mandor Block of Jodhpur
Caste Wise Analysis of MGNREGA in Mandor Block of Jodhpur
 
Analysis Of MGNREGA on people of Mandor Region on Caste Basis
Analysis Of MGNREGA on people of Mandor Region on Caste BasisAnalysis Of MGNREGA on people of Mandor Region on Caste Basis
Analysis Of MGNREGA on people of Mandor Region on Caste Basis
 
Essentialize Extreme Programming practices
Essentialize Extreme Programming practicesEssentialize Extreme Programming practices
Essentialize Extreme Programming practices
 
Context Free Grammar
Context Free GrammarContext Free Grammar
Context Free Grammar
 
NFA DFA Equivalence theorem
NFA DFA Equivalence theorem NFA DFA Equivalence theorem
NFA DFA Equivalence theorem
 
Maximum weighted edge biclique problem on bipartite graphs
Maximum weighted edge biclique problem on bipartite graphsMaximum weighted edge biclique problem on bipartite graphs
Maximum weighted edge biclique problem on bipartite graphs
 
Carmeet
CarmeetCarmeet
Carmeet
 
Tesla aquisition of maxwell
Tesla aquisition of maxwellTesla aquisition of maxwell
Tesla aquisition of maxwell
 
Literature club Introduction 2k19
Literature club Introduction 2k19Literature club Introduction 2k19
Literature club Introduction 2k19
 
Inter IIT Tech Meet 2k19, IIT Jodhpur
Inter IIT Tech Meet 2k19, IIT JodhpurInter IIT Tech Meet 2k19, IIT Jodhpur
Inter IIT Tech Meet 2k19, IIT Jodhpur
 
Jargons eCell IIT Jodhpur
Jargons eCell IIT JodhpurJargons eCell IIT Jodhpur
Jargons eCell IIT Jodhpur
 

Recently uploaded

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 

Recently uploaded (20)

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 

BTP Report.pdf

  • 1. Project Shringar - An Exploration Of Approaches Towards Virtual TryOn Nivedit Jain (B18CSE039)1, Mitul Indravadanbhai Patel (B18CSE041)1, and Rajat Sharma (B18CSE043)1 1Department of Computer Science and Engineering Indian Institute of Technology, Jodhpur Project Report for completion of B.Tech. Project (BTech Pre-Final Year, Trimester 2, Academic Year 2020-2021) under the guidance of Dr. Anand Mishra, Indian Institute of Technology, Jodhpur. I INTRODUCTION Augmented Reality tries to combine computer machinery with human interaction, and has become one of the fastest growing fields in information technology within the past couple of years. Companies and entrepreneurs have invested a lot of capital and human resource to bring human and computers closer, and new innovations have opened a lot of doors into industries that were considered to be out of scope of these technologies previously. One of these industries is fashion, which is also the one we are targeting in this project. Our goal is to create a Virtual Tryon application, where users don’t have to go through the hustle of trying out clothes in a changing room, and can instead try them out online through an AR interface. An popular example of this system is currently in use by Lenskart.com®, where they do it for spectacle frames instead of clothes, which is what we are trying to do. Through this project, we aim to accomplish the following goals: • Survey the existing and upcoming methodologies and innovations that can be used for creating the application. • Expand our knowledge base about the internal workings of different methods, so as to design efficient and cost effective solutions. • Implement starter modules for our application pipeline which can later be bundled into the system. • To gather knowledge about the field of Deep Learning and Augmented Reality in general, and their applications in the fashion industry, along with future prospects. Through the course of this project we have surveyed several papers, literature and various other content, pertaining to our use case, learned about new technologies such as GANs and auto-encoders, trained Image Segmentation models for fashion specific uses, and developed a method for cloth size estimation using Human Pose Estimation techniques. Overall it was a highly enriched learning experience for all of us. * jain.22@iitj.ac.in, patel.6@iitj.ac.in, sharma.30@iitj.ac.in Figure 1: How Coded Light Technique Works ([1]) II LITERATURE SURVEY While working on this project we have explored a number of interesting papers which we have discussed in this section. 1 Depth Based Camera for Measurements A Depth Camera as the name suggests is a camera that has the ability to not only capture images of an object but also able to detect depth/distance from the camera of the object usually using one of the approaches [1] from the following • Structural Light and Coded Light Approach In this class of approach a pattern of light is emitted by the sensor and distance is calculated by the change in that pattern. See figure 1 • Stereo Depth Usually use reflection of infrared waves with visible waves and produces depth effects just like our eyes, the basic idea is that we know the distance between sensors and then we used the 2 captured images to predict the depth of the object. See figure 2 • Time of Flight and LiDAR usually sensors in this range are based on the speed of light or waves, which is already known, the working is quite intuitive and based on time it takes a beam of light to come back to the camera. We observe that these sensors are slowly and slowly getting in our mobile devices and thus creating some high potential use cases. Example Apple iPhone 12 has LiDAR, Samsung Galaxy S20 Ultra (Time of Flight Based), and many more. A number of techniques in literature could be found which uses this incredible device to obtain and produce
  • 2. Figure 2: How Stereo Depth Works ([1]) avtars for a human and also produce measurements. Most of these techniques are based on two main techniques of Random Forest[2] and Geodesic Distances[3]. One such approach we particularly though for the first step of our pipeline was by Tan Xiaohui, et al, Automatic human body feature extraction and personal size measurement [4]. They were not only able to generate 3D Avatar effectively using depth cameras and human body features parsing using a random forest approach but also able to efficiently use geodesic distances to be able to predict the size and shapes of various important features like shoulder, chest, waist, hip and legs with an extremely low average error margin of 0.0617cm (for all measurements). Also the computational requirements were not very high. 2 Using Non Depth Based Camera With the evolution of computer vision, several methods have been developed to get very accurate information about the shape and size of a variety of human body parts, however, these approaches need some additional inputs with photo or video of the person, like height, weight, gender, age etc. Moreover these approaches also require a person to be in a particular pose and orientation. We were able to find several applications, startups and companies who are working on similar approaches or have successfully implemented them, however, the majority of these approaches and algorithms were not available publicly. We after carefully studying them have designed one of our approaches and also have suggested some improvements which we felt we could implement. We found out that the majority of these approaches calculate measurement per unit pixel and then try to improve them using some learning approaches. Some also asked to keep the distance between you and the camera to be fixed and then used basic trigonometry to calculate the height, then move to per pixel notation. Others moved ahead and tried to improve them using Deep Learning Based approaches and some tried to use various alignment methods to align images from different angles. Few example Presize[5], Nettelo[6], Sizer[7]. See figures 3 3 4. 3 Parsing clothes and Humans Parsing and understanding a variety of objects is a well-studied problem in computer science and there are several approaches to for the same. To parse and segments clothes we first started with a Variational Autoencoder Figure 3: Basic Trigonometry Approach for calculating height, the basic idea is that if we have at least one parameter then we can try to predict the other parameters. Based Approach [8] and simultaneously tried various operation on them like changing colors, however later we changed our approach to a Mask-R-CNN[9] based instance segmentation approach as a lot literature was based upon it, and we also found the much of the earlier literature was using a UNet[10] based approach, we have also tried and experimented with similar approaches (both of these approaches discussed in detailed in later sections of this text). A lot of approaches have already been found in the literature regarding parsing humans and especially human pose, we have gone through and used one such approach, namely OpenPose [11] (discussed in detail in later sections), however, there are other great techniques to segment humans, parse them, one such great is pose2seg[12], which segments human-based building on their pose. 4 Virtual Try On (2D) It is one of the fields which is increasingly becoming popular in computer science. And in a very naive sense means to change clothing or try various things in clothing, like colour, dress etc, in this work we have worked with two such prominent papers, VITON: An Image-based Virtual Try-on Network [13] and Towards Photo-Realistic Virtual Try-On by Adaptively Generating ↔ Preserving Image Content [14]. (both of them discussed in details in later sections) A part from this we have also survey and learnt basics of a lot of technologies like AR, GANs, Variational Autoencoders, R-CNNs, etc, these could be found in presentation available at https://drive.google.com/drive/folders/ 15FRuGU0VDZM2ySMLFDOEQlqK1gBPbFZg?usp=sharing III OUR SOLUTIONS 1 U-Net The U-Net was developed by Olaf Ronneberger et al. for biomedical Image segmentation using Convolutional Networks[15]. It is based on fully Convolutional networks. It takes less than a second to make predictions on latest GPUs for a 512x512 Image using a U-Net.[16] It has shown Very good performance on very different biomedical segmentation applications. U-Net is used for semantic segmentation, like given an image of person
  • 3. Figure 4: Presize.ai extra input methods Figure 5: Presize.ai video capturing in a particular alignment wearing some clothes, it can be semantically differentiated as in segmented to the into the clothes, like T-Shirt, Jeans, Hat etc. 1.1 Methodology 1.1.1 Structure Similar to a Convolutional network intitally it has the downscaling layers, then up scaling layers concatenating a copy of the last layer from layers in the corresponding downscaling component and doing covolution-transpose; As shown in the a generic structure in Figure 6. As we can see, instead of having downscaling in each step, we have discrete downscaling, in the bursts of 3 layer groups. 1.1.2 Energy The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross entropy loss function. 1.2 Implementation • We Implemented a basic U-Net to segment out T-Shirt/Upper-Body-Clothing from a given Image of a person. • We prepared a DataSet from the Viton dataset[17], involving a given image and it’s T-Shirt or Upper-body-clothing mask(binary mask). As shown in Figure 8. Figure 6: A Generic U-Net Structure[15] • We produced a network to take in the image of the person(256x256x3) and predict the segmentation mask of the T-Shirt or Upper-body-clothing (256x256x1). • We used a similar structure to the structure shown in the Figure 6, except we were going for (256x256x3) -> (256x256x1). Also we used the same padding while convoluting over the previous group of the layers to maintain the similar structure as the input image instead of losing the border pixels. • We Trained the network with the dataset generated earlier using viton[17] upto the accuracy of 99.28%. • We measured the mean IOU background excluded on a test DataSet of 1000 images to be 0.9284. The given metric is defined as: IoU = PixelsPred T PixelsGT PixelsPred S PixelsGT (1) where Pixels signify the number of Pixels classified into the given category. • We also experimented with the masked out T-Shirt to change the T-Shirt Colours. As shown in Figure 9. (a) Person (b) T-Shirt Mask Figure 8: A Slice of the DataSet generated by us from the Viton DataSet[17] 1.3 Results The Results are shown in Figure 9. The IoU over the dataset found was 0.9284.
  • 4. Figure 7: The U-Net Structure used by us[18] 2 Mask RCNN 2.1 Motivation The main motivation behind using Mask RCNN[19] is for the cloth segmentation part, so as to prepare a cloth mask that can be used as input for the cloth warping stage in the VITON[13] model. 2.2 Methodology Mask RCNN is based upon the RCNN (Regional Convolutional Neural Network) family of neural networks, and is an extension of Faster RCNN which was developed for object detection tasks in image processing. Mask RCNN extends this by providing for instance segmentation also by creating segmentation masks (which specify the category to which each pixel belongs) and hence is suitable for our use case. The internal working of the model is described as follows: 2.2.1 Region Proposal Networks[20] Region Proposal forms the basis of RCNNs[21][22][20][19], wherein prospective regions of object containment are marked by using a heuristical algorithm (RCNN[22], Fast RCNN[21]), or an RPN (Region Proposal Network) (Faster RCNN[20]). The RPN takes as input a convolutional feature map of the image, from the last convolutional layer, and provides output for region proposals in the form of bounding boxes, and a classification score for two classes (object and no object). The input map is subjected to a sliding network which takes an n × n spatial window of the map, and maps it to a lower dimensional feature (256-d or 512-d), which is then fed into two parallel fully connected layers, i.e. a proposal bounding box regression layer(reg) and a box classification layer obj. To account for different scales and aspect ratio of images in order to make region proposal invariant of these factors, we predict k region proposals for each sliding window simultaneously, each with its own scale and aspect ratio. Thus the reg layer generates 4k different outputs representing the bounding box coordinates, while the obj layer generates 2k different outputs representing class scores, for each of the k proposals. The RPN assigns a binary label to each anchor point (i.e. whether it contains an object or not), according to whether it has the highest IoU with a ground-truth box, or if it has an IoU > 0.7 with any ground-truth box. The loss function for the RPN is represented as L = Lobj + Lreg where Lobj is the log loss over the two classes, while Lreg is the Robust loss over the region proposal box coordinates. 2.2.2 RoI Align[19] The region proposals from the RPN are mapped onto the feature map, and are used to pool features using an RoI pooling layer, which basically is a max-pooling layer that divides the proposal into sub-windows of a fixed size and performs pooling on them, in order to give an output of dimensions (N,7,7,512), where N is the number of initial regions proposed. The main problem with RoI pooling is that while it maps RoI to the extracted features, it introduces a level of quantization while performing the mapping, which results in a misalignment between input and the extracted features. RoI align avoids this by maintaing the actual floating point values of the region coordinates, and uses bilinear interpolation to figure out the values for matrix in the pooling step. This preserves the input to extracted feature mapping, and hence is suitable for use in semantic segmentation tasks. 2.2.3 Basic Structure The final structure of the network is as follows: the input image is passed through several convolutional layers of some convolutional network, which outputs a feature map that is passed through an RPN, and the regions proposed are mapped onto the feature map and pooled using the RoI align layer. The output from the RoI layer is passed through two fully connected layers that are connected to two parallel object classification and bounding box regression branches. The output from the RoI layer is also fed to a mask prediction branch simultaneously via an FCN (Fully Convolutional Network). The loss function for the network is defined as L = Lcls + Lbox + Lmask, where Lcls is a log loss for each class, Lbox is a smooth L1 regression loss for each class, while Lmask is an averaged binary cross entropy loss over each pixel, for each class. 2.3 Implementation • We used the matterport implementation[24] of Mask RCNN[19] (implemented in tensorflow 1.15), with pre-trained weights on the MS-COCO[25] dataset. • We trained our network on the iMaterialist 2019[26] dataset, with 45k images, of which 36k were train
  • 5. (a) Converted Pink T-Shirt into White T-Shirt (b) Converted Blue T-Shirt into white T-Shirt (c) Converted Pink T-Shirt into Gray T-Shirt Figure 9: U-Net Results images, and the rest 9k were test images, and has 46 categories of clothing. • The metric used for measuring performance is mean precision and recall over IoU threshold (using bounding boxes). IoU (Intersection over Union) for prediction and ground truth bounding boxes for an object of class C is defined as: IoUC = BoxPred T BoxGT BoxPred S BoxGT (2) The IoU threshold is a parameter θ where if IoUC >= θ, then the given image has an object of class C predicted for it. Taking precision of all classes over all images, and then taking their mean forms the basis of our metric. We calculate MP and MR for θ = 0.5 and θ = 0.75 and also calculate the averages of the two values (listed as MP and MR). We also calculate the combined F1 score from MP and MR. They are listed as below: Figure 10: Mask RCNN Structure[23] Model MP for θ = 0.5 MP for θ = 0.75 MP Mask RCNN 0.316 0.172 0.244 Model MR for θ = 0.5 MR for θ = 0.75 MR Mask RCNN 0.347 0.186 0.267 Model F1 Mask RCNN 0.255 2.4 Results Some of the sample results are shown in Figures 11 and 12. Also for IoU please refer to the tables above. Figure 11: Mask RCNN Result Figure 12: Mask RCNN Result 3 Cloth Size Estimator 3.1 Motivation An important part while shopping for clothing is finding the right size of apparel to wear. But size measurement is
  • 6. not an easy task, and certainly cannot be done individually. Hence this led us to creating a simple heuristical cloth size estimator, which would serve as the first module of our pipeline. 3.2 Methodology 3.2.1 OpenPose[27] OpenPose is a popular and open source human 2D pose and keypoint estimation model. It performs pose estimation by taking the input image and preparing part confidence maps (which show the likelihood of a body part being present at a given point), and the part affinity maps (which show the orientation and association of different body parts). The body parts candidates are then associated using a set of bipartite matchings, after which they are assembled into full body poses for all the people in the image. 3.2.2 Basic Structure The OpenPose is used to generate keypoints for a given input image, which we use to predict the person’s size. We make the assumption that the person is standing parallely to the vertical plane, and the camera is also held perpendicularly to it. This ensures that the distances between the keypoints are actual measurements, and not mere projections. The keypoint diagram that we referred to is shown in Fig 13. Then we define the metric cm/pixels as: cm/pixels = Height −10 Distance(P14,P16) (3) where the Height is taken as an input from the user. Once the metric is calculated, for any euclidean distance d between two points on the image, we can find the corresponding length len in the real world with a margin of approximation using the relation: len = λ ∗(cm/pixels)∗d (4) where λ is a constant introduced for taking into approximation the fat on the person’s body. 3.3 Implementation 1. We implemented a version of the model in python and performed testing on our own images since any appropriate dataset was not available to us. 2. We first used OpenPose on images to generate the pose and get data about keypoints, as shown in figure 14. 3. Then we calculated cm per pixel as described by the equation in the above section. 4. We took advantage of the fact the fact that, when we need to find the size of a person we need to classify them in size as XS, S, M, L, XL getting exact measurements is not required. Hence small inaccuracy in exact is valid to some extend. Figure 13: Keypoints generated from OpenPose 3.4 Results We were not able to find a good dataset with all the necessary parameters to report the accuracy of our model, however, we tested on ourselves and the model performed quite decent. Few examples for our case were as follows Person Parameter Real Predicted Nivedit Trouser Length 103 cm 102.02 Cm Nivedit Waist 42 Inch 41.63 Inch Nivedit Shoulder 40 Inch 34.44 Inch Pratik Trouser Length 89cm 86.11 Cm Pratik Waist 36 Inch 37.8 Inch Pratik Shoulder 37 Inch 36.3 Inch We have further proposed an approach for an approach in later section, to improve these results considering parameters like fat and body shape in much more detail not just pose skeletons. However, we were not able to train and verify the proposed approach due to lack of availability of good data set for the purpose. We further propose to make one such data set for the purpose. 4 VITON Viton is one of the most popular UNet Like, encoder-decoder network (see figure 18), which operates Virtual Try-On. It is quite a large architecture, we have
  • 7. Figure 14: Keypoints generated from OpenPose for a person briefly described it here, (Note : All images for this section are directly taken from VITON Paper [13]) 4.1 Methodology 1. First, given the image of person, we use OpenPose [11], to get the pose map of the person, then using the human parser[28] we get the body shape, also we remove the face and hair from the image. All this information is stacked in form of images (22) over one another to form a matrix of size (256 * 192 * 22). 2. All this information (input) thus generated with the target cloth is passed into the main Viton network as shown in figure 18. 3. The image of the person thus generated is fed with target cloth in a mask improvement network, as shown in figure 19. 4. The final mask generated used with the target image, a thin plate transformation (see figure 17) in applied on the target cloth with reference to the newly generated mask. 5. Finally the transformed target cloth is superimposed on the person with a perceptual loss. 4.2 Results We weren’t able to compile the code ourselves since the model was very computationally expensive and its proper dependencies weren’t available on the DGX2 server. Hence the qualitative results are from the repository of the original authors themselves [17]. See figure 15 Figure 15: Results for VITON Model Figure 16: Generating input for VITON 5 ACGPN It is recent Virtual Try On Network, which is used to improve the fidelity of the previous existing, Virtual Try-On Network. It is completely GAN based and have a number of 3 modules, namely Semantic Generation Module, Clothes Warping Module and Content Fusion Module (see figure 20). Each of unit is extremely computationally intensive and would need a very detailed description, which is out of scope of this report. Thus we have provided a brief introduction of various units. For details please refer [14]. 5.1 Semantic Generation Module This is the first module and as the name suggest it is used to understand semantics of the image of the person as well as the target cloth. All units are based on conditional GANs[29] with a UNet structure is the generator and for discriminator they have used pix2pixHD[30] network. First GAN (G1) is used to understand the areas of image where we need to place the clothing from the target cloth. Then the second GAN unit (G2) is used to get orientation and positioning of the target cloth. 5.2 Clothes Warping Module This modules is responsible for wrapping the target cloth over the intermediate image of a person. This was the most challenging part for us to understand. This internally used a second order difference equation to effectively wrap the cloth with high fidelity. This difference equation is used with a Spatial Transformation Network[31] and Thin-Plate Spline. Before sending the next module all the data is passed through Non-target Body Part Composition, it is simply
  • 8. Figure 17: Transformation Module for better cloth fit Figure 18: Viton Architecture[13] due using appropriate dot products of various mask to make sure that all the necessary parts which needs to be present pass to the next module. 5.3 Content Fusion Module Finally all data from all channels are fed into a content fusion module, which uses the third GAN (G3), which uses all the poses that to produce a final high resolution image of the target. It basically acts like an inpainting unit which fills in all the missing parts of the image. The code for ACGPAN could be found at https://github.com/switchablenorms/ DeepFashion_Try_On. IV PROPOSED METHODS In this section we propose some of the methods which we were not able to experiment with to have results, but we believe could be highly useful 1 Proposed Size Estimator • In our previous size estimator we were using only basic pose skeleton to predict the size of the body. However this ignores the body shape in many cases. • To overcome this, we propose first segment the person from the background using human segmentation module like pose2seg (pose2seg will further make the process faster as it calculates pose as an intermediate step). • In step 2, use OpenPose on that image. Output should be like fig 21. • Now we calculate size as in the first process, what can do after that, extrapolate the line. Figure 19: Viton Mask Improvement Architecture[13] Figure 20: ACGPAN Architecture[14] • We know can use the above two parameters and CNN over the nearby areas to compute another scaling factor, which could be given for each image. • Finally our output would be a function f, such that sizepart = f(method1,extrapolated,λ)+c Figure 21: Keypoints generated from OpenPose for on a person after background removal 2 CGAN Based TShirt Color Changing Approach CGANs (Controllable Generative Adversarial Networks)[32] have been experimented a lot with changing the colors of hairs of a person and simulate hair dyes [33]. This approach planned was quite similar, we would aim to find the independent color variable from the vector. Idea is that similarly to the hair color module if we would find a independent set of variable which could link to Tshirt colors then we would be able to able to change to colors just by tweaking that set of variables.
  • 9. We tried to train it, but learning was quite unstable and model failed to converge, we feel that we could try to improve the learning and convergence, given sufficient computational resources. V CONCLUSION AND LEARNING OUTCOMES Our conclusion for this project consists of the following findings: 1. Explored the scope of AR in the field of fashion. 2. Surveyed several papers about technologies that related to our use case. 3. Studied and implemented two working cloth parsing models using U-Net and Mask RCNN architectures, and trained them to work on semantic segmentation for clothing in images. 4. Developed and implemented size estimation model using OpenPose. 5. Studied the workings of an existing virtual tryon model VITON and ACGPN. 6. The U-Net architecture reports very high semantic segmentation accuracy for a single class. We tried to create a multi-class architecture also, but it gave poor results. It might be possible that training different U-Nets for different classes and then combining the predictions can provide for a good albeit slow cloth parser. 7. The Mask RCNN approach showed results lower than that of state of the art models [26] (AP@.5 = 60.26 and AP@0.7 = .4765), although they also used Mask RCNN only. Better training and adjustment of hyper parameters can maybe help increase this. 8. The cloth size estimation approach worked well for the most part, although we had to factor in the λ factor on our own, but this can be rectified through our proposed methods. 9. The virtual tryon models we explored, i.e. VITON and ACGPN worked good enough for their test images, but they were highly mathematically complex to understand and computationally expensive to implement on a commercial scale. The learning outcomes from this project are as following: 1. We learnt about the various technologies that can be used in the fashion industry and surveyed several papers regarding them. 2. We learnt about semantic segmentation techniques and applied them for cloth parsing purposes. 3. We developed our own technique for cloth size estimation using human pose estimation. 4. We read about the existing virtual try on techniques such as VITON, CP-VTON and ACGPN, and understood their architectures well, and tried to delve into the mathemetical details with the best of our efforts. 5. The search for cloth wrapping techniques also introduced us to GANs and autoencoders, which we read about in detail. 6. We explored methods for the λ parameter in the cloth size estimation module to be learned as a function of the person’s image itself (using image processing), or as a function of the person’s BMI, so as to take care of person’s with different sizes. Detailed presentation on learning outcomes could be found at https://drive.google.com/drive/folders/ 15FRuGU0VDZM2ySMLFDOEQlqK1gBPbFZg?usp=sharing VI ISSUES FACED 1. The mathematical details regarding the workings of the virtual tryon models proved to be out of scope for us, and hence we weren’t able to proceed to create a version of our own. 2. The multi-class implementation of U-Net gave poor performance on the test images, and the faults weren’t clear, and hence we proceeded with a single class implementation only. 3. Good dataset was not available to check to experiments, train and validated our proposed body shape model. VII FUTURE WORK 1. To gain an intuitive understanding of the mathematics behind VITON and ACGPN. 2. To try to implement U-Net for multi-class semantic segmentation. 3. To improve the results for Mask RCNN by better training and hyper parameter adjustments. 4. Implement our proposals for cloth size estimators. 5. To understand and design the target cloath warping. 6. To analyse the logistics of the project from a commercial perspective and to make it commercially viable. 7. To develop a simple prototype mobile application. 8. To Design and optimise a model such that it can work on a mobile device with a primitive GPU. 9. To extend/design a Model for 3D imaging and use AR/VR to project the outputs in a more elegant way to produce a better visualisation.
  • 10. VIII ACKNOWLEDGMENTS We are highly thankful for Dr. Anand Mishra (https://anandmishra22.github.io/) for allowing us to explore this wonderful field, work completely as per our ideas, valuable discussions, and supporting us on each step. REFERENCES [1] Intel. https://www.intelrealsense.com/beginners- guide-to-depth/. (2021). [2] T. K. Ho. Random Decision Forests. (2021). [3] https://www.sciencedirect.com/topics/computer- science/geodesic-distance. (2021). [4] T. Xiaohui, P. Xiaoyu, L. Liwen & X. Qing. Automatic human body feature extraction and personal size measurement. Journal of Visual Languages Computing 47, 9–18 (2018). ISSN: 1045-926X. https://www.sciencedirect.com/ science/article/pii/S1045926X17302835. [5] Presize.ai. https://www.presize.ai/. (2021). [6] Nettelo. http://nettelo.com/. (2021). [7] Size. https://sizer.me/. (2021). [8] D. P. Kingma & M. Welling. An Introduction to Variational Autoencoders. Foundations and Trends® in Machine Learning 12, 307–392 (2019). ISSN: 1935-8245. http://dx.doi.org/10.1561/2200000056. [9] K. He, G. Gkioxari, P. Dollár & R. Girshick. Mask R-CNN 2018. arXiv: 1703.06870 [cs.CV]. [10] O. Ronneberger, P. Fischer & T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation 2015. arXiv: 1505.04597 [cs.CV]. [11] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei & Y. Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields 2019. arXiv: 1812.08008 [cs.CV]. [12] S.-H. Zhang et al. Pose2Seg: Detection Free Human Instance Segmentation 2019. arXiv: 1803 . 10683 [cs.CV]. [13] X. Han, Z. Wu, Z. Wu, R. Yu & L. S. Davis. VITON: An Image-based Virtual Try-on Network 2018. arXiv: 1711.08447 [cs.CV]. [14] H. Yang et al. Towards Photo-Realistic Virtual Try-On by Adaptively Generating↔Preserving Image Content 2020. arXiv: 2003.05863 [cs.CV]. [15] P. F. Olaf Ronneberger & T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation 2015. arXiv: 1505.04597 [cs.CV]. [16] https: // en. wikipedia. org/ wiki/ U-Net [17] https: // github. com/ xthan/ VITON [18] https : / / github . com / HarisIqbal88 / PlotNeuralNet / blob / master / examples / Unet/ Unet. pdf [19] K. He, G. Gkioxari, P. Dollár & R. B. Girshick. Mask R-CNN. CoRR abs/1703.06870 (2017). arXiv: 1703 . 06870. http://arxiv.org/abs/1703.06870. [20] S. Ren, K. He, R. B. Girshick & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. CoRR abs/1506.01497 (2015). arXiv: 1506 . 01497. http://arxiv.org/abs/1506.01497. [21] R. B. Girshick. Fast R-CNN. CoRR abs/1504.08083 (2015). arXiv: 1504.08083. http://arxiv.org/ abs/1504.08083. [22] R. B. Girshick, J. Donahue, T. Darrell & J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013). arXiv: 1311 . 2524. http://arxiv.org/abs/1311.2524. [23] https : / / ars . els - cdn . com / content / image / 1 - s2 . 0 - S0168169919301103-gr4. jpg [24] https: // github. com/ matterport/ Mask_ RCNN [25] T.-Y. Lin et al. Microsoft COCO: Common Objects in Context. CoRR abs/1405.0312 (2014). arXiv: 1405 . 0312. http://arxiv.org/abs/1405.0312. [26] S. Guo et al. The iMaterialist Fashion Attribute Dataset. CoRR abs/1906.05750 (2019). arXiv: 1906 . 05750. http://arxiv.org/abs/1906.05750. [27] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei & Y. Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CoRR abs/1812.08008 (2018). arXiv: 1812 . 08008. http://arxiv.org/abs/1812.08008. [28] K. Gong, X. Liang, D. Zhang, X. Shen & L. Lin. Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing 2017. arXiv: 1703.05446 [cs.CV]. [29] M. Mirza & S. Osindero. Conditional Generative Adversarial Nets 2014. arXiv: 1411 . 1784 [cs.LG]. [30] T.-C. Wang et al. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. CoRR abs/1711.11585 (2017). arXiv: 1711.11585. http://arxiv.org/abs/1711.11585. [31] M. Jaderberg, K. Simonyan, A. Zisserman & K. Kavukcuoglu. Spatial Transformer Networks 2016. arXiv: 1506.02025 [cs.CV]. [32] M. Lee & J. Seok. Controllable Generative Adversarial Network 2019. arXiv: 1708 . 00598 [cs.LG]. [33] https : / / towardsdatascience . com / dye - your - hair - or - look - older - using - ai - 930bc6928422