The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
BTP Report.pdf
1. Project Shringar - An Exploration Of Approaches Towards Virtual TryOn
Nivedit Jain (B18CSE039)1, Mitul Indravadanbhai Patel (B18CSE041)1, and Rajat Sharma
(B18CSE043)1
1Department of Computer Science and Engineering
Indian Institute of Technology, Jodhpur
Project Report for completion of B.Tech. Project (BTech Pre-Final Year, Trimester 2, Academic Year 2020-2021) under the guidance
of Dr. Anand Mishra, Indian Institute of Technology, Jodhpur.
I INTRODUCTION
Augmented Reality tries to combine computer machinery
with human interaction, and has become one of the fastest
growing fields in information technology within the past
couple of years. Companies and entrepreneurs have
invested a lot of capital and human resource to bring
human and computers closer, and new innovations have
opened a lot of doors into industries that were considered
to be out of scope of these technologies previously. One of
these industries is fashion, which is also the one we are
targeting in this project. Our goal is to create a Virtual
Tryon application, where users don’t have to go through
the hustle of trying out clothes in a changing room, and can
instead try them out online through an AR interface. An
popular example of this system is currently in use by
Lenskart.com®, where they do it for spectacle frames
instead of clothes, which is what we are trying to do.
Through this project, we aim to accomplish the following
goals:
• Survey the existing and upcoming methodologies and
innovations that can be used for creating the
application.
• Expand our knowledge base about the internal
workings of different methods, so as to design
efficient and cost effective solutions.
• Implement starter modules for our application pipeline
which can later be bundled into the system.
• To gather knowledge about the field of Deep Learning
and Augmented Reality in general, and their
applications in the fashion industry, along with future
prospects.
Through the course of this project we have surveyed
several papers, literature and various other content,
pertaining to our use case, learned about new technologies
such as GANs and auto-encoders, trained Image
Segmentation models for fashion specific uses, and
developed a method for cloth size estimation using Human
Pose Estimation techniques. Overall it was a highly
enriched learning experience for all of us.
* jain.22@iitj.ac.in, patel.6@iitj.ac.in, sharma.30@iitj.ac.in
Figure 1: How Coded Light Technique Works ([1])
II LITERATURE SURVEY
While working on this project we have explored a number
of interesting papers which we have discussed in this
section.
1 Depth Based Camera for Measurements
A Depth Camera as the name suggests is a camera that has
the ability to not only capture images of an object but also
able to detect depth/distance from the camera of the object
usually using one of the approaches [1] from the following
• Structural Light and Coded Light Approach In this
class of approach a pattern of light is emitted by the
sensor and distance is calculated by the change in that
pattern. See figure 1
• Stereo Depth Usually use reflection of infrared waves
with visible waves and produces depth effects just like
our eyes, the basic idea is that we know the distance
between sensors and then we used the 2 captured
images to predict the depth of the object. See figure 2
• Time of Flight and LiDAR usually sensors in this
range are based on the speed of light or waves, which
is already known, the working is quite intuitive and
based on time it takes a beam of light to come back to
the camera.
We observe that these sensors are slowly and slowly getting
in our mobile devices and thus creating some high potential
use cases. Example Apple iPhone 12 has LiDAR, Samsung
Galaxy S20 Ultra (Time of Flight Based), and many more.
A number of techniques in literature could be found
which uses this incredible device to obtain and produce
2. Figure 2: How Stereo Depth Works ([1])
avtars for a human and also produce measurements. Most
of these techniques are based on two main techniques of
Random Forest[2] and Geodesic Distances[3].
One such approach we particularly though for the first
step of our pipeline was by Tan Xiaohui, et al, Automatic
human body feature extraction and personal size
measurement [4]. They were not only able to generate 3D
Avatar effectively using depth cameras and human body
features parsing using a random forest approach but also
able to efficiently use geodesic distances to be able to
predict the size and shapes of various important features
like shoulder, chest, waist, hip and legs with an extremely
low average error margin of 0.0617cm (for all
measurements). Also the computational requirements were
not very high.
2 Using Non Depth Based Camera
With the evolution of computer vision, several methods
have been developed to get very accurate information
about the shape and size of a variety of human body parts,
however, these approaches need some additional inputs
with photo or video of the person, like height, weight,
gender, age etc. Moreover these approaches also require a
person to be in a particular pose and orientation. We were
able to find several applications, startups and companies
who are working on similar approaches or have
successfully implemented them, however, the majority of
these approaches and algorithms were not available
publicly.
We after carefully studying them have designed one of
our approaches and also have suggested some
improvements which we felt we could implement. We
found out that the majority of these approaches calculate
measurement per unit pixel and then try to improve them
using some learning approaches. Some also asked to keep
the distance between you and the camera to be fixed and
then used basic trigonometry to calculate the height, then
move to per pixel notation. Others moved ahead and tried
to improve them using Deep Learning Based approaches
and some tried to use various alignment methods to align
images from different angles. Few example Presize[5],
Nettelo[6], Sizer[7]. See figures 3 3 4.
3 Parsing clothes and Humans
Parsing and understanding a variety of objects is a
well-studied problem in computer science and there are
several approaches to for the same. To parse and segments
clothes we first started with a Variational Autoencoder
Figure 3: Basic Trigonometry Approach for calculating height,
the basic idea is that if we have at least one parameter then we
can try to predict the other parameters.
Based Approach [8] and simultaneously tried various
operation on them like changing colors, however later we
changed our approach to a Mask-R-CNN[9] based instance
segmentation approach as a lot literature was based upon
it, and we also found the much of the earlier literature was
using a UNet[10] based approach, we have also tried and
experimented with similar approaches (both of these
approaches discussed in detailed in later sections of this
text).
A lot of approaches have already been found in the
literature regarding parsing humans and especially human
pose, we have gone through and used one such approach,
namely OpenPose [11] (discussed in detail in later
sections), however, there are other great techniques to
segment humans, parse them, one such great is
pose2seg[12], which segments human-based building on
their pose.
4 Virtual Try On (2D)
It is one of the fields which is increasingly becoming
popular in computer science. And in a very naive sense
means to change clothing or try various things in clothing,
like colour, dress etc, in this work we have worked with
two such prominent papers, VITON: An Image-based
Virtual Try-on Network [13] and Towards Photo-Realistic
Virtual Try-On by Adaptively Generating ↔ Preserving
Image Content [14]. (both of them discussed in details in
later sections)
A part from this we have also survey and learnt basics of
a lot of technologies like AR, GANs, Variational
Autoencoders, R-CNNs, etc, these could be found in
presentation available at
https://drive.google.com/drive/folders/
15FRuGU0VDZM2ySMLFDOEQlqK1gBPbFZg?usp=sharing
III OUR SOLUTIONS
1 U-Net
The U-Net was developed by Olaf Ronneberger et al. for
biomedical Image segmentation using Convolutional
Networks[15]. It is based on fully Convolutional networks.
It takes less than a second to make predictions on latest
GPUs for a 512x512 Image using a U-Net.[16] It has
shown Very good performance on very different
biomedical segmentation applications. U-Net is used for
semantic segmentation, like given an image of person
3. Figure 4: Presize.ai extra input methods
Figure 5: Presize.ai video capturing in a particular alignment
wearing some clothes, it can be semantically differentiated
as in segmented to the into the clothes, like T-Shirt, Jeans,
Hat etc.
1.1 Methodology
1.1.1 Structure Similar to a Convolutional network
intitally it has the downscaling layers, then up scaling
layers concatenating a copy of the last layer from layers in
the corresponding downscaling component and doing
covolution-transpose; As shown in the a generic structure
in Figure 6. As we can see, instead of having downscaling
in each step, we have discrete downscaling, in the bursts of
3 layer groups.
1.1.2 Energy The energy function is computed by a
pixel-wise soft-max over the final feature map combined
with the cross entropy loss function.
1.2 Implementation
• We Implemented a basic U-Net to segment out
T-Shirt/Upper-Body-Clothing from a given Image of a
person.
• We prepared a DataSet from the Viton dataset[17],
involving a given image and it’s T-Shirt or
Upper-body-clothing mask(binary mask). As shown
in Figure 8.
Figure 6: A Generic U-Net Structure[15]
• We produced a network to take in the image of the
person(256x256x3) and predict the segmentation
mask of the T-Shirt or Upper-body-clothing
(256x256x1).
• We used a similar structure to the structure shown in
the Figure 6, except we were going for (256x256x3)
-> (256x256x1). Also we used the same padding
while convoluting over the previous group of the
layers to maintain the similar structure as the input
image instead of losing the border pixels.
• We Trained the network with the dataset generated
earlier using viton[17] upto the accuracy of 99.28%.
• We measured the mean IOU background excluded on
a test DataSet of 1000 images to be 0.9284. The given
metric is defined as:
IoU =
PixelsPred
T
PixelsGT
PixelsPred
S
PixelsGT
(1)
where Pixels signify the number of Pixels classified
into the given category.
• We also experimented with the masked out T-Shirt to
change the T-Shirt Colours. As shown in Figure 9.
(a) Person (b) T-Shirt Mask
Figure 8: A Slice of the DataSet generated by us from the Viton
DataSet[17]
1.3 Results
The Results are shown in Figure 9. The IoU over the dataset
found was 0.9284.
4. Figure 7: The U-Net Structure used by us[18]
2 Mask RCNN
2.1 Motivation
The main motivation behind using Mask RCNN[19] is for
the cloth segmentation part, so as to prepare a cloth mask
that can be used as input for the cloth warping stage in the
VITON[13] model.
2.2 Methodology
Mask RCNN is based upon the RCNN (Regional
Convolutional Neural Network) family of neural networks,
and is an extension of Faster RCNN which was developed
for object detection tasks in image processing. Mask
RCNN extends this by providing for instance segmentation
also by creating segmentation masks (which specify the
category to which each pixel belongs) and hence is suitable
for our use case. The internal working of the model is
described as follows:
2.2.1 Region Proposal Networks[20] Region Proposal
forms the basis of RCNNs[21][22][20][19], wherein
prospective regions of object containment are marked by
using a heuristical algorithm (RCNN[22], Fast
RCNN[21]), or an RPN (Region Proposal Network)
(Faster RCNN[20]). The RPN takes as input a
convolutional feature map of the image, from the last
convolutional layer, and provides output for region
proposals in the form of bounding boxes, and a
classification score for two classes (object and no object).
The input map is subjected to a sliding network which
takes an n × n spatial window of the map, and maps it to a
lower dimensional feature (256-d or 512-d), which is then
fed into two parallel fully connected layers, i.e. a proposal
bounding box regression layer(reg) and a box classification
layer obj. To account for different scales and aspect ratio
of images in order to make region proposal invariant of
these factors, we predict k region proposals for each
sliding window simultaneously, each with its own scale
and aspect ratio. Thus the reg layer generates 4k different
outputs representing the bounding box coordinates, while
the obj layer generates 2k different outputs representing
class scores, for each of the k proposals. The RPN assigns
a binary label to each anchor point (i.e. whether it contains
an object or not), according to whether it has the highest
IoU with a ground-truth box, or if it has an IoU > 0.7 with
any ground-truth box. The loss function for the RPN is
represented as L = Lobj + Lreg where Lobj is the log loss
over the two classes, while Lreg is the Robust loss over the
region proposal box coordinates.
2.2.2 RoI Align[19] The region proposals from the
RPN are mapped onto the feature map, and are used to
pool features using an RoI pooling layer, which basically is
a max-pooling layer that divides the proposal into
sub-windows of a fixed size and performs pooling on them,
in order to give an output of dimensions (N,7,7,512),
where N is the number of initial regions proposed. The
main problem with RoI pooling is that while it maps RoI to
the extracted features, it introduces a level of quantization
while performing the mapping, which results in a
misalignment between input and the extracted features.
RoI align avoids this by maintaing the actual floating point
values of the region coordinates, and uses bilinear
interpolation to figure out the values for matrix in the
pooling step. This preserves the input to extracted feature
mapping, and hence is suitable for use in semantic
segmentation tasks.
2.2.3 Basic Structure The final structure of the
network is as follows: the input image is passed through
several convolutional layers of some convolutional
network, which outputs a feature map that is passed
through an RPN, and the regions proposed are mapped
onto the feature map and pooled using the RoI align layer.
The output from the RoI layer is passed through two fully
connected layers that are connected to two parallel object
classification and bounding box regression branches. The
output from the RoI layer is also fed to a mask prediction
branch simultaneously via an FCN (Fully Convolutional
Network). The loss function for the network is defined as
L = Lcls + Lbox + Lmask, where Lcls is a log loss for each
class, Lbox is a smooth L1 regression loss for each class,
while Lmask is an averaged binary cross entropy loss over
each pixel, for each class.
2.3 Implementation
• We used the matterport implementation[24] of Mask
RCNN[19] (implemented in tensorflow 1.15), with
pre-trained weights on the MS-COCO[25] dataset.
• We trained our network on the iMaterialist 2019[26]
dataset, with 45k images, of which 36k were train
5. (a) Converted Pink T-Shirt into White T-Shirt
(b) Converted Blue T-Shirt into white T-Shirt
(c) Converted Pink T-Shirt into Gray T-Shirt
Figure 9: U-Net Results
images, and the rest 9k were test images, and has 46
categories of clothing.
• The metric used for measuring performance is mean
precision and recall over IoU threshold (using
bounding boxes). IoU (Intersection over Union) for
prediction and ground truth bounding boxes for an
object of class C is defined as:
IoUC =
BoxPred
T
BoxGT
BoxPred
S
BoxGT
(2)
The IoU threshold is a parameter θ where if
IoUC >= θ, then the given image has an object of
class C predicted for it. Taking precision of all classes
over all images, and then taking their mean forms the
basis of our metric. We calculate MP and MR for
θ = 0.5 and θ = 0.75 and also calculate the averages
of the two values (listed as MP and MR). We also
calculate the combined F1 score from MP and MR.
They are listed as below:
Figure 10: Mask RCNN Structure[23]
Model MP for θ = 0.5 MP for θ = 0.75 MP
Mask RCNN 0.316 0.172 0.244
Model MR for θ = 0.5 MR for θ = 0.75 MR
Mask RCNN 0.347 0.186 0.267
Model F1
Mask RCNN 0.255
2.4 Results
Some of the sample results are shown in Figures 11 and 12.
Also for IoU please refer to the tables above.
Figure 11: Mask RCNN Result
Figure 12: Mask RCNN Result
3 Cloth Size Estimator
3.1 Motivation
An important part while shopping for clothing is finding
the right size of apparel to wear. But size measurement is
6. not an easy task, and certainly cannot be done individually.
Hence this led us to creating a simple heuristical cloth size
estimator, which would serve as the first module of our
pipeline.
3.2 Methodology
3.2.1 OpenPose[27] OpenPose is a popular and open
source human 2D pose and keypoint estimation model. It
performs pose estimation by taking the input image and
preparing part confidence maps (which show the likelihood
of a body part being present at a given point), and the part
affinity maps (which show the orientation and association
of different body parts). The body parts candidates are then
associated using a set of bipartite matchings, after which
they are assembled into full body poses for all the people
in the image.
3.2.2 Basic Structure The OpenPose is used to
generate keypoints for a given input image, which we use
to predict the person’s size. We make the assumption that
the person is standing parallely to the vertical plane, and
the camera is also held perpendicularly to it. This ensures
that the distances between the keypoints are actual
measurements, and not mere projections. The keypoint
diagram that we referred to is shown in Fig 13. Then we
define the metric cm/pixels as:
cm/pixels =
Height −10
Distance(P14,P16)
(3)
where the Height is taken as an input from the user. Once
the metric is calculated, for any euclidean distance d
between two points on the image, we can find the
corresponding length len in the real world with a margin of
approximation using the relation:
len = λ ∗(cm/pixels)∗d (4)
where λ is a constant introduced for taking into
approximation the fat on the person’s body.
3.3 Implementation
1. We implemented a version of the model in python and
performed testing on our own images since any
appropriate dataset was not available to us.
2. We first used OpenPose on images to generate the pose
and get data about keypoints, as shown in figure 14.
3. Then we calculated cm per pixel as described by the
equation in the above section.
4. We took advantage of the fact the fact that, when we
need to find the size of a person we need to classify
them in size as XS, S, M, L, XL getting exact
measurements is not required. Hence small
inaccuracy in exact is valid to some extend.
Figure 13: Keypoints generated from OpenPose
3.4 Results
We were not able to find a good dataset with all the
necessary parameters to report the accuracy of our model,
however, we tested on ourselves and the model performed
quite decent. Few examples for our case were as follows
Person Parameter Real Predicted
Nivedit Trouser Length 103 cm 102.02 Cm
Nivedit Waist 42 Inch 41.63 Inch
Nivedit Shoulder 40 Inch 34.44 Inch
Pratik Trouser Length 89cm 86.11 Cm
Pratik Waist 36 Inch 37.8 Inch
Pratik Shoulder 37 Inch 36.3 Inch
We have further proposed an approach for an approach
in later section, to improve these results considering
parameters like fat and body shape in much more detail not
just pose skeletons.
However, we were not able to train and verify the
proposed approach due to lack of availability of good data
set for the purpose. We further propose to make one such
data set for the purpose.
4 VITON
Viton is one of the most popular UNet Like,
encoder-decoder network (see figure 18), which operates
Virtual Try-On. It is quite a large architecture, we have
7. Figure 14: Keypoints generated from OpenPose for a person
briefly described it here, (Note : All images for this section
are directly taken from VITON Paper [13])
4.1 Methodology
1. First, given the image of person, we use OpenPose
[11], to get the pose map of the person, then using the
human parser[28] we get the body shape, also we
remove the face and hair from the image. All this
information is stacked in form of images (22) over
one another to form a matrix of size (256 * 192 * 22).
2. All this information (input) thus generated with the
target cloth is passed into the main Viton network as
shown in figure 18.
3. The image of the person thus generated is fed with
target cloth in a mask improvement network, as
shown in figure 19.
4. The final mask generated used with the target image, a
thin plate transformation (see figure 17) in applied on
the target cloth with reference to the newly generated
mask.
5. Finally the transformed target cloth is superimposed
on the person with a perceptual loss.
4.2 Results
We weren’t able to compile the code ourselves since the
model was very computationally expensive and its proper
dependencies weren’t available on the DGX2 server. Hence
the qualitative results are from the repository of the original
authors themselves [17]. See figure 15
Figure 15: Results for VITON Model
Figure 16: Generating input for VITON
5 ACGPN
It is recent Virtual Try On Network, which is used to
improve the fidelity of the previous existing, Virtual
Try-On Network. It is completely GAN based and have a
number of 3 modules, namely Semantic Generation
Module, Clothes Warping Module and Content Fusion
Module (see figure 20). Each of unit is extremely
computationally intensive and would need a very detailed
description, which is out of scope of this report. Thus we
have provided a brief introduction of various units. For
details please refer [14].
5.1 Semantic Generation Module
This is the first module and as the name suggest it is used
to understand semantics of the image of the person as well
as the target cloth. All units are based on conditional
GANs[29] with a UNet structure is the generator and for
discriminator they have used pix2pixHD[30] network.
First GAN (G1) is used to understand the areas of image
where we need to place the clothing from the target cloth.
Then the second GAN unit (G2) is used to get orientation
and positioning of the target cloth.
5.2 Clothes Warping Module
This modules is responsible for wrapping the target cloth
over the intermediate image of a person. This was the most
challenging part for us to understand. This internally used
a second order difference equation to effectively wrap the
cloth with high fidelity. This difference equation is used
with a Spatial Transformation Network[31] and Thin-Plate
Spline.
Before sending the next module all the data is passed
through Non-target Body Part Composition, it is simply
8. Figure 17: Transformation Module for better cloth fit
Figure 18: Viton Architecture[13]
due using appropriate dot products of various mask to
make sure that all the necessary parts which needs to be
present pass to the next module.
5.3 Content Fusion Module
Finally all data from all channels are fed into a content
fusion module, which uses the third GAN (G3), which uses
all the poses that to produce a final high resolution image
of the target. It basically acts like an inpainting unit which
fills in all the missing parts of the image.
The code for ACGPAN could be found at
https://github.com/switchablenorms/
DeepFashion_Try_On.
IV PROPOSED METHODS
In this section we propose some of the methods which we
were not able to experiment with to have results, but we
believe could be highly useful
1 Proposed Size Estimator
• In our previous size estimator we were using only basic
pose skeleton to predict the size of the body. However
this ignores the body shape in many cases.
• To overcome this, we propose first segment the person
from the background using human segmentation
module like pose2seg (pose2seg will further make the
process faster as it calculates pose as an intermediate
step).
• In step 2, use OpenPose on that image. Output should
be like fig 21.
• Now we calculate size as in the first process, what can
do after that, extrapolate the line.
Figure 19: Viton Mask Improvement Architecture[13]
Figure 20: ACGPAN Architecture[14]
• We know can use the above two parameters and CNN
over the nearby areas to compute another scaling
factor, which could be given for each image.
• Finally our output would be a function f, such that
sizepart = f(method1,extrapolated,λ)+c
Figure 21: Keypoints generated from OpenPose for on a person
after background removal
2 CGAN Based TShirt Color Changing Approach
CGANs (Controllable Generative Adversarial
Networks)[32] have been experimented a lot with changing
the colors of hairs of a person and simulate hair dyes [33].
This approach planned was quite similar, we would aim to
find the independent color variable from the vector. Idea is
that similarly to the hair color module if we would find a
independent set of variable which could link to Tshirt
colors then we would be able to able to change to colors
just by tweaking that set of variables.
9. We tried to train it, but learning was quite unstable and
model failed to converge, we feel that we could try to
improve the learning and convergence, given sufficient
computational resources.
V CONCLUSION AND LEARNING OUTCOMES
Our conclusion for this project consists of the following
findings:
1. Explored the scope of AR in the field of fashion.
2. Surveyed several papers about technologies that
related to our use case.
3. Studied and implemented two working cloth parsing
models using U-Net and Mask RCNN architectures,
and trained them to work on semantic segmentation
for clothing in images.
4. Developed and implemented size estimation model
using OpenPose.
5. Studied the workings of an existing virtual tryon model
VITON and ACGPN.
6. The U-Net architecture reports very high semantic
segmentation accuracy for a single class. We tried to
create a multi-class architecture also, but it gave poor
results. It might be possible that training different
U-Nets for different classes and then combining the
predictions can provide for a good albeit slow cloth
parser.
7. The Mask RCNN approach showed results lower than
that of state of the art models [26] (AP@.5 = 60.26
and AP@0.7 = .4765), although they also used Mask
RCNN only. Better training and adjustment of hyper
parameters can maybe help increase this.
8. The cloth size estimation approach worked well for the
most part, although we had to factor in the λ factor on
our own, but this can be rectified through our proposed
methods.
9. The virtual tryon models we explored, i.e. VITON
and ACGPN worked good enough for their test
images, but they were highly mathematically complex
to understand and computationally expensive to
implement on a commercial scale.
The learning outcomes from this project are as following:
1. We learnt about the various technologies that can be
used in the fashion industry and surveyed several
papers regarding them.
2. We learnt about semantic segmentation techniques and
applied them for cloth parsing purposes.
3. We developed our own technique for cloth size
estimation using human pose estimation.
4. We read about the existing virtual try on techniques
such as VITON, CP-VTON and ACGPN, and
understood their architectures well, and tried to delve
into the mathemetical details with the best of our
efforts.
5. The search for cloth wrapping techniques also
introduced us to GANs and autoencoders, which we
read about in detail.
6. We explored methods for the λ parameter in the cloth
size estimation module to be learned as a function of
the person’s image itself (using image processing), or
as a function of the person’s BMI, so as to take care of
person’s with different sizes.
Detailed presentation on learning outcomes could be found
at https://drive.google.com/drive/folders/
15FRuGU0VDZM2ySMLFDOEQlqK1gBPbFZg?usp=sharing
VI ISSUES FACED
1. The mathematical details regarding the workings of
the virtual tryon models proved to be out of scope for
us, and hence we weren’t able to proceed to create a
version of our own.
2. The multi-class implementation of U-Net gave poor
performance on the test images, and the faults weren’t
clear, and hence we proceeded with a single class
implementation only.
3. Good dataset was not available to check to
experiments, train and validated our proposed body
shape model.
VII FUTURE WORK
1. To gain an intuitive understanding of the mathematics
behind VITON and ACGPN.
2. To try to implement U-Net for multi-class semantic
segmentation.
3. To improve the results for Mask RCNN by better
training and hyper parameter adjustments.
4. Implement our proposals for cloth size estimators.
5. To understand and design the target cloath warping.
6. To analyse the logistics of the project from a
commercial perspective and to make it commercially
viable.
7. To develop a simple prototype mobile application.
8. To Design and optimise a model such that it can work
on a mobile device with a primitive GPU.
9. To extend/design a Model for 3D imaging and use
AR/VR to project the outputs in a more elegant way
to produce a better visualisation.
10. VIII ACKNOWLEDGMENTS
We are highly thankful for Dr. Anand Mishra
(https://anandmishra22.github.io/) for allowing
us to explore this wonderful field, work completely as per
our ideas, valuable discussions, and supporting us on each
step.
REFERENCES
[1] Intel. https://www.intelrealsense.com/beginners-
guide-to-depth/.
(2021).
[2] T. K. Ho. Random Decision Forests. (2021).
[3] https://www.sciencedirect.com/topics/computer-
science/geodesic-distance.
(2021).
[4] T. Xiaohui, P. Xiaoyu, L. Liwen & X. Qing.
Automatic human body feature extraction and
personal size measurement. Journal of Visual
Languages Computing 47, 9–18 (2018). ISSN:
1045-926X. https://www.sciencedirect.com/
science/article/pii/S1045926X17302835.
[5] Presize.ai. https://www.presize.ai/. (2021).
[6] Nettelo. http://nettelo.com/. (2021).
[7] Size. https://sizer.me/. (2021).
[8] D. P. Kingma & M. Welling. An Introduction to
Variational Autoencoders. Foundations and
Trends® in Machine Learning 12, 307–392 (2019).
ISSN: 1935-8245.
http://dx.doi.org/10.1561/2200000056.
[9] K. He, G. Gkioxari, P. Dollár & R. Girshick. Mask
R-CNN 2018. arXiv: 1703.06870 [cs.CV].
[10] O. Ronneberger, P. Fischer & T. Brox. U-Net:
Convolutional Networks for Biomedical Image
Segmentation 2015. arXiv: 1505.04597 [cs.CV].
[11] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei &
Y. Sheikh. OpenPose: Realtime Multi-Person 2D
Pose Estimation using Part Affinity Fields 2019.
arXiv: 1812.08008 [cs.CV].
[12] S.-H. Zhang et al. Pose2Seg: Detection Free Human
Instance Segmentation 2019. arXiv: 1803 . 10683
[cs.CV].
[13] X. Han, Z. Wu, Z. Wu, R. Yu & L. S. Davis. VITON:
An Image-based Virtual Try-on Network 2018. arXiv:
1711.08447 [cs.CV].
[14] H. Yang et al. Towards Photo-Realistic Virtual
Try-On by Adaptively Generating↔Preserving
Image Content 2020. arXiv: 2003.05863 [cs.CV].
[15] P. F. Olaf Ronneberger & T. Brox. U-Net:
Convolutional Networks for Biomedical Image
Segmentation 2015. arXiv: 1505.04597 [cs.CV].
[16] https: // en. wikipedia. org/ wiki/ U-Net
[17] https: // github. com/ xthan/ VITON
[18] https : / / github . com / HarisIqbal88 /
PlotNeuralNet / blob / master / examples /
Unet/ Unet. pdf
[19] K. He, G. Gkioxari, P. Dollár & R. B. Girshick.
Mask R-CNN. CoRR abs/1703.06870 (2017).
arXiv: 1703 . 06870.
http://arxiv.org/abs/1703.06870.
[20] S. Ren, K. He, R. B. Girshick & J. Sun. Faster
R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks. CoRR abs/1506.01497
(2015). arXiv: 1506 . 01497.
http://arxiv.org/abs/1506.01497.
[21] R. B. Girshick. Fast R-CNN. CoRR abs/1504.08083
(2015). arXiv: 1504.08083. http://arxiv.org/
abs/1504.08083.
[22] R. B. Girshick, J. Donahue, T. Darrell & J. Malik.
Rich feature hierarchies for accurate object
detection and semantic segmentation. CoRR
abs/1311.2524 (2013). arXiv: 1311 . 2524.
http://arxiv.org/abs/1311.2524.
[23] https : / / ars . els -
cdn . com / content / image / 1 - s2 . 0 -
S0168169919301103-gr4. jpg
[24] https: // github. com/ matterport/ Mask_
RCNN
[25] T.-Y. Lin et al. Microsoft COCO: Common Objects
in Context. CoRR abs/1405.0312 (2014). arXiv:
1405 . 0312.
http://arxiv.org/abs/1405.0312.
[26] S. Guo et al. The iMaterialist Fashion Attribute
Dataset. CoRR abs/1906.05750 (2019). arXiv:
1906 . 05750.
http://arxiv.org/abs/1906.05750.
[27] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei &
Y. Sheikh. OpenPose: Realtime Multi-Person 2D
Pose Estimation using Part Affinity Fields. CoRR
abs/1812.08008 (2018). arXiv: 1812 . 08008.
http://arxiv.org/abs/1812.08008.
[28] K. Gong, X. Liang, D. Zhang, X. Shen & L. Lin.
Look into Person: Self-supervised Structure-sensitive
Learning and A New Benchmark for Human Parsing
2017. arXiv: 1703.05446 [cs.CV].
[29] M. Mirza & S. Osindero. Conditional Generative
Adversarial Nets 2014. arXiv: 1411 . 1784
[cs.LG].
[30] T.-C. Wang et al. High-Resolution Image Synthesis
and Semantic Manipulation with Conditional GANs.
CoRR abs/1711.11585 (2017). arXiv: 1711.11585.
http://arxiv.org/abs/1711.11585.
[31] M. Jaderberg, K. Simonyan, A. Zisserman &
K. Kavukcuoglu. Spatial Transformer Networks
2016. arXiv: 1506.02025 [cs.CV].
[32] M. Lee & J. Seok. Controllable Generative
Adversarial Network 2019. arXiv: 1708 . 00598
[cs.LG].
[33] https : / / towardsdatascience . com / dye -
your - hair - or - look - older - using - ai -
930bc6928422