2. Problem Statement
Semantic segmentation is a type of computer vision that involves
labeling each pixel in an image with a corresponding semantic class
label such as "person," "road," or "building." This labeling helps to
provide a detailed understanding of the image's contents, enabling
better decision-making in fields like agriculture, disaster response,
environmental monitoring, infrastructure management, and urban
planning. In particular, semantic segmentation of aerial images can be
particularly useful for providing critical information and insights for a
wide range of applications.
4. Dataset
The Aerial Semantic Segmentation Drone
Dataset comprises 400 high-resolution aerial
images captured by a drone camera. Each
image measures 6000 * 4000 pixels and is
accompanied by binary and RGB label masks.
5. Encoding Mask
The preprocess() function applies the
read_image() and read_mask() functions to each
batch of input and target data, respectively. The
function first decodes the input and target data,
which are initially in byte format.The
tf.numpy_function() function is then used to apply
the read_image() and read_mask() functions to
the input and target data, respectively, resulting in
tensors of data types tf.float32 and tf.int32. The
mask tensor is converted to a one-hot tensor
using the tf.one_hot() function, with num_classes
used to specify the number of classes. Finally,
the set_shape() function is used to ensure that
the dimensions of the input and target tensors are
correctly set to H x W x 3 and H x W x
num_classes, respectively.
8. U-NET Model
UNET is a U-shaped encoder-decoder network
architecture, which consists of four encoder blocks and
four decoder blocks that are connected via a bridge. The
encoder network (contracting path) half the spatial
dimensions and double the number of filters (feature
channels) at each encoder block. Likewise, the decoder
network doubles the spatial dimensions and half the
number of feature channels.
9. Architecture
This architecture is a U-Net model, which is commonly used for image segmentation tasks. It consists of a contracting path that captures
context and a symmetric expanding path that enables precise localization. The model has 5 convolutional layers in the contracting path,
with 16, 32, 64, 128, and 256 filters, respectively. Each convolutional layer is followed by a dropout layer to prevent overfitting. The
pooling layers with a size of (2,2) are used to reduce the spatial dimensions of the feature maps.
In the expansive path, the model uses 4 transposed convolutional layers with 128, 64, 32, and 16 filters, respectively. Each transposed
convolutional layer is followed by a concatenation layer that merges the feature maps from the corresponding contracting path layer. The
concatenation is done along the channel dimension (axis=3). The model also has two convolutional layers with 128 and 64 filters,
respectively, after the first and second transposed convolutional layers. These convolutional layers are followed by a dropout layer.
Finally, the model has an output layer that uses a 1x1 convolution with softmax activation to classify each pixel into one of the specified
number of classes.
The model is compiled with the Adam optimizer and categorical cross-entropy loss function, and the accuracy metric is used for
evaluation during training. The summary of the model is printed to the console.
12. Confusion Matrix
A confusion matrix is a table that is used to evaluate the
performance of a classifier by comparing the predicted
class labels to the actual class labels. The confusion
matrix is a 2-dimensional table that is used to
summarize the performance of a classification algorithm.
The rows of the confusion matrix represent the actual
class labels, and the columns represent the predicted
class labels.
The diagonal entries of the confusion matrix are the
number of instances that are correctly classified, while
the off-diagonal entries are the number of instances that
are incorrectly classified. The diagonal entries are also
known as the True Positives (TP), True Negatives (TN),
False Positives (FP), and False Negatives (FN).
Based on these values, various performance metrics
such as accuracy, precision, recall, and F1-score can be
calculated to evaluate the classifier.