A convolution operation operates on pixel values within a kernel's receptive field, multiplying kernel weights with pixel values and adding a bias term to produce a single output value. This reduces the input matrix dimensions. Convolutional neural networks share weights across neurons in a feature map to reduce parameters and computation. CNNs are equivariant to translations, so translating the input results in translating the output. Padding preserves the spatial dimensions after convolution by adding pixels around the input border. The stride specifies the number of pixels to move the kernel each time, controlling the resolution of outputs. RELU is used to decrease non-linearity in images for easier prediction.