This document explores the evolution of artificial intelligence in computer vision, detailing the historical developments from early vision systems to modern advancements like Vision Transformers (ViT). It highlights key breakthroughs such as the introduction of convolutional neural networks (CNNs) with AlexNet in 2012 and the subsequent rise of Vision-Language Models (VLMs) capable of zero-shot predictions in various tasks. The document underscores the ongoing innovation and integration of different architectures that redefine the capabilities of AI in understanding and processing visual information.