MobileViT presents a light-weight and efficient vision transformer architecture for mobile and embedded vision applications by incorporating spatial inductive biases through depth-wise convolutions and a multi-scale training sampler to improve performance and training efficiency. Experimental results show that MobileViT achieves competitive accuracy to CNNs while being significantly more lightweight, and it also outperforms other vision transformers when evaluated on various mobile hardware platforms.