This document presents the vit-adapter, a method that enhances the Vision Transformer (ViT) for dense image prediction tasks, allowing it to match the performance of vision-specific transformers. It incorporates a pre-training-free adapter that introduces image-related biases into the model and has shown state-of-the-art results in object detection and segmentation tasks without using extra detection data. The study emphasizes the effectiveness of multi-modal pre-training and the potential of the vit-adapter as an alternative for vision-specific transformers.