This document discusses a self-evolving vision transformer model for chest x-ray diagnosis using knowledge distillation. The model architecture is described along with techniques like pseudo-labeling for distillation and loss functions. Experimental results show the vision transformer model outperforms a CNN baseline and that the model's performance can be steadily increased using increasingly more unlabeled data through self-supervision and self-training with knowledge distillation.