Knowledge distillation is a process that transfers knowledge from a teacher network to a student network. It was initially proposed to help deploy complex models onto devices with limited resources by distilling the knowledge into a smaller student model. Now it is commonly used to compress models for more efficient inference or training. The 2014 NIPS paper first defined knowledge distillation, using a softmax distribution over the teacher's outputs as "dark knowledge" to train the student, in addition to the hard targets. Since then, many models have been proposed that apply different distillation losses and processes to transfer knowledge from large pretrained models to more compact student models while maintaining high performance.