The document discusses recent theories on why deep neural networks generalize well despite being highly overparameterized. Classic learning theory, which assumes restricting the hypothesis space is necessary for generalization, fails to explain modern neural networks. Recent studies suggest neural networks generalize because 1) their complexity is underestimated and 2) SGD regularization finds flat minima. Sharpness-aware minimization (SAM) directly optimizes for flat minima and consistently improves generalization, especially for vision transformers which have sharper loss landscapes than ResNets. SAM produces more interpretable attention maps and significantly boosts performance of vision transformers and MLP-Mixers on in-domain and out-of-domain tasks.