The document discusses transformer and BERT models. It provides an overview of attention models, the transformer architecture, and how transformer models work. It then introduces BERT, explaining how it differs from transformer models in that it does not use a decoder and is pretrained using two unsupervised tasks. The document outlines BERT's architecture and embeddings. Pretrained BERT models are discussed, including DistilBERT, RoBERTa, ALBERT and DeBERTa.