The document discusses sequence to sequence models for speech recognition. It describes how traditional automatic speech recognition (ASR) works using acoustic, pronunciation, and language models. The document then introduces sequence to sequence models like Listen, Attend and Spell (LAS) which uses an encoder, attender, and decoder. LAS improves upon traditional ASR by integrating all models into a single neural network with attention and other optimizations like minimum word error rate training and scheduled sampling. Sequence to sequence models provide around 11% relative improvement in word error rate over traditional ASR systems.