The document discusses a lipreading model that uses a spatiotemporal front-end with ResNet and a temporal convolutional backend followed by a bidirectional LSTM backend to predict words from video input. It provides statistics on the model's top predictions and worst predictions on a test set of 50 examples with 500 labels each. Future work proposed includes improving the preprocessing step and using the concept of sentences.