This document describes the Transformer, a novel neural network architecture based solely on attention mechanisms rather than recurrent or convolutional layers. The Transformer uses stacked encoder and decoder blocks with multi-head self-attention and feed-forward layers to achieve state-of-the-art results in machine translation tasks. Key aspects of the Transformer include multi-head attention to jointly attend to information from different representation subspaces, positional encoding to embed positional information, and an attention mask to prevent positions from attending to subsequent positions. The Transformer achieves superior performance compared to RNN-based models on translation benchmarks, with fewer parameters and computation that can be fully parallelized.