- The document presents a new formulation of the attention mechanism in Transformers using kernels. This allows defining attention over a larger space and integrating positional embeddings.
- Experiments on neural machine translation and sequence prediction tasks study different kernel forms like RBF and their combination with positional encodings.
- Results show the best kernel is the RBF kernel and competitive performance to state-of-the-art models with less computation.