Swin Transformer is a hierarchical vision transformer that uses shifted windows to provide multi-scale representations. It computes representations of divided image patches at different levels by merging patches in deeper layers. This allows it to identify objects of varying sizes easily, unlike previous vision transformers that had a single representation at a low resolution. Swin Transformer has linear computational complexity due to self-attention occurring within local windows at each level instead of globally.