NÜWA is a unified multimodal pre-trained model that can generate and manipulate visual data like images and videos. It uses a 3D transformer encoder-decoder framework to handle text, images, and videos. A 3D Nearby Attention mechanism considers the nature of visual data and reduces computational complexity. NÜWA achieves state-of-the-art results on tasks like text-to-image generation, text-to-video generation, and video prediction. It also shows strong zero-shot capabilities for tasks like text-guided image and video manipulation.