This document discusses controllable image-to-video translation for generating facial expression videos from a single neutral photo. It describes generating videos that smoothly transition from the neutral input photo to the peak expression state while preserving identity. The method trains models to jointly generate different expression types using adversarial and temporal continuity losses, as well as landmark prediction to maintain natural facial dynamics. Experiments analyze temporal smoothness and visualize generated expression videos.