Controllable image-to-video
translation: A case study on facial
expression generation
Introduction
■ Task Description
– how to generate video clips of rich facial expressions from a single profile photo of
the neutral expression
■ Difficulties
– The image-to-video translation might seem like an ill-posed problem because the
output has much more unknowns to fill in than the input values
– humans are familiar with and sensitive about the facial expressions
– the face identity is supposed to be preserved in the generated video clips
Introduction
■ Different people express emotions in similar manners
■ the expressions are often “unimodal” for a fixed type of emotion
■ the human face of a profile photo draws a majority of users’ attention, leaving the
quality of the generated background less important
Method
■ Problem formulation
■ Given an input image I ∈ RH×W×3 where H andW are respectively the height and
width of the image, our goal is to generate a sequence of video frames {V (a) := f(I,a);a
∈ [0,1]}, where f(I,a) denotes the model to be learned.
■ Properties:
– a=0, f(I,a)=I
– Smooth, f(I,a) and f(I,a+ Δa) should be visually similar when ∆a is small
– V (1) be the peak state of the expression
Method
Method
Training loss
adversarial loss
Temporal continuity
Facial landmark prediction Lk:
method
■ Jointly learning the models of different types of facial expressions
Experiments
Visualization
Experiments
■ Analysis on temporal continuity
Experiments

Controllable image to-video translation

  • 1.
    Controllable image-to-video translation: Acase study on facial expression generation
  • 2.
    Introduction ■ Task Description –how to generate video clips of rich facial expressions from a single profile photo of the neutral expression ■ Difficulties – The image-to-video translation might seem like an ill-posed problem because the output has much more unknowns to fill in than the input values – humans are familiar with and sensitive about the facial expressions – the face identity is supposed to be preserved in the generated video clips
  • 3.
    Introduction ■ Different peopleexpress emotions in similar manners ■ the expressions are often “unimodal” for a fixed type of emotion ■ the human face of a profile photo draws a majority of users’ attention, leaving the quality of the generated background less important
  • 4.
    Method ■ Problem formulation ■Given an input image I ∈ RH×W×3 where H andW are respectively the height and width of the image, our goal is to generate a sequence of video frames {V (a) := f(I,a);a ∈ [0,1]}, where f(I,a) denotes the model to be learned. ■ Properties: – a=0, f(I,a)=I – Smooth, f(I,a) and f(I,a+ Δa) should be visually similar when ∆a is small – V (1) be the peak state of the expression
  • 5.
  • 6.
    Method Training loss adversarial loss Temporalcontinuity Facial landmark prediction Lk:
  • 7.
    method ■ Jointly learningthe models of different types of facial expressions
  • 8.
  • 9.
    Experiments ■ Analysis ontemporal continuity
  • 10.