2010.01.25 - Developmentally inspired computational framework for embodied speech imitation
 (PhD presentation)

  • 528 views
Uploaded on

This thesis is concerned with the autonomous acquisition of speech production skills by a robotic system. …

This thesis is concerned with the autonomous acquisition of speech production skills by a robotic system.
The acquisition should occur in interaction with a human tutor, making little or no assumptions on the vocabulary and language of interaction.

A particular target embodiment of the acquisition framework presented in this thesis is the humanoid robot ASIMO.
Because of its size, and the little knowledge of the world it possesses, a child's voice is probably the most appropriate type of voice for such an interactive system.

This means, however, that the acoustic properties of the tutor's voice are very different from the system's.
Consequently, the system has to address the correspondence problem in speech.

For this, inspired by findings in the development of speech skills in infants, we propose an interaction scheme involving a cooperative tutor that provides imitative feedback for simple utterances of the system.
It allows the robot to learn a probabilistic correspondence model, which lets the system associate configurations of it's own vocal tract with the acoustic properties of the tutor's voice.
Using this correspondence model, the system can project a target tutor utterance into its motor space, making an imitation possible.

We also integrated this interaction scheme in an embodied speech structure acquisition framework, already used to teach and interact with the robot.
With this integration, we measure the tutor response, and the utterances to be imitated, in a previously trained perceptual space.
This is not only biologically more plausible, but also paves the way for an embodiment in the humanoid robot.

We also investigated a new speech synthesis algorithm, which operates in the acoustic domain and provides the system with a child-like voice.
Its architecture is a hybrid of a harmonic model and a channel vocoder, and uses a gammatone filter bank to produce the spectral representations.
For the control of the speech synthesizer in the context of imitation learning, a synergistic coding scheme, based on the concept of motor primitive, was investigated.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
528
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • There I had presented and evaluated a framework for synthesizing speech with a child’s voice.
    The ultimate goal was to use the framework to learn speech through interaction with a tutor.
    In the end, I’d shown you the first steps



  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • - explain the difficulties of working with a child’s voice
    - motivate the need for the new technique
    - articulatory: limited on voices and phoneme sets
    - VOCODER has been shown to work well with good spectral representations

    - speech is the physical result of air being expelled from the lungs and passing through the vocal tract
    - Source Filter Model of speech production
    - source signal (larynx, vocal tract constriction) that is modulated by a Vocal Tract Filter Function
    - different ways of representing and deriving the Vocal Tract Filter Function

  • focus on the architecture and properties

    we tested for intelligibility and naturalness
  • focus on the architecture and properties

    we tested for intelligibility and naturalness
  • focus on the architecture and properties

    we tested for intelligibility and naturalness
  • focus on the architecture and properties

    we tested for intelligibility and naturalness

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • properties different BUT meaning same
  • 1. even if it were true, there are is no know speech representation that would do the job
    2.

    also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances
    Goldstein 2003 -

  • 1. even if it were true, there are is no know speech representation that would do the job
    2.

    also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances
    Goldstein 2003 -

  • 1. even if it were true, there are is no know speech representation that would do the job
    2.

    also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances
    Goldstein 2003 -

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009


  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}



















  • S_c always better
    system benefits from an extended vocal repertoire

    trends:
    canonical vowels
    generalization isn’t working 100%: morphing might be introducing some distortions

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been

















  • over-representation: more models than vowels
    1. there are some phonemes for which there is a sparse activity
    2. some phone models are never active
    3. some are active all of the time

    whole subset is not covered
    - primitives are only vowels

    different primitives have a stronger dispersion than others
    either
    - non-uniform imitative response of the tutor to the vocal primitive
    - limitations to synthesizing a phoneme with only one spectral vector
    - or the inexistence of any phone model fully representing the imitative response
    - issues of over- or under- representation
  • over-representation: more models than vowels
    1. there are some phonemes for which there is a sparse activity
    2. some phone models are never active
    3. some are active all of the time

    whole subset is not covered
    - primitives are only vowels

    different primitives have a stronger dispersion than others
    either
    - non-uniform imitative response of the tutor to the vocal primitive
    - limitations to synthesizing a phoneme with only one spectral vector
    - or the inexistence of any phone model fully representing the imitative response
    - issues of over- or under- representation
  • over-representation: more models than vowels
    1. there are some phonemes for which there is a sparse activity
    2. some phone models are never active
    3. some are active all of the time

    whole subset is not covered
    - primitives are only vowels

    different primitives have a stronger dispersion than others
    either
    - non-uniform imitative response of the tutor to the vocal primitive
    - limitations to synthesizing a phoneme with only one spectral vector
    - or the inexistence of any phone model fully representing the imitative response
    - issues of over- or under- representation
  • over-representation: more models than vowels
    1. there are some phonemes for which there is a sparse activity
    2. some phone models are never active
    3. some are active all of the time

    whole subset is not covered
    - primitives are only vowels

    different primitives have a stronger dispersion than others
    either
    - non-uniform imitative response of the tutor to the vocal primitive
    - limitations to synthesizing a phoneme with only one spectral vector
    - or the inexistence of any phone model fully representing the imitative response
    - issues of over- or under- representation






  • which conclusions are here OK?
    retake conclusions
  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

Transcript

  • 1. University of Minho Engineering School Developmentally inspired computational framework for embodied speech imitation Miguel Vaz mvaz@dei.uminho.pt Dep. Industrial Electronics Honda Research Institute Europe University of Minho Offenbach am Main Portugal Germany 25th January, Guimarães
  • 2. Long-term goal: verbal interaction with ASIMO 2
  • 3. Long-term goal: verbal interaction with ASIMO speech perception speech production meaning / language 2
  • 4. Long-term goal: verbal interaction with ASIMO speech perception speech production meaning / language 2
  • 5. Constraints and specificities of the ASIMO platform 3
  • 6. Constraints and specificities of the ASIMO platform no pre-defined vocabulary 3
  • 7. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice 3
  • 8. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice 3
  • 9. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice system has child’s voice 3
  • 10. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice synthesize child’s voice system has child’s voice 3
  • 11. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice synthesize child’s voice system has child’s voice address correspondence problem 3
  • 12. outline synthesize child’s voice address correspondence problem 4
  • 13. outline synthesize child’s voice address correspondence problem 4
  • 14. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem 4
  • 15. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem  sensorimotor model trained with tutor imitative feedback  feature space  perceptual space 4
  • 16. Speech: source-filter model of speech production time (s) time (s) glottal airflow vocal tract output from lips filter function dB dB dB Hz Hz Hz source spectrum output spectrum formant frequencies 5
  • 17. Spectral feature extraction with a gammatone filter bank scheme example 8 3 freq (kHz) Envelope Harmonic Gammatone 1 Extraction Structure Filterbank 0.4 Elimination 0.1 0.25 0.5 time (s) 8 3 freq (kHz) speech 1 ... ... 0.4 0.1 8 3 freq (kHz) pitch 1 0.4 0.1 zur¨ ck u 6
  • 18. VOCODER-like synthesis algorithm with a gammatone filter bank 7
  • 19. VOCODER-like synthesis algorithm with a gammatone filter bank hybrid architecture  channel vocoder for frication  harmonic model for voicing voicing mask harmonic voicing pitch energy sampling spectral vectors synthesis white gammatone noise filter bank frication frication mask 7
  • 20. VOCODER-like synthesis algorithm with a gammatone filter bank hybrid architecture  channel vocoder for frication  harmonic model for voicing voicing mask harmonic voicing pitch energy sampling good naturalness for high- and low- spectral pitch voices vectors synthesis  good results in comparison to standard white gammatone acoustic synthesis techniques noise filter bank frication  tested against MCEP based synthesis frication mask 7
  • 21. VOCODER-like synthesis algorithm with a gammatone filter bank hybrid architecture  channel vocoder for frication  harmonic model for voicing voicing mask harmonic voicing pitch energy sampling good naturalness for high- and low- spectral pitch voices vectors synthesis  good results in comparison to standard white gammatone acoustic synthesis techniques noise filter bank frication  tested against MCEP based synthesis frication mask good intelligibility  tested with Modified Rhyme Test for german 7
  • 22. Example from copy synthesis 8
  • 23. Example from copy synthesis 8
  • 24. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem  sensorimotor model trained with tutor imitative feedback  feature space  perceptual space 9
  • 25. Correspondence problem ? asimo Asimo 10
  • 26. Correspondence problem in the literature 11
  • 27. Correspondence problem in the literature innate representations  [Marean1992, Kuhl1996, Minematsu2009]  labeled data in standard Voice Conversion systems 11
  • 28. Correspondence problem in the literature innate representations  [Marean1992, Kuhl1996, Minematsu2009]  labeled data in standard Voice Conversion systems important information from feedback of parent / tutor  imitation [Papousek1992, Girolametto1999]  reward, stimulation  distinctive maternal responses [Gros-Louis2006, Goldstein2003] 11
  • 29. Correspondence problem in the literature innate representations  [Marean1992, Kuhl1996, Minematsu2009]  labeled data in standard Voice Conversion systems important information from feedback of parent / tutor  imitation [Papousek1992, Girolametto1999]  reward, stimulation  distinctive maternal responses [Gros-Louis2006, Goldstein2003]  mutual imitation games guide acquisition of vowels  [Miura2007, Kanda2009]  tutor imitation as reward signal in RL framework  [Howard2007, Messum2007] 11
  • 30. We use tutor’s imitative feedback 12
  • 31. We use tutor’s imitative feedback cooperative tutor (always) imitates 12
  • 32. We use tutor’s imitative feedback cooperative tutor (always) imitates vocal tract model probabilistic mapping motor  tutor’s voice motor repertoire commands tutor cochlear model sensory- imitative motor response model 12
  • 33. We use tutor’s imitative feedback cooperative tutor (always) imitates probabilistic mapping  tutor’s voice motor repertoire innate vocal repertoire  vowels (primitives)  8 vectors  10 year old boy  TIDIGITS corpus  formant-annotated 12
  • 34. We use tutor’s imitative feedback cooperative tutor (always) imitates probabilistic mapping  tutor’s voice motor repertoire 0 p0 p1 pc p2 p3 p4 innate vocal repertoire  vowels (primitives) S(α, c) c1 α  8 vectors c c2 c3  10 year old boy  TIDIGITS corpus 1 q4 q0 q1 qc q2 q3  formant-annotated p c = pi + cj −ci (pj − pi ) c−ci morphing to combine primitives  assumption: qc = qi + cj −ci (qj − qi ) c−ci  intermediate states will sound “inbetween” 12
  • 35. Training phase m1 m2 m3 13
  • 36. Training phase vocal primitive m1 m2 m3 13
  • 37. Training phase vocal primitive m1 m2 m3 tutor imitative response 13
  • 38. Training phase vocal primitive m1 m2 m3 tutor imitative response feature space p1 (t) = F1 (t) p2 (t) = F2 (t) − F1 (t) p3 (t) = F3 (t) − F1 (t) p{4,5,6} (t) = log(S(C{1,2,3} (t), t)) 13
  • 39. Training phase vocal primitive m1 m2 m3 tutor imitative response feature space p1 (t) = F1 (t) p2 (t) = F2 (t) − F1 (t) p3 (t) = F3 (t) − F1 (t) p{4,5,6} (t) = log(S(C{1,2,3} (t), t)) 13
  • 40. Training phase vocal primitive m1 m2 m3 tutor imitative response feature space build model of p1 (t) = F1 (t) response to p2 (t) = F2 (t) − F1 (t) primitive p3 (t) = F3 (t) − F1 (t) p{4,5,6} (t) = log(S(C{1,2,3} (t), t)) 13
  • 41. Imitation phase 14
  • 42. Imitation phase tutor target utterance 14
  • 43. Imitation phase tutor target utterance k-Nearest Neighbours class Kj posterior p(Cj |x) = K probabilities Kj - number of points of class Cj in a neighbourhood V (x) with K elements 14
  • 44. Imitation phase tutor target utterance k-Nearest Neighbours class Kj posterior p(Cj |x) = K probabilities Kj - number of points of class Cj in a neighbourhood V (x) with K elements population coding 14
  • 45. Imitation phase tutor target utterance k-Nearest Neighbours class Kj posterior p(Cj |x) = K probabilities Kj - number of points of class Cj in a neighbourhood V (x) with K elements population coding ... spectral p(Cj1 |x) α= output ... p(Cj1 |x) + p(Cj2 |x) 14
  • 46. Imitation example 15
  • 47. Imitation example target utterance classification 8000 0.8 P(C|x) 0.6 0.4 p(Cj |x) 3000 0.2 0 0.25 0.5 0.75 1 freq (Hz) time (s) 8000 1000 3000 morphed freq (Hz) 1000 primitives 100 0.25 0.5 0.75 1 100 time (s) time (s) imitation 8000 3000 freq (Hz) 1000 100 0.25 0.5 0.75 1 time (s) 15
  • 48. Imitation example target utterance classification 8000 0.8 P(C|x) 0.6 0.4 p(Cj |x) 3000 0.2 0 0.25 0.5 0.75 1 freq (Hz) time (s) 8000 1000 3000 morphed freq (Hz) 1000 primitives 100 0.25 0.5 0.75 1 100 time (s) time (s) imitation 8000 3000 freq (Hz) 1000 100 0.25 0.5 0.75 1 time (s) 15
  • 49. Imitation example target utterance classification 8000 0.8 P(C|x) 0.6 0.4 p(Cj |x) 3000 0.2 0 0.25 0.5 0.75 1 freq (Hz) time (s) 8000 1000 3000 morphed freq (Hz) 1000 primitives 100 0.25 0.5 0.75 1 100 time (s) time (s) imitation 8000 3000 freq (Hz) 1000 pitch + energy 100 0.25 0.5 0.75 1 time (s) 15
  • 50. other examples adult imitation aia aua papa 16
  • 51. other examples adult imitation aia aua papa 16
  • 52. other examples adult imitation aia aua papa 16
  • 53. other examples adult imitation aia aua papa 16
  • 54. other examples adult imitation aia aua papa 16
  • 55. other examples adult imitation aia aua papa 16
  • 56. other examples adult imitation aia aua papa 16
  • 57. Subjective evaluation of imitation experiment  how similar is the content of the two sounds?  1 (different) ... 5 (same)  24 test subjects stimuli  3 systems x 13 phonemes pairs < human, imitated > O, e, @, o, a, E, i, U Y, 9, aI, aU, OI S3 a, i, U S5 a, i, U, E, O  8 pairs < human, control >  supervised activation S8 a, i, U, E, O, e, @, o 17
  • 58. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem  sensorimotor model trained with tutor imitative feedback  feature space  perceptual space 18
  • 59. Integration with an existing speech acquisition system (Azubi) phone syllable model word model initialization model lexicon pool pool training phone segments syllable word LM LM LM syllable phone score syllable sequence word normalization recognizer spotter spotter detect words phone syllabic activities phonotactic constraints symbol speech model grounding 19
  • 60. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training phone segments syllable word LM LM LM syllable phone score syllable sequence word normalization recognizer spotter spotter detect words phone syllabic activities phonotactic constraints symbol speech model grounding 19
  • 61. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] grounding 19
  • 62. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] grounding 19
  • 63. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] grounding λp λp 1 2 λp 3 λp 4 λp 5 19
  • 64. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] utterance grounding generation production primitives Correspondence model trained at correspondence primitive synergistic activity synthesizer the phone model level activity contour mapping encoder λp λp 1 2 λp 3 λp 4 λp 5 19
  • 65. Training phase: correspondence model λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 20
  • 66. Training phase: correspondence model vocal primitive λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 20
  • 67. Training phase: correspondence model segmentation classification vocal tutor primitive imitation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 20
  • 68. Training phase: correspondence model segmentation classification update vocal probabilistic tutor mapping primitive imitation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 3 λp 4 λp 5 Mij = P (λp |mj , Dj) m3 i - Cij = P (mj |λp ) m2 i P (λp |mj ,Dj) P (λp ) + = i P (mj ) i m1 = Mij - 20
  • 69. Training phase: correspondence model segmentation classification update vocal probabilistic tutor mapping primitive imitation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 15 3 λp 4 λp 5 Mij = P (λp |mj , Dj) m3 i - - Cij = P (mj |λp ) m2 i P (λp |mj ,Dj) P (λp ) + + = i P (mj ) i m1 - = Mij - 20
  • 70. Imitation phase λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 21
  • 71. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 21
  • 72. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 3 λp 4 λp 5 vocal m3 primitives’ posterior m2 probabilities m1 21
  • 73. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 15 3 λp 4 λp 5 vocal m3 primitives’ posterior m2 probabilities m1 21
  • 74. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 15 3 λp 4 λp 5 vocal m3 primitives’ posterior m2 probabilities m1 population coding gaussian activation contours spectral output 21
  • 75. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
  • 76. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
  • 77. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
  • 78. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
  • 79. Imitation example 23
  • 80. Imitation example mama input spectrum population coding spectral output 23
  • 81. Imitation example mama input spectrum population coding spectral output 23
  • 82. Imitation example mama input spectrum population coding spectral output 23
  • 83. Summary Framework where speech imitation can be possible  speech synthesis technique to synthesize child’s voice  channel vocoder meets gammatone filterbank  evaluation  address the correspondence problem  probabilistic mapping between tutor’s voice and system’s motor space  tutor feedback interpreted in  feature space  unsupervisedly acquired perceptual space  integration in an online speech acquisition framework (Azubi)  paves the way for usage on the robot 24
  • 84. Publications "Learning from a tutor: embodied speech acquisition and imitation learning" M.Vaz, H.Brandl, F.Joublin, C.Goerick Proc. IEEE Intl. Conf. on Development and Learning 2009, Shanghai, China "Speech imitation with a child’s voice: addressing the correspondence problem" M.Vaz, H.Brandl, F.Joublin, C.Goerick Proc. SPECOM’2009, St Petersburg, Russia "Linking Perception and Production: System Learns a Correspondence Between its Own Voice and the Tutor's" M.Vaz, H.Brandl, F.Joublin, C.Goerick, Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA- lab, Grenoble, Université Stendhal, France "Speech structure acquisition for interactive systems" H.Brandl, M.Vaz, F.Joublin, C.Goerick Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA- lab, Grenoble, Université Stendhal, France "Listen to the Parrot: Demonstrating the Quality of Online Pitch and Formant Extraction via Feature-based Resynthesis" M.Heckmann, C.Glaeser, M.Vaz, T.Rodemann, F.Joublin, C. Goerick Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems 2008, Nice, France 25
  • 85. Thank you Dr. Estela Bicho Dr. Frank Joublin Dr. Wolfram Erlhagen Colleagues @ Honda Research Institute Colleagues @ DEI Family Friends 26