University of Minho
                                                                      Engineering School




 Developm...
Long-term goal:
verbal interaction with ASIMO




                                2
Long-term goal:
verbal interaction with ASIMO




                      speech perception

                      speech pr...
Long-term goal:
verbal interaction with ASIMO




                      speech perception

                      speech pr...
Constraints and specificities of the ASIMO
                platform




                                        3
Constraints and specificities of the ASIMO
                platform



 no pre-defined vocabulary




                    ...
Constraints and specificities of the ASIMO
                platform
                             acquire speech in interac...
Constraints and specificities of the ASIMO
                platform
                             acquire speech in interac...
Constraints and specificities of the ASIMO
                platform
                             acquire speech in interac...
Constraints and specificities of the ASIMO
                platform
                             acquire speech in interac...
Constraints and specificities of the ASIMO
                platform
                             acquire speech in interac...
outline




synthesize child’s voice


address correspondence problem




                                 4
outline




synthesize child’s voice


address correspondence problem




                                 4
outline



synthesize child’s voice

    vocoder using gammatone filter bank


address correspondence problem




       ...
outline



synthesize child’s voice

    vocoder using gammatone filter bank


address correspondence problem

    senso...
Speech: source-filter model of speech
            production




                time (s)                                 ...
Spectral feature extraction with a
                   gammatone filter bank
                         scheme               ...
VOCODER-like synthesis algorithm with a
      gammatone filter bank




                                      7
VOCODER-like synthesis algorithm with a
      gammatone filter bank
hybrid architecture

    channel vocoder for fricatio...
VOCODER-like synthesis algorithm with a
      gammatone filter bank
hybrid architecture

    channel vocoder for fricatio...
VOCODER-like synthesis algorithm with a
      gammatone filter bank
hybrid architecture

    channel vocoder for fricatio...
Example from copy synthesis




                              8
Example from copy synthesis




                              8
outline




synthesize child’s voice

    vocoder using gammatone filter bank
address correspondence problem

    sensor...
Correspondence problem

            ?
 asimo             Asimo




                           10
Correspondence problem in the literature




                                      11
Correspondence problem in the literature

innate representations

    [Marean1992, Kuhl1996, Minematsu2009]

    labeled...
Correspondence problem in the literature

innate representations

    [Marean1992, Kuhl1996, Minematsu2009]

    labeled...
Correspondence problem in the literature

innate representations

    [Marean1992, Kuhl1996, Minematsu2009]

    labeled...
We use tutor’s imitative feedback




                                    12
We use tutor’s imitative feedback

cooperative tutor (always) imitates




                                           12
We use tutor’s imitative feedback

cooperative tutor (always) imitates
                                                   ...
We use tutor’s imitative feedback

cooperative tutor (always) imitates

probabilistic mapping

    tutor’s voice      mot...
We use tutor’s imitative feedback

cooperative tutor (always) imitates

probabilistic mapping

    tutor’s voice         ...
Training phase


  m1   m2   m3




                 13
Training phase

 vocal
primitive     m1   m2   m3




                             13
Training phase

   vocal
  primitive         m1   m2   m3


tutor imitative
  response




                               ...
Training phase

   vocal
  primitive         m1   m2   m3


tutor imitative
  response



                                ...
Training phase

   vocal
  primitive         m1   m2   m3


tutor imitative
  response



                                ...
Training phase

   vocal
  primitive         m1   m2   m3


tutor imitative
  response



                                ...
Imitation phase




                  14
Imitation phase

   tutor
  target
utterance




                              14
Imitation phase

    tutor
   target
 utterance


                                  k-Nearest Neighbours
   class         ...
Imitation phase

    tutor
   target
 utterance


                                  k-Nearest Neighbours
   class         ...
Imitation phase

    tutor
   target
 utterance


                                     k-Nearest Neighbours
   class      ...
Imitation example




                    15
Imitation example
                   target utterance                                                                     ...
Imitation example
                   target utterance                                                                     ...
Imitation example
                   target utterance                                                                     ...
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
other examples


       adult    imitation


aia




aua




papa




                            16
Subjective evaluation of imitation

experiment

    how similar is the content of
    the two sounds?
        1 (differe...
outline




synthesize child’s voice

    vocoder using gammatone filter bank
address correspondence problem

    sensor...
Integration with an existing speech
    acquisition system (Azubi)



                        phone                       ...
Integration with an existing speech
             acquisition system (Azubi)

Goals:

    integrate with perceptual model
...
Integration with an existing speech
             acquisition system (Azubi)

Goals:

    integrate with perceptual model
...
Integration with an existing speech
             acquisition system (Azubi)

Goals:

    integrate with perceptual model
...
Integration with an existing speech
             acquisition system (Azubi)

Goals:

    integrate with perceptual model
...
Integration with an existing speech
             acquisition system (Azubi)

Goals:

    integrate with perceptual model
...
Training phase: correspondence model




              λp λp
               1  2   λp
                       3   λp
      ...
Training phase: correspondence model


  vocal
 primitive




                   λp λp
                    1  2   λp
     ...
Training phase: correspondence model

                               segmentation
                               classific...
Training phase: correspondence model

                               segmentation
                               classific...
Training phase: correspondence model

                               segmentation
                               classific...
Imitation phase




 λp λp
  1  2   λp
          3   λp
               4   λp
                    5


                    ...
Imitation phase

 target tutor
  utterance


segmentation                            [λp , ..., λp ] = arg max P ([λp ]|Xt...
Imitation phase

 target tutor
  utterance


segmentation                            [λp , ..., λp ] = arg max P ([λp ]|Xt...
Imitation phase

 target tutor
  utterance


segmentation                          [λp , ..., λp ] = arg max P ([λp ]|Xtut...
Imitation phase

 target tutor
  utterance


segmentation                          [λp , ..., λp ] = arg max P ([λp ]|Xtut...
Experimental results

                                                  Correspondence matrix


phone models

    “child-...
Experimental results

                                                  Correspondence matrix


phone models

    “child-...
Experimental results

                                                  Correspondence matrix


phone models

    “child-...
Experimental results

                                                  Correspondence matrix


phone models

    “child-...
Imitation example




                    23
Imitation example

                        mama




  input
spectrum




population
  coding




 spectral
  output




  ...
Imitation example

                        mama




  input
spectrum




population
  coding




 spectral
  output




  ...
Imitation example

                        mama




  input
spectrum




population
  coding




 spectral
  output




  ...
Summary

Framework where speech imitation can be possible

    speech synthesis technique to synthesize child’s voice
   ...
Publications

"Learning from a tutor: embodied speech acquisition and imitation learning"
M.Vaz, H.Brandl, F.Joublin, C.Go...
Thank you




Dr. Estela Bicho
Dr. Frank Joublin
Dr. Wolfram Erlhagen

Colleagues @ Honda Research Institute
Colleagues @ ...
Upcoming SlideShare
Loading in...5
×

2010.01.25 - Developmentally inspired computational framework for embodied speech imitation
 (PhD presentation)

560

Published on

This thesis is concerned with the autonomous acquisition of speech production skills by a robotic system.
The acquisition should occur in interaction with a human tutor, making little or no assumptions on the vocabulary and language of interaction.

A particular target embodiment of the acquisition framework presented in this thesis is the humanoid robot ASIMO.
Because of its size, and the little knowledge of the world it possesses, a child's voice is probably the most appropriate type of voice for such an interactive system.

This means, however, that the acoustic properties of the tutor's voice are very different from the system's.
Consequently, the system has to address the correspondence problem in speech.

For this, inspired by findings in the development of speech skills in infants, we propose an interaction scheme involving a cooperative tutor that provides imitative feedback for simple utterances of the system.
It allows the robot to learn a probabilistic correspondence model, which lets the system associate configurations of it's own vocal tract with the acoustic properties of the tutor's voice.
Using this correspondence model, the system can project a target tutor utterance into its motor space, making an imitation possible.

We also integrated this interaction scheme in an embodied speech structure acquisition framework, already used to teach and interact with the robot.
With this integration, we measure the tutor response, and the utterances to be imitated, in a previously trained perceptual space.
This is not only biologically more plausible, but also paves the way for an embodiment in the humanoid robot.

We also investigated a new speech synthesis algorithm, which operates in the acoustic domain and provides the system with a child-like voice.
Its architecture is a hybrid of a harmonic model and a channel vocoder, and uses a gammatone filter bank to produce the spectral representations.
For the control of the speech synthesizer in the context of imitation learning, a synergistic coding scheme, based on the concept of motor primitive, was investigated.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
560
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • There I had presented and evaluated a framework for synthesizing speech with a child’s voice.
    The ultimate goal was to use the framework to learn speech through interaction with a tutor.
    In the end, I’d shown you the first steps



  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • - explain the difficulties of working with a child’s voice
    - motivate the need for the new technique
    - articulatory: limited on voices and phoneme sets
    - VOCODER has been shown to work well with good spectral representations

    - speech is the physical result of air being expelled from the lungs and passing through the vocal tract
    - Source Filter Model of speech production
    - source signal (larynx, vocal tract constriction) that is modulated by a Vocal Tract Filter Function
    - different ways of representing and deriving the Vocal Tract Filter Function

  • focus on the architecture and properties

    we tested for intelligibility and naturalness
  • focus on the architecture and properties

    we tested for intelligibility and naturalness
  • focus on the architecture and properties

    we tested for intelligibility and naturalness
  • focus on the architecture and properties

    we tested for intelligibility and naturalness

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • properties different BUT meaning same
  • 1. even if it were true, there are is no know speech representation that would do the job
    2.

    also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances
    Goldstein 2003 -

  • 1. even if it were true, there are is no know speech representation that would do the job
    2.

    also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances
    Goldstein 2003 -

  • 1. even if it were true, there are is no know speech representation that would do the job
    2.

    also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    Gros-Louis 2006 - interactive, differentiated and proximate responses increase production of more advanced utterances
    Goldstein 2003 -

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009


  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}

  • egin{split}
    p_1(t) & = F_1(t) \
    p_2(t) & = F_2(t) - F_1(t) \
    p_3(t) & = F_3(t) - F_1(t) \
    p_{{4,5,6}}(t) & = log( S( C_{{1,2,3}}(t), t) )%\
    % p_5(t) & = log( S( c_2(t), t) ) \
    % p_6(t) & = log( S( c_3(t), t) )
    end{split}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}
  • why kNN?
    no assumptions on the distribution of the elements of ech class
    important because data quite irregular


    For a set of labels or vocal-classes $C_j$ and an input feature vector $x$, we consider a neighbourhood $V$ of $x$ that contains exactly $k$ points.
    The posterior probability of class membership depends on the number of training points of class $C_j$ present in $V$, denoted by $K_j$:
    egin{equation}
    p( C_j | x ) = frac{K_j}{K}

    alpha = frac{p( C_{j1} | x )}{ p( C_{j1} | x ) + p( C_{j2} | x )}



















  • S_c always better
    system benefits from an extended vocal repertoire

    trends:
    canonical vowels
    generalization isn’t working 100%: morphing might be introducing some distortions

  • language assumptions:
    syllable structure
    number of vowels in the vowel system
    prosodic

    traditional HMM synthesis approaches not suitable

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • add to scheme that the system gets the phone models after they have been

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been
  • - Correspondence model has the form of a matrix, because the perceptual space is discrete
    - from a given input, the Azubi model

    C_{ij} = P( m_j | lambda_i^{p} ) = frac{ P( lambda_i^p | m_j , Dj) , P(lambda_i^p) } { P(m_j) }

    M_{ij} = P( lambda_i^p | m_j , Dj)

    [ lambda^p_{1}, ... , lambda^p_{n} ] = argmax_{[lambda^p] in mathcal{P}} P( [lambda^p] | X_{tutor})

    add to scheme that the system gets the phone models after they have been

















  • over-representation: more models than vowels
    1. there are some phonemes for which there is a sparse activity
    2. some phone models are never active
    3. some are active all of the time

    whole subset is not covered
    - primitives are only vowels

    different primitives have a stronger dispersion than others
    either
    - non-uniform imitative response of the tutor to the vocal primitive
    - limitations to synthesizing a phoneme with only one spectral vector
    - or the inexistence of any phone model fully representing the imitative response
    - issues of over- or under- representation
  • over-representation: more models than vowels
    1. there are some phonemes for which there is a sparse activity
    2. some phone models are never active
    3. some are active all of the time

    whole subset is not covered
    - primitives are only vowels

    different primitives have a stronger dispersion than others
    either
    - non-uniform imitative response of the tutor to the vocal primitive
    - limitations to synthesizing a phoneme with only one spectral vector
    - or the inexistence of any phone model fully representing the imitative response
    - issues of over- or under- representation
  • over-representation: more models than vowels
    1. there are some phonemes for which there is a sparse activity
    2. some phone models are never active
    3. some are active all of the time

    whole subset is not covered
    - primitives are only vowels

    different primitives have a stronger dispersion than others
    either
    - non-uniform imitative response of the tutor to the vocal primitive
    - limitations to synthesizing a phoneme with only one spectral vector
    - or the inexistence of any phone model fully representing the imitative response
    - issues of over- or under- representation
  • over-representation: more models than vowels
    1. there are some phonemes for which there is a sparse activity
    2. some phone models are never active
    3. some are active all of the time

    whole subset is not covered
    - primitives are only vowels

    different primitives have a stronger dispersion than others
    either
    - non-uniform imitative response of the tutor to the vocal primitive
    - limitations to synthesizing a phoneme with only one spectral vector
    - or the inexistence of any phone model fully representing the imitative response
    - issues of over- or under- representation






  • which conclusions are here OK?
    retake conclusions
  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • also mention the work of the guys in Edinburgh, where they make spectral morphing between an adult and a child speaker by maximizing the likelihood of a given sequence

    M. Vaz, H. Brandl, F. Joublin, and C. Goerick, “Speech imitation with a child’s voice: addressing the correspondence problem,” accepted for 13-th Int. Conf. on Speech and Computer - SPECOM, 2009

  • 2010.01.25 - Developmentally inspired computational framework for embodied speech imitation
 (PhD presentation)

    1. 1. University of Minho Engineering School Developmentally inspired computational framework for embodied speech imitation Miguel Vaz mvaz@dei.uminho.pt Dep. Industrial Electronics Honda Research Institute Europe University of Minho Offenbach am Main Portugal Germany 25th January, Guimarães
    2. 2. Long-term goal: verbal interaction with ASIMO 2
    3. 3. Long-term goal: verbal interaction with ASIMO speech perception speech production meaning / language 2
    4. 4. Long-term goal: verbal interaction with ASIMO speech perception speech production meaning / language 2
    5. 5. Constraints and specificities of the ASIMO platform 3
    6. 6. Constraints and specificities of the ASIMO platform no pre-defined vocabulary 3
    7. 7. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice 3
    8. 8. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice 3
    9. 9. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice system has child’s voice 3
    10. 10. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice synthesize child’s voice system has child’s voice 3
    11. 11. Constraints and specificities of the ASIMO platform acquire speech in interaction  imitation  online learning no pre-defined vocabulary  unlabeled data minimize language assumptions  no corpus for system’s voice synthesize child’s voice system has child’s voice address correspondence problem 3
    12. 12. outline synthesize child’s voice address correspondence problem 4
    13. 13. outline synthesize child’s voice address correspondence problem 4
    14. 14. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem 4
    15. 15. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem  sensorimotor model trained with tutor imitative feedback  feature space  perceptual space 4
    16. 16. Speech: source-filter model of speech production time (s) time (s) glottal airflow vocal tract output from lips filter function dB dB dB Hz Hz Hz source spectrum output spectrum formant frequencies 5
    17. 17. Spectral feature extraction with a gammatone filter bank scheme example 8 3 freq (kHz) Envelope Harmonic Gammatone 1 Extraction Structure Filterbank 0.4 Elimination 0.1 0.25 0.5 time (s) 8 3 freq (kHz) speech 1 ... ... 0.4 0.1 8 3 freq (kHz) pitch 1 0.4 0.1 zur¨ ck u 6
    18. 18. VOCODER-like synthesis algorithm with a gammatone filter bank 7
    19. 19. VOCODER-like synthesis algorithm with a gammatone filter bank hybrid architecture  channel vocoder for frication  harmonic model for voicing voicing mask harmonic voicing pitch energy sampling spectral vectors synthesis white gammatone noise filter bank frication frication mask 7
    20. 20. VOCODER-like synthesis algorithm with a gammatone filter bank hybrid architecture  channel vocoder for frication  harmonic model for voicing voicing mask harmonic voicing pitch energy sampling good naturalness for high- and low- spectral pitch voices vectors synthesis  good results in comparison to standard white gammatone acoustic synthesis techniques noise filter bank frication  tested against MCEP based synthesis frication mask 7
    21. 21. VOCODER-like synthesis algorithm with a gammatone filter bank hybrid architecture  channel vocoder for frication  harmonic model for voicing voicing mask harmonic voicing pitch energy sampling good naturalness for high- and low- spectral pitch voices vectors synthesis  good results in comparison to standard white gammatone acoustic synthesis techniques noise filter bank frication  tested against MCEP based synthesis frication mask good intelligibility  tested with Modified Rhyme Test for german 7
    22. 22. Example from copy synthesis 8
    23. 23. Example from copy synthesis 8
    24. 24. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem  sensorimotor model trained with tutor imitative feedback  feature space  perceptual space 9
    25. 25. Correspondence problem ? asimo Asimo 10
    26. 26. Correspondence problem in the literature 11
    27. 27. Correspondence problem in the literature innate representations  [Marean1992, Kuhl1996, Minematsu2009]  labeled data in standard Voice Conversion systems 11
    28. 28. Correspondence problem in the literature innate representations  [Marean1992, Kuhl1996, Minematsu2009]  labeled data in standard Voice Conversion systems important information from feedback of parent / tutor  imitation [Papousek1992, Girolametto1999]  reward, stimulation  distinctive maternal responses [Gros-Louis2006, Goldstein2003] 11
    29. 29. Correspondence problem in the literature innate representations  [Marean1992, Kuhl1996, Minematsu2009]  labeled data in standard Voice Conversion systems important information from feedback of parent / tutor  imitation [Papousek1992, Girolametto1999]  reward, stimulation  distinctive maternal responses [Gros-Louis2006, Goldstein2003]  mutual imitation games guide acquisition of vowels  [Miura2007, Kanda2009]  tutor imitation as reward signal in RL framework  [Howard2007, Messum2007] 11
    30. 30. We use tutor’s imitative feedback 12
    31. 31. We use tutor’s imitative feedback cooperative tutor (always) imitates 12
    32. 32. We use tutor’s imitative feedback cooperative tutor (always) imitates vocal tract model probabilistic mapping motor  tutor’s voice motor repertoire commands tutor cochlear model sensory- imitative motor response model 12
    33. 33. We use tutor’s imitative feedback cooperative tutor (always) imitates probabilistic mapping  tutor’s voice motor repertoire innate vocal repertoire  vowels (primitives)  8 vectors  10 year old boy  TIDIGITS corpus  formant-annotated 12
    34. 34. We use tutor’s imitative feedback cooperative tutor (always) imitates probabilistic mapping  tutor’s voice motor repertoire 0 p0 p1 pc p2 p3 p4 innate vocal repertoire  vowels (primitives) S(α, c) c1 α  8 vectors c c2 c3  10 year old boy  TIDIGITS corpus 1 q4 q0 q1 qc q2 q3  formant-annotated p c = pi + cj −ci (pj − pi ) c−ci morphing to combine primitives  assumption: qc = qi + cj −ci (qj − qi ) c−ci  intermediate states will sound “inbetween” 12
    35. 35. Training phase m1 m2 m3 13
    36. 36. Training phase vocal primitive m1 m2 m3 13
    37. 37. Training phase vocal primitive m1 m2 m3 tutor imitative response 13
    38. 38. Training phase vocal primitive m1 m2 m3 tutor imitative response feature space p1 (t) = F1 (t) p2 (t) = F2 (t) − F1 (t) p3 (t) = F3 (t) − F1 (t) p{4,5,6} (t) = log(S(C{1,2,3} (t), t)) 13
    39. 39. Training phase vocal primitive m1 m2 m3 tutor imitative response feature space p1 (t) = F1 (t) p2 (t) = F2 (t) − F1 (t) p3 (t) = F3 (t) − F1 (t) p{4,5,6} (t) = log(S(C{1,2,3} (t), t)) 13
    40. 40. Training phase vocal primitive m1 m2 m3 tutor imitative response feature space build model of p1 (t) = F1 (t) response to p2 (t) = F2 (t) − F1 (t) primitive p3 (t) = F3 (t) − F1 (t) p{4,5,6} (t) = log(S(C{1,2,3} (t), t)) 13
    41. 41. Imitation phase 14
    42. 42. Imitation phase tutor target utterance 14
    43. 43. Imitation phase tutor target utterance k-Nearest Neighbours class Kj posterior p(Cj |x) = K probabilities Kj - number of points of class Cj in a neighbourhood V (x) with K elements 14
    44. 44. Imitation phase tutor target utterance k-Nearest Neighbours class Kj posterior p(Cj |x) = K probabilities Kj - number of points of class Cj in a neighbourhood V (x) with K elements population coding 14
    45. 45. Imitation phase tutor target utterance k-Nearest Neighbours class Kj posterior p(Cj |x) = K probabilities Kj - number of points of class Cj in a neighbourhood V (x) with K elements population coding ... spectral p(Cj1 |x) α= output ... p(Cj1 |x) + p(Cj2 |x) 14
    46. 46. Imitation example 15
    47. 47. Imitation example target utterance classification 8000 0.8 P(C|x) 0.6 0.4 p(Cj |x) 3000 0.2 0 0.25 0.5 0.75 1 freq (Hz) time (s) 8000 1000 3000 morphed freq (Hz) 1000 primitives 100 0.25 0.5 0.75 1 100 time (s) time (s) imitation 8000 3000 freq (Hz) 1000 100 0.25 0.5 0.75 1 time (s) 15
    48. 48. Imitation example target utterance classification 8000 0.8 P(C|x) 0.6 0.4 p(Cj |x) 3000 0.2 0 0.25 0.5 0.75 1 freq (Hz) time (s) 8000 1000 3000 morphed freq (Hz) 1000 primitives 100 0.25 0.5 0.75 1 100 time (s) time (s) imitation 8000 3000 freq (Hz) 1000 100 0.25 0.5 0.75 1 time (s) 15
    49. 49. Imitation example target utterance classification 8000 0.8 P(C|x) 0.6 0.4 p(Cj |x) 3000 0.2 0 0.25 0.5 0.75 1 freq (Hz) time (s) 8000 1000 3000 morphed freq (Hz) 1000 primitives 100 0.25 0.5 0.75 1 100 time (s) time (s) imitation 8000 3000 freq (Hz) 1000 pitch + energy 100 0.25 0.5 0.75 1 time (s) 15
    50. 50. other examples adult imitation aia aua papa 16
    51. 51. other examples adult imitation aia aua papa 16
    52. 52. other examples adult imitation aia aua papa 16
    53. 53. other examples adult imitation aia aua papa 16
    54. 54. other examples adult imitation aia aua papa 16
    55. 55. other examples adult imitation aia aua papa 16
    56. 56. other examples adult imitation aia aua papa 16
    57. 57. Subjective evaluation of imitation experiment  how similar is the content of the two sounds?  1 (different) ... 5 (same)  24 test subjects stimuli  3 systems x 13 phonemes pairs < human, imitated > O, e, @, o, a, E, i, U Y, 9, aI, aU, OI S3 a, i, U S5 a, i, U, E, O  8 pairs < human, control >  supervised activation S8 a, i, U, E, O, e, @, o 17
    58. 58. outline synthesize child’s voice  vocoder using gammatone filter bank address correspondence problem  sensorimotor model trained with tutor imitative feedback  feature space  perceptual space 18
    59. 59. Integration with an existing speech acquisition system (Azubi) phone syllable model word model initialization model lexicon pool pool training phone segments syllable word LM LM LM syllable phone score syllable sequence word normalization recognizer spotter spotter detect words phone syllabic activities phonotactic constraints symbol speech model grounding 19
    60. 60. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training phone segments syllable word LM LM LM syllable phone score syllable sequence word normalization recognizer spotter spotter detect words phone syllabic activities phonotactic constraints symbol speech model grounding 19
    61. 61. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] grounding 19
    62. 62. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] grounding 19
    63. 63. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] grounding λp λp 1 2 λp 3 λp 4 λp 5 19
    64. 64. Integration with an existing speech acquisition system (Azubi) Goals:  integrate with perceptual model  make it more appropriate to use in phone syllable real scenarios model model initialization model word lexicon pool pool training Azubi model [Brandl et al, 2008] phone LM segments syllable LM word LM syllable acquires speech phone recognizer score normalization syllable spotter sequence word spotter  phones, syllables, words detect words phone syllabic  already used in interaction activities phonotactic constraints symbol speech model scenarios [Bolder et al, 2008, etc] utterance grounding generation production primitives Correspondence model trained at correspondence primitive synergistic activity synthesizer the phone model level activity contour mapping encoder λp λp 1 2 λp 3 λp 4 λp 5 19
    65. 65. Training phase: correspondence model λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 20
    66. 66. Training phase: correspondence model vocal primitive λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 20
    67. 67. Training phase: correspondence model segmentation classification vocal tutor primitive imitation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 20
    68. 68. Training phase: correspondence model segmentation classification update vocal probabilistic tutor mapping primitive imitation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 3 λp 4 λp 5 Mij = P (λp |mj , Dj) m3 i - Cij = P (mj |λp ) m2 i P (λp |mj ,Dj) P (λp ) + = i P (mj ) i m1 = Mij - 20
    69. 69. Training phase: correspondence model segmentation classification update vocal probabilistic tutor mapping primitive imitation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 15 3 λp 4 λp 5 Mij = P (λp |mj , Dj) m3 i - - Cij = P (mj |λp ) m2 i P (λp |mj ,Dj) P (λp ) + + = i P (mj ) i m1 - = Mij - 20
    70. 70. Imitation phase λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 21
    71. 71. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp λp 1 2 λp 3 λp 4 λp 5 m3 m2 m1 21
    72. 72. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 3 λp 4 λp 5 vocal m3 primitives’ posterior m2 probabilities m1 21
    73. 73. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 15 3 λp 4 λp 5 vocal m3 primitives’ posterior m2 probabilities m1 21
    74. 74. Imitation phase target tutor utterance segmentation [λp , ..., λp ] = arg max P ([λp ]|Xtutor ) 1 n p [λ ]∈P λp 15 1 λp 2 λp 15 3 λp 4 λp 5 vocal m3 primitives’ posterior m2 probabilities m1 population coding gaussian activation contours spectral output 21
    75. 75. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
    76. 76. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
    77. 77. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
    78. 78. Experimental results Correspondence matrix phone models  “child-directed”-like speech +- 1min vocal primitives  interaction  15 imitations of each vocal primitive phone models 22
    79. 79. Imitation example 23
    80. 80. Imitation example mama input spectrum population coding spectral output 23
    81. 81. Imitation example mama input spectrum population coding spectral output 23
    82. 82. Imitation example mama input spectrum population coding spectral output 23
    83. 83. Summary Framework where speech imitation can be possible  speech synthesis technique to synthesize child’s voice  channel vocoder meets gammatone filterbank  evaluation  address the correspondence problem  probabilistic mapping between tutor’s voice and system’s motor space  tutor feedback interpreted in  feature space  unsupervisedly acquired perceptual space  integration in an online speech acquisition framework (Azubi)  paves the way for usage on the robot 24
    84. 84. Publications "Learning from a tutor: embodied speech acquisition and imitation learning" M.Vaz, H.Brandl, F.Joublin, C.Goerick Proc. IEEE Intl. Conf. on Development and Learning 2009, Shanghai, China "Speech imitation with a child’s voice: addressing the correspondence problem" M.Vaz, H.Brandl, F.Joublin, C.Goerick Proc. SPECOM’2009, St Petersburg, Russia "Linking Perception and Production: System Learns a Correspondence Between its Own Voice and the Tutor's" M.Vaz, H.Brandl, F.Joublin, C.Goerick, Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA- lab, Grenoble, Université Stendhal, France "Speech structure acquisition for interactive systems" H.Brandl, M.Vaz, F.Joublin, C.Goerick Speech and Face to Face Communication Workshop in memory of Christian Benoît: GIPSA- lab, Grenoble, Université Stendhal, France "Listen to the Parrot: Demonstrating the Quality of Online Pitch and Formant Extraction via Feature-based Resynthesis" M.Heckmann, C.Glaeser, M.Vaz, T.Rodemann, F.Joublin, C. Goerick Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems 2008, Nice, France 25
    85. 85. Thank you Dr. Estela Bicho Dr. Frank Joublin Dr. Wolfram Erlhagen Colleagues @ Honda Research Institute Colleagues @ DEI Family Friends 26

    ×