Modeling the synergy between verbal
         and nonverbal communication
         using formal, analytic methods of sequential analysis




                                                     Conversation Analysis




Video and Audio Data




                                  Human Intersubjectivity




                                                                             1
Our Research has two thrusts

1) Contribute to the science of
human interaction and
intersubjectivity


2) Develop novel technical
capabilities for computers to detect
and infer the states of
intersubjectivity based on video and
audio data




                                       2
The Problem

   Parties who know or suspect that others might observe them
    via surveillance systems
    – will try to communicate in a covert manner using non-explicit
      references and non-verbal communication (gesture, expression, and
      other “body language”) to reach shared understanding.
   Video contains rich visual cues that can enable the decoding
    of the non-explicit communication that audio alone does not,
    – some of which can be interpreted by a third-party human observer
      (e.g., an intelligence analyst).
    – The exponential growth of video and audio data will lead to a
      problem of scale for human intelligence analysts, as the volume of
      data that is continuously generated will overwhelm them.
    – Automated techniques are needed to supplement and ultimately
      reduce the amount of human analysis needed.



                                                                           3
Discovering the
architecture of                       People display their understanding in
                                    interaction in a well- documented three
intersubjectivity                                - turn sequence


                            1.                               2.
                       Initiation                         Response




                                                3.
                                          Confirmation/
                                           correction




                    We study verbal and non-verbal cues



                                                                              4
Solution/Approach
Identify the structure and sequence       Create models to
of implicit reference (allusion),         develop algorithms
understanding, agreement and              that detect and infer
sentiment between meeting parties         the state of these
as it evolves over time.                  behaviors.




   Culture as we know it is “an apparatus for generating recognizable actions” (Sacks,
    1992;)
    – By “recognizable,” we mean, “can be readily seen and recognized by any member of
      the culture.” we concern ourselves here with four types:
1. Confirming implicit allusion – implicit references, not explicitly stated i.e., confirming,
   through various recognizable vocal actions, both the content and the implicit conveyance
   of a participant’s prior remarks
2. Understanding - The architecture of intersubjectivity in which understanding is achieved
   or navigated in three turn sequences
3. Agreement - The structure of Agreement and Disagreement Sequences: a. Negotiation
   and confirmation of mutual agreement; If disagreement, persuasion
4. Sentiment - emotional significance of expression as distinguished from its verbal
   context - a. Positive/negative detection; b. Reaction to detected sentiment



                                                                                                 5
Solution/Approach
   Analyze videotapes of:
    – Naturally occurring, mundane face-to-
      face conversation to determine if and
      when the accompanying interactional
      gestures are systematic and can serve
      as cues that can be automatically
      detected.
    – More formal environments -- planning
      meetings, in which all parties must
      achieve understanding and agreement
      regarding a listing of subtasks and
      estimates for completing them.
   Based on the analyzed video and audio
    data, existing methods in body tracking,
    gesture detection, and audio analysis will
    be used to extract machine perceivable
    cues.




                                                 6
Why Now?
   Most of this formal analytic study has been derived from telephone
    conversation data,
   Video analysis can leverage the communicative intent of gestures in
    conversation.
   The development of models and algorithms for the detection of human action
    would benefit significantly from the analysis of the ways in which gesture are
    systematically related to these observable and recognizable social actions.
   Recent advances in the analysis of embodied conduct suggest that it may be
    possible to detect social action that embodies intent but is not explicitly said
    by participants.
     – the occurrence of head nods, eyes gaze, body position can indicate
       acceptance, agreement, disagreement, engagement or disengagement
       with an ongoing course of conversation.
     – analysis tools can provide assistance proactively to enable a new kind of
       analytic process, one in which the system itself finds relevant instances
       that the analyst herself may not have time or intuition to identify.




                                                                                       7
PARC CONFIDENTIAL
                    8

Parc Human Interaction

  • 1.
    Modeling the synergybetween verbal and nonverbal communication using formal, analytic methods of sequential analysis Conversation Analysis Video and Audio Data Human Intersubjectivity 1
  • 2.
    Our Research hastwo thrusts 1) Contribute to the science of human interaction and intersubjectivity 2) Develop novel technical capabilities for computers to detect and infer the states of intersubjectivity based on video and audio data 2
  • 3.
    The Problem  Parties who know or suspect that others might observe them via surveillance systems – will try to communicate in a covert manner using non-explicit references and non-verbal communication (gesture, expression, and other “body language”) to reach shared understanding.  Video contains rich visual cues that can enable the decoding of the non-explicit communication that audio alone does not, – some of which can be interpreted by a third-party human observer (e.g., an intelligence analyst). – The exponential growth of video and audio data will lead to a problem of scale for human intelligence analysts, as the volume of data that is continuously generated will overwhelm them. – Automated techniques are needed to supplement and ultimately reduce the amount of human analysis needed. 3
  • 4.
    Discovering the architecture of People display their understanding in interaction in a well- documented three intersubjectivity - turn sequence 1. 2. Initiation Response 3. Confirmation/ correction We study verbal and non-verbal cues 4
  • 5.
    Solution/Approach Identify the structureand sequence Create models to of implicit reference (allusion), develop algorithms understanding, agreement and that detect and infer sentiment between meeting parties the state of these as it evolves over time. behaviors.  Culture as we know it is “an apparatus for generating recognizable actions” (Sacks, 1992;) – By “recognizable,” we mean, “can be readily seen and recognized by any member of the culture.” we concern ourselves here with four types: 1. Confirming implicit allusion – implicit references, not explicitly stated i.e., confirming, through various recognizable vocal actions, both the content and the implicit conveyance of a participant’s prior remarks 2. Understanding - The architecture of intersubjectivity in which understanding is achieved or navigated in three turn sequences 3. Agreement - The structure of Agreement and Disagreement Sequences: a. Negotiation and confirmation of mutual agreement; If disagreement, persuasion 4. Sentiment - emotional significance of expression as distinguished from its verbal context - a. Positive/negative detection; b. Reaction to detected sentiment 5
  • 6.
    Solution/Approach  Analyze videotapes of: – Naturally occurring, mundane face-to- face conversation to determine if and when the accompanying interactional gestures are systematic and can serve as cues that can be automatically detected. – More formal environments -- planning meetings, in which all parties must achieve understanding and agreement regarding a listing of subtasks and estimates for completing them.  Based on the analyzed video and audio data, existing methods in body tracking, gesture detection, and audio analysis will be used to extract machine perceivable cues. 6
  • 7.
    Why Now?  Most of this formal analytic study has been derived from telephone conversation data,  Video analysis can leverage the communicative intent of gestures in conversation.  The development of models and algorithms for the detection of human action would benefit significantly from the analysis of the ways in which gesture are systematically related to these observable and recognizable social actions.  Recent advances in the analysis of embodied conduct suggest that it may be possible to detect social action that embodies intent but is not explicitly said by participants. – the occurrence of head nods, eyes gaze, body position can indicate acceptance, agreement, disagreement, engagement or disengagement with an ongoing course of conversation. – analysis tools can provide assistance proactively to enable a new kind of analytic process, one in which the system itself finds relevant instances that the analyst herself may not have time or intuition to identify. 7
  • 8.