1. Modeling the synergy between verbal
and nonverbal communication
using formal, analytic methods of sequential analysis
Video and Audio Data
2. Our Research has two thrusts
1) Contribute to the science of
human interaction and
2) Develop novel technical
capabilities for computers to detect
and infer the states of
intersubjectivity based on video and
3. The Problem
Parties who know or suspect that others might observe them
via surveillance systems
– will try to communicate in a covert manner using non-explicit
references and non-verbal communication (gesture, expression, and
other “body language”) to reach shared understanding.
Video contains rich visual cues that can enable the decoding
of the non-explicit communication that audio alone does not,
– some of which can be interpreted by a third-party human observer
(e.g., an intelligence analyst).
– The exponential growth of video and audio data will lead to a
problem of scale for human intelligence analysts, as the volume of
data that is continuously generated will overwhelm them.
– Automated techniques are needed to supplement and ultimately
reduce the amount of human analysis needed.
4. Discovering the
architecture of People display their understanding in
interaction in a well- documented three
intersubjectivity - turn sequence
We study verbal and non-verbal cues
Identify the structure and sequence Create models to
of implicit reference (allusion), develop algorithms
understanding, agreement and that detect and infer
sentiment between meeting parties the state of these
as it evolves over time. behaviors.
Culture as we know it is “an apparatus for generating recognizable actions” (Sacks,
– By “recognizable,” we mean, “can be readily seen and recognized by any member of
the culture.” we concern ourselves here with four types:
1. Confirming implicit allusion – implicit references, not explicitly stated i.e., confirming,
through various recognizable vocal actions, both the content and the implicit conveyance
of a participant’s prior remarks
2. Understanding - The architecture of intersubjectivity in which understanding is achieved
or navigated in three turn sequences
3. Agreement - The structure of Agreement and Disagreement Sequences: a. Negotiation
and confirmation of mutual agreement; If disagreement, persuasion
4. Sentiment - emotional significance of expression as distinguished from its verbal
context - a. Positive/negative detection; b. Reaction to detected sentiment
Analyze videotapes of:
– Naturally occurring, mundane face-to-
face conversation to determine if and
when the accompanying interactional
gestures are systematic and can serve
as cues that can be automatically
– More formal environments -- planning
meetings, in which all parties must
achieve understanding and agreement
regarding a listing of subtasks and
estimates for completing them.
Based on the analyzed video and audio
data, existing methods in body tracking,
gesture detection, and audio analysis will
be used to extract machine perceivable
7. Why Now?
Most of this formal analytic study has been derived from telephone
Video analysis can leverage the communicative intent of gestures in
The development of models and algorithms for the detection of human action
would benefit significantly from the analysis of the ways in which gesture are
systematically related to these observable and recognizable social actions.
Recent advances in the analysis of embodied conduct suggest that it may be
possible to detect social action that embodies intent but is not explicitly said
– the occurrence of head nods, eyes gaze, body position can indicate
acceptance, agreement, disagreement, engagement or disengagement
with an ongoing course of conversation.
– analysis tools can provide assistance proactively to enable a new kind of
analytic process, one in which the system itself finds relevant instances
that the analyst herself may not have time or intuition to identify.