Multimodal Analysis Of Stand-up Comedians
Audio, Video and Lexical Analysis
By: Yash Singh, Madhav Sharan, Sree Priyanka Uppu, Nandan PC, Harsh Fatepuria, Rahul Agrawal
This research project was done as part of USC's course : CSCI 535 Multimodal Probabilistic Learning of Human Communication , under the guidance of Prof Stefan Scherer
Our team analyzed the performance of stand up comedians by examining acoustic, visual and linguistic features
in automatic humor recognition. We annotated stand up comedy video segments as high, medium or
no laughter using a simplistic approach based on the product of laughter intensity and pitch values. Our
analysis will help us in determining the influence of the three modalities - facial expressions (smile, certain
Action Units, emotion), acoustic features (pitch and intensity) and verbal features (pause timings, sentiment
of transcripts) on audience’s laughter. We worked on predictive analysis of humor using standard supervised
learning classifiers like boosted decision trees, Support Vector Machines (SVM), etc. Since, humor is
sequential in nature we have also used Long Short Term Memory Neural Networks (LSTM) for predictive
analysis.
3. Why Stand up comedian?
● We love watching stand ups
● They express variety of emotions
● Feedback from audience in form of laughter available.
● Relatively new
Motivation
5. H2 : Pauses and word elongation contribute towards laughter
Hypotheses
6. H3 : Voice modulation - Pitch and intensity changes can also play a
crucial role
7. H4 : Laughter is sequential in nature, meaning small laughter could
add up to bigger laughters.
8. • We collected 3 hours 46 minutes of data from ‘The Tonight Show
Starring Jimmy Fallon’ or ‘Late Night with Conan O’Brien’.
• 46 Videos (11.76Gb) ~ approx 5 mins each
• 27 males and 19 female artists
• The backdrop in the videos is dark
• Most part of the videos ( 80-90%) the artist faces the camera.
Data Collection
9. • For facial feature extractions, we blacked out the frames manually
when the camera does not capture the artist’s face ~ setting the
video features to those frames as 0.
• For audio features, there can be 0, during a pause or while the
audience are laughing.
Pre Processing
10. • Manually segment the videos based on punch lines
• Annotate the laughter level in each segment based on product of
mean pitch and mean intensity to–
o Big (55% ~ 100% intensity)
o Small (36% ~ 55% intensity)
o No (0~36% intensity)
• Pitch range of 75 to 625 Hz gives a good sampling rate of 10 ms
and covers a wide range of frequencies.
• Pitch of laughter varies across videos and hence, it is normalized to
the range [0,1].
Data Annotation
11. OpenSmile
• Extracted:5 low-level descriptors. we extracted
✧ Musical Chroma features - Tone
✧ Prosody features (Loudness and pitch),
✧ Energy(1),
✧ MFCC(13 MFCC from 0-12 from 26 Mel-frequency bands).
• All these features were captured at a frame rate of 10 ms.
• Processing : Aggregated the features on standard deviation and
mean for each segment
Feature Engineering - Audio
12. OpenFace
• Extracted
✧ eye gaze direction vector in world coordinates for both the eyes
✧ the location of the head with respect to camera in milimeters and the
rotation (radians)
✧ 68 facial landmark location in 2D pixel format (x,y)
✧ 33 rigid and non-rigid shape parameters
✧ 11 AU intensities and AU occurrences.
• Processing : Aggregated the features on standard deviation and mean
for each segment
Feature Engineering - Video
13. • Analyze features like Action Units, gaze (y and z direction), pose (rotation of
head) and various facial landmark points, Frown and Eyebrow raise.
14. IBM Watson
● Pauses
● Last pause
● Word elongation
● Sentiments
Feature Engineering - Textual
15. H1 : AU related features
Feature Analysis - Visual
AU 07 (Lid tightener) AU 14 (Dimpler)
21. H1 : Certain facial expression could contribute to laughter
H2 : Pauses and word elongation contribute towards laughter
Results
22. H3 : Voice modulation - Pitch and intensity changes can also play a
crucial role
XGBoost
23. Early Fusion:
● Min Video frames = 100 frames/segment
● Min Audio frames = 30 frames/segment
● Min text = 0 words /segment
● No good way of taking equal frames from each modality → difficult to
do early fusion
H4 : Laughter is sequential in nature :
LSTM - Challenge