Multimodal Analysis Of Stand-up Comedians
Audio, Video and Lexical Analysis
This research project is done as part of course graduate course CSCI 535 : Multimodal Probabilistic Learning of Human Communication at USC
Our team collected data from youtube and analysed standup comedians through various modalities to understand how does facial , acoustic and lexical features affect audience response.
We applied machine learning techniques like boosted decision tree and neural networks to build our predictive model.
Our team :
Yash Singh, Madhav Sharan, Sree Priyanka Uppu, Nandan PC, Harsh Fatepuria, Rahul Agrawal
3. Why Stand up comedian?
● We love watching stand ups
● They express variety of emotions
● Feedback from audience in form of laughter available.
● Relatively new
Motivation
5. H2 : Pauses and word elongation contribute towards laughter
Hypotheses
6. H3 : Voice modulation - Pitch and intensity changes can also play a
crucial role
7. H4 : Laughter is sequential in nature, meaning small laughter could
add up to bigger laughters.
8. • We collected 3 hours 46 minutes of data from ‘The Tonight Show
Starring Jimmy Fallon’ or ‘Late Night with Conan O’Brien’.
• 46 Videos (11.76Gb) ~ approx 5 mins each
• 27 males and 19 female artists
• The backdrop in the videos is dark
• Most part of the videos ( 80-90%) the artist faces the camera.
Data Collection
9. • For facial feature extractions, we blacked out the frames manually
when the camera does not capture the artist’s face ~ setting the
video features to those frames as 0.
• For audio features, there can be 0, during a pause or while the
audience are laughing.
Pre Processing
10. • Manually segment the videos based on punch lines
• Annotate the laughter level in each segment based on product of
mean pitch and mean intensity to–
o Big (55% ~ 100% intensity)
o Small (36% ~ 55% intensity)
o No (0~36% intensity)
• Pitch range of 75 to 625 Hz gives a good sampling rate of 10 ms
and covers a wide range of frequencies.
• Pitch of laughter varies across videos and hence, it is normalized to
the range [0,1].
Data Annotation
11. OpenSmile
• Extracted:5 low-level descriptors. we extracted
✧ Musical Chroma features - Tone
✧ Prosody features (Loudness and pitch),
✧ Energy(1),
✧ MFCC(13 MFCC from 0-12 from 26 Mel-frequency bands).
• All these features were captured at a frame rate of 10 ms.
• Processing : Aggregated the features on standard deviation and
mean for each segment
Feature Engineering - Audio
12. OpenFace
• Extracted
✧ eye gaze direction vector in world coordinates for both the eyes
✧ the location of the head with respect to camera in milimeters and the
rotation (radians)
✧ 68 facial landmark location in 2D pixel format (x,y)
✧ 33 rigid and non-rigid shape parameters
✧ 11 AU intensities and AU occurrences.
• Processing : Aggregated the features on standard deviation and mean
for each segment
Feature Engineering - Video
13. • Analyze features like Action Units, gaze (y and z direction), pose (rotation of
head) and various facial landmark points, Frown and Eyebrow raise.
14. IBM Watson
● Pauses
● Last pause
● Word elongation
● Sentiments
Feature Engineering - Textual
15. H1 : AU related features
Feature Analysis - Visual
AU 07 (Lid tightener) AU 14 (Dimpler)
21. H1 : Certain facial expression could contribute to laughter
H2 : Pauses and word elongation contribute towards laughter
Results
22. H3 : Voice modulation - Pitch and intensity changes can also play a
crucial role
XGBoost
23. Early Fusion:
● Min Video frames = 100 frames/segment
● Min Audio frames = 30 frames/segment
● Min text = 0 words /segment
● No good way of taking equal frames from each modality → difficult to
do early fusion
H4 : Laughter is sequential in nature :
LSTM - Challenge