CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection

•

0 likes•23 views

This document summarizes an approach for no-audio multimodal speech detection using CNNs and Fisher Vectors. Dense trajectories and Fisher Vectors were used to extract features from video, with 1-second windows labeled as speaking or not. A 1D CNN was used on acceleration data with 3-second windows also labeled at 1 second intervals. Preliminary experiments showed the CNN outperformed logistic regression and was competitive with LSTMs. While the CNN benefited from more context, 1-second windows were too small for accurate video predictions. Opportunities for future work included exploring optimal context sizes and detecting more subtle speech cues.

Technology

Approach
Video
Dense Trajectories + Fisher Vectors
1s window -> label 1s
Why?
- Preliminary experiments using 3s
showed consistently better
performance vs MILES
- Results showed robustness
against different orientations
- Simple
Acceleration
1-D Convolutional Neural Network
3s window -> label 1s
Why?
- Logistic Regression showed strong
performance in this task.
- CNNs can take a larger context
- Preliminary experiments showed
competitive performance vs LSTMs
Chen, Y., Bi, J., & Wang, J. Z. (2006). MILES: Multiple-instance learning via embedded instance selection. IEEE Transactions on
Pattern Analysis and Machine Intelligence. 2

Video: dense trajectories
Trajectory extraction:
1. Sample interest points in different scales.
2. Track each point using optical flow in corresponding scale.
3. Describe the trajectory via HOG, HOF, MBH.
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition.
3

Video: Fisher Vectors
Goal: summarize a set of local descriptors (eg. dense trajectories) of arbitrary
size into a single high-dimensional (feature) vector that describes a window.
GMM SVMBags of dense trajectories Fisher Vectors
i
i+1
i+2
for a GMM, ! can be: mean, variance = 2KD-
dimensional vector
K: dimensionality of GMM (256)
D: dimensionality of input (200)
2KD = 51200 4

Acceleration: 1-D CNN
Input: 3 axis of accelerometer for 3s windows.
Architecture:
- 5 conv. layers + 3 FC layers
- filter sizes inspired in AlexNet but along 1D.
5

Results & Conclusions
● Simple CNN models outperformed feature extraction (PSD features).
● Windows of 1s are too small for accurate video predictions.
● The CNN model benefits from a larger context.
6

Opportunities
Task-specific:
● Exploration of 1-D CNN architectures for time series.
● What is the optimal context size for this task?
● How to detect the more subtle cues?
General:
● Study the ground truth gap: perceived speaking vs. actual speaking.
● We should re-evaluate unsupervised domain adaptation for person specificity.
● How to incorporate a larger context into Fisher vector methods?
● How to interpret what a FV model has learned?
7

Questions?
MediaEval 2019
10th Anniversary Workshop
27-29 October 2019
EURECOM, Sophia Antipolis, France

What's hot

Medical Imaging at DCU - Kevin McGuinness - UPC Barcelona 2018Universitat Politècnica de Catalunya

[Paper] Multiscale Vision Transformers(MVit)Susang Kim

Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...Universitat Politècnica de Catalunya

AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.GeeksLab Odessa

Transformer in VisionSangmin Woo

Deep Learning for Speech Recognition in Cortana at AI NEXT ConferenceBill Liu

Recent Progress on Object Detection_20170331Jihong Kang

Scalable image recognition model with deep embedding捷恩蔡

"An adaptive modular approach to the mining of sensor network ...butest

convolutional neural network (CNN, or ConvNet)RakeshSaran5

Deep Learning TomographyAmir Adler

Convolutional Neural NetworksTianxiang Xiong

CNNUkjae Jeong

DeepFix: a fully convolutional neural network for predicting human fixations...Universitat Politècnica de Catalunya

Review on cs231 part-2Jeong Choi

CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018Universitat Politècnica de Catalunya

Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...Universitat Politècnica de Catalunya

A brief introduction to recent segmentation methodsShunta Saito

A Framework to Specify and Verify Computational Fields for Pervasive Computin...Danilo Pianini

Deep Learning for Computer Vision: Image Retrieval (UPC 2016)Universitat Politècnica de Catalunya

What's hot (20)

Medical Imaging at DCU - Kevin McGuinness - UPC Barcelona 2018

[Paper] Multiscale Vision Transformers(MVit)

Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...

AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.

Transformer in Vision

Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

Recent Progress on Object Detection_20170331

Scalable image recognition model with deep embedding

"An adaptive modular approach to the mining of sensor network ...

convolutional neural network (CNN, or ConvNet)

Deep Learning Tomography

Convolutional Neural Networks

CNN

DeepFix: a fully convolutional neural network for predicting human fixations...

Review on cs231 part-2

CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018

Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...

A brief introduction to recent segmentation methods

A Framework to Specify and Verify Computational Fields for Pervasive Computin...

Deep Learning for Computer Vision: Image Retrieval (UPC 2016)

Similar to CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection

Deep learning for molecules, introduction to chainer chemistryKenta Oono

Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...CSCJournals

Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering

Attention correlated appearance and motion feature followed temporal learning...IJECEIAES

Review: You Only Look One-level FeatureDongmin Choi

Big Data Intelligence: from Correlation Discovery to Causal Reasoning Wanjin Yu

Large-scale Video Classification with Convolutional Neural Net.docxcroysierkathey

Das3 Fjseinssharmili priyadarsini

Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Universitat Politècnica de Catalunya

Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Universitat Politècnica de Catalunya

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersSymeon Papadopoulos

AI and Deep Learning Subrat Panda, PhD

med_poster_spieJoe Robinson

Temporal Action Localization in Untrimmed Videos via Multi Stage CNNsUniversitat Politècnica de Catalunya

Video Description using Deep LearningPranjalMahajan9

Efficient video perception through AIQualcomm Research

Use CNN for Sequence ModelingDongang (Sean) Wang

Deep leaning Vincent VanhouckePruthvi Raju Pakalapati Ninja/Black Belt Recruiter

Improving Hardware Efficiency for DNN ApplicationsChester Chen

Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중datasciencekorea

Similar to CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection (20)

Deep learning for molecules, introduction to chainer chemistry

Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...

Deep Learning Based Voice Activity Detection and Speech Enhancement

Attention correlated appearance and motion feature followed temporal learning...

Review: You Only Look One-level Feature

Big Data Intelligence: from Correlation Discovery to Causal Reasoning

Large-scale Video Classification with Convolutional Neural Net.docx

Das3 Fjseins

Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018

Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

AI and Deep Learning

med_poster_spie

Temporal Action Localization in Untrimmed Videos via Multi Stage CNNs

Video Description using Deep Learning

Efficient video perception through AI

Use CNN for Sequence Modeling

Deep leaning Vincent Vanhoucke

Improving Hardware Efficiency for DNN Applications

Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

How to convert PDF to text with Nanonetsnaman860154

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Pigging Solutions in Pet Food ManufacturingPigging Solutions

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Key Features Of Token Development (1).pptxLBM Solutions

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Understanding the Laravel MVC ArchitecturePixlogix Infotech

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

AI as an Interface for Commercial BuildingsMemoori

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames

Injustice - Developers Among Us (SciFiDevCon 2024)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Breaking the Kubernetes Kill Chain: Host Path Mount

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Maximizing Board Effectiveness 2024 Webinar.pptx

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

How to convert PDF to text with Nanonets

Salesforce Community Group Quito, Salesforce 101

Pigging Solutions in Pet Food Manufacturing

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Key Features Of Token Development (1).pptx

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Understanding the Laravel MVC Architecture

08448380779 Call Girls In Civil Lines Women Seeking Men

How to Remove Document Management Hurdles with X-Docs?

Handwritten Text Recognition for manuscripts and early printed texts

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

AI as an Interface for Commercial Buildings

CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection

1. CNNs and Fisher Vectors for No- Audio Multimodal Speech Detection Jose Vargas, Hayley Hung Delft University of Technology MediaEval 2019 10th Anniversary Workshop 27-29 October 2019 EURECOM, Sophia Antipolis, France

2. Approach Video Dense Trajectories + Fisher Vectors 1s window -> label 1s Why? - Preliminary experiments using 3s showed consistently better performance vs MILES - Results showed robustness against different orientations - Simple Acceleration 1-D Convolutional Neural Network 3s window -> label 1s Why? - Logistic Regression showed strong performance in this task. - CNNs can take a larger context - Preliminary experiments showed competitive performance vs LSTMs Chen, Y., Bi, J., & Wang, J. Z. (2006). MILES: Multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2

3. Video: dense trajectories Trajectory extraction: 1. Sample interest points in different scales. 2. Track each point using optical flow in corresponding scale. 3. Describe the trajectory via HOG, HOF, MBH. Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 3

4. Video: Fisher Vectors Goal: summarize a set of local descriptors (eg. dense trajectories) of arbitrary size into a single high-dimensional (feature) vector that describes a window. GMM SVMBags of dense trajectories Fisher Vectors i i+1 i+2 for a GMM, ! can be: mean, variance = 2KD- dimensional vector K: dimensionality of GMM (256) D: dimensionality of input (200) 2KD = 51200 4

5. Acceleration: 1-D CNN Input: 3 axis of accelerometer for 3s windows. Architecture: - 5 conv. layers + 3 FC layers - filter sizes inspired in AlexNet but along 1D. 5

6. Results & Conclusions ● Simple CNN models outperformed feature extraction (PSD features). ● Windows of 1s are too small for accurate video predictions. ● The CNN model benefits from a larger context. 6

7. Opportunities Task-specific: ● Exploration of 1-D CNN architectures for time series. ● What is the optimal context size for this task? ● How to detect the more subtle cues? General: ● Study the ground truth gap: perceived speaking vs. actual speaking. ● We should re-evaluate unsupervised domain adaptation for person specificity. ● How to incorporate a larger context into Fisher vector methods? ● How to interpret what a FV model has learned? 7

8. Questions? MediaEval 2019 10th Anniversary Workshop 27-29 October 2019 EURECOM, Sophia Antipolis, France

CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection

Similar to CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection (20)

More from multimediaeval

More from multimediaeval (20)

Recently uploaded

Recently uploaded (20)

CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection