L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data

L3-Net Deep Audio
Embeddings to Improve
COVID-19 Detection
from Smartphone Data
Mattia G. Campana (IIT-CNR)

Andrea Rovati (UniMi)

Franca Delmastro (IIT-CNR)

Elena Pagani (UniMi)
IEEE SMARTCOMP 2022, June 20-24

Aalto University, Espoo, Finland

L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
AI response to the COVID-19 pandemic
2 IEEE SMARTCOMP 2022
Help the healthcare system
• Machine Learning (ML) classifiers for blood test results

• Deep Learning (DL) models to analyze chest X-ray and lungs
Computed Tomography (CT) images
Track behaviours in public places
• Monitoring social distancing

• Face mask detection systems

m-health systems based on respiratory sounds
Diagnosis
• Pervasive & low-cost solution for fast screening

• Support the healthcare system in identifying new cases (prevention of new outbreaks)

• Track the disease evolution

COVID-19 Detection from respiratory sounds
Handcrafted acoustic features (HC)
Main drawbacks
• Dif
fi
cult to
fi
nd the best set of features • Typically outperformed by Deep Learning models
Shallow
 
classifier
Time domain Frequency domain Time-frequency representations
• RMS Energy (how loud is the signal)

• Zero crossing rate (how fast the signal changes)
• Spectral centroid

• Period (freq. with highest amplitude)
• Spectrogram

• Mel-Frequency Cepstral Coefficients (MFCC)
Features

Extraction
COVID-19
 
positive/negative

DL-based approach
Representative work:
E. A. Mohammed et al., “An ensemble learning approach to digital corona virus preliminary screening from cough
sounds”, Scienti
fi
c Reports, 2021.
Main drawback: Requires large-scale datasets, especially for complex models
Graphical representation
 
(i.e., Spectrogram-like image)
Convolutional Neural Network
 
(CNN)
COVID-19
 
positive/negative
Ensemble of CNN with different audio representations

“Hybrid” approach
Representative work:
Brown, Chloë, et al. "Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data." In
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
HC features
Deep audio
embeddings
+ Shallow
 
classifier
Pre-trained DL model
COVID-19
 
positive/negative
477 HC features + VGGish (trained with AudioSet ~ 2 million samples)

Improving the Hybrid approach
Investigation of an alternative embedding model: L3-Net
HC features
Deep audio
embeddings
+ Shallow
 
classifier
Pre-trained DL model
COVID-19
 
positive/negative

L3-Net: Look, Listen and Learn
Arandjelovic, Relja, and Andrew Zisserman. "Look, listen and learn." Proceedings of the IEEE International Conference on Computer Vision. 2017.
Fusion layers

(Fully-connected)
Video embeddings
Audio embeddings
Mel-Spectrogram

(1s window)
Video frame image
Image and
audio come
from the
same video?
Video sub-network
Audio sub-network

L3-Net for COVID-19 Detection
+
Shallow
Classifier
COVID-19
positive/negative
Cough/Breath
audio sample
Audio frames Mel-Spectrogram
HC features
Dimensionality
reduction (PCA)
Audio
fi
le embeddings

Combination of the
frames embeddings

(Mean + std)
Audio embeddings
Audio sub-network
Cramer, Jason, et al. "Look, listen, and learn more: Design choices for deep audio embeddings." ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
OpenL3 model trained with AudioSet

Experimental Evaluation: Goals
IEEE SMARTCOMP 2022
1) Improve the classification performance with respect to:
 
                                         
2)Can we perform the classification task directly on the mobile device?
- Brown et al. (2020): same approach but different embedding model (i.e., VGGish vs L3-Net)

- Mohammed et al. (2021): ensemble model (CNN trained from scratch vs pre-trained model)
- Memory footprint evaluation
9

Datasets
Cambridge

crowdsourced breath and cough audio samples (data agreement)

www.covid-19-sounds.org
COSWARA

crowdsourced cough samples

coswara.iisc.ac.in

https://github.com/iiscleap/Coswara-Data
Virufy

Cough samples collected in hospital; labels based on
 
COVID-19 PCR test results

https://github.com/virufy/virufy-covid
Cambridge COSWARA Virufy
62
860
282
7
2758
752
Healthy
COVID-19

Best model
Dev set Test set
Performances
 
AUC, Precision, Recall
Training & Tuning
Balanced Dataset
 
(Under-sampling)
Train set Validation set
Evaluation protocol
5-fold nested Cross Validation
 
with stratified user-based splits
PCA explained variance: [0.7, 0.8, 0.9, 0.95, 0.99]
Shallow classifiers: Logistic Regression (LR), Support Vector Machines (SVM), AdaBoost (AB), Random Forest (RF)
Features sets:
F1: deep audio embeddings F2: embeddings + Period, Tempo, Duration
F3: embeddings + HC features, except Δ-MFCC, Δ2-MFCC F4: embeddings + all HC feature (i.e, 477 HC)

Classification Results vs Brown et al. (2020)
TABLE III: Classification results
Task Method Modality Features Classifier PCA Mean (± std)
AUC Precision Recall
1
baseline Cough + Breath F2 LR .95 .80 (.07) .72 (.06) .69 (.11)
our (same) Cough + Breath F2 LR .95 .76 (.092) .69 (.095) .68 (.158)
our (best) Cough + Breath F2 SVM .70 .80 (.068) .77 (.096) .68 (.139)
2
baseline Cough F2 SVM .90 .82 (.18) .80 (.16) .72 (.23)
our (same) Cough F2 SVM .90 .69 (.227) .74 (.187) .61 (.276)
our (best) Breath F1 LR .80 .84 (.168) .92 (.106) .60 (.237)
3
baseline Breath F3 SVM .70 .80 (.14) .69 (.20) .69 (.26)
our (same) Breath F3 SVM .70 .64 (.254) .69 (.154) .66 (.269)
our (best) Breath F1 AB .70 .88 (.066) .82 (.152) .79 (.192)
rufy
baseline Top 4 Ensemble CNN - .77 .80 .71
our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)
1: COVID-positive vs COVID-negative

2: COVID-positive with cough vs COVID-negative

3: COVID-positive with cough vs COVID-negative with asthma and cough
Dataset: Cambridge (cough & breath audio samples)
3 classification tasks
Gain (%)
Task 1 Task 2 Task 3
10
-12
-1
13
12
5
8
2
0
AUC
Precision
Recall
1
2
3
rufy
our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)
1
2
3
rufy
our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)

Classification Results vs Mohammed et al. (2021)
Dataset: COSWARA + Virufy (cough audio samples)
1
2
3
ufy
our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)
1
2
3
ufy
our F3 LR .99 .99 (.001) .99 (.006) .99 (.007)
4 CNN with different inputs:
- Power Spectrum

- MFCC
- Spectrogram

- Mel-spectrogram
SVM .99 (.001) .99 (.002) .98 (.01)
RF .81 (.024) .79 (.031) .59 (.07)
AB .85 (.011) .77 (.021) .75 (.03)
Gain (%)
28
19
22
AUC
Precision
Recall
Classification Task: COVID-positive vs COVID-negative

Memory footprint
Cambridge Task 1 COSWARA + Virufy
Cambridge Task 2 Cambridge Task 3
LR with PCA 99%: 7.19 KB
AB with PCA 70%: 17 KB
LR with PCA 80%: 1.03 KB
SVM with PCA 70%: 48 KB
Low memory impact in all the experiments

Contributions
• We investigated the use of a pre-trained instance of L3-Net (OpenL3) to improve the COVID-19
detection from respiratory sound data

• Evaluation: subject-independent experiments with 3 datasets

• Results: +8% AUC vs VGGish, +22% AUC vs ensemble of end-to-end CNN

• Low memory footprint: we can perform the whole task on resource-constrained devices

Future Work
• Distinguish between COVID-19 and other respiratory diseases (e.g., asthma)
Fixed CNN layers Train FC layers
Diagnosis
• Fine-tuning OpenL3, proposing a single model for both features extraction and classification
• Extensive comparison of different audio embedding models

L3-Net Deep Audio
Embeddings to Improve
COVID-19 Detection
from Smartphone Data
Mattia G. Campana

Ubiquitous Internet Research Unit

 
Institute of Informatics and Telematics
 
National Research Council of Italy
mattiacampana.github.io
mattia.campana@iit.cnr.it
linkedin.com/in/mattiacampana

Handcrafted acoustic Features
• The audio sample is re-sampled to a standard value for audio tasks (e.g., 16kHz or 22kHz)

• Extraction of features related to both frame (i.e., audio chunks) and segment (whole sample) perspectives

• We used the same 477 HC features (including statistics) considered by Brown et al. (2020)
Feature Description
Duration Total length (in seconds) of the audio sample
Onset Number of pitch onset (i.e., “events”) in the audio signal
Tempo Rate of beats that occur at regular intervals throughout the entire audio signal
Period The frequency with the highest amplitude among those obtained from the Fast Fourier transform
(FFT)
RMS Energy Root-Mean-Square of the signal power (i.e., the magnitude of the short-time Fourier transform)
Spectral Centroid The centroid value of the frame-wise magnitude spectrogram. Identifies percussive and sustained
sounds.
Roll-off Frequency The frequency under which the 85% of the total energy of the frame-wise spectrum is contained
Zero-crossing rate The number of times the signal value crosses the zero axe, and it is computed for each frame
MFCC Shape of the cosine transformation of the song logarithmic spectrum, expressed in Mel-bands
Δ-MFCC and Δ2-MFCC The first and second order derivatives of MFCC along time

L3-Net vs VGGish
# parameters: • L3-Net: 4.7M
• VGGish: 62M
Cramer, Jason, et al. "Look, listen, and learn more: Design choices for deep audio embeddings." ICASSP 2019-2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data

Recommended

Recommended

More Related Content

Similar to L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data

Similar to L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data (20)

Recently uploaded

Recently uploaded (20)

L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data