Presentation of the paper entitled "L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from SMartphones Data" at SMARTCOMP 2022, Aalto University, Finland.
Call Girls Near The Byke Suraj Plaza Mumbai »¡¡ 07506202331¡¡« R.K. Mumbai
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
1. L3-Net Deep Audio
Embeddings to Improve
COVID-19 Detection
from Smartphone Data
Mattia G. Campana (IIT-CNR)
Andrea Rovati (UniMi)
Franca Delmastro (IIT-CNR)
Elena Pagani (UniMi)
IEEE SMARTCOMP 2022, June 20-24
Aalto University, Espoo, Finland
2. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
AI response to the COVID-19 pandemic
2 IEEE SMARTCOMP 2022
Help the healthcare system
• Machine Learning (ML) classifiers for blood test results
• Deep Learning (DL) models to analyze chest X-ray and lungs
Computed Tomography (CT) images
Track behaviours in public places
• Monitoring social distancing
• Face mask detection systems
3. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
m-health systems based on respiratory sounds
3 IEEE SMARTCOMP 2022
Diagnosis
• Pervasive & low-cost solution for fast screening
• Support the healthcare system in identifying new cases (prevention of new outbreaks)
• Track the disease evolution
4. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
COVID-19 Detection from respiratory sounds
4 IEEE SMARTCOMP 2022
Handcrafted acoustic features (HC)
Main drawbacks
• Dif
fi
cult to
fi
nd the best set of features • Typically outperformed by Deep Learning models
Shallow
classifier
Time domain Frequency domain Time-frequency representations
• RMS Energy (how loud is the signal)
• Zero crossing rate (how fast the signal changes)
• Spectral centroid
• Period (freq. with highest amplitude)
• Spectrogram
• Mel-Frequency Cepstral Coefficients (MFCC)
Features
Extraction
COVID-19
positive/negative
5. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
COVID-19 Detection from respiratory sounds
5 IEEE SMARTCOMP 2022
DL-based approach
Representative work:
E. A. Mohammed et al., “An ensemble learning approach to digital corona virus preliminary screening from cough
sounds”, Scienti
fi
c Reports, 2021.
Main drawback: Requires large-scale datasets, especially for complex models
Graphical representation
(i.e., Spectrogram-like image)
Convolutional Neural Network
(CNN)
COVID-19
positive/negative
Ensemble of CNN with different audio representations
6. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
COVID-19 Detection from respiratory sounds
6 IEEE SMARTCOMP 2022
“Hybrid” approach
Representative work:
Brown, Chloë, et al. "Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data." In
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
HC features
Deep audio
embeddings
+ Shallow
classifier
Pre-trained DL model
COVID-19
positive/negative
477 HC features + VGGish (trained with AudioSet ~ 2 million samples)
7. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Improving the Hybrid approach
7 IEEE SMARTCOMP 2022
Investigation of an alternative embedding model: L3-Net
HC features
Deep audio
embeddings
+ Shallow
classifier
Pre-trained DL model
COVID-19
positive/negative
8. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
L3-Net: Look, Listen and Learn
8 IEEE SMARTCOMP 2022
Arandjelovic, Relja, and Andrew Zisserman. "Look, listen and learn." Proceedings of the IEEE International Conference on Computer Vision. 2017.
Fusion layers
(Fully-connected)
Video embeddings
Audio embeddings
Mel-Spectrogram
(1s window)
Video frame image
Image and
audio come
from the
same video?
Video sub-network
Audio sub-network
9. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
L3-Net for COVID-19 Detection
9 IEEE SMARTCOMP 2022
+
Shallow
Classifier
COVID-19
positive/negative
Cough/Breath
audio sample
Audio frames Mel-Spectrogram
HC features
Dimensionality
reduction (PCA)
Audio
fi
le embeddings
Combination of the
frames embeddings
(Mean + std)
Audio embeddings
Audio sub-network
Cramer, Jason, et al. "Look, listen, and learn more: Design choices for deep audio embeddings." ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
OpenL3 model trained with AudioSet
10. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Experimental Evaluation: Goals
IEEE SMARTCOMP 2022
1) Improve the classification performance with respect to:
2)Can we perform the classification task directly on the mobile device?
- Brown et al. (2020): same approach but different embedding model (i.e., VGGish vs L3-Net)
- Mohammed et al. (2021): ensemble model (CNN trained from scratch vs pre-trained model)
- Memory footprint evaluation
9
11. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Datasets
10 IEEE SMARTCOMP 2022
Cambridge
crowdsourced breath and cough audio samples (data agreement)
www.covid-19-sounds.org
COSWARA
crowdsourced cough samples
coswara.iisc.ac.in
https://github.com/iiscleap/Coswara-Data
Virufy
Cough samples collected in hospital; labels based on
COVID-19 PCR test results
https://github.com/virufy/virufy-covid
Cambridge COSWARA Virufy
62
860
282
7
2758
752
Healthy
COVID-19
12. Best model
Dev set Test set
Performances
AUC, Precision, Recall
Training & Tuning
Balanced Dataset
(Under-sampling)
Train set Validation set
L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Evaluation protocol
11 IEEE SMARTCOMP 2022
5-fold nested Cross Validation
with stratified user-based splits
PCA explained variance: [0.7, 0.8, 0.9, 0.95, 0.99]
Shallow classifiers: Logistic Regression (LR), Support Vector Machines (SVM), AdaBoost (AB), Random Forest (RF)
Features sets:
F1: deep audio embeddings F2: embeddings + Period, Tempo, Duration
F3: embeddings + HC features, except Δ-MFCC, Δ2-MFCC F4: embeddings + all HC feature (i.e, 477 HC)
15. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Memory footprint
14 IEEE SMARTCOMP 2022
Cambridge Task 1 COSWARA + Virufy
Cambridge Task 2 Cambridge Task 3
LR with PCA 99%: 7.19 KB
AB with PCA 70%: 17 KB
LR with PCA 80%: 1.03 KB
SVM with PCA 70%: 48 KB
Low memory impact in all the experiments
16. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Contributions
15 IEEE SMARTCOMP 2022
• We investigated the use of a pre-trained instance of L3-Net (OpenL3) to improve the COVID-19
detection from respiratory sound data
• Evaluation: subject-independent experiments with 3 datasets
• Results: +8% AUC vs VGGish, +22% AUC vs ensemble of end-to-end CNN
• Low memory footprint: we can perform the whole task on resource-constrained devices
17. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Future Work
15 IEEE SMARTCOMP 2022
• Distinguish between COVID-19 and other respiratory diseases (e.g., asthma)
Fixed CNN layers Train FC layers
Diagnosis
• Fine-tuning OpenL3, proposing a single model for both features extraction and classification
• Extensive comparison of different audio embedding models
18. L3-Net Deep Audio
Embeddings to Improve
COVID-19 Detection
from Smartphone Data
Mattia G. Campana
Ubiquitous Internet Research Unit
Institute of Informatics and Telematics
National Research Council of Italy
mattiacampana.github.io
mattia.campana@iit.cnr.it
linkedin.com/in/mattiacampana
19. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
Handcrafted acoustic Features
15 IEEE SMARTCOMP 2022
• The audio sample is re-sampled to a standard value for audio tasks (e.g., 16kHz or 22kHz)
• Extraction of features related to both frame (i.e., audio chunks) and segment (whole sample) perspectives
• We used the same 477 HC features (including statistics) considered by Brown et al. (2020)
Feature Description
Duration Total length (in seconds) of the audio sample
Onset Number of pitch onset (i.e., “events”) in the audio signal
Tempo Rate of beats that occur at regular intervals throughout the entire audio signal
Period The frequency with the highest amplitude among those obtained from the Fast Fourier transform
(FFT)
RMS Energy Root-Mean-Square of the signal power (i.e., the magnitude of the short-time Fourier transform)
Spectral Centroid The centroid value of the frame-wise magnitude spectrogram. Identifies percussive and sustained
sounds.
Roll-off Frequency The frequency under which the 85% of the total energy of the frame-wise spectrum is contained
Zero-crossing rate The number of times the signal value crosses the zero axe, and it is computed for each frame
MFCC Shape of the cosine transformation of the song logarithmic spectrum, expressed in Mel-bands
Δ-MFCC and Δ2-MFCC The first and second order derivatives of MFCC along time
20. L3-Net Deep Audio Embeddings to Improve COVID-19 Detection from Smartphone Data
L3-Net vs VGGish
15 IEEE SMARTCOMP 2022
# parameters: • L3-Net: 4.7M
• VGGish: 62M
Cramer, Jason, et al. "Look, listen, and learn more: Design choices for deep audio embeddings." ICASSP 2019-2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.