KTTO_2015_Vavrek

The Development of Excellence of the Telecommunication Research Team in
Relation to International Cooperation - CZ.1.07/2.3.00/20.0217
Multi-level audio classification architecture
Jozef Vavrek, Jozef Juhár
Department of Electronics and Multimedia Communications
Faculty of Electrical Engineering and Informatics
Technical University of Košice
email: {Jozef.Vavrek; Jozef.Juhar}@tuke.sk

Telecommunication
Educational Seminar

Content
1. Motivation and aim
2. Proposed classification system
3. Audio data
4. Segmentation, preprocessing, feature extraction, smoothing
4.1 Feature extraction techniques (cepstral)
4.2 Feature extraction techniques (spectral)
5. Basic principles of BN audio data classification via BDT
6. Basic principles of BN audio data classification via BDA
7. Binary discrimination architecture employing Support Vector
Machine classifier (BDASVM)
8. Experimental setup
9. Results
10. Additional experiments – One Against One (OAO) architecture
11. Additional results
12. Conclusions & future work

1.Motivation and aim
We built the classification system with intention to use if for refinement
the acoustic models for each particular audio class and lower the word
error rate of the automatic speech recognition (ASR) system.
We proposed binary discrimination architecture utilizing support vector
machine (BDASVM) classifier in order to overcome classification
accuracy of binary decision trees with SVM (BDTSVM) and alleviate
miss-classification error that propagates from the top of the
architecture.

2.Proposed classification system

3.Audio data
 Database: Slovak TV broadcast news BNKE1 - part of the COST-278
 Audio: 16 kHz 16 bit mono PCM
 Metadata: manually annotated using Transcriber
 Duration: 65 hours (188 recordings)
 Audio data used for training and testing:
Audio event Training set (min) Testing set (min)
Pure Speech (PS) 10.19 9.16
Speech with env. sound (SES) 9.26 9.44
Speech with music (MS) 9.41 9.25
Music (M) 11.7 9.04
Env. Sound (Background B) 9.06 9.31
49 46.2

4.Segmentation, preprocessing, feature extraction, smoothing

4.1 Feature extraction techniques (cepstral)
Mel-Frequency Cepstral Coefficients (MFCC)
Variance of Acceleration Mel-Frequency Cepstral Coefficients (VAMFCC)
Variance of Mel-Filter Bank Energy (VMFBE)

Spectral Centroid (SC)
Spectral Flux (SF)
Spectral Spread (SS)

Spectral ROLL-OFF (ROLLOFF)
Band Periodicity (BP)

5.Basic principles of BN audio data classification via BDT

6.Basic principles of BN audio data classification via BDA

7.Binary Discrimination Architecture employing Support
Vector Machine classifier (BDASVM)

8.Experimental setup
Segmentation: rectangular window with length 200ms and 100ms overlapping
Preprocessing: Hamming window with length 50ms and 25ms overlapping
Feature extraction: frame-based, segment-based, frame-based with smoothing, segment-
based with smoothing
Smoothing: floating window with length 1s
Classification: support vector machine classifier, RBF kernel function, 5-fold cross-
validation
•Evaluation parameters:
– for cross-validation: Area Under the Curve (AUC)=<TPR>, (0.5,1)
– for classification performance: Accuracy (Acc)=(TP+TN)/(TP+FP+TN+FN)
– Processing Time (PT)
Software: wavex (wav extractor), libsvm-3.17
Hardware: HPC TUKE 24 nodes, IBM Blade System x HS22 with two six-core processor
units Intel Xeon L5640 (2.27GHz) and 48 GB RAM

9.Results
Topology
Acc[%]
frame framefw
seg segfw
BDTSVM 74.82 88.05 78.13 86.81
BDASVM 75.10 89.12 80.43 90.52
+0.28 +1.07 +2.3 +3.71
Tab.: Classification performance of BDTSVM and BDASVM architectures for different parameterization levels
Acc represents average from S-NS, PS-NPS, MS-SES, M-B discriminators
Tab.: The overall classification performance of BDTSVM and BDASVM architectures
Acc represents average from each parameterization levels
Topology
Acc[%]
PS MS SES M B Avg PT[min]
BDTSVM 85.69 54.46 48.63 72.75 77.83 67.87 44.13
BDASVM 85.94 53.29 48.94 72.85 80.74 68.35 48.37
+0.48 -4.24

10.Additional experiments – One Against One (OAO) architecture

11.Additional results
Topology
Acc[%]
frame framefw
seg segfw
OAOSVM 63.70 77.38 64.41 74.16
BDASVM 54.95 79.75 62.91 75.79
-8.75 +2.37 -1.5 +1.63
Tab.: Classification performance of OAOSVM and BDASVM architectures for different parameterization levels
Acc represents average for PS, MS, SES, M, B classes
Tab.: The overall classification performance of OAOSVM and BDASVM architectures
Acc represents average from each parameterization levels
Topology
Acc[%]
PS MS SES M B Avg PT[min]
OAOSVM 86.92 53.22 46.79 76.49 86.13 69.91 24.56
BDASVM 85.94 53.29 48.94 72.85 80.74 68.35 48.37
-1.56 -23.81

12. Conclusions & future work
Advantages of BDASVM
significant classification error reduction on each individual classification level
regardless the parameterization level (against BDTSVM)
higher the overall classification accuracy (against BDTSVM)
Possibility of using optimal parameterization and feature selection techniques
on each individual level of classification (against OAOSVM)
Disadvantages of BDASVM
 higher number of classifiers => higher processing time (against BDTSVM and
OAOSVM)
 a need to find an optimal feature selection algorithm for selecting optimal
training and testing set (against BDTSVM and OAOSVM)
In the near future, we will make comparison between BDASVM and One Against
All SVM (OAASVM) architecture and extraction of each audio class using
phoneme-based alignment.
Future work will be also directed towards an implementation of the BDASVM into
the BN transcription system.

Thank you for you attention

KTTO_2015_Vavrek

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to KTTO_2015_Vavrek

Similar to KTTO_2015_Vavrek (20)

KTTO_2015_Vavrek