[Chung il kim] 0829 thesis

HIgh Performance
Computing & Systems LAB
Unsupervised feature learning for audio classif-
ication using convolutional deep belief networks
Honglak Lee Yan Largman Peter Pham Andrew Y. Ng
Thesis Presenter Chung il Kim
Computer Science Departement, Stanford University
Stanford, CA 94305
Advances in Neural Information Processing Systems 22 (NIPS 2009)

High Performance Computing & Systems Lab
Contents
 Abstract & Introduction
 Theory & Algorithm
 Convolutional Deep Belief Networks(CDBN)
 on Shift Invariant Sparse Coding(SISC)
 Unsupervised Feature Learning
 Application to Audio Recognition Tasks
 Speech Recognition
 Music Classification
 Discussion and Conclusion
31th Aug 2017, Paper Seminar 2

1. Abstract & Introduction (1)
 Abstract
 Deep learning approaches
 Build hieratical representations on unlabeled data
 Focusing on unlabeled auditory data
 Using Convolutional deep belief network(CDBN)
 Evaluate auditory data on various audio classification tasks
 RAW
 MFCC
 CDBN(L1, L2)

 Introduction
 Issue of Audio data recognition
 Toward high dimension and complex
 Previous work[1, 2]
 sparse coding leads to filters correspond to cochlear filters
 Related work[3]
 Efficient sparse coding algorithm for audio classification tasks
– Feature sign search algorithm(FS-EXACT, FS-Window)
– Lagrangian of DFT
[1] E. C. Smith and M. S. Lewicki. Efficient auditory coding. Nature, 439:978–982, 2006.
[2] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607–609, 1996.
[3] R. Grosse, R. Raina, H. Kwong, and A.Y. Ng. Shift-invariant sparse codig for audio classification. In UAI, 2007.

 Introduction
 The limit of those methods
 Applied to learn relatively shallow
 1-layer representations
 Many promising approached [4, 5, 6, 7, 8] usually Image
 Fast
 With energy-based model
 Greedy
 Empirical evaluation
 But Deep learning not applied to auditory data
[4]G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
[5]M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an energy-based model. In NIPS, 2006.
[6]Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In NIPS, 2006.
[7]H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, 2007.
[8]H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief network model for visual area V2. In NIPS, 2008.

 Introduction
 Deep belief network
 Generative probabilitistic model
– Composed 1 visible layer, and many hidden layer
 Well-trained using ‘Greedy Layerwise Training’
 Convolutional deep belief network(CDBN) [9]
 Also trained as greedy, bottom-up fashion
 Good performance in several visual recognition tasks
 CDBN on unlabeled audio data
 evaluate the learned feature representations
– several audio classification tasks
[9]H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.

2. Convolutional Deep Belief Network (1)
 Convolutional Restricted Boltzmann Machines(CRBMs)
 CDBN, consist of CRBMs block model
<Figure 1> Image of Convolution Deep Belief Networks
1. Set partial area
2. Get detection through filter
(highly overcomplete, sparse needed)
3. Pooling(usually max-pooling)
4. Greedy layerwise traing
-more than 1
5. Get pattern of visible data.

 Convolutional Restricted Boltzmann Machines(CRBMs)
 Extension of ‘regular’ Restricted Boltzmann Machines(RBMs)
 Decrease related dimension
 makes sparsity problem
<Figure2> dimension down, sparse

 CDBNs
 Energy function
 CRBMs’ Probability distribution referred using energy(next page)
<Formula 1> Energy function of CRBMs in binary(up) and real-valued(down)
nv : No. dimensional array of binary unit
nw : No. dimensional filter array
K : number of filter
nH : No. dimensional array of hidden unit
(nv – nw + 1)
bk : shared bias for each group
c : shared bias for visible units

 CDBNs
 Probability distribution
 CRBMs’ Probability distribution referred using energy
<Formula 2> joint and conditional probability distributions
*v : valid convolution
*f : full convolution

 Pooling layer
 Shrink data map
 In classification, usually using max-pooling
0 0.5 0.5 0.4
0.7 0.1 0.2 0.4
0.9 0.3 0.7 0.5
0.5 0.8 0.2 0
0.7 0.5 0.5
0.9 0.7 0.7
0.9 0.8 0.7
<Picture3> Image of max-pooling

 Process of CDBNs
 Set partial area
 Get detection through filter (highly overcomplete, sparse needed)
 Pooling(usually max-pooling)
 Greedy layerwise Training more than 1
 Get pattern of visible data.
<Picture4> Process of CDBNs
https://deeplearning4j.org/kr/convolutionnets

3. on Shift Invariant Sparse Coding (1)
 Sparsity
 Typical CRBM is highly overcomplete
 Sparsity penalty term added to log-likelihood
 To solve overfitting problem in deep neural networks
 Avoid full-connectivity
 This algorithm uses LASSO(the Least Absolute Shrinkage and Selection Operator)
<Formula 4> the objective of sparsity
<Formula 3> the training objective

 Two algorithm for solve SISC in audio data
 Coefficient, Figure-Sign Search algorithm
 Efficiency for short signals (x->low-dimensional)
 Not good for over 1minute
<Pseudo 1> Feature-sign search algorithm 1
R. Grosse, R. Raina, H. Kwong, and A.Y. Ng. Shift-invariant sparse coding for audio classiﬁcation. In UAI, 2007

 Two algorithm for solve SISC in audio data
 Bases, using Lagrangian and DFT
 1st, Discrete Fourier Transform
– signal decompose
 2nd, Set Lagrangian
– To solve optimization
 3rd, using Newton’s method(in this Paper)

 Approach Tasks
 Using LASSO
 Partial differential equation
 Bias ↑, variance ↓ (trade off)
Liang Sun Arizona State University, Efficient Sparse Coding Algorithms, http://slideplayer.com/slide/4953202/

 Approach Tasks
 By resulting ‘unconstrained QP’
 Compute analytical solution
 This is subvector of x
 Using discrete line search(LS), update x with to the point.
 Collect value which coefficient changes sign, and update the lowest one

 Approach Tasks
 Last matching those condition, and repeat it.

 Result of FS search(learning speed)

 Result of FS search(Speech)
 Speech data (TIMIT)
 1 second long, 32 speech signal with basis function
 Filter
 SISC(with FS), MFCC(Mel Frequency Cepstral Coefficient), RAW

 Result of FS search(Musical genre)
 2-second, 5-way musical genre song.
 Filter
 SISC(with FS), TC(Tzanetakis & Cook)
 MFCC(Mel Frequency Cepstrum Coefficient), RAW

4. Unsupervised Feature Learning (1)
 Description of TIMIT Data
 For researching speech recognition systems
 American English
 In This Research
 Spectrogram form
 Window size : 20ms
 Overlaps : 10ms
 Using PCA-Whitening(with 80 components)
– To reduce the dimensionality
 Research Contents
 Phonemes
 Speaker gender

 Layer and Training Setting
 1st layer
 300 bases
 Filter length(nw) : 6
 Max-pooling ratio : 3
 2nd layer
 300 bases (output of 1st layer)
 Filter length : 6

 Phonemes and the CDBN features
 Analysis
 Vowel(“ah”, “oy”)
 Prominent horizontal bands
 Lower freq.
 “oy”
 Upward slanting pattern

 Phonemes and the CDBN features
 Analysis
 Fricatives(“s”)
 Energy in the high freq.
 “el”
 High intensity in low freq.
 Low intensity follows in high freq.

 Speaker gender information & CDBN features
 Female, finer horizontal banding pattern in low freq.
 L1, L2 correspond to basis.

5. Speech Recognition(Speaker ID) (1)
 About bases data
 No. speakers : 168
 Sentenses per speaker : 10
 Total sentenses : 1680
 1. Speaker Identification Test
 10 Random trials
 Training : TIMIT data
 All data expressed as Spectrogram
 RAW, MFCC, CDBN L1, CDBN L2, CDBN L1+L2
 Simple summary statistics for each channel
 Evaluate features using standard supervised classifiers
 SVM(Sub Vector Machine), GDA(Gaussian Discriminant Analysis), KNN
(K-Nearest Neigbor classification)

 Speaker Identification

 2. Speaker Gender classification
 Randomly sampled training examples
 200 testing examples
 20 trials

 3. Phone Classification
 39way phone classification accuracy
 Over 5 random trials

6. Music Classification (1)
 1. Genre classification
 1st and 2nd layer
 Music data from: ISMIR
 Bases : 300
 Randomly sampled 3-second segment(Training or testing sample)
 Genre : 5-way(classical, electirc, jazz, pop and rock)
 20 random trials on each training samples

 2. Artist classification
 1st and 2nd layer (same as genre classification)
 Music data from: ISMIR
 Bases : 300
 Randomly sampled 3-second segment(Training or testing sample)
 Genre only classical music
 Only 4-way artist
 Over 20 random trials (in average)

 2. Artist classification

7. Discussion
 Not suitable on Modern Speech
 Much larger than the TIMIT data set.
 This research’s target
 Restrict amount of the labeled data
 Remains interesting problem
 Deep learning to larger datasets
 More challenging tasks

8. Conclusion
 Applied CDBN to audio data
 Evaluate on various audio classification tasks
 Not using a large Amount of data
 This learned feature often equaled or surpassed MFCC
 (MFCC hand-tailored to audio data)
 Combining both, achieve higher classification accuracy
 L1 CDBN, high performance on multiple audio recognition tasks
 Hope Inspiring automatically learning deep feature
 In audio data

Thank you

[Chung il kim] 0829 thesis

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to [Chung il kim] 0829 thesis

Similar to [Chung il kim] 0829 thesis (20)

Recently uploaded

Recently uploaded (20)

[Chung il kim] 0829 thesis