ECCV2010: feature learning for image classification, part 4

Outline ,[object Object],[object Object],[object Object]

Supervised learning Cars Motorcycles Testing: What is this?

Semi-supervised learning Unlabeled images (all cars/motorcycles) Testing: What is this? Car Motorcycle

Self-taught learning Unlabeled images (random internet images) Testing: What is this? Car Motorcycle

Self-taught learning Sparse coding, LCC, etc.     , …,  k Use learned     , …,  k to represent training/test sets. Using     , …,  k  a   a  , …, a k If have labeled training set is small, can give huge performance boost. Car Motorcycle

Learning feature hierarchies/Deep learning

Why feature hierarchies pixels edges object parts (combination of edges) object models

Deep learning algorithms ,[object Object],[object Object],[object Object],[object Object]

Deep learning with autoencoders ,[object Object],[object Object],[object Object],[object Object]

Logistic regression Logistic regression has a learned parameter vector  . On input x, it outputs: where Draw a logistic regression unit as: x 1 x 2 x 3 +1

Neural Network ,[object Object],x 1 x 2 x 3 +1 +1 Layer 1 Layer 3 Layer 3 a 3 a 2 a 1

Neural Network x 1 x 2 x 3 +1 +1 Layer 1 Layer 2 Layer 4 +1 Layer 3 Example 4 layer network with 2 output units:

Neural Network example [Courtesy of Yann LeCun]

Training a neural network ,[object Object],[object Object],[object Object]

Unsupervised feature learning with a neural network ,[object Object],[object Object],[object Object],[object Object],[object Object],a 1 a 2 a 3 x 4 x 5 x 6 +1 Layer 1 Layer 2 x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 +1 Layer 3

Unsupervised feature learning with a neural network Training a sparse autoencoder. Given unlabeled training set x 1 , x 2 , … Reconstruction error term L 1 sparsity term a 1 a 2 a 3

Unsupervised feature learning with a neural network x 4 x 5 x 6 +1 Layer 1 Layer 2 x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 +1 Layer 3 a 1 a 2 a 3

Unsupervised feature learning with a neural network x 4 x 5 x 6 +1 Layer 1 Layer 2 x 1 x 2 x 3 +1 a 1 a 2 a 3 New representation for input.

Unsupervised feature learning with a neural network x 4 x 5 x 6 +1 Layer 1 Layer 2 x 1 x 2 x 3 +1 a 1 a 2 a 3

Unsupervised feature learning with a neural network x 4 x 5 x 6 +1 x 1 x 2 x 3 +1 a 1 a 2 a 3 +1 b 1 b 2 b 3 Train parameters so that , subject to b i ’s being sparse.

Unsupervised feature learning with a neural network x 4 x 5 x 6 +1 x 1 x 2 x 3 +1 a 1 a 2 a 3 +1 b 1 b 2 b 3 New representation for input.

Unsupervised feature learning with a neural network x 4 x 5 x 6 +1 x 1 x 2 x 3 +1 a 1 a 2 a 3 +1 b 1 b 2 b 3

Unsupervised feature learning with a neural network x 4 x 5 x 6 +1 x 1 x 2 x 3 +1 a 1 a 2 a 3 +1 b 1 b 2 b 3 +1 c 1 c 2 c 3

Unsupervised feature learning with a neural network x 4 x 5 x 6 +1 x 1 x 2 x 3 +1 a 1 a 2 a 3 +1 b 1 b 2 b 3 +1 c 1 c 2 c 3 New representation for input. Use [c 1 , c 3 , c 3 ] as representation to feed to learning algorithm.

Deep Belief Net ,[object Object],[object Object],[object Object]

Restricted Boltzmann machine (RBM) Input [x 1, x 2 , x 3 , x 4 ] Layer 2. [a 1, a 2 , a 3 ] (binary-valued) MRF with joint distribution: Use Gibbs sampling for inference. Given observed inputs x, want maximum likelihood estimation: x 4 x 1 x 2 x 3 a 2 a 3 a 1

Restricted Boltzmann machine (RBM) Input [x 1, x 2 , x 3 , x 4 ] Layer 2. [a 1, a 2 , a 3 ] (binary-valued) Gradient ascent on log P(x) : [x i a j ] obs from fixing x to observed value, and sampling a from P(a|x). [x i a j ] prior from running Gibbs sampling to convergence. Adding sparsity constraint on a i ’s usually improves results. x 4 x 1 x 2 x 3 a 2 a 3 a 1

Deep Belief Network ,[object Object],Input [x 1, x 2 , x 3 , x 4 ] Layer 2. [a 1, a 2 , a 3 ] Layer 3. [b 1, b 2 , b 3 ] Train with approximate maximum likelihood (often with sparsity constraint on a i ’s):

Deep Belief Network Input [x 1, x 2 , x 3 , x 4 ] Layer 2. [a 1, a 2 , a 3 ] Layer 3. [b 1, b 2 , b 3 ] Layer 4. [c 1, c 2 , c 3 ]

Convolutional DBN for audio Spectrogram Detection units Max pooling unit

Convolutional DBN for audio Spectrogram

Probabilistic max pooling X 3 X 1 X 2 X 4 max {x 1 , x 2 , x 3 , x 4 } Convolutional Neural net: Convolutional DBN: Where x i are real numbers. Where x i are {0,1}, and mutually exclusive . Thus, 5 possible cases: Collapse 2 n configurations into n+1 configurations. Permits bottom up and top down inference. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 X 3 X 1 X 2 X 4 max {x 1 , x 2 , x 3 , x 4 }

Convolutional DBN for audio One CDBN layer Detection units Max pooling Detection units Max pooling Second CDBN layer

CDBNs for speech Learned first-layer bases

Convolutional DBN for Images Visible nodes (binary or real) At most one hidden nodes are active. Hidden nodes (binary) “ Filter” weights (shared) Input data V W k Detection layer H Max-pooling layer P ‘’ max-pooling’’ node (binary)

Convolutional DBN on face images pixels edges object parts (combination of edges) object models Note: Sparsity important for these results.

Learning of object parts Examples of learned object parts from object categories Faces Cars Elephants Chairs

Training on multiple objects Plot of H (class|neuron active) Trained on 4 classes (cars, faces, motorbikes, airplanes). Second layer: Shared-features and object-specific features. Third layer: More specific features. Second layer bases learned from 4 object categories. Third layer bases learned from 4 object categories.

Hierarchical probabilistic inference Input images Samples from feedforward Inference (control ) Samples from Full posterior inference Generating posterior samples from faces by “filling in” experiments (cf. Lee and Mumford, 2003). Combine bottom-up and top-down inference.

Key issue in feature learning: Scaling up

Scaling up with graphics processors Peak GFlops NVIDIA GPU US$ 250 2003 2004 2005 2006 2007 2008 (Source: NVIDIA CUDA Programming Guide) Intel CPU

Scaling up with GPUs Approx. number of parameters (millions): Using GPU (Raina et al., 2009)

Unsupervised feature learning: Does it work?

State-of-the-art task performance Audio Images Multimodal (audio/video) Video TIMIT Phone classification Accuracy Prior art (Clarkson et al.,1999) 79.6% Stanford Feature learning 80.3% TIMIT Speaker identification Accuracy Prior art (Reynolds, 1995) 99.7% Stanford Feature learning 100.0% CIFAR Object classification Accuracy Prior art (Yu and Zhang, 2010) 74.5% Stanford Feature learning 75.5% NORB Object classification Accuracy Prior art (Ranzato et al., 2009) 94.4% Stanford Feature learning 96.2% AVLetters Lip reading Accuracy Prior art (Zhao et al., 2009) 58.9% Stanford Feature learning 63.1% UCF activity classification Accuracy Prior art (Kalser et al., 2008) 86% Stanford Feature learning 87% Hollywood2 classification Accuracy Prior art (Laptev, 2004) 47% Stanford Feature learning 50%

Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Other resources ,[object Object],[object Object],[object Object],[object Object]

ECCV2010: feature learning for image classification, part 4

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to ECCV2010: feature learning for image classification, part 4

Similar to ECCV2010: feature learning for image classification, part 4 (20)

More from zukun

More from zukun (20)

Recently uploaded

Recently uploaded (20)

ECCV2010: feature learning for image classification, part 4

Editor's Notes