• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A tutorial on deep learning at icml 2013
 

A tutorial on deep learning at icml 2013

on

  • 6,083 views

 

Statistics

Views

Total Views
6,083
Views on SlideShare
6,078
Embed Views
5

Actions

Likes
16
Downloads
302
Comments
1

2 Embeds 5

https://twitter.com 4
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    A tutorial on deep learning at icml 2013 A tutorial on deep learning at icml 2013 Presentation Transcript

    • Y LeCunMA RanzatoDeep LearningTutorialICML, Atlanta, 2013-06-16Yann LeCunCenter for Data Science & Courant Institute, NYUyann@cs.nyu.eduhttp://yann.lecun.comMarcAurelio RanzatoGoogleranzato@google.comhttp://www.cs.toronto.edu/~ranzato
    • Y LeCunMA RanzatoDeep Learning = Learning Representations/FeaturesThe traditional model of pattern recognition (since the late 50s)Fixed/engineered features (or fixed kernel) + trainableclassifierEnd-to-end learning / Feature learning / Deep learningTrainable features (or kernel) + trainable classifier“Simple” TrainableClassifierhand-craftedFeature ExtractorTrainableClassifierTrainableFeature Extractor
    • Y LeCunMA RanzatoThis Basic Model has not evolved much since the 50sThe first learning machine: the PerceptronBuilt at Cornell in 1960The Perceptron was a linear classifier ontop of a simple feature extractorThe vast majority of practical applicationsof ML today use glorified linear classifiersor glorified template matching.Designing a feature extractor requiresconsiderable efforts by experts.y=sign(∑i=1NW i F i ( X )+b)AFeatureExtractorWi
    • Y LeCunMA RanzatoArchitecture of “Mainstream”Pattern Recognition SystemsModern architecture for pattern recognitionSpeech recognition: early 90s – 2011Object Recognition: 2006 - 2012fixed unsupervised supervisedClassifierMFCC Mix of GaussiansClassifierSIFTHoGK-meansSparse CodingPoolingfixed unsupervised supervisedLow-levelFeaturesMid-levelFeatures
    • Y LeCunMA RanzatoDeep Learning = Learning Hierarchical RepresentationsIts deep if it has more than one stage of non-linear featuretransformationTrainableClassifierLow-LevelFeatureMid-LevelFeatureHigh-LevelFeatureFeature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
    • Y LeCunMA RanzatoTrainable Feature HierarchyHierarchy of representations with increasing level of abstractionEach stage is a kind of trainable feature transformImage recognitionPixel edge texton motif part object→ → → → →TextCharacter word word group clause sentence story→ → → → →SpeechSample spectral band sound … phone phoneme→ → → → → →word →
    • Y LeCunMA RanzatoLearning Representations: a challenge forML, CV, AI, Neuroscience, Cognitive Science...How do we learn representations of the perceptualworld?How can a perceptual system build itself bylooking at the world?How much prior structure is necessaryML/AI: how do we learn features or feature hierarchies?What is the fundamental principle? What is thelearning algorithm? What is the architecture?Neuroscience: how does the cortex learn perception?Does the cortex “run” a single, generallearning algorithm? (or a small number ofthem)CogSci: how does the mind learn abstract concepts ontop of less abstract ones?Deep Learning addresses the problem of learninghierarchical representations with a single algorithmTrainable FeatureTransformTrainable FeatureTransformTrainable FeatureTransformTrainable FeatureTransform
    • Y LeCunMA RanzatoThe Mammalian Visual Cortex is Hierarchical[picture from Simon Thorpe][Gallant & Van Essen]The ventral (recognition) pathway in the visual cortex has multiple stagesRetina - LGN - V1 - V2 - V4 - PIT - AIT ....Lots of intermediate representations
    • Y LeCunMA RanzatoLets be inspired by nature, but not too muchIts nice imitate Nature,But we also need to understandHow do we know whichdetails are important?Which details are merely theresult of evolution, and theconstraints of biochemistry?For airplanes, we developedaerodynamics and compressiblefluid dynamics.We figured that feathers andwing flapping werent crucialQUESTION: What is theequivalent of aerodynamics forunderstanding intelligence?LAvion III de Clément Ader, 1897(Musée du CNAM, Paris)His Eole took off from the ground in 1890,13 years before the Wright Brothers, but youprobably never heard of it.
    • Y LeCunMA RanzatoTrainable Feature Hierarchies: End-to-end learningA hierarchy of trainable feature transformsEach module transforms its input representation into a higher-levelone.High-level features are more global and more invariantLow-level features are shared among categoriesTrainableFeatureTransformTrainableFeatureTransformTrainableClassifier/PredictorLearned Internal RepresentationsHow can we make all the modules trainable and get them to learnappropriate representations?
    • Y LeCunMA RanzatoThree Types of Deep ArchitecturesFeed-Forward: multilayer neural nets, convolutional netsFeed-Back: Stacked Sparse Coding, Deconvolutional NetsBi-Drectional: Deep Boltzmann Machines, Stacked Auto-Encoders
    • Y LeCunMA RanzatoThree Types of Training ProtocolsPurely SupervisedInitialize parameters randomlyTrain in supervised modetypically with SGD, using backprop to compute gradientsUsed in most practical systems for speech and imagerecognitionUnsupervised, layerwise + supervised classifier on topTrain each layer unsupervised, one after the otherTrain a supervised classifier on top, keeping the other layersfixedGood when very few labeled samples are availableUnsupervised, layerwise + global supervised fine-tuningTrain each layer unsupervised, one after the otherAdd a classifier layer, and retrain the whole thing supervisedGood when label set is poor (e.g. pedestrian detection)Unsupervised pre-training often uses regularized auto-encoders
    • Y LeCunMA RanzatoDo we really need deep architectures?Theoreticians dilemma: “We can approximate any function as close as wewant with shallow architecture. Why would we need deep ones?”kernel machines (and 2-layer neural nets) are “universal”.Deep learning machinesDeep machines are more efficient for representing certain classes offunctions, particularly those involved in visual recognitionthey can represent more complex functions with less “hardware”We need an efficient parameterization of the class of functions that areuseful for “AI” tasks (vision, audition, NLP...)
    • Y LeCunMA RanzatoWhy would deep architectures be more efficient?A deep architecture trades space for time (or breadth for depth)more layers (more sequential computation),but less hardware (less parallel computation).Example1: N-bit parityrequires N-1 XOR gates in a tree of depth log(N).Even easier if we use threshold gatesrequires an exponential number of gates of we restrict ourselvesto 2 layers (DNF formula with exponential number of minterms).Example2: circuit for addition of 2 N-bit binary numbersRequires O(N) gates, and O(N) layers using N one-bit adders withripple carry propagation.Requires lots of gates (some polynomial in N) if we restrictourselves to two layers (e.g. Disjunctive Normal Form).Bad news: almost all boolean functions have a DNF formula withan exponential number of minterms O(2^N).....[Bengio & LeCun 2007 “Scaling Learning Algorithms Towards AI”]
    • Y LeCunMA RanzatoWhich Models are Deep?2-layer models are not deep (even ifyou train the first layer)Because there is no featurehierarchyNeural nets with 1 hidden layer are notdeepSVMs and Kernel methods are not deepLayer1: kernels; layer2: linearThe first layer is “trained” inwith the simplest unsupervisedmethod ever devised: usingthe samples as templates forthe kernel functions.Classification trees are not deepNo hierarchy of features. Alldecisions are made in the inputspace
    • Y LeCunMA RanzatoAre Graphical Models Deep?There is no opposition between graphical models and deep learning.Many deep learning models are formulated as factor graphsSome graphical models use deep architectures inside their factorsGraphical models can be deep (but most are not).Factor Graph: sum of energy functionsOver inputs X, outputs Y and latent variables Z. Trainable parameters: WEach energy function can contain a deep networkThe whole factor graph can be seen as a deep network−log P( X ,Y , Z /W )∝E ( X ,Y , Z ,W )=∑iEi( X ,Y ,Z ,W i)E1(X1,Y1)E2(X2,Z1,Z2)E3(Z2,Y1) E4(Y3,Y4)X1 Z3 Y2Y1Z2Z1 X2
    • Y LeCunMA RanzatoDeep Learning: A Theoreticians Nightmare?Deep Learning involves non-convex loss functionsWith non-convex losses, all bets are offThen again, every speech recognition system ever deployedhas used non-convex optimization (GMMs are non convex).But to some of us all “interesting” learning is non convexConvex learning is invariant to the order in which sample arepresented (only depends on asymptotic sample frequencies).Human learning isnt like that: we learn simple conceptsbefore complex ones. The order in which we learn thingsmatter.
    • Y LeCunMA RanzatoDeep Learning: A Theoreticians Nightmare?No generalization bounds?Actually, the usual VC bounds apply: most deep learningsystems have a finite VC dimensionWe dont have tighter bounds than that.But then again, how many bounds are tight enough to beuseful for model selection?Its hard to prove anything about deep learning systemsThen again, if we only study models for which we can provethings, we wouldnt have speech, handwriting, and visualobject recognition systems today.
    • Y LeCunMA RanzatoDeep Learning: A Theoreticians Paradise?Deep Learning is about representing high-dimensional dataThere has to be interesting theoretical questions thereWhat is the geometry of natural signals?Is there an equivalent of statistical learning theory forunsupervised learning?What are good criteria on which to base unsupervisedlearning?Deep Learning Systems are a form of latent variable factor graphInternal representations can be viewed as latent variables tobe inferred, and deep belief networks are a particular type oflatent variable models.The most interesting deep belief nets have intractable lossfunctions: how do we get around that problem?Lots of theory at the 2012 IPAM summer school on deep learningWrights parallel SGD methods, Mallats “scattering transform”,Oshers “split Bregman” methods for sparse modeling,Mortons “algebraic geometry of DBN”,....
    • Y LeCunMA RanzatoDeep Learning and Feature Learning TodayDeep Learning has been the hottest topic in speech recognition in the last 2 yearsA few long-standing performance records were broken with deeplearning methodsMicrosoft and Google have both deployed DL-based speechrecognition system in their productsMicrosoft, Google, IBM, Nuance, AT&T, and all the major academicand industrial players in speech recognition have projects on deeplearningDeep Learning is the hottest topic in Computer VisionFeature engineering is the bread-and-butter of a large portion of theCV community, which creates some resistance to feature learningBut the record holders on ImageNet and Semantic Segmentation areconvolutional netsDeep Learning is becoming hot in Natural Language ProcessingDeep Learning/Feature Learning in Applied MathematicsThe connection with Applied Math is through sparse coding,non-convex optimization, stochastic gradient algorithms, etc...
    • Y LeCunMA RanzatoIn Many Fields, Feature Learning Has Caused a Revolution(methods used in commercially deployed systems)Speech Recognition I (late 1980s)Trained mid-level features with Gaussian mixtures (2-layer classifier)Handwriting Recognition and OCR (late 1980s to mid 1990s)Supervised convolutional nets operating on pixelsFace & People Detection (early 1990s to mid 2000s)Supervised convolutional nets operating on pixels (YLC 1994, 2004,Garcia 2004)Haar features generation/selection (Viola-Jones 2001)Object Recognition I (mid-to-late 2000s: Ponce, Schmid, Yu, YLC....)Trainable mid-level features (K-means or sparse coding)Low-Res Object Recognition: road signs, house numbers (early 2010s)Supervised convolutional net operating on pixelsSpeech Recognition II (circa 2011)Deep neural nets for acoustic modelingObject Recognition III, Semantic Labeling (2012, Hinton, YLC,...)Supervised convolutional nets operating on pixels
    • Y LeCunMA RanzatoD-AEDBN DBMAEPerceptronRBMGMM BayesNPSVMSparseCodingDecisionTreeBoostingSHALLOW DEEPConv. NetNeural NetRNN
    • Y LeCunMA RanzatoSHALLOW DEEPNeural NetworksProbabilistic ModelsD-AEDBN DBMAEPerceptronRBMGMM BayesNPSVMSparseCodingDecisionTreeBoostingConv. NetNeural NetRNN
    • Y LeCunMA RanzatoSHALLOW DEEPNeural NetworksProbabilistic ModelsConv. NetD-AEDBN DBMAEPerceptronRBMGMM BayesNPSVMSupervised SupervisedUnsupervisedSparseCodingBoostingDecisionTreeNeural NetRNN
    • Y LeCunMA RanzatoSHALLOW DEEPD-AEDBN DBMAEPerceptronRBMGMM BayesNPSVMSparseCodingBoostingDecisionTreeNeural NetConv. NetRNNIn this talk, well focus on thesimplest and typically mosteffective methods.
    • Y LeCunMA RanzatoWhat AreGood Feature?
    • Y LeCunMA RanzatoDiscovering the Hidden Structure in High-Dimensional DataThe manifold hypothesisLearning Representations of Data:Discovering & disentangling the independentexplanatory factorsThe Manifold Hypothesis:Natural data lives in a low-dimensional (non-linear) manifoldBecause variables in natural data are mutually dependent
    • Y LeCunMA RanzatoDiscovering the Hidden Structure in High-Dimensional DataExample: all face images of a person1000x1000 pixels = 1,000,000 dimensionsBut the face has 3 cartesian coordinates and 3 Euler anglesAnd humans have less than about 50 muscles in the faceHence the manifold of face images for a person has <56 dimensionsThe perfect representations of a face image:Its coordinates on the face manifoldIts coordinates away from the manifoldWe do not have good and general methods to learn functions that turns animage into this kind of representationIdealFeatureExtractor [1.2−30.2−2...]Face/not facePoseLightingExpression-----
    • Y LeCunMA RanzatoDisentangling factors of variationThe Ideal Disentangling Feature ExtractorPixel 1Pixel 2Pixel nExpressionViewIdealFeatureExtractor
    • Y LeCunMA RanzatoData Manifold & Invariance:Some variations must be eliminatedAzimuth-Elevation manifold. Ignores lighting. [Hadsell et al. CVPR 2006]
    • Y LeCunMA RanzatoBasic Idea fpr Invariant Feature LearningEmbed the input non-linearly into a high(er) dimensional spaceIn the new space, things that were non separable may becomeseparablePool regions of the new space togetherBringing together things that are semantically similar. Likepooling.Non-LinearFunctionPoolingOrAggregationInputhigh-dimUnstable/non-smoothfeaturesStable/invariantfeatures
    • Y LeCunMA RanzatoNon-Linear Expansion → PoolingEntangled data manifoldsNon-Linear DimExpansion,DisentanglingPooling.Aggregation
    • Y LeCunMA RanzatoSparse Non-Linear Expansion → PoolingUse clustering to break things apart, pool together similar thingsClustering,Quantization,Sparse CodingPooling.Aggregation
    • Y LeCunMA RanzatoOverall Architecture:Normalization → Filter Bank → Non-Linearity → PoolingStacking multiple stages of[Normalization Filter Bank Non-Linearity Pooling].→ → →Normalization: variations on whiteningSubtractive: average removal, high pass filteringDivisive: local contrast normalization, variance normalizationFilter Bank: dimension expansion, projection on overcomplete basisNon-Linearity: sparsification, saturation, lateral inhibition....Rectification (ReLU), Component-wise shrinkage, tanh,winner-takes-allPooling: aggregation over space or feature typeX i ; Lp :p√X ip; PROB:1blog(∑iebX i)ClassifierfeaturePoolingNon-LinearFilterBankNormfeaturePoolingNon-LinearFilterBankNorm
    • Y LeCunMA RanzatoDeep Supervised Learning(modular approach)
    • Y LeCunMA RanzatoMultimodule Systems: CascadeComplex learning machines can bebuilt by assembling modules intonetworksSimple example: sequential/layeredfeed-forward architecture (cascade)Forward Propagation:
    • Y LeCunMA RanzatoMultimodule Systems: ImplementationEach module is an objectContains trainableparametersInputs are argumentsOutput is returned, but alsostored internallyExample: 2 modules m1, m2Torch7 (by hand)hid = m1:forward(in)out = m2:forward(hid)Torch7 (using the nn.Sequential class)model = nn.Sequential()model:add(m1)model:add(m2)out = model:forward(in) 
    • Y LeCunMA RanzatoComputing the Gradient in Multi-Layer Systems
    • Y LeCunMA RanzatoComputing the Gradient in Multi-Layer Systems
    • Y LeCunMA RanzatoComputing the Gradient in Multi-Layer Systems
    • Y LeCunMA RanzatoJacobians and Dimensions
    • Y LeCunMA RanzatoBack Propgation
    • Y LeCunMA RanzatoMultimodule Systems: ImplementationBackpropagation through a moduleContains trainable parametersInputs are argumentsGradient with respect to input isreturned.Arguments are input andgradient with respect to outputTorch7 (by hand)hidg = m2:backward(hid,outg)ing = m1:backward(in,hidg)Torch7 (using the nn.Sequential class)ing = model:backward(in,outg) 
    • Y LeCunMA RanzatoLinear Module
    • Y LeCunMA RanzatoTanh module (or any other pointwise function)
    • Y LeCunMA RanzatoEuclidean Distance Module
    • Y LeCunMA RanzatoAny Architecture worksAny connection is permissibleNetworks with loops must be“unfolded in time”.Any module is permissibleAs long as it is continuous anddifferentiable almost everywherewith respect to the parameters, andwith respect to non-terminal inputs.
    • Y LeCunMA RanzatoModule-Based Deep Learning with Torch7Torch7 is based on the Lua languageSimple and lightweight scripting language, dominant in the game industryHas a native just-in-time compiler (fast!)Has a simple foreign function interface to call C/C++ functions from LuaTorch7 is an extension of Lua withA multidimensional array engine with CUDA and OpenMP backendsA machine learning library that implements multilayer nets, convolutionalnets, unsupervised pre-training, etcVarious libraries for data/image manipulation and computer visionA quickly growing community of usersSingle-line installation on Ubuntu and Mac OSX:curl -s https://raw.github.com/clementfarabet/torchinstall/master/install | bashTorch7 Machine Learning Tutorial (neural net, convnet, sparse auto-encoder):http://code.cogbits.com/wiki/doku.php
    • Y LeCunMA RanzatoExample: building a Neural Net in Torch7Net for SVHN digit recognition10 categoriesInput is 32x32 RGB (3 channels)1500 hidden unitsCreating a 2-layer netMake a cascade moduleReshape input to vectorAdd Linear moduleAdd tanh moduleAdd Linear ModuleAdd log softmax layerCreate loss function moduleNoutputs = 10; nfeats = 3; Width = 32; height = 32ninputs = nfeats*width*heightnhiddens = 1500­­ Simple 2­layer neural networkmodel = nn.Sequential()model:add(nn.Reshape(ninputs))model:add(nn.Linear(ninputs,nhiddens))model:add(nn.Tanh())model:add(nn.Linear(nhiddens,noutputs))model:add(nn.LogSoftMax())criterion = nn.ClassNLLCriterion()See Torch7 example at http://bit.ly/16tyLAx
    • Y LeCunMA RanzatoExample: Training a Neural Net in Torch7one epoch over training setGet next batch of samplesCreate a “closure” feval(x) that takes theparameter vector as argument and returnsthe loss and its gradient on the batch.Run model on batchbackpropNormalize by size of batchReturn loss and gradientcall the stochastic gradient optimizerfor t = 1,trainData:size(),batchSize do  inputs,outputs = getNextBatch()  local feval = function(x)    parameters:copy(x)    gradParameters:zero()    local f = 0    for i = 1,#inputs do      local output = model:forward(inputs[i])      local err = criterion:forward(output,targets[i])      f = f + err      local df_do = criterion:backward(output,targets[i])      model:backward(inputs[i], df_do)    end    gradParameters:div(#inputs)    f = f/#inputs    return f,gradParameters  end   – of feval  optim.sgd(feval,parameters,optimState)end
    • Y LeCunMA Ranzato% F-PROPfor i = 1 : nr_layers - 1[h{i} jac{i}] = nonlinearity(W{i} * h{i-1} + b{i});endh{nr_layers-1} = W{nr_layers-1} * h{nr_layers-2} + b{nr_layers-1};prediction = softmax(h{l-1});% CROSS ENTROPY LOSSloss = - sum(sum(log(prediction) .* target)) / batch_size;% B-PROPdh{l-1} = prediction - target;for i = nr_layers – 1 : -1 : 1Wgrad{i} = dh{i} * h{i-1};bgrad{i} = sum(dh{i}, 2);dh{i-1} = (W{i} * dh{i}) .* jac{i-1};end% UPDATEfor i = 1 : nr_layers - 1W{i} = W{i} – (lr / batch_size) * Wgrad{i};b{i} = b{i} – (lr / batch_size) * bgrad{i};endToy Code (Matlab): Neural Net Trainer
    • Y LeCunMA RanzatoDeep Supervised Learning is Non-ConvexExample: what is the loss function for the simplest 2-layer neural net everFunction: 1-1-1 neural net. Map 0.5 to 0.5 and -0.5 to -0.5(identity function) with quadratic cost:
    • Y LeCunMA RanzatoBackprop in PracticeUse ReLU non-linearities (tanh and logistic are falling out of favor)Use cross-entropy loss for classificationUse Stochastic Gradient Descent on minibatchesShuffle the training samplesNormalize the input variables (zero mean, unit variance)Schedule to decrease the learning rateUse a bit of L1 or L2 regularization on the weights (or a combination)But its best to turn it on after a couple of epochsUse “dropout” for regularizationHinton et al 2012 http://arxiv.org/abs/1207.0580Lots more in [LeCun et al. “Efficient Backprop” 1998]Lots, lots more in “Neural Networks, Tricks of the Trade” (2012 edition)edited by G. Montavon, G. B. Orr, and K-R Müller (Springer)
    • Y LeCunMA RanzatoDeep LearningIn Speech Recognition
    • Y LeCunMA RanzatoCase study #1: Acoustic ModelingA typical speech recognition system:FeatureExtractionNeuralNetworkDecoderTransducer&LanguageModelHi,howareyou?
    • Y LeCunMA RanzatoCase study #1: Acoustic ModelingA typical speech recognition system:FeatureExtractionNeuralNetworkDecoderTransducer&LanguageModelHi,howareyou?Here, we focus only on the prediction of phone states fromshort time-windows of spectrogram.For simplicity, we will use a fully connected neural network(in practice, a convolutional net does better).Mohamed et al. “DBNs for phone recognition” NIPS Workshop 2009Zeiler et al. “On rectified linear units for speech recognition” ICASSP 2013
    • Y LeCunMA RanzatoDataUS English: Voice Search, Voice Typing, Read dataBillions of training samplesInput: log-energy filter bank outputs40 frequency bands26 input framesOutput: 8000 phone statesZeiler et al. “On rectified linear units for speech recognition” ICASSP 2013
    • Y LeCunMA RanzatoArchitectureFrom 1 to 12 hidden layersFor simplicity, the same number of hidden units at each layer:1040 → 2560 → 2560 → … → 2560 → 8000Non-linearities: __/ output = max(0, input)Zeiler et al. “On rectified linear units for speech recognition” ICASSP 2013
    • Y LeCunMA RanzatoEnergy & LossSince it is a standard classification problem, the energy is:Zeiler et al. “On rectified linear units for speech recognition” ICASSP 2013E x , y=−y f  x y 1-of-N vectorThe loss is the negative log-likelihood:L=E x , ylog∑yexp−E x , y
    • Y LeCunMA RanzatoOptimizationSGD with schedule on learning rateZeiler et al. “On rectified linear units for speech recognition” ICASSP 2013t t−1−t∂ L∂ t−1t=max1,tTMini-batches of size 40Asynchronous SGD (using 100 copies of the network on a fewhundred machines). This speeds up training at Google but it is notcrucial.
    • Y LeCunMA RanzatoTrainingGiven an input mini-batchFPROPmax 0,W 1 xmax 0,W 2 h1max 0,W n hn−1NegativeLog-Likelihoodlabel y
    • Y LeCunMA RanzatoTrainingGiven an input mini-batchFPROPmax 0,W 1 xmax 0,W 2 h1max 0,W n hn−1NegativeLog-Likelihoodlabel yh2= f x ;W 1
    • Y LeCunMA RanzatoTrainingGiven an input mini-batchFPROPmax 0,W 1 xmax 0,W 2 h1max 0,W n hn−1NegativeLog-Likelihoodlabel yh2= f h1 ;W 2
    • Y LeCunMA RanzatoTrainingGiven an input mini-batchFPROPmax 0,W 1 xmax 0,W 2 h1max 0,W n hn−1NegativeLog-Likelihoodlabel yhn= f hn−1
    • Y LeCunMA RanzatoTrainingGiven an input mini-batchFPROPmax 0,W 1 xmax 0,W 2 h1max 0,W n hn−1NegativeLog-Likelihoodlabel y
    • Y LeCunMA RanzatoTrainingGiven an input mini-batchBPROPmax 0,W 1 xmax 0,W 2 h1max 0,W n hn−1NegativeLog-Likelihoodlabel y∂ L∂ hn−1=∂ L∂ hn∂ hn∂ hn−1∂ L∂W n=∂ L∂ hn∂ hn∂ W n
    • Y LeCunMA RanzatoTrainingGiven an input mini-batchmax 0,W 1 xmax 0,W 2 h1max 0,W n hn−1NegativeLog-Likelihoodlabel yBPROP∂ L∂ h1=∂ L∂ h2∂ h2∂ h1∂ L∂W 2=∂ L∂ h2∂ h2∂W 2
    • Y LeCunMA RanzatoTrainingGiven an input mini-batchmax 0,W 1 xmax 0,W 2 h1max 0,W n hn−1NegativeLog-Likelihoodlabel yBPROP∂ L∂W 1=∂ L∂ h1∂ h1∂W 1
    • Y LeCunMA RanzatoTrainingGiven an input mini-batchmax 0,W 1 xmax 0,W 2 h1max 0,W n hn−1NegativeLog-Likelihoodlabel yParameterupdate −∂ L∂
    • Y LeCunMA RanzatoTraining
    • Y LeCunMA RanzatoZeiler et al. “On rectified linear units for speech recognition” ICASSP 2013Number ofhidden layers Word Error Rate %1248121612.811.410.911.1GMM baseline: 15.4%Word Error Rate
    • Y LeCunMA RanzatoConvolutionalNetworks
    • Y LeCunMA RanzatoConvolutional NetsAre deployed in many practical applicationsImage recognition, speech recognition, Googles and Baidusphoto taggersHave won several competitionsImageNet, Kaggle Facial Expression, Kaggle MultimodalLearning, German Traffic Signs, Connectomics, Handwriting....Are applicable to array data where nearby values are correlatedImages, sound, time-frequency representations, video,volumetric images, RGB-Depth images,.....One of the few models that can be trained purely supervisedinput83x83Layer 164x75x75Layer 264@14x14Layer 3256@6x6Layer 4256@1x1Output1019x9convolution(64 kernels)9x9convolution(4096 kernels)10x10 pooling,5x5 subsampling6x6 pooling4x4 subsamp
    • Y LeCunMA RanzatoFully-connected neural net in high dimensionExample: 200x200 imageFully-connected, 400,000 hidden units = 16 billion parametersLocally-connected, 400,000 hidden units 10x10 fields = 40million paramsLocal connections capture local dependencies
    • Y LeCunMA RanzatoShared Weights & Convolutions:Exploiting StationarityFeatures that are useful on one part of the imageand probably useful elsewhere.All units share the same set of weightsShift equivariant processing:When the input shifts, the output alsoshifts but stays otherwise unchanged.Convolutionwith a learned kernel (or filter)Non-linearity: ReLU (rectified linear)The filtered “image” Z is called a feature mapAij =∑klW kl X i+ j. k+lZij=max(0, Aij)Example: 200x200 image400,000 hidden units with10x10 fields = 1000params10 feature maps of size200x200, 10 filters of size10x10
    • Y LeCunMA RanzatoMultiple Convolutions with Different KernelsDetects multiple motifs at eachlocationThe collection of units looking atthe same patch is akin to afeature vector for that patch.The result is a 3D array, whereeach slice is a feature map.Multipleconvolutions
    • Y LeCunMA RanzatoEarly Hierarchical Feature Models for Vision[Hubel & Wiesel 1962]:simple cells detect local featurescomplex cells “pool” the outputs of simplecells within a retinotopic neighborhood.Cognitron & Neocognitron [Fukushima 1974-1982]poolingsubsampling“Simple cells”“Complexcells”Multipleconvolutions
    • Y LeCunMA RanzatoThe Convolutional Net Model(Multistage Hubel-Wiesel system)poolingsubsampling“Simple cells”“Complex cells”MultipleconvolutionsRetinotopic Feature Maps[LeCun et al. 89][LeCun et al. 98]Training is supervisedWith stochastic gradientdescent
    • Y LeCunMA RanzatoFeature Transform:Normalization → Filter Bank → Non-Linearity → PoolingStacking multiple stages of[Normalization Filter Bank Non-Linearity Pooling].→ → →Normalization: variations on whiteningSubtractive: average removal, high pass filteringDivisive: local contrast normalization, variance normalizationFilter Bank: dimension expansion, projection on overcomplete basisNon-Linearity: sparsification, saturation, lateral inhibition....Rectification, Component-wise shrinkage, tanh, winner-takes-allPooling: aggregation over space or feature type, subsamplingX i ; Lp :p√X ip; PROB:1blog(∑iebX i)ClassifierfeaturePoolingNon-LinearFilterBankNormfeaturePoolingNon-LinearFilterBankNorm
    • Y LeCunMA RanzatoFeature Transform:Normalization → Filter Bank → Non-Linearity → PoolingFilter Bank → Non-Linearity = Non-linear embedding in high dimensionFeature Pooling = contraction, dimensionality reduction, smoothingLearning the filter banks at every stageCreating a hierarchy of featuresBasic elements are inspired by models of the visual (and auditory) cortexSimple Cell + Complex Cell model of [Hubel and Wiesel 1962]Many “traditional” feature extraction methods are based on thisSIFT, GIST, HoG, SURF...[Fukushima 1974-1982], [LeCun 1988-now],since the mid 2000: Hinton, Seung, Poggio, Ng,....ClassifierfeaturePoolingNon-LinearFilterBankNormfeaturePoolingNon-LinearFilterBankNorm
    • Y LeCunMA RanzatoConvolutional Network (ConvNet)Non-Linearity: half-wave rectification, shrinkage function, sigmoidPooling: average, L1, L2, maxTraining: Supervised (1988-2006), Unsupervised+Supervised (2006-now)input83x83Layer 164x75x75 Layer 264@14x14Layer 3256@6x6 Layer 4256@1x1Output1019x9convolution(64 kernels)9x9convolution(4096 kernels)10x10 pooling,5x5 subsampling6x6 pooling4x4 subsamp
    • Y LeCunMA RanzatoConvolutional Network Architecture
    • Y LeCunMA RanzatoConvolutional Network (vintage 1990)filters → tanh → average-tanh → filters → tanh → average-tanh → filters → tanhCurvedmanifoldFlattermanifold
    • Y LeCunMA Ranzato“Mainstream” object recognition pipeline 2006-2012:somewhat similar to ConvNetsFixed Features + unsupervised mid-level features + simple classifierSIFT + Vector Quantization + Pyramid pooling + SVM[Lazebnik et al. CVPR 2006]SIFT + Local Sparse Coding Macrofeatures + Pyramid pooling + SVM[Boureau et al. ICCV 2011]SIFT + Fisher Vectors + Deformable Parts Pooling + SVM[Perronin et al. 2012]OrientedEdgesWinnerTakes AllHistogram(sum)FilterBankfeaturePoolingNon-LinearityFilterBankfeaturePoolingNon-LinearityClassifierFixed (SIFT/HoG/...)K-meansSparse CodingSpatial MaxOr averageAny simpleclassifierUnsupervised Supervised
    • Y LeCunMA RanzatoTasks for Which Deep Convolutional Nets are the BestHandwriting recognition MNIST (many), Arabic HWX (IDSIA)OCR in the Wild [2011]: StreetView House Numbers (NYU and others)Traffic sign recognition [2011] GTSRB competition (IDSIA, NYU)Pedestrian Detection [2013]: INRIA datasets and others (NYU)Volumetric brain image segmentation [2009] connectomics (IDSIA, MIT)Human Action Recognition [2011] Hollywood II dataset (Stanford)Object Recognition [2012] ImageNet competitionScene Parsing [2012] Stanford bgd, SiftFlow, Barcelona (NYU)Scene parsing from depth images [2013] NYU RGB-D dataset (NYU)Speech Recognition [2012] Acoustic modeling (IBM and Google)Breast cancer cell mitosis detection [2011] MITOS (IDSIA)The list of perceptual tasks for which ConvNets hold the record is growing.Most of these tasks (but not all) use purely supervised convnets.
    • Y LeCunMA RanzatoIdeas from Neuroscience and PsychophysicsThe whole architecture: simple cells and complex cellsLocal receptive fieldsSelf-similar receptive fields over the visual field (convolutions)Pooling (complex cells)Non-Linearity: Rectified Linear Units (ReLU)LGN-like band-pass filtering and contrast normalization in the inputDivisive contrast normalization (from Heeger, Simoncelli....)Lateral inhibitionSparse/Overcomplete representations (Olshausen-Field....)Inference of sparse representations with lateral inhibitionSub-sampling ratios in the visual cortexbetween 2 and 3 between V1-V2-V4Crowding and visual metamers give cues on the size of the pooling areas
    • Y LeCunMA RanzatoSimple ConvNet Applications with State-of-the-Art PerformanceTraffic Sign Recognition (GTSRB)German Traffic Sign RecoBench99.2% accuracy#1: IDSIA; #2 NYUHouse Number Recognition (Google)Street View House Numbers94.3 % accuracy
    • Y LeCunMA RanzatoPrediction of Epilepsy Seizures from Intra-Cranial EEGPiotr Mirowski, Deepak Mahdevan (NYU Neurology), Yann LeCun
    • Y LeCunMA RanzatoEpilepsy Prediction4641032……………………………32……………832384feature extractionover short timewindowsfor individualchannels(we look for10 sortsof features)integration ofall channels and all featuresacross several time samplesEEGchannelstime, in samples…integration ofall channelsand all featuresacross severaltime samplesinputsoutputsTemporal Convolutional Net
    • Y LeCunMA RanzatoConvNet in Connectomics[Jain, Turaga, Seung 2007-present]3D convnet to segment volumetric images
    • Y LeCunMA RanzatoObject Recognition [Krizhevsky, Sutskever, Hinton 2012]CONV 11x11/ReLU 96fmLOCAL CONTRAST NORMMAX POOL 2x2subFULL 4096/ReLUFULL CONNECTCONV 11x11/ReLU 256fmLOCAL CONTRAST NORMMAX POOLING 2x2subCONV 3x3/ReLU 384fmCONV 3x3ReLU 384fmCONV 3x3/ReLU 256fmMAX POOLINGFULL 4096/ReLUWon the 2012 ImageNet LSVRC. 60 Million parameters, 832M MAC ops4M16M37M442K1.3M884K307K35K4Mflop16M37M74M224M149M223M105M
    • Y LeCunMA RanzatoObject Recognition: ILSVRC 2012 resultsImageNet Large Scale Visual Recognition Challenge1000 categories, 1.5 Million labeled training samples
    • Y LeCunMA RanzatoObject Recognition [Krizhevsky, Sutskever, Hinton 2012]Method: large convolutional net650K neurons, 832M synapses, 60M parametersTrained with backprop on GPUTrained “with all the tricks Yann came up with inthe last 20 years, plus dropout” (Hinton, NIPS2012)Rectification, contrast normalization,...Error rate: 15% (whenever correct class isnt in top 5)Previous state of the art: 25% errorA REVOLUTION IN COMPUTER VISIONAcquired by Google in Jan 2013Deployed in Google+ Photo Tagging in May 2013
    • Y LeCunMA RanzatoObject Recognition [Krizhevsky, Sutskever, Hinton 2012]
    • Y LeCunMA RanzatoObject Recognition [Krizhevsky, Sutskever, Hinton 2012]TESTIMAGE RETRIEVED IMAGES
    • Y LeCunMA RanzatoConvNet-Based Google+ Photo TaggerSearched my personal collection for “bird”SamyBengio???
    • Y LeCunMA RanzatoAnother ImageNet-trained ConvNet[Zeiler & Fergus 2013]Convolutional Net with 8 layers, input is 224x224 pixelsconv-pool-conv-pool-conv-conv-conv-full-full-fullRectified-Linear Units (ReLU): y = max(0,x)Divisive contrast normalization across features [Jarrett et al.ICCV 2009]Trained on ImageNet 2012 training set1.3M images, 1000 classes10 different crops/flips per imageRegularization: Dropout[Hinton 2012]zeroing random subsets of unitsStochastic gradient descentfor 70 epochs (7-10 days)With learning rate annealing
    • Y LeCunMA RanzatoObject Recognition on-line demo [Zeiler & Fergus 2013]http://horatio.cs.nyu.edu
    • Y LeCunMA RanzatoConvNet trained on ImageNet [Zeiler & Fergus 2013]
    • Y LeCunMA RanzatoState of the art withonly 6 training examplesFeatures are generic: Caltech 256Network firsttrained onImageNet.Last layerchopped offLast layer trainedon Caltech 256,first layers N-1kept fixed.State of the artaccuracy with only6 trainingsamples/class3: [Bo, Ren, Fox. CVPR, 2013] 16: [Sohn, Jung, Lee, Hero ICCV 2011]
    • Y LeCunMA RanzatoFeatures are generic: PASCAL VOC 2012Network first trained on ImageNet.Last layer trained on Pascal VOC, keeping N-1 first layers fixed.[15] K. Sande, J. Uijlings, C. Snoek, and A. Smeulders. Hybrid coding for selective search. InPASCAL VOC Classification Challenge 2012,[19] S. Yan, J. Dong, Q. Chen, Z. Song, Y. Pan, W. Xia, Z. Huang, Y. Hua, and S. Shen. Generalizedhierarchical matching for sub-category aware object classification. In PASCAL VOC ClassificationChallenge 2012
    • Y LeCunMA RanzatoSemantic Labeling:Labeling every pixel with the object it belongs to[Farabet et al. ICML 2012, PAMI 2013]Would help identify obstacles, targets, landing sites, dangerous areasWould help line up depth map with edge maps
    • Y LeCunMA RanzatoScene Parsing/Labeling: ConvNet ArchitectureEach output sees a large input context:46x46 window at full rez; 92x92 at ½ rez; 184x184 at ¼ rez[7x7conv]->[2x2pool]->[7x7conv]->[2x2pool]->[7x7conv]->Trained supervised on fully-labeled imagesLaplacianPyramidLevel 1FeaturesLevel 2FeaturesUpsampledLevel 2 FeaturesCategories
    • Y LeCunMA RanzatoScene Parsing/Labeling: PerformanceStanford Background Dataset [Gould 1009]: 8 categories[Farabet et al. IEEE T. PAMI 2013]
    • Y LeCunMA RanzatoScene Parsing/Labeling: Performance[Farabet et al. IEEE T. PAMI 2012]SIFT Flow Dataset[Liu 2009]:33 categoriesBarcelona dataset[Tighe 2010]:170 categories.
    • Y LeCunMA RanzatoScene Parsing/Labeling: SIFT Flow dataset (33 categories)Samples from the SIFT-Flow dataset (Liu)[Farabet et al. ICML 2012, PAMI 2013]
    • Y LeCunMA RanzatoScene Parsing/Labeling: SIFT Flow dataset (33 categories)[Farabet et al. ICML 2012, PAMI 2013]
    • Y LeCunMA RanzatoScene Parsing/Labeling[Farabet et al. ICML 2012, PAMI 2013]
    • Y LeCunMA RanzatoScene Parsing/Labeling[Farabet et al. ICML 2012, PAMI 2013]
    • Y LeCunMA RanzatoScene Parsing/Labeling[Farabet et al. ICML 2012, PAMI 2013]
    • Y LeCunMA RanzatoScene Parsing/Labeling[Farabet et al. ICML 2012, PAMI 2013]
    • Y LeCunMA RanzatoScene Parsing/LabelingNo post-processingFrame-by-frameConvNet runs at 50ms/frame on Virtex-6 FPGA hardwareBut communicating the features over ethernet limits systemperformance
    • Y LeCunMA RanzatoScene Parsing/Labeling: Temporal ConsistencyCausal method for temporal consistency[Couprie, Farabet, Najman, LeCun ICLR 2013, ICIP 2013]
    • Y LeCunMA RanzatoNYU RGB-Depth Indoor Scenes Dataset407024 RGB-D images of apartments1449 labeled frames, 894 object categories[Silberman et al. 2012]
    • Y LeCunMA RanzatoScene Parsing/Labeling on RGB+Depth ImagesWith temporal consistency[Couprie, Farabet, Najman, LeCun ICLR 2013, ICIP 2013]
    • Y LeCunMA RanzatoScene Parsing/Labeling on RGB+Depth ImagesWith temporal consistency[Couprie, Farabet, Najman, LeCun ICLR 2013, ICIP 2013]
    • Y LeCunMA RanzatoSemantic Segmentation on RGB+D Images and Videos[Couprie, Farabet, Najman, LeCun ICLR 2013, ICIP 2013]
    • Y LeCunMA RanzatoEnergy-BasedUnsupervised Learning
    • Y LeCunMA RanzatoEnergy-Based Unsupervised LearningLearning an energy function (or contrast function) that takesLow values on the data manifoldHigher values everywhere elseY1Y2
    • Y LeCunMA RanzatoCapturing Dependencies Between Variableswith an Energy FunctionThe energy surface is a “contrast function” that takes low values on thedata manifold, and higher values everywhere elseSpecial case: energy = negative log densityExample: the samples live in the manifoldY1Y2Y 2=(Y1)2
    • Y LeCunMA RanzatoTransforming Energies into Probabilities (if necessary)YP(Y|W)YE(Y,W)The energy can be interpreted as an unnormalized negative log densityGibbs distribution: Probability proportional to exp(-energy)Beta parameter is akin to an inverse temperatureDont compute probabilities unless you absolutely have toBecause the denominator is often intractable
    • Y LeCunMA RanzatoLearning the Energy Functionparameterized energy function E(Y,W)Make the energy low on the samplesMake the energy higher everywhere elseMaking the energy low on the samples is easyBut how do we make it higher everywhere else?
    • Y LeCunMA RanzatoSeven Strategies to Shape the Energy Function1. build the machine so that the volume of low energy stuff is constantPCA, K-means, GMM, square ICA2. push down of the energy of data points, push up everywhere elseMax likelihood (needs tractable partition function)3. push down of the energy of data points, push up on chosen locationscontrastive divergence, Ratio Matching, Noise ContrastiveEstimation, Minimum Probability Flow4. minimize the gradient and maximize the curvature around data pointsscore matching5. train a dynamical system so that the dynamics goes to the manifolddenoising auto-encoder6. use a regularizer that limits the volume of space that has low energySparse coding, sparse auto-encoder, PSD7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible.Contracting auto-encoder, saturating auto-encoder
    • Y LeCunMA Ranzato#1: constant volume of low energy1. build the machine so that the volume of low energy stuff is constantPCA, K-means, GMM, square ICA...E(Y )=∥W TWY −Y∥2PCAK-Means,Z constrained to 1-of-K codeE(Y )=minz ∑i∥Y −W i Zi∥2
    • Y LeCunMA Ranzato#2: push down of the energy of data points,push up everywhere elseMax likelihood (requires a tractable partition function)YP(Y)YE(Y)Maximizing P(Y|W) on trainingsamples make this bigmake this bigmake this smallMinimizing -log P(Y,W) on trainingsamplesmake this small
    • Y LeCunMA Ranzato#2: push down of the energy of data points,push up everywhere elseGradient of the negative log-likelihood loss for one sample Y:Pushes down on theenergy of the samplesPulls up on theenergy of low-energy YsYYE(Y)Gradient descent:
    • Y LeCunMA Ranzato#3. push down of the energy of data points,push up on chosen locationscontrastive divergence, Ratio Matching, Noise Contrastive Estimation,Minimum Probability FlowContrastive divergence: basic ideaPick a training sample, lower the energy at that pointFrom the sample, move down in the energy surface with noiseStop after a whilePush up on the energy of the point where we stoppedThis creates grooves in the energy surface around data manifoldsCD can be applied to any energy function (not just RBMs)Persistent CD: use a bunch of “particles” and remember their positionsMake them roll down the energy surface with noisePush up on the energy wherever they areFaster than CDRBME(Y ,Z )=−ZTWY E(Y )=−log ∑zeZTWY
    • Y LeCunMA Ranzato#6. use a regularizer that limitsthe volume of space that has low energySparse coding, sparse auto-encoder, Predictive Saprse Decomposition
    • Y LeCunMA RanzatoSparse Modeling,Sparse Auto-Encoders,Predictive Sparse DecompositionLISTA
    • Y LeCunMA RanzatoHow to Speed Up Inference in a Generative Model?Factor Graph with an asymmetric factorInference Z → Y is easyRun Z through deterministic decoder, and sample YInference Y → Z is hard, particularly if Decoder function is many-to-oneMAP: minimize sum of two factors with respect to ZZ* = argmin_z Distance[Decoder(Z), Y] + FactorB(Z)Examples: K-Means (1of K), Sparse Coding (sparse), Factor AnalysisINPUTDecoderYDistanceZLATENTVARIABLEFactor BGenerative ModelFactor A
    • Y LeCunMA RanzatoSparse Coding & Sparse ModelingSparse linear reconstructionEnergy = reconstruction_error + code_prediction_error + code_sparsityE(Yi,Z )=∥Yi−W d Z∥2+ λ∑j∣z j∣[Olshausen & Field 1997]INPUT Y Z∥Yi− Y∥2∣z j∣W d ZFEATURES∑j.Y → ̂Z=argminZ E(Y ,Z)Inference is slowDETERMINISTICFUNCTIONFACTORVARIABLE
    • Y LeCunMA RanzatoEncoder ArchitectureExamples: most ICA models, Product of ExpertsINPUT Y ZLATENTVARIABLEFactor BEncoder DistanceFast Feed-Forward ModelFactor A
    • Y LeCunMA RanzatoEncoder-Decoder ArchitectureTrain a “simple” feed-forward function to predict the result of a complexoptimization on the data points of interestINPUTDecoderYDistanceZLATENTVARIABLEFactor B[Kavukcuoglu, Ranzato, LeCun, rejected by every conference, 2008-2009]Generative ModelFactor AEncoder DistanceFast Feed-Forward ModelFactor A1. Find optimal Zi for all Yi; 2. Train Encoder to predict Zi from Yi
    • Y LeCunMA RanzatoWhy Limit the Information Content of the Code?INPUT SPACE FEATURESPACETraining sampleInput vector which is NOT a training sampleFeature vector
    • Y LeCunMA RanzatoWhy Limit the Information Content of the Code?INPUT SPACE FEATURESPACETraining sampleInput vector which is NOT a training sampleFeature vectorTraining based on minimizing the reconstruction errorover the training set
    • Y LeCunMA RanzatoWhy Limit the Information Content of the Code?INPUT SPACE FEATURESPACETraining sampleInput vector which is NOT a training sampleFeature vectorBAD: machine does not learn structure from training data!!It just copies the data.
    • Y LeCunMA RanzatoWhy Limit the Information Content of the Code?Training sampleInput vector which is NOT a training sampleFeature vectorIDEA: reduce number of available codes.INPUT SPACE FEATURESPACE
    • Y LeCunMA RanzatoWhy Limit the Information Content of the Code?Training sampleInput vector which is NOT a training sampleFeature vectorIDEA: reduce number of available codes.INPUT SPACE FEATURESPACE
    • Y LeCunMA RanzatoWhy Limit the Information Content of the Code?Training sampleInput vector which is NOT a training sampleFeature vectorIDEA: reduce number of available codes.INPUT SPACE FEATURESPACE
    • Y LeCunMA RanzatoPredictive Sparse Decomposition (PSD): sparse auto-encoderPrediction the optimal code with a trained encoderEnergy = reconstruction_error + code_prediction_error + code_sparsityEYi,Z =∥Yi−W d Z∥2∥Z−ge W e ,Yi∥2∑j∣z j∣ge (W e ,Yi)=shrinkage(W e Yi)[Kavukcuoglu, Ranzato, LeCun, 2008 → arXiv:1010.3467],INPUT Y Z∥Yi− Y∥2∣z j∣W d ZFEATURES∑j.∥Z− Z∥2ge W e ,Yi
    • Y LeCunMA RanzatoPSD: Basis Functions on MNISTBasis functions (and encoder matrix) are digit parts
    • Y LeCunMA RanzatoTraining on natural imagespatches.12X12256 basis functionsPredictive Sparse Decomposition (PSD): Training
    • Y LeCunMA RanzatoLearned Features on natural patches:V1-like receptive fields
    • Y LeCunMA RanzatoISTA/FISTA: iterative algorithm that converges to optimal sparse codeINPUT Y ZWe sh()S+[Gregor & LeCun, ICML 2010], [Bronstein et al. ICML 2012], [Rolfe & LeCun ICLR 2013]Lateral InhibitionBetter Idea: Give the “right” structure to the encoder
    • Y LeCunMA RanzatoThink of the FISTA flow graph as a recurrent neural net where We and S aretrainable parametersINPUT Y ZW e sh()S+Time-Unfold the flow graph for K iterationsLearn the We and S matrices with “backprop-through-time”Get the best approximate solution within K iterationsYZWesh()+ S sh()+ SLISTA: Train We and S matricesto give a good approximation quickly
    • Y LeCunMA RanzatoLearning ISTA (LISTA) vs ISTA/FISTA
    • Y LeCunMA RanzatoLISTA with partial mutual inhibition matrix
    • Y LeCunMA RanzatoLearning Coordinate Descent (LcoD): faster than LISTA
    • Y LeCunMA RanzatoArchitectureRectified linear unitsClassification loss: cross-entropyReconstruction loss: squared errorSparsity penalty: L1 norm of last hidden layerRows of Wd and columns of We constrained in unit sphereWe()+S +WcW dCan be repeatedEncodingFiltersLateralInhibitionDecodingFilters̄X̄YXL1 ̄ZXY0()+[Rolfe & LeCun ICLR 2013]Discriminative Recurrent Sparse Auto-Encoder (DrSAE)
    • Y LeCunMA RanzatoImage = prototype + sparse sum of “parts” (to move around the manifold)DrSAE Discovers manifold structure of handwritten digits
    • Y LeCunMA RanzatoReplace the dot products with dictionary element by convolutions.Input Y is a full imageEach code component Zk is a feature map (an image)Each dictionary element is a convolution kernelRegular sparse codingConvolutional S.C.∑k. * ZkWkY =“deconvolutional networks” [Zeiler, Taylor, Fergus CVPR 2010]Convolutional Sparse Coding
    • Y LeCunMA RanzatoConvolutional FormulationExtend sparse coding from PATCH to IMAGEPATCH based learning CONVOLUTIONAL learningConvolutional PSD: Encoder with a soft sh() Function
    • Y LeCunMA RanzatoConvolutional Sparse Auto-Encoder on Natural ImagesFilters and Basis Functions obtained with 1, 2, 4, 8, 16, 32, and 64 filters.
    • Y LeCunMA RanzatoPhase 1: train first layer using PSDFEATURESY Z∥Y i− ̃Y∥2∣z j∣Wd Z λ∑.∥Z− ̃Z∥2ge(W e ,Y i)Using PSD to Train a Hierarchy of Features
    • Y LeCunMA RanzatoPhase 1: train first layer using PSDPhase 2: use encoder + absolute value as feature extractorFEATURESY ∣z j∣ge(W e ,Y i)Using PSD to Train a Hierarchy of Features
    • Y LeCunMA RanzatoPhase 1: train first layer using PSDPhase 2: use encoder + absolute value as feature extractorPhase 3: train the second layer using PSDFEATURESY ∣z j∣ge(W e ,Y i)Y Z∥Y i− ̃Y∥2∣z j∣Wd Z λ∑.∥Z− ̃Z∥2ge(W e ,Y i)Using PSD to Train a Hierarchy of Features
    • Y LeCunMA RanzatoPhase 1: train first layer using PSDPhase 2: use encoder + absolute value as feature extractorPhase 3: train the second layer using PSDPhase 4: use encoder + absolute value as 2nd feature extractorFEATURESY ∣z j∣ge(W e ,Y i)∣z j∣ge(W e ,Y i)Using PSD to Train a Hierarchy of Features
    • Y LeCunMA RanzatoPhase 1: train first layer using PSDPhase 2: use encoder + absolute value as feature extractorPhase 3: train the second layer using PSDPhase 4: use encoder + absolute value as 2nd feature extractorPhase 5: train a supervised classifier on topPhase 6 (optional): train the entire system with supervised back-propagationFEATURESY ∣z j∣ge(W e ,Y i)∣z j∣ge(W e ,Y i)classifierUsing PSD to Train a Hierarchy of Features
    • Y LeCunMA Ranzato[Osadchy,Miller LeCun JMLR 2007],[Kavukcuoglu et al. NIPS 2010] [Sermanet et al. CVPR 2013]Pedestrian Detection, Face Detection
    • Y LeCunMA RanzatoFeature maps from all stages are pooled/subsampled and sent to the finalclassification layersPooled low-level features: good for textures and local motifsHigh-level features: good for “gestalt” and global shape[Sermanet, Chintala, LeCun CVPR 2013]7x7 filter+tanh38 feat mapsInput78x126xYUVL2 Pooling3x32040 9x9filters+tanh68 feat mapsAv Pooling2x2 filter+tanhConvNet Architecture with Multi-Stage Features
    • Y LeCunMA Ranzato[Kavukcuoglu et al. NIPS 2010] [Sermanet et al. ArXiv 2012]ConvNetColor+SkipSupervisedConvNetColor+SkipUnsup+SupConvNetB&WUnsup+SupConvNetB&WSupervisedPedestrian Detection: INRIA Dataset. Miss rate vs falsepositives
    • Y LeCunMA RanzatoResults on “Near Scale” Images (>80 pixels tall, no occlusions)Daimlerp=21790ETHp=804TudBrusselsp=508INRIAp=288
    • Y LeCunMA RanzatoResults on “Reasonable” Images (>50 pixels tall, few occlusions)Daimlerp=21790ETHp=804TudBrusselsp=508INRIAp=288
    • Y LeCunMA Ranzato128 stage-1 filters on Y channel.Unsupervised training with convolutional predictive sparse decompositionUnsupervised pre-training with convolutional PSD
    • Y LeCunMA RanzatoStage 2 filters.Unsupervised training with convolutional predictive sparse decompositionUnsupervised pre-training with convolutional PSD
    • Y LeCunMA Ranzato96x96input:120x120output: 3x3Traditional Detectors/Classifiers must be applied to every location ona large input image, at multiple scales.Convolutional nets can replicated over large images very cheaply.The network is applied to multiple scales spaced by 1.5.Applying a ConvNet onSliding Windows is Very Cheap!
    • Y LeCunMA RanzatoComputational cost for replicated convolutional net:96x96 -> 4.6 million multiply-accumulate operations120x120 -> 8.3 million multiply-accumulate ops240x240 -> 47.5 million multiply-accumulate ops480x480 -> 232 million multiply-accumulate opsComputational cost for a non-convolutional detectorof the same size, applied every 12 pixels:96x96 -> 4.6 million multiply-accumulate operations120x120 -> 42.0 million multiply-accumulateoperations240x240 -> 788.0 million multiply-accumulate ops480x480 -> 5,083 million multiply-accumulate ops96x96 window12 pixel shift84x84 overlapBuilding a Detector/Recognizer:Replicated Convolutional Nets
    • Y LeCunMA Ranzato
    • Y LeCunMA Ranzato
    • Y LeCunMA RanzatoMusical Genre Recognition with PSD FeatureInput: “Constant Q Transform” over 46.4ms windows (1024 samples)96 filters, with frequencies spaced every quarter tone (4octaves)Architecture:Input: sequence of contrast-normalized CQT vectors1: PSD features, 512 trained filters; shrinkage function →rectification3: pooling over 5 seconds4: linear SVM classifier. Pooling of SVM categories over 30secondsGTZAN Dataset1000 clips, 30 second each10 genres: blues, classical, country, disco, hiphop, jazz,metal, pop, reggae and rock.Results84% correct classification
    • Y LeCunMA RanzatoSingle-Stage Convolutional NetworkTraining of filters: PSD (unsupervised)Architecture: contrast norm → filters → shrink → max poolingsubtractive+divisivecontrastnormalizationFiltersShrinkageMaxPooling(5s)LinearClassifier
    • Y LeCunMA RanzatoConstant Q Transform over 46.4 ms → Contrast Normalizationsubtractive+divisive contrast normalization
    • Y LeCunMA RanzatoConvolutional PSD Features on Time-Frequency SignalsOctave-wide features full 4-octave featuresMinor 3rdPerfect 4thPerfect 5thQuartal chordMajor triadtransient
    • Y LeCunMA RanzatoPSD Features on Constant-Q TransformOctave-wide featuresEncoder basisfunctionsDecoder basisfunctions
    • Y LeCunMA RanzatoTime-Frequency FeaturesOctave-wide features on8 successive acousticvectorsAlmost notemporalstructure in thefilters!
    • Y LeCunMA RanzatoAccuracy on GTZAN dataset (small, old, etc...)Accuracy: 83.4%. State of the Art: 84.3%Very fast
    • Y LeCunMA RanzatoUnsupervised Learning:Invariant Features
    • Y LeCunMA RanzatoLearning Invariant Features with L2 Group SparsityUnsupervised PSD ignores the spatial pooling step.Could we devise a similar method that learns the pooling layer as well?Idea [Hyvarinen & Hoyer 2001]: group sparsity on pools of featuresMinimum number of pools must be non-zeroNumber of features that are on within a pool doesnt matterPools tend to regroup similar featuresINPUT Y Z∥Yi− ̃Y∥2W d ZFEATURESλ∑.∥Z− ̃Z∥2ge(W e ,Y i)√(∑ Z k2)L2 norm withineach poolE (Y,Z )=∥Y −W d Z∥2+∥Z−ge (W e ,Y )∥2+∑j √∑k∈P jZk2
    • Y LeCunMA RanzatoLearning Invariant Features with L2 Group SparsityIdea: features are pooled in group.Sparsity: sum over groups of L2 norm of activity in group.[Hyvärinen Hoyer 2001]: “subspace ICA”decoder only, square[Welling, Hinton, Osindero NIPS 2002]: pooled product of expertsencoder only, overcomplete, log student-T penalty on L2 pooling[Kavukcuoglu, Ranzato, Fergus LeCun, CVPR 2010]: Invariant PSDencoder-decoder (like PSD), overcomplete, L2 pooling[Le et al. NIPS 2011]: Reconstruction ICASame as [Kavukcuoglu 2010] with linear encoder and tied decoder[Gregor & LeCun arXiv:1006:0448, 2010] [Le et al. ICML 2012]Locally-connect non shared (tiled) encoder-decoderINPUTYEncoder only (PoE, ICA),Decoder Only orEncoder-Decoder (iPSD, RICA)Z INVARIANTFEATURESλ∑.√(∑ Z k2)L2 norm withineach poolSIMPLEFEATURES
    • Y LeCunMA RanzatoGroups are local in a 2D Topographic MapThe filters arrangethemselves spontaneouslyso that similar filters enterthe same pool.The pooling units can beseen as complex cellsOutputs of pooling units areinvariant to localtransformations of the inputFor some itstranslations, for othersrotations, or othertransformations.
    • Y LeCunMA RanzatoImage-level training, local filters but no weight sharingTraining on 115x115 images. Kernels are 15x15 (not shared acrossspace!)[Gregor & LeCun 2010]Local receptive fieldsNo shared weights4x overcompleteL2 poolingGroup sparsity over poolsInputReconstructed Input(Inferred) CodePredicted CodeDecoderEncoder
    • Y LeCunMA RanzatoImage-level training, local filters but no weight sharingTraining on 115x115 images. Kernels are 15x15 (not shared acrossspace!)
    • Y LeCunMA Ranzato119x119 Image Input100x100 Code20x20 Receptive field sizesigma=5 Michael C. Crair, et. al. The Journal of NeurophysiologyVol. 77 No. 6 June 1997, pp. 3381-3385 (Cat)K Obermayer and GG Blasdel, Journal ofNeuroscience, Vol 13, 4114-4129 (Monkey)Topographic Maps
    • Y LeCunMA RanzatoImage-level training, local filters but no weight sharingColor indicates orientation (by fitting Gabors)
    • Y LeCunMA RanzatoInvariant Features Lateral InhibitionReplace the L1 sparsity term by a lateral inhibition matrixEasy way to impose some structure on the sparsity[Gregor, Szlam, LeCun NIPS 2011]
    • Y LeCunMA RanzatoInvariant Features via Lateral Inhibition: Structured SparsityEach edge in the tree indicates a zero in the S matrix (no mutual inhibition)Sij is larger if two neurons are far away in the tree
    • Y LeCunMA RanzatoInvariant Features via Lateral Inhibition: Topographic MapsNon-zero values in S form a ring in a 2D topologyInput patches are high-pass filtered
    • Y LeCunMA RanzatoInvariant Features through Temporal ConstancyObject is cross-product of object type and instantiation parametersMapping units [Hinton 1981], capsules [Hinton 2011]small medium largeObject type Object size[Karol Gregor et al.]
    • Y LeCunMA RanzatoWhat-Where Auto-Encoder ArchitectureStSt-1St-2C1tC1t-1C1t-2C2tDecoderW1W1W1W2PredictedinputC1tC1t-1C1t-2C2tStSt-1St-2InferredcodePredictedcodeInputEncoderf ∘ ̃W1f ∘ ̃W1 f ∘ ̃W1̃W2f̃W2̃W2
    • Y LeCunMA RanzatoLow-Level Filters Connected to Each Complex CellC1(where)C2(what)
    • Y LeCunMA RanzatoInputGenerating ImagesGenerating images
    • Y LeCunMA RanzatoFutureChallenges
    • Y LeCunMA RanzatoThe Graph of Deep Learning Sparse Modeling Neuroscience↔ ↔Architecture of V1[Hubel, Wiesel 62]Basis/Matching Pursuit[Mallat 93; Donoho 94]Sparse Modeling[Olshausen-Field 97]Neocognitron[Fukushima 82]Backprop[many 85]Convolutional Net[LeCun 89]Sparse Auto-Encoder[LeCun 06; Ng 07]RestrictedBoltzmannMachine[Hinton 05]Normalization[Simoncelli 94]Speech Recognition[Goog, IBM, MSFT 12]Object Recog[Hinton 12]Scene Labeling[LeCun 12]Connectomics[Seung 10]Object Reco[LeCun 10]Compr. Sensing[Candès-Tao 04]L2-L1 optim[Nesterov,NemirovskiDaubechies,Osher....]ScatteringTransform[Mallat 10]Stochastic Optimization[Nesterov, BottouNemirovski,....]Sparse Modeling[Bach, Sapiro. Elad]MCMC, HMCCont. Div.[Neal, Hinton]Visual Metamers[Simoncelli 12]
    • Y LeCunMA RanzatoIntegrating Feed-Forward and FeedbackMarrying feed-forward convolutional nets withgenerative “deconvolutional nets”Deconvolutional networks[Zeiler-Graham-Fergus ICCV 2011]Feed-forward/Feedback networks allowreconstruction, multimodal prediction, restoration,etc...Deep Boltzmann machines can do this, butthere are scalability issues with trainingTrainable FeatureTransformTrainable FeatureTransformTrainable FeatureTransformTrainable FeatureTransform
    • Y LeCunMA RanzatoIntegrating Deep Learning and Structured PredictionDeep Learning systems can be assembled intofactor graphsEnergy function is a sum of factorsFactors can embed whole deep learningsystemsX: observed variables (inputs)Z: never observed (latent variables)Y: observed on training set (outputvariables)Inference is energy minimization (MAP) or freeenergy minimization (marginalization) over Zand Y given an XEnergy Model(factor graph)E(X,Y,Z)X(observed)Z(unobserved)Y(observed ontraining set)
    • Y LeCunMA RanzatoEnergy Model(factor graph)Integrating Deep Learning and Structured PredictionDeep Learning systems can be assembled intofactor graphsEnergy function is a sum of factorsFactors can embed whole deep learningsystemsX: observed variables (inputs)Z: never observed (latent variables)Y: observed on training set (outputvariables)Inference is energy minimization (MAP) or freeenergy minimization (marginalization) over Zand Y given an XF(X,Y) = MIN_z E(X,Y,Z)F(X,Y) = -log SUM_z exp[-E(X,Y,Z) ]Energy Model(factor graph)E(X,Y,Z)X(observed)Z(unobserved)Y(observed ontraining set)F(X,Y) = Marg_z E(X,Y,Z)
    • Y LeCunMA RanzatoIntegrating Deep Learning and Structured PredictionIntegrting deep learning and structuredprediction is a very old ideaIn fact, it predates structuredpredictionGlobally-trained convolutional-net +graphical modelstrained discriminatively at the wordlevelLoss identical to CRF and structuredperceptronCompositional movable parts modelA system like this was reading 10 to 20%of all the checks in the US around 1998
    • Y LeCunMA RanzatoEnergy Model(factor graph)Integrating Deep Learning and Structured PredictionDeep Learning systems can be assembled intofactor graphsEnergy function is a sum of factorsFactors can embed whole deep learningsystemsX: observed variables (inputs)Z: never observed (latent variables)Y: observed on training set (outputvariables)Inference is energy minimization (MAP) or freeenergy minimization (marginalization) over Zand Y given an XF(X,Y) = MIN_z E(X,Y,Z)F(X,Y) = -log SUM_z exp[-E(X,Y,Z) ]Energy Model(factor graph)E(X,Y,Z)X(observed)Z(unobserved)Y(observed ontraining set)F(X,Y) = Marg_z E(X,Y,Z)
    • Y LeCunMA RanzatoFuture ChallengesIntegrated feed-forward and feedbackDeep Boltzmann machine do this, but there are issues of scalability.Integrating supervised and unsupervised learning in a single algorithmAgain, deep Boltzmann machines do this, but....Integrating deep learning and structured prediction (“reasoning”)This has been around since the 1990s but needs to be revivedLearning representations for complex reasoning“recursive” networks that operate on vector space representationsof knowledge [Pollack 90s] [Bottou 2010] [Socher, Manning, Ng2011]Representation learning in natural language processing[Y. Bengio 01],[Collobert Weston 10], [Mnih Hinton 11] [Socher 12]Better theoretical understanding of deep learning and convolutional netse.g. Stephane Mallats “scattering transform”, work on the sparserepresentations from the applied math community....
    • Y LeCunMA RanzatoSOFTWARETorch7: learning library that supports neural net training– http://www.torch.ch– http://code.cogbits.com/wiki/doku.php (tutorial with demos by C. Farabet)- http://eblearn.sf.net (C++ Library with convnet support by P. Sermanet)Python-based learning library (U. Montreal)- http://deeplearning.net/software/theano/ (does automatic differentiation)RNN– www.fit.vutbr.cz/~imikolov/rnnlm (language modeling)– http://sourceforge.net/apps/mediawiki/rnnl/index.php (LSTM)Misc– www.deeplearning.net//software_linksCUDAMat & GNumpy– code.google.com/p/cudamat– www.cs.toronto.edu/~tijmen/gnumpy.html
    • Y LeCunMA RanzatoREFERENCESConvolutional Nets– LeCun, Bottou, Bengio and Haffner: Gradient-Based Learning Applied to DocumentRecognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998- Krizhevsky, Sutskever, Hinton “ImageNet Classification with deep convolutional neuralnetworks” NIPS 2012– Jarrett, Kavukcuoglu, Ranzato, LeCun: What is the Best Multi-Stage Architecture forObject Recognition?, Proc. International Conference on Computer Vision (ICCV09),IEEE, 2009- Kavukcuoglu, Sermanet, Boureau, Gregor, Mathieu, LeCun: Learning ConvolutionalFeature Hierachies for Visual Recognition, Advances in Neural Information ProcessingSystems (NIPS 2010), 23, 2010– see yann.lecun.com/exdb/publis for references on many different kinds of convnets.– see http://www.cmap.polytechnique.fr/scattering/ for scattering networks (similar toconvnets but with less learning and stronger mathematical foundations)
    • Y LeCunMA RanzatoREFERENCESApplications of Convolutional Nets– Farabet, Couprie, Najman, LeCun, “Scene Parsing with Multiscale Feature Learning,Purity Trees, and Optimal Covers”, ICML 2012– Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala and Yann LeCun: PedestrianDetection with Unsupervised Multi-Stage Feature Learning, CVPR 2013- D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber. Deep Neural NetworksSegment Neuronal Membranes in Electron Microscopy Images. NIPS 2012- Raia Hadsell, Pierre Sermanet, Marco Scoffier, Ayse Erkan, Koray Kavackuoglu, UrsMuller and Yann LeCun: Learning Long-Range Vision for Autonomous Off-Road Driving,Journal of Field Robotics, 26(2):120-144, February 2009– Burger, Schuler, Harmeling: Image Denoisng: Can Plain Neural Networks Competewith BM3D?, Computer Vision and Pattern Recognition, CVPR 2012,
    • Y LeCunMA RanzatoREFERENCESApplications of RNNs– Mikolov “Statistical language models based on neural networks” PhD thesis 2012– Boden “A guide to RNNs and backpropagation” Tech Report 2002– Hochreiter, Schmidhuber “Long short term memory” Neural Computation 1997– Graves “Offline arabic handwrting recognition with multidimensional neural networks”Springer 2012– Graves “Speech recognition with deep recurrent neural networks” ICASSP 2013
    • Y LeCunMA RanzatoREFERENCESDeep Learning & Energy-Based Models– Y. Bengio, Learning Deep Architectures for AI, Foundations and Trends in MachineLearning, 2(1), pp.1-127, 2009.– LeCun, Chopra, Hadsell, Ranzato, Huang: A Tutorial on Energy-Based Learning, inBakir, G. and Hofman, T. and Schölkopf, B. and Smola, A. and Taskar, B. (Eds),Predicting Structured Data, MIT Press, 2006– M. Ranzato Ph.D. Thesis “Unsupervised Learning of Feature Hierarchies” NYU 2009Practical guide– Y. LeCun et al. Efficient BackProp, Neural Networks: Tricks of the Trade, 1998– L. Bottou, Stochastic gradient descent tricks, Neural Networks, Tricks of the TradeReloaded, LNCS 2012.– Y. Bengio, Practical recommendations for gradient-based training of deeparchitectures, ArXiv 2012