Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 2014-01-08


Published on

Note: these are the slides from a presentation at Lexis Nexis in Alpharetta, GA, on 2014-01-08 as part of the DataScienceATL Meetup. A video of this talk from Dec 2013 is available on vimeo at

Note: Slideshare mis-converted the images in slides 16-17. Expect a fix in the next couple of days.


Deep learning is a hot area of machine learning named one of the "Breakthrough Technologies of 2013" by MIT Technology Review. The basic ideas extend neural network research from past decades and incorporate new discoveries in statistical machine learning and neuroscience. The results are new learning architectures and algorithms that promise disruptive advances in automatic feature engineering, pattern discovery, data modeling and artificial intelligence. Empirical results from real world applications and benchmarking routinely demonstrate state-of-the-art performance across diverse problems including: speech recognition, object detection, image understanding and machine translation. The technology is employed commercially today, notably in many popular Google products such as Street View, Google+ Image Search and Android Voice Recognition.

In this talk, we will present an overview of deep learning for data scientists: what it is, how it works, what it can do, and why it is important. We will review several real world applications and discuss some of the key hurdles to mainstream adoption. We will conclude by discussing our experiences implementing and running deep learning experiments on our own hardware data science appliance.

Published in: Technology, Education
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • (1:00)Thank organizers & attendeesMy background thesisInvitation to connectTalk in 3 parts: introduction and motivate the topichigh-level overview of deep learning detailsexamples
  • How many heard of deep learning
  • joke: Wired and ad placementCompanies are qcquiring talent and demonstrating use caseZuckerberg @ NIPS
  • Growing popularityLots of applications motivated by vision and audioSensible because of connections to perception, AI and neural networksRevolutions have participants
  • Products are seeing big liftExample of real-time translation kept it in the same voice!“I’m speaking in English and hopefully you’ll hear me speaking in Chinese in my own voice”
  • Apology for ommission
  • - As a data scientist, consume machine learning
  • Consider canonical problem: classificationCats and dogs, cats and data scientistsIn this case, we want to build a magic box that discriminates cats vs dogsPlay on the google cat detector: 1000 nodes, 16000 cores, 1 week per trial @ $1/hr = ? June 2012Cat detector detects better than a catLeaving data on the dable
  • Many examples, from all classes, requiredConsequence -> use less dataFeatures require lots of engineering and workExample here, SIFT, took over a decade for David Lowe to developMany examples of features: tail, fur, eyes, edges, height, etc.
  • Features: raw numbers to smaller, better pile of numbersMany examples, from all classes, requiredConsequence -> use less dataFeatures require lots of engineering and workExample here, SIFT, took over a decade for David Lowe to developMany examples of features: tail, fur, eyes, edges, height, etc.Best disciplined approach: copy and tweakShow of hands – how many of you have experienced this?
  • 80% of the data scientist jobWe don’t scale – how long to get a Phd?Each loop we have to do invention and ideation“Won a kaggle contest using RF”Workflow, feature engineering
  • This is not always true, but good for high variance problemsWhat examples of extra data?Not just a little more data, but a lot of dataOften have a lot more data today in the connected world
  • No principled way to generate featuresNo playbook for alien data features
  • Modules that learn featuresStack and I get a hierarchical decomposition
  • Hinton split timeBefore & after
  • Describe MNIST, boring easy“everything works at 96% accuracy”
  • This network achieved 0.35% error using online backprop6 hidden layers, 2500, 2000, 1500, 1000, 500, 10 with validation & test error .35% & .32%
  • Data flows from bottom to topAffine + nonlinearityNonlinear regressionWe have to learn the weights and biasWe have to pick the activation function
  • Backprop topBackprop global
  • 1000 categories25% -> 15% errorAcquired by Google 1/13
  • Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 2014-01-08

    1. 1. Deep Learning for Data Scientists Andrew B. Gardner
    2. 2. Deep Learning in the Press… Ng Hinton LeCun Zuckerberg Google Hires Brains that Helped Supercharge Machine Learning. Wired 3/2013. Kurzweil Facebook taps ‘Deep Learning’ Giant for New AI Lab. Wired 12/2013. Is “Deep Learning” A Revolutions in Artificial Intelligence? The Man Behind the Google Brain: Andrew Ng and the Quest for the New AI. New Yorker 11/2012. Wired 5/2013. New Techniques from Google and Ray Kurzweil Are Taking Artificial Intelligence to Another Level. MIT Technology Review 5/2013.
    3. 3. … Publication & Search Trends … Google Scholar Citations Google Trends 600 big data 500 data science 400 300 “deep learning” + “neural network” deep learning machine learning 200 100 0 ‘06 ‘11 ‘06 ‘11 domains: computer vision, speech & audio, bioinformatics, etc. Conferences: NIPS, ICLR, ICML, …
    4. 4. … Industry & Products • Google Microsoft Real-time English-Chinese Translation – Android Voice Recognition – Maps – Image+ • • • • SIRI Translation Documents … Microsoft Chief Research Officer Rick Rashid, 11/2012
    5. 5. Deep Learning Epicenters (North America) de Freitas (UBC) Microsoft Bengio (U Montreal) Hinton (U Toronto) Facebook Ng (Stanford) Google Yahoo LeCun (NYU)
    6. 6. Deep Learning: The Origin Story
    7. 7. Before: A Cat Detector We want to build this…. classifier f : X ®Y Y ~ the labels {“cat”, “dog”} X ~ the images … for less than $1.0M !
    8. 8. Challenge: Labeled Data Labels are expensive  Less data Intuitively: more data is good cat cat dog unused,unlabeled cat dog
    9. 9. Challenge: Features Features are expensive  Fewer, shallow Intuitively: better features are good image (pixels) Magic feature dictionary SIFT HoG B W SIFT binary histogram Moments Shape Histogram + ++ + +++ + + + + + + x=(1.3, 2.8, …) Fang detector Something new
    10. 10. Machine Learning (Before) Building a Cat Detector 1.0 expensive important* Features Detector (Classifier)
    11. 11. fa ng of in ch on, of of us on is is bly How Good is “More Data?” speech. The memory-based learner used only the word before and word after as features. Labels are expensive  Less data 1.00 • More data dominates* better techniques .975 0.95 0.90 Test Accuracy a 93, In is fic es are m ber • Often have lots of data 0.85 .825 0.80 Memory-Based Winnow 0.75 Perceptron Naïve Bayes 0.70 0.1 1 10 100 Millions of Words 1000 Learning curves for confusion set Figure 1. Learning Curves for Confusion Set disambiguation, e.g. {to, two, too}. Disambiguation We collected a 1 -billion-word training corpus from a variety of English texts, including • … we just don’t have lots of labels • What if there was a way to use unlabeled data? “Scaling to Very Very Large Corpora for Natural Language Disambiguation,” Banko and Brill, 2001.
    12. 12. The Impact of Features Intuitively: better features are good • Critical to success – even more than data! • How to create / engineer features? – Typically shallow • Domain-specific • What if there was a way to automatically learn features?
    13. 13. Machine Learning (What We Want) Building a Cat Detector 2.0 bountiful important* Features + Detector (Classifier) end-to-end
    14. 14. AR” Building an Object Recognition System ” “CAR” Deep Nets Intuition “CAR” car intermediate representations CLASSIFIER FEATURE EXTRACTOR label IDEA: Use data to optimize features for the given task. olutional DBN's for scalable unsup. learning...” ICML 2009 Lee et al. ICML 2009 12 Ranzato 2 Ranzato 13 Ranzato Ranza
    15. 15. on from low structure as hical Another Example of Hierarchy Learning rchical Learning mplexity from low progression ral progression from low high level structure as to high level structure as natural complexity in natural complexity what is being eto monitor whatisisbeing the machine o monitor what being r and guide the machine es toto guide themachine t and er subspaces tter subspaces od lower level llower level heads ntation can be used for sentation can be usedfor ndistinct tasks for be used istinct tasks s faces as parts edges
    16. 16. d tomachine machine e guide the he subspaces Hierarchy Reusability? faces cars elephants chairs wer level be used forbe used for tation can tinct tasks 5 5
    17. 17. A Breakthrough G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006. G. E. Hinton and R. R. Salakhutdniov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504-507, July 2006. before after
    18. 18. Deep Belief Nets MNIST 60K + 10K Images Technique Test Error DBN pretrain 1.25 SVM 1.4 kNN 2.8-4.4 ConvNet 0.4 -> 0.23 supervised tuning unsupervised pretraining
    19. 19. MNIST Sample Errors Ciresan et al. “Deep Big Simple Neural Networks Excel on Handwritten Digit Recognition,” 2010
    20. 20. Key Ideas • Learn features from data – Use all data • Deep architecture – Representation – Computational efficiency – Shared statistics • Practical training • State-of-the-art (it worked)
    21. 21. After: Cat Detector unlabeled images (millions) labeled images (few) deep learning network more data automatic (deep) features
    22. 22. How Does It Work?
    23. 23. This Is A Neuron output 1. Sum all inputs (weighted) y x = w0 + w1z1 + w2 z2 + w3z3 f(x) 2. Nonlinearly transform y = f ( x) weights w0 w1 w2 sigmoid w3 tanh 1 bias z1 z2 inputs z3 activation function
    24. 24. A Neural Network forward propagation: weighted sum inputs, produce activation, feed forward cat dog Output Hidden 13.5 weight 21 n_teeth 16 n_whiskers Inputs (the features)
    25. 25. Training Back propagation of error. 1 0 cat dog total error at top proportional contributions going backwards 13.5 weight 21 n_teeth 16 n_whiskers
    26. 26. After Training network layer weights weights as a matrix [.5, -.2, 4, .15, -1,…] -.5 .4 0 .1 .1 .5 -1 2 [-.5, -.3, .4, 0, …] -.3 .7 -.2 .4 we can view weight matrix as image … plus performance evaluation & logging
    27. 27. Building Blocks So many choices! network topology • Network Topology – Number of layers – Nodes per layer • Layer Type – Feedforward – Restricted Boltzmann – Autoencoder – Recurrent – Convolutional layer type neuron type • Neuron Type – Rectified Linear Unit • Regularization – Dropout • Magic Numbers
    28. 28. A Deep Learning Recipe, 1.0 • Lots of data, some+ labels • Train each RBM layer greedily, successively • Add an output layer and train with labels labels
    29. 29. A Few Other Important Things • Deep Learning Recipe 2.0 – Dropout / regularization – Rectified Linear Units • • • • Convolutional networks Hyperparameters Not just neural networks Practical Issues (GPU)
    30. 30. Some Applications
    31. 31. Sample Classification Results ImageNet V alidation classification Krizhevsky et al., NIPS 2012. [Krizhevsky et al. NI PS’12
    32. 32. Segmentation neuronal membranes Ciresan et al. “DNN segment neuronal membranes...” NIPS 2012
    33. 33. CalTech 256 2 5 6 Caltech Z eiler & Fergus, Vis ualizing and Unders tanding Convolutional Ne tworks arXiv 1311.2901, 2013 , 7 5 7 0 6 5 6 training examples 6 0 5 5 5 0 4 5 4 0 3 5 3 0 2 5 0 1 0 2 0 3 0 4 0 5 0 6 0 Zeiler & Fergus,”Visualizing and Understanding Convolutional Networks,” arXiv 1311.2901, 2013
    34. 34. Application: Speech frequencies in window “He can for example present significant university wide issues to the senate.” small time window slide 15ms phoneme Spectrogram: window in time -> vector of frequences; slide; repeat
    35. 35. Automatic Speech CDBNs for speech Unlabeled TIMIT data -> convolutional DBN Trained on unlabeled TIMIT corpus Experimental R • Speaker identification TIMIT Speaker identification Accuracy Prior art (Reynolds, 1995) 99.7% Convolutional DBN 100.0% • Phone classification TIMIT Phone classification Accuracy Clarkson et al. (1999) 77.6% Gunawardana et al. (2005) 78.3% Sung et al. (2007) 78.5% Petrov et al. (2007) 78.6% Sha & Saul (2006) 78.9% Yu et al. (2009) 79.2% Convolutional DBN 80.3% Learned first-layer bases Lee et al., “Unsupervised feature learning for audio classification using convolutional deep 68 belief networks”, NIPS 2009.
    36. 36. A Long List of Others • Kaggle – Merck Molecular Activation (‘12) – Salary Prediction (‘13) • • • • Learning to Play Atari Games (‘13) NLP – chunking, NER, parsing, etc. Activity recognition from video Recommendations
    37. 37. Deep Learning In A Nutshell • • • • • • • • Architectures vs. features Deep vs. shallow Automatic* features Lots of data vs. best technique Compute- vs. human intensive State-of-the-art Breaks expert, domain barrier Details & tricks can be complex
    38. 38. Interested in Deep Learning? Connect for: • Training Workshop (interest list) • Projects / consulting • Collaboration • Questions