Binary classification and linear separators. Perceptron, ADALINE, artifical neurons. Artificial neural networks (ANNs), activation functions, and universal approximation theorem. Linear versus non-linear classification problems. Typical tasks, architectures and loss functions. Gradient descent and back-propagation. Support Vector Machines (SVMs), soft-margins and kernel trick. Connexions between ANNs and SVMs.
Localization and classification. Overfeat: class agnostic versu class specific localization, fully convolutional neural networks, greedy merge strategy. Multiobject detection. Region proposal and selective search. R-CNN, Fast R-CNN, Faster R-CNN and YOLO. Image segmentation. Semantic segmentation and transposed convolutions. Instance segmentation and Mask R-CNN. Image captioning. Recurrent Neural Networks (RNNs). Language generation. Long Short Term Memory (LSTMs). DeepImageSent, Show and Tell, and Show, Attend and Tell algorithms.
Overview of the course. Introduction to image sciences, image processing and computer vision. Basics of machine learning, terminologies, paradigms. No-free lunch theorem. Supervised versus unsupervised learning. Clustering and K-Means. Classification and regression. Linear least squares and polynomial curve fitting. Model complexity and overfitting. Curse of dimensionality. Dimensionality reduction and principal component analysis. Image representation, semantic gap, image features, and classical computer vision pipelines.
Image generation. Gaussian models for human faces, limits and relations with linear neural networks. Generative adversarial networks (GANs), generators, discrinators, adversarial loss and two player games. Convolutional GAN and image arithmetic. Super-resolution. Nearest-neighbor, bilinear and bicubic interpolation. Image sharpening. Linear inverse problems, Tikhonov and Total-Variation regularization. Super-Resolution CNN, VDSR, Fast SRCNN, SRGAN, perceptual, adversarial and content losses. Style transfer: Gatys model, content loss and style loss.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Localization and classification. Overfeat: class agnostic versu class specific localization, fully convolutional neural networks, greedy merge strategy. Multiobject detection. Region proposal and selective search. R-CNN, Fast R-CNN, Faster R-CNN and YOLO. Image segmentation. Semantic segmentation and transposed convolutions. Instance segmentation and Mask R-CNN. Image captioning. Recurrent Neural Networks (RNNs). Language generation. Long Short Term Memory (LSTMs). DeepImageSent, Show and Tell, and Show, Attend and Tell algorithms.
Overview of the course. Introduction to image sciences, image processing and computer vision. Basics of machine learning, terminologies, paradigms. No-free lunch theorem. Supervised versus unsupervised learning. Clustering and K-Means. Classification and regression. Linear least squares and polynomial curve fitting. Model complexity and overfitting. Curse of dimensionality. Dimensionality reduction and principal component analysis. Image representation, semantic gap, image features, and classical computer vision pipelines.
Image generation. Gaussian models for human faces, limits and relations with linear neural networks. Generative adversarial networks (GANs), generators, discrinators, adversarial loss and two player games. Convolutional GAN and image arithmetic. Super-resolution. Nearest-neighbor, bilinear and bicubic interpolation. Image sharpening. Linear inverse problems, Tikhonov and Total-Variation regularization. Super-Resolution CNN, VDSR, Fast SRCNN, SRGAN, perceptual, adversarial and content losses. Style transfer: Gatys model, content loss and style loss.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Anomaly detection using deep one class classifier홍배 김
- Anomaly detection의 다양한 방법을 소개하고
- Support Vector Data Description (SVDD)를 이용하여
cluster의 모델링을 쉽게 하도록 cluster의 형상을 단순화하고
boundary근방의 애매한 point를 처리하는 방법 소개
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/dlai-2019/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/idl-2020/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Anomaly detection using deep one class classifier홍배 김
- Anomaly detection의 다양한 방법을 소개하고
- Support Vector Data Description (SVDD)를 이용하여
cluster의 모델링을 쉽게 하도록 cluster의 형상을 단순화하고
boundary근방의 애매한 point를 처리하는 방법 소개
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/dlai-2019/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/idl-2020/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning is a technique that basically mimics the human brain. So, the Scientist and Researchers taught can we make machines learn in the same way so, there is where the deep learning concept came that lead to the invention called Neural Network
Abstract: This PDSG workshop introduces basic concepts of the grandfather of neural networks - the Perceptron. Concepts covered are history, algorithm and limitations.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Machine Learning - Introduction to Neural NetworksAndrew Ferlitsch
Abstract: This PDSG workshop introduces basic concepts of neural networks. Concepts covered are Neurons, Binary vs. Categorical vs. Real Value output, activation functions, fully connected networks, deep neural networks, specialized learners, cost function and feed forward.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
An artificial neuron network (neural network) is a computational model that mimics the way nerve cells work in the human brain. Artificial neural networks (ANNs) use learning algorithms that can independently make adjustments - or learn, in a sense - as they receive new input
Basic definitions, terminologies, and Working of ANN has been explained. This ppt also shows how ANN can be performed in matlab. This material contains the explanation of Feed forward back propagation algorithm in detail.
Similar to MLIP - Chapter 2 - Preliminaries to deep learning (20)
Heat equation. Discretization and finite difference. Explicit and implicit Euler schemes. CFL conditions. Continuous Gaussian convolution solution. Linear and non-linear scale spaces. Anisotropic diffusion. Perona-Malik and Weickert model. Variational methods. Tikhonov regularization by gradient descent. Links between variational models and diffusion models. Total-Variation regularization and ROF model. Sparsity and group sparsity. Applications to image deconvolution.
Sample mean, law of large numbers, and method of moments. Mean square error and bias-variance tradeoff. Unbiased estimation: MVUE, Cramèr-Rao-Bound, Efficiency, MLE. Linear estimation: BLUE, Gauss-Markov theorem, least-square error estimator, Moore-Penrose pseudo-inverse. Bayesian estimators: likelihood and priors, MMSE and posterior mean, MAP. Linear MMSE, applications to Wiener deconvolution, image filtering with PCA, and the non-local Bayes algorithm. Applications to image denoising.
Limits of Wiener filtering. Non-linear shrinkage functions. Limits of Fourier representation. Continuous and discrete wavelet transforms. Sparsity and shrinkage in wavelet domain. Undecimated wavelet transforms, a trous algorithm. Regularization. Sparse regression, combinatorial optimization and matching pursuit. LASSO, non-smooth optimization, and proximal minimization. Link with implcit Euler scheme. ISTA algorithm. Syntehsis versus analysis regularized models. Applications to image deconvolution.
Patch models and sparse decompositions of image patches. Dictionary learning and the k-SVD algorithm. Collaborative filtering and BM3D. Non-local sparse based models. Expected patch log-likelihood. Other applications of patch models in inpainting, super-resolution and deblurring.
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb) Charles Deledalle
Moving averages. Finite differences and edge detectors. Gradient, Sobel and Laplacian. Linear translations invariant filters, cross-correlation and convolution. Adaptive and non-linear filters. Median filters. Morphological filters. Local versus global filters. Sigma filter. Bilateral filter. Patches and non-local means. Applications to image denoising.
Image sciences, image processing, image restoration, photo manipulation. Image and videos representation. Digital versus analog imagery. Quantization and sampling. Sources and models of noises in digital CCD imagery: photon, thermal and readout noises. Sources and models of blurs. Convolutions and point spread functions. Overview of other standard models, problems and tasks: salt-and-pepper and impulse noises, half toning, inpainting, super-resolution, compressed sensing, high dynamic range imagery, demosaicing. Short introduction to other types of imagery: SAR, Sonar, ultrasound, CT and MRI. Linear and ill-posed restoration problems.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
MLIP - Chapter 2 - Preliminaries to deep learning
1. ECE 285
Machine Learning for Image Processing
Chapter II – Preliminaries to deep learning
Charles Deledalle
March 11, 2019
1
2. Deep learning – What is deep learning?
What is deep learning?
• Part of the machine learning field of learning representations of data.
Exceptionally effective at learning patterns.
• Utilizes learning algorithms that derive meaning out of data by using a
hierarchy of multiple layers that mimic the neural networks of our brain.
• If you provide the system tons of information, it begins to understand it
and respond in useful ways.
• Rebirth of artificial neural networks.
(Source: Lucas Masuch)
2
3. Machine learning – Brief history
Brief history
• First wave
• 1943. McCulloch and Pitts proposed the first neural model,
• 1958. Rosenblath introduced the Perceptron,
• 1969. Minsky and Papert’s book demonstrated the limitation of single
layer perceptrons, and almost the whole field went into hibernation.
• Second wave
• 1986. Backpropagation learning algorithm was rediscovered,
• 1989. Yann LeCun (re)introduced Convolutional Neural Networks,
• 1998. Breakthrough of CNNs in recognizing hand-written digits.
• Third wave
• 2006. Deep (neural network) learning gains popularity,
• 2012. Made significant breakthroughs in many applications,
• 2015. AlphaGo first program to beat a professional Go player.
(Source: Jun Wang)
3
4. Deep learning – Interest
Google NGRAM & Google Trends
(Source: Lucas Masuch) 4
5. Machine learning – Timeline
Timeline of (deep) learning
1974 Backpropagation
1995
SVM reigns
Convolution Neural Networks for
Handwritten Recognition
1998
2006
Restricted
Boltzmann
Machine
1958 Perceptron
1969
Perceptron criticized
Google Brain Project on
16k Cores
2012
2012
AlexNet wins
ImageNet
Perceptrons
book
~1980
Multilayer
network Support Vector Machines
feature 1
feature2
Support Vectors
Maximal
Margin
HyperplaneX1
X2
X3
X4
W11
W12
W13
W14
f(x)
Input Weights Sum
Activation
Function
arti cial
Neuron
awkward silence (AI winter)
(Source: Lucas Masuch & Vincent Lepetit) 5
7. Machine learning – ANN - Backpropagation
Perceptron
1958 Perceptron
1969
Perceptron criticized
Perceptrons
book
X1
X2
X3
X4
W11
W12
W13
W14
f(x)
Input Weights Sum
Activation
Function
arti cial
Neuron
(Source: Lucas Masuch & Vincent Lepetit) 6
8. Machine learning – Perceptron
Perceptron (Frank Rosenblatt, 1958)
First binary classifier based on supervised learning (discrimination).
Foundation of modern artificial neural networks.
At that time: technological, scientific and philosophical challenges.
7
9. Machine learning – Perceptron – Representation
Representation of the Perceptron
Parameters of the perceptron
• wk: synaptic weights
• b: bias
←− real parameters to be estimated.
Training = adjusting the weights and biases
8
10. Machine learning – Perceptron – Inspiration
The origin of the Perceptron
Takes inspiration from the visual system known for its ability to learn patterns.
• When a neuron receives a stimulus
with high enough voltage, it emits
an action potential (aka, nerve
impulse or spike). It is said to fire.
• The perceptron mimics this
activation effect: it fires only when
i
wixi + b > 0
y = sign(w0x0 + w1x1 + w2x2 + w3x3 + b)
f(x;w)
=
+1 for the first class
−1 for the second class
9
11. Machine learning – Perceptron – Principle
1 Data are represented as vectors:
2 Collect training data with positive and negative examples:
(Source: Vincent Lepetit) 10
12. Machine learning – Perceptron – Principle
Dot product:
w, x =
d
i=1
wixi
= wT
x
3 Training: find w and b so that:
• w, x + b is positive for positive samples x,
• w, x + b is negative for negative samples x.
(Source: Vincent Lepetit) 10
13. Machine learning – Perceptron – Principle
Dot product:
w, x =
d
i=1
wixi
= wT
x
3 Training: find w and b so that:
• w, x + b is positive for positive samples x,
• w, x + b is negative for negative samples x.
The equation w, x + b = 0 defines a hyperplane.
The hyperplane acts as a linear separator.
w is a normal vector to the hyperplane.
(Source: Vincent Lepetit) 10
14. Machine learning – Perceptron – Principle
4 Testing: the perceptron can now classify new examples.
(Source: Vincent Lepetit) 10
15. Machine learning – Perceptron – Principle
4 Testing: the perceptron can now classify new examples.
• A new example x is classified positive if w, x + b is positive,
(Source: Vincent Lepetit) 10
16. Machine learning – Perceptron – Principle
4 Testing: the perceptron can now classify new examples.
• A new example x is classified positive if w, x + b is positive,
• and negative if w, x + b is negative.
(Source: Vincent Lepetit) 10
17. Machine learning – Perceptron – Principle
4 Testing: the perceptron can now classify new examples.
• A new example x is classified positive if w, x + b is positive,
• and negative if w, x + b is negative.
(signed) distance of x to the hyperplane:
r =
w, x + b
||w||
(Source: Vincent Lepetit) 10
18. Machine learning – Perceptron – Representation
Alternative representation
Use the zero-index to encode the bias as a synaptic weight.
Simplifies algorithms as all parameters can now be treated in the same way.
11
19. Machine learning – Perceptron – Training
Perceptron algorithm
Goal: find the vector of weights w from a labeled training dataset T
How: minimize classification errors
min
w
E(w) = −
(x,d)∈T
st y=d
d × w, x
• penalize only misclassified samples (y = d),
• zero if all samples are correctly classified,
• proportional to the absolute distance to the hyperplane (d × w, x < 0)
12
20. Machine learning – Perceptron – Training
Perceptron algorithm
Algorithm:
• Initialize w randomly
• Repeat until convergence
• For all (x, d) ∈ T
• Compute: y = sign w, x
• If y = d:
Update: w ← w + dx
• Converges to some solution if the training data are linearly separable,
• Corresponds to stochastic gradient descent for E(w) (see later),
• But may pick any of many solutions of varying quality.
⇒ Poor generalization error.
13
21. Machine learning – Perceptron – Training
Variant: ADALINE (Adaptive Linear Neuron) algorithm
(Widrow & Hoff, 1960).
Loss: minimize instead the least square error with the prediction without sign
min
w
E(w) =
(x,d)∈T
( w, x − d)2
Algorithm:
• Initialize w randomly
• Repeat until convergence
• For all (x, d) ∈ T
• Compute: y = w, x
• Update: w ← w + γ(d − y)x, γ > 0
Also corresponds to stochastic gradient descent for E(w) (see later).
14
22. Machine learning – Perceptron – Training
Perceptron vs ADALINE algorithm
Activation function = sign
15
23. Machine learning – Perceptron – Perceptrons book
Perceptrons book (Minsky and Papert, 1969)
A perceptron can only classify data points that are linearly separable:
+1
+1
Seen by many as a justification to stop research on perceptrons.
(Source: Vincent Lepetit)
16
25. Machine learning – Artificial neural network
Artificial neural network
(Source: Lucas Masuch & Vincent Lepetit) 17
26. Machine learning – Artificial neural network
Artificial neural network
• Supervised learning method initially inspired by
the behavior of the human brain.
• Consists of the inter-connection of several
small units (just like in the human brain).
• Essentially numerical but can handle
classification and discrete inputs with
appropriate coding.
• Introduced in the late 50s, very popular in the
90s, reappeared in the 2010s with deep
learning.
• Also referred to as Multi-Layer Perceptron (MLP).
• Historically used after feature extraction.
18
27. Machine learning – Artificial neural network
Artificial neuron (McCulloch & Pitts, 1943)
• An artificial neuron contains several incoming weighted connections, an
outgoing connection and has a nonlinear activation function g.
• Neurons are trained to filter and detect specific features or patterns (e.g.
edge, nose) by receiving weighted input, transforming it with the
activation function and passing it to the outgoing connections.
• Unlike the perceptron, can be used for regression (with proper choice of g).
19
28. Machine learning – ANN
Artificial neural network / Multilayer perceptron / NeuralNet
• Inter-connection of several artificial neurons.
• Each level in the graph is called a layer:
• Input layer,
• Hidden layer(s),
• Output layer.
• Each neuron (also called node or unit) in
the hidden layers acts as a classifier.
• Feedforward NN (no cycle)
• first and simplest type of NN,
• information moves in one direction.
• Recurrent NN (with cycle)
• use for time sequences,
• such as speech-recognition.
Quiz: how many layers does
this network have?
20
29. Machine learning – ANN
Artificial neural network / Multilayer perceptron / NeuralNet
h1 = g1 w1
11x1 + w1
12x2 + w1
13x3 + b1
1
h2 = g1 w1
21x1 + w1
22x2 + w1
23x3 + b1
2
h3 = g1 w1
31x1 + w1
32x2 + w1
33x3 + b1
3
h4 = g1 w1
41x1 + w1
42x2 + w1
43x3 + b1
4
y1 = g2 w2
11h1 + w2
12h2 + w2
13h3 + w2
14h4 + b2
1
y2 = g2 w2
21h1 + w2
22h2 + w2
23h3 + w2
24h4 + b2
2
wk
ij synaptic weight between previous node j and next node i at layer k.
gk are any activation function applied to each coefficient of its input vector.
21
30. Machine learning – ANN
Artificial neural network / Multilayer perceptron / NeuralNet
h1 = g1 w1
11x1 + w1
12x2 + w1
13x3 + b1
1
h2 = g1 w1
21x1 + w1
22x2 + w1
23x3 + b1
2
h3 = g1 w1
31x1 + w1
32x2 + w1
33x3 + b1
3
h4 = g1 w1
41x1 + w1
42x2 + w1
43x3 + b1
4
h = g1 (W1x + b1)
y1 = g2 w2
11h1 + w2
12h2 + w2
13h3 + w2
14h4 + b2
1
y2 = g2 w2
21h1 + w2
22h2 + w2
23h3 + w2
24h4 + b2
2
y = g2 (W2h + b2)
wk
ij synaptic weight between previous node j and next node i at layer k.
gk are any activation function applied to each coefficient of its input vector.
The matrices Wk and biases bk are learned from labeled training data.
21
31. Machine learning – ANN
Artificial neural network / Multilayer perceptron
It can have 1 hidden layer only (shallow network),
It can have more than 1 hidden layer (deep network),
each layer may have a different size, and
hidden and output layers often have different activation functions.
22
32. Machine learning – ANN
Artificial neural network / Multilayer perceptron
• As for the perceptron, the biases can be integrated into the weights:
Wkhk−1 + bk = bk Wk
˜Wk
1
hk−1
˜hk−1
= ˜Wk
˜hk−1
• A neural network with L layers is a function of x parameterized by ˜W :
y = f(x; ˜W ) where ˜W = ( ˜W1, ˜W2, . . . , ˜WL)T
• It can be defined recursively as
y = f(x; ˜W ) = hL, hk = gk
˜Wk
˜hk−1 and h0 = x
• For simplicity, ˜W will be denoted W (when no possible confusions).
23
33. Machine learning – ANN – Activation functions
Activation functions
Linear units: g(a) = a
y = WLhL−1 + bL
hL−1 = WL−1hL−2 + bL−1
y = WLWL−1hL−2 + WLbL−1 + bL
y = WL . . . W1x +
L−1
k=1
WL . . . Wk+1bk + bL
We can always find an equivalent network without hidden units,
because composition of affine functions are affine.
Sometimes used for linear dimensionality reduction (similarly to PCA).
In general, non-linearity is needed to learn complex (non-linear)
representations of data, otherwise the NN would be just a linear function.
Otherwise, back to the problem of nonlinearly separable datasets.
24
34. Machine learning – ANN – Activation functions
Activation functions
Threshold units: for instance the sign function
g(a) =
−1 if a < 0
+1 otherwise.
or Heaviside (aka, step) activation functions
g(a) =
0 if a < 0
1 otherwise.
Discontinuities in the hidden layers
make the optimization really difficult.
We prefer functions that are continuous and differentiable.
25
35. Machine learning – ANN – Activation functions
Activation functions
Sigmoidal units: for instance the hyperbolic tangent function
g(a) = tanh a =
ea
− e−a
ea + e−a
∈ [−1, 1]
or the logistic sigmoid function
g(a) =
1
1 + e−a
∈ [0, 1] -5 0 5
-2
-1
0
1
2
• In fact equivalent by linear transformations :
tanh(a/2) = 2logistic(a) − 1
• Differentiable approximations of the step and sign functions, respectively.
• Act as threshold units for large values of |a| and as linear for small values.
26
36. Machine learning – ANN
Sigmoidal units: logistic activation functions are used in binary classification
(class C1 vs C2) as they can be interpreted as posterior probabilities:
y = P(C1|x) and 1 − y = P(C2|x)
The architecture of the network defines the shape of the separator...
...
0
1
0 1
0
1
0 1
0
1
0 1
Separation
{x P(C1|x) = P(C2|x)}
Complexity/capacity of
the network
⇒
Trade-off between
generalization and
overfitting.
27
37. Machine learning – ANN – Activation functions
Activation functions
“Modern” units:
g(a) = max(a, 0)
ReLU
or g(a) = log(1 + ea
)
Softplus
Most neural networks use ReLU
(Rectifier linear unit) – max(a, 0) –
nowadays for hidden layers, since it
trains much faster, is more expressive
than logistic function and prevents the
gradient vanishing problem (see later).
(Source: Lucas Masuch)
28
38. Machine learning – ANN
Neural networks solve non-linear separable problems
(Source: Vincent Lepetit) 29
39. Machine learning – UAT
Universal Approximation Theorem
(Hornik et al, 1989; Cybenko, 1989)
Any continuous function can be approximated by a feedforward shallow
network (i.e., with 1-hidden layer only) with a sufficient number of neurons in
the hidden layer.
• The theorem does not say how large the network needs to be.
• No guarantee that the training algorithm will be able to train the network.
30
41. Tasks, architectures and loss functions
Approximation – Least square regression
• Goal: Predict a real multivariate function.
• How: estimate the coefficients W of y = f(x; W )
from labeled training examples where labels are real vectors:
• Typical architecture:
• Hidden layer:
ReLU(a) = max(a, 0)
• Linear output:
g(a) = a
31
42. Tasks, architectures and loss functions
Approximation – Least square regression
• Loss: As for the polynomial curve fitting, it is standard to consider the
sum of square errors (assumption of Gaussian distributed errors)
E(W ) =
N
i=1
||yi
− di
||2
2 =
N
i=1
||f(xi
; W ) − di
||2
2
and look for W ∗
such that E(W ∗
) = 0. Recall: SSE ≡ SSD ≡ MSE
• Solution: Provided the network as enough flexibility and the size of the
training set grows to infinity
y = f(x; W ) = E[d|x] = dp(d|x) dd
posterior mean
32
43. Tasks, architectures and loss functions
Binary classification – Logistic regression
• Goal: Classify object x into class C1 or C2.
• How: estimate the coefficients W of a real function y = f(x; W ) ∈ [0, 1]
from training examples with labels 1 (for class C1) and 0 (otherwise):
T = {(xi
, di
)}1=1..N
• Typical architecture:
• Hidden layer:
ReLU(a) = max(a, 0)
• Output layer:
logistic(x) =
1
1 + e−a
33
44. Tasks, architectures and loss functions
Binary classification – Logistic regression
• Loss: it is standard to consider the cross-entropy for two-classes
(assumption of Bernoulli distributed data)
E(W ) = −
N
i=1
di
log yi
+ (1 − di
) log(1 − yi
) with yi
= f(xi
; W )
and look for W ∗
such that E(W ∗
) = 0.
• Solution: Provided the network as enough flexibility and the size of the
training set grows to infinity
y = f(x; W ) = P(C1|x)
posterior probability
34
45. Tasks, architectures and loss functions
Multiclass classification – Multivariate logistic regression
(aka, multinomial classification)
• Goal: Classify an object x into one among K classes C1, . . . , CK .
• How: estimate the coefficients W of a multivariate function
y = f(x; W ) ∈ [0, 1]K
from training examples T = {(xi
, di
)} where di
is a 1-of-K (one-hot) code
• Class 1: di
= (1, 0, . . . , 0)T
if xi
∈ C1
• Class 2: di
= (0, 1, . . . , 0)T
if xi
∈ C2
• . . .
• Class K: di
= (0, 0, . . . , 1)T
if xi
∈ CK
• Remark: do not use the class index k directly as a scalar label.
35
46. Tasks, architectures and loss functions
Multiclass classification – Multivariate logistic regression
• Typical architecture:
• Hidden layer:
ReLU(a) = max(a, 0)
• Output layer:
softmax(a)k =
exp(ak)
K
l=1 exp(al)
Softmax guarantees the outputs yk to be positive and sum to 1.
Generalization of the logistic sigmoid activation function.
Smooth version of winner-takes-all activation model (maxout).
(largest gets +1 others get 0).
36
47. Tasks, architectures and loss functions
Multiclass classification – Multivariate logistic regression
• Loss: it is standard to consider the cross-entropy for K classes
(assumption of multinomial distributed data)
E(W ) = −
N
i=1
K
k=1
di
k log yi
k with yi
= f(xi
; W )
and look for W ∗
such that E(W ∗
) = 0.
• Solution: Provided the network as enough flexibility and the size of the
training set grows to infinity
yk = fk(x; W ) = P(Ck|x)
posterior probability
37
48. Machine learning – ANN
Multi-label classification:
(aka, multi-output classification)
• Goal: Classify object x into zero or several classes C1, . . . , CK .
The classes need to be non-mutually exclusive.
• How: Combination of binary-classification networks.
• Typical architecture:
• Remark: For categorical inputs, use also 1-of-K codes.
38
50. Machine learning – ANN - Backpropagation
Learning with backpropagation
(Source: Lucas Masuch & Vincent Lepetit) 39
51. Machine learning – ANN – Learning
Training process
• To train a neural network over a large set of labeled data, you must
continuously compute the difference between the network’s predicted
output and the actual output.
• This difference is measured by the loss E, and the process for training a
net is known as backpropagation, or backprop.
• During backprop, weights and biases are tweaked slightly until the lowest
possible loss is achieved.
−→ Optimization: look for E = 0
• The gradient is an important aspect of this process, as a measure of how
much the loss changes with respect to a change in a weight or bias value.
(Source: Caner Hazırba¸s)
40
52. Machine learning – ANN – Learning
Training process
(Source: Lucas Masuch) 41
53. Machine learning – ANN – Optimization
Objective: min
W
E(W ) ⇒ E(W ) = ∂E(W )
∂W1
. . . ∂E(W )
∂WL
T
= 0
Loss functions: recall that classical loss functions are
• Square error (for regression: dk ∈ R, yk ∈ R)
E(W ) =
1
2
(x,d)∈T
||y − d||2
2 =
1
2
(x,d)∈T k
(yk − dk)2
• Cross-entropy (for multi-class classification: dk ∈ {0, 1}, yk ∈ [0, 1])
E(W ) = −
(x,d)∈T k
dk log yk
Solution: no close-form solutions ⇒ use gradient descent. What is it?
42
54. Machine learning – ANN – Optimization – Gradient descent
An iterative algorithm trying to find a minimum of a real function.
Gradient descent
• Let F be a real function, differentiable and lower bounded with a L
Lipschitz gradient (see next slide). Then, whatever the initialization x0
,
if 0 < γ < 2/L, the sequence
xt+1
= xt
− γ F(xt
) ,
converges to a stationary point x (i.e., it cancels the gradient)
F(x ) = 0 .
• The parameter γ is called the step size (or learning rate in ML field).
• A too small step size γ leads to slow convergence.
43
55. Machine learning – ANN – Optimization – Gradient descent
Lipschitz gradient
• A differentiable function F has L Lipschitz gradient, if
|| F(x1) − F(x2)||2 L||x1 − x2||2, for all x1, x2 .
• The mapping x → F(x) is necessarily continuous.
• If F is twice differentiable with bounded Hessian, then
L = sup
x
|| 2
F(x)
Hessian matrix of F
||2.
where for a matrix A, its 2-norm ||A||2 is its maximal singular value.
44
56. Machine learning – ANN – Optimization – Gradient descent
Gradent descent and Lipschitz gradient
Example (1d quadratic loss)
• Consider: F(x) = 1
2
(x − y)2
• We have: F (x) = x − y
• And: F (x) = 1
• Then: L = supx |F (x)| = 1
• Thus: The range for γ is 0 < γ < 2/L = 2
• And: xt+1
= xt
− γ(xt
− y) converges towards x , whatever x0
,
such that F (x ) = 0 ⇒ x = y
Question: what is the best value for γ?
45
57. Machine learning – ANN – Optimization – Gradient descent
Gradent descent and Lipschitz gradient
Example (1d quartic loss)
• Consider: F(x) = 1
4
(x − y)4
• We have: F (x) = (x − y)3
• And: F (x) = 3(x − y)2
• Then: L = supx |F (x)| = ∞
• Thus: There are no γ satisfying 0 < γ < 2/L
• And: xt+1
= xt
− γ(xt
− y)3
, for γ > 0, may diverge,
oscilate forever or converge to the solution y.
Convergence can be obtained for some specific choice
of γ and initialization x0
.
46
58. Machine learning – ANN – Optimization – Gradient descent
These two curves cross at x such that F(x ) = 0
47
59. Machine learning – ANN – Optimization – Gradient descent
Here γ is small: slow convergence
47
60. Machine learning – ANN – Optimization – Gradient descent
γ a bit larger: faster convergence
47
61. Machine learning – ANN – Optimization – Gradient descent
γ ≈ 1/L even larger: around fastest convergence
47
62. Machine learning – ANN – Optimization – Gradient descent
γ a bit too large: convergence slows down
47
63. Machine learning – ANN – Optimization – Gradient descent
γ too large: convergence too slow again
47
65. Machine learning – ANN – Optimization – Gradient descent
Gradient descent for convex function
• If moreover F is convex
F(λx1 + (1 − λ)x2) λF(x1) + (1 − λ)F(x2), ∀x1, x2, λ ∈ (0, 1) ,
then, the gradient descent converges towards a global minimum
x ∈ argmin
x
F(x).
• For 0 < γ < 2/L, the sequence |F(xk
) − F(x )| decays in O(1/k).
• NB: All stationary points are global minima (non necessarily unique).
49
67. Machine learning – ANN – Optimization – Gradient descent
Influence of the step parameter (learning rate)
51
68. Machine learning – ANN – Optimization
Let’s start with a single artifical neurons
Example ((Batch) Perceptron algorithm)
• Model: y = sign w, x
• Loss: E(w) = −
(x,d)∈T
st y=d
d × w, x
• Gradient: E(w) = −
(x,d)∈T
st y=d
d × x
• Gradient descent: wt+1
← wt
+ γ
(x,d)∈T
st yt
=d
d × x
Convex or non-convex?
52
69. Machine learning – ANN – Optimization
Let’s start with a single artifical neurons
Example ((Batch) ADALINE)
• Model: y = w, x
• Loss: E(w) =
1
2
(x,d)∈T
( w, x − d)2
=d2+wT xxT w−2dwT x
• Gradient: E(w) =
(x,d)∈T
x(xT
w
y
−d)
• Gradient descent: wt+1
← wt
+ γ
(x,d)∈T
(d − yt
)x
Convex or non-convex?
53
70. Machine learning – ANN – Optimization
Let’s start with a single artifical neurons
Example ((Batch) ADALINE)
• Model: y = w, x
• Loss: E(w) =
1
2
(x,d)∈T
( w, x − d)2
=d2+wT xxT w−2dwT x
• Gradient: E(w) =
(x,d)∈T
x(xT
w
y
−d)
• Gradient descent: wt+1
← wt
+ γ
(x,d)∈T
(d − yt
)x
Convex or non-convex?
If enough training samples: lim
t→∞
wt
=
(x,d)∈T
xxT
Hessian
−1
(x,d)∈T
dx
53
71. Machine learning – ANN – Optimization
Back to our optimization problem
In our case W → E(W ) is non-convex ⇒ No guarantee of convergence.
Convergence will depends on
• the initialization,
• the step size γ.
Because of this:
• Normalizing each data point x in the range [−1, +1] is important to
control for the Hessian → the stability and the speed of the algorithm,
• The activation functions and the loss should be chosen to have a second
derivative smaller than 1 (when combined),
• The initialization should be random with well chosen variance
(we will go back to this later),
• If all of these are satisfied, we can generally choose γ ∈ [.001, 1].
54
72. Machine learning – ANN – Optimization
Back to our optimization problem
In our case W → E(W ) is non-convex ⇒ No guarantee of convergence.
Even if so, the limit solution depends on:
• the initialization,
• the step size γ.
Nevertheless, really good minima or saddle points are reached in practice by
W t+1
← W t
− γ E(W t
), γ > 0
Gradient descent can be expressed coordinate by coordinate as:
wt+1
i,j ← wt
i,j − γ
∂E(W t
)
∂wt
i,j
for all weights wi,j linking a node j to a node i in the next layer.
⇒ The algorithm to compute
∂E(W )
∂wi,j
for ANNs is called backpropagation.
55
73. Machine learning – ANN – Optimization
Backpropagation: computation of
∂E(W )
∂wi,j
Feedforward least square regression context
• Model: Feed-forward neural network.
(for simplicity without bias)
• Loss function: E(W ) =
1
2
(x,d)∈T k
(yk − dk)2
We have: E(W ) =
(x,d)∈T k
1
2
(yk − dk)2
ek
Apply linearity:
∂E(W )
∂wi,j
=
(x,d)∈T k
∂ek
∂wi,j
56
74. Machine learning – ANN – Optimization
1. Case where wij is a synaptic weight for the output layer
• j: neuron in the previous hidden layer
• hj: response of hidden neuron j
• wi,j: synaptic weight between j and i
• yi: response of output neuron i
yi = g (ai) with ai =
j
wi,jhj
Apply chain rule:
∂E(W )
∂wi,j
=
(x,d)∈T k
∂ek
∂wi,j
=
(x,d)∈T k
∂ek
∂yi
∂yi
∂ai
∂ai
∂wi,j
57
75. Machine learning – ANN – Optimization
1. Case where wij is a synaptic weight for the output layer
ek =
1
2
(yk − dk)2
⇒
∂ek
∂yi
=
yi − di if k = i
0 otherwise
yi = g(ai) ⇒
∂yi
∂ai
= g (ai),
ai =
j
wi,j hj ⇒
∂ai
∂wi,j
= hj
⇒
∂E(W )
∂wi,j
=
(x,d)∈T k
∂ek
∂yi
∂yi
∂ai
∂ai
∂wi,j
=
(x,d)∈T
(yi − di)g (ai)
δi
hj
=
(x,d)∈T
δihj where δi =
k
∂ek
∂ai
58
76. Machine learning – ANN – Optimization
2. Case where wij is a synaptic weight for a hidden layer
• j: neuron in the previous hidden layer
• hj: response of hidden neuron j
• wi,j: synaptic weight between j and i
• hi: response of hidden neuron i
hi = g (ai) with ai =
i
wi,jhj
Apply chain rule:
∂E(W )
∂wi,j
=
(x,d)∈T k
∂ek
∂wi,j
=
(x,d)∈T k
∂ek
∂hi
∂hi
∂ai
∂ai
∂wi,j
=
(x,d)∈T k l
∂ek
∂al
∂al
∂hi
∂hi
∂ai
∂ai
∂wi,j
=
(x,d)∈T l k
∂ek
∂al
δl
∂al
∂hi
∂hi
∂ai
∂ai
∂wi,j
59
77. Machine learning – ANN – Optimization
2. Case where wij is a synaptic weight for a hidden layer
al =
i
wl,i hi ⇒
∂al
∂hi
= wl,i
hi = g(ai) ⇒
∂hi
∂ai
= g (ai),
ai =
j
wi,j hj ⇒
∂aj
∂wi,j
= hj
⇒
∂E(W )
∂wi,j
=
(x,d)∈T l
δl
∂al
∂hi
∂hi
∂ai
∂ai
∂wi,j
=
(x,d)∈T l
wl,iδl g (ai)
δi
hj
=
(x,d)∈T
δihj
60
78. Machine learning – ANN – Optimization
Backpropagation algorithm
(Werbos, 1974 & Rumelhart, Hinton and Williams, 1986)
∂E(W )
∂wi,j
=
(x,d)∈T
δihj where hj = xj if j is an input node
where δi = g (ai) ×
yi − di if i is the output node
l
wl,iδl otherwise
For all input x and desired output d
• Forward step:
→ compute the response (hj, ai and yi) of all neurons,
→ start from the first hidden layer and pursue towards the output one.
• Backward step:
→ Retropropagate the error (δi) from the output layer to the first layer.
Update wi,j ← wi,j − γ δjhi, and repeat everything until convergence.
61
79. Machine learning – ANN – Optimization
Backpropagation algorithm with matrix-vector form
Easier to use matrix-vector notations for each layer:
(k denotes the layer)
Wk E(W ) = δkhT
k−1 where h0 = x
where δk = g (ak)×
element-wise
y − d if k is an output layer
W T
k+1δk+1 otherwise
• x: matrix with all training input vectors in column,
• d: matrix with corresponding desired target vectors in column,
• y: matrix with all predictions in column,
• ak = Wkhk−1: matrix with all weighted sums in column,
• hk = g(ak): matrix with all hidden outputs in column,
• Wk: matrix of weights at layer k,
62
98. Machine learning – ANN – Optimization
Case where g is the logistic sigmoid function
• Recall that the logistic sigmoid function is given by:
g(a) =
1
1 + e−a
=
1
u(a)
where u(a) = 1 + e−a
• We have:
u (a) = −e−a
• Then, we get
g (a) =
−u (a)
u(a)2
=
e−a
(1 + e−a)2
=
1 + e−a
(1 + e−a)2
−
1
(1 + e−a)2
=
1
1 + e−a
−
1
(1 + e−a)2
=
1
1 + e−a
1 −
1
1 + e−a
= g(a) (1 − g(a))
64
99. Machine learning – ANN – Optimization
What if g is non-differentiable?
For instance: g(a) = ReLU(a) = max(a, 0)
• ReLU is continuous and almost everywhere differentiable:
∂ max(a, 0)
∂a
=
1 if a > 0
0 if a < 0
undefined otherwise
• Gradient descent can handle this case with a simple modification:
∂ max(a, 0)
∂a
=
1 if a > 0
0 if a 0
• When g is convex, this is called a sub-gradient, and gradient descent is
called sub-gradient descent.
65
100. Machine learning – ANN – Optimization
Let’s try to learn the x-or function
import numpy as np
# Training set
x = np.array ([[0 ,0] , [0,1], [1,0], [1 ,1]]).T
d = np.array ([ 0, +1, +1, 0 ]).T
# Initialization for a 2 layer feedforward network
b1 = np.random.rand(2, 1)
W1 = np.random.rand(2, 2)
b2 = np.random.rand(1, 1)
W2 = np.random.rand(1, 2)
# Activation functions and their derivatives
def g1(a): return a * (a > 0) # ReLU
def g1p(a): return 1 * (a > 0)
def g2(a): return a # Linear
def g2p(a): return 1
66
101. Machine learning – ANN – Optimization
Let’s try to learn the x-or function
import numpy as np
# Training set
x = np.array ([[0 ,0] , [0,1], [1,0], [1 ,1]]).T
d = np.array ([ 0, +1, +1, 0 ]).T
# Initialization for a 2 layer feedforward network
b1 = np.random.rand(2, 1)
W1 = np.random.rand(2, 2)
b2 = np.random.rand(1, 1)
W2 = np.random.rand(1, 2)
# Activation functions and their derivatives
def g1(a): return a * (a > 0) # ReLU
def g1p(a): return 1 * (a > 0)
def g2(a): return 1 / (1 + np.exp(-a)) # Logistic
def g2p(a): return g2(a) * (1 - g2(a))
66
102. Machine learning – ANN – Optimization
Let’s try to learn the x-or function
gamma = .01 # step parameter ( learning rate)
f o r t i n range (0, 10000):
# Forward phase
a1 = W1.dot(x)
h1 = g1(a1)
a2 = W2.dot(h1)
y = g2(a2)
# Error evaluation
e = y - d
# Backward phase
delta2 = g2p(a2) * e
delta1 = g1p(a1) * W2.T.dot(delta2)
# gradient update
W2 = W2 - gamma * delta2.dot(h1.T)
W1 = W1 - gamma * delta1.dot(x.T)
_
_
67
103. Machine learning – ANN – Optimization
Let’s try to learn the x-or function
Remark 1: dealing with the
bias is similar
gamma = .01 # step parameter
f o r t i n range (0, 10000):
# Forward phase
a1 = W1.dot(x) + b1
h1 = g1(a1)
a2 = W2.dot(h1) + b2
y = g2(a2)
# Error evaluation
e = y - d
# Backward phase
delta2 = g2p(a2) * e
delta1 = g1p(a1) * W2.T.dot(delta2)
# gradient update
W2 = W2 - gamma * delta2.dot(h1.T)
W1 = W1 - gamma * delta1.dot(x.T)
b2 = b2 - gamma * delta2.sum(axis=1, keepdims=True)
b1 = b1 - gamma * delta1.sum(axis=1, keepdims=True)
67
104. Machine learning – ANN – Optimization
Let’s try to learn the x-or function
Remark 1: dealing with the
bias is similar
Remark 2: using cross-entropy
is simple too
gamma = .01 # step parameter
f o r t i n range (0, 10000):
# Forward phase
a1 = W1.dot(x) + b1
h1 = g1(a1)
a2 = W2.dot(h1) + b2
y = g2(a2)
# Error evaluation
e = -d / y + (1 - d) / (1 - y)
# Backward phase
delta2 = g2p(a2) * e
delta1 = g1p(a1) * W2.T.dot(delta2)
# gradient update
W2 = W2 - gamma * delta2.dot(h1.T)
W1 = W1 - gamma * delta1.dot(x.T)
b2 = b2 - gamma * delta2.sum(axis=1, keepdims=True)
b1 = b1 - gamma * delta1.sum(axis=1, keepdims=True)
67
105. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 1
68
106. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 11
68
107. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 21
68
108. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 31
68
109. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 41
68
110. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 51
68
111. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 61
68
112. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 71
68
113. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 81
68
114. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 91
68
115. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 101
68
116. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 201
68
117. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 301
68
118. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 401
68
119. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 501
68
120. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 601
68
121. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 701
68
122. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 801
68
123. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 901
68
124. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 1001
68
125. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 2001
68
126. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 3001
68
127. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 4001
68
128. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 5001
68
129. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 6001
68
130. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 7001
68
131. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 8001
68
132. Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 9001
68
134. Machine learning – ANN – Optimization
Comparisons – different initializations & activations
Failures (Local minima or saddle points)
LinearoutputLogisticoutput
70
135. Machine learning – ANN – Optimization
The 1990s view of ANN and back-propagation
Advantages
• Universal approximators,
• May be very accurate.
Drawbacks
• Black box models (very difficult to interpret).
• The learning time does not scale well
• it was very slow in networks with multiple hidden layers.
• It got stuck at local optima or saddle points
• these can be nevertheless often surprisingly good.
• All global minima do not have the same quality (generalization error)
• strong variations in prediction errors on different testing datasets.
71
137. Machine learning – Support Vector Machine
Support Vector Machine
(Source: Lucas Masuch & Vincent Lepetit) 72
138. Machine learning – Support Vector Machine
Support Vector Machine
(Vapnik, 1995)
• Very successful methods in the mid-90’s.
• Developed for binary classification.
• Clever type of perceptron.
• Based on two major ideas
1 large margin,
2 kernel trick.
Many NNs researchers switched to SVMs in the 1990s
because they (used to) work better.
73
139. Machine learning – Support Vector Machine
Shattering points with oriented hyperplanes
• Goal: Build hyperplanes that separate points in two classes,
• Question: Which is the best separating line?
Remember, hyperplane: w, x + b = 0
74
140. Machine learning – Support Vector Machine
Classification margin
• Signed distance from an example to the separator: r =
x, w + b
||w||
• Examples closest to the hyperplane are support vectors.
• Margin ρ is the distance between the separator and the support vector(s).
• Is this one good?
75
141. Machine learning – Support Vector Machine
Classification margin
• Signed distance from an example to the separator: r =
x, w + b
||w||
• Examples closest to the hyperplane are support vectors.
• Margin ρ is the distance between the separator and the support vector(s).
• Is this one good?
• Consider two testing samples,
where are they assigned?
75
142. Machine learning – Support Vector Machine
Classification margin
• Signed distance from an example to the separator: r =
x, w + b
||w||
• Examples closest to the hyperplane are support vectors.
• Margin ρ is the distance between the separator and the support vector(s).
• Is this one good?
• Consider two testing samples,
where are they assigned?
75
143. Machine learning – Support Vector Machine
Classification margin
• Signed distance from an example to the separator: r =
x, w + b
||w||
• Examples closest to the hyperplane are support vectors.
• Margin ρ is the distance between the separator and the support vector(s).
• Is this one good?
• Consider two testing samples,
where are they assigned?
• This separator may have large
generalization error.
75
144. Machine learning – Support Vector Machine
Largest margin = Support Vector Machine
• Maximizing the margin reduces generalization error (see, PAC learning).
• Then, there are necessary support vectors on both sides of the hyperplane.
• Only support vectors are important ⇒ other examples can be discarded.
Poor separation Optimal separation
76
145. Machine learning – Support Vector Machine
Training
• As for neural networks, the parameters w and b are obtained by optimizing
a loss function (the margin).
• What is the formula for the margin?
• Training dataset: feature vectors xi
, targeted class di
∈ {−1, +1}
• Assume the training samples to be linearly separable,
i.e., there exist w and b such that
xi
, w + b +1 if di
= +1
xi
, w + b −1 if di
= −1
di
( xi
, w + b) 1
Then: | xi
, w + b| 1
77
146. Machine learning – Support Vector Machine
Training
• For support vectors, the inequality can be forced to be an equality
| xi
, w + b| = 1
• Then, the distance between any support vector and the hyperplane is
|r| =
| xi
, w + b|
||w||
=
1
||w||
• So, the margin is (by definition)
ρ = |r| =
1
||w||
• Maximizing the margin is then equivalent to minimizing ||w||2
, providing
the hyperplane (w, b) separates the data.
78
147. Machine learning – Support Vector Machine
Training
• Therefore, the problem can be recast as
min
w,b
1
2
||w||2
subject to di
( xi
, w + b) 1, for all i.
⇒ Quadratic (convex) optimization problem subject to linear constraints,
⇒ No local minima! Only a single global one.
Equivalent formulation:
min
w,b
E(w, b)
where E(w, b) =
1
2
||w||2
if di
( xi
, w + b) 1, for all i
+∞ otherwise
79
148. Machine learning – Support Vector Machine
Training
min
w,b
E(w, b)
where E(w, b) =
1
2
||w||2
if di
( xi
, w + b) 1, for all i
+∞ otherwise
• We can use the techniques of Lagrange multipliers:
1 Introduce a Lagrange multiplier αi 0 for each constraint.
2 Let α = (α1, α2, . . .)T
be the vector of Lagrange multipliers.
3 Define the (primal) function as
LP (w, b, α) =
1
2
||w||2
−
i
αi
(di
( xi
, w + b) − 1)
inequality constraint
4 Solve the saddle point problem
E(w, b) = max
α
LP (w, b, α) ⇒ min
w,b
E(w, b) = min
w,b
max
α
LP (w, b, α)
80
149. Machine learning – Support Vector Machine
Training
• When the original objective-function is convex (and only then), we can
interchange the minimization and maximization
min
w,b
max
α
LP (w, b, α) = max
α
min
w,b
LP (w, b, α)
• The following is called dual problem
LD(α) = min
w,b
LP (w, b, α) =
1
2
||w||2
w, w
−
i
αi
(di
( xi
, w + b) − 1)
• For a fixed α, the minimum is achieved by simultaneously canceling the
gradients with respect to w and b
w LP (w, b, α) = 0 ⇒ w −
i
αi
di
xi
= 0 ⇒ w =
i
αi
di
xi
and b LP (w, b, α) = 0 ⇒
i
αi
di
= 0
81
150. Machine learning – Support Vector Machine
Training
The dual is obtained by plugging these equations in LP (w, b, α):
LD(α) =
1
2
||w||2
w, w
−
i
αi
(di
( xi
, w + b) − 1)
=
1
2 i
αi
di
xi
,
j
αj
dj
xj
−
i
αi
di
( xi
,
j
αj
dj
xj
+ b) +
i
αi
=
1
2 i j
αi
αj
di
dj
xi
, xj
−
i j
αi
αj
di
dj
xi
, xj
+ b
i
αi
di
=0
+
i
αi
=
i
αi
−
1
2 i j
αi
αj
di
dj
xi
, xj
82
151. Machine learning – Support Vector Machine
SVM training algorithm
1 Maximize the dual (e.g., using coordinate projected gradient descent):
max
α
LD(α) =
i
αi
−
1
2 i j
αi
αj
di
dj
xi
, xj
subject to αi
0 and
i
αi
di
= 0, for all i.
2 Each non-zero αi
indicates that corresponding xi
is a support vector.
3 Deduce the parameters from these support vectors (s.v)
w =
i (s.v.)
αi
di
xi
and b = di
−
j (s.v.)
αj
dj
xi
, xj
for any s.v. i
• The dual is in practice more efficient to solve than the original problem,
and it allows identifying the support vectors.
• It only depends on the dot products xi
, xj
between all training points.
83
152. Machine learning – Support Vector Machine
SVM classification algorithm
• Given the parameters of the hyperplane w and b,
the SVM classify a new sample x as
y = w, x + b 0
(same as for the perceptron).
• But (very important), this can be reformulated as
y =
i (s.v.)
αi
di
x, xi
+ b
which only depends on the dot products x, xi
between x and the
support vectors – we will return to this later.
84
153. Machine learning – Support Vector Machine
Non-separable case
What if the training set is not linearly separable?
The optimization problem does not admit solutions.
85
154. Machine learning – Support Vector Machine
Non-separable case
What if the training set is not linearly separable?
The optimization problem does not admit solutions.
Solution: allows the system to make errors εi.
85
155. Machine learning – Support Vector Machine
Non-separable case – Soft-margin SVM
• Just relax the constraints by permitting errors to some extent
min
w,b,ε
1
2
||w||2
+ C
i
εi
subject to di
( xi
, w + b) 1 − εi and εi 0, for all i.
• Quantities εi are called slack variables.
• The parameter C > 0 is chosen by the user and controls overfitting.
• The dual problem becomes
max
α
LD(α) =
i
αi
−
1
2 i j
αi
αj
di
dj
xi
, xj
subject to 0 αi C and
i
αi
di
= 0, for all i.
and does not depend on the slack variables εi.
86
156. Machine learning – Support Vector Machine
Linear SVM – Overview
• SVM finds the separating hyperplane maximizing the margin.
• Soft-margin SVM: trade-off between large margin and errors.
• These optimal hyperplanes are only defined in terms of support vectors.
• Lagrangian formulation allows identifying the support vectors with
non-zero Lagrange multipliers αi
.
• The training and classification algorithms both depend only on dot
products between data points.
So far, the SVM is a linear separator (the best in some sense).
But, it cannot solve non-linear problems (such as xor),
as done by multi-layer neural networks.
87
157. Machine learning – Support Vector Machine
Non-linear SVM – Overview
What happens if the separator is non-linear?
88
158. Machine learning – Support Vector Machine
Non-linear SVM – Map to higher dimensions
Here: (x1, x2) → (x2
1, x2
2,
√
2x1x2)
The input space (original feature space) can always be mapped to some
higher-dimensional feature space where the training data set is linearly
separable, via some (non-linear) transformation x → ϕ(x).
89
159. Machine learning – Support Vector Machine
Non-linear SVM – Map to higher dimensions
Replace all occurrences of x by ϕ(x), for the training step
LD(α) =
i
αi
−
1
2 i j
αi
αj
di
dj
ϕ(xi
), ϕ(xj
)
and the classification step
y =
i (s.v.)
αi
di
ϕ(x), ϕ(xi
) + b
• May require a feature space of really high (even infinite) dimension.
• In this case, manipulating ϕ(x) might be tough, impossible, or lead to
intense computation: complexity depends on the feature space dimension.
90
160. Machine learning – Support Vector Machine
Non-linear SVM – Kernel trick
• SVMs do not care about feature vectors ϕ(x).
• They rely only on dot products between them. Define
K(x, x ) = ϕ(x), ϕ(x )
• K is called kernel function: dot product in a higher dimensional space.
• Even if ϕ(x) is tough to manipulate, K(x, x ) can be rather simple.
• Mercer’s theorem: K continuous and symmetric positive semi-definite
(i.e., the matrix K = (K(xi
, xj
))i,j is symmetric positive semi-definite for
all finite sequences x1
, . . . , xn
) then there exists a mapping ϕ such that
K(x, x ) = ϕ(x), ϕ(x )
• We do not even need to know ϕ, pick a continuous symmetric positive
semi-definite kernel K and use it instead of dot products.
91
161. Machine learning – Support Vector Machine
Non-linear SVM – Kernel trick
• Replace all occurrences of x, x by K(x, x ), for the training step
LD(α) =
i
αi
−
1
2 i j
αi
αj
di
dj
K(xi
, xj
)
and the classification step
y =
i (s.v.)
αi
di
K(x, xi
) + b
• Python implementation available in scikit-learn.
Complexity depends on the input space dimension.
92
162. Machine learning – Support Vector Machine
Non-linear SVM – Standard kernels
• Linear: K(x, x ) = x, x
• Polynomial: K(x, x ) = (γ x, x + β)p
• Gaussian: K(x, x ) = exp −γ||x − x ||2
−→ Radial basis function (RBF) network.
• Sigmoid: K(x, x ) = tanh(γ x, x + β)
In practice: K is usually chosen by trial and error.
A linear SVM is an optimal perceptron but what is a non-linear SVM? 93
163. Machine learning – Support Vector Machine
Non-linear SVMs are shallow ANNs
Consider: K(x, x ) = tanh(γ x, x + β)
We get: y =
i (s.v.)
αi
di
K(x, xi
) + b ← SVM
y =
i (s.v.)
αi
di
tanh(γ x, xi
+ β) + b
y = w2, g(W1x + b1) + b2 ← Shallow ANN
where W1 = γ x1
x2
. . .
support vectors
T
, b1 = β1,
w2 = α1
d1
α2
d2
. . .
Lagrande multipliers and desired
class of support vectors
, b2 = b, and g = tanh .
The number of support vectors is the number of hidden units.
94
164. Machine learning – Support Vector Machine
Difference between SVMs and ANNs
• SVM:
• based on stronger mathematical theory,
• avoid over-fitting,
• not trapped in local-minima,
• works well with fewer training samples.
• But limited to binary classification
→ extensions to regression in (Vapnik et al., 1997)
and PCA in (Sch¨olkopf et al, 1999).
• Tuning SVMs is tough:
→ selecting a specific kernel (feature space) and
regularization parameter C done by trial and errors.
95
165. Machine learning – Support Vector Machine
Similarity between SVMs and ANNs
(in binary classification)
• Similar formalisms:
y = w, ϕ(x) + b 0 ?
SVM
• ϕ(x): feature vector,
• ϕ: non-linear function implicitly
defined by the kernel K,
• UAT: ϕ could be approximated by
a shallow NN.
ANN
• ϕ(x): output of the last hidden layer,
• ϕ: non-linear recursive function
learned by backpropagation,
• the output of hidden layers can be
interpreted as feature vectors.
96
166. Machine learning – Support Vector Machine
Similarity between SVMs and ANNs
(in binary classification)
• Both are linear separators in a suitable feature space,
• Mapping to the feature space is obtained by a non-linear transform,
• For SVM, the non-linear transform is chosen by the user.
Instead, ANNs learn it with backpropagation.
• ANNs can also do it recursively layer by layer:
→ this is called a feature hierarchy,
→ this is the foundation of deep learning.
97
167. Machine learning – Deep learning
What’s next?
(Source: Lucas Masuch & Vincent Lepetit) 98
168. Questions?
Next class: Introduction to deep learning
Slides and images courtesy of
• P. Gallinari
• V. Lepetit
• A. O. i Puig
• L. Masuch
• A. W. Moore
• C. Hazırba¸s
• M. Welling
98