Overview on function optimization in general and in deep learning. The slides cover from basic algorithms like batch gradient descent, stochastic gradient descent to the state of art algorithm like Momentum, Adagrad, RMSprop, Adam.
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
Methods of Optimization in Machine LearningKnoldus Inc.
In this session we will discuss about various methods to optimise a machine learning model and, how we can adjust the hyper-parameters to minimise the cost function.
Gradient descent optimization with simple examples. covers sgd, mini-batch, momentum, adagrad, rmsprop and adam.
Made for people with little knowledge of neural network.
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
Methods of Optimization in Machine LearningKnoldus Inc.
In this session we will discuss about various methods to optimise a machine learning model and, how we can adjust the hyper-parameters to minimise the cost function.
Gradient descent optimization with simple examples. covers sgd, mini-batch, momentum, adagrad, rmsprop and adam.
Made for people with little knowledge of neural network.
Presentation in Vietnam Japan AI Community in 2019-05-26.
The presentation summarizes what I've learned about Regularization in Deep Learning.
Disclaimer: The presentation is given in a community event, so it wasn't thoroughly reviewed or revised.
The word ‘stochastic‘ means a system or process linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total number of samples from a dataset that is used for calculating the gradient for each iteration. In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole dataset. Although using the whole dataset is really useful for getting to the minima in a less noisy and less random manner, the problem arises when our dataset gets big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent optimization technique, you will have to use all of the one million samples for completing one iteration while performing the Gradient Descent, and it has to be done for every iteration until the minima are reached. Hence, it becomes computationally very expensive to perform.
This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration.
Machine Learning With Logistic RegressionKnoldus Inc.
Machine learning is the subfield of computer science that gives computers the ability to learn without being programmed. Logistic Regression is a type of classification algorithm, based on linear regression to evaluate output and to minimize the error.
An overview of gradient descent optimization algorithms Hakky St
勾配降下法についての論文をスライドにしたものです。
This is the slide for study meeting of gradient descent.
I use this paper and this is very good information about gradient descent.
https://arxiv.org/abs/1609.04747
High Dimensional Data Visualization using t-SNEKai-Wen Zhao
Review of the t-SNE algorithm which helps visualizing the high dimensional data on manifold by projecting them onto 2D or 3D space with metric preserving.
Linear regression with gradient descentSuraj Parmar
Intro to the very popular optimization Technique(Gradient descent) with linear regression . Linear regression with Gradient descent on www.landofai.com
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
Training of Deep neural network is difficult task. Deep neural network train with the help of training algorithms and activation function This is an overview of Activation Function and Training Algorithms used for Deep Neural Network. It underlines a brief comparative study of activation function and training algorithms.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
Presentation in Vietnam Japan AI Community in 2019-05-26.
The presentation summarizes what I've learned about Regularization in Deep Learning.
Disclaimer: The presentation is given in a community event, so it wasn't thoroughly reviewed or revised.
The word ‘stochastic‘ means a system or process linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total number of samples from a dataset that is used for calculating the gradient for each iteration. In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole dataset. Although using the whole dataset is really useful for getting to the minima in a less noisy and less random manner, the problem arises when our dataset gets big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent optimization technique, you will have to use all of the one million samples for completing one iteration while performing the Gradient Descent, and it has to be done for every iteration until the minima are reached. Hence, it becomes computationally very expensive to perform.
This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration.
Machine Learning With Logistic RegressionKnoldus Inc.
Machine learning is the subfield of computer science that gives computers the ability to learn without being programmed. Logistic Regression is a type of classification algorithm, based on linear regression to evaluate output and to minimize the error.
An overview of gradient descent optimization algorithms Hakky St
勾配降下法についての論文をスライドにしたものです。
This is the slide for study meeting of gradient descent.
I use this paper and this is very good information about gradient descent.
https://arxiv.org/abs/1609.04747
High Dimensional Data Visualization using t-SNEKai-Wen Zhao
Review of the t-SNE algorithm which helps visualizing the high dimensional data on manifold by projecting them onto 2D or 3D space with metric preserving.
Linear regression with gradient descentSuraj Parmar
Intro to the very popular optimization Technique(Gradient descent) with linear regression . Linear regression with Gradient descent on www.landofai.com
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
Training of Deep neural network is difficult task. Deep neural network train with the help of training algorithms and activation function This is an overview of Activation Function and Training Algorithms used for Deep Neural Network. It underlines a brief comparative study of activation function and training algorithms.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
Simulators play a major role in analyzing multi-modal transportation networks. As their complexity increases, optimization becomes an increasingly challenging task. Current calibration procedures often rely on heuristics, rules of thumb and sometimes on brute-force search. Alternatively, we provide a statistical method which combines a distributed, Gaussian Process Bayesian optimization method with dimensionality reduction techniques and structural improvement. We then demonstrate our framework on the problem of calibrating a multi-modal transportation network of city of Bloomington, Illinois. Our framework is sample efficient and supported by theoretical analysis and an empirical study. We demonstrate on the problem of calibrating a multi-modal transportation network of city of Bloomington, Illinois. Finally, we discuss directions for further research.
Opening of our Deep Learning Lunch & Learn series. First session: introduction to Neural Networks, Gradient descent and backpropagation, by Pablo J. Villacorta, with a prologue by Fernando Velasco
One of the central tasks in computational mathematics and statistics is to accurately approximate unknown target functions. This is typically done with the help of data — samples of the unknown functions. The emergence of Big Data presents both opportunities and challenges. On one hand, big data introduces more information about the unknowns and, in principle, allows us to create more accurate models. On the other hand, data storage and processing become highly challenging. In this talk, we present a set of sequential algorithms for function approximation in high dimensions with large data sets. The algorithms are of iterative nature and involve only vector operations. They use one data sample at each step and can handle dynamic/stream data. We present both the numerical algorithms, which are easy to implement, as well as rigorous analysis for their theoretical foundation.
Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. Topics of today’s talk
● Function optimization
● Basics optimization algorithms and limitation
● Challenges in gradient descent optimization
● Practical gradient descent algorithms
2
3. Optimization in Machine learning
● Machine learning cares about performance measure P, that
is defined with respect to the test set and may also be
intractable
● Learning process: optimize P indirectly by optimizing a
cost function J(θ), in the hope that doing so will
improve P
● First problem of machine learning: optimization for cost
function J(θ)
3
5. What is function optimization
● Optimization = minimizing or maximizing
● Maximizing a function f may be accomplished via
minimizing -f
● f is called an objective function or criterion
● In the case of minimization, f is also called cost
function, loss function, or error function
5
10. Optimization Problem
● Example: find extrema of function f(x,y)=x2
+y2
● Easy?
● How about f(x,y)=-x2
-y2
● Still easy?
● How about f(x,y)=x2
-y2
10
12. Gradient and the Hessian matrix
● Gradient: vector of first-order
partial derivatives
12
● Hessian: matrix of second-order
partial derivatives
13. Second Partial Derivative Test
f is a multivariate function
● Stationary points: points that make ∇f = 0
● Second partial derivative test:
A stationary point is
○ Local minimum if all eigenvalues of the Hessian positive
○ Local maximum if all eigenvalues of the Hessian negative
○ Saddle point if the Hessian has both positive and negative
eigenvalues
○ Inconclusive if the Hessian is invertible
13
14. Since H has 2 eigenvalues 1,3 > 0 then (0,0) is a minimum 14
Example: f(x,y) = x2
-xy+y2
17. Batch Gradient Descent
Blindfolded Hiker: how to get to the lowest place?
17
Go step-by-step
● Which direction?
Left or Right?
● How far should I step?
Gradient
Learning
Rate
18. Batch Gradient Descent
Solution: θ := θ - η∇θ
J(θ)
1. Initiate step size η
2. Start with a random point θ
3. Calculate gradient ∇θ
J(θ) at point θ
4. Follow the inversed direction of gradient → get new θ
5. Repeat until reach minima
a. Stop condition? → gradient is small enough
18
19. Batch Gradient Descent
Pros
● Stable convergence
Cons
● Need to calculate gradient for whole dataset
● Slow if is not implemented wisely
19
20. Stochastic Gradient Descent
Principle: Same as Batch Gradient Descent
Difference:
● Updating θ at each example of training dataset
20
θ := θ - η∇θ
J(θ)
θ := θ - η∇θ
J(θ;x(i)
,y(i)
)
21. Stochastic Gradient Descent
Pros
● Faster than Batch Gradient
● Possible to learn online
Cons
● Unstable convergence
● Not use optimized vector operation
21
Image Credit: Pham Quang Khang
Fluctuation in Stochastic Gradient Descent
23. Example of a cost function
● Example: cross entropy in logistic regression
where xn
is vector input, yn
is label, w is weight matrix, N
is number of training examples, σ is sigmoid function
● The shape of the cost function is poorly understood
23
24. Convexity problem
● In traditional machine learning, objective functions are
designed carefully to be convex
○ Example: objective function of SVM is convex.
● When training neural networks, we must confront the
non-convex case
○ Many local minima → infeasible to find global minima
○ Dealing with saddle points
○ Flat regions exist
24
25. Local minima
● In practice, local minima is not a major problem
● [1] gives some theoretical insights about local minima:
○ For large-size networks, most local minima are equivalent and yield
similar performance on a test set
○ The probability of finding a “bad” (high value) local minimum is
non-zero for small-size networks and decreases quickly with network
size
○ Struggling to find the global minimum on the training set (as opposed
to one of the many good local ones) is not useful in practice and may
lead to overfitting
25[1] Choromanska et al. 2014.
26. Saddle points
● For high-dimensional non-convex functions, saddle points
are much more than local minima (and maxima) [1]
● Saddle points slow down training process
○ Batch Gradient Descent may be stuck at saddle points
○ Stochastic Gradient Descent seems to be able to escape saddle points
in many cases [2]
26
[1] Dauphin et al. 2014.
[2] Goodfellow et al. 2015.
27. Flat regions
● Flat regions: regions of constant value, where the
gradient and the Hessian are both 0
● Big problem when those regions have high value of the
objective function
● Escaping from those regions is extremely difficult
27y = x5
28. ● At (0,0) the gradient and the
Hessian are both 0
→ f is super flat at (0,0)
● Second partial derivative test
can’t determine whether (0,0)
is a local minimum, a local
maximum or a saddle point
● In this case, (0,0) is a
global maximum
28
Flat regions: example of cost function f(x,y) = -x2
y2
29. Flat regions: example of cost function f(x,y) = xy(x+y)(1+y)
● At (0,0) the gradient and the
Hessian are both 0 too
● But in this case, (0,0) is a
saddle point
29
31. Gradient descent optimization problem
● Objective: to find the parameters θ that minimize the
lost function J(θ)
● Approach: iteratively update the params θ by utilizing
the gradient ∇J(θ)
● Gradient descent conventional method: θt
= θt-1
- η∇J(θ)
31
32. Momentum
● Add momentum to params updater:
vt
= αvt-1
- η∇J(θt-1
)
θt
= θt-1
+ vt
● Essential meaning of Momentum:
○ Accelerate the learning rate during the training process
○ In physic: add the momentum to the ball that rolling down the hill
○ The momentum parameter α is less than 1 as there is always resistance
force to slow the ball down and usually picked as 0.9
32
33. Adaptive learning rate
● Previous algorithm always use the fixed learning rate
throughout the learning process
○ The learning rate has to be either set to be very small at the
beginning or periodically decrease the learning rate
● Adaptive learning rate: learning rate is automatically
decreased in the learning process
● Adaptive learning rate algo: AdaGrad, RMSprop, Adam
33
34. Adagrad
● Essential meaning: the larger the params change the
slower it get updated
● Algorithms:
Accumulated sum-square ht
= ht-1
+ ∇J(θt-1
)・∇J(θt-1
)
Params updater θt
= θt-1
- η(1/sqrt(ht
))・∇J(θ)
● The learning rate is decreased as the number of update
step increases
34
35. RMSprop
● Adagrad decreases the learning rate too fast as it adds
all previous square gradients
● Consider partially previous added up sum square gradient:
Accumulated sum-square ht
= γht-1
+ (1-γ)∇J(θt-1
)・∇J(θt-1
)
Params updater θt
= θt-1
- η(1/sqrt(ht
))・∇J(θt-1
)
● Good practice γ = 0.9
35
36. Adam (Adaptive momentum estimation)
● Utilizing the advantage of both Momentum and Adagrad
● Algorithm
Momentum: vt
= β1
vt-1
+ (1 - β1
)∇J(θt-1
)
Learning rate: ht
= β2
ht-1
+ (1 - β2
)∇J(θt-1
)・∇J(θt-1
)
To avoid momentum and learning rate decay to be too
small, zero-bias counteract is calculated:
Vt
’ = vt
/(1 - β1
t
), ht
’ = ht
/(1 - β2
t
)
Param update:
θt
= θt-1
- ηVt
’/(sqrt(ht
’) + ε)
36
37. Compare all 4 algorithms +SGD on benchmark data
● Data: part of MNIST
● Input size: 28x28x1, output size: 10
● Model: NN with 4 hidden layers, 100 units each layer
● Train data number: 800, test data number: 200
● Batch size: 100
● Number of epoch: 1000
● Initial weight: Ɲ(0, 0.1)
37
38. Result 1: Change of cost function during process
● Learning rate: η = 0.001
● Momentum: α = 0.9
● RMSprop: γ = 0.9
● Adam: β1
= 0.9, β2
= 0.999
38
44. Machine Learning Definition
"A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P
if its performance at tasks in T, as measured by P, improves
with experience E." [1]
[1] Mitchell, T. (1997). Machine Learning. McGraw Hill. p2.
44
45. Exploding Gradients
● Cliffs: where the
gradient is super big
● One update step of
gradient descent can
move the parameters
extremely far, usually
over the minimum point
● Solution: gradient
clipping
45
Goodfellow et al, Deep Learning book, p289
46. Vanishing Gradients
● Is a major problem when training deep networks
● The gradient tends to get smaller as we move backward
through the hidden layers when running backpropagation
● Weights in the earlier layers may not be learned
● Solutions:
○ Use good activation functions (e.g. ReLU) instead of sigmoid
○ Good initialization
○ Better network architectures (LSTMs/GRUs instead of basic RNNs)
46
47. Sharp and Wide Minima [1]
● Large-batch Gradient Descent tends to converge to sharp minima
→ poorer generalization
● Small-batch Gradient Descent consistently converges to wide minima
→ better generalization
47[1] Keskar et al. 2017.