A presentation about NGBoost (Natural Gradient Boosting) which I presented in the Information Theory and Probabilistic Programming course at the University of Oklahoma.
このスライドではベイズ統計学によく登場する確率分布の関係について紹介している。平易なベルヌーイ分布から多少複雑なベータ分布までがどのようにつながっているかを示している。いくつかの重要な性質については実際に証明を与えた。本スライドは2016年10月1日のNagoyaStat #2で発表したものである。
Some probability distributions are used for bayes statistics. This slide shows relationships from Bernoulli distribution to Beta distribution. Some important properties are proofed in this slide.
Artificial Intelligence Course: Linear models ananth
In this presentation we present the linear models: Regression and Classification. We illustrate with several examples. Concepts such as underfitting (Bias) and overfitting (Variance) are presented. Linear models can be used as stand alone classifiers for simple cases and they are essential building blocks as a part of larger deep learning networks
SEARN Algorithm is a search-based algorithm for structured prediction.
Most of the content is taken from http://users.umiacs.umd.edu/~hal/docs/daume06thesis.pdf. I just read the thesis and presented what's in there. Thus the credits of the content should go to the author of the thesis.
このスライドではベイズ統計学によく登場する確率分布の関係について紹介している。平易なベルヌーイ分布から多少複雑なベータ分布までがどのようにつながっているかを示している。いくつかの重要な性質については実際に証明を与えた。本スライドは2016年10月1日のNagoyaStat #2で発表したものである。
Some probability distributions are used for bayes statistics. This slide shows relationships from Bernoulli distribution to Beta distribution. Some important properties are proofed in this slide.
Artificial Intelligence Course: Linear models ananth
In this presentation we present the linear models: Regression and Classification. We illustrate with several examples. Concepts such as underfitting (Bias) and overfitting (Variance) are presented. Linear models can be used as stand alone classifiers for simple cases and they are essential building blocks as a part of larger deep learning networks
SEARN Algorithm is a search-based algorithm for structured prediction.
Most of the content is taken from http://users.umiacs.umd.edu/~hal/docs/daume06thesis.pdf. I just read the thesis and presented what's in there. Thus the credits of the content should go to the author of the thesis.
Regression takes a group of random variables, thought to be predicting Y, and tries to find a mathematical relationship between them. This relationship is typically in the form of a straight line (linear regression) that best approximates all the individual data points.
https://github.com/telecombcn-dl/dlmm-2017-dcu
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Methods of Optimization in Machine LearningKnoldus Inc.
In this session we will discuss about various methods to optimise a machine learning model and, how we can adjust the hyper-parameters to minimise the cost function.
Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
Basic knowhow of several techniques commonly used in deep learning and neural networks -- activation functions, cost functions, optimizers, regularization, parameter initialization, normalization, data handling, hyperparameter selection. Presented as lecture material for the course EE599 Deep Learning in Spring 2019 at University of Southern California.
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Maninda Edirisooriya
Bias and Variance are the deepest concepts in ML which drives the decision making of a ML project. Regularization is a solution for the high variance problem. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
A machine learning method for efficient design optimization in nano-optics JCMwave
The slideshow contains a brief explanation of Gaussian process regression and Bayesian optimization. For two optimization problems, benchmarks against other local gradient-based and global heuristic optimization methods are included. They show, that Bayesian optimization can identify better designs in exceptionally short computation times.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
2. Outlines
• Introduction.
• What is probabilistic regression?
• Why is it useful?
• How does other methods compare to NGBoost?
• Gradient Boosting Algorithm.
• NGBoost:
• Main components.
• Steps.
• Usage.
• Experiments & Results.
• Computational Complexity.
• Future Work.
• References.
3. Introduction
What is probabilistic regression?
(Standard Regression)
Note: This use of conditional probability distributions is already the norm in classification
4. Why is probabilistic regression (prediction) useful?
The measure of uncertainty makes probabilistic prediction crucial in applications like healthcare and
weather forecasting.
5. Why is probabilistic regression (prediction) useful?
All in all, probabilistic regression (prediction) provides better insight over standard (scalar)
regression.
E[Y|X=x]
X=x P(Y|X=x)
6. Problems with existing methods
Methods:
• Post-hoc variance.
• Generalized Additive Models for Shape Scale
and Location (GAMLSS)
• Bayesian methods like MMC.
• Bayesian deep learning.
Problems:
• Inflexible.
• Slow.
• Require expert knowledge.
• Make strong assumption about nature of data
(Homoscedasticity*)
Limitations of deep learning methods: difficult to use
out-of-the-box
• Require expert knowledge.
• Usually perform only on par with traditional
methods on limited size or tabular data.
• Require extensive hyperparameter tuning.
* Homoscedasticity: means that all random
variables in a sequence have the same finite
variance.
7. Gradient Boosting
Machines (GBMs)
• A set of highly modular methods
that:
• work out-of-the-box.
• Perform well on structured
data, even with small datasets.
• Demonstrated empirical success on
Kaggle and other data science
competitions.
Source: what algorithms are most successful on Kaggle?
8. Problems related to GBMs
• Assume Homoscedasticity: constant variance.
• Predicted distributions should have at least two
degrees of freedom (two parameters) to
effectively convey both the magnitude and the
uncertainty of the predictions.
What is the solution then?
(Spoiler Alert) it is NGBoost
NGBoost sovles the problem of simultaneous boosting of multiple parameters from
the base learners using:
• A multiparameter boosting approach.
• Use of natural gradients.
9. Gradient
Boosting
Algorithm
• An ensemble of simple models are involved in making a prediction.
• Results in a prediction model in the form of ensemble weak models.
• Intuition: the best possible next model, when combined with previous models,
minimizes the overall prediction error.
• Components:
• A loss function to be optimized.
• E.g., MSE or Logarithmic Loss.
• A weak learner to make predictions.
• Most common choice is Decision Trees or Regression Trees.
• Common to constrain the learner such as specifying maximum number of
layers, nodes, splits or leaf nodes.
• An additive model to add weak learners to minimize the loss function.
• A gradient descent procedure is used to minimize the loss when adding
trees.
11. Gradient Boosting
Algorithm
Explanation:
Step 1: Initialize prediction to a constant whose value minimizes
the loss. You could solve using Gradient Descent or manually if
problem is trivial.
Step 2: build the trees (weak learners)
(A) Compute residuals between prediction and observed data.
Use prediction of previous step 𝐹 𝑥 = 𝐹𝑚−1(𝑥), which is
𝐹0(𝑥) for 𝑚 = 1.
(B) Optimize tree on the residuals (make residuals the target
output). 𝑗 here loops over leaf nodes.
(C) Determine output for each leaf in tree. E.g., if leaf has 14.7
and 2.7, then output is the value of 𝛾 that minimizes the
summation. Different than Step 1, here we are taking
previous prediction 𝐹𝑚−1(𝑥𝑖) into account.
(D) Make a new prediction for each sample. The summation
accounts for the case that a single sample ends up in
multiple leaves. So, you take a scaled sum of the outputs 𝛾
for each leaf. Choosing a small learning rate 𝜐 improves
prediction
Step 3: Final prediction is the prediction of the last tree.
To learn more:
• Paper: Greedy Function Approximation: A Gradient Boosting Machine, Jerome H. Friedman.
• Video explanations: Gradient Boost part 1, part 2, part 3, part 4.
• Decision Trees video explanation: Decision Trees.
• AdaBoost video explanation: AdaBoost.
12. NGBoost: Natural Gradient Boosting
• A method for probabilistic prediction with competitive state-of-the-art performance on a variety
of datasets.
• Combines a multiparameter boosting algorithm with the natural gradient to efficiently how the
parameters of the presumed outcome distribution vary with the observed features.
• In a standard prediction setting:
• the object of interest is the estimate of a scalar function Ε(𝑦|𝑥) where 𝑥 is the vector of covariates
(observed features) and 𝑦 is the prediction target.
• For NGBoost:
• The object of interest is a conditional probability distribution 𝑃𝜃(𝑦|𝑥).
• Assuming 𝑃𝜃 𝑦 𝑥 has a parametric form of 𝑝 parameters where 𝜃 𝜖 ℝ𝑝
(vector of p parameters).
14. NGBoost:
Natural
Gradient
Boosting
Steps:
1. Pick a scoring rule to grade our estimate of P(Y|X=x)
2. Assume that P(Y|X=x) has some parametric form
3. Fit the parameters θ(x) as a function of x using
gradient boosting
4. Use the natural gradient to correct the training
dynamics of this approach
15. Proper Scoring Rule
A proper scoring rule 𝑆(𝑃, 𝑦) must satisfy:
Ε𝑦~𝑄 𝑆(𝑄, 𝑦) ≤ Ε𝑦~𝑄 𝑆 𝑃, 𝑦 ∀ 𝑃, 𝑄
𝑄: 𝑡𝑟𝑢𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑦
𝑃: 𝑎𝑛𝑦 𝑜𝑡ℎ𝑒𝑟 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑒. 𝑔. 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑦
In other words, the scoring rule assigns a score to the forecast such that the true distribution 𝑄
of the outcomes gets the best score in expectation compared to other distributions, like 𝑃.
(Gneiting and Raftery, 2007. Strictly Proper Scoring Rules, Prediction, and Estimation.)
16. 1. Pick a scoring rule to grade our estimate of P(Y|X=x)
Point Prediction Loss Function
Probabilistic
Prediction
Scoring Rule
Example scoring rule: negative log-likelihood
Notes:
• A scoring rule in probabilistic regression is analogous to loss function in standard regression.
• NLL when minimized gives the Maximum Likelihood Estimation (MLE).
• Taking the log simplifies the calculus.
• NLL (MLE) is the most common propre scoring rule.
• CRPS is another good alternative to MLE.
17. 2. Assume P(Y|X=x) has some parametric form
μ = 1
σ = 1
μ = 2
σ = 0.5
μ = 2.5
σ = 0.75
μ = 3.5
σ = 1.5
Note:
here they are
assuming a normal
distribution, but
you can swap out
with any other
distribution
(Poisson, Bernoulli,
etc.) that fits your
application.
18. 3. Fit the parameters θ(x) as a function of x using gradient boosting
μ = 1
σ = 1
μ = 2
σ = 0.5
μ = 2.5
σ = 0.75
μ = 3.5
σ = 1.5
19. This approach performs poorly in practice.
What we get:
What we want:
The algorithm is failing to adjust the mean which is affecting prediction.
What could be the solution?
Use natural gradients instead of ordinary gradients.
20. What we typically do: gradient descent in the parameter space
• Pick a small region around your value of 𝜃
• Which direction, to step into in the ball, decreases the score. (aka gradient)
21. What we want to do: Gradient descent in the space of distributions
Every point in this space represents
some distribution.
22. Parametrizing the space of distributions
is just a “name” for P
Each distribution has such
a name (i.e. is “identified”
by its parameters)
23. The problem is:
Gradient descent in the parameter space is not gradient descent in the distribution space because
distances don’t correspond.
That’s because distances are not the same in both spaces.
Spaces have
different
shape and
density
24. 4. Use the natural gradient to correct the training dynamics of this
approach.
this is the natural gradient
Idea: do gradient descent in the distribution by
searching parameters in the transformed region
25. • is the Riemannian
metric of the space of
distributions
• It depends on the
parametric form chosen
and the score function
• If the score is NLL, this is
the Fisher Information
Here’s the trick:
• Multiplying the ordinary gradient with Riemannian metric which will implicitly transform optimal direction
in parameter space to optimal direction in the distributional space.
• We can conveniently compute the natural gradient by applying a transformation to the gradient
27. NGBoost
Explanation:
1. Estimate a common 𝜃(0)
such that it minimizes 𝑆.
2. For each iteration 𝑚:
• Compute the natural gradient 𝑔𝑖
(𝑚)
of 𝑆 with
respect to the predicted parameters of that
example up to that stage, 𝜃𝑖
(𝑚−1)
.
• Fit learners, one per parameter, on natural
gradients. E.g. 𝑓(𝑚)
= (𝑓𝜇
𝑚
, 𝑓log 𝜎
𝑚
)
• Compute a scaling factor 𝜌(𝑚)
(scalar) such that
it minimizes true scoring rule along the
projected gradient in the form of line search. In
practice, they found setting 𝜌 = 1 and then
halving successively works well.
• Update predicted parameters.
Notes:
• learning rate 𝜂 is typically 0.1 or 0.01. According to
Friedman assumption.
• Sub-sampling mini-batches can improve computation
performance for large datasets.
28. Experiments
• UCI ML Repository benchmarks.
• Probabilistic Regression:
• Configuration:
• Data split: 70% training, 20% validation, and 10% testing.
• Repeated 20 times.
• Ablation:
• 2nd-Order boosting: use 2nd order gradients instead of natural gradients.
• Multiparameter boosting: using ordinary gradients instead of natural
gradients.
• Homoscedastic boosting: assuming constant variance to see the benefits of
the allowing parameters other than the conditional mean to vary across 𝑥.
• Why? To demonstrate that multiparameter boosting and the natural
gradient work together to improve performance.
• Point estimation.
29. Results
The result is equal or better performance than state-of-the art probabilistic prediction methods
33. Computational
Complexity
Difference between NGBoost and other boosting algorithms:
• NGBoost is a series of learners that must be fit for each
parameter, whereas standard boosting fits only one series of
learners.
• Natural Gradient 𝑝𝑥𝑝 𝐼𝑠
−1
matrix is computed at each step.
Note that 𝑝 is the number of parameters.
In practice:
• The matrix is small for most used distributions. Only 2x2 if using
Normal distribution.
• If dataset is huge, it may still be expensive to compute large
number of matrices for each iteration.
34. Future work
• Apply NGBoost to classification (e.g.
survival)
• Joint prediction: 𝑃𝜃(𝑧, 𝑦|𝑥)
• Technical innovations:
• Better tree-based base learners and
regularization are likely to improve
performance especially in terms of large
datasets.