Slides of a report on Machine Learning Seminar Series'11 at Kazan (Volga Region) Federal University. See http://cll.niimm.ksu.ru/cms/main/seminars/mlseminar
Sensitivity Analysis, Optimal Design, Population Modeling.pptxAditiChauhan701637
Sensitivity analysis is the study of the unreliability related to output and input of mathematical model or numerical system which can be divided and allocated to various sources.
The process of outcome under possible speculation to find out the impact of a variable under sensitivity analysis can be useful for a range of purpose, consisting -
1. In the existence of unreliability, prefer testing of the results of a model or system.
2. Enhanced understanding of correlation between input and output variables in a model or system.
Sensitivity analysis methods:
There are many number of methods to study the sensitivity analysis, many of which have been developed to address one or more of the limitations discussed above. By the type sensitivity analysis measurement they are differentiate, be it based on variance decompositions, partial derivatives or elementary effects.
FDA’s emphasis on quality by design began with the recognition that increased testing does not improve product quality (this has long been recognized in other industries).In order for quality to increase, it must be built into the product. To do this requires understanding how formulation and manufacturing process variables influence product quality.Quality by Design (QbD) is a systematic approach to pharmaceutical development that begins with predefined objectives and emphasizes product and process understanding and process control, based on sound science and quality risk management.
This presentation - Part VI in the series- deals with the concepts of Design of Experiments. This presentation was compiled from material freely available from FDA , ICH , EMEA and other free resources on the world wide web.
Sensitivity Analysis, Optimal Design, Population Modeling.pptxAditiChauhan701637
Sensitivity analysis is the study of the unreliability related to output and input of mathematical model or numerical system which can be divided and allocated to various sources.
The process of outcome under possible speculation to find out the impact of a variable under sensitivity analysis can be useful for a range of purpose, consisting -
1. In the existence of unreliability, prefer testing of the results of a model or system.
2. Enhanced understanding of correlation between input and output variables in a model or system.
Sensitivity analysis methods:
There are many number of methods to study the sensitivity analysis, many of which have been developed to address one or more of the limitations discussed above. By the type sensitivity analysis measurement they are differentiate, be it based on variance decompositions, partial derivatives or elementary effects.
FDA’s emphasis on quality by design began with the recognition that increased testing does not improve product quality (this has long been recognized in other industries).In order for quality to increase, it must be built into the product. To do this requires understanding how formulation and manufacturing process variables influence product quality.Quality by Design (QbD) is a systematic approach to pharmaceutical development that begins with predefined objectives and emphasizes product and process understanding and process control, based on sound science and quality risk management.
This presentation - Part VI in the series- deals with the concepts of Design of Experiments. This presentation was compiled from material freely available from FDA , ICH , EMEA and other free resources on the world wide web.
Factorial Design and application in drug development.
Variables in factorial design & Optimisation technique.
Types of Factorial design & it's applications.
Design of experiments, Factorial design
Characteristics of Factorial Design & Factorial design Testing.
Advantages, Disadvantage, Application of Factorial Designs.
Software Used in Factorial Design Testing.
computational model of drug disposition.SaloniDalwadi
this presentation is about digitalization in pharmacy for prediction of parameter like pharmacokinetics and pharmacodynamics before drug dicovery process or formulation development process.
FDA 2013 Clinical Investigator Training Course Preparing an IND Application: ...MedicReS
FDA 2013 Clinical Investigator Training Course Preparing an IND Application: Preclinical Considerations for Cell and Gene Therapy Products
Patrick Au, Ph.D., (CBER)
Dissolution apparatus, invivo-invitro corelation, factor affecting,BCS classification ..
Complete dissolution topic in this slide & easy way to write..
Cheak it now and give feedback
Factorial Design and application in drug development.
Variables in factorial design & Optimisation technique.
Types of Factorial design & it's applications.
Design of experiments, Factorial design
Characteristics of Factorial Design & Factorial design Testing.
Advantages, Disadvantage, Application of Factorial Designs.
Software Used in Factorial Design Testing.
computational model of drug disposition.SaloniDalwadi
this presentation is about digitalization in pharmacy for prediction of parameter like pharmacokinetics and pharmacodynamics before drug dicovery process or formulation development process.
FDA 2013 Clinical Investigator Training Course Preparing an IND Application: ...MedicReS
FDA 2013 Clinical Investigator Training Course Preparing an IND Application: Preclinical Considerations for Cell and Gene Therapy Products
Patrick Au, Ph.D., (CBER)
Dissolution apparatus, invivo-invitro corelation, factor affecting,BCS classification ..
Complete dissolution topic in this slide & easy way to write..
Cheak it now and give feedback
We approach the screening problem - i.e. detecting which inputs of a computer model significantly impact the output - from a formal Bayesian model selection point of view. That is, we place a Gaussian process prior on the computer model and consider the $2^p$ models that result from assuming that each of the subsets of the $p$ inputs affect the response. The goal is to obtain the posterior probabilities of each of these models. In this talk, we focus on the specification of objective priors on the model-specific parameters and on convenient ways to compute the associated marginal likelihoods. These two problems that normally are seen as unrelated, have challenging connections since the priors proposed in the literature are specifically designed to have posterior modes in the boundary of the parameter space, hence precluding the application of approximate integration techniques based on e.g. Laplace approximations. We explore several ways of circumventing this difficulty, comparing different methodologies with synthetic examples taken from the literature.
Authors: Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha) and Rui Paulo (Universidade de Lisboa)
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Ukraine
31 травня відбувся вебінар для ML-спеціалістів - “Advanced Statistical Methods for Linear Regression” від спікера Віталія Мірошниченка! Ця доповідь для тих, хто добре ознайомлений із найпоширенішими моделями даних та підходами у машинному навчанні і хоче розширити знання іншими підходами.
У доповіді ми розглянули:
- Нагадування. Модель лінійної регресії і підгонка параметрів;
- Навчання батчами (великі об’єми вибірок);
- Оптимізація розрахунків у каскаді моделей;
- Модель суміші лінійних регресій;
- Оцінки методом складеного ножа матриць коваріацій.
Про спікера:
Віталій Мірошниченко — Senior ML Software Engineer, GlobalLogic. Має більше 6 років досвіду, який отримав здебільшого на проєктах, пов’язаних із Telecom, Cyber security, Retail. Активний учасник змагань Kaggle, та Аспірант КНУ.
Деталі заходу: https://bit.ly/3HkqhDB
Відкриті ML позиції у GlobalLogic: https://bit.ly/3MPC9yo
A Novel Methodology for Designing Linear Phase IIR FiltersIDES Editor
This paper presents a novel technique for
designing an Infinite Impulse Response (IIR) Filter with
Linear Phase Response. The design of IIR filter is always a
challenging task due to the reason that a Linear Phase
Response is not realizable in this kind. The conventional
techniques involve large number of samples and higher
order filter for better approximation resulting in complex
hardware for implementing the same. In addition, an
extensive computational resource for obtaining the inverse
of huge matrices is required. However, we propose a
technique, which uses the frequency domain sampling along
with the linear programming concept to achieve a filter
design, which gives a best approximation for the linear
phase response. The proposed method can give the closest
response with less number of samples (only 10) and is
computationally simple. We have presented the filter design
along with its formulation and solving methodology.
Numerical results are used to substantiate the efficiency of
the proposed method.
A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2...Jialin LIU
"A Mathematically Derived Number of Resamplings for Noisy Optimization". Jialin Liu, David L. St-Pierre and Olivier Teytaud. (Accepted as short paper) Genetic and Evolutionary Computation Conference (GECCO), 2014.
Special Plenary Lecture at the International Conference on VIBRATION ENGINEERING AND TECHNOLOGY OF MACHINERY (VETOMAC), Lisbon, Portugal, September 10 - 13, 2018
http://www.conf.pt/index.php/v-speakers
Propagation of uncertainties in complex engineering dynamical systems is receiving increasing attention. When uncertainties are taken into account, the equations of motion of discretised dynamical systems can be expressed by coupled ordinary differential equations with stochastic coefficients. The computational cost for the solution of such a system mainly depends on the number of degrees of freedom and number of random variables. Among various numerical methods developed for such systems, the polynomial chaos based Galerkin projection approach shows significant promise because it is more accurate compared to the classical perturbation based methods and computationally more efficient compared to the Monte Carlo simulation based methods. However, the computational cost increases significantly with the number of random variables and the results tend to become less accurate for a longer length of time. In this talk novel approaches will be discussed to address these issues. Reduced-order Galerkin projection schemes in the frequency domain will be discussed to address the problem of a large number of random variables. Practical examples will be given to illustrate the application of the proposed Galerkin projection techniques.
We consider the problem of finding anomalies in high-dimensional data using popular PCA based anomaly scores. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. We give the first streaming algorithms
that use space that is linear or sublinear in the dimension. We prove general results showing that any sketch of a matrix that satisfies a certain operator norm guarantee can be used to approximate these scores. We instantiate these results with powerful matrix sketching techniques such as Frequent Directions and random projections to derive efficient and practical algorithms for these problems, which we validate over real-world data sets. Our main technical contribution is to prove matrix perturbation
inequalities for operators arising in the computation of these measures.
-Proceedings: https://arxiv.org/abs/1804.03065
-Origin: https://arxiv.org/abs/1804.03065
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Knowledge engineering: from people to machines and back
7 - Model Assessment and Selection
1. Model Assessment and Selection
Machine Learning Seminar Series'11
Nikita Zhiltsov
Kazan (Volga Region) Federal University, Russia
18 November 2011
1 / 34
2. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
2 / 34
3. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
3 / 34
4. Notation
x = (x1 , . . . , xD ) ∈ X a vector of inputs
t ∈ T a target variable
y(x) a prediction model
L(t, y(x)) the loss function for measuring errors.
Usual choices for regression:
(y(x) − t)2 squared error
L(t, y(x)) =
|y(x) − t| absolute error
... and classication:
I(y(x) = t) 0-1 loss
L(t, y(x)) =
−2 log pt (x) log-likelihood loss
4 / 34
5. Notation (cont.)
1 N
err = N i=1 L(ti , xi ) training error
ErrD = ED [L(t, y(x))] test error (prediction error) for a given
training set D
Err = E[ErrD ] = E[L(t, y(x))] expected test error
NB
Most methods eectively estimate only Err.
5 / 34
6. Typical behavior of test and training error
Example
Training error is not a good estimate of the test error
There is some intermediate model complexity that gives
minimum expected test error
6 / 34
7. Dening our goals
Model Selection
Estimating the performance of dierent models in order to choose
the best one
Model Assessment
Having chosen a nal model, estimating its generalization error on
new data
7 / 34
8. Data-rich situation
Training set is used to learn the models
Validation set is used to estimate prediction error for model
selection
Test set is used for assessment of the generalization error of the
chosen model
8 / 34
9. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
9 / 34
10. Bias-Variance Decomposition
Let's consider expected loss E[L] for regression task:
E[L] = L(t, y(x)) p(x, t)dxdt
R X
Under squared error loss, h(x) = E[t|x] = tp(t|x)dt is the optimal
prediction.
Then, E[L] can be decomposed into the sum of three parts:
E[L] = bias2 + variance + noise
where
2
bias = (ED [y(x; D)] − h(x))2 p(x)dx
variance = ED [(y(x; D) − ED [y(x; D)])2 ] p(x)dx
noise = (h(x) − t)2 p(x, t)dxdt
10 / 34
11. Bias-Variance Decomposition
Examples
p
For a linear model y(x, w) = j=1 wj xj , ∀wj = 0,
the in-sample error is:
N
1 p 2
Err = (¯(xi ) − h(xi ))2 +
y σ + σ2
N i=1
N
For a ridge regression model (Tikhonov regularization):
N
1
Err = {(ˆ(xi ) − h(xi ))2 + (ˆ(xi ) − y (xi ))2 } + V ar + σ 2
y y ¯
N i=1
where y (xi )
ˆ the best-tting linear approximation to h
11 / 34
13. Bias-variance tradeo
Example
Regression with squared loss
Classication with 0-1 loss
In the 2nd case, prediction error is no
longer the sum of squared bias and
variance
⇒ The best choices of tuning parameters
may dier substantially in the two
settings
13 / 34
14. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
14 / 34
15. Analytical methods: AIC, BIC, SRM
They give the in-sample estimates in the general form:
ˆ
Err = err + w
ˆ
where w
ˆ is an estimate of the average optimism
By using w,
ˆ the methods penalize too complex models
Unlike regularization, they do not impose a specic
regularization parameter λ
Each criterion denes its notion of model complexity involved in
the penalizing term
15 / 34
16. Akaike Information Criterion (AIC)
Applicable for linear models
Either log-likelihood loss or squared error loss is used
Given a set of models indexed by a tuning parameter α, denote
by d(α) number of parameters for each model. Then,
d(α) 2
AIC(α) = err + 2 σ
ˆ
N
where σ2
ˆ is typically estimated by the mean squared error of a
low-bias model
Finally, we choose the model giving smallest AIC
16 / 34
17. Akaike Information Criterion (AIC)
Example
Phoneme recognition task (N = 1000)
Input vector is the log-periodogram of
the spoken vowel quantized to 256
uniformly space frequencies
Linear logistic regression is used to
predict the phonem class
Here d(α) is a number of basis
functions
17 / 34
18. Bayesian Information Criterion (BIC)
BIC, like AIC, is applicable in settings where log-likehood
maximization is involved
N d
BIC = 2
(err + (log N ) σ 2 )
ˆ
σ
ˆ N
BIC is proportional to AIC with the factor 2 replaced by log N
Having N 8, BIC tends to penalize complex models more
heavily than AIC
BIC also provides the posterior probability of each model m:
1
e− 2 BICm
M 1
− 2 BICl
l=1 e
BIC is asympotically consistent as N →∞
18 / 34
19. Structural Risk Minimization
The Vapnik-Chervonenkis (VC) theory provides a general
measure of the model complexity and gives associated bounds
on the optimism
Such a complexity measure, VC dimension, is dened as follows:
VC dimension of the class functions {f (x, α)} is
the largest number of points that can be shattered by
members of {f (x, α)}
E.g. a linear indicator function in p dimensions has VC
dimension p + 1; sin(αx) has innite VC dimension
19 / 34
20. Structural Risk Minimization (cont.)
If we t N training points using {f (x, α)} having VC dimension
h, then with probability at least 1 − η the following bound holds:
h 2N ln η
Err err + (ln + 1) − )
N h N
SRM approach ts a nested sequence of models of increasing VC
dimensions h1 h2 . . . and then chooses the model with the
smallest upper bound
SVM classier eciently carries out the SRM approach
Issues
ˆ There exists the diculty in calculating the VC dimension of a class
of functions
ˆ In practice, often the upper bound is very loose
20 / 34
21. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
21 / 34
22. Sample re-use: cross-validation, bootstrapping
These methods directly (and quite accurately) estimate
the average generalization error
The extra-sample error is evaluated rather than
in-sample one (test input vectors do not need to
coincide with training ones)
They can be used with any loss function, and with
nonlinear, adaptive and tting techniques
However, they may underestimate true error for such
tting methods as trees
22 / 34
23. Cross-validation
Probably the simplest and widely used method
However, time-consuming method
CV procedure looks as follows:
1 Split data into K roughly equal-sized parts
2 For k-th part we t the model y −k (x) to other K − 1 parts
3 Then the cross-validation estimate of the prediction error is
N
1
CV = L(ti , y −k(i) (xi ))
N
i=1
The case K=N (leave-one-out cross-validation) is roughly
unbiased, but can have high variance
23 / 34
24. Cross-validation (cont.)
In practice, 5- or 10-fold cross-validation is recommended
CV tends to overestimate the true prediction error on small
datasets
Often one-standard error rule is used with CV. See example:
We choose the most
parsimonious model
whose error is no more
than one standard error
above the error of the
best model
A model with p=9
would be chosen
24 / 34
25. Bootstrapping
General method for assessing statistical accuracy
Given a training set, here the bootstrapping procedure steps are:
1 Randomly draw datasets of with replacement from it; each
sample is of the same size as the original one
2 This is done by B times, producing B bootstrap datasets
3 Fit the model to each of the bootstrap datasets
4 Examine the prediction error using the original training set as a
test set:
N
1 1
ˆ
Errboot = L(ti , y ∗b (xi ))
N |C −i |
i=1 b∈C −i
where C (−i) is the set of indices of the bootstrap samples that
do not contain observation i
To alleviate the upward bias, the .632 estimator is used:
ˆ (.632) = 0.368 err + 0.632 Errboot
Err ˆ
25 / 34
26. Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methods
AIC
BIC
SRM Approach
4 Error Estimation: Sample re-use
Cross-validation
Bootstrapping
5 Model Assessment in R
26 / 34
27. http://r-project.org
Free software environment for statistical
computing and graphics
R packages for machine learning and data
mining: kernlab, rpart, randomForest,
animation, gbm, tm etc.
R packages for evaluation: bootstrap,boot
RStudio IDE
27 / 34
28. Housing dataset at UCI Machine learning
repository
http://archive.ics.uci.edu/ml/datasets/Housing
Housing values in suburbs of Boston
506 intances, 13 attributes + 1 numeric class attribute
(MEDV)
28 / 34
29. Loading data in R
housing - read.table(∼/projects/r/housing.data,
+ header=T)
attach(housing)
29 / 34
30. Cross-validation example in R
Helper function
Creating a function using crossval() from bootstrap package
eval - function(fit,k=10){
+ require(bootstrap)
+ theta.fit - function(x,y){lsfit(x,y)}
+ theta.predict - function(fit,x){cbind(1,x)%*%fit$coef}
+ x - fit$model[,2:ncol(fit$model)]
+ y - fit$model[,1]
+ results - crossval(x,y,theta.fit,theta.predict,
+ ngroup=k)
+ squared.error=sum((y-results$cv.fit)^2)/length(y)
+ cat(Cross-validated squared error =,
+ squared.error, n)}
30 / 34
31. Cross-validation example in R
Model assessment
fit - lm(MEDV∼.,data=housing) # A linear model that uses
all the attributes
eval(fit)
Cross-validated squared error = 23.15827
fit - lm(MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS,
+ data=housing) # Less complex model
eval(fit)
Cross-validated squared error = 23.24319
fit - lm(MEDV∼ RM,data=housing) # Too simple model
eval(fit)
Cross-validated squared error = 44.38424
31 / 34
32. Bootstrapping example in R
Helper function
Creating a function using boot() function from boot package
sqer - function(formula,data,indices){
+ d - data[indices,]
+ fit - lm(formula, data=d)
+ return (sum(fit$residuals^2)/length(fit$residuals))
+ }
32 / 34
33. Bootstrapping example in R
Model assessment
results - boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼.) # 1000 bootstrapped datasets
print(results)
Bootstrap Statistics :
original bias std. error
t1* 21.89483 -0.76001 2.296025
results - boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS)
print(results)
Bootstrap Statistics :
original bias std. error
t1* 22.88726 -0.5400892 2.744437
results - boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼ RM)
print(results)
Bootstrap Statistics :
original bias std. error
t1* 43.60055 -0.3379168 5.407933
33 / 34
34. Resources
T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical
Learning, 2008
Stanford Engineering Everywhere CS229 Machine Learning.
Handouts 4 and 5
http://videolectures.net/stanfordcs229f07_machine_
learning/
34 / 34