4. Model is a representation or abstraction of a process or a system.
Creation of model helps us to:
* Define a problem
* Organize the thoughts
* Understand the data
* Communicate and test that understanding
* Make prediction
The important aim of model creation is to define the problem
such that important details become visible rather than irrelevant
details.
While constructing a model one should keep in mind what type of
information is needed from it.
6. Following are the basic steps in model creation:
Model Selection
Model fitting
Model validation
7. Model selection
In this step, plots of data, process knowledge and assumption
about the process are used to determine the from of model to fit
the data.
Model fitting
From the above selected model and possible information about
the data an appropriate model-fitting method is used to estimate
the unknown parameters in the model.
Model validation
When the estimation of parameters have been made,the model is then
carefully assessed to see if the underlying assumptions of the analysis
appears plausible.
8. If the assumption seem valid, the model can be used to answer the
scientific or engineering questions that promoted the modelling
effort.
If the model validation identifies problems with the current model
then the modelling process is repeated using information from the
model validation step to select and/or fit an improved model.
The flow chart shown in the next slide will be
the basic model-fitting sequence with
the integration of the related data collection steps into the
model-building process.
9.
10. In order to obtain a reasonable model, the collected data must be
used with proper understanding of it.
The obtained raw data must be processed and then studied, how
a model can be fitted to it.
Must be seen whether needed correction or not. If needed how it
can be applied?
It is discussed below in data analysis.
12. Its a process of inspecting, cleaning and transforming data to
highlight the useful information, suggesting conclusions, and
supporting decision making.
It has multiple facets and approaches.
Encompassing diverse techniques under a variety of names.
Whatever the case, the first step will be to collect data from the
various source and enter into a single data file.
Most convenient way of this is variables-in-column format.
In this format the variable names in column headings and the
values of the variables in rows.
13. There are different types of data analysis technique:
**Data Mining
**Business intelligence
**Predictive analytics
**Text analytics
In statistical application, data analysis is divided into following:
**Descriptive statistic
**Exploratory data analysis(EDA)
**Confirmatory data analysis(CDA)
14. Data Mining
Focuses on modelling and knowledge discovery for predictive
rather than purely descriptive.
Business Intelligence
Data analysis that relies heavily on aggregation, focusing on
business information
Predictive Analytics
Focuses on the application of statistical or structural model for
predictive forecasting or classification.
Text Analytics
Applies statistical, linguistic and structural techniques to
extract and classify information from textual sources.
EDA
Focuses on discovering of new features in data
CDA
Confirming or falsifying existing hypothesis.
15. Next step is to perform quality check on data, where we typically
looking for data entry problems, unusual data values,
missing data, etc.
The two most useful steps for this are scatter plot and histogram.
16. Scatter plots
**By constructing it for all of response variables, any data entry
problem will be easily identified.
**It reveals relationship or association between two variables.
**Such relationship manifest themselves by any non-random
structure in the plot.
**Its a plot of the values of Y versus the corresponding values of
X.
**Vertical axis Variable Y, the response variable.
Horizontal axis Variable X, suspected variables related to
response.
17. **By this plot following questions can be answered:
1.Are variables X and Y related?
2.Are variables X and Y are linearly related?
3.Are variables X and Y are non-linearly related?
4.Does the variation in Y change depending on X?
5.Are there outliers?
**Following are the few examples of scatter plot:
1.No relationship
2.Strong linear(positive correlation)
3.Strong linear(negative correlation)
4.Exact linear(positive correlation)
5.Quadratic relationship
6.Exponetial relationship
7.Sinusoidal relationship(damped)
8.Variation of Y doesn't depend on X(homoscedastic)
9.Variation of Y does depend on X(hetroscedastic)
18. 1)No Relationship
i)For a given value of X the corresponding
of Y ranges all over.
ii)Say X=0.5 then Y ranges from -2 to +2.
iii)Same is true for all other values of X.
iv)This lack of predictability in determining Y
from a given value of X and the non-structure
appearance of scatter plot leads to conclusion
“No Relationship”.
Fig 1
19. 2)Strong linear(positive correlation)Relationship
i)A straight line comfortably fits through the data
hence a linear relationship exists.
ii)Scatter about the line is quite small, so there is
strong linear relationship.
iii)Slope of line is positive.
iv)Small values of X corresponds to small values
of Y, Large values of X corresponds to large
value of Y.
v)So, there is a positive correlation between X & Y.
Fig 2
20. 3)Strong linear(-ve correlation)relationship
i)It is as same of the above Strong linear.
ii)Except that its having negative slope.
iii)Small values of X corresponds to large values
of Y, large values of X corresponds to small
values of Y.
iv)So, there is a negative correlation between
X & Y.
Fig 3
21. 4)Exact linear(+ve correlation)Relationship
i)A straight line comfortably fits through the data
hence there is a linear relationship.
ii)The scatter about the line is zero, there is perfect
predictability between X and Y.
iii) So, there is an exact linear relationship.
iv)The slope of the line is positive.
v)So there is positive correlation between X and Y.
Fig 4
22. 5)Quadratic Relation
i)No imaginable simple straight line could ever
adequately describe the relationship between
X and Y.
ii)Hence a curvilinear or non-linear function is
needed.
iii)The simplest such a curvilinear function is
quadratic model.
iv)Many other curvilinear are possible, but the
data analysis principle suggest to fit quadratic
function first.
Fig 5
23. 6)Exponential Relationship
i)Straight line and quadratic models will prove
lacking in large values of X.
ii)Hence a non-linear function beyond quadratic
is needed.
iii)Among many other non-linear function simpler
one is the exponential model.
iv)For some A,B and C
v)In this case exponential function fits so good
hence the conclusion of exponential function.
Fig 6
24. 7)Variation of Y doesn't depend on X:-
i)It reveals a linear relationship between X & Y
for a given value of X, the predicted value of Y
will fall on line.
ii)Further plot reveals that, the variation in Y about
the predicted value is same regardless of value
of X.
iii)Statistically its referred as homoscedasticity.
iv)Its very important because its the underlying
assumption for regression.
v)Its violation leads to parameter estimate with
inflated variance.
vi)If the data is homoscedastic then the usual
regression is used.
Fig 7
25. 8)Variation of Y does depend on X:-
i)It reveals an approximate linear relationship
between X and Y.
ii)It reveals a statistical condition referred as
hetroscedasticity.
iii)In this the variation in Y differs depending on
the value of X.
iv)In this example, small value of X yield small
scatter in Y while large value of X result in
large scatter in Y.
v)It complicates the data analysis, but can be
overcome by proper weighting of data and
performing a Y variable transformation.
Fig
8
26. 9)Outlier:-
i)A data point that emanates from a different
model than do the rest of data.
ii)Outlier detection is important for effective
modelling.
iii)If all data is included in a linear regression,
then the fitted model will be poor virtually
everywhere.
iv)If the outlier is omitted from fitting process,
then the final fit will be excellent almost
everywhere.
Fig
9
27. Once the data quality problems are identified and fixed.
The location, spread and shape for all of response variables
must be estimated.
This is easily done by combination of histograms and numerical
summary statistics.
28. Histogram
** Graphically summarized distribution of univariate data.
**It shows the location(centre), scale(spread), skewness,
presence of outliers and presence of multiple modes in the data.
**The above features provide strong indication of the proper
distributional model for the data.
**Most common from is obtained by splitting the range of data
into equal-sized bins(called classes).
**Then the number of points from data set into each bin is counted.
Vertical Axis-counts for each bin(Frequency)
Horizontal Axis-Response variable
29. **Following Questions can be answered by a Histogram:-
What kind of population distribution do the data come from?
Where are the data located?
How spread out are the data?
Are the data symmetric or skewed?
Are there outliers in the data?
**Few examples of histogram:-
1)Normal
2)Symmetric, Non-normal, Short tailed
3)Symmetric, Non-normal, Long tailed
4)Symmetric and bimodal
5)Bimodal Mixture of 2 Normal’s
6)Skewed(Non-symmetric) Right
7)Skewed(Non-Symmetric) Left
8)Symmetric with Outlier
30. 1)Normal:-
i)It is a classical bell shaped, symmetric histogram
with most of the frequency counts bunched in the
middle and with the counts dying off out in the
tails.
ii)If histogram indicates a symmetric, moderate tail
distribution, then the recommended step is to do
a normal probability plot to confirm approximate
normality.
iii)If the normal probability distribution is linear, then
the normal distribution is good model for the data.Fig
10
31. 2)Symmetric, Non-normal, Short tailed:-
i)In a symmetric distribution, the “centre” of distribution
is referred as “body” of the distribution and the “tail”
of distribution refers to the extreme regions of the
distribution.
ii)For a short-tailed distribution, the tails approach zero
very fast and commonly have a “Sawed-off” look.
iii)If histogram indicates symmetric, short-tail dist.
then next step will be to generate uniform
probability plot.
iv)If the plot is linear then the uniform distribution is
an appropriate model for the data.
Fig
11
32. 3)Symmetric, Non-normal, Long-tailed:-
i)For a long tailed distribution, the tail declines
to zero very slowly.
ii)Hence one can see the probability a long way
from the body of distribution.
iii)If the histogram indicates symmetric, long tailed
distribution the next step will be to do the
Cauchy probability plot.
iv)If its linear then Cauchy distribution is appropriate
model for the data
Fig
12
33. 4)Symmetric and Bimodal:-
i)Shown histogram illustrates bimodal(two peak)
distribution.
ii)The histogram serves as a tool for diagnosing the
bimodality.
iii)The bimodality is caused by sinusoidality in the
data.
iv)If the histogram indicates a symmetric, bimodal
distribution then next step will be followed:
Do a lag plot or scatter plot to check for
sinusoidality. If the lag plot is elliptical then
the data is sinusoidal.
If the data is sinusoidal, then a spectral plot is
used to graphically estimate the underlying
sinusoidal frequency.
If the data is not sinusoidal, then the
Tukey Lambda PPCC Plot may determine the
best fit symmetric distribution for the data
Fig 13
34. 5)Bimodal Mixture of 2 Normal's:-
i)In this example bimodality is not due to
underlying deterministic modal, it is due to
the mixture of probability models.
ii)If this is the case then the research challenge
is to determine physically why there are two
similar but separate sub-process.
iii)If data indicates that the data may appropriately
fit with a mixture of two normal distribution,
then next step will be:
Fit the normal mixture model using either
least square or maximum likelihood.
Whether any method is used the quality
of fit is good starting values.
It can provide initial estimates for the location
and scale parameters of the two normal
distribution.
Both data plot and R plot is used to fit a
mixture of two normal's.
Fig
14
35. 6)Skewed(Non-Normal)Right:-
i)A skewed distribution is one in which their is no
mirror imaging.
ii)its having one tail of distribution considerably
longer or drawn out relative to the other tail.
iii)Skewed right means the tail is on right side.
iv)For a skewed distribution there is no centre but
several typical value metric are often used.
Mode, mean and median.
v)Skewed data forms due to the lower or upper
bound of the data.
vi)Data have lower bound then its skewed right.
vii)If the histogram indicates skewed right then next
steps to be followed:
Quantitatively summarize the data.
Determine the best fit distribution.
from Weibull, Gamma,Chi-square,Lognormal
Normalizing transformation such as
Box-Cox Transformation.
Fig 15
36. 7)Skewed(Non-Normal)Left:-
i)The issues for the Skewed left data are similar
for skewed right data.
ii)Skewed left means the tail is on left side.
iii)Data that have an upper bound are often skew left.
iv)Data collected in scientific and Engg. applications
often have a lower bound of zero.
Fig 16
37. 8)Symmetric with Outlier:-
i)Symmetric distribution means two halves of
histogram appear as mirror-images of each other.
ii)In this example symmetry with the exception of
outlaying data near Y=9.45.
iii)An outlier is a data point that comes from a
distribution different from the bulk of the data.
iv)All outliers should be taken seriously and
investigated for explanations.
v)Outliers are our best friends they are trying to tell
us something, and we shouldn’t stop until we are
comfortable in explaining each outlier.
vi)If the histogram shows the outliers then:
Graphically check for outliers by generating
box plot
Quantitatively check for outliers by carrying out
Grubbs test
Fig
17
38. Type of data
Quantitative data
*Often its a continuous decimal number to specified number
of significant digit.
*Sometimes its a whole counting number
Categorical data
*Data one of several categories
Qualitative data
*Data is pass/fail or the presence or lack of characteristics.
39. Following are the software's for Data analysis:-
ROOT-C++ Data analysis framework.
PAW-FORTORN /C Data analysis framework.
JHepWork-Java(multi-platform)Data analysis framework.
Data Applied-an online data mining and data visualization
Zeptoscope basic-Interactive Java based plotter
GeNle-discovery of causal relationships from data, learning and
inference.
ANTz-C realtime 3D visualization.
R-a programming language and software environment
for statistical computing and graphics.
41. A Mathematical model is description of system using mathematical
concepts and knowledge.
The process of developing a mathematical model is called
Mathematical modelling.
These models not only used in natural and engineering disciplines
but also used in social sciences such as economics, psychology,
sociology and political science.
Most extensively used by physicist,engineers,stastiticians,operation
research analysts and economists.
It will help to explain a system and to study the effects of different
components and to make the prediction about behaviour.
42. Following are the some of mathematical models:
Dynamical systems
Statistical models
Differential models
Game theoretic model
Logical models
43. Dynamical System
Concept in mathematics where a fixed rule describes the time
dependence of a point in a geometrical space.
Example swinging of clock pendulum, flow of water in pipe
and number of fish in each springtime in a lake.
Types of Dynamical Systems are:
*Linear dynamical system
*Local dynamics
*Bifurcation theory
*Ergodic systems
*Multidimensional generalisation
44. Statistical Models
Formulization of relationships between variables in the form of
mathematical equations.
Describes how one or more random variables are related to
one or more other variables.
The model is statistical if the variables are not deterministically
but stochastically related.
It is a collection of probability distribution functions or probability
density functions.
45. On the basis of finite and infinite dimensional parameter
*Parametric model
*Non-parametric model
*Semi-parametric model
According to the number of endogenous variables
and number of equations
*Complete models
*Incomplete models
46. Other statistical methods are:
1)General linear model
*restricted to continuous dependent variable.
*statistically linear model.
*generally written as Y=XB+U
Y matrix with series of multivariate measurements.
X design matrix.
B matrix contains parameters to be estimated.
U matrix containing errors.
*it incorporates
ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary
linear regression, t-test and f-test.
47. 2)Generalized linear model.
* flexible generalisation of ordinary linear regression that allows
for response variables that have other than a normal distribution.
* It generalizes the linear regression by allowing the linear model
to be related to the response variable via a link function and
by allowing the magnitude of the variance of each measurement
to be a function of its predicted value
* Model components are
A probability distribution from the exponential family.
A linear predictor η = Xβ
A link function g such that E(Y) = μ = g-1(η).
* Example:
General linear models, Linear regression, binomial data,
Multinomial regression, count data, clustered data,
Generalised additive models, logistic regression.
48. 3)Multilevel model.
* Also called as nested models or mixed models or
Spilt-plot designs or random parameter model.
* These are statistical models of parameters that vary at more
than one level.
* It is generalization of linear models even though it extend to
non-linear models.
* its particularly appropriate for researcher designs where the
data for participants is organise that more than one level.
* It can be used with many lavels,but 2-levels is common
Level 1 regression equation
Level 2 regression equation
49. Level 1 regression equation.
Yij = β0j + β1j(X1ij) + β2j(X2ij) + eij
Yij score on the dependent variable for an individual
observation at Level 1
(subscript i refers to individual case, subscript j refers to the group).
Xij Level 1 predictor.
β0j intercept of the dependent variable in group j (Level 2).
β1j slope for the relationship in group j (Level 2) between the
Level 1 predictor and the dependent variable.
eij random errors of prediction for the Level 1 equation
(it is also sometimes referred to as rij).
50. Level 2 regression equation
The dependent variables are the intercepts and
the slopes for the independent variables at
Level 1 in the groups of Level 2.
β0j = γ00 + γ01Wj +u0j
β1j = γ10 + u1j
Γ00 overall intercept.
This is the grand mean of the scores on the dependent variable
across all the groups when all the predictors are equal to 0.
Wj Level 2 predictor.
γ01 overall regression coefficient, or the slope,
between the dependent variable and the Level 2 predictor.
U0j random error component for the deviation
of the intercept of a group from the overall intercept.
Γ10 overall regression coefficient, or the slope,
between the dependent variable and the Level 1 predictor.
u1j error component for the slope
(meaning the deviation of the group slopes from the overall slope)
51. *Types of multi level models:
Random intercepts models
Random slopes models
Random intercepts and slopes model
Developing a multilevel model
52. 4) Structural Equation model.
* Statistical technique for testing and estimating causal relations
using a combination of statistical data and qualitative casual
assumptions.
* Allows confirmatory and explanatory modelling.
* Meaning suited to both theory testing and theory development.
* It has ability to construct latent variable: variables which are
are not measured directly but estimated in the model using
several variables.
* Steps involved are
Model Specification
Estimation for free parameters
Assessment of model and model fit
Model modification
Sample size and power
Interpretation and communication
53. Differential Model.
* Mathematical equation for an unknown function of one several
that relates the values of the function itself and the its derivates
of various orders.
* An example modelling a real world problem using differential
equation is determination of a velocity of a ball sailing
through the air considering only gravity and air resistance.
* Types of differential equation are:
Ordinary and partial
Linear and non-linear
54. Game theoretic model.
* Study of strategic decision making
* Study of mathematical models of conflict and
co-operation between intelligent rational decision makers.
* Representation of game:
Extensive from
Normal form
Characteristic function form
Partition function form
55. *Types of games:
Cooperative or non-cooperative
Symmetric and asymmetric
zero sum and non-aero sum
perfect information and imperfect information
combinatorial games
infinitely long games
Discrete and continuous games
Stochastic outcomes
Meta games
Differential games
56. Logical Model:
*Study of mathematical model using mathematical logic tools.
*Types are:
Finite model theory
First order logic
probabilistic logic(Ex: fuzzy logic)
Artificial Neural Network(inspired by biological Neural Network)
57. Most commonly used methods to create model are:
**Linear Least Squares Regression
**Non Linear Least Squares Regression
**Weighted Least Squares Regression
**LOESS
58. Linear Least Squares Regression.
i)It is by far the most widely used modelling method.
ii)It is what when the people say they have used “regression” ,
“linear regression” or “least squares” to fit the model.
iii)It is not only the most widely used method but it has been adopted
to a broad range of situations that are outside its direct range.
iv)Linear least squares regression can be used to fit the data with
any function of the form:
In which:
1)each explanatory variable in the function is multiplied by
unknown parameters,
2)there is at most one unknown parameter with no corresponding
explanatory variable,
3)all of the individual terms are summed to produce the final
function value.
59. v)The term ‘linear’ is used even though the function may not be a
straight line, because if the unknown parameters are considered
to be variables and explanatory variables are considered to be
known coefficients to those “variables”.
vi)Then the problem becomes a system of linear equations that can be
solved for unknown parameters.
vii)Linear models are not being limited to being straight lines or planes,
but include a fairly wide range of shapes
**simple quadratic curve:
**straight-line model of log(x):
**polynomial in sin(x):
60. viii)Advantages of linear least squares regression:
a)Its a primary tool for process modelling because of its
effectiveness and completeness.
b)Either the process are inherently linear because, over short
ranges, any process can be well approximated by a linear model
c)It makes very efficient use of the data and good results can be
obtained form relatively less data.
d)The statistical intervals can be used to give clear answers to the
scientific and engineering question.
ix)Disadvantages of linear least square regression:
a)It is difficult to find a linear model that the data well as the range
of data increases.
b)limitations in the shapes that linear models can assume
over long ranges, possibly poor extrapolation properties
and sensitivity to outliers.
c)It is very sensitive to presence of unusual data points in the data
used to fit a model.
d)The result of linear least square analysis will seriously skew
because of one or two outliers.
61. Nonlinear Least squares Regression:
i)It extends above method for use with much larger and more
general class of functions.
ii)Almost any function that can be written in closed form can be
incorporated in a nonlinear regression model.
iii)Very few limitations can be used in the functional part.
iv)The way in which the unknown parameters in the function are
estimated, however, is conceptually the same as it is in linear
least squares regression.
v)A nonlinear model is any model of the basic form,
in which,
1)Functional part of model in not linear with respect to
unknown parameters,
2)Method of least squares is used to estimate the values
of the unknown parameters.
62. 3)Function is smooth with respect to the unknown parameters.
4)Least squares criterion is used to obtain the parameter estimates
has a unique solution.
vi)These last two criteria are not essential for definition but are of
practical importance.
vii)Some examples of nonlinear models are:
63. viii)Advantages of Nonlinear Least Squares Method:
a)Biggest advantage is the broad range of functions that
can be fit.
b)Scientific and engineering processes can be described by
linear model but there are many other processes that are
inherently nonlinear like strengthening of concrete as it cures.
c)Being a “Least squares” procedure this method have same
advantage as that of above method.
d)In most cases the probabilistic interpretation of the intervals
produced by this method are only approximately correct,
still work very in practice.
ix)Disadvantages of Nonlinear Least Squares Method:
a)Need to use iterative optimization procedures to compute the
parameter estimates.
b)The use of iterative procedures requires the user to provide
starting values for the unknown parameters before the software
can begin the optimization.
c)Few model validation tools for the detection of outliers.
64. Weighted Least Squares Regression:
i)Unlike linear and nonlinear least squares regression, this is
associated with a particular type of function used to describe
the relationship between process variables.
ii)It reflects the behaviour of the random errors in the model and
can be used with functions that are either linear and nonlinear
in the parameters.
iii)It works by incorporating extra nonnegative constants
associated with each data point, into the fitting criterion.
iv)The weight for each observation is given relative to the weights
of other observations.
v)Efficient method that makes good use of small data sets.
65. vi)It has advantages like all of the least squares discussed above.
vii)Biggest disadvantage is this method is based on the assumptions
that the weights are known exactly. This wont be the case in real
application so the estimated weight must be used.
viii)The weight least squares should be used when the weights can be
estimated precisely relative to one another.
ix)If it actually increases the influence of an outlier, the results of the
analysis may be far inferior to an unweighted least square analysis.
66. LOESS:
LOESS is one of many "modern" modelling methods that
build on "classical" methods, such as linear and nonlinear
least squares regression.
Modern regression methods are designed to address
situations in which the classical procedures do not perform
well.
A method that is (somewhat) more descriptively known as
locally weighted polynomial regression.
At each point in the data set a low-degree polynomial is fit to
a subset of the data, with explanatory variable values near
the point whose response is being estimated.
The polynomial is fit using weighted least squares, giving
more weight to points near the point whose response is
being estimated and less weight to points further away.
The value of the regression function for the point is then
obtained by evaluating the local polynomial using the
explanatory variable values for that data point.
67. The subsets of data used for each weighted least squares fit in
LOESS are determined by a nearest neighbours algorithm.
A user-specified input to the procedure called the "bandwidth"
or "smoothing parameter" determines how much of the data is
used to fit each local polynomial.
The smoothing parameter, q, is a number between (d+1)/n
and 1, with d denoting the degree of the local polynomial.
The value of q is the proportion of data used in each fit.
The subset of data used in each weighted least squares fit is
comprised of the nq (rounded to the next largest integer) points
whose explanatory variables values are closest to the point at
which the response is being estimated.
68. q is called the smoothing parameter because it controls the
flexibility of the LOESS regression function.
Large values of q produce the smoothest functions that wiggle
the least in response to fluctuations in the data.
The smaller q is, the closer the regression function will conform
to the data.
Using too small a value of the smoothing parameter is not
desirable, however, since the regression function will
eventually start to capture the random error in the data.
Useful values of the smoothing parameter typically lie in the
range 0.25 to 0.5 for most LOESS applications.
69. The local polynomials fit to each subset of the data are almost
always of first or second degree, that is, either locally linear (in
the straight line sense) or locally quadratic.
Using a zero degree polynomial turns LOESS into a weighted
moving average.
Such a simple local model might work well for some situations,
but may not always approximate the underlying function well
enough.
Higher-degree polynomials would work in theory, but yield
models that are not really in the spirit of LOESS.
LOESS is based on the ideas that any function can be well
approximated in a small neighbourhood by a low-order
polynomial and that simple models can be fit to data easily.
High-degree polynomials would tend to over fit the data in each
subset and are numerically unstable, making accurate
computations difficult.
70. As mentioned above, the weight function gives the most weight
to the data points nearest the point of estimation and the least
weight to the data points that are furthest away.
The use of the weights is based on the idea that points near
each other in the explanatory variable space are more likely to
be related to each other in a simple way than points that are
further apart.
Following this logic, points that are likely to follow the local model
best influence the local model parameter estimates the most.
Points that are less likely to actually conform to the local model
have less influence on the local model parameter estimates.
71. The traditional weight function used for LOESS is the tri-cube
weight function,
The weight for a specific point in any localized subset of data
is obtained by evaluating the weight function at the distance
between that point and the point of estimation, after scaling
the distance so that the maximum absolute distance over all
of the points in the subset of data is exactly one.
72. The biggest advantage LOESS has over many other methods is
the fact that it does not require the specification of a function to
fit a model to all of the data in the sample.
Instead the analyst only has to provide a smoothing parameter
value and the degree of the local polynomial.
In addition, LOESS is very flexible, making it ideal for modelling
complex processes for which no theoretical models exist.
These two advantages, combined with the simplicity of the
method, make LOESS one of the most attractive of the modern
regression methods for applications that fit the general
framework of least squares regression but which have a
complex deterministic structure.
73. Another disadvantage of LOESS is the fact that it does not
produce a regression function that is easily represented
by a mathematical formula.
This can make it difficult to transfer the results of an
analysis to other people.
In order to transfer the regression function to another
person, they would need the data set and software for
LOESS calculations.
In nonlinear regression, on the other hand, it is only
necessary to write down a functional form in order to
provide estimates of the unknown parameters and the
estimated uncertainty.
Depending on the application, this could be either a major
or a minor drawback to using LOESS
74. Finally, as discussed above, LOESS is a computational
intensive method.
This is not usually a problem in our current computing
environment, however, unless the data sets being used are
very large.
LOESS is also prone to the effects of outliers in the data set,
like other least squares methods.
There is an iterative, robust version of LOESS that can be
used to reduce LOESS sensitivity to outliers, but extreme
outliers can still overcome even the robust method.
76. Following are few case studies to explain the model creation:
LOAD CELL CALIBRATION
For load cell that relates a known load applied to a load cell
to the deflection of the cell.
The model is then used to calibrate future cell readings associate
with loads of unknown magnitude.
•Background and Data
•Selection of initial model
•Model fitting-Initial model
•Graphical Residual Analysis-Initial model
•Interpretation of numerical output-Initial model
•Model Reinfinement
•Model fitting-Model #2
•Graphical Residual Analysis-Model #2
•Interpretation of numerical input-Model #2
•Use of the model for calibration
77. Background and Data:
Collected data of calibration experiment consist of:
**Known load
**Applied load to the cell and
**Corresponding deflection of the cell from its nominal
position.
Forty measurements were made over a range of loads
from 1,50,000 to 3,000,000 units.
Data were collected in two sets in order of increasing load.
The systematic run order makes it difficult to determine
whether or not there was any drift in the load cell or
measuring equipment over time.
Assuming there is no drift, however, the experiment should
provide a good description of the relationship between
the load applied to the cell and its response.
78. Deflection Load
--------------------------------
-----
0.11019 150000
0.21956 300000
0.32949 450000
0.43899 600000
0.54803 750000
0.65694 900000
0.76562 1050000
0.87487 1200000
0.98292 1350000
1.09146 1500000
1.20001 1650000
1.30822 1800000
1.41599 1950000
1.52399 2100000
1.63194 2250000
1.73947 2400000
1.84646 2550000
1.95392 2700000
2.06128 2850000
2.16844 3000000
0.11052 150000
0.22018 300000
0.32939 450000
0.43886 600000
0.54798 750000
0.65739 900000
0.76596 1050000
0.87474 1200000
0.98300 1350000
1.09150 1500000
1.20004 1650000
1.30818 1800000
1.41613 1950000
1.52408 2100000
1.63159 2250000
1.73965 2400000
1.84696 2550000
1.95445 2700000
2.06177 2850000
Analyses used in this case study can be generated by
using Dataplot code and R code.
79. Selection of Initial model:
The first step in analyzing the data is to select a candidate model.
In the case of a measurement system like this one,
a fairly simple function should describe the relationship
between the load and the response of the load cell.
Plotting the data indicates that the hypothesized, simple relationship
between load and deflection is reasonable.
The plot below shows the data.
It indicates that a straight-line model is likely to fit the data.
It does not indicate any other problems, such as presence of
outliers or non-constant standard deviation of the response.
81. Model Fitting-Initial model:
Using software for computing least squares parameter estimates,
the straight-line model, is easily fit for data.
The regression results are shown below. Before trying to interpret
all of the numerical output, however, it is critical to check
that the assumptions underlying the parameter
estimation are met reasonably well.
82. Parameter Estimate Stan. Dev t Value
B0 0.614969E-02 0.7132E-03 8.6
B1 0.722103E-06 0.3969E-09 0.18E+04
Residual standard deviation 0.0021712694
Residual degrees of freedom 38
Lack-of-fit F statistic 214.7464
Lack-of-fit critical value,F0.05,18,20 2.15
83. Graphical Residual Analysis-Initial model:
After fitting a straight line to the data, many people like to
check the quality of the fit with a plot of the data overlaid with
the estimated regression function.
The plot below shows this for the load cell data.
Based on this plot, there is no clear evidence of any
deficiencies in the model.
85. This type of overlaid plot is useful for showing the relationship
between the data and the predicted values from the regression
function however, it can obscure important detail about the model.
Plots of the residuals, on the other hand, show this detail well, and
should be used to check the quality of the fit.
Graphical analysis of the residuals is the single most important
technique for determining the need for model refinement or for
verifying that the underlying assumptions of the analysis are met.
Residual plots of interest for this model include:
residual Vs predictor value
residual Vs regression function value
residual run order plot
residual lag plot
histogram of residuals
normal probability plot
87. The structure in the relationship between the residuals and
the load clearly indicates that the functional part of the model is
not specified.
The ability of the residual plot to clearly show this problem, while
the plot of the data did not show it, is due to the difference in scale
between the plots.
The curvature in the response is much smaller than the linear
trend.
Therefore the curvature is hidden when the plot is viewed in the
scale of the data.
When the linear trend is subtracted, however, as it is in the
residual plot, the curvature stands out.
89. Further residual diagnostic plots are shown below.
The plots include a run order plot, a lag plot, a histogram,
and a normal probability plot.
Shown in a two-by-two array like this, these plots comprise
a 4-plot of the data that is very useful for checking the
assumptions underlying the model.
91. Interpretation of Numerical Output-Initial Model:
The fact that the residual plots clearly indicate a problem
with the specification of the function describing the
systematic variation in the data means that there is little point
in looking at most of the numerical results from the fit.
The lack-of-fit test can also be used as part of the model
validation.
The numerical results of the fit are shown below.
92. Parameter Estimate Stan. Dev t value
B0 0.617969E-02 0.7132E-03 8.6
B1 0.722103E-06 0.3969E-09 0.18E+04
Residual Standard Deviation 0.0021712694
Residual Degrees of Freedom 38
Lack-of-fit F statistic 214.7464
Lack-of-fit critical value,F0.05,18,20 2.15
93. The lack-of-fit test statistic, 214.7464, clearly indicates that
the functional part of the model is not right.
The critical value for a test having a significance level of
0.05 is 2.15.
Any value greater than the critical value indicates that the
hypothesis of a straight-line model for this data should be
rejected.
94. Model Reinfinement:
After ruling out the straight line model for these data, the next
task is to decide what function would better describe the
systematic variation in the data.
Reviewing the plots of the residuals versus all potential
predictor variables can offer insight into selection of a new
model, just as a plot of the data can aid in selection of an initial
model.
Iterating through a series of models selected in this way
will often lead to a function that describes the data well.
96. The horseshoe-shaped structure in the plot of the residuals
versus load suggests that a quadratic polynomial might fit the
data well.
Since that is also the simplest polynomial model,
after a straight line, it is the next function to consider.
97. Model Fitting-Model #2:
Based on the residual plots, the function used to
describe the data should be the quadratic polynomial.
The regression results are shown below.
As for the straight-line model, however, it is important
to check that the assumptions underlying the
parameter estimation are met before trying to interpret
the numerical output.
The steps used to complete the graphical residual
analysis are essentially identical to those used for the
previous model.
98. Quadratic Fit:
Parameter Estimate Stan. Dev t Value
B0 0.673618E-03 0.1079E-03 6.2
B1 0.732059E-06 0.1578E-09 0.46E+04
B2 -0.316081E-14 0.4867E-16 -65.0
Residual standard deviation 0.0002051768
Residual degrees of freedom 37
Lack-of-fit F statistic 0.8107
Lack-of-fit critical value,F0.05,17,20 2.17
99. Graphical Residual Analysis-Model #2:
The data with a quadratic estimated regression function and the
residual plots are shown below.
Fig 24
100. This plot is almost identical to the analogous plot for the straight-
line model, again illustrating the lack of detail in the plot due to the
scale.
In this case, however, the residual plots will show that the
model does fit well.
Fig
25
101. The residuals randomly scattered around zero, indicate that the
quadratic is a good function to describe these data. There is also
no indication of non-constant variability over the range of loads.
Fig
26
102. This plot also looks good.
There is no evidence of changes in variability across the range
of deflection.
Fig 27
103. All of these residual plots have become satisfactory by simply
by changing the functional form of the model.
There is no evidence in the run order plot of any time
dependence in the measurement process, and the lag plot
suggests that the errors are independent.
The histogram and normal probability plot suggest that the
random errors affecting the measurement process are normally
distributed.
104. Interpretation of Numerical input-Model #2
The numerical results from the fit are shown below.
For the quadratic model, the lack-of-fit test statistic is 0.8107.
The fact that the test statistic is approximately one indicates
there is no evidence to support a claim that the functional part
of the model does not fit the data.
The test statistic would have had to have been greater than
2.17 to reject the hypothesis that the quadratic model is
correct at the 0.05 significance level.
105. Parameter Estimate Stan. Dev t Value
B0 0.673618E-03 0.1079E-03 6.2
B1 0.732059E-06 0.1572E-09 0.46E+04
B2 -0.316081E-14 0.4967E-16 -65.0
Residual standard deviation 0.0002051768
Residual degrees of freedom 37
Lack-of-fit F statistic 0.8107
Lack-of-fit critical value,F0.05,17,20 2.17
106. From the numerical output, we can also find the regression
function that will be used for the calibration.
The function, with its estimated parameters, is
All of the parameters are significantly different from zero, as
indicated by the associated t statistics.
The critical value for the t-distribution with 37 degrees of
freedom and 1-α/2=0.975 is 2.026.
Since all of the t values are well above this critical value, we
can safely conclude that none of the estimated parameters is
equal to zero.
107. Use of the Model for Calibration:
A good model has been found for these data, it can be used
to estimate load values for new measurements of deflection.
For example, suppose a new deflection value of 1.239722
is observed.
The regression function can be solved for load to determine
an estimated load value without having to observe it directly.
The plot below illustrates the calibration process graphically.
109. From the plot, it is clear that the load that produced the deflection
of 1.239722 should be about 1,750,000, and would certainly lie
between 1,500,000 and 2,000,000.
This rough estimate of the possible load range will be used to
compute the load estimate numerically.
To solve for the numerical estimate of the load associated with
the observed deflection, the observed value substituting in the
regression function and the equation is solved for load.
Typically this will be done using a root finding procedure in a
statistical or mathematical package.
That is one reason why rough bounds on the value of the load to
be estimated are needed.
110. Even though the rough estimate of the load associated with an
observed deflection is not necessary to solve the equation, the other
reason is to determine which solution to the equation is correct, if
there are multiple solutions.
The quadratic calibration equation, in fact, has two solutions.
As we saw from the plot on the previous page, however, there is
really no confusion over which root of the quadratic function is the
correct load.
Essentially, the load value must be between 150,000 and 3,000,000
for this problem.
The other root of the regression equation and the new deflection
value correspond to a load of over 229,899,600.
Looking at the data at hand, it is safe to assume that a load of
229,899,600 would yield a deflection much greater than 1.24.
111. The final step in the calibration process, after determining the
estimated load associated with the observed deflection, is to
compute an uncertainty or confidence interval for the load.
A single-use 95% confidence interval for the load, is obtained
by inverting the formulas for the upper and lower bounds of a
95% prediction interval for a new deflection value.
These inequalities, shown below, are usually solved numerically
just as calibration equation was to find the end points of the
confidence interval.
For some models including this one the solution could actually
be obtained algebraically, but it is easier to let the computer do
the work using a generic algorithm.
112. The three terms on the right-hand side of each inequality are the
regression function(f),a t-distribution multiplier, and the Std. Dev
of a new measurement from the process
Regression software often provides convenient methods for
computing these quantities for arbitrary values of the predictor
variables, which can make computation of the confidence interval
end points easier.
Although this interval is not symmetric mathematically, the
asymmetry is very small, so for all practical purposes, the interval
can be written as if desired,
113. ULTRASONIC REFRENCE BLOCK STUDY
It illustrates the construction of a non-linear regression model for
ultrasonic calibration data.
This case study demonstrates fitting a non-linear model and
the use of transformations and weighted fits to deal with the
violation of the assumption of constant standard deviations
for the errors.
This assumption is also called homogeneous
variances for the errors.
1)Background and Data
2)Fit Initial Model
3)Transformation to improve fit
4)Weighting to improve fit
5)Compare the fits
114. Background and data:
The ultrasonic reference block data consist of a response
variable and a predictor variable.
The response variable is ultrasonic response and the
predictor variable is metal distance.
These data were provided by the NIST scientist Dan
Chwirut.
The analyses used in this case study can be generated
using both Dataplot code and R code
116. Fit Initial-Model:
The first step in fitting a nonlinear function is to simply
plot the data.
This plot shows an exponentially decaying pattern in the
data.
This suggests that some type of exponential function
might be an appropriate model for the data.
There are two issues that need to be addressed in the
initial model selection when fitting a nonlinear model.
We need to determine an appropriate functional form for
the model.
We need to determine appropriate starting values for the
estimation of the model parameters.
118. To determine an appropriate functional form for the model.
**Due to the large number of potential functions that can be
used for a nonlinear model, the determination of an
appropriate model is not always obvious.
**The plot of the data will often suggest a well-known function.
**In addition, we often use scientific and engineering knowledge in
determining an appropriate model.
**In scientific studies, we are frequently interested in fitting a
theoretical model to the data.
**We also often have historical knowledge from previous studies
(either our own data or from published studies) of functions that
have fit similar data well in the past.
**In the absence of a theoretical model or experience with prior data
sets, selecting an appropriate function will often require a certain
amount of trial and error.
**Regardless of whether or not we are using scientific knowledge in
selecting the model, model validation is still critical in determining
if our selected model is adequate.
119. To determine Appropriate Starting values.
**Nonlinear models are fit with iterative methods that require
starting values.
**In some cases, inappropriate starting values can result in parameter
estimates for the fit that converge to a local minimum or maximum
rather than the global minimum or maximum.
**Some models are relatively insensitive to the choice of starting
values while others are extremely sensitive.
**In the case where you do not know what good starting values would
be, one approach is to create a grid of values for each of the
parameters of the model and compute some measure of goodness
of fit, such as the residual standard deviation, at each point on
the grid.
**The idea is to create a broad grid that encloses reasonable values
for the parameter.
**However, we typically want to keep the number of grid points for each
parameter relatively small to keep the computational burden down
(particularly as the number of parameters in the model increases).
120. For this particular data set, the scientist was trying to fit
the following theoretical model.
Since we have a theoretical model, we use this as the
initial model.
We set the starting values for all three parameters to 0.1.
The following results were generated for the nonlinear fit.
121. Parameter Estimate Stan. Dev t Value
b1 0.190279 0.2194E-01 8.6
b2 0.006131 0.3450E-03 17.8
b3 0.010531 0.7928E-03 13.3
Residual standard deviation 3.362
Residual degrees of freedom 211
123. This plot shows a reasonably good fit.
It is difficult to detect any violations of the fit assumptions
from this plot.
The estimated model is
When there is a single independent variable, the plot
provides a convenient method for initial model validation.
125. The basic assumptions for regression models are that the errors are
random observations from a normal distribution with zero mean and
constant standard deviation (or variance).
These plots suggest that the variance of the errors is not constant.
In order to see this more clearly, we will generate full- sized a plot
of the predicted values from the model and overlay the data and
plot the residuals against the independent variable, Metal Distance.
127. This plot suggests that the errors have greater variance for
the values of metal distance less than one than elsewhere.
That is, the assumption of homogeneous variances seems to
be violated.
Except when the Metal Distance is less than or equal to one,
there is not strong evidence that the error variances differ.
Nevertheless, we will use transformations or weighted fits to
see if we can eliminate this problem.
128. Transformations to Improve Fit.
The first step is to try transformations of the response variable
that will result in homogeneous variances.
In practice, the square root, ln, and reciprocal transformations
often work well for this purpose.
We will try these first.
130. In examining these four plots, we are looking for the plot that
shows the most constant variability of the ultrasonic response
across values of metal distance.
Although the scales of these plots differ widely, which would
seem to make comparisons difficult, we are not comparing the
absolute levels of variability between plots here.
Instead we are comparing only how constant the variation within
each plot is for these four plots.
The plot with the most constant variation will indicate which
transformation is best.
Based on constancy of the variation in the residuals, the square
root transformation is probably the best transformation to use for
this data.
131. After transforming the response variable, it is often helpful to
transform the predictor variable as well.
In practice, the square root, ln, and reciprocal transformations
often work well for this purpose.
We will try these first.
This plot shows that none of the proposed transformations
offers an improvement over using the raw predictor variable.
Based on the below plots, we choose to fit a model with a
square root transformation for the response variable and no
transformation for the predictor variable.
133. Parameter Estimate Stan. Dev t Value
b1 -0.0154326 0.8593E-02 -1.8
b2 0.0806714 0.1524E-02 53.6
b3 0.0638590 0.2900E-02 22.2
Residual standard deviation 0.29715
Residual degrees of freedom 211
Although the residual standard deviation is lower than it was for
the original fit, we cannot compare them directly since the fits
were performed on different scales.
134. The plot of the predicted values with the transformed data
indicates a good fit. The fitted model is
Fig 35
136. Since we transformed the data, we need to check that all of the
regression assumptions are now valid.
The 6-plot of the data using this model indicates no obvious
violations of the assumptions.
In order to see more detail, we generate a full size version of the
residuals versus predictor variable plot.
This plot suggests that the errors now satisfy the assumption of
homogeneous variances.
138. Weighting to Improve Fit:
Another approach when the assumption of constant variance
of the errors is violated is to perform a weighted fit.
In a weighted fit, we give less weight to the less precise
measurements and more weight to more precise measurements
when estimating the unknown parameters in the model.
In this case, we have replication in the data, so we can fit the
power model
to the variances from each set of replicates in the data and use
for the weights.
139. The following results were obtained for the fit of
ln(variances) against ln(means) for the replicate groups.
Parameter Estimate Stan. Dev t Value
γ0 2.5369 0.1919 13.1
γ1 -1.1128 0.1741 -6.4
Residual standard deviation 0.6099
Residual degrees of freedom 20
140. The fit output and plot from the replicate variances against the
replicate means shows that the linear fit provides a reasonable fit, with
an estimated slope of -1.1128.
Fig 38
141. Based on this fit, we used an estimate of -1.0 for the exponent in
the weighting function.
Fig 39
142. The residual plot from the fit to determine an appropriate weighting
function reveals no obvious problems.
The results of the weighted fit are shown below.
Parameter Estimate Stan. Dev t Value
b1 0.146999 0.1505E-01 9.8
b2 0.005280 0.4021E-03 13.1
b3 0.012388 0.7362E-03 16.8
Residual standard deviation 4.11
Residual degrees of freedom 211
143. To assess the quality of the weighted fit, we first generate a
plot of the predicted line with the original data.
The plot of the predicted values with the data indicates a good fit
Fig
40
144. The model for the weighted fit is
We need to verify that the weighted fit does not violate the
regression assumptions
146. In order to check the assumption of equal error variances in more
detail, we generate a full-sized version of the residuals versus the
predictor variable.
This plot suggests that the residuals now have approximately equal
variability.
Fig
42
147. Compare the Fits:
It is interesting to compare the results of
the three fits:
Unweighted fit
Transformed fit
Weighted fit
The first step in comparing the fits is to plot all three sets of
predicted values (in the original units) on the same plot with
the raw data.
This below plot shows that all three fits generate comparable
predicted values.
We can also compare the residual standard
deviations (RESSD) from the fits.
148. The RESSD for the transformed data is calculated after
translating the predicted values back to the original scale.
Fig
43
149. RESSD From Unweighted Fit 3.361673
RESSD From Transformed Fit 3.306732
RESSD From Weighted Fit 3.392797
In this case, the RESSD is quite close for all three fits
(which is to be expected based on the plot).
150. Given that transformed and weighted fits generate predicted
values that are quite close to the original fit.
Then why would we want to make the extra effort to generate
a transformed or weighted fit?
We do so to develop a model that satisfies
the model assumptions for fitting a nonlinear model.
This gives us more confidence that conclusions and
analyses based on the model are justified and appropriate.
151. THERMAL EXPANSION OF COPPER:
This case study illustrates the use of a class of nonlinear models
called rational function models.
The data set used is the thermal expansion of copper
related to temperature.
This data set was provided by the NIST scientist Thomas Hahn.
Background and data
Rational Functional Models
Initial Plot of Data
Fit Quadratic/Quadratic Rational Functional model
Fit Cubic/Cubic model
152. Background and Data.
The response variable for this data set is the coefficient of
thermal expansion for copper.
The predictor variable is temperature in degrees kelvin.
There were 236 data points collected.
These data were provided by the NIST scientist Thomas
Hahn.
The analyses used in this case study can be generated
using both Dataplot code and R code.
154. Rational Function Models.
A polynomial function is one that has the form
with n denoting a non-negative integer
that defines the degree of the polynomial.
A polynomial with a degree of 0 is simply
a constant, with a degree of 1 is a line, with
a degree of 2 is a quadratic, with a degree
of 3 is a cubic,
155. A rational function is simply the ratio of two polynomial
functions.
with n denoting a non-negative integer that defines the
degree of the numerator and m is a non-negative integer
that defines the degree of the denominator.
For fitting rational function models, the constant term in
the denominator is usually set to 1.
Rational functions are typically identified by the degrees
of the numerator and denominator.
For example, a quadratic for the numerator and a cubic
for the denominator is identified as a quadratic/cubic
rational function.
156. A rational function model is a generalization of the
polynomial model.
Rational function models contain polynomial models as a
subset (i.e., the case when the denominator is a constant).
157. Rational function models have the following advantages.
Rational function models have a moderately simple form.
As with polynomial models, this means that rational function
models are not dependent on the underlying metric.
Rational functions are typically smoother and less
oscillatory than polynomial models.
Rational functions can be either finite or infinite
for finite values, or finite or infinite for infinite values.
Rational function models can often be used to model
complicated structure with a fairly low degree in both the
numerator and denominator.
Rational function models are moderately easy to handle
computationally.
158. Rational Function Models have the following disadvantage
The properties of the rational function family are not as
well known to engineers and scientists as are those of
the polynomial family.
The literature on the rational function family is also more limited.
Because the properties of the family are often not well
understood.
Unconstrained rational function fitting can, at times,
result in undesired nuisance asymptotes (vertically)
due to roots in the denominator polynomial.
These nuisance asymptotes occur occasionally and unpredictably,
but the gain in flexibility of shapes is well worth the chance
that they may occur.
159. One common difficulty in fitting nonlinear models is finding
adequate starting values.
A major advantage of rational function models is the ability to
compute starting values using a linear least squares fit.
To do this, choose p points from the data set, with p denoting the
number of parameters in the rational model.
For example, given the linear/quadratic model
we need to select four representative points.
We then perform a linear fit on the model
Here, pn and pd are the degrees of the numerator and
denominator, respectively, and the x and y contain the subset of
points, not the full data set.
160. Initial Plot of Data:
The first step in fitting a nonlinear function is to simply plot the
data.
This plot initially shows a fairly steep slope that levels off to a
more gradual slope.
This type of curve can often be modelled with a rational
function model.
The plot also indicates that there do not appear to be any
outliers in this data.
162. Fit Quadratic/Quadratic Rational Function Model.
Based on the procedure described, we fit the model:
using the following five representative points to generate
the starting values for the Q/Q rational function.
The coefficients from the preliminary linear
fit of the five points are:
A0 = -3.005450
A1 = 0.368829
A2 = -0.006828
B1 = -0.011234
B2 = -0.000306
163. The results for the nonlinear fit are shown below.
Parameter Estimate Stan. Dev t Value
A0 -8.028e+00 3.988e-01 -20.13
A1 5.083e-01 1.930e-02 26.33
A2 -7.307e-03 2.463e-04 -29.67
B1 -7.040e-03 5.235e-04 -13.45
B2 -3.288e-04 1.242e-05 -26.47
Residual standard deviation = 0.5501
Residual degrees of freedom = 231
The regression yields the following estimated model.
164. Generated a plot of the fitted rational function model with the raw data.
Fig 45
165. Looking at the fitted function with the raw data appears to
show a reasonable fit.
Although the plot of the fitted function with the raw data
appears to show a reasonable fit, we need to validate
the model assumptions
The 6-plot is an effective tool for this purpose.
167. The plot of the residuals versus the predictor variable
temperature (row 1, column 2) and of the residuals versus the
predicted values (row 1, column 3) indicate a distinct pattern in
the residuals.
This suggests that the assumption of random errors is badly
violated.
Hence a full-sized residual plot is generated in order to show
more detail.
The full-sized residual plot clearly shows the distinct pattern in
the residuals.
When residuals exhibit a clear pattern, the corresponding
errors are probably not random.
169. Fit Cubic/Cubic Rational Function Model.
Since the Q/Q model did not describe the data well, we next
fit a cubic/cubic (C/C) rational function model.
Based on the procedure , we fit the model:
170. Seven representative points to generate the starting values:
TEMP THERMEXP
--------------------------------
10 0
30 2
40 3
50 5
120 12
200 15
800 20
The coefficients from the preliminary linear fit of the seven
points are:
A0 = -2.323648e+00
A1 = 3.530298e-01
A2 = -1.383334e-02
A3 = 1.766845e-04
B1 = -3.395949e-02
B2 = 1.100686e-04
B3 = 7.910518e-06
171. The results of fitting the C/C model are shown below.
Parameter Estimate Stan. Dev t Value
A0 1.07913 0.1710 6.3
A1 -0.122801 0.1203E-01 -10.2
A2 0.408837E-02 0.2252E-03 18.2
A3 -0.142848E-05 0.2610E-06 -5.5
B1 -0.576111E-02 0.2468E-03 -23.3
B2 0.240629E-03 0.1060E-04 23.0
B3 -0.123254E-06 0.1217E-07 -10.1
Residual standard deviation = 0.0818
Residual degrees of freedom = 229
The regression analysis yields the following estimated model.
Hence generated a plot of the fitted rational function model with
the raw data.
173. The fitted function with the raw data appears to show a reasonable
fit.
Although the plot of the fitted function with the raw data appears to
show a reasonable fit, we need to validate the model assumptions.
The 6-plot is an effective tool for this purpose.
The 6-plot indicates no significant violation of the model
assumptions.
That is, the errors appear to have constant location and scale (from
the residual plot in row 1, column 2), seem to be random (from the
lag plot in row 2, column 1), and approximated well by a normal
distribution (from the histogram and normal probability plots in row
2, columns 2 and 3).
A full-sized residual plot is generated to show more detail.
175. The full-sized residual plot suggests that the assumptions of
constant location and scale for the errors are valid.
No distinguishing pattern is evident in the residuals.
We conclude that the cubic/cubic rational function model
does in fact provide a satisfactory model for this data set.
176. References:
NIST/SEMATECH e-Handbook of Statistical Methods,
http://www.itl.nist.gov/div898/handbook/, date.
BASE Note 1”, Morten Blomhoj, Tinne Hoff Kjeldsen,
Johnny Ottesen, Natural Sciences Basis Program,
Roskilde University Center, Denmark, August 2000.
http://en.wikipedia.org/wiki/Mathematical_model.