applied multivariate statistical techniques in agriculture and plant science 2

International journal of Agronomy and Plant Production. Vol., 4 (1), 127-141, 2013
Available online at http:// www.ijappjournal.com
ISSN 2051-1914 ©2013 VictorQuest Publications
A Review on Applied Multivariate Statistical Techniques in Agriculture
and Plant Science
Armin Saed-Moucheshi
1*
, Elham Fasihfar
1
, Hojat Hasheminasab
2
, Amir Rahmani
1
and Alli Ahmadi
3
1- Dept. Crop Production and Plant Breeding, Shiraz University, Shiraz (Iran)
2- Dept. Crop Production and Plant Breeding, Razi University, Kermanshah (Iran)
3- Dept. Plant Protection, Tabriz University, Tabriz (Iran)
*Corresponding Author Email: saedmoocheshi@gmail.com
Abstract
Most scientists make decisions based on analyzing of the obtained data from researches
works. Almost all data in science are abundance and by themselves they are of little help
unless they are summarized by some methods and appropriate interpretations have
been made. The data set may contain so many observations that stand out and whose
presence in the data cannot be justified by any simple explanation. Multivariate
statistical technique is a form of statistics encompassing the simultaneous observations
and analysis of more than one statistical variable. In this review we are trying to clarify
how multivariate statistical methods such as multiple regression analysis, principal
component analysis (PCA), factor analysis (FA), clustering analysis, and canonical
correlation (CC) can be used as methods to explain relationships among different
variables and making decisions for future works with examples relating to the agriculture
and plant science.
Keywords: Canonical correlation; Factor analysis; Principal component analysis; Cluster
analysis; Multiple regression.
Introduction
Most crucial scientific, sociological, political, economic, business, biology and botany make decisions
based on analyzing of obtained data from research's works. Almost all data in science are abundance and by
themselves they are of little help unless they are summarized by some methods and appropriate
interpretations have been made. Since such a summary and corresponding interpretation can rarely be
made just by looking at the raw data, a careful scientific scrutiny and analysis of these data can usually
provide enormous amount of valuable information. Admittedly, the more complex the data and their
structure, the more involved the data analysis (Steel and Torrie, 1960). The complexity in a data set may
exist for a variety of reasons. The data set may contain too many observations that stand out and what
presence in the data cannot be justified by any simple explanation. Another situation in which a simple
analysis alone may not suffice occurs when the data on some of the variables are correlated or when there is
a trend present in the data. Many times, data are collected on a number of units, and on each unit not just
one, but many variables are measured. Further, when many variables exist, in order to obtain more definite
and more easily comprehensible information, scientist need to used further complex analyses in order to get
highest information that can be obtained from data (Everitt and Dunn, 1992).
For univariate data, when there is only one variable under consideration, these are usually
summarized by the (at the either population or sample) mean, variance, skewness, kurtosis and etc
(Anderson, 1984). These are the basic quantities used for data description. On the other hand, multivariate
statistics is a form of statistics encompassing the simultaneous observation and analysis of more than one
statistical variable. Methods of bivariate statistics, for example simple linear regression and correlation, are
special cases of multivariate statistics in which two variables are involved (Steel and Torrie, 1960).
Multivariate statistics concerns understanding the different aims and background, and it can explain how
different variables are related with each other or one another. The practical implementation of multivariate
statistics to a particular problem may involve several types of univariate and multivariate analysis in order to

Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
128
understand the relationships among variables and their relevance to the actual problems being studied
(Johnson and Wicheren, 1996). Many different multivariate analyses techniques such as multivariate
analysis of variance (MANOVA), multiple regression analysis, principal components analysis (PCA), factor
analysis (FA), canonical correlation analysis (CC), and clustering analysis are available. In this review we are
going to explain applying and usable techniques of multivariate statistics in the agriculture and plant science
with related examples in order to provide a practical manual in scientific research works for plant scientist.
Multiple Linear Regression Analysis
Linear regression is an approach to modeling the relationship between a dependent variable called Y
and one or more explanatory variables denoted X. The case of one explanatory variable is called simple
regression. For example we want to determine 1 cm increasing the height of a plant makes how much
change in its yield, in which situation we use simple linear regression (Draper and Smith, 1966). The
prediction model equation for simple linear regression is:
Y=b0 + b1X + ε
b0: It is the intercept that geometrically represents the value of dependent variable (Y) where the regression
line crosses the Y axis. Substantively, it is the expected value of Y when independent variable is equal zero.
b1: Slope coefficient (regression coefficient). It represents the change in Y associated with a one-unit
increase in X.
ε: In most situations, we are not in a position to determine the population parameters directly. Instead, we
must estimate their values from a finite sample from the population and this parameter is the error of the
prediction.
Multiple regression considers more than one explanatory variable (X). For example changing one unit in the
stem height, stem diameter, root length and leaf area caused how many changes in the plant yield.
Prediction model for multiple regression is expanded model of simple linear regression which is showed as
follow:
Y=b0 + b1X1 + b2X2 +…..+ biXi + ε
bi= Partial slope coefficient (also called partial regression coefficient, metric coefficient). It represents the
change in Y associated with a one-unit increase in Xi when all other independent variables are held
constant.
Where b0 is the sample estimate of β0 and bi is the sample estimate of βi, and β's are the parameters from
the whole population in which sampling is conducted.
After determining the intercept and regression coefficients, we have to test them for significance by
doing the analysis of variance (ANOVA). ANOVA determine if regression coefficients that the probable
model calculates should be present in the final model as a predictor or not. Statistical software calculates a
P-value or sig-value for coefficients significance test. If P-value for a coefficient was less than 0.05 (P<0.05),
the coefficient is statistically significant and the related variable should be present in the model as a predictor
but if it was higher than 0.05 (P>0.05), the coefficient is not statistically significant and the related variable
should not to be present as a predictor (Draper and Smith, 1981). Coefficient of determination or R-square
(R
2
) shows that how the model of predictors fits dependent variable or variables. Higher R
2
, higher fit of the
model and higher model goodness. Moreover, significant test for intercept (b0) is similar to regression
coefficients (Kleinbaum et al., 1998).
Significance test of the coefficient and R
2
help researchers to decide what predictor is more
important and must be present in the model. As well as these methods, some other techniques are made up
for determining the best model of predictors. Beside this, when the number of the predictors increase,
usually most of the variables are strongly correlated with each other and it is not necessary to presence all of
these correlated variables in the model and they can use instead of each other (Manly, 2001).
Backward elimination: in this technique, unlike forward selection, all variable are existed in the model and
the less important variables are removed from the model step by step. In the first step, all possible models
with removing each one of the variables considered and which variable having the least mean square will be
removed from the model. In the next steps, this procedure is applied and whenever the P-value will be higher

129
than standard, the analysis will be stopped and model with remained variables will be the best predicting
model (Burnham and Anderson, 2002).
Forward selection: in this method, for the first step of analysis, all possible simple regression related to
each of the independent variables is calculated and which of the variables that has the highest mean square
(or F-value) is presented as the first and most important predictor in the regression model. In the second
step, variable interred to the model in the first step is exist in the model and all other possible models in
which the first variable is exist must be made up and each one has the most mean square is preferred
prediction model. This procedure will continue until the P-value of the model will be higher than the standard
P-value. In this situation, the remained variables will not to be presented in the prediction model (Harrell,
2001).
Stepwise regression: this variable selection method has proved to be an extremely useful computational
technique in data analysis problems (Dong et al., 2008). Similar to forward selection, in stepwise regression
all possible univariate models are worked out and which variable has the highest mean square is consisted
in the model. In the second step, all other possible models associated with the first consisted variable is
investigated and each variable that has the highest mean square is entered to the model, but when the
second variable entered, first variable should be test for significance in the presence of the second entered
variable. In this situation if the first entered variable is either significant, both variables will be consisted in the
model but if the first variable is not significant, it should be removed from the model. In other steps, this
procedure is repeated and what variable was entered to the prediction model in the previous steps that has
P-value less than the standard will be removed. Indeed this technique use both forward selection and
backward elimination techniques and is more suitable than those alone (Miller, 2002).
Path analysis: regression coefficients strongly are depending on the unit of the variables. Based on the unit
of the variables, the coefficients of the variables are high or low and variables with strong unit has high
coefficient and vice versa. In order to comparing coefficients, the solution is to transform the variables' data
to the standard data by subtracting the mean and dividing to its standard deviation. After standardizing the
variables' data, the variable with higher coefficient has higher effect on the dependent variable. When
independent variables are correlated with each other, the variables can affects each other. In this situation,
the correlation between each independent variable with the dependent variable could be divided into direct
effect of the each independent variable and the indirect effect via other correlated variables (Fig. 1). Using
standardized data in the regression model for calculating regression coefficient gives the direct effect of the
variables. The indirect effect of the variables can be estimated by multiplying each related direct effect to
correlation coefficient between two or more independent variables (Shipley, 1997). Therefore, path analysis
can be explained as an extension technique of the regression model, used to test the ﬁt of the correlation
matrix against two or more causal models which are being compared by the researcher (Dong et al., 2008).
X1
X2
X3 X3
X2
X1
Effect via
Y
Final effect
Figure 1; Diagram of path analysis

130
For better understanding of regression techniques have been mentioned above, we present an example
here.
Example 1: we had measured some morphological traits of three wheat cultivars consisting of Tiller
numbers/plant, Spike length, Spikelets/spike, Spike weight/plant, Grains/spike, Grain weight/spike, 100-
Grain weight, Total chlorophyll content of flag leaf, Biologic yield/plant, Root weight, Leaves area and grain
yield under for water regimes (Moocheshi et al., 2012). Here we want to evaluate relationship between what
grain yield and its related measured morphological traits using mentioned techniques above.
Multivariate regression
Table 1 shows regression coefficient values, their standard error, t-student value and p-value for
coefficients. Total regression equation based on the results is:
Y=0.5394 - 0.12X1 - 0.02X2 - 0.01X3 + 0.96X4 + 0.01X5 - 0.78X6 - 0.01X7 - 0.004X8 + 0.01X9 + 0.08X10 +
0.001X11
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grains/spike,
X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic
yield/plant, X10= Root weight, X11= Leaves area and Y=Grain yield.
Coefficient of determination or R
2
is equal to 99.2% which is very high, but it is not a real coefficient
of determination because with increasing variables numbers, R
2
will be getting higher. Scientists introduce
adjusted-R
2
instead of R
2
for solving this problem but it is either not a completely accepted index. Also, as
you can see, in this situation that number of variables are abundant and therefore, explaining the relation
between dependent and many independent variables are so complex, on the other hand some coefficient
values are very little can be removed from the model. Based on the p-value, most of the variables are not
statistically significant. P-value shows that what variable must to be present in the model as a predictor and
what must not to be present. As you can see in the table 1, X4 and X6 are the variables that have the p-
value lower than 0.05 and we must select them as the most effective variables on yield. The predicting
model based regression analysis will be as fallow:
Y= 0.96X4 – 0.78X6
Selection procedures
Backward elimination: in the four steps of the backward elimination four variables such as X4, X3, X2 and
X7 are removed from the model and other variables are remained. Based on this result, four mentioned
variables are the least important variables for predicting yield. By this procedure predicting model are
formulated as follow (Table 2 and3):
Y= -0.19 + 0.98X4 + 0.01X5 – 1.54X6 – 0.004 X8 - 0.01X9 + 0.1X10 + 0.005X11
X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf,
X9= Biologic yield/plant, X10= Root weight, X11= Leaves area, and Y= Yield.
Forward selection: similar to backward elimination, seven variables are consisted in the forward selection
model but the values of the coefficients have little difference (Table 4).
Y= -0.003 + 0.98X4 - 0.004X5 + 0.01X6 – 0.01 X8 - 1.54X9 + 0.11X10 - 0.003X11
Stepwise selection: Tables 5 shows the data representing entered variables to, or removed variables from
the model of stepwise regression. Similar to the results of backward and forward, stepwise selection can
screen seven variables:
Y= -0.195 + 0.98X4 - 0.01X5 -1.54X6 – 0.004 X8 - 0.01X9 + 0.1X10 - 0.005X11
What model should be the predicting model is of the choice of researcher and he can use best model that
can explain idea of the research but usually stepwise selection is the best. On the other hand, significant t-
test for variables in multivariate regression analysis is not sufficient technique.
Path analysis: for better doing path coefficient analysis and understanding the relationship between yield
and other morphological traits, researcher can use results of the selection procedures in the path analysis,
but here we considered all variables. In this technique, the correlation coefficient between yield and each of
the measured morphological traits is partitioned into direct and their indirect effects via other variables on
yield. Highest direct effect of variables on yield was obtained for spike weight/plant (1.013) while other
variables had a very low direct effect on yield (Table 6). Sum of indirect effects of spike weight/plant were
negative. Except of spike weight/plant, other variables had high indirect effect on grain yield. Spiklets/spike
showed lowest contribution in grain yield through its direct effect but it showed highest contribution through
other traits.

131
Table 1. The regression coefficient (B), standard error (SE), T-value and probability of the estimated
variables in predicting wheat grain yield by the multiple linear regression analysis under inoculation (In)
and non-inoculation (Non-In) conditions and different water levels
Predictor DF B SE T P
Constant 1 0.5394 0.49180 1.10 0.284
X1 1 -0.1164 0.08245 -1.41 0.171
X2 1 -0.0202 0.05014 -0.40 0.691
X3 1 -0.0082 0.02037 -0.40 0.693
X4 1 0.9617 0.01927 49.90 0.001
X5 1 0.0110 0.00699 1.56 0.131
X6 1 -0.7802 0.34490 -2.26 0.033
X7 1 -0.0070 0.00979 -0.71 0.483
X8 1 -0.0042 0.00318 -1.33 0.196
X9 1 0.0131 0.01165 1.12 0.273
X10 1 0.0840 0.09246 0.91 0.373
X11 1 -0.0008 0.00318 -0.25 0.803
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5=
Grains/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf,
X9= Biologic yield/plant, X10= Root weight and X11= Leaves area.
Table 3. Backward elimination and remained variables in the model
Step
Variable
Parameter
estimate
Standard
error
Sum of
squares F-Value Pr > F
Intercept -0.19463 0.08673 0.03923 5.040 0.0329
1 x4 0.97670 0.00947 82.8773 640.1 <.0001
2 x5 0.01208 0.00342 0.09736 12.50 0.0014
3 x6 -1.54441 0.21063 0.41875 53.76 <.0001
4 x8 -0.00407 0.00138 0.06753 8.670 0.0064
5 x9 -0.01094 0.00460 0.04402 5.650 0.0245
6 x10 0.09707 0.04682 0.03347 4.300 0.0475
7 x11 0.00505 0.00160 0.07755 9.960 0.0038
X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of
flag leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area.
Table 4. Summary of forward selection
Step
Variable
entered
partial R-
Square
Model R-
Square
Parameter
estimate
Standard
error F-Value Pr > F
1 x4 0.9963 0.9963 0.97859 0.00963 83.85 <.0001
2 x6 0.0013 0.9975 0.01198 0.00341 17.04 0.0002
3 x9 0.0005 0.998 -1.54065 0.21043 7.79 0.0088
4 x5 0.0004 0.9985 -0.00443 0.0043 8.69 0.006
5 x11 0.0002 0.9987 -0.0034 0.00152 5.48 0.0261
6 x8 0.0002 0.9989 -0.01166 0.00465 6.42 0.0169
7 x10 0.0001 0.9991 0.11336 0.04937 4.3 0.0475
Intercept -0.12314 0.11097 0.0034
Table 2. Summary of Backward elimination
Step
Variable
removed
Number of variables
remain in model
Partial
R-Square
Model
R-Square F Value Pr > F
1 x1 10 0 0.9991 0.02 0.8836
2 x3 9 0 0.9991 0.03 0.8558
3 x2 8 0 0.9991 0.28 0.6028
4 x7 7 0 0.9991 1.06 0.3117
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X7= 100-Grain weight,

132
Table 5. Relative contribution (partial and model R
2
), F-value and probability in predicting wheat grain yield
by the stepwise procedure analysis under non-inoculation condition and different water levels
Step Variable
Entered
Variable
Removed
Partial
R-Square
Model
R-
Square
P-Value
ER
Parameter
Estimate
Standard
Error
P-Value
M
1 x4 - 0.9963 0.9963 <.0001 0.9767 0.00947 <.0001
2 x6 - 0.0013 0.9975 0.0002 -1.54441 0.21063 <.0001
3 x9 - 0.0005 0.998 0.0088 -0.01094 0.00460 0.0245
4 x5 - 0.0004 0.9985 0.0060 0.01208 0.00342 0.0014
5 x11 - 0.0002 0.9987 0.0261 0.00505 0.0016 0.0038
6 x8 - 0.0002 0.9989 0.0169 -0.00407 0.00138 0.0064
7 x10 - 0.0001 0.9991 0.0475 0.09707 0.04682 0.0475
Intercept -0.195 0.0867 0.0329
X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag
leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area.
R-Square= Coefficient of Determination, P-Value ER= P-value for enter or remove variables and P-Value
M= P-Value for final model.
Principal Component Analysis
Principal component analysis (PCA) is a variable reduction procedure and is useful when you have obtained
data on high number of variables (possibly a large number of variables), and believe that there is some
redundancy in those variables (Fig. 2). Principal component analysis (PCA) can be explained as a method
that reduces data dimensionality by performing a covariance analysis between variables. The main
advantage of principal component analysis is reducing the number of dimensions without much loss of
information. (Everitt and Dunn, 1992). In this case, redundancy means that some of the variables are
correlated with one another, possibly because they are measuring the same construct. PCA uses an
orthogonal transformation to convert a set of observations of possibly well correlated variables into a set of
values of linearly uncorrelated variables called principal components (PC). The number of principal
components is less than or equal to the number of original variables (Dunetman, 1989). This transformation
is defined as such a way that the first PC has the largest possible variance which is accounts for as much of
the variability in the data as possible, and each succeeding component in turn has the highest variance
possible under the constraint that it be orthogonal to (uncorrelated with) the preceding components
(Jackson, 1991). PCs are independent when the data set is jointly normally distributed. The PC may then be
used as predictor or criterion variables in subsequent analyses. PCA is sensitive to the relative scaling of the
original variables and it mostly used as a tool in exploratory data analysis and for making predicting models
(Anderson, 1984). PCA can be done by eigenvalue decomposition of a data covariance (or correlation)
matrix or singular value decomposition of a data matrix, usually after mean centering (and standardizing or
using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of
component scores, sometimes called factor scores (the transformed variable values corresponding to a
particular data point), and loadings (the weight by which each standardized original variable should be
multiplied to get the component score). Often, PCA operation can be thought of as revealing the internal
structure of the data in a way which best explains the variance in the data (Jackson, 1991). If a multivariate
dataset is visualized as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA can
supply the user with a lower-dimensional picture. This is done by using only the first few principal
components so that the dimensionality of the transformed data is reduced (Steel and Torrie, 1960).

133
Figure 2; Diagram for principal component analysis
Table 6. Path coefficient (direct and indirect effects) of the measured variables attributed on grain yield variation of wheat in non-
inoculation condition and different water levels
Effects via Direct IndirectVariables
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 Effect Effect
Y
X1 -0.004 -0.003 0.000 0.503 0.018 -0.030 -0.006 -0.017 -0.019 0.01 0.01 -0.004 0.471 0.467
X2 -0.002 -0.004 0.000 0.495 0.018 -0.026 -0.004 -0.013 -0.019 0.009 0.007 -0.004 0.469 0.465
X3 -0.002 -0.002 0.001 0.641 0.019 -0.028 -0.006 -0.012 -0.018 0.009 0.013 0.001 0.620 0.621
X4 -0.002 -0.002 0.000 1.013 0.018 -0.037 -0.005 -0.012 -0.019 0.012 0.018 1.013 -0.02 0.993
X5 -0.002 -0.003 0.000 0.576 0.032 -0.041 -0.004 -0.015 -0.02 0.01 0.016 0.032 0.522 0.554
X6 -0.002 -0.002 0.000 0.567 0.02 -0.067 -0.003 -0.008 -0.015 0.01 0.02 -0.067 0.591 0.524
X7 -0.003 -0.002 0.000 0.579 0.016 -0.026 -0.008 -0.015 -0.015 0.012 0.013 -0.008 0.563 0.555
X8 -0.003 -0.002 0.000 0.54 0.021 -0.023 -0.006 -0.023 -0.021 0.009 0.011 -0.023 0.532 0.509
X9 -0.003 -0.003 0.000 0.727 0.023 -0.036 -0.005 -0.017 -0.027 0.011 0.013 -0.027 0.716 0.689
X10 -0.002 -0.002 0.000 0.616 0.017 -0.034 -0.005 -0.011 -0.015 0.020 0.013 0.020 0.581 0.601
X11 -0.002 -0.001 0.000 0.673 0.018 -0.048 -0.004 -0.01 -0.014 0.009 0.028 0.028 0.626 0.654
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X7=
100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area.

134
Example 2 explains how PCA can be used for explaining relationships among dataset related to agriculture.
Example 2: Fourteen strawberry cultivars were cultivated in two consequent years (2009-2010) in the
research center of agriculture and natural recourses of Sanadaj, Iran. Ten variables consist of two set data
(first set morphological traits and second set biochemical traits) were measured (Saed-Moucheshi et al., In
Press).
To regard with considering ten parameters used in this research, ten components were calculated by
PCA. As it is expected, PC1 showed highest eigenvalue (3.51) and though, most variation among data can
be explained by this PC. After component 1, PC2; PC3; and PC4 can explain more variation among data
than other components. Four first components can explain 85% of total variation among data (Table 7). On
the other hand, these components have higher value than unit value (1) of eigenvalue (Fig. 3) and so, these
components were used for explain whole variation among data. Also, From Fig. 3, it can be observed that an
increase in the number of the components was associated with a decrease in eigenvalues, which is an
important indicator in general genetics also efficient indicators for screening the genotypes. Flowering period
had highest coefficient in PC1. In components 2, 3, and 4; yield, anthocyanin and berry size regularly
showed maximum coefficient among traits.
First component clearly separated two groups of variables containing chemical and morphological
parameters. Yield, berry size, berry weight, flowering and fruiting periods had high and negative correlations
with PC1 and though, based upon this component these traits have higher effects in contributing of yield. In
PC2 petiole long and TSS and also yield showed highest and negative correlation with this component. PC2
explain that petiole long had a high effect on yield, on the other hand, higher yield can provide higher amount
of total soluble solids (TSS). Berry size and berry weight and also yield have a very low coefficient in PC3
and based on this component these two traits can be important distributors of yield. Flowering and fruiting
periods and anthocyanin content showed highest negative contribution in PC3 and so, these two periods can
change the anthocyanin content. Titratable acidity (TA) had the highest positive coefficient in PC3 and this
trait is independent variables from others. PC4 also showed that higher yield provide higher TSS content and
direct selecting for yield results in more TSS content as well.
Table 7. Principle component analysis of traits measured during two years strawberry cultivation.
Component
4321traits
-0.174-0.529-0.1910.229Anthocyanin
-0.4820.009-0.285-0.379Berry size
-0.4630.007-0.258-0.395Berry weight
0.326-0.342-0.016-0.424Flowering period
0.454-0.3510.012-0.383Fruiting period
0.348-0.018-0.480.084Petiole long
-0.107-0.404-0.1960.352Stolons/plant
0.2680.559-0.3760.025Titratable acidity
0.017-0.059-0.430.37Total soluble solids
0.0770.002-0.469-0.233Berry yield
1.2511.4302.3303.510Eigenvalue
12.514.323.335.1Proportion percent of variance
85.272.758.435.1Cumulative percent of variance

135
Factor Analysis
Factor analysis (FA), similar to principal component, is a statistical method used to describe
variability among observed, correlated variables in terms of a potentially lower number of unobserved
variables called factors. The purpose of FA is to discover simple patterns in the pattern of relationships
among the variables (Spearman, 1904). In other words, it is possible, for example, that variations in three or
four observed variables mainly reflect the variations in fewer such unobserved variables. FA searches for
such joint variations in response to unobserved latent variables (Anderson, 1984). The observed variables
are modeled as linear combinations of the potential factors, plus error terms. The information gained about
the interdependencies between observed variables can be used later to reduce the set of variables in a
dataset (Manly, 2001.). Computationally this technique is equivalent to low rank approximation of the matrix
of observed variables. FA is related to principal component analysis (PCA), but the two are not identical.
Latent variable models, including factor analysis, use regression modeling techniques to test hypotheses
producing error terms, while PCA is a descriptive statistical technique (Dunetman, 1989).
FA is used to study the patterns of relationship among many dependent variables, with the goal of
discovering something about the nature of the independent variables that affect them, even though those
independent variables were not measured directly. The different methods of FA at first extract a set of factors
from a data set. These factors are almost always orthogonal and are ordered according to the proportion of
the variance of the original data that these factors explain. In general, only a (small) subset of factors is kept
for further consideration and the remaining factors are considered as either irrelevant or nonexistent (i.e.,
they are assumed to reflect measurement error or noise). In order to make the interpretation of the factors
that are considered relevant, the first selection step is generally followed by a rotation of the factors that were
retained. Two main types of rotation are used: orthogonal; when the new axes are also orthogonal to each
other and oblique; when the new axes are not required being orthogonal to each other. Because the
rotations are always performed in a subspace (the so-called factor space), the new axes will always explain
less variance than the original factors (which are computed to be optimal), but obviously the part of variance
explained by the total subspace after rotation is the same as it was before rotation (only the partition of the
variance has changed; Kaiser, 1958.).
This model proposes that each observed response (measure 1 through measure 5) is influenced
partially by underlying common factors (factor 1 and factor 2) and partially by underlying unique factors (E1
through E5; Fig. 4). The strength of the link between each factor and each measure varies, such that a given
factor influences some measures more than others. FA is performed by examining the pattern of correlations
(or covariances) between the observed measures. Measures that are highly correlated (either positively or
negatively) are likely influenced by the same factors, while those that are relatively uncorrelated are likely
influenced by deferent factors (Manly, 1986).
Fig. 3; Component numbers and their eigenvalues in principal component analysis

136
Figure 4; Diagram of factor analysis
For factor analysis we use data of example 1. In this data, two first factors of the twelve factors in the factor
analysis accounted for 60.1% of the total variations in the data structure (Table 8). The first factor was
included for yield and spike weight/plant and it could explain 49.5% of the total variation in the dependent
structure for, and its suggested name is yield. The second factor accounted for 10.6% of total variability and
was consisted of total chlorophyll content of the flag leaf which was named total chlorophyll. The first two
factors have higher value than unit value (1) of eigenvalue and are graphically shown in Fig. 5 (a).
In this example, factor analysis showed that spike weight/plant, and total chlorophyll content of the flag leaf
had the highest relative contribution in wheat grain yield. Such results can be recognized by means of
diagram 5 (b).
Table 8. Rotated (Varimax rotation) factor loadings and communalities for the estimated variables of
wheat based on factor analysis technique for inoculation and non-inoculation conditions and different
water levels
Variable Factor1 Factor2 Communality
X1 0.159 0.384 0.543
X2 0.194 0.196 0.390
X3 0.324 0.127 0.451
X4 0.875 0.157 1.032
X5 0.230 0.280 0.510
X6 0.246 0.056 0.302
X7 0.250 0.247 0.497
X8 0.220 0.817 1.037
X9 0.411 0.421 0.832
X10 0.299 0.138 0.437
X11 0.374 0.126 0.500
Y 0.885 0.140 1.025
Latent roots 2.338 1.268 3.606
Factor variance (%) 49.50 10.60 60.10
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5=
Grains/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf,
X9= Biologic yield/plant, X10= Root weight, X11= Leaves area and Y= Grain yield.

137
Figure 5 (a). Scree plot showing eigenvalues in response to the number of factors for the estimated variables
of wheat.
Figure 5 (b). Variables loading by factor analysis and varimax rotation with first two factors.
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grain
number/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9=
Biologic yield/plant, X10= Root weight, X11= Leaves area and Y= Grain yield.
Clustering Analysis
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so
that the objects in the same cluster are more similar (in some sense or another) to each other than to those
in other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical
data analysis used in many fields, including machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm, but the general
task to be solved (Romesburg, 1984.). It can be achieved by various algorithms that differ significantly in
their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include
groups with low distances among the cluster members, dense areas of the data space, intervals or particular
statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The
appropriate clustering algorithm and parameter settings (including values such as the distance function to
use, a density threshold or the number of expected clusters) depend on the individual data set and intended
use of the results (Richard, 2007). Indeed, Cluster analysis is an exploratory data analysis tool for organizing
observed data into meaningful taxonomies, groups, or clusters, based on combinations of variables, which
maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that

138
are initially unknown. In this sense, cluster analysis creates new groupings without any preconceived notion
of what clusters may arise (Singh, and Chowdhury, 1985). Cluster analysis, like factor analysis, makes no
distinction between dependent and independent variables. The entire set of interdependent relationships is
examined. Cluster analysis is the obverse of factor analysis. Whereas factor analysis reduces the number of
variables by grouping them into a smaller set of factors, cluster analysis reduces the number of observations
or cases by grouping them into a smaller set of clusters (Johnson, and Wicheren, 1996). On the other hand,
Everitt and Dunn (1992) and Nouri et al. (2011) stated that the main advantage of using PCA over cluster
analysis is that each variable can be allocated to one group only.
The first choice that must be made for carrying out the clustering analysis is how similarity (or
alternatively, distance) between data is to be defined. There are many ways to compute how similar series of
data such as Pearson correlation, Spearman rank correlation (for non-numeric data), Euclidean distance and
etc (Romesburg, 1984). After choosing distance method for measuring similarity, related method for
clustering such as hierarchical or non-hierarchical algorithm must be used. Hierarchical method is most
popular method which in this procedure we construct a hierarchy or tree-like structure to see the relationship
among cases. The clusters could be arrived at either from weeding out dissimilar observations (divisive
method) or joining together similar observations (agglomerative method). Most common statistical packages
use agglomerative method and the most popular agglomerative methods are (1) single linkage (nearest
neighbor approach), (2) complete linkage (furthest neighbor), (3) average linkage, (4) Ward’s method, and
(5) Centroid method (Everitt, 1993).
Example 3: Twenty chickpea cultivars were cultivated in 2005 at research center of Razi University,
Kermanshah, Iran, under the rainfed condition (Moucheshi et al., 2009-2010). Yield and its components were
measured and cultivars were grouped using cluster analysis based on the measured traits.
Cluster analysis of chickpea genotypes based on grain yield and its components (Fig. 6) classified
genotypes into four groups with 5, 4, 2 and 9 number of genotypes, respectively. The highest distance or
dissimilarity between genotypes was observed for genotypes 1 and 17, and the highest similarity was
obtained for genotypes 18 and 20. Based on the results, four grouped cluster of cultivars can have a
common origin, on the other hand crossing between genotypes in distanced clusters like first and four cluster
can provided much variation for plant breeding aims.
Figure 6; Results of cluster analysis for 20 chickpea genotypes under rainfed condition
Canonical Correlation
in statistical techniques, dependence refers to any statistical relationship between two random
variables or two sets of data and correlation refers to any of a broad class of statistical relationships involving
dependence; such as dependent phenomena include the correlation between the physical statures of
parents and their offspring, and the correlation between the demand for a product and its price. Formally,
dependence refers to any situation in which random variables do not satisfy a mathematical condition of
probabilistic independence (Steel and Torrie, 1960).

139
Correlation can refer to any departure of two or more random variables from independence, but
technically it refers to any of several more specialized types of relationship between mean values. There are
several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The most common
of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two
variables (which may exist even if one is a nonlinear function of the other). Other correlation coefficients
have been developed to be more robust than the Pearson correlation that is, more sensitive to nonlinear
relationships (Johnson and Wicheren, 1996).
In a canonical correlation (multiple multiple correlation), the data can be divided into two sets of
related variables, one referred to independent variables which has two or more Y, for example, variables and
another referred to dependent variables consisting of two or more X, for example, variables where its goal is
to describe the relationships between the two sets of variables. You find the canonical weights (coefficients)
a1, a2, a3, ... ap to be applied to the apX variables and b1, b2, b3, ... bm to be applied to the bmY variables in
such a way that the main correlation is between CVX1 and CVY1 is maximized (Bratlet, 1974.).
CVX1 = a1X1 + a2X2 +...+ apXp
CVY1 = b1Y1 + b2Y2 +...+ bmYm
CVX1 and CVY1 are the first canonical variates, and their correlation is the sample canonical
correlation coefficient for the first pair of canonical variates (Fig. 7). The residuals are then analyzed in the
same fashion to find a second pair of canonical variates, CVX2 and CVY2, whose weights are chosen to
maximizing the correlation between CVX2 and CVY2, using only the variance remaining after the variance due
to the first pair of canonical variates has been removed from the original variables. This continues until a
"significance" cutoff is reached or the maximum number of pairs (which equals the smaller of m and p) has
been found (Giffins, 1985.).
Figure 7; Diagram for canonical correlation
Example 4: nine variables consist of tow set (yield and its component, and photosynthesis related
traits) were measured in 20 chickpea genotypes under rainfed condition at Razi University, Kermanshah,
Iran in 2005. We want to consider the relationship between these two sets of variables (unpublished data).
Number of roots (Eigenvalue or squared canonical correlation) is equal to the number of variables in
the smaller set of data therefore; the number of roots in this example is 4 (Fig 8). In this example none of the
canonical correlation between sets of the variables is significant and so, there is no relationship between
these two sets (Table 9). For better understanding of this correlation we assume that the first canonical
correlation (0.428) is significant. Yield has the highest and negative contribution in the first root among the
first set of the data while 100seed weight has a high positive contribution. Highest negative contribution
among second set of the data was belonged to chlorophyll florescence and positive one was observed in the
chlorophyll b. these results shows that yield and chlorophyll floresenc have a direct relationship, and
chlorophyll a and also number of pods per plant are rather contributed in this relationship. These variables
are negatively correlated with first root. On the other hand, 100seeds weight, seed weight, number of seed
per plant, chlorophyll b, and chlorophyll ab have straight relationship on another but contribution of SW and
NSP in the first set and Ch ab in the second set are low. These variables are positively correlated with first
root.
Redundancy index is the amount of variance in a canonical variate (dependent or independent)
explained by the other canonical variate in the canonical function. It can be computed for both the dependent

140
and the independent canonical variates in each canonical function. The explained variability of each set of
the data by another one in this example are very low (3% for first set and 6.38% for the second set).
Table 9. summary of canonical correlation
Root1 Root2 Root3 Root4 variance
Extracted
Total
redundancy
100SW 2.694 -0.746 1.906 0.89
SW 0.588 0.1 -1.716 0.245
NSP 0.677 0.907 0.107 -0.06
NPP -0.231 -0.83 -0.231 -0.954
Y -2.897 1.042 0.038 -1.069
70.22%
3.00%
Ch a -0.288 0.999 -0.118 0.147
Ch b 0.529 0.51 -0.076 -0.836
Ch ab 0.402 -0.375 -0.778 0.52
Ch f -0.903 -0.202 -0.452 -0.253
100%
6.83%
EigenValue 0.1835 0.0805 0.0525 0.001
Can Corr 0 .428 0.284 0.229 0.032
P-value 0.55817 >0.56 >0.56 >0.56
100SW: 100 seed weight; SW: seed weight per plant; NSP: number of seed per plant; NPP:
number of pod per plant; Y: yield; CH a: chlorophyll a; Ch b: Chlorophyll b; Ch ab: total
chlorophyll content; and Ch f: chlorophyll florescence.
Figure 8. Plot of Eigenvalues or root number and their contribution in the canonical correlation
It seems that research works in agriculture and plant science are a little weak in the statistical
discussion and explaining. This review explained most widely applied multivariate statistical methods that
researchers in agriculture and plant science can use in their investigations to give more authority to their
works and results.

141
References
Anderson TW, 1984. An introduction to multivariate statistical analysis. John Wiley, New York.
Bratlet MS, 1974. The general canonical correlation distribution. Annals of Mathematical Statistics 18 1-17.
Burnham KP, Anderson DR. 2002. Model selection and multimodel inference. Springer, New York.
Dong B, Liu M, Shao HB, Li Q, Shi L, Du F, Zhang Z, 2008. Investigation on the relationship between leaf
water use efficiency and physio-biochemical traits of winter wheat under rained condition. Colloids and
Surfaces B: Biointerfaces 62: 280-287.
Draper NR, Smith H, 1966. Applied Regression Analysis. John Wiley, New York.
Draper NR, Smith H, 1981. Applied regression analysis. John Wiley, New York.
Dunetman GH, 1989. Principal component analysis. Sage Publication, Newbury Park.
Everitt BS, 1993. Cluster Analysis. Wiley, New York.
Everitt BS, Dunn G, 1992. Applied Multivariate Data Analysis, Oxford University Press, New York, NY.
Giffins R, 1985. Canonical analysis: a review with application in ecology. Springer-Verlag, Berlin.
Harrell FE, 2001. Regression modeling strategies: With applications to linear models, logistic regression, and
survival analysis. Springer-Verlag New York.
Jackson JE, 1991. A user's guide to principal component. John Wiley New York.
Johnson RA, Wicheren, DW, 1996. Applied multivariate statistical analysis. Prentice Hall of India, New Delhi.
Kaiser HF, 958. The varimax criterion for analytic notation in factor analysis. Psychometricka 23 187-200.
Kleinbaum DG, Kupper LL, Muller KE, 1988. Applied Regression Analysis and Other Multivariable Methods.
PWS-Kent Publishing Co, Boston.
Kleinbaum DG, Kupper LL, Muller KE, 1988. Applied Regression Analysis and Other Multivariable Methods.
PWS-Kent Publishing Co, Boston.
Manly BFJ, 1986. Multivariate statistical method; a primier. Chapman and Hall, London - New York.
Manly BFJ, 2001. Statitics for environmental science and management. Chapman and hall/CRC, Boca
Raton.
Miller AJ, 2002. Subset selection in regression. Chapman and Hall London.
Moucheshi AS, Heidari B, Dadkhodaie A, 2009-2010. Genetic Variation and Agronomic Evaluation of
Chickpea Cultivars for Grain Yield and Its Components Under Irrigated and Rainfed Growing
Conditions. Iran Agricultural Research 28-29: 39-50.
Mouchesi A, heidari B, Assad MT. 2012. Alleviation of drought stress effects on wheat using arbuscular
mycorrhizal symbiosis. International Journal of AgriScience 2: 35-47.
Nouri A, Etminan A, Dasilva D, Mohammad R, 2011, Assessment of yield, yield-related traits and drought
tolerance of durum wheat genotypes (Triticum turjidum var. durum Desf.). Australian Journal of Crop
Science 5:8-16.
Richard AJ, 2007. Applied multivariate statistical analysis. Prentice Hall
Romesburg HC, 1984. Cluster analysis for researches. Lifetime Learning Publication, Belmont
Saed-Moucheshi A, Karami F, Nadafi S, Khan AA, (in press) Heritability, genetic variability and
interrelationship among some morphological and chemical parameters of strawberry cultivars Pakistan
Journal of Botany
Shipley B, 1997. Exploratory path analysis with applications in ecology and evolution. The American
Naturalist 149: 1113-1138.
Singh RK, Chowdhury BD, 1985. Biometrical method in quantitative genetic analysis. Kalyani publishers.
Ludhiana, , New Delhi.
Spearman C, 1904. General intelligence, objectively determined and measured. American Journal of
Psychology 15 201-293.
Steel RGD, Torrie JH, 1960. Principles and Procedures of Statistics. McGraw Hill Book Co. Inc., New York.

applied multivariate statistical techniques in agriculture and plant science 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to applied multivariate statistical techniques in agriculture and plant science 2

Similar to applied multivariate statistical techniques in agriculture and plant science 2 (20)

applied multivariate statistical techniques in agriculture and plant science 2