The role of cassava as food and cash crop in Sierra Leone has contributed immensely to the country's economic development. This includes providing employment facilities for Sierra Leoneans. Cassava is the second largest food crop grown across the country. Despite its importance and tremendous contributions to the country's economic development, its production faces several constraints. This work, therefore, focused on using a statistical modeling technique to key out the major factors influencing cassava productivity in the southern part of Sierra Leone. It further measured the effect of each factor on cassava productivity. A multiple binary logistic regression modeling technique were used in the empirical analysis. Two hundred cassava farmers were randomly selected from the communities in the study area. Cassava productivity was measured by the level of cassava yield. Initially, several factors were considered as possible determinants of the level of cassava yield. However, the empirical analysis showed that farm size, educational level, and age by farming experience are the main factors influencing cassava productivity in the study area. Increase in farm size can increase cassava yield whiles an increase in educational level may decrease cassava productivity. Older people with more farming experience can contribute significantly to cassava productivity.
2. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
Sesay and Koneh 593
increase in the urban population and the demand for food
(Essers et al. 2005). To feed the urban dwellers, food
supply from every farm household has to increase by at
least 63% in 10 years (Sanni et al. 2009). This clearly
points out the necessity for an increase in the growth of a
supplementary food crop like Cassava.
Despite its importance and tremendous contributions,
cassava production in Sierra Leone faces several
constraints. Some of these constraints are: inadequate
funding; lack of farming experience; lack of availability of
land for farming and the educational level of cassava
farmers.
This study, therefore, aims to key out the major factors
influencing cassava productivity in the Moyamba District,
Southern Province of Sierra Leone. It used a logistic
regression modeling technique to identify the key
determinants of cassava productivity and to measure the
effect of each determinant on the yield of cassava grown
in the study area.
MATERIALS AND METHODS
Theoretical Frameworks
This section focuses on the review of the theoretical and
conceptual frameworks of using a logistic regression
method for analyzing categorical outcome. It also points
out the main statistics used in the logistics regression
model checking.
Logistic Regression
Regression analysis is a predictive modeling technique. It
investigates and estimates the relationship between a
variable of interest called the dependent or target variable
and one or more variables that may have an influence on
the dependent variable called predictor(s). Based on the
type of dependent variable(s), the number of independent
variables and shape of the regression line, there exist
different regression techniques used to investigate
relevant relationships and to make valuable predictions.
Among these numerous regression techniques, this work
used a multiple binary logistic regression modeling
technique to investigate the factors influencing cassava
production in the Moyamba District. Cassava productivity
was measured by the level of cassava yield. A multiple
binary logistic regression was used, multiple’ because
there were over one independent variable, ‘binary’
because the variable of interest, called the dependent
variable was dichotomous (high or low yield) and ‘logistic’
because of lack of linearity between the dependent
variable and the independent variable (s).
In building the logistic regression model to achieve the
purpose of this research work, the following concepts and
statistics were considered:
The Binomial Distribution
The binomial distribution is appropriate to use as an error
distribution in logistic regression because:
1. the outcome of interest is dichotomous (a success or a
failure); and
2. a number of independent trials are considered.
Let:
𝑦𝑖 = {
1 𝑖𝑓 𝑡ℎ𝑒 𝑖 𝑡ℎ
𝑓𝑎𝑟𝑚′𝑠 cassava 𝑦𝑖𝑒𝑙𝑑 𝑖𝑠 ℎ𝑖𝑔ℎ
0 𝑖𝑓 𝑡ℎ𝑒 𝑖 𝑡ℎ
𝑓𝑎𝑟𝑚′𝑠 cassava 𝑦𝑖𝑒𝑙𝑑 𝑖𝑠 𝑙𝑜𝑤
Equation (1)
where 𝑦𝑖 is the level of the yield for farm i. Here, 𝑦𝑖 is
considered as a realization of a random variable 𝑌𝑖 that
can take the values one and zero with probabilities 𝑝𝑖 and
1 − 𝑝𝑖 respectively. The distribution of 𝑌𝑖 is called a
bernoulli distribution with parameter 𝑝𝑖 and can be written
as
𝑝𝑟(𝑌𝑖 = 𝑦𝑖) = 𝑝𝑖
𝑦 𝑖
(1 − 𝑝𝑖)1−𝑦 𝑖 Equation (2)
for 𝑦𝑖 = 0,1. If 𝑦𝑖 = 1 𝑝𝑖 is obtained, and if 𝑦𝑖 = 0 1 −
𝑝𝑖 is obtained.
Logistic Regression Model
From the above discussion of the binomial distribution, the
logistic regression model can be understood as a means
of finding the 𝛽 parameters that best fit:
𝑦𝑖 = {
1 β0 + β1x + ε > 0
0 𝑒𝑙𝑠𝑒
Equation (3)
Where 𝜀 is an error term
In short, if 𝑝̂ is the predicted probability that 𝑌 = 1, given
the values of 𝑥1, … , 𝑥 𝑘,
the model assumes that
log
𝑝̂
(1−𝑝̂)
= 𝛽0 + 𝛽1 𝑥1+, … , 𝛽 𝑝 𝑥 𝑘 Equation (4)
Where 𝑌~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑝̂)
Parameter Interpretation
Unlike the simple linear model, 𝑌 = 𝛽0 + 𝛽1 𝑥1 indicating
that if x increases by 1, Y increases by .𝛽1 , in a logistic
regression model, it is log
𝑝̂
(1−𝑝̂)
which increases by .𝛽1. To
see this,let the predicted probability of the event of interest
be 𝑝0 when 𝑥 = 0 and 𝑝̂1 when 𝑥 = 1, then
log
𝑃̂0
(1 − 𝑃̂0)
= 𝛽0
3. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
J. Agric. Econ. Rural Devel. 594
log
𝑝̂1
1 − 𝑝̂1
= 𝛽0 + 𝛽1
log
𝑝̂1
1 − 𝑝̂1
= log
𝑝̂0
1 − 𝑝̂0
+ 𝛽1
Taking exponent on both sides of this equation we have:
𝑒
log(
𝑝1
1−𝑝̂1
)
= 𝑒
log
𝑝̂0
1−𝑝̂0
+𝛽1
This gives
𝑝1
1−𝑝̂1
=
𝑝̂0
1−𝑝̂0
× 𝑒 𝛽1
Equation (5)
This means, when x increases by 1, the odds of a positive
outcome increase by a factor
of 𝑒1
𝛽
. Therefore, 𝑒1
𝛽
is called the odds ratio for a unit
increase in x.
To be specific, the odd ratio for a continuous independent
variable, 𝑂𝑅 𝑐 can be defined as:
𝑂𝑅 𝑐 =
𝑜𝑑𝑑𝑠(𝑥+1)
𝑜𝑑𝑑𝑠(𝑥)
=
𝐹(𝑥+1)
1−𝐹(𝑥+1)
𝐹(𝑥)
1−𝐹(𝑥)
=
𝑒 𝛽0+𝛽1(𝑥+1)
𝑒 𝛽0+𝛽1 𝑥 = 𝑒 𝛽1
Equation (6)
In case of a binary independent variable, the odds ratio
can be define as
𝑎𝑑
𝑏𝑐
, where a, b, c and d are cells in a 2×2
contingency table
Measures of fit for Logistic Regression
Like any classical linear model, a vital part of logistic
regression analysis is how well the model fits the Data.
Before trusting the result of a model to make valid
conclusions or predict future outcomes, it is important to
check the model beyond all reasonable doubt to make sure
that the model assumed is correctly specified and the data
at hand do not conflict with assumptions made by the
model.
The residuals or differences between observed and fitted
values were the raw materials used in these tests.
Deviance Goodness-of-Fit Test
The deviance goodness-of-fit test assesses the
discrepancy between the current model and the full model.
The deviance statistic denoted as D2
is thus;
𝐷2
= 2 log Ls (β̂) − log Lm (β̂) Equation (7)
where
log Lm(β̂) = maximized log-likelihood of the fitted model
log Ls(β̂) = maximized log-likelihood of the saturated
model
Evidence for model lack-of-fit occurs when the value of D2
is large
Pearson Goodness-of-Fit Test
The Pearson goodness-of-fit test also assesses the
discrepancy between the current model and the full model.
The test-statistic is:
𝜒2
= ∑
(Oi−Ei)2
Ei
𝑛
𝑗=1 = N ∑
(
Oi
N
⁄ −pi)
2
pi
n
i=1 Equation (8)
where
𝜒2
= Pearson's cumulative test statistic, which
asymptotically approaches a 𝜒2
distribution.
Oj == the number of observations of type j.
N= total number of observations
Ej = NPj = the expected (theoretical) frequency of type j,
asserted by the null hypothesis that the fraction of type j in
the population is pj
nj = the number of cells in the table.
Hosmer Lemeshow
This goodness-of-fit test was used to determine whether
the predicted probabilities deviate from the observed
probabilities in a way that the binomial distribution does not
predict. If the p-value for the goodness-of-fit test is lower
than the chosen significance level, the predicted
probabilities deviate from the observed probabilities in a
way that the binomial distribution cannot predict.
Hosmer and Lemeshow (2000) recommended partitioning
the observations into 10 equal sized groups according to
their predicted probabilities. So that
𝐺 𝐻𝐿
2
= ∑
(𝑂 𝐽−𝐸 𝐽)
2
𝐸 𝐽(1−𝐸 𝑗/𝑛 𝑗)
10
𝑗=1 ~𝜒8
2
Equation (9)
𝑛𝐽 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑗 𝑡ℎ
groug
𝑂𝑗 = ∑ 𝑦𝑖𝑗𝑖 = Observed number cases in the 𝑗 𝑡ℎ
groug
𝐸𝑗 =expected number of cases in the 𝑗 𝑡ℎ
group
Measures of the Predictive Power of the Logistic
Regression Model
The R2 statistics for logistic regression was used to
measure the predictive power of the model. There are
different versions of R2 in the statistics literature, but this
work used the Nagelkerke and Cox and Snell R2 Squares
produced by SPSS.
In using R2, adding any variable may tend to increase it's
value, even if that variable is irrelevant. For this reason,
the adjusted R2 is preferably used to access the predictive
power of the logistic legression model.
Cox and Snell R2 :
It’s sometimes referred to as a
“pseudo” R2 .
The Cox and Snell R2 is
4. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
Sesay and Koneh 595
𝑅 𝐶&𝑆
2
= 1 − (
𝐿 𝑂
𝐿 𝑀
)
2
𝑛
Equation (10)
where n is the sample size
Nagelkerke R Square: The Nagelkerke R Square adjusts
the Cox and Snell’s R Square so that the range of possible
values extends to 1. This is achieved by dividing the Cox
and Snell R-squared by its maximum possible value,
1 − 𝐿(𝑀𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
2
𝑁⁄
Equation (11)
So, if the full model perfectly predicts the outcome and has
a likelihood of 1, Nagelkerke R-squared = 1. This implies,
When L(Mfull) = 1, R2 = 1 and when L(Mfull) = L(Mintercept),
R2 = 0. The Nagelkerke R Square is:
Equation (12)
Methodology
This section introduces stages involved in the data
analysis. It also points out the type of data analysis
adopted at each stage together with the need for each
analysis.
Description of study area
This study was carried out in the Moyamba District, the
southern part of Sierra Leone with. a population of 318,064
in the 2015 population and housing census (Statistics
Sierra Leone, 2015). Moyamba District has a seasonal
variation like any other parts of the country. It has a rainy
season that starts in May and ends in October and a dry
season that starts in November and ends in April. One of
the main occupations of people living in this part of the
country is farming.
Sampling Technique and Data Collection
In line with the work of Peduzzi et al. (1996) (for sample
size consideration in logistic regression), a random
sampling technique was employed to select two hundred
(200) cassava farmers from the communities in the study
area. Questionnaires containing questions relating to the
level of cassava output together with potential factors that
might influence the level of the output were administered
to all the selected cassava farmers. The data obtained
provided information on the socioeconomic characteristics
of the cassava farmers, output or yield of cassava and
other factors such as farming experience, farm size,
sources of labour, source of farm power, control means of
pest and disease, credit facilities, and extension contacts.
Measurement of Variables
The study used data on technical coefficient (input-output)
of cassava production. The input factors include labour,
control means of pest and disease, credit facility,
extension services, land and socioeconomic factors. The
socioeconomic factors/variables were made up of the
farmer gender, age, level of education, marital status,
religious background and family size. The land variable
was the per total land area in acres cultivated by the
farmer, which indicates the size of the farm. Labour
calculations were based on the total number of people
employed to work on a given farm land in a particular crop
season. The educational level of the farmers was
determined by the number of years spent in school. Family
size was determined by the number of people living in the
household during the crop year. The output factor was the
level of cassava yield, which is the total cassava yield in
bags per acre per crop season. For example, if the
expected yield per acre is 10 bags of cassava during the
crop year, then below 5 bags was considered as a low
yield, whiles 5 bags and above was considered as high
yield.
Descriptive and Exploratory Data Analysis
The first stage of the analysis was used to gain an
understanding of the distributions of both continuous and
categorical variables.
For the continuous variables, a bivariate exploratory data
analysis was carried out to know if there was a relationship
between each continuous independent variable and the
categorical outcome variable.
The independent sample t test was used as an exploratory
tool. Like any exploratory analysis, the independent
sample t-test helped to determine whether it was worth
fitting a logistic regression model for the continuous
variables. A significant difference in mean was an
indication that, using a logistic regression model would be
the best as the results would be significant.
Variable Selection
The following steps were taken to select variables to enter
the Logistic Regression Model.
Step 1. Univariable Analyses
The univariate logistic regression was used to test the
association of each explanatory variable (one at a time)
with the outcome variable. This step helped to eliminate
insignificant variables from the model (i.e. variables that do
not show any significant association with the dependent
variable all by themselves) as such variables are not likely
to be associated with the outcome variable even after all
the other variables are added to the model).
The result of this univariate analysis includes: Wald and
likelihood ratio chi-square test statistics and their P-values;
parameter estimates and standard errors; and odds ratios
and their confidence limits. Each of these results were
considered.
5. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
J. Agric. Econ. Rural Devel. 596
Furthermore, since the values of the parameters for
logistic regression are calculated on a log scale, odd ratios
were examined. The odd ratios were calculated after
exponentiating the parameter estimates. An odds ratio
greater than one (>1) indicates a positive association, less
than one indicates (<1) indicates a negative association
and equal to one (=1) indicates no association of the tested
variable with the outcome.
Step 2. Multivariable Analyses
The next step was to carry out the multiple logistic
regression analysis on the selected independent variables.
At the end of the multiple logistic regression analysis,
those variables found to be insignificant were not included
in the final model.
Test for Parameters
After the multiple logistic regression analysis, the
importance of each explanatory variable was assessed by
carrying out statistical tests of the significance of the
coefficients. Parameter estimates and standard errors of
the variables in the model were assessed after addition or
deletion of a variable. This was done using the Wald and
likelihood ratio test statistics and their associated p-values.
The Wald statistic
Wald χ2 statistic was used to test the significance of
individual coefficients in the model. The statistic is
calculated as follows:
(
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡
𝑆𝐸 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡
)
2
Equation (13)
Each Wald statistic was compared to a χ2 distribution with
1 degree of freedom. Wald statistics are easy to calculate,
but their reliability is questionable, particularly for small
samples. For data that produce large estimates of the
coefficient, the standard error is often inflated, resulting in
a lower Wald statistic, and therefore the explanatory
variable may be incorrectly assumed to be unimportant in
the model. Likelihood ratio tests (see below) are generally
considered to be superior.
Likelihood ratio test: The likelihood ratio test for a
particular parameter compares the likelihood (L0) of
obtaining the data when the parameter is zero with the
likelihood (L1) of obtaining the data evaluated at the MLE
of the parameter. The test statistic is calculated as follows:
−2 × 𝑙𝑛(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑟𝑎𝑡𝑖𝑜) = −2 × 𝑙𝑛(𝐿0/𝐿1) = −2 × (𝑙𝑛𝐿0 − 𝑙𝑛𝐿1)
Equation (14)
Measures of Fit for Logistic Regression Model
As already mentioned under the theoretical framework
section, before trusting the result of a model to make valid
conclusions or predict future outcomes, the model should
be checked beyond all reasonable doubt to make sure that
the model assumed is correctly specified and that the
data at hand does not conflict with assumptions made by
the model. In this work, the Hosmer and Lemeshow
Goodness of fit test was used to check whether the logistic
model assumed was correctly specified.
Model Discrimination
How well the model distinguished between the two groups
in the binary outcome in binary logistic regression was
assessed using the area under the receiver operating
characteristic (ROC) curve. This curve was obtained by
plotting sensitivity against specificity. The diagonal line
represents chance. A curve that is far above the diagonal
line shows that an indicator is accurate. This measure
varies between 0.5 and 1. An area of 0.5 represents the
diagonal, attained when no discrimination exists. An area
closer to 1 represents a good indicator. Whereas an area
of 1 represent a perfect indicator.
Measures of the Predictive Power of the Model
The R2 statistic for logistic Regression was used to
measure the predictive power of the model.
Test for Model Assumptions
In the case of binary logistic regression, the fact that the
probability lies between 0 and 1 imposes a constraint.
Therefore, both the assumptions of constant variance and
normality present in multiple linear regressions are lost.
However, like every statistical test, there are certain
assumptions that needed to be met if the result of the
multiple binary logistic regression model must be useful.
The model was checked to make sure that the data did
not fail those assumptions.
Multicolinearity
Multicollinearity occurs when the model includes multiple
independent variables that are correlated with each other.
This normally occurs when there are some independent
variables that are redundant. It is a type of disturbance that
may be present in the data. If this disturbance is not
eliminated from the data, any statistical inferences made
about the data may not be reliable. There are a number of
ways of detecting multicollinearity in a data set. Among
these are two collinearity diagnostic factors that can help
to identify multicollinearity. These are, the value of the
tolerance and its reciprocal, called variance inflation factor
(VIF). If the value of the tolerance is less than 0.2 or 0.1
and, simultaneously, the value of VIF 10 and above, then
multicollinearity is problematic.
The variable’s tolerance is 1 − R2
. Generally, a small
tolerance value indicates that the variable under
consideration is almost a perfect linear combination of the
independent variables already in the equation and that it
should not be added to the model.
6. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
Sesay and Koneh 597
Also, if the standard errors of the regression coefficients
are large, then multicollinearity is an issue.
In addition to the standard errors of the regression
coefficients, this work used the tolerance statistics and the
variance inflation factor (VIF) to test for multicollinearity.
Interaction
To test for interaction, the logistic regression analysis was
carried out with an interaction term, the p-value of the
regression output determined whether or not to include an
interaction term in the model. A significant p-value led to
the retention of the interaction term in the present model.
Influential Observation and Outliers
The final step was to find out if there were observations
that do not fit the model well (outliers), have strange values
for any variable (leverage) or that have undue influence on
the model (influence).
This ended the variable selection for the final model.
Final Model
After the variable selection stage, the next step was to fit
and assess the final logistic regression model. Most of the
diagnostic steps taken during the variable selection stage
were again applied to the final model. This was done to
ensure the appropriateness, adequacy and usefulness of
the final model upon which our conclusion was based.
EMPIRICAL ANALYSIS
Descriptive Statistic/ Exploratory data Analysis
Table 1. Descriptive Analysis: Dependent (DV) and Independent (IV) Variables to be Modeled
Variable Name IV/DV Valid Range Variable Type
Cassava Yield/Outcome DV High, Low Character, Categorical
Educational Level IV No Formal Education,
Primary School,
Secondary School,
Tech - Voc.
Character, Categorical
Gender IV Male, Female Numeric, Categorical
Land Owner IV Self, Communal, Lease, Rent Character, Categorical
Family Size IV 1 -17 Numeric, Categorical
Farm Size IV 1-10 acres Numeric, Continuous
Age IV 17-59 years Numeric, Continuous
Farming Experience IV 1-29 yesrs Numeric, Continuous
Source of Labour IV Family, haired, communal Numeric, Categorical
Pesticides IV Yes, No Numeric, Categorical
Credit Facility IV Yes, No Numeric, Categorical
Extension Services IV Yes, No Numeric Categorical
Descriptive Statistic For Categorical Variable
Table 2:Descriptive Statistics
N Range Minimum Maximum
EDUCATIONAL LEVEL 200 3 0 3
FAMILY SIZE 200 16 1 17
OWNERSHIP OF THE FARM LAND 200 4 0 4
SOURCES OF LABOUR 200 2 0 2
CREDIT FACILITIES 200 1 0 1
SOURCE OF FARM POWER 200 2 0 2
PESTS AND DISEASES CONTROL 200 0 0 0
Valid N (listwise) 200
Descriptive Statistic for Continuous Variable
Table 3: Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
FARM SIZE OF THE RESPOND 200 0 10 5.29 3.086
AGE OF RESPONDENT 200 17 59 41.55 8.689
FARMING EXPERIENCE OF RESPONDENT 200 1 29 14.51 8.460
Valid N (listwise) 200
7. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
J. Agric. Econ. Rural Devel. 598
Exploratory data Analysis
A bivariate exploratory analysis was carried out to know if
there was a relationship between the continuous
independent variables and the categorical outcome
variable. The independence sample t-test was used to
explore the relationship between each of the continuous
independent variables and the outcome variable, cassava
yield.
Like any statistical test, before using the independence
sample t-test, the common assumptions made when doing
a t-test were considered. The assumption of the t-test for
independent means focuses on sampling, research
design, measurement, population distributions and
population variance. The t-test for independent means is
considered typically robust for violations of the normal
distribution assumption (with a larger sample size). This
work used the QQ-plot to see if the assumption of
normality was satisfied before using the t-test for
independent means.
Quantile-Quantile (Q-Q) plot for continuous
independent variables
The Q-Q plots for the continuous variables are presented
in figure 1. The Q-Q plot is a graphical method for
comparing two probability distributions by plotting their
quantiles against each other. A concave departure from
the straight line in the Q-Q plot is an indication of a heavy
tailed distribution, whereas a convex departure is an
indication of a thin tail.
From the Q-Q plot in figure 1, it is evident that, the
distributions of the continuous independent variables are
not perfectly normally distributed. However, because of the
central Limit Theorem (sample size is greater than 30) and
the data was obtained randomly, the t-test was carried out.
Figure 1: QQ-plot of continuous variables
Independent Samples Test
The independent sample t-test was carried out for each
of the continuous independent variables, to determine if:
(1) there is a statistically significant difference in the mean
experience gained by cassava farmers with high
cassava yield and those with low cassava yield.
(2) there is a statistically significant difference in the mean
farm size used by cassava farmers with high cassava
yield and those with low cassava yield.
(3) there is a statistically significant difference in the mean
age of cassava farmers with high cassava yield and
those with low cassava yield.
The independent sample t-test acted as an exploratory
tool. Like all exploratory analysis, the independence
sample t-tests helped to determine if it is worth fitting a
logistic regression model for these variables or not. A
significant difference in mean, implies, running a logistic
regression would be the best, as the results would be
significant. Below are the outputs of the independent
sample t-tests for the continuous variables used in the final
model.
The significance level in the independence sample t-test
(in table 4) for farm size in relation to cassava yield is far
below the threshold significance level of 0.05. This means
that the mean difference in the farm size for those cassava
farmers with high cassava yield and those with low
cassava yield is statistically significant. This further implies
that, there is a relationship between farm size and cassava
yield. The logistic regression model was used to further
explore this relationship.
Similarly, the significance level in the independence
sample t-test, (presented in table 6) for farming experience
is far below the threshold significant level of 0.05. This
means that, the mean difference in the farming experience
of those cassava farmers with high yield and those with
low yield is statistically significant. This further implies that,
there is a relationship between farming experience and
cassava yield. The logistic regression model was used to
further explore this relationship
However, the significance level in the independence
sample t-test for the continuous variable, age is above the
threshold significance level of 0.05. This means that, the
difference in the mean age of those Cassava farmers with
high yield and those with low yield is not statistically
significant.
8. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
Sesay and Koneh 599
Table 4: Independent Samples Test
Levene's
Test for
Equality of
Variances t-test for Equality of Means
F Sig. t df
Sig.
(2-
tailed)
Mean
Difference
Std. Error
Difference
95% Confidence Interval of
the Difference
Lower Upper
FARM SIZE
OF THE
RESPOND
Equal
variances
assumed
2.130 .146 4.738 198 .000 1.979 .418 1.155 2.803
Equal
variances
not assumed
4.745 188.042 .000 1.979 .417 1.156 2.802
Table 5: Independent Samples Test
Levene's Test
for Equality of
Variances t-test for Equality of Means
F Sig. t df
Sig. (2-
tailed)
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower Upper
AGE OF
RESPONDENT
Equal
variances
assumed
.070 .792 1.914 198 .057 2.353 1.230 -.072 4.778
Equal
variances
not
assumed
1.917 188.006 .057 2.353 1.228 -.069 4.775
Table 6: Independent Samples Test
Levene's
Test for
Equality of
Variances t-test for Equality of Means
F Sig. t df
Sig.
(2-
tailed)
Mean
Difference
Std. Error
Difference
95% Confidence Interval of
the Difference
Lower Upper
FARMING
EXPERIENCE
OF
RESPONDENT
Equal
variances
assumed
.456 .500 7.544 198 .000 8.033 1.065 5.933 10.133
Equal
variances
not
assumed
7.612 192.494 .000 8.033 1.055 5.952 10.115
Variable Selection
This involves two stages of analysis, the univariate stage
and the multivariable stage.
Univariate Analysis
This is the first stage of the variable selection procedure.
Each of the variables was investigated separately using
univariate logistic regression. Table 7 gives a combined
summary of all the univariate outputs.
From table 7, all the independent variables with p-values
less than the threshold value of 0.05 were found to be
significant and hence associated with the dependent
variable. At the second stage of the variable selection
procedure, all the significant independent variables were
further simultaneously investigated using the multivariable
logistic regression.
9. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
J. Agric. Econ. Rural Devel. 600
Table 7: P-Values and Odd Ratios of Independent
Variables from Univariate Analysis
Factor P-values (Wald
test)
P-values
(LR test)
Odd
Ratio
(OR)
Age 0.010 0.009 1.673
Educational Level 0.00 0.00 2.256
Family Size 0.331 0.307 0.410
Farm Experience 0.001 0.000 0.031
Land Owner 0.474 0.474 0.923
Farm Size 0.00 0.000 0.174
Source of Labour 0.00 0.00 0.137
Pesticides 0,00 0.00 0,171
Credit Facility 0.001 0.00 0.329
Extension Services 0.193 0.191 0.680
Gender 0.078 0.078 1.680
Multivariate Analysis
The multivariate output together with the goodness of fit
test result for the multivariate analysis are presented in
tables 8 and 9 respectively. From table 8, the Wald
statistic, p-values for some of the independent variables
are greater than the chosen significant threshold value of
0.05. The statistically significant independent variables
base on the p-values are: farming experience, educational
level, credit facility, source of labour and control means of
pest and disease. This implies that some of the variables
that entered the model during the multivariable analysis
stage were found to be insignificant. The Hosmer-
Lemeshow test of goodness of fit ( in Table 9) shows that,
at this multivariate analysis stage, the model is not a good
fit to the data as p=0.004<0.05.
In addition, due to further statistical investigation on each
of the statistically significant independent variables
mentioned below (Table 8), some of them did not enter the
final model. The reason being that, further statistical
investigations (tests) on these variables showed that some
of them influenced the outcome variable in such a way that
their inclusion in the model violates the assumption of ‘no
outlier. For example, when the variable, credit facility
entered the model as an independent variable with
extremely high significant value, the maximum of the
cook’s distance exceeded one (1). It even attained the
value of two (2) which is a clear violation of the assumption
of ‘no outlier’ or influential observation for the validity of the
result of the logistic regression model.
Some of the discoveries of the statistical investigations on
the independent variables are actually in line with reality.
For example, very few farmers have access to credit
facilities. The few that have access may tend to have big
farm lands, more laborers, and improved planting
materials leading to very high cassava yield/output. On the
other hand, some unfaithful cassava farmers may use the
money received from the credit to do something different
from the cassava production for which it was obtained
(credit facility’s odd Ratio <1, meaning higher credit grant
decreases the odds of cassava yield). So it was not
surprising to see that when credit facility entered the
equation, the incidence of influential /Outlier observation
was alarming. Nevertheless, we still acknowledge the fact
that credit facility is an extremely high determinant of high
or low level of cassava yield (outcome variable) in the
study area.
Table 8: Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1a FARMING_EXPERIENCE .104 .032 10.690 1 .001 1.109
AGE -.009 .026 .120 1 .729 .991
EDUCATIONAL 14.967 3 .002
EDUCATIONAL(1) -1.874 1.189 2.486 1 .115 .153
EDUCATIONAL(2) .304 1.329 .052 1 .819 1.355
EDUCATIONAL(3) -.181 1.250 .021 1 .885 .834
FARM_SIZE .127 .069 3.371 1 .066 1.136
SOURCES_OF_LABOUR 6.532 2 .038
SOURCES_OF_LABOUR(1) -1.370 .538 6.474 1 .011 .254
SOURCES_OF_LABOUR(2) -.490 .604 .658 1 .417 .612
CREDIT(1) -3.085 .661 21.803 1 .000 .046
CONTROL_MEAN(1) -1.816 .572 10.083 1 .001 .163
Constant 4.142 1.911 4.697 1 .030 62.952
a. Variable(s) entered on step 1: FARMING_EXPERIENCE, AGE, EDUCATIONAL, FARM_SIZE,
SOURCES_OF_LABOUR, CREDIT, CONTROL_MEAN.
Table 9: Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 22.670 8 .004
10. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
Sesay and Koneh 601
Now that the significant independent variables in relation
to the output variable are selected, the next step was to fit
the final model for the logistic regression analysis.
Final Model
This is the last stage of the analysis. After the variables
have been selected from the first two stages of the logistic
regression modeling, the following analytical procedures
were taken to build and confirm the final model so as to
achieve our objective of identifying the main factors that
influence cassava productivity and to determine the effect
of each factor on cassava produced in the study area.
The categorical variable coding result presented in table
10 shows that majority of cassava farmers were illiterates
with no formal education.
Table 10: Categorical Variables Codings
Frequency
Parameter coding
(1) (2) (3)
EDUCATIONAL LEVEL OF RESPONDENT NO FORMAL EDUCATION 119 1.000 .000 .000
PRIMARY SCHOOL 16 .000 1.000 .000
SECONDARY SCHOOL 55 .000 .000 1.000
TECH - VOC. 10 .000 .000 .000
The model coefficients are contained in the column
headed B in Table 11. A negative coefficient means that
the Odd of increase in cassava yield decreases.
The output in Table 11 helped to identify the key
determinants of increase or decrease in cassave
productivity. That is, those independent variables that
contributed significantly to the level of cassava yield. It
also helped to determine how each determinant influenced
cassava yield.
From table 11, it is clear that among the independent
variables that entered the final model, farm size with
significance level (for Wald) that is far below the threshold
significance level of 0.05 is the main factor that influenced
the level of cassava yield. The odd ratio (Exp(B))
associated with farm size is 1.188 which is greater than
one (>1), meaning, an increase in farm size will increase
the probability of an increase in cassava yield. In other
words, the probability of high cassava yield occurring with
a unit (acre) increase in farm size is higher than at the
original farm size. Also, from table 11, educational level is
seen as a significant factor in determining the level of
cassava yield. It odd ratio (Exp(B)) is less than one (<1) for
all levels (no formal education, primary school, secondary
school and tech voc). This means that the probability of
high cassava yield with a unit increase in educational level
is lower than at original (or no increase). In other words,
the odds of increase in cassava yield is lower for farmers
with high educational level than for those with no or low
educational level. Lastly, from table 11, the interaction
term, farming experience by age is a highly significant
factor (with a significant level of 0.00) in determining the
level of cassava yield. It odd ratio is greater than one. This
implies that, the odds of an increase in cassava yield is
higher for older people with more farming experience than
for younger people with less farming experience. That is,
the probability of an increase in cassava yield is higher with
a unit (year) increase in age by farming experience than at
original.
Table 11: Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a EDUCATIONAL 21.921 3 .000
EDUCATIONAL(1) -2.099 1.217 2.975 1 .085 .123 .011 1.331
EDUCATIONAL(2) -.971 1.322 .539 1 .463 .379 .028 5.051
EDUCATIONAL(3) -.085 1.263 .005 1 .946 .918 .077 10.908
FARM_SIZE .172 .060 8.357 1 .004 1.188 1.057 1.335
AGE by FARMING_EXPERIENCE .003 .001 26.598 1 .000 1.003 1.002 1.004
Constant -.809 1.275 .402 1 .526 .445
a. Variable(s) entered on step 1: EDUCATIONAL, FARM_SIZE, AGE * FARMING_EXPERIENCE.
Model Checking
Chi-square goodness of fit test for model coefficients
The test in table 12 was used to check if the present (new)
model with explanatory variables included is an
improvement over the baseline model. This test uses the
chi-square test to see if there is a significant difference
between the Log-likelihoods of the baseline model and the
present model. A significantly reduced value of the Log-
likelihoods (-2LLs) suggests that the new model is
explaining more of the variation in the outcome variable
than the baseline model. In other words, a significantly
reduced value of the Log-likelihoods shows that the new
model is an improvement over the baseline model. From
Table 12, the chi-square statistic is highly significant (chi-
square=87.395 df=5, p<.000). This shows that, the present
(new) model is significantly better compared to the
baseline model.
11. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
J. Agric. Econ. Rural Devel. 602
Table 12: Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 87.395 5 .000
Block 87.395 5 .000
Model 87.395 5 .000
From the classification table presented in Table 13, the
present logistic regression model correctly classified the
outcome for 77% of the cases.
Outcome Classification
Table 13: Classification Tablea
Observed
Predicted
OUTPUT OR
YIELD Percentage
CorrectLOW HIGH
Step
1
OUTPUT
OR YIELD
LOW 66 22 75.0
HIGH 24 88 78.6
Overall
Percentage
77.0
a. The cut value is .500
Model chi-square goodness of fit test
The hypothesis tested for the model goodness of fit were
stated as:
𝐻0: The model is a good fitting model.
𝐻 𝑎: The model is not a good fitting model.
From table 14, the tests of goodness of fit shows that, the
model is a good fit to the data as 𝑝 = 0.724 > .05
Table 14: Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 5.309 8 .724
Measures of the Predictive Power of the Model
From the model summary result presented in table 15, it is
clear that, between 35% and 47% of the variation in
cassava yield was explained by the logistic regression
model.
Table 15: Model Summary
Step
-2 Log
likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 186.977a .354 .474
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.
Influential Observation and Outliers
Again, it is good to find out if there are observations that
do not fit the model well (outliers), have strange values
(leverage) or that have undue influence on the model
(influential). In this work, the cook’s distance denoted as
Di, was used to find an influential predictor in the set of
predictor variables used in the analysis. In other words, it
was used to identify points that negatively affect the logistic
regression model. The measurement is a combination of
each observation’s leverage and residual values; the
higher the leverage and residuals, the higher the Cook’
distance. A Di value of more than 1 indicates that an
influential observation is present.
The maximum and minimum values of the Cook’s Distance
for our analysis are presented in the summary table (table
16) below.
From table 16, the maximum value of Di is 0.40012 which
is less than one (<1). Therefore, the issue of influential
observation or outlier is not alarming.
Table 16: Analog of Cook's influence statistics
N Valid 200
Missing 0
Mode .00028
Range .40010
Minimum .00003
Maximum .40012
MODEL DISCRIMINATION
How well the model distinguishes between the two groups
in the binary outcome in binary logistic regression was
assessed using the area under the receiver operating
characteristic (ROC) curve.
The Two basic measures of diagnostic accuracy are the
sensitivity and specificity (Zhou et al 2002). When
sensitivity is plotted against 1-specificity we obtained the
receiver operating characteristic (ROC) curve. The
diagonal line in the curve represents chance. The curve in
figure 2 is well above the diagonal line. In addition, from
table 16, the area under the curve (AUC) is 0.814. This
represents a high predictive accuracy of the chosen
model. In other words, an AUC value of 0.814 (which is
close to 1) indicates that the model reliably distinguished
between cassava farmers with high and low cassava
yields.
Figure 2: Receiver Operating Characteristic (ROC) curve
12. A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
Sesay and Koneh 603
Table: 16: Area Under the Curve
Test Result Variable(s):
Area
.814
Multicolinearity
As already measured under the methodology section,
among the number of ways of detecting multicilinearity in
a data set, this work used the value of the tolerance and
its reciprocal, called the variance inflation factor (VIF) to
detect or identify multicollinearity in the data. The
variable’s tolerance is 1 − 𝑅2
. If the value of tolerance is
less than 0.2 or 0.1 and, simultaneously, the value of VIF
10 and above, then multicollinearity is problematic.
From our analysis, the highest value of, 𝑅2
which is the
Negelkerke, 𝑅2
is equal to 0.474. Hence the tolerance is
calculated as 1 − 𝑅2
= 1 − 0.474 = 0.526 and it VIF is
2.1097 (𝑖. 𝑒. 𝑉𝐼𝐹 =
1
0.474
= 2.1097) . The tolerance is far
above 0.1 and the value of VIF is far below 10. It is
therefore concluded that multicolinearity is not
problematic. In addition, the standard errors of the
coefficients are not too significant. This further suggested
that multicolinearity is not an issue here.
RESULTS AND DISCUSSION
A logistic regression analysis was carried out to find out
the main factors influencing cassava productivity in the
Moyamba District, southern province of Sierra Leone. The
level of cassava productivity was measured by the level
(high or low) of cassava yield. At the initial stage of the
analysis, many factors were considered as potential
determinants of cassava productivity in the study area.
However, further statistical investigation proved that some
of those factors were not significant determinants of a high
or low yield of cassava. Insignificant factors were dropped
out of the analysis. Variables (factors) that entered the
final model are: farm size, educational level and the
interaction term, age by farming experience.
At the final stage of the analysis, the logistic regression
model was significant, as the test of the full model against
a model with only the constant was significant. This shows
that the predictors as a set reliably distinguished between
a high and low yield of cassava (chi square = 87.395, p <
.05 with df=5). The model explained between 35% and
47% (Negelkerke R2 and Cox and snail R2 respectively) of
the variation in the cassava yield.
The Wald criteria showed that, among the independent
variables that entered the final model, farm size with a
significance level (for Wald) that is far below the threshold
significance level of 0.05 was the main factor that
influenced the level of cassava yield. The odd ratio
(Exp(B)) associated with farm size is 1.188 which is
greater than one (>1), meaning that, an increase in farm
size will increase the probability of high cassava yield. In
other words, the probability of high cassava yield occurring
with unit (acre) increase in farm size will be higher than at
the original farm size. This result is in conformity with the
research result documented by Ren et al (2019), that farm
size plays a critical role in agricultural sustainability.
Educational level was also shown to be a significant factor
in determining the level (high or low) of cassava yield.
However, in line with the view of mejority Sierra Leoneans,
that, subsistence farming is an option for those who failed
to go to school or droped out uf school, its odd ratio
(Exp(B)) is 0.123 which is less than one (<1). This means
that the probability of high cassava yield with unit increase
in educational level is lower than at original (no increase).
In other words, the odds of increase in cassava yield is
lower for higher educational level. This result is similar to
that obtained by Malte Reimers and Stephan Klasen
(2013) who detected insignificant or even surprisingly
negative effects of schooling on agricultural productivity
Finally, the interaction term, farming experience by age is
a high] y significant factor (with a significant level of 0.00)
in determining the level of cassava yield. Its odd ratio is
greater than one. This implies that the Odds of increase in
cassava yield are higher for older people with more
farming experience than for younger people with less
farming experience. In other word, the probability of an
increase in cassava yield is higher with a unit (year)
increase in age by experience. This is not surprising as
extension services for disseminating information on farm
technologies are not common in the rural areas. Farmers
only gain experience after long years of farming. A study
conducted by Gideon Danso-Abbeam et al (2018)
reaffirmed the critical role of extension programmes in
enhancing farm productivity and household income.
Credit facility, though, did not enter the final model (as it
exhibited an extreme behavior), was still recognized as a
significant determinant of high level of cassava yield. This
is because, the Wald p-value associated with credit facility
was significant at both the univariate and multivariable
stages of the variable selection in the logistic regression
modeling. This result is supported by Ekwere et al (2014),
in their book title, “Effects of agricultural credit facility on
the agricultural production and rural development, In their
book, they documented that, the independent variables;
loan size, farm size, and inputs explained the variation in
the total value of farmers output.
CONCLUSION
The purpose of this work was to identify the main factors
influencing cassava productivity and to determine the
effect of each factor on cassava yield/output. The
empirical evidence showed that, farm size, educational
level, and age by farming experience are the main factors
influencing cassava productivity in the study area. An