Makalah Seminar_KNM XVII_ITS

KNM XVII 11-14 Juni 2014 ITS, Surabaya
1
MODELLING ROAD TRAFFIC ACCIDENT
DEATHS IN SOUTH AFRICA USING
GENERALIZED LINEAR MODELS
SHARON OGOLLA
1
, SONY SUNARYO
2
, IRHAMAH
3
1
Institut Teknologi Sepuluh Nopember Surabaya, sha.ogolla@gmail.com
2
Institut Teknologi Sepuluh Nopember Surabaya, sony_s@statistika.its.ac.id
3
Institut Teknologi Sepuluh Nopember Surabaya, irhamahn@yahoo.com
Abstract
World Health Organization (WHO) reports that over 1.2 million people die annually
due to road accidents. The numbers of deaths resulting from road traffic crashes have been
projected to reach 8.4 million in the year 2020. To analyze the mortality data it is
necessary to consider the mortality rate of certain age groups, so that we can find data
which shows the prevalence of major groups of deaths. The model is developed by the
Generalized Linear Modeling (GLM) method. The analysis of data is followed by
subsequent formulation of the Poisson regression models. It was further found that the
data analyzed over dispersion variance greater than average. As a result, Negative
Binomial model was used as an alternative and it found to fit the data perfectly.
Incremental addition of relevant explanatory variables further expanded the basic model
into a comprehensive model. At the end of this study, it could be seen through the
analysis of the data that age group from 35-49 is prevalent to road traffic accident deaths
with 26.6%. Females had an expected death rate of , which is 65.4% lower, at all
ages. The effect of being in the 35–49 year age group, compared with 65> year olds, is to
multiply the mean death rate by = 0.557, that is to decrease the mean death rate by
an estimated 44.3%, for both genders.
Keywords : Generalized Linear Models, Negative Binomial Regression, Poisson
Regression, South Africa
1. Introduction
Generalized linear models play a very important role in statistical inference. They
represent a mathematical way of quantifying the relationship between a response variable
and a set of independent variables, including a general class of statistical models.
Originally introduced by Nelder and Wedderburn [1], generalized linear model (GLM) is
an extension of the classical linear models. It includes linear regression models,
analysis of variance models, Logistic regression models, Poisson regression models,
Zero-inflated Poisson regression models, Negative Binomial regression models, log-
linear models, as well as many other models.
There are several studies that have been conducted relating to Generalized Linear
Models to solve real problems. Umar et al. [2] carried out a study to determine the impact
of running headlights on conspicuity-related motorcycle accidents in Malaysia. The
Generalized linear model with Poisson distribution and log link was used to describe the
frequency of conspicuity-related motorcycle accidents. The explanatory variables used
consisted of: influence of time trends, changes in recording system, effect of fasting
during month of Ramazan, and Balik Kampong which is a religious holiday unique to the

2
multi-cultural society of Malaysia. In order to overcome the over-dispersion of data, the
quasi-likelihood technique was used. Russo et al. [3] used it in Brazil to model the
number of deaths in Santo Angelo. In health, Jahangeer et al. [4] used generalized linear
models to analyze the factors influencing exclusive breastfeeding.
Studies done worldwide by Odero et al. [5] and Balogun et al.[6]have shown that road
traffic accidents are the leading causes of death of many adolescents and young adults.
There is evidence that using minimum safety standards, crash worthiness improvement in
vehicles, seatbelts use laws and reduced alcohol use, can substantially reduce deaths on
the road Leon [7]. In developing countries, including South Africa, the scenario is
different to developed countries, road traffic accidents are increasing with time and
mortality due to road traffic accidents is also on the rise Asogwa [8]. Peden et al. [9]
reported that when taking the population figures into account, developing countries in
Sub-Saharan Africa have the highest frequency of various accidents worldwide.
In South Africa, 3,280,931 deaths were recorded in between 2001 and 2006 of which
9.5% were due to non-natural causes [10]. Road traffic accident deaths comprised 9.3% of
non-natural deaths. Data from the National Injury Mortality Surveillance System
(NIMSS) showed that in 2005, transport-related injuries accounted for 74.3% of all
accidental (or unintentional) deaths [11]. Analysis of the injury burden in South Africa by
Norman et al. [12] showed that the age standardized road traffic injury mortality rates for
South Africa were about double the global rate for both males and females.
The benefits to be achieved from the results of this study are to provide scientific
insights concerning Generalized Linear Models and to create a platform for future studies
into modeling number of deaths by using Generalized Linear Models.
2. Literature Review
A. Generalized Linear Models
Generalized linear models are a natural generalization of classical linear models that
allow the mean of a population to depend on a linear predictor through a non-linear link
function. This allows the the response probability distribution to be any member of the
exponential family of distributions.
A generalized linear model (or GLM) consists of three components:
1. A random component, which specify the conditional distribution of the response
variable , given the explanatory variables
2. A linear function of the regression variables, called the linear predictor,
(1)
on which the expected value of depends.
3. An invertible link function, ( ) (2)
This transforms the expectation of the response to the linear predictor. The inverse of the
link function is sometimes called the mean function
( ) (3)
B. Poisson Regression Model
The Poisson regression model is a specific type of GLM and is non-linear. Poisson
regression analysis is a technique used to model dependent variables that describe count
data [13]. Poisson regression model has often been applied to estimate standardized
mortality and incidence ratios in cohort studies and in ecological investigations.
The primary equation of the model is
( ) (4)

3
The most common formulation of this model is the log-linear specification as in equation
(5)
The expected number of events per period is given by
( | ) (6)
Poisson regression model is a specific type of generalized linear models (GLM) whose
parameters can be estimated using the maximum likelihood method, with the likelihood
function given by:
∏ ( ) ∏ (7)
And the ln-likelihood function equal to:
∑ ∑ ∑ ( ) (8)
C. Solving For Over-dispersion In Poisson Regression
Over-dispersion may be modeled using compound Poisson distributions. With this
model the count y is Poisson distributed with mean λ, but λ is itself a random variable
which causes the variation to exceed that expected if the Poisson mean were fixed [14].
Thus suppose λ is regarded as a positive continuous random variable with probability
function g(λ). Given λ, the count is distributed as P(λ). Then the probability function of y
is
∫ (9)
A convenient choice for g(λ) is the gamma probability function G(μ, ν), implying (9) is
NB (μ, κ) where κ = 1/ν. In other words the negative binomial arises when there are
different groups of risks, each group characterized by a separate Poisson mean, and with
the means distributed according to the gamma distribution [14].
D. Negative Binomial Regression Model
Negative binomial distribution is a distribution that has a lot of ways in terms of its
approach. There are twelve negative binomial distribution approaches among which can
be approached by Poisson - Gamma mixture distribution, as a compound Poisson
distribution, as a sequence of Bernoulli trials, or as the inverse of the Binomial
distribution [15]
When data is overdispersed, the common method to account for it is by using negative
binomial model [15]. Negative binomial regression is a type of generalized linear model
in which the dependent variable Y is a count of the number of times an event occurs.
Statistical comparisons between Poisson and negative binomial regression models
confirm that in most cases the negative binomial better represents observed counts than
Poisson [15]. Hilbe [15] gave the parameterization of the negative binomial model as
( ) ( ) (10)
where is the mean of and is the heterogeneity parameter. Hilbe [15]
derives this parameterization as a Poisson-gamma mixture, or alternatively as the
number of failures before the ( ⁄ ) success, though we will not require ⁄ to be an
integer. Negative Binomial model estimation process is done by using the Newton
Raphson method.

4
The Partial likelihood form of negative binomial is
( ) ∏
( )
( ) ( ) (11)
From equation (11) it can then form a partial log-likelihood which becomes
( ) ( ) (12)
where { }
If equation (11) is substituted into (12), then the partial form ln-likelihood will be
( )
∑ ( ( )) ∑ ( ( )) ∑ ( ) ∑ ( )
∑ ( )
∑ ( ( )) ∑ ( ) ∑ ( )
∑ ( ) ∑ ( ) ∑ ( ) (13)
To maximize the function in equation above, the first derivative shall be found
( )
( ) (14)
The next step is to calculate the second partial derivatives of the log-likelihood function
partial aimed to form the Hessian matrix. The second partial derivatives of the log-
likelihood function of the partial regression coefficient β parameters are as follows:
∑ {
( )
( )
}
∑ {
( )
( )
( )}
∑ {
( )
( )
( ) }
∑ {
( )
( )
( ) }
Based on the results of the second partial derivatives above, the Hessian matrix is
obtained as follows:

5
=
( )
(15)
as a measurement of .
In addition, the matrix used in the iterative procedure of Newton Raphson algorithm
method for finding solutions of the log-likelihood function is convergent and used as
estimates for each parameter. Thus, the next stage is the process of Newton Raphson
algorithm in the negative binomial models as follows:
1. Determining the value of initial parameter estimates ̂ for iteration when .
2. Form a vector ̂
(̂ ) ( )
3. Shaping the Hessian matrix (̂ ).
4. Substituting the value ̂ to the elements of the vector ( ̂ )and the Hessian
matrix to obtain a vector ( ̂ ) and the Hessian matrix ( ̂ )
5. Perform iterations ranging from in the following equation
̂ ̂ (̂ ) (̂ )
6. Determine iteration update to to obtain parameter estimates that converge
|̂ ̂ |
3. Analysis And Results
A. Descriptive Statistics
From 2001 to 2006, there were a total of 28,890 people killed in South Africa. On
average, we could say that on a yearly basis, there were a total of 4,815 people killed
every year. Figure 1 below, shows the age distribution of people killed by road traffic
accidents in South Africa from 2001 to 2006. From the figure below, it is quite clear that
youths and middle aged people are prone to road traffic accidents. It can also be seen
that male group are the major victims in road traffic accidents. The highest number of
traffic accidents from year 2001-2006 is reported to come from the 35-49 male age
group which is recorded as 28.62 deaths in every 100,000 population. Followed closely
by 25-34 male age group which records a total death number of 25.69. The lowest
number of deaths from the male age group comes to 4.42 deaths in every 100,000
population.
Whilst male death rates show a peak at age group 35–49 years (similar to death
rates for both sexes), female death rates show a roughly linear increase from age group
0–14 to age group 65 years and above. Thus among females, the elderly experienced the
highest death rates due to road traffic accidents. This can be concluded that, at this age

6
group, most are pensioners and retirees hence they do not travel regularly. From figure
1, it can be noted that road traffic accident deaths increases very fast from infancy till
the ages 35-49 for males then it starts decreasing again. Thus it can be said that, the peak
of someone dying in South Africa due to road traffic accidents is at the age of 35-49 .
Figure 1 Age Distribution of people killed in road traffic accidents
B. DATA ANALYSIS
The deviance of the final Poisson distributed model was 1375.22 on 64 degrees of
freedom and that the scaled deviance is greater was greater than 1, a DF value of 21.49
indicating a case of over-dispersion. Since there is a case of over-dispersion, Negative
Binomial was then used to fit the model. Negative binomial reported a perfect fit for all
our models. In this case, our best model with all the variables included, the deviance of
the Negative binomial distributed model was 71.95 on 64 degrees of freedom and that
the scaled deviance and Pearson values adjusted for DF were rather small indicating
a good fit (value of 1.12). With the inclusion of all explanatory variables, the model gets
better. Age, population and gender were both highly significant, p-value was <0.05.
However, the age groups 25-34, 35-49 and 50-64 were not significant in this case since
p-value was >0.05.
Likelihood ratio statistics for type I and type III analysis tests were done.
Table 2 shows the Type I analysis tests each explanatory variable sequentially, under the
assumption that the previous explanatory variables are included in the model. With the
entry of population into the model, the deviance increases by 146.2, from 301685.765 to
301831.96.
0-14 15-24 25-34 35-49 50-64 65>
female 3.35 4.78 5.75 7.77 7.37 10.05
male 4.42 14.11 25.69 28.62 21.92 19.6
0
5
10
15
20
25
30
35
Deathsper100,000population
Age Distribution of People Killed by Road Traffic
Accidents

7
Table 1 Poisson regression model Information
Distribution Poisson
Link Function Log
Dependent Variable deaths
Offset Variable l_popn
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 64 1375.2179 21.4878
Scaled Deviance 64 1375.2179 21.4878
Pearson Chi-Square 64 1467.7835 22.9341
Scaled Pearson X2 64 1467.7835 22.9341
Log Likelihood 150368.9831
Full Log Likelihood -961.1206
AIC(smaller is better) 1938.2411
AICC (smaller is better) 1940.5268
BIC (smaller is better) 1956.4545
This is highly significant (p-value is <0.05) as judged against the distribution. In the
presence of gender in the model, the inclusion of age brings the deviance up to 301825.00, an
increase of 139.24. This indicates a much improved fit, achieved at a cost of five degrees of
freedom, since there are five parameters associated with categorical age. This statistic has p-
value <0.0001 on the distribution, indicating age is highly significant.
Table 2. LR Statistics for Type I and Type III Analysis
Type I Type III
Source df ∆ p-value p-value
Intercept 301685.765
Age 5 301825.00 68.53 <0.0001 73.89 <0.0001
Gender 1 301756.46 70.70 <0.0001 92.16 <0.0001
Popl 1 301831.96 6.96 0.0083 6.96 0.0083
The Type III analysis tests each explanatory variable under the assumption that all other
variables are included in the model. Gender, in the presence of age, has a deviance reduction of
= 92.16 with p-value <0.0001. Age, in the presence of gender, has = 73.89 with p-value
<0.0001 (as for the Type I analysis). There is no change in the Population value at 6.96.
Akaike Information Criterion was used to select our best model. Table 4 shows every
explanatory variable added to a model improves fit and the best model is the one with the
smallest AIC.

8
Table 3 Negative Binomial Model Information
Distribution Negative Binomial
Link Function Log
Dependent Variable deaths
Offset Variable l_popn
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 64 71.9595 1.1244
Scaled Deviance 64 71.9595 1.1244
Pearson Chi-Square 64 70.1381 1.0959
Scaled Pearson X2 64 70.1381 1.0959
Log Likelihood 150915.9792
Full Log Likelihood -414.1245
AIC(smaller is better) 846.2490
AICC (smaller is better) 849.1522
BIC (smaller is better) 866.7389
It is evident that when the number of explanatory variables increases, it makes a
good fit. Since the scaled deviance value is approximately close to 1, there is no case of
over-dispersion hence Negative Binomial was chosen to be the best model.
Table 4. Comparison between Poisson and Negative Binomial model with their
respective AIC, & Deviance values
Poisson Regression Model
No. Explanatory Variables AIC Scaled Deviance Value/DF
1 Age 8833.86 8274.84 125.37
2 Gender 6428.59 5877.57 83.97
3 Population 13020.65 12469.62 178.14
4 Age & Gender 2026.19 1465.17 22.54
5 Age, Gender & Population 1938.24 1375.22 21.49
Negative Binomial Regression Model
No. Explanatory Variables AIC Scaled Deviance Value/DF
1 Age 952.82 74.61 1.13
2 Gender 909.74 73.29 1.05
3 Population 980.37 76.36 1.09
4 Age & Gender 851.21 72.64 1.12
5 Age, Gender & Population 846.25 71.96 1.12

9
By choosing the smallest AIC, model number 5 is the best since it had an AIC value of
846.25. The fitted model was
where represent the age groups 0-14,15-24, 25-34, 35-49, 50-64
respectively, represents the female gender and represents population.
4. Conclusion
This study has shown that for an over-dispersion data, the Negative Binomial model
is better than the Poisson Regression model. Because of the Poisson distribution has a
special property that mean is equal to the variance. Thus an over dispersion means that
the variance is greater than mean. The Negative Binomial regression model is more
flexible as it allows for the variance to be greater than mean. The results also revealed
that the most affected people who die through road accidents in South Africa are male.
Females had an expected death rate of , which is 65.4% lower, at all ages. In
comparison with the age group 65>, the 0-14 age group had a decreased death rate of
89.6% for both genders, the 15-24 age group had a decreased death rate of 73.3% for
both genders, the 25-34 age group had a decreased death rate estimated at 54.9% for
both genders, the 35-49 age group had a decreased death rate estimated at 44.3% for
both genders and the 50-64 age group had a decreased death rate estimated at 38.8%. It
was also found that for every increase of in the population, the death rate of road
traffic accidents also increased by an estimated , thus the more the
population, the more the number of deaths. It can also be noted that accident deaths
increase as the years go by, and thus more care and policies should be provided to
reduce road traffic accident deaths in South Africa.
REFERENCES
[1] Nelder, J.A and Wedderburn, R.W.M (1972). “Generalized linear models”. Journal
of the Royal Statistical Society, Series B, 19, 92-100.
[2] Radin Umar, R.S., M., Norghani, H., Hussain, B., Shahrom, and M.M, Hamdan,
1998. Research Report 1, National Road Safety Council Malaysia, Kuala Lumpur.
[3] Russo, S. Flender, D. and da Silva, G.F. (2012). “Poisson Regression Models for
Count Data: Use in the Number of Deaths in the Santo Angelo (Brazil).” Journal of
Basic & Applied Sciences, 2012, 8, 266-269.
[4] Cheika J., Naushad M.K. and Maleika H.M.K. (2009). “Analyzing the factors
influencing exclusive breastfeeding using the Generalized Poisson Regression
model”. World Academy of Science, Engineering and Technology Vol:3 2009-11-
29.
[5] Odero, W., Garner, P. and Zwi, A. (1997). “Road traffic injuries in the developing
countries: a comprehensive review of epidemiological studies”. Journal of Tropical
Medicine and International Health. 2(5), 445-460.
[6] Balogun, J.A., Abereoje, O.K. (1992). “Pattern of road traffic accident cases in a
Nigeria University teaching hospital between 1987 and 1990.” J.Trop Med Hyg;
95(1):239.
[7] Leon, S.R. (1996). “Reducing death on the Road. The effects of minimum safety
standard”.119 Unpublicised crash test, seat belts and alcohol. Am J Public Health;
86(1):31-3.
[8] Asongwa, S.E. (1992). “Road traffic accidents in Nigeria: A review and a
reappraisal”. Accident Analysis and Prevention: 23 (5), 343-35.

10
[9] Peden, M. (Ed), (2004), “World Report on Road Traffic Injury Prevention”. World
HealthOrganisation, Geneva.
[10] Statistics South Africa. 2008. “Mortality and cause of death in South Africa, 2006:
Findings from death notification”. Statistics South Africa.
[11] Medical Research Council and UNISA. 2007. “A profile of fatal injuries in South
Africa 7th Annual Report of the National Injury Mortality Surveillance System
2005”. MRC/UNISA Crime, Violence and Injury Lead Programme, July 2007.
[12] Norman, R. Matzopoulos, R. Groenwald, P. and Bradshaw, D. (2007). “The high
burden of injuries in South Africa.” Bulletin of the World Health Organization.
September 2007, 85 (9). WHO. Geneva.
[13] Cameron, A.C. and Trivedi, P.K. (1998). “Regression Analysis of Count Data”.
Cambridge University Press, Cambridge, U.K.
[14] Jong, P. and Heller, G. Z. (2008). “Generalized Linear Models for Insurance
Data.” The International Series on Actuarial Science, Cambridge University Press
ISBN-13 978-0-511-38877-4.
[15] Hilbe, Joseph M. (2011). “Negative binomial regression” (2nd
edition) New York:
Cambridge University Press

Makalah Seminar_KNM XVII_ITS

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to Makalah Seminar_KNM XVII_ITS

Similar to Makalah Seminar_KNM XVII_ITS (20)

Makalah Seminar_KNM XVII_ITS