2014-mo444-final-project

Using ML techniques to study city traffic accidents
Henrique Nogueira, Luisa Cardoso Madeira, Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The dataset comprises approximately 297,853 traffic ac-
cidents collected from 2000 until 2013 at Porto Alegre, a
brazilian city with around 1,4 million people.
2. Activities
There is several articles discussing how to accomplish
Traffic Accident Analysis in different countries. In Japan,
HASEGAWA et al. [1] applied Support Vector Machine
(SVM) using Gaussian kernel to separate major and non-
major accidents. In China, Fang et al. [2] used Bayesian
networks and Logistic Regression to study the accidents in
Jilin province during 2010. Finally, Olutayo and Eludire [3]
studied Nigeria’s busiest roads during 2 years and compared
decision trees (Id3), Radial Basis Function (RBF) and neu-
ral networks using a Multilayer Perceptron, in their example
”Id3 tree algorithm performed better with higher accuracy
rate”.
3. Proposed Solutions
The proposed solution for this project is to compare two
classifier methods to predict wounds in accidents on the city
of Porto Alegre, considering data from 2000 to 2014. The
algorithms that will be compared are: Decision Tree, Ran-
dom Forest and Logistic Regression.
A decision tree or a classification tree is a tree in which
each internal (non-leaf) node is labeled with an input fea-
ture. The arcs coming from a node labeled with a feature
are labeled with each of the possible values of the feature.
Each leaf of the tree is labeled with a class or a probability
distribution over the classes. To classify an example, filter
it down the tree, as follows. For each feature encountered
in the tree, the arc corresponding to the value of the exam-
ple for that feature is followed. When a leaf is reached, the
classification corresponding to that leaf is returned.
∗Is with the Institute of Computing, University of Camp-
inas (Unicamp). Contact: hrqnogueira@gmail.com,
lu.madeira2@gmail.com, paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
Random forest works as a large collection of non-
correlated Decision Trees. The idea of combining Decision
Trees comes from the bagging technique that allows us to
decrease the variance of the learning algorithm by combin-
ing a set of them. To verify Random Forest results, we will
use inTrees as described in Deng [4].
The big advantage of these two classifier is that they have
white-box models. It means that, unlikely Logistic Regres-
sion or SVM, we can look inside the model and understand
how the decisions are being made. In this project, we would
like to predict whether there will be wounded people in-
volved in accidents or not. So the white-box logic may help
us to understand which are the most important features and
how they impact the results. These two algorithms will be
compared using ROC curves.
We also used Logistic Regression to compare the ROC
performance and analyse each feature contribution.
3.1. Classification Trees and Information Gain
To build the classification tree one fundamental concept
is to find the root node (attribute that best splits the data
over). One of the measures used is Entropy (H), which
measures the homogeneity of the examples, calculated as
below:
H(S) =
c
i=1
(−pi ∗ log2 pi) (1)
The tree split function to find non-leaf nodes will be In-
formation Gain, which measures the reduction on Entropy
as follows:
IG(S, A) = H(S) −
v values(A)
(
|Sv|
|S|
) ∗ H(Sv) (2)
where Sv is the subset of S for which A has value v.
3.2. Quality measures
To access the quality of the results it will be used TPR
(True Positive Rate) and FPR (False Positive Rate) as plot-
ted in based on ROC (Receiver Operating Characteristic)
curve.
1

4. Experiments and Discussion
4.0.1 Data splitting
The data was splitted in 3 partitions for each program under
analysis using the following propotions: 60% for training,
20% for validation and 20% for testing. This was imple-
mented on R as below:
s p l i t d f <− function ( dataframe , seed=NULL) {
i f ( ! i s . null ( seed ) ) s e t . seed ( seed )
index <− 1: nrow ( dataframe )
#60% f o r t r a i n i n g
t r a i n i n d e x <− sample ( index ,
trunc ( length ( index ) ∗ 0 . 6 ) )
t r a i n s e t <− dataframe [ t r a i n i n d e x , ]
o t h e r s e t <− dataframe [− t r a i n i n d e x , ]
o t h e r I n d e x <− 1: nrow ( o t h e r s e t )
#20% f o r t r a i n i n g and
#20% f o r t e s t i n g s e t
v a l i d a t i o n I n d e x <−
sample ( otherIndex ,
trunc ( length ( o t h e r I n d e x ) / 2 ) )
v a l i d a t i o n s e t <− o t h e r s e t [ v a l i d a t i o n I n d e x , ]
t e s t s e t <− o t h e r s e t [− v a l i d a t i o n I n d e x , ]
l i s t ( t r a i n s e t = t r a i n s e t ,
v a l i d a t i o n s e t = v a l i d a t i o n s e t ,
t e s t s e t = t e s t s e t )
}
The next table summarizes the data partitition sample
size
Dataset Sample size
Training 180499
Test 60164
Validation 57190
Table 1. Decision Tree - Confusion matrix - hurt people training -
checking learning
4.1. Results
4.1.1 Decision Tree
Accident without victims With victims
Train 0.046 0.35
Validation 0.046 0.34
Teste 0.047 0.36
Table 2. Decision Tree - Classification error
Figure 1. Decision Tree for Hurt People.
Figure 2. ROC curve for Decision Tree predictions. Red - training,
blue - test, green - validation set.
Accident without victims With victims
Train 0.052 0.28
Validation 0.1 0.16
Teste 0.1 0.17
Table 3. Random Forest - Classification errors
4.1.2 Random Forest
Support is defined as the proportion of transactions contain-
ing the condition. Confidence is defined as the number of
transactions containing both the condition and the outcome
divided by the number of transactions containing the condi-

Length Sup Conf Condition Prediction
1 0.21 0.695 MOTO >0.5 1
1 0.172 0.556 MOTO <= 0.5 0
1 0.148 0.598 LOCAL is ’Logadouro’ 1
1 0.138 0.669 AUTO >1.5 0
1 0.136 0.588 AUTO <= 1.5 1
1 0.134 0.574 LOCAL is ’Cruzamento’ 0
1 0.124 0.892 BICICLETA >0.5 1
1 0.112 0.619 BICICLETA <= 0.5 0
1 0.109 0.667 CAMINHAO <= 0.5 1
1 0.103 0.889 TIPO ACID in (’ATROPELAMENTO’, ’QUEDA’) 1
Table 4. Random Forest - Rules derived from hurt people traffic analysis
Figure 3. Accidents causing hurt - RF important variables.
tion. Length is defined as the number of items contained in
an association rule.
4.1.3 Logistic Regression
Figure 7 shows the average residual and the average fitted
(predicted) value for each bin, or category.Category is based
on the fitted values. 95% of all values should fall within the
dotted line (as it happened in our example)
4.1.4 Other data aggregation relevant to the problem
5. Conclusions and Future Work
Type 1 errors (false positives, that is, predicted accident
when it is does not happened) were greater in RF while Type
Figure 4. Accidents causing death - RF important variables.
2 errors (false negatives, predicted there is no accident when
it in fact happened) were greater on Decision Trees. At our
study, type 2 is the most important because we would not
like to avoid predicting an accident when it is probable to
happen (play safe and warn a chance of an accident is bet-
ter at this study). Considering this, from ROC images on
Figure 2 and 5 it is possible to observe that Random Forest
delivered better results (less classification errors of type 2)
than simple Decision Trees.
By the analysis of picture 3 it is possible to check that
the type of the accident and accidents involving motorcycles
are the two with more relevance to indicate hurt. Other rele-
vant variables are accidents involving cars, the place where
it happened and the hour (but not so high as the first two
mentioned).

Figure 5. ROC curve for Random Forest predictions. Red - train-
ing, blue - test, green - validation set.
Figure 6. ROC Logistic Regression. Red - training, blue - test,
green - validation set.
Picture 4 shows the variables which most impacted on
death accidents. Again motorcycles were the main cause,
but at this case, other variables appeared with a great im-
portance such as day/night, car crash, the hour and the type
of the accident and if trucks are involved.
Estimate Std Error
TIPO ACID ATROPELAMENTO 4.0449617 0.0569926
REGIAO NAO CADASTRADO 4.0373454 0.3960868
BICICLETA 3.609395 0.0819853
TIPO ACID QUEDA 2.794752 0.0881739
(Intercept) 2.3929621 0.1181012
MOTO 2.2216396 0.0291009
LOCAL Logradouro 1.4691486 0.019812
CAMINHAO -1.0833747 0.036022
TIPO ACID CAPOTAGEM 0.9057897 0.0855192
CARROCA 0.8206036 0.1144767
LOTACAO -0.7509406 0.0667275
TAXI -0.675485 0.0385326
REGIAO SUL 0.5651705 0.0278513
AUTO -0.5633394 0.0235469
REGIAO NAO IDENTIFICADO 0.5056619 0.3153992
REGIAO LESTE 0.4006684 0.0263747
TEMPO NAO CADAST 0.3895447 0.1076033
ONIBUS INT -0.3546455 0.0541953
REGIAO NORTE 0.2756333 0.0262838
OUTRO -0.2563662 0.0894421
FX HORA -0.0367186 0.0015004
ONIBUS URB 0.0096466 0.0404007
MES 0.0074157 0.0024001
DIA 0.0009066 0.000918
TIPO ACID TOMBAMENTO -0.0693669 0.1325849
TIPO ACID NAO CADASTRADO -0.0826591 0.5226228
DIA SEM SABADO -0.2394208 0.0325041
TEMPO CHUVOSO -0.514578 0.0289924
TIPO ACID CHOQUE -0.5259759 0.0285438
DIA SEM SEXTA-FEIRA -0.6812563 0.0319216
DIA SEM QUINTA-FEIRA -0.7043353 0.0327392
DIA SEM SEGUNDA-FEIRA -0.7156692 0.0331751
DIA SEM QUARTA-FEIRA -0.720596 0.0329271
TIPO ACID COLISAO -0.7549218 0.0194193
DIA SEM TERCA-FEIRA -0.7706036 0.033298
TEMPO NUBLADO -1.0541536 0.0334349
TIPO ACID EVENTUAL -1.0786179 0.0674208
TIPO ACID INCENDIO -1.8502813 0.6154434
NOITE DIA NOITE -2.9580033 0.0995565
NOITE DIA DIA -3.6283964 0.0991302
Table 5. Logistic Regression Sorted
Tables 5 aggregates the top 10 places that concentrate the
majority of the hurt accidents where Table 6 shows the same
study for the death accidents. The streets, avenues, and
roads are almost the same in both cases indicating places
where the government should place more attention to re-
duce trafﬁc accidents.
Going further at the type of the accident causing hurt,
the table 4 depicts the rules derived from Random Forest.
When the place is not a crossroad the change of hurt is a
little bit higher. There are some trivial relationships such as

Figure 7. Binned Plot - Logistic Regression.
Place Nbr Accidents
Av/St (w/ hurt)
AV PROTASIO ALVES 2939
AV BENTO GONCALVES 2838
AV ASSIS BRASIL 2562
AV IPIRANGA 2442
AV SERTORIO 1708
AV PROF OSCAR PEREIRA 1380
AV FARRAPOS 1264
AV BALTAZAR DE OLIVEIRA GARCIA 1117
ESTR JOAO DE OLIVEIRA REMIAO 1038
AV JUCA BATISTA 984
Table 6. Top 10 hurt accidents (place concentration)
Place Nbr Accidents
Av/St (w/ deaths)
AV BENTO GONCALVES 103
AV ASSIS BRASIL 92
AV PROTASIO ALVES 69
AV IPIRANGA 51
AV BALTAZAR DE OLIVEIRA GARCIA 49
ESTR JOAO DE OLIVEIRA REMIAO 49
AV FARRAPOS 44
AV SERTORIO 38
AV PROF OSCAR PEREIRA 37
AV JUCA BATISTA 33
Table 7. Top 10 death accidents (place concentration)
accidents whose type is falldown or run down cause hurt ac-
cidents. However, some are not so trivial, such as accidents
involving two cars do not have a great chance of hurting
people.
The ROC curves were very similar for Logistic Regres-
sion and Random Forest relatively to test set and validation
set. It also helped to elucidate the behaviour of other fea-
tures such as rollover.
In summary, it is possible to affirm that Random Forest
and Logistic regression had similar performances on this
problem and both were better than Decision Tree. From
the most valuable variables on Random Forest it is clear
that motorcycles and type of the accident are crucial for the
decision on whether there are injured people or not. By
the other side, contrary to the common thinking, features
like ”rainy weather” or ”crossing” have low relevance on
the decision. There have also been interesting observations
like the place where accidents happen being ”Logadouro”
and one-car accidents with wounded people.
The model could be improved if there were some vari-
ables to further analyse the accidents, such as driver char-
acteristics, weather (rainy) and traffic condition (jams), or
accident impact on the road.
References
[1] Hironobu HASEGAWA, Masaru FUJII, Mikiharu
ARIMURA, and Tohru TAMURA. A study on traffic
accident analysis using support vector machines. 11th World
Conference on Transport Research, page 9, 2007. 1
[2] Zong Fang, Xu Hongguo, and Zhang Huiyong. Prediction for
traffic accident severity: Comparing the bayesian network and
regression models. Hindawi Publishing Corporation, 2013:9,
2013. 1
[3] V.A Olutayo and A.A Eludire. Traffic accident analysis using
decision trees and neural networks. I.J. Information Technol-
ogy and Computer Science, 02:22–28, 2014. 1
[4] H. Deng. Interpreting tree ensembles with intrees. arXiv,
1408.5456:1–18, 2014. 1

2014-mo444-final-project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to 2014-mo444-final-project

Similar to 2014-mo444-final-project (20)

2014-mo444-final-project