SlideShare a Scribd company logo
1 of 5
Download to read offline
Using ML techniques to study city traffic accidents
Henrique Nogueira, Luisa Cardoso Madeira, Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The dataset comprises approximately 297,853 traffic ac-
cidents collected from 2000 until 2013 at Porto Alegre, a
brazilian city with around 1,4 million people.
2. Activities
There is several articles discussing how to accomplish
Traffic Accident Analysis in different countries. In Japan,
HASEGAWA et al. [1] applied Support Vector Machine
(SVM) using Gaussian kernel to separate major and non-
major accidents. In China, Fang et al. [2] used Bayesian
networks and Logistic Regression to study the accidents in
Jilin province during 2010. Finally, Olutayo and Eludire [3]
studied Nigeria’s busiest roads during 2 years and compared
decision trees (Id3), Radial Basis Function (RBF) and neu-
ral networks using a Multilayer Perceptron, in their example
”Id3 tree algorithm performed better with higher accuracy
rate”.
3. Proposed Solutions
The proposed solution for this project is to compare two
classifier methods to predict wounds in accidents on the city
of Porto Alegre, considering data from 2000 to 2014. The
algorithms that will be compared are: Decision Tree, Ran-
dom Forest and Logistic Regression.
A decision tree or a classification tree is a tree in which
each internal (non-leaf) node is labeled with an input fea-
ture. The arcs coming from a node labeled with a feature
are labeled with each of the possible values of the feature.
Each leaf of the tree is labeled with a class or a probability
distribution over the classes. To classify an example, filter
it down the tree, as follows. For each feature encountered
in the tree, the arc corresponding to the value of the exam-
ple for that feature is followed. When a leaf is reached, the
classification corresponding to that leaf is returned.
∗Is with the Institute of Computing, University of Camp-
inas (Unicamp). Contact: hrqnogueira@gmail.com,
lu.madeira2@gmail.com, paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
Random forest works as a large collection of non-
correlated Decision Trees. The idea of combining Decision
Trees comes from the bagging technique that allows us to
decrease the variance of the learning algorithm by combin-
ing a set of them. To verify Random Forest results, we will
use inTrees as described in Deng [4].
The big advantage of these two classifier is that they have
white-box models. It means that, unlikely Logistic Regres-
sion or SVM, we can look inside the model and understand
how the decisions are being made. In this project, we would
like to predict whether there will be wounded people in-
volved in accidents or not. So the white-box logic may help
us to understand which are the most important features and
how they impact the results. These two algorithms will be
compared using ROC curves.
We also used Logistic Regression to compare the ROC
performance and analyse each feature contribution.
3.1. Classification Trees and Information Gain
To build the classification tree one fundamental concept
is to find the root node (attribute that best splits the data
over). One of the measures used is Entropy (H), which
measures the homogeneity of the examples, calculated as
below:
H(S) =
c
i=1
(−pi ∗ log2 pi) (1)
The tree split function to find non-leaf nodes will be In-
formation Gain, which measures the reduction on Entropy
as follows:
IG(S, A) = H(S) −
v values(A)
(
|Sv|
|S|
) ∗ H(Sv) (2)
where Sv is the subset of S for which A has value v.
3.2. Quality measures
To access the quality of the results it will be used TPR
(True Positive Rate) and FPR (False Positive Rate) as plot-
ted in based on ROC (Receiver Operating Characteristic)
curve.
1
4. Experiments and Discussion
4.0.1 Data splitting
The data was splitted in 3 partitions for each program under
analysis using the following propotions: 60% for training,
20% for validation and 20% for testing. This was imple-
mented on R as below:
s p l i t d f <− function ( dataframe , seed=NULL) {
i f ( ! i s . null ( seed ) ) s e t . seed ( seed )
index <− 1: nrow ( dataframe )
#60% f o r t r a i n i n g
t r a i n i n d e x <− sample ( index ,
trunc ( length ( index ) ∗ 0 . 6 ) )
t r a i n s e t <− dataframe [ t r a i n i n d e x , ]
o t h e r s e t <− dataframe [− t r a i n i n d e x , ]
o t h e r I n d e x <− 1: nrow ( o t h e r s e t )
#20% f o r t r a i n i n g and
#20% f o r t e s t i n g s e t
v a l i d a t i o n I n d e x <−
sample ( otherIndex ,
trunc ( length ( o t h e r I n d e x ) / 2 ) )
v a l i d a t i o n s e t <− o t h e r s e t [ v a l i d a t i o n I n d e x , ]
t e s t s e t <− o t h e r s e t [− v a l i d a t i o n I n d e x , ]
l i s t ( t r a i n s e t = t r a i n s e t ,
v a l i d a t i o n s e t = v a l i d a t i o n s e t ,
t e s t s e t = t e s t s e t )
}
The next table summarizes the data partitition sample
size
Dataset Sample size
Training 180499
Test 60164
Validation 57190
Table 1. Decision Tree - Confusion matrix - hurt people training -
checking learning
4.1. Results
4.1.1 Decision Tree
Accident without victims With victims
Train 0.046 0.35
Validation 0.046 0.34
Teste 0.047 0.36
Table 2. Decision Tree - Classification error
Figure 1. Decision Tree for Hurt People.
Figure 2. ROC curve for Decision Tree predictions. Red - training,
blue - test, green - validation set.
Accident without victims With victims
Train 0.052 0.28
Validation 0.1 0.16
Teste 0.1 0.17
Table 3. Random Forest - Classification errors
4.1.2 Random Forest
Support is defined as the proportion of transactions contain-
ing the condition. Confidence is defined as the number of
transactions containing both the condition and the outcome
divided by the number of transactions containing the condi-
Length Sup Conf Condition Prediction
1 0.21 0.695 MOTO >0.5 1
1 0.172 0.556 MOTO <= 0.5 0
1 0.148 0.598 LOCAL is ’Logadouro’ 1
1 0.138 0.669 AUTO >1.5 0
1 0.136 0.588 AUTO <= 1.5 1
1 0.134 0.574 LOCAL is ’Cruzamento’ 0
1 0.124 0.892 BICICLETA >0.5 1
1 0.112 0.619 BICICLETA <= 0.5 0
1 0.109 0.667 CAMINHAO <= 0.5 1
1 0.103 0.889 TIPO ACID in (’ATROPELAMENTO’, ’QUEDA’) 1
Table 4. Random Forest - Rules derived from hurt people traffic analysis
Figure 3. Accidents causing hurt - RF important variables.
tion. Length is defined as the number of items contained in
an association rule.
4.1.3 Logistic Regression
Figure 7 shows the average residual and the average fitted
(predicted) value for each bin, or category.Category is based
on the fitted values. 95% of all values should fall within the
dotted line (as it happened in our example)
4.1.4 Other data aggregation relevant to the problem
5. Conclusions and Future Work
Type 1 errors (false positives, that is, predicted accident
when it is does not happened) were greater in RF while Type
Figure 4. Accidents causing death - RF important variables.
2 errors (false negatives, predicted there is no accident when
it in fact happened) were greater on Decision Trees. At our
study, type 2 is the most important because we would not
like to avoid predicting an accident when it is probable to
happen (play safe and warn a chance of an accident is bet-
ter at this study). Considering this, from ROC images on
Figure 2 and 5 it is possible to observe that Random Forest
delivered better results (less classification errors of type 2)
than simple Decision Trees.
By the analysis of picture 3 it is possible to check that
the type of the accident and accidents involving motorcycles
are the two with more relevance to indicate hurt. Other rele-
vant variables are accidents involving cars, the place where
it happened and the hour (but not so high as the first two
mentioned).
Figure 5. ROC curve for Random Forest predictions. Red - train-
ing, blue - test, green - validation set.
Figure 6. ROC Logistic Regression. Red - training, blue - test,
green - validation set.
Picture 4 shows the variables which most impacted on
death accidents. Again motorcycles were the main cause,
but at this case, other variables appeared with a great im-
portance such as day/night, car crash, the hour and the type
of the accident and if trucks are involved.
Estimate Std Error
TIPO ACID ATROPELAMENTO 4.0449617 0.0569926
REGIAO NAO CADASTRADO 4.0373454 0.3960868
BICICLETA 3.609395 0.0819853
TIPO ACID QUEDA 2.794752 0.0881739
(Intercept) 2.3929621 0.1181012
MOTO 2.2216396 0.0291009
LOCAL Logradouro 1.4691486 0.019812
CAMINHAO -1.0833747 0.036022
TIPO ACID CAPOTAGEM 0.9057897 0.0855192
CARROCA 0.8206036 0.1144767
LOTACAO -0.7509406 0.0667275
TAXI -0.675485 0.0385326
REGIAO SUL 0.5651705 0.0278513
AUTO -0.5633394 0.0235469
REGIAO NAO IDENTIFICADO 0.5056619 0.3153992
REGIAO LESTE 0.4006684 0.0263747
TEMPO NAO CADAST 0.3895447 0.1076033
ONIBUS INT -0.3546455 0.0541953
REGIAO NORTE 0.2756333 0.0262838
OUTRO -0.2563662 0.0894421
FX HORA -0.0367186 0.0015004
ONIBUS URB 0.0096466 0.0404007
MES 0.0074157 0.0024001
DIA 0.0009066 0.000918
TIPO ACID TOMBAMENTO -0.0693669 0.1325849
TIPO ACID NAO CADASTRADO -0.0826591 0.5226228
DIA SEM SABADO -0.2394208 0.0325041
TEMPO CHUVOSO -0.514578 0.0289924
TIPO ACID CHOQUE -0.5259759 0.0285438
DIA SEM SEXTA-FEIRA -0.6812563 0.0319216
DIA SEM QUINTA-FEIRA -0.7043353 0.0327392
DIA SEM SEGUNDA-FEIRA -0.7156692 0.0331751
DIA SEM QUARTA-FEIRA -0.720596 0.0329271
TIPO ACID COLISAO -0.7549218 0.0194193
DIA SEM TERCA-FEIRA -0.7706036 0.033298
TEMPO NUBLADO -1.0541536 0.0334349
TIPO ACID EVENTUAL -1.0786179 0.0674208
TIPO ACID INCENDIO -1.8502813 0.6154434
NOITE DIA NOITE -2.9580033 0.0995565
NOITE DIA DIA -3.6283964 0.0991302
Table 5. Logistic Regression Sorted
Tables 5 aggregates the top 10 places that concentrate the
majority of the hurt accidents where Table 6 shows the same
study for the death accidents. The streets, avenues, and
roads are almost the same in both cases indicating places
where the government should place more attention to re-
duce traffic accidents.
Going further at the type of the accident causing hurt,
the table 4 depicts the rules derived from Random Forest.
When the place is not a crossroad the change of hurt is a
little bit higher. There are some trivial relationships such as
Figure 7. Binned Plot - Logistic Regression.
Place Nbr Accidents
Av/St (w/ hurt)
AV PROTASIO ALVES 2939
AV BENTO GONCALVES 2838
AV ASSIS BRASIL 2562
AV IPIRANGA 2442
AV SERTORIO 1708
AV PROF OSCAR PEREIRA 1380
AV FARRAPOS 1264
AV BALTAZAR DE OLIVEIRA GARCIA 1117
ESTR JOAO DE OLIVEIRA REMIAO 1038
AV JUCA BATISTA 984
Table 6. Top 10 hurt accidents (place concentration)
Place Nbr Accidents
Av/St (w/ deaths)
AV BENTO GONCALVES 103
AV ASSIS BRASIL 92
AV PROTASIO ALVES 69
AV IPIRANGA 51
AV BALTAZAR DE OLIVEIRA GARCIA 49
ESTR JOAO DE OLIVEIRA REMIAO 49
AV FARRAPOS 44
AV SERTORIO 38
AV PROF OSCAR PEREIRA 37
AV JUCA BATISTA 33
Table 7. Top 10 death accidents (place concentration)
accidents whose type is falldown or run down cause hurt ac-
cidents. However, some are not so trivial, such as accidents
involving two cars do not have a great chance of hurting
people.
The ROC curves were very similar for Logistic Regres-
sion and Random Forest relatively to test set and validation
set. It also helped to elucidate the behaviour of other fea-
tures such as rollover.
In summary, it is possible to affirm that Random Forest
and Logistic regression had similar performances on this
problem and both were better than Decision Tree. From
the most valuable variables on Random Forest it is clear
that motorcycles and type of the accident are crucial for the
decision on whether there are injured people or not. By
the other side, contrary to the common thinking, features
like ”rainy weather” or ”crossing” have low relevance on
the decision. There have also been interesting observations
like the place where accidents happen being ”Logadouro”
and one-car accidents with wounded people.
The model could be improved if there were some vari-
ables to further analyse the accidents, such as driver char-
acteristics, weather (rainy) and traffic condition (jams), or
accident impact on the road.
References
[1] Hironobu HASEGAWA, Masaru FUJII, Mikiharu
ARIMURA, and Tohru TAMURA. A study on traffic
accident analysis using support vector machines. 11th World
Conference on Transport Research, page 9, 2007. 1
[2] Zong Fang, Xu Hongguo, and Zhang Huiyong. Prediction for
traffic accident severity: Comparing the bayesian network and
regression models. Hindawi Publishing Corporation, 2013:9,
2013. 1
[3] V.A Olutayo and A.A Eludire. Traffic accident analysis using
decision trees and neural networks. I.J. Information Technol-
ogy and Computer Science, 02:22–28, 2014. 1
[4] H. Deng. Interpreting tree ensembles with intrees. arXiv,
1408.5456:1–18, 2014. 1

More Related Content

What's hot

Applied Business Statistics ,ken black , ch 6
Applied Business Statistics ,ken black , ch 6Applied Business Statistics ,ken black , ch 6
Applied Business Statistics ,ken black , ch 6AbdelmonsifFadl
 
V. pacáková, d. brebera
V. pacáková, d. breberaV. pacáková, d. brebera
V. pacáková, d. breberalogyalaa
 
Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew Hair
 
15 ch ken black solution
15 ch ken black solution15 ch ken black solution
15 ch ken black solutionKrunal Shah
 
Soạn thảo văn bản bằng LATEX
Soạn thảo văn bản bằng LATEXSoạn thảo văn bản bằng LATEX
Soạn thảo văn bản bằng LATEXHuỳnh Lâm
 
Vasicek Model Project
Vasicek Model ProjectVasicek Model Project
Vasicek Model ProjectCedric Melhy
 
Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5AbdelmonsifFadl
 
12 ch ken black solution
12 ch ken black solution12 ch ken black solution
12 ch ken black solutionKrunal Shah
 
Identification of Outliersin Time Series Data via Simulation Study
Identification of Outliersin Time Series Data via Simulation StudyIdentification of Outliersin Time Series Data via Simulation Study
Identification of Outliersin Time Series Data via Simulation Studyiosrjce
 
10 ch ken black solution
10 ch ken black solution10 ch ken black solution
10 ch ken black solutionKrunal Shah
 
08 ch ken black solution
08 ch ken black solution08 ch ken black solution
08 ch ken black solutionKrunal Shah
 
18 ch ken black solution
18 ch ken black solution18 ch ken black solution
18 ch ken black solutionKrunal Shah
 
161783709 chapter-04-answers
161783709 chapter-04-answers161783709 chapter-04-answers
161783709 chapter-04-answersFiras Husseini
 
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...Camilo Pechâ Garz
 
153929081 80951377-regression-analysis-of-count-data
153929081 80951377-regression-analysis-of-count-data153929081 80951377-regression-analysis-of-count-data
153929081 80951377-regression-analysis-of-count-dataNataniel Barros
 
03 ch ken black solution
03 ch ken black solution03 ch ken black solution
03 ch ken black solutionKrunal Shah
 
165662191 chapter-03-answers-1
165662191 chapter-03-answers-1165662191 chapter-03-answers-1
165662191 chapter-03-answers-1Firas Husseini
 
Some Unbiased Classes of Estimators of Finite Population Mean
Some Unbiased Classes of Estimators of Finite Population MeanSome Unbiased Classes of Estimators of Finite Population Mean
Some Unbiased Classes of Estimators of Finite Population Meaninventionjournals
 

What's hot (20)

Applied Business Statistics ,ken black , ch 6
Applied Business Statistics ,ken black , ch 6Applied Business Statistics ,ken black , ch 6
Applied Business Statistics ,ken black , ch 6
 
V. pacáková, d. brebera
V. pacáková, d. breberaV. pacáková, d. brebera
V. pacáková, d. brebera
 
Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3
 
15 ch ken black solution
15 ch ken black solution15 ch ken black solution
15 ch ken black solution
 
Soạn thảo văn bản bằng LATEX
Soạn thảo văn bản bằng LATEXSoạn thảo văn bản bằng LATEX
Soạn thảo văn bản bằng LATEX
 
Vasicek Model Project
Vasicek Model ProjectVasicek Model Project
Vasicek Model Project
 
Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5
 
12 ch ken black solution
12 ch ken black solution12 ch ken black solution
12 ch ken black solution
 
Identification of Outliersin Time Series Data via Simulation Study
Identification of Outliersin Time Series Data via Simulation StudyIdentification of Outliersin Time Series Data via Simulation Study
Identification of Outliersin Time Series Data via Simulation Study
 
10 ch ken black solution
10 ch ken black solution10 ch ken black solution
10 ch ken black solution
 
08 ch ken black solution
08 ch ken black solution08 ch ken black solution
08 ch ken black solution
 
18 ch ken black solution
18 ch ken black solution18 ch ken black solution
18 ch ken black solution
 
161783709 chapter-04-answers
161783709 chapter-04-answers161783709 chapter-04-answers
161783709 chapter-04-answers
 
Chapter 11
Chapter 11 Chapter 11
Chapter 11
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...
 
153929081 80951377-regression-analysis-of-count-data
153929081 80951377-regression-analysis-of-count-data153929081 80951377-regression-analysis-of-count-data
153929081 80951377-regression-analysis-of-count-data
 
03 ch ken black solution
03 ch ken black solution03 ch ken black solution
03 ch ken black solution
 
165662191 chapter-03-answers-1
165662191 chapter-03-answers-1165662191 chapter-03-answers-1
165662191 chapter-03-answers-1
 
Some Unbiased Classes of Estimators of Finite Population Mean
Some Unbiased Classes of Estimators of Finite Population MeanSome Unbiased Classes of Estimators of Finite Population Mean
Some Unbiased Classes of Estimators of Finite Population Mean
 

Viewers also liked

2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
CaseStudyIndustryPresentation
CaseStudyIndustryPresentationCaseStudyIndustryPresentation
CaseStudyIndustryPresentationCharles Buie
 
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersDiminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersHarnoor Sanjeev
 
McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)Frank McDonald
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_fariaPaulo Faria
 
2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_fariaPaulo Faria
 
Postcards final-all
Postcards final-allPostcards final-all
Postcards final-allHashevaynu
 
Hashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu
 
Going Live April 2015
Going Live April 2015Going Live April 2015
Going Live April 2015Andrew Green
 
Service and guidance in education
Service and guidance in educationService and guidance in education
Service and guidance in educationWaqar Nisa
 

Viewers also liked (15)

Article_6
Article_6Article_6
Article_6
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
CaseStudyIndustryPresentation
CaseStudyIndustryPresentationCaseStudyIndustryPresentation
CaseStudyIndustryPresentation
 
Fa102 b
Fa102 bFa102 b
Fa102 b
 
Rebellions Excerpt
Rebellions ExcerptRebellions Excerpt
Rebellions Excerpt
 
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersDiminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
 
McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria
 
vjq_cv2015
vjq_cv2015vjq_cv2015
vjq_cv2015
 
2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria
 
Power Point
Power PointPower Point
Power Point
 
Postcards final-all
Postcards final-allPostcards final-all
Postcards final-all
 
Hashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual Dinner
 
Going Live April 2015
Going Live April 2015Going Live April 2015
Going Live April 2015
 
Service and guidance in education
Service and guidance in educationService and guidance in education
Service and guidance in education
 

Similar to 2014-mo444-final-project

Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014ijcsbi
 
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUESNEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUEScscpconf
 
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUESNEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUEScsitconf
 
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...Waqas Tariq
 
Parametric sensitivity analysis of a mathematical model of facultative mutualism
Parametric sensitivity analysis of a mathematical model of facultative mutualismParametric sensitivity analysis of a mathematical model of facultative mutualism
Parametric sensitivity analysis of a mathematical model of facultative mutualismIOSR Journals
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA AIRCC Publishing Corporation
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA ijcsit
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA AIRCC Publishing Corporation
 
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxclairbycraft
 
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxDaliaCulbertson719
 
OPTIMAL GLOBAL THRESHOLD ESTIMATION USING STATISTICAL CHANGE-POINT DETECTION
OPTIMAL GLOBAL THRESHOLD ESTIMATION USING STATISTICAL CHANGE-POINT DETECTIONOPTIMAL GLOBAL THRESHOLD ESTIMATION USING STATISTICAL CHANGE-POINT DETECTION
OPTIMAL GLOBAL THRESHOLD ESTIMATION USING STATISTICAL CHANGE-POINT DETECTIONsipij
 
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...IJITCA Journal
 
Image Processing for Automated Flaw Detection and CMYK model for Color Image ...
Image Processing for Automated Flaw Detection and CMYK model for Color Image ...Image Processing for Automated Flaw Detection and CMYK model for Color Image ...
Image Processing for Automated Flaw Detection and CMYK model for Color Image ...IOSR Journals
 
A new analysis of failure modes and effects by fuzzy todim with using fuzzy t...
A new analysis of failure modes and effects by fuzzy todim with using fuzzy t...A new analysis of failure modes and effects by fuzzy todim with using fuzzy t...
A new analysis of failure modes and effects by fuzzy todim with using fuzzy t...ijfls
 
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease DiagnosisFuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease DiagnosisIRJET Journal
 

Similar to 2014-mo444-final-project (20)

report
reportreport
report
 
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
 
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUESNEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
 
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUESNEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
 
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...
 
Parametric sensitivity analysis of a mathematical model of facultative mutualism
Parametric sensitivity analysis of a mathematical model of facultative mutualismParametric sensitivity analysis of a mathematical model of facultative mutualism
Parametric sensitivity analysis of a mathematical model of facultative mutualism
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
 
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
 
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
 
Cb36469472
Cb36469472Cb36469472
Cb36469472
 
OPTIMAL GLOBAL THRESHOLD ESTIMATION USING STATISTICAL CHANGE-POINT DETECTION
OPTIMAL GLOBAL THRESHOLD ESTIMATION USING STATISTICAL CHANGE-POINT DETECTIONOPTIMAL GLOBAL THRESHOLD ESTIMATION USING STATISTICAL CHANGE-POINT DETECTION
OPTIMAL GLOBAL THRESHOLD ESTIMATION USING STATISTICAL CHANGE-POINT DETECTION
 
The Short-term Swap Rate Models in China
The Short-term Swap Rate Models in ChinaThe Short-term Swap Rate Models in China
The Short-term Swap Rate Models in China
 
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...
 
1834 1840
1834 18401834 1840
1834 1840
 
1834 1840
1834 18401834 1840
1834 1840
 
Image Processing for Automated Flaw Detection and CMYK model for Color Image ...
Image Processing for Automated Flaw Detection and CMYK model for Color Image ...Image Processing for Automated Flaw Detection and CMYK model for Color Image ...
Image Processing for Automated Flaw Detection and CMYK model for Color Image ...
 
A new analysis of failure modes and effects by fuzzy todim with using fuzzy t...
A new analysis of failure modes and effects by fuzzy todim with using fuzzy t...A new analysis of failure modes and effects by fuzzy todim with using fuzzy t...
A new analysis of failure modes and effects by fuzzy todim with using fuzzy t...
 
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease DiagnosisFuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
 

2014-mo444-final-project

  • 1. Using ML techniques to study city traffic accidents Henrique Nogueira, Luisa Cardoso Madeira, Paulo Renato de Faria∗ Anderson Rocha† 1. Introduction The dataset comprises approximately 297,853 traffic ac- cidents collected from 2000 until 2013 at Porto Alegre, a brazilian city with around 1,4 million people. 2. Activities There is several articles discussing how to accomplish Traffic Accident Analysis in different countries. In Japan, HASEGAWA et al. [1] applied Support Vector Machine (SVM) using Gaussian kernel to separate major and non- major accidents. In China, Fang et al. [2] used Bayesian networks and Logistic Regression to study the accidents in Jilin province during 2010. Finally, Olutayo and Eludire [3] studied Nigeria’s busiest roads during 2 years and compared decision trees (Id3), Radial Basis Function (RBF) and neu- ral networks using a Multilayer Perceptron, in their example ”Id3 tree algorithm performed better with higher accuracy rate”. 3. Proposed Solutions The proposed solution for this project is to compare two classifier methods to predict wounds in accidents on the city of Porto Alegre, considering data from 2000 to 2014. The algorithms that will be compared are: Decision Tree, Ran- dom Forest and Logistic Regression. A decision tree or a classification tree is a tree in which each internal (non-leaf) node is labeled with an input fea- ture. The arcs coming from a node labeled with a feature are labeled with each of the possible values of the feature. Each leaf of the tree is labeled with a class or a probability distribution over the classes. To classify an example, filter it down the tree, as follows. For each feature encountered in the tree, the arc corresponding to the value of the exam- ple for that feature is followed. When a leaf is reached, the classification corresponding to that leaf is returned. ∗Is with the Institute of Computing, University of Camp- inas (Unicamp). Contact: hrqnogueira@gmail.com, lu.madeira2@gmail.com, paulo.faria@gmail.com †Is with the Institute of Computing, University of Campinas (Uni- camp). Contact: anderson.rocha@ic.unicamp.br Random forest works as a large collection of non- correlated Decision Trees. The idea of combining Decision Trees comes from the bagging technique that allows us to decrease the variance of the learning algorithm by combin- ing a set of them. To verify Random Forest results, we will use inTrees as described in Deng [4]. The big advantage of these two classifier is that they have white-box models. It means that, unlikely Logistic Regres- sion or SVM, we can look inside the model and understand how the decisions are being made. In this project, we would like to predict whether there will be wounded people in- volved in accidents or not. So the white-box logic may help us to understand which are the most important features and how they impact the results. These two algorithms will be compared using ROC curves. We also used Logistic Regression to compare the ROC performance and analyse each feature contribution. 3.1. Classification Trees and Information Gain To build the classification tree one fundamental concept is to find the root node (attribute that best splits the data over). One of the measures used is Entropy (H), which measures the homogeneity of the examples, calculated as below: H(S) = c i=1 (−pi ∗ log2 pi) (1) The tree split function to find non-leaf nodes will be In- formation Gain, which measures the reduction on Entropy as follows: IG(S, A) = H(S) − v values(A) ( |Sv| |S| ) ∗ H(Sv) (2) where Sv is the subset of S for which A has value v. 3.2. Quality measures To access the quality of the results it will be used TPR (True Positive Rate) and FPR (False Positive Rate) as plot- ted in based on ROC (Receiver Operating Characteristic) curve. 1
  • 2. 4. Experiments and Discussion 4.0.1 Data splitting The data was splitted in 3 partitions for each program under analysis using the following propotions: 60% for training, 20% for validation and 20% for testing. This was imple- mented on R as below: s p l i t d f <− function ( dataframe , seed=NULL) { i f ( ! i s . null ( seed ) ) s e t . seed ( seed ) index <− 1: nrow ( dataframe ) #60% f o r t r a i n i n g t r a i n i n d e x <− sample ( index , trunc ( length ( index ) ∗ 0 . 6 ) ) t r a i n s e t <− dataframe [ t r a i n i n d e x , ] o t h e r s e t <− dataframe [− t r a i n i n d e x , ] o t h e r I n d e x <− 1: nrow ( o t h e r s e t ) #20% f o r t r a i n i n g and #20% f o r t e s t i n g s e t v a l i d a t i o n I n d e x <− sample ( otherIndex , trunc ( length ( o t h e r I n d e x ) / 2 ) ) v a l i d a t i o n s e t <− o t h e r s e t [ v a l i d a t i o n I n d e x , ] t e s t s e t <− o t h e r s e t [− v a l i d a t i o n I n d e x , ] l i s t ( t r a i n s e t = t r a i n s e t , v a l i d a t i o n s e t = v a l i d a t i o n s e t , t e s t s e t = t e s t s e t ) } The next table summarizes the data partitition sample size Dataset Sample size Training 180499 Test 60164 Validation 57190 Table 1. Decision Tree - Confusion matrix - hurt people training - checking learning 4.1. Results 4.1.1 Decision Tree Accident without victims With victims Train 0.046 0.35 Validation 0.046 0.34 Teste 0.047 0.36 Table 2. Decision Tree - Classification error Figure 1. Decision Tree for Hurt People. Figure 2. ROC curve for Decision Tree predictions. Red - training, blue - test, green - validation set. Accident without victims With victims Train 0.052 0.28 Validation 0.1 0.16 Teste 0.1 0.17 Table 3. Random Forest - Classification errors 4.1.2 Random Forest Support is defined as the proportion of transactions contain- ing the condition. Confidence is defined as the number of transactions containing both the condition and the outcome divided by the number of transactions containing the condi-
  • 3. Length Sup Conf Condition Prediction 1 0.21 0.695 MOTO >0.5 1 1 0.172 0.556 MOTO <= 0.5 0 1 0.148 0.598 LOCAL is ’Logadouro’ 1 1 0.138 0.669 AUTO >1.5 0 1 0.136 0.588 AUTO <= 1.5 1 1 0.134 0.574 LOCAL is ’Cruzamento’ 0 1 0.124 0.892 BICICLETA >0.5 1 1 0.112 0.619 BICICLETA <= 0.5 0 1 0.109 0.667 CAMINHAO <= 0.5 1 1 0.103 0.889 TIPO ACID in (’ATROPELAMENTO’, ’QUEDA’) 1 Table 4. Random Forest - Rules derived from hurt people traffic analysis Figure 3. Accidents causing hurt - RF important variables. tion. Length is defined as the number of items contained in an association rule. 4.1.3 Logistic Regression Figure 7 shows the average residual and the average fitted (predicted) value for each bin, or category.Category is based on the fitted values. 95% of all values should fall within the dotted line (as it happened in our example) 4.1.4 Other data aggregation relevant to the problem 5. Conclusions and Future Work Type 1 errors (false positives, that is, predicted accident when it is does not happened) were greater in RF while Type Figure 4. Accidents causing death - RF important variables. 2 errors (false negatives, predicted there is no accident when it in fact happened) were greater on Decision Trees. At our study, type 2 is the most important because we would not like to avoid predicting an accident when it is probable to happen (play safe and warn a chance of an accident is bet- ter at this study). Considering this, from ROC images on Figure 2 and 5 it is possible to observe that Random Forest delivered better results (less classification errors of type 2) than simple Decision Trees. By the analysis of picture 3 it is possible to check that the type of the accident and accidents involving motorcycles are the two with more relevance to indicate hurt. Other rele- vant variables are accidents involving cars, the place where it happened and the hour (but not so high as the first two mentioned).
  • 4. Figure 5. ROC curve for Random Forest predictions. Red - train- ing, blue - test, green - validation set. Figure 6. ROC Logistic Regression. Red - training, blue - test, green - validation set. Picture 4 shows the variables which most impacted on death accidents. Again motorcycles were the main cause, but at this case, other variables appeared with a great im- portance such as day/night, car crash, the hour and the type of the accident and if trucks are involved. Estimate Std Error TIPO ACID ATROPELAMENTO 4.0449617 0.0569926 REGIAO NAO CADASTRADO 4.0373454 0.3960868 BICICLETA 3.609395 0.0819853 TIPO ACID QUEDA 2.794752 0.0881739 (Intercept) 2.3929621 0.1181012 MOTO 2.2216396 0.0291009 LOCAL Logradouro 1.4691486 0.019812 CAMINHAO -1.0833747 0.036022 TIPO ACID CAPOTAGEM 0.9057897 0.0855192 CARROCA 0.8206036 0.1144767 LOTACAO -0.7509406 0.0667275 TAXI -0.675485 0.0385326 REGIAO SUL 0.5651705 0.0278513 AUTO -0.5633394 0.0235469 REGIAO NAO IDENTIFICADO 0.5056619 0.3153992 REGIAO LESTE 0.4006684 0.0263747 TEMPO NAO CADAST 0.3895447 0.1076033 ONIBUS INT -0.3546455 0.0541953 REGIAO NORTE 0.2756333 0.0262838 OUTRO -0.2563662 0.0894421 FX HORA -0.0367186 0.0015004 ONIBUS URB 0.0096466 0.0404007 MES 0.0074157 0.0024001 DIA 0.0009066 0.000918 TIPO ACID TOMBAMENTO -0.0693669 0.1325849 TIPO ACID NAO CADASTRADO -0.0826591 0.5226228 DIA SEM SABADO -0.2394208 0.0325041 TEMPO CHUVOSO -0.514578 0.0289924 TIPO ACID CHOQUE -0.5259759 0.0285438 DIA SEM SEXTA-FEIRA -0.6812563 0.0319216 DIA SEM QUINTA-FEIRA -0.7043353 0.0327392 DIA SEM SEGUNDA-FEIRA -0.7156692 0.0331751 DIA SEM QUARTA-FEIRA -0.720596 0.0329271 TIPO ACID COLISAO -0.7549218 0.0194193 DIA SEM TERCA-FEIRA -0.7706036 0.033298 TEMPO NUBLADO -1.0541536 0.0334349 TIPO ACID EVENTUAL -1.0786179 0.0674208 TIPO ACID INCENDIO -1.8502813 0.6154434 NOITE DIA NOITE -2.9580033 0.0995565 NOITE DIA DIA -3.6283964 0.0991302 Table 5. Logistic Regression Sorted Tables 5 aggregates the top 10 places that concentrate the majority of the hurt accidents where Table 6 shows the same study for the death accidents. The streets, avenues, and roads are almost the same in both cases indicating places where the government should place more attention to re- duce traffic accidents. Going further at the type of the accident causing hurt, the table 4 depicts the rules derived from Random Forest. When the place is not a crossroad the change of hurt is a little bit higher. There are some trivial relationships such as
  • 5. Figure 7. Binned Plot - Logistic Regression. Place Nbr Accidents Av/St (w/ hurt) AV PROTASIO ALVES 2939 AV BENTO GONCALVES 2838 AV ASSIS BRASIL 2562 AV IPIRANGA 2442 AV SERTORIO 1708 AV PROF OSCAR PEREIRA 1380 AV FARRAPOS 1264 AV BALTAZAR DE OLIVEIRA GARCIA 1117 ESTR JOAO DE OLIVEIRA REMIAO 1038 AV JUCA BATISTA 984 Table 6. Top 10 hurt accidents (place concentration) Place Nbr Accidents Av/St (w/ deaths) AV BENTO GONCALVES 103 AV ASSIS BRASIL 92 AV PROTASIO ALVES 69 AV IPIRANGA 51 AV BALTAZAR DE OLIVEIRA GARCIA 49 ESTR JOAO DE OLIVEIRA REMIAO 49 AV FARRAPOS 44 AV SERTORIO 38 AV PROF OSCAR PEREIRA 37 AV JUCA BATISTA 33 Table 7. Top 10 death accidents (place concentration) accidents whose type is falldown or run down cause hurt ac- cidents. However, some are not so trivial, such as accidents involving two cars do not have a great chance of hurting people. The ROC curves were very similar for Logistic Regres- sion and Random Forest relatively to test set and validation set. It also helped to elucidate the behaviour of other fea- tures such as rollover. In summary, it is possible to affirm that Random Forest and Logistic regression had similar performances on this problem and both were better than Decision Tree. From the most valuable variables on Random Forest it is clear that motorcycles and type of the accident are crucial for the decision on whether there are injured people or not. By the other side, contrary to the common thinking, features like ”rainy weather” or ”crossing” have low relevance on the decision. There have also been interesting observations like the place where accidents happen being ”Logadouro” and one-car accidents with wounded people. The model could be improved if there were some vari- ables to further analyse the accidents, such as driver char- acteristics, weather (rainy) and traffic condition (jams), or accident impact on the road. References [1] Hironobu HASEGAWA, Masaru FUJII, Mikiharu ARIMURA, and Tohru TAMURA. A study on traffic accident analysis using support vector machines. 11th World Conference on Transport Research, page 9, 2007. 1 [2] Zong Fang, Xu Hongguo, and Zhang Huiyong. Prediction for traffic accident severity: Comparing the bayesian network and regression models. Hindawi Publishing Corporation, 2013:9, 2013. 1 [3] V.A Olutayo and A.A Eludire. Traffic accident analysis using decision trees and neural networks. I.J. Information Technol- ogy and Computer Science, 02:22–28, 2014. 1 [4] H. Deng. Interpreting tree ensembles with intrees. arXiv, 1408.5456:1–18, 2014. 1