The document discusses using machine learning techniques to analyze traffic accident data from Porto Alegre, Brazil between 2000-2013. It compares decision trees, random forests, and logistic regression for predicting whether accidents resulted in injuries. Random forests and logistic regression performed similarly and better than decision trees. Motorcycles and accident type were highly predictive of injuries, while factors like weather had low relevance. The models could be improved with additional data on drivers, weather, and traffic conditions.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Identification of Outliersin Time Series Data via Simulation Studyiosrjce
IOSR Journal of Mathematics(IOSR-JM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of mathemetics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in mathematics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Some Unbiased Classes of Estimators of Finite Population Meaninventionjournals
International Journal of Mathematics and Statistics Invention (IJMSI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJMSI publishes research articles and reviews within the whole field Mathematics and Statistics, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Identification of Outliersin Time Series Data via Simulation Studyiosrjce
IOSR Journal of Mathematics(IOSR-JM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of mathemetics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in mathematics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Some Unbiased Classes of Estimators of Finite Population Meaninventionjournals
International Journal of Mathematics and Statistics Invention (IJMSI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJMSI publishes research articles and reviews within the whole field Mathematics and Statistics, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersHarnoor Sanjeev
Proposal Writers are sandwiched between Pink, Red and Gold reviews; capture, capture management, federal contracts, federal government, government contracting, Management, proposal coordination, proposal life cycle, proposal management, proposal reviews, proposal writing
A Predictive Stock Data Analysis with SVM-PCA Model .......................................................................1
Divya Joseph and Vinai George Biju
HOV-kNN: A New Algorithm to Nearest Neighbor Search in Dynamic Space.......................................... 12
Mohammad Reza Abbasifard, Hassan Naderi and Mohadese Mirjalili
A Survey on Mobile Malware: A War without End................................................................................... 23
Sonal Mohite and Prof. R. S. Sonar
An Efficient Design Tool to Detect Inconsistencies in UML Design Models............................................. 36
Mythili Thirugnanam and Sumathy Subramaniam
An Integrated Procedure for Resolving Portfolio Optimization Problems using Data Envelopment
Analysis, Ant Colony Optimization and Gene Expression Programming ................................................. 45
Chih-Ming Hsu
Emerging Technologies: LTE vs. WiMAX ................................................................................................... 66
Mohammad Arifin Rahman Khan and Md. Sadiq Iqbal
Introducing E-Maintenance 2.0 ................................................................................................................. 80
Abdessamad Mouzoune and Saoudi Taibi
Detection of Clones in Digital Images........................................................................................................ 91
Minati Mishra and Flt. Lt. Dr. M. C. Adhikary
The Significance of Genetic Algorithms in Search, Evolution, Optimization and Hybridization: A Short
Review ...................................................................................................................................................... 103
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUEScscpconf
This paper presents a new idea for fault detection and isolation (FDI) technique which is applied to industrial system. This technique is based on Neural Networks fault-free and Faulty
behaviours Models (NNFMs). NNFMs are used for residual generation, while decision tree architecture is used for residual evaluation. The decision tree is realized with data collected
from the NNFM’s outputs and is used to isolate detectable faults depending on computed threshold. Each part of the tree corresponds to specific residual. With the decision tree, it
becomes possible to take the appropriate decision regarding the actual process behaviour by evaluating few numbers of residuals. In comparison to usual systematic evaluation of all
residuals, the proposed technique requires less computational effort and can be used for on line diagnosis. An application example is presented to illustrate and confirm the effectiveness and the accuracy of the proposed approach.
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUEScsitconf
This paper presents a new idea for fault detection and isolation (FDI) technique which is
applied to industrial system. This technique is based on Neural Networks fault-free and Faulty
behaviours Models (NNFMs). NNFMs are used for residual generation, while decision tree
architecture is used for residual evaluation. The decision tree is realized with data collected
from the NNFM’s outputs and is used to isolate detectable faults depending on computed
threshold. Each part of the tree corresponds to specific residual. With the decision tree, it
becomes possible to take the appropriate decision regarding the actual process behaviour by
evaluating few numbers of residuals. In comparison to usual systematic evaluation of all
residuals, the proposed technique requires less computational effort and can be used for on line
diagnosis. An application example is presented to illustrate and confirm the effectiveness and
the accuracy of the proposed approach.
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...Waqas Tariq
A fundamental task in many statistical analyses is to characterize the location and variability of a data set. A further characterization of the data includes skewness and kurtosis. This paper emphasizes the real time computational problem for generally the rth standardized moments and specially for both skewness and kurtosis. It has therefore been important to derive an optimum computational technique for the standardized moments. A new algorithm has been designed for the evaluation of the standardized moments. The evaluation of error analysis has been discussed. The new algorithm saved computational energy by approximately 99.95% than that of the previously published algorithms.
Parametric sensitivity analysis of a mathematical model of facultative mutualismIOSR Journals
The complex dynamics of facultative mutualism is best described by a system of continuous non-linear first order ordinary differential equations. The methods of 1-norm, 2-norm, and infinity-norm will be used to quantify and differentiate the different forms of the sensitivity of model parameters. These contributions will be presented and discussed.
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA ijcsit
This article consolidates the idea that non-random pairing can promote the evolution of cooperation in a non-repeated version of the prisoner’s dilemma. This idea is taken from[1], which presents experiments utilizing stochastic simulation. In the following it is shown how the results from [1] is reproducible by
numerical analysis. It is also demonstrated that some unexplained findings in [1], is due to the methods used.
This article consolidates the idea that non-random pairing can promote the evolution of cooperation in a non-repeated version of the prisoner’s dilemma. This idea is taken from[1], which presents experiments utilizing stochastic simulation. In the following it is shown how the results from [1] is reproducible by numerical analysis. It is also demonstrated that some unexplained findings in [1], is due to the methods used.
This article consolidates the idea that non-random pairing can promote the evolution of cooperation in a non-repeated version of the prisoner’s dilemma. This idea is taken from[1], which presents experiments utilizing stochastic simulation. In the following it is shown how the results from [1] is reproducible by numerical analysis. It is also demonstrated that some unexplained findings in [1], is due to the methods used.
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docxclairbycraft
By
PREFERENCES FOR CAR CHOICE
IN UNITED STATES
Thank you
PREFERENCES FOR CAR CHOICE IN THE UNITED STATES 2
PREFERENCES FOR CAR CHOICE IN THE UNITED STATES 2
Table of Contents
Introduction………………………………………………………………………………………..3
Background3
Data Analysis4
Data Visualization9
Conclusion16
References17
Introduction
The most common applications of Statistics is describing a set of descriptive data statistics, regression, and hypothesis testing and inferential statistics. The two main branches are descriptive and inferential statistics. People who do not have any formal training in statistics are more familiar with inferential statistics than with descriptive statistics. In this paper, the data will analyze using descriptive statistics. So we will focus on the descriptive branch of the statistics.
Descriptive Statistics Definition
The descriptive statistics are the type of statistical analysis that helps to describe the data in some meaningful way. The statistics are helpful to describe quantitatively about the essential features of the data or information. The descriptive statistics give the summaries of the given sample as well as the observations done. These summaries or descriptions can either be graphical or quantitative.Background
This study will focus on and analyzing & Visualizing the data set about Preferences For Car Choice In The United States. The data set contained 4654 observations and 71 columns. There are several different types of graphs that help describe the statistical data. These graphs are histogram, bar graph, box and whisker plot, line graph, scatter plot, ogive, pie chart, and many more. Generally, the kinds of measurements that can use with descriptive statistics are:
The measure of central tendency describes the data which lies in the center of a given frequency distribution. The main steps of central tendency are mean and median and mode (Nick, 2020).
The spread measure describes how the scores are spread across the entire distribution. In the spread, measurements that are included standard deviation, variance, quartiles, range, absolute difference.Data Analysis
One of the essential concepts of statistics is data analysis. It is the process that is observing the data, analyzing, and modeling the data. The purpose of data analysis is to obtain useful data information and state conclusions which support decision-making. The data analysis can be performed under several techniques using different approaches. The method of data assessment and analysis can be achieved by using analytical and logical approaches to examine each component of the data provided. Data from various sources are collected, reviewed, and then explained for decision making or conclusions. There are several methods for analyzing the results. Data mining, text analytics, and business intelligence are some of the most commonly used techniques and data visualizations.
The data an.
By
PREFERENCES FOR CAR CHOICE
IN UNITED STATES
Thank you
PREFERENCES FOR CAR CHOICE IN THE UNITED STATES 2
PREFERENCES FOR CAR CHOICE IN THE UNITED STATES 2
Table of Contents
Introduction………………………………………………………………………………………..3
Background3
Data Analysis4
Data Visualization9
Conclusion16
References17
Introduction
The most common applications of Statistics is describing a set of descriptive data statistics, regression, and hypothesis testing and inferential statistics. The two main branches are descriptive and inferential statistics. People who do not have any formal training in statistics are more familiar with inferential statistics than with descriptive statistics. In this paper, the data will analyze using descriptive statistics. So we will focus on the descriptive branch of the statistics.
Descriptive Statistics Definition
The descriptive statistics are the type of statistical analysis that helps to describe the data in some meaningful way. The statistics are helpful to describe quantitatively about the essential features of the data or information. The descriptive statistics give the summaries of the given sample as well as the observations done. These summaries or descriptions can either be graphical or quantitative.Background
This study will focus on and analyzing & Visualizing the data set about Preferences For Car Choice In The United States. The data set contained 4654 observations and 71 columns. There are several different types of graphs that help describe the statistical data. These graphs are histogram, bar graph, box and whisker plot, line graph, scatter plot, ogive, pie chart, and many more. Generally, the kinds of measurements that can use with descriptive statistics are:
The measure of central tendency describes the data which lies in the center of a given frequency distribution. The main steps of central tendency are mean and median and mode (Nick, 2020).
The spread measure describes how the scores are spread across the entire distribution. In the spread, measurements that are included standard deviation, variance, quartiles, range, absolute difference.Data Analysis
One of the essential concepts of statistics is data analysis. It is the process that is observing the data, analyzing, and modeling the data. The purpose of data analysis is to obtain useful data information and state conclusions which support decision-making. The data analysis can be performed under several techniques using different approaches. The method of data assessment and analysis can be achieved by using analytical and logical approaches to examine each component of the data provided. Data from various sources are collected, reviewed, and then explained for decision making or conclusions. There are several methods for analyzing the results. Data mining, text analytics, and business intelligence are some of the most commonly used techniques and data visualizations.
The data an.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
OPTIMAL GLOBAL THRESHOLD ESTIMATION USING STATISTICAL CHANGE-POINT DETECTIONsipij
Aim of this paper is reformulation of global image thresholding problem as a well-founded statistical
method known as change-point detection (CPD) problem. Our proposed CPD thresholding algorithm does
not assume any prior statistical distribution of background and object grey levels. Further, this method is
less influenced by an outlier due to our judicious derivation of a robust criterion function depending on
Kullback-Leibler (KL) divergence measure. Experimental result shows efficacy of proposed method
compared to other popular methods available for global image thresholding. In this paper we also propose
a performance criterion for comparison of thresholding algorithms. This performance criteria does not
depend on any ground truth image. We have used this performance criterion to compare the results of
proposed thresholding algorithm with most cited global thresholding algorithms in the literature.
This paper analyzes the swap rates issued by the China Inter-bank Offered Rate(CHIBOR) and
selects the one-year FR007 daily data from January 1st, 2019 to June 30th, 2019 as a sample. To fit the data,
we conduct Monte Carlo simulation with several typical continuous short-term swap rate models such as the
Merton model, the Vasicek model, the CIR model, etc. These models contain both linear forms and nonlinear
forms and each has both drift terms and diffusion terms. After empirical analysis, we obtain the parameter
values in Euler-Maruyama scheme and relevant statistical characteristics of each model. The results show that
most of the short-term swap rate models can fit the swap rates and reflect the change of trend, while the CKLSO
model performs best.
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...IJITCA Journal
In this paper, we applied a dynamic model for manoeuvring targets in SIR particle filter algorithm for improving tracking accuracy of multiple manoeuvring targets. In our proposed approach, a color distribution model is used to detect changes of target's model . Our proposed approach controls
deformation of target's model. If deformation of target's model is larger than a predetermined threshold,then the model will be updated. Global Nearest Neighbor (GNN) algorithm is used as data association algorithm. We named our proposed method as Deformation Detection Particle Filter (DDPF) . DDPF
approach is compared with basic SIR-PF algorithm on real airshow videos. Comparisons results show that, the basic SIR-PF algorithm is not able to track the manoeuvring targets when the rotation or scaling is occurred in target' s model. However, DDPF approach updates target's model when the rotation or
scaling is occurred. Thus, the proposed approach is able to track the manoeuvring targets more efficiently
and accurately.
A new analysis of failure modes and effects by fuzzy todim with using fuzzy t...ijfls
Failure mode and effects analysis (FMEA) is awidely used engineering technique for designing, identifying
and eliminating theknown and/or potential failures, problems, and errors and so on from system to other
parts. The evaluating of FMEA parameters is challenging point because it’s importantfor managers to
know the real risk in their systems. In this study,we used fuzzy TODIM for evaluating the potential failure
modes in our system respect to factors of FMEA,which is known as; Severity(S); Occurrence (O); and
detect ability (D).The final result was combined with fuzzy time function which helps to predict systems in
future and solving problems in our system and it could help to avoid potential future failure modes in our
systems.
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
2014-mo444-final-project
1. Using ML techniques to study city traffic accidents
Henrique Nogueira, Luisa Cardoso Madeira, Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The dataset comprises approximately 297,853 traffic ac-
cidents collected from 2000 until 2013 at Porto Alegre, a
brazilian city with around 1,4 million people.
2. Activities
There is several articles discussing how to accomplish
Traffic Accident Analysis in different countries. In Japan,
HASEGAWA et al. [1] applied Support Vector Machine
(SVM) using Gaussian kernel to separate major and non-
major accidents. In China, Fang et al. [2] used Bayesian
networks and Logistic Regression to study the accidents in
Jilin province during 2010. Finally, Olutayo and Eludire [3]
studied Nigeria’s busiest roads during 2 years and compared
decision trees (Id3), Radial Basis Function (RBF) and neu-
ral networks using a Multilayer Perceptron, in their example
”Id3 tree algorithm performed better with higher accuracy
rate”.
3. Proposed Solutions
The proposed solution for this project is to compare two
classifier methods to predict wounds in accidents on the city
of Porto Alegre, considering data from 2000 to 2014. The
algorithms that will be compared are: Decision Tree, Ran-
dom Forest and Logistic Regression.
A decision tree or a classification tree is a tree in which
each internal (non-leaf) node is labeled with an input fea-
ture. The arcs coming from a node labeled with a feature
are labeled with each of the possible values of the feature.
Each leaf of the tree is labeled with a class or a probability
distribution over the classes. To classify an example, filter
it down the tree, as follows. For each feature encountered
in the tree, the arc corresponding to the value of the exam-
ple for that feature is followed. When a leaf is reached, the
classification corresponding to that leaf is returned.
∗Is with the Institute of Computing, University of Camp-
inas (Unicamp). Contact: hrqnogueira@gmail.com,
lu.madeira2@gmail.com, paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
Random forest works as a large collection of non-
correlated Decision Trees. The idea of combining Decision
Trees comes from the bagging technique that allows us to
decrease the variance of the learning algorithm by combin-
ing a set of them. To verify Random Forest results, we will
use inTrees as described in Deng [4].
The big advantage of these two classifier is that they have
white-box models. It means that, unlikely Logistic Regres-
sion or SVM, we can look inside the model and understand
how the decisions are being made. In this project, we would
like to predict whether there will be wounded people in-
volved in accidents or not. So the white-box logic may help
us to understand which are the most important features and
how they impact the results. These two algorithms will be
compared using ROC curves.
We also used Logistic Regression to compare the ROC
performance and analyse each feature contribution.
3.1. Classification Trees and Information Gain
To build the classification tree one fundamental concept
is to find the root node (attribute that best splits the data
over). One of the measures used is Entropy (H), which
measures the homogeneity of the examples, calculated as
below:
H(S) =
c
i=1
(−pi ∗ log2 pi) (1)
The tree split function to find non-leaf nodes will be In-
formation Gain, which measures the reduction on Entropy
as follows:
IG(S, A) = H(S) −
v values(A)
(
|Sv|
|S|
) ∗ H(Sv) (2)
where Sv is the subset of S for which A has value v.
3.2. Quality measures
To access the quality of the results it will be used TPR
(True Positive Rate) and FPR (False Positive Rate) as plot-
ted in based on ROC (Receiver Operating Characteristic)
curve.
1
2. 4. Experiments and Discussion
4.0.1 Data splitting
The data was splitted in 3 partitions for each program under
analysis using the following propotions: 60% for training,
20% for validation and 20% for testing. This was imple-
mented on R as below:
s p l i t d f <− function ( dataframe , seed=NULL) {
i f ( ! i s . null ( seed ) ) s e t . seed ( seed )
index <− 1: nrow ( dataframe )
#60% f o r t r a i n i n g
t r a i n i n d e x <− sample ( index ,
trunc ( length ( index ) ∗ 0 . 6 ) )
t r a i n s e t <− dataframe [ t r a i n i n d e x , ]
o t h e r s e t <− dataframe [− t r a i n i n d e x , ]
o t h e r I n d e x <− 1: nrow ( o t h e r s e t )
#20% f o r t r a i n i n g and
#20% f o r t e s t i n g s e t
v a l i d a t i o n I n d e x <−
sample ( otherIndex ,
trunc ( length ( o t h e r I n d e x ) / 2 ) )
v a l i d a t i o n s e t <− o t h e r s e t [ v a l i d a t i o n I n d e x , ]
t e s t s e t <− o t h e r s e t [− v a l i d a t i o n I n d e x , ]
l i s t ( t r a i n s e t = t r a i n s e t ,
v a l i d a t i o n s e t = v a l i d a t i o n s e t ,
t e s t s e t = t e s t s e t )
}
The next table summarizes the data partitition sample
size
Dataset Sample size
Training 180499
Test 60164
Validation 57190
Table 1. Decision Tree - Confusion matrix - hurt people training -
checking learning
4.1. Results
4.1.1 Decision Tree
Accident without victims With victims
Train 0.046 0.35
Validation 0.046 0.34
Teste 0.047 0.36
Table 2. Decision Tree - Classification error
Figure 1. Decision Tree for Hurt People.
Figure 2. ROC curve for Decision Tree predictions. Red - training,
blue - test, green - validation set.
Accident without victims With victims
Train 0.052 0.28
Validation 0.1 0.16
Teste 0.1 0.17
Table 3. Random Forest - Classification errors
4.1.2 Random Forest
Support is defined as the proportion of transactions contain-
ing the condition. Confidence is defined as the number of
transactions containing both the condition and the outcome
divided by the number of transactions containing the condi-
3. Length Sup Conf Condition Prediction
1 0.21 0.695 MOTO >0.5 1
1 0.172 0.556 MOTO <= 0.5 0
1 0.148 0.598 LOCAL is ’Logadouro’ 1
1 0.138 0.669 AUTO >1.5 0
1 0.136 0.588 AUTO <= 1.5 1
1 0.134 0.574 LOCAL is ’Cruzamento’ 0
1 0.124 0.892 BICICLETA >0.5 1
1 0.112 0.619 BICICLETA <= 0.5 0
1 0.109 0.667 CAMINHAO <= 0.5 1
1 0.103 0.889 TIPO ACID in (’ATROPELAMENTO’, ’QUEDA’) 1
Table 4. Random Forest - Rules derived from hurt people traffic analysis
Figure 3. Accidents causing hurt - RF important variables.
tion. Length is defined as the number of items contained in
an association rule.
4.1.3 Logistic Regression
Figure 7 shows the average residual and the average fitted
(predicted) value for each bin, or category.Category is based
on the fitted values. 95% of all values should fall within the
dotted line (as it happened in our example)
4.1.4 Other data aggregation relevant to the problem
5. Conclusions and Future Work
Type 1 errors (false positives, that is, predicted accident
when it is does not happened) were greater in RF while Type
Figure 4. Accidents causing death - RF important variables.
2 errors (false negatives, predicted there is no accident when
it in fact happened) were greater on Decision Trees. At our
study, type 2 is the most important because we would not
like to avoid predicting an accident when it is probable to
happen (play safe and warn a chance of an accident is bet-
ter at this study). Considering this, from ROC images on
Figure 2 and 5 it is possible to observe that Random Forest
delivered better results (less classification errors of type 2)
than simple Decision Trees.
By the analysis of picture 3 it is possible to check that
the type of the accident and accidents involving motorcycles
are the two with more relevance to indicate hurt. Other rele-
vant variables are accidents involving cars, the place where
it happened and the hour (but not so high as the first two
mentioned).
4. Figure 5. ROC curve for Random Forest predictions. Red - train-
ing, blue - test, green - validation set.
Figure 6. ROC Logistic Regression. Red - training, blue - test,
green - validation set.
Picture 4 shows the variables which most impacted on
death accidents. Again motorcycles were the main cause,
but at this case, other variables appeared with a great im-
portance such as day/night, car crash, the hour and the type
of the accident and if trucks are involved.
Estimate Std Error
TIPO ACID ATROPELAMENTO 4.0449617 0.0569926
REGIAO NAO CADASTRADO 4.0373454 0.3960868
BICICLETA 3.609395 0.0819853
TIPO ACID QUEDA 2.794752 0.0881739
(Intercept) 2.3929621 0.1181012
MOTO 2.2216396 0.0291009
LOCAL Logradouro 1.4691486 0.019812
CAMINHAO -1.0833747 0.036022
TIPO ACID CAPOTAGEM 0.9057897 0.0855192
CARROCA 0.8206036 0.1144767
LOTACAO -0.7509406 0.0667275
TAXI -0.675485 0.0385326
REGIAO SUL 0.5651705 0.0278513
AUTO -0.5633394 0.0235469
REGIAO NAO IDENTIFICADO 0.5056619 0.3153992
REGIAO LESTE 0.4006684 0.0263747
TEMPO NAO CADAST 0.3895447 0.1076033
ONIBUS INT -0.3546455 0.0541953
REGIAO NORTE 0.2756333 0.0262838
OUTRO -0.2563662 0.0894421
FX HORA -0.0367186 0.0015004
ONIBUS URB 0.0096466 0.0404007
MES 0.0074157 0.0024001
DIA 0.0009066 0.000918
TIPO ACID TOMBAMENTO -0.0693669 0.1325849
TIPO ACID NAO CADASTRADO -0.0826591 0.5226228
DIA SEM SABADO -0.2394208 0.0325041
TEMPO CHUVOSO -0.514578 0.0289924
TIPO ACID CHOQUE -0.5259759 0.0285438
DIA SEM SEXTA-FEIRA -0.6812563 0.0319216
DIA SEM QUINTA-FEIRA -0.7043353 0.0327392
DIA SEM SEGUNDA-FEIRA -0.7156692 0.0331751
DIA SEM QUARTA-FEIRA -0.720596 0.0329271
TIPO ACID COLISAO -0.7549218 0.0194193
DIA SEM TERCA-FEIRA -0.7706036 0.033298
TEMPO NUBLADO -1.0541536 0.0334349
TIPO ACID EVENTUAL -1.0786179 0.0674208
TIPO ACID INCENDIO -1.8502813 0.6154434
NOITE DIA NOITE -2.9580033 0.0995565
NOITE DIA DIA -3.6283964 0.0991302
Table 5. Logistic Regression Sorted
Tables 5 aggregates the top 10 places that concentrate the
majority of the hurt accidents where Table 6 shows the same
study for the death accidents. The streets, avenues, and
roads are almost the same in both cases indicating places
where the government should place more attention to re-
duce traffic accidents.
Going further at the type of the accident causing hurt,
the table 4 depicts the rules derived from Random Forest.
When the place is not a crossroad the change of hurt is a
little bit higher. There are some trivial relationships such as
5. Figure 7. Binned Plot - Logistic Regression.
Place Nbr Accidents
Av/St (w/ hurt)
AV PROTASIO ALVES 2939
AV BENTO GONCALVES 2838
AV ASSIS BRASIL 2562
AV IPIRANGA 2442
AV SERTORIO 1708
AV PROF OSCAR PEREIRA 1380
AV FARRAPOS 1264
AV BALTAZAR DE OLIVEIRA GARCIA 1117
ESTR JOAO DE OLIVEIRA REMIAO 1038
AV JUCA BATISTA 984
Table 6. Top 10 hurt accidents (place concentration)
Place Nbr Accidents
Av/St (w/ deaths)
AV BENTO GONCALVES 103
AV ASSIS BRASIL 92
AV PROTASIO ALVES 69
AV IPIRANGA 51
AV BALTAZAR DE OLIVEIRA GARCIA 49
ESTR JOAO DE OLIVEIRA REMIAO 49
AV FARRAPOS 44
AV SERTORIO 38
AV PROF OSCAR PEREIRA 37
AV JUCA BATISTA 33
Table 7. Top 10 death accidents (place concentration)
accidents whose type is falldown or run down cause hurt ac-
cidents. However, some are not so trivial, such as accidents
involving two cars do not have a great chance of hurting
people.
The ROC curves were very similar for Logistic Regres-
sion and Random Forest relatively to test set and validation
set. It also helped to elucidate the behaviour of other fea-
tures such as rollover.
In summary, it is possible to affirm that Random Forest
and Logistic regression had similar performances on this
problem and both were better than Decision Tree. From
the most valuable variables on Random Forest it is clear
that motorcycles and type of the accident are crucial for the
decision on whether there are injured people or not. By
the other side, contrary to the common thinking, features
like ”rainy weather” or ”crossing” have low relevance on
the decision. There have also been interesting observations
like the place where accidents happen being ”Logadouro”
and one-car accidents with wounded people.
The model could be improved if there were some vari-
ables to further analyse the accidents, such as driver char-
acteristics, weather (rainy) and traffic condition (jams), or
accident impact on the road.
References
[1] Hironobu HASEGAWA, Masaru FUJII, Mikiharu
ARIMURA, and Tohru TAMURA. A study on traffic
accident analysis using support vector machines. 11th World
Conference on Transport Research, page 9, 2007. 1
[2] Zong Fang, Xu Hongguo, and Zhang Huiyong. Prediction for
traffic accident severity: Comparing the bayesian network and
regression models. Hindawi Publishing Corporation, 2013:9,
2013. 1
[3] V.A Olutayo and A.A Eludire. Traffic accident analysis using
decision trees and neural networks. I.J. Information Technol-
ogy and Computer Science, 02:22–28, 2014. 1
[4] H. Deng. Interpreting tree ensembles with intrees. arXiv,
1408.5456:1–18, 2014. 1