This document summarizes an analysis of traffic violation data from 2017. It describes the dataset, data preprocessing steps including extraction, classification, oversampling, and imputation. Descriptive analysis examines relationships between variables like seat belt usage, gender, licenses and accidents. Modeling techniques applied include decision trees, logistic regression, neural networks, and auto neural networks. Models using only significant variables like description, time, violation type and day achieved similar accuracy. The analysis found accidents are more common for men than women and many drivers in accidents lacked proper licenses.
3. About
The
Dataset
• Traffic Violation Data 2017
Source: Data.gov
• Groups of Attributes
1. Location
2. Causes of Violation
3. Vehicle Information
4. Driver’s Information
5. Consequences of Violation
• Type of Attributes
4. Huge Data
Excess of categories in a column
Imbalanced categories inTargetVariable
MissingValues
9. Descriptive
Analysis
Seat Belt as a cause of Accident
Gender involved in anAccident
Relation between Commercial License andAccident
Day-wise distribution ofTrafficViolation cases
Different types ofViolations
Vehicle types involved inTrafficViolation
14. Data Modeling
using all input
variables
DecisionTree
Simple and Easy to use
Suitable for binary target variable
ImportantVariables
Type of violation
Personal Injury
Property Damage
Charge
Description
15. Data Modeling
using all input
variables
Logistic Regression
Recommended for binary target variables
Uses Maximum Likelihood to estimate the model
parameters
ImportantVariables
Day ofWeek
Hour of Day
Personal Injury
Property Damage
ViolationType
Description
16. Data Modeling
using all input
variables
Neural Network
• It is supervised machine learning algorithm
• Data partitioned into
Train – 70%
Validation – 15%
Test – 15%
Train Validation Test
0.0394 0.5042 0.4938
17. Data Modeling
using all input
variables
Auto Neural Network
• More Flexible than Neural Network
• We can specify number of Hidden
Units
Number of
Hidden Units
Misclassification
Rate - Train
Misclassification
Rate - Validate
Misclassification
Rate - Test
1 0.0 0.50 0.48
2 0.49 0.5 0.5021
3 0.46 0.57 0.56
23. Results &
Implications
1. Accidents can happen on any day
2. Men are more involved in an accident and not women
3. The Majority drivers involved in accidents were not having the
official driving license
4. Can be useful for driving license department to show the people
the importance of proper training and seat belts