Developed a decision tree classification model using SAS Enterprise miner to predict the seriousness of an accident case (i.e. if an accident is fatal or results in injury) based on the various predictors like rush or no rush hour, work zone, weather conditions, speed limits, interstate etc. This helps to prioritize situations and allocates resources in scenarios where there is a high possibility of an accident resulting in fatalities or serious injury.
3. WHY THIS PROJECT
• “Every 12 minutes someone dies in a car crash in the United States due to a car accident or a collision
between two motor vehicles.” (-NCIPC)
• Most of times the accidents are fatal or involve serious injuries and by the time the help arrives at the crash
site, a lot of loss has been done.
• We attempt to build a model that can predict the seriousness of an accident case (i.e. if an accident is fatal
or results in injury) based on the various predictors like rush or no rush hour, work zone, weather
conditions, speed limits, interstate etc.
• This helps to prioritize situations and allocates resources in scenarios where there is a high possibility of an
accident resulting in fatalities or serious injury.
• This will enable the emergency care provider on focusing on the measures and resource that can be taken
when they arrive at the scene. The accuracy of pre-hospital crash scene details and crash victim assessment
has important implications on the care that can be provided at the time of the crash scene.
4. WHAT ARE WE CONSIDERING
• We will be looking at the characteristics of the environment in which the accident
occurred (weather, road condition, type of road, time of day, the day of the week, and
month of the year) and the characteristics of the crash (direction of accident, speed
limit on the road, work zone area, and how many vehicles were involved).
• All of these variables can effect in what kind of accident has occurred (no injury,
injury or fatal). This can further help the medic’s team to come prepared for the
necessary actions that need to be taken at the scene.
6. CLEAR DESCRIPTION OF DATA SET
Sl. No Variables Description
1 HOUR_I_R 1=rush hour, 0=not (rush = 6-9 am, 4-7 pm)
2 ALIGN_I 1 = straight, 2 = curve
3
STRATUM_R
1= NASS Crashes Involving At Least One Passenger
Vehicle towed due to damage from the crash scene and no
medium or heavy trucks are Involved, 0=not
4 WRK_ZONE 1= yes, 0= no
5 WKDY_I_R 1=weekday, 0=weekend
6 INT_HWY Interstate? 1=yes, 0=no
7
LGTCON_I_R
Light conditions - 1=day, 2=dark (including dawn/dusk),
3=dark, but lighted,4=dawn or dusk
8 MAN_COL_I 0=no collision, 1=head-on, 2=other form of collision
9 PED_ACC_R 1=pedestrian/cyclist involved, 0=not
10
REL_JCT_I_R
1=accident at intersection/interchange, 0=not at
intersection
7. CLEAR DESCRIPTION OF DATA SET
Sl. No Variables Description
11 SPD_LIM Speed limit, miles per hour
12
SUR_CON
Surface conditions (1=dry, 2=wet, 3=snow/slush, 4=ice,
5=sand/dirt/oil, 8=other, 9=unknown)
13 TRAF_WAY 1=two-way traffic, 2=divided hwy, 3=one-way road
14 VEH_INVL Number of vehicles involved
15
WEATHER_R
1=no adverse conditions, 2= rain, snow or other adverse
condition
16 INJURY_CRASH 1=yes, 0= no
17 NO_INJ_I Number of injuries
18 FATALITIES 1= yes, 0= no
19 MAX_SEV_IR 0=no injury, 1=non-fatal inj., 2=fatal inj.
8. FILTERING DATA
• Filtering method used is "Standard Deviations from the
Mean",
• This will eliminate the observations that are farther than
three standard deviations from their means.
9. DATA PARTITIONING
• We build the model with Training Data
• Test its correctness with Test Data
• Validate it with Validation Data
10. PREDICT, CLASSIFY OR CLUSTER ?
As we are trying to predict the categorical class label MAX_SER_INJ, our analysis is
supervised classification.
Our model intends to discover relationships between the attributes that would make it
possible to predict the outcome variable.
11. MODEL
The following three models are used for our analysis
• Memory Based Reasoning(MBR)
• Decision Trees
• Logistic Regression
14. BASELINE MISCLASSIFICATION
• MAX_SEV_IR- 0=no injury, 1=non-fatal inj., 2=fatal inj.
• Class 0 (No injury): 4949
• Class 1(Non-fatal injury): 4900
• Class 2 (Fatal Injury): 150
• The majority class is 0 (No injury)
• The percentage of majority class in the dataset is: 49.49 % (4949/9999)
• The baseline misclassification rate: 50.51 %
• This is the baseline, the model that we build will make any sense if its
misclassification rate is less than baseline misclassification.
15. OUR DEFINITION OF BEST MODEL AS PER BUSINESS
REQUIREMENT
• Decision Tree : A supervised learning data driven method for classification
• It is based on separating observations into more homogeneous subgroups by creating splits
on predictors.
• As Per our business requirement , this model is best in classifying the event of accident into
three cases to prioritize resources.
19. INTERPRETATION AND IMPLEMENTATION
• Based on this rules, an application/website can be created which upon
entering all the 5 most important factors(Predictors) will give an idea of the
percentage of chances of an accident resulting in Fatality/Injury/No Injury.
• The emergency service provider can then take a decision and send the
response team to the site of an accident accordingly.
21. OUTCOME
• Depending on the Node Rule, it will predict the outcome
• Red Cross predict’s there are 80% chances of Injury
• Red Cross predict’s there are 10 % chances of Fatality
• Red Cross predict’s there are 10 % chances of No injury
22. SCOPE FOR IMPROVEMENT
• In order to build more focused and rigorous model, we are working on identifying more predictors that
can help determine the status of accident and a more clean model that has a less misclassification.
• In order to achieve this, we intend to try Neural Network data mining algorithm.