A project to create at least two predictive Machine Learning models to analyze a business situation.
Description of Business Situation - The hiring managers of Pas de Poissen sought the guidance of a consulting firm to determine which of the nationality of the foreign workforce, entering Canada, would have the highest probability that a judge would approve their appeal to remain, and subsequently be employable in the country.
Establishing a model to best determine which candidates to hire provided exceptional cost saving opportunities. In the past, if the company was informed that one of their new foreign national workers was not granted an appeal, and was actively on a fishing deployment, at times lasting for over 45 days, the trawler was forced to return to port. A vessel having to return equated to missed opportunistic revenue, as it could no longer fish, and unexpected fuel expenses to return to homeport. Furthermore, the penalty for knowing employing an illegal foreign worker was harsh from both the Canadian and U.S fisheries enforcement agencies.
Deliverables -
A description of the business problem we are addressing
How and where we obtained the data, and the steps we went through to insure that it was "clean"
A summary of modeling steps, with reference to the predictive models in the project file
Assessment of the accuracy of models, with reference to project file results
Our interpretation of the results of our analysis
What we learnt, and how might it inform the business situation that we chose to analyze
Source: Rattle Library
Name: “Green: Refugee Appeal”
Predictive Models : "Forest Model" and "Boosting Model"
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Modeling using R
1. Background
Pas
de
Poisson
is
a
fishing
conglomerate
headquartered
in
Montreal,
CN.
The
fleet
is
located
remotely
in
two
loca?ons,
Halifax,
NS
and
St.
John’s
Newfoundland.
The
St.
John’s
fleets
primarily
work
the
near
shore
fishing
grounds
of
Nova
Sco?a
and
Newfoundland
within
12
nau?cal
miles
from
shore.
The
Halifax
loca?ons,
however,
have
fishing
deployments
that
are
located
much
further
offshore,
and
in
most
cases,
using
U.S.
territorial
waters
in
the
North
Atlan?c
under
the
CANAM
bilateral
agreements.
The
en?re
crew
of
the
St.
Johns
fleet
are
Canadian
residents.
Hiring
managers
ensure
that
90%
of
the
deck
hands
working
on
the
Halifax
fleet
are
foreign
workers
as
the
labor
rate
is
significantly
lower
and
the
turnover
rate
is
6
?mes
the
rate
of
St.
Johns
because
the
weather
is
constantly
rough
in
the
North
Atlan?c
crea?ng
excep?onally
poor
working
condi?ons,
but
paying
well.
2. Execu?ve
Summary
The
hiring
managers
of
Pas
de
Poissen
sought
the
guidance
of
a
consul?ng
firm
to
determine
which
of
the
na?onality
of
the
foreign
work
force,
entering
Canada,
would
have
the
highest
probability
that
a
judge
would
approve
their
appeal
to
remain,
and
subsequently
be
employable
in
the
country.
Establishing
a
model
to
best
determine
which
candidates
to
hire
provided
excep?onal
cost
saving
opportuni?es.
In
the
past,
if
the
company
was
informed
that
one
of
their
new
foreign
na?onal
workers
was
not
granted
an
appeal,
and
was
ac?vely
on
a
fishing
deployment,
at
?mes
las?ng
for
over
45
days,
the
trawler
was
forced
to
return
to
port.
A
vessel
having
to
return
equated
to
missed
opportunis?c
revenue,
as
it
could
no
longer
fish,
and
unexpected
fuel
expenses
for
return
transit.
Furthermore,
the
penalty
for
knowing
employing
an
illegal
foreign
worker
was
harsh
from
both
the
Canadian
and
U.S
fisheries
enforcement
agencies.
3. Data
Integrity
• Source:
Ra[le
Library
• Name:
“Green:
Refugee
Appeal”
• “Cleaning”
steps
• Used
transform
tag
to
remove
missing
and
ignored
data
a`er
comparing
the
original
and
“cleaned”
OOB
error
rates.
Addi?onally,
the
categorical
data
“judges”
was
deemed
to
be
sta?s?cally
insignificant
for
our
purposes,
hence
it
was
omi[ed
thus
increasing
the
integrity
of
the
the
dataset.
• Steps:
In
order
to
fulfill
the
hiring
strategy
we
targeted
informa?on,
from
the
data
(using
ra[le,
R
and
excel),
that
would
serve
to
determine
the
informa?on
necessary
to
depict
future
hires
based
on
the
probability
to
determine
an
approved
appeal.
4. Forest
Model
! Imported
the
data
and
Rescaled
! Created
a
Forest
model
with
default
op?ons
! OOB
error
=30.62%
,
Type
1=
16.12
%
and
Type
2
=65.5
%error,
AUC
=
0.644
! Our
business
requires
more
focus
on
Type
1
error
rather
than
Type
2
error
! Checked
the
trend
of
errors
and
importance
! Created
a
sample
of
35,35
! OOB
es?mate
of
error
rate:
35.83%,
Type
1
error
rate
=
35.02%,
Type
2
error
rate
=
37.77%,
AUC
=
0.653
! Error
rate
increased,
Type
1
increased-‐
not
good
! No
major
change,
although
type
2
decreased
! Look
for
a
be[er
one.
Prune
the
trees
at
minimum
complexity
! Here
tree
=
421
and
complexity
=
0.2913
! Now,
OOB
es?mate
of
error
rate:
29.32%
,
AUC
=
0.646,
Type
1
error
=
14.28571%,
Type
2
error=
65.55%
! Type
2
is
s?ll
large
but
we
are
not
much
concerned
about
that.
! Best
model
so
far
5. Forest
Model
! Create
Importance
level
of
Type
1,
Type
2
error
rate
by
sampling
data
(35,35)
! randomForest(formula
=
IMO_decision
~
.,
data
=
crs$dataset[crs$sample,
c(crs$input,
crs$target)],ntree
=
421,
mtry
=
5,
sampsize
=
c(35,
35),
importance
=
TRUE,
replace
=
FALSE,
na.ac?on
=
na.roughfix)
!
OOB
es?mate
of
error
rate:
36.48%,
Type
1
error
rate
=
36.4
%,
Type
2
error
rate
=
36.6%
! OOB
increased
.
Type
1
increased
as
expected
.
Not
a
good
solu?on
! Our
Best
Solu?on
so
far
is
! 95%
CI:
0.5462-‐0.6554
(DeLong)
! OOB
es?mate
of
error
rate:
29.32%,
Type
1
error
rate
=
14.28%,
Type
2
error
rate
=
65.6
%.
! Run
the
evalua?on
on
the
test
data
set
to
get
the
final
result.
7. Boos?ng
Model
• Run
the
Boos?ng
model
with
default
op?ons
• OOB
es?mate
of
error
rate:
21.8%
• Type
1
error
rate
is
6.9%,
Type
2
error
rate
is
61.1
%.
Look
for
error
trends
and
importance
of
variables.
Analysis-‐
Success
and
language
are
major
predictors
• Training
Error
is
high
ini?ally,
down
warding
as
number
of
itera?ons
increase.
• Try
to
look
at
the
point
where
error
graph
becomes
constant.
• 1’s
as
shown
in
the
graph
depict
the
trend,
but
the
trend
again
is
changing
beyond
itera?on
50.
• Build
more
itera?ons
to
figure
out
the
trend
and
the
point
a`er
which
error
rate
is
constant.
• Analysis-‐
Success
and
language
are
major
predictors
• Build
the
model
with
itera?on
=
200
• Analysis-‐:
The
trend
seems
clear.
A`er
140
itera?ons,
the
error
rate
graph
becomes
constant.
• Set
the
itera?ons
to
140
and
con?nue
the
boos?ng
model.
• Analysis-‐:
OOB
error
is
21.2
%
but
Type
2
errors
are
very
large.
• AUC
=68%.
S?ll
room
for
improvement.
Set
the
importance
matrix.
We
need
less
Type
2
error.
• Call:
ada(IMO_decision
~
.,
data
=
crs$dataset[crs$train,
c(crs$input,
crs$target)],
control
=
rpart.control(maxdepth
=
30,
cp
=
0.01,
minsplit
=
20,
xval
=
10),
parms
=
list(split
=
"informa?on",
loss
=
matrix(c(0,
1,
1.5,
0),
byrow
=
TRUE,
nrow
=
2)),
iter
=
140)
8. Final
Confusion
Matrix-‐
Boos?ng
Model
•
• Analysis-‐:
Best
so
far,
although
type
2
error
is
s?ll
big
• Giving
more
importance
doesn’t
help
• No
major
change
in
ROC.
10. Conclusion
With
the
best
dataset,
it
shows
that
there
is
a
strong
sta?s?cal
significance
that
Czechoslovakia,
exhibit
1,
is
the
na?on
with
the
highest
probability
of
winning
appeal
based
on
data
analyzed
in
MS
Excel.
Furthermore,
exhibit
2
shows
29%
of
all
applicants
are
denied
their
appeal.
Of
those
the
Rater,
person
who
determines
the
merit
of
their
case
going
forward
predicts
with,
an
81%
confidence
rate
that,
when
he
or
she
predicts
a
appeal
denial,
it
is
the
correct
predic?on,
conversely
they
are
only
correct
48%
of
the
?me
when
they
predict
an
awarded
appeal
by
the
judge.
Finally,
the
data
shows
that
most
applicants
the
seek
an
appeal
have
a
higher
approval
probability
with
the
courts
in
Montreal
and
not
Toronto.
As
with
the
Appeal
data
(above)
the
same
inferences
can
be
established
with
individual
Judge
data.
For
the
judges
tree,
exhibit
3,
if
we
assume
that
the
rater
predicts
success
for
33-‐34%
of
claimants,
72%
of
those
posi?ve
predic?ons
are
cases
that
are
to
be
heard
by
judges
that
ARE
NOT
Heald,
Hugessen,
Iacobucci,
MacGuigan,
Pra[e,
and
Stone.
We
can
infer
that
Desjardins,
Mahoney,
Marceau,
and
Urie
ARE
judges
that
will
have
the
highest
probability
of
ruling
posi?ve
on
an
appeal.
Therefore,
as
Desjardins
is
from
Montreal
and
rules
favorably
on
Czechoslovakian
na?onals,
it
would
behoove
the
company
to
create
a
goal
congruent
strategy
that
favors
those
results.
11. Exhibit
1
Appeal
Rate
by
Na?on
NATION
APPROVED
APPEAL
RATE
CZECHOSLOVAKIA
73%
SRI
LANKA
36%
EL
SALVADOR
36%
ARGENTINA
25%
IRAN
25%
CHINA
22%
BULGARIA
7%