Predictive Modeling using R

Background

Pas
de
Poisson
is
a
fishing
conglomerate
headquartered
in
Montreal,
CN.

The

fleet
is
located
remotely
in
two
loca?ons,
Halifax,
NS
and
St.
John’s

Newfoundland.

The
St.
John’s
fleets
primarily
work
the
near
shore
fishing

grounds
of
Nova
Sco?a
and
Newfoundland
within
12
nau?cal
miles
from

shore.

The
Halifax
loca?ons,
however,
have
fishing
deployments
that
are

located
much
further
offshore,
and
in
most
cases,
using
U.S.
territorial
waters

in
the
North
Atlan?c
under
the
CANAM
bilateral
agreements.

The
en?re
crew
of
the
St.
Johns
fleet
are
Canadian
residents.

Hiring
managers

ensure
that
90%
of
the
deck
hands
working
on
the
Halifax
fleet
are
foreign

workers
as
the
labor
rate
is
significantly
lower
and
the
turnover
rate
is
6
?mes

the
rate
of
St.
Johns
because
the
weather
is
constantly
rough
in
the
North

Atlan?c
crea?ng
excep?onally
poor
working
condi?ons,
but
paying
well.

Execu?ve
Summary

The
hiring
managers
of
Pas
de
Poissen
sought
the
guidance
of
a
consul?ng

firm
to
determine
which
of
the
na?onality
of
the
foreign
work
force,
entering

Canada,
would
have
the
highest
probability
that
a
judge
would
approve
their

appeal
to
remain,
and
subsequently
be
employable
in
the
country.

Establishing
a
model
to
best
determine
which
candidates
to
hire
provided

excep?onal
cost
saving
opportuni?es.

In
the
past,
if
the
company
was

informed
that
one
of
their
new
foreign
na?onal
workers
was
not
granted
an

appeal,
and
was
ac?vely
on
a
fishing
deployment,
at
?mes
las?ng
for
over
45

days,
the
trawler
was
forced
to
return
to
port.

A
vessel
having
to
return

equated
to
missed
opportunis?c
revenue,
as
it
could
no
longer
fish,
and

unexpected
fuel
expenses
for
return
transit.

Furthermore,
the
penalty
for

knowing
employing
an
illegal
foreign
worker
was
harsh
from
both
the

Canadian
and
U.S
fisheries
enforcement
agencies.

Data
Integrity

•  Source:
Ra[le
Library

•  Name:
“Green:
Refugee
Appeal”

•  “Cleaning”
steps

•  Used
transform
tag
to
remove
missing
and
ignored
data
aèr

comparing
the
original
and
“cleaned”
OOB
error
rates.

Addi?onally,
the
categorical
data
“judges”
was
deemed
to
be

sta?s?cally
insignificant
for
our
purposes,
hence
it
was
omi[ed

thus
increasing
the
integrity
of
the
the
dataset.

•  Steps:

In
order
to
fulfill
the
hiring
strategy
we
targeted
informa?on,
from

the
data
(using
ra[le,
R
and
excel),
that
would
serve
to
determine
the

informa?on
necessary
to
depict
future
hires
based
on
the
probability
to

determine
an
approved
appeal.

Forest
Model

!  Imported
the
data
and
Rescaled

!  Created
a
Forest
model
with
default
op?ons

!  OOB
error
=30.62%
,
Type
1=
16.12
%
and
Type
2
=65.5
%error,
AUC
=
0.644

!  Our
business
requires
more
focus
on
Type
1
error
rather
than
Type
2
error

!  Checked
the
trend
of
errors
and
importance

!  Created
a
sample
of
35,35

!  OOB
es?mate
of

error
rate:
35.83%,
Type
1
error
rate
=
35.02%,
Type
2
error
rate
=

37.77%,
AUC
=
0.653

!  Error
rate
increased,
Type
1
increased-‐
not
good

!  No
major
change,
although
type
2
decreased

!  Look
for
a
be[er
one.
Prune
the
trees
at
minimum
complexity

!  Here
tree
=
421
and
complexity
=
0.2913

!  Now,
OOB
es?mate
of

error
rate:
29.32%
,
AUC
=
0.646,
Type
1
error
=
14.28571%,

Type
2
error=
65.55%

!  Type
2
is
s?ll
large
but
we
are
not
much
concerned
about
that.

!  Best
model
so
far

Forest
Model

!  Create
Importance
level
of
Type
1,
Type
2
error
rate
by
sampling
data
(35,35)

!  randomForest(formula
=
IMO_decision
~
.,
data
=
crs$dataset[crs$sample,

c(crs$input,
crs$target)],ntree
=
421,
mtry
=
5,
sampsize
=
c(35,
35),

importance
=
TRUE,
replace
=
FALSE,
na.ac?on
=
na.roughﬁx)

! 
OOB
es?mate
of

error
rate:
36.48%,
Type
1
error
rate
=
36.4
%,
Type
2
error

rate
=
36.6%

!  OOB
increased
.
Type
1
increased
as
expected
.
Not
a
good
solu?on

!  Our
Best
Solu?on
so
far
is

!  95%
CI:
0.5462-‐0.6554
(DeLong)

!  OOB
es?mate
of

error
rate:
29.32%,
Type
1
error
rate
=
14.28%,
Type
2
error

rate
=
65.6
%.

!  Run
the
evalua?on
on
the
test
data
set
to
get
the
ﬁnal
result.

Final
Confusion
Matrix-‐
Forest
Model

Boos?ng
Model

•  Run
the
Boos?ng
model
with
default
op?ons

•  OOB
es?mate
of

error
rate:
21.8%

•  Type
1
error
rate
is
6.9%,
Type
2
error
rate
is
61.1
%.
Look
for
error
trends
and
importance
of
variables.

Analysis-‐
Success
and
language
are
major
predictors

•  Training
Error
is
high
ini?ally,
down
warding
as
number
of
itera?ons
increase.

•  Try
to
look
at
the
point
where
error
graph
becomes
constant.

•  1’s
as
shown
in
the
graph
depict
the
trend,
but
the
trend
again
is
changing
beyond
itera?on
50.

•  Build
more
itera?ons
to
figure
out
the
trend
and
the
point
aèr
which
error
rate
is
constant.

•  Analysis-‐
Success
and
language
are
major
predictors

•  Build
the
model
with
itera?on
=
200

•  Analysis-‐:
The
trend
seems
clear.
Aèr
140
itera?ons,
the
error
rate
graph
becomes
constant.

•  Set
the
itera?ons
to
140
and
con?nue
the
boos?ng
model.

OOB
error
is
21.2
%
but
Type

2
errors
are
very
large.

•  AUC
=68%.
S?ll
room
for
improvement.
Set
the
importance
matrix.
We
need
less
Type
2
error.

•  Call:

ada(IMO_decision
~
.,
data
=
crs$dataset[crs$train,
c(crs$input,

crs$target)],
control
=

rpart.control(maxdepth
=
30,
cp
=
0.01,

minsplit
=
20,
xval
=
10),
parms
=
list(split
=
"informa?on",

loss
=

matrix(c(0,
1,
1.5,
0),
byrow
=
TRUE,
nrow
=
2)),
iter
=
140)

Final
Confusion
Matrix-‐
Boos?ng

Model

• 

Best
so
far,
although
type
2
error
is

s?ll
big

•  Giving
more
importance
doesn’t
help

•  No
major
change
in
ROC.

Comparison
of
Models

Forest
Model
Boos,ng
Model

Conclusion

With
the
best
dataset,
it
shows
that
there
is
a
strong
sta?s?cal
signiﬁcance
that

Czechoslovakia,
exhibit
1,
is
the
na?on
with
the
highest
probability
of
winning

appeal
based
on
data
analyzed
in
MS
Excel.

Furthermore,
exhibit
2
shows
29%
of

all
applicants
are
denied
their
appeal.

Of
those
the
Rater,
person
who
determines

the
merit
of
their
case
going
forward
predicts
with,
an
81%
conﬁdence
rate
that,

when
he
or
she
predicts
a
appeal
denial,
it
is
the
correct
predic?on,
conversely

they
are
only
correct
48%
of
the
?me
when
they
predict
an
awarded
appeal
by
the

judge.

Finally,
the
data
shows
that
most
applicants
the
seek
an
appeal
have
a

higher
approval
probability
with
the
courts
in
Montreal
and
not
Toronto.

As
with
the
Appeal
data
(above)
the
same
inferences
can
be
established
with

individual
Judge
data.
For
the
judges
tree,
exhibit
3,
if
we
assume
that
the
rater

predicts
success
for
33-‐34%
of
claimants,
72%
of
those
posi?ve
predic?ons
are

cases
that
are
to
be
heard
by
judges
that
ARE
NOT
Heald,
Hugessen,
Iacobucci,

MacGuigan,
Pra[e,
and
Stone.
We
can
infer
that
Desjardins,
Mahoney,
Marceau,

and
Urie
ARE
judges
that
will
have
the
highest
probability
of
ruling
posi?ve
on
an

appeal.

Therefore,
as
Desjardins
is
from
Montreal
and
rules
favorably
on

Czechoslovakian
na?onals,
it
would
behoove
the
company
to
create
a
goal

congruent
strategy
that
favors
those
results.

Exhibit
1

Appeal
Rate
by
Na?on

NATION
APPROVED
APPEAL
RATE

CZECHOSLOVAKIA
73%

SRI
LANKA
36%

EL
SALVADOR
36%

ARGENTINA
25%

IRAN
25%

CHINA
22%

BULGARIA
7%

Predictive Modeling using R

Recommended

Recommended

More Related Content

Similar to Predictive Modeling using R

Similar to Predictive Modeling using R (20)

Recently uploaded

Recently uploaded (20)

Predictive Modeling using R