12422, 744 PM Assignment_3localhost8888nbconverthtml.docx

12/4/22, 7:44 PM Assignment_3
localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipy
nb?download=false 1/11
INSTRUCTIONS:
* Add your code as indicated in each cell.
* Besides adding your code, do not alter this file.
* Do not delete or change test cases. Once you are done with a
question, you can run the test cases to see if you programmed
the
question correctly.
* If you get a question wrong, do not give up. Keep trying until
you
pass the test cases.
* Rename the file as firstname_lastname_assignmentid.ipynb
(e.g.,
marina_johnson_assignment1.ipynb)
* Only submit .ipynb files (no .py files)
#
Question 1
1. Read the employee_attrition dataset and save it as df. Recall
that the target variable in this

dataset is named 'Attrition.'
1. Check if the dataset is imbalanced by counting the number of
Noes and Yeses in the target
variable Attrition.
Hints:
Imbalanced data refers to a situation where the number of
observations is not the
same for all the classes in a dataset. For example, the number of
churned
employees is 4000, while the number of unchurned employees is
40000. This
means this dataset is imbalanced.
You need to access the target variable Attrition and count how
many Yes and No
there is in this variable. If the number of Yes's is equal to the
number of No's, then
the dataset is balanced. Otherwise, it is not balanced.
In [138… # Do not delete this cell
import numpy as np
score = dict()
np.random.seed(333)
Check Module 5g: Encoding Categorical Variables to earn more
about data

imbalance problems. Particularly, check 2.5: Balancing datasets
in Module 5.
Do not alter the below cell. It is a test case for
Question 1
{'question 1': 'pass'}
#
Question 2
1. Identify the names of the numerical input variables and save
it as a LIST
1. Identify the names of the categorical input variables+ and
save it as a LIST
Hints:
Remember Attrition is the target (output) variable, so exclude
Attrition from
both LISTS containing the numerical and categorical input
variables.
Check Modules 5b: Dropping Variables and Module 3e: Helpful
Functions
(check after minute 4)
Question 2
In [139… import pandas as pd
df = # your code to read the dataset goes in here
number_of_yes = # your code to find the number
# of yeses in the Attrition variable goes in
here

number_of_no = # your code to find the number
# of noes in the Attrition variable goes in
here
In [140… try:
if (number_of_yes == 237 and number_of_no == 1233):
score['question 1'] = 'pass'
else:
score['question 1'] = 'fail'
except:
score
Out[140]:
In [141… numerical_variables = # Your code to identify
numerical variables goes in here
categorical_varables = # Your code to identify categorical
variables goes in here
In [142… try:
{'question 1': 'pass', 'question 2': 'pass'}
#
Question 3
1. Identify the numerical variables with zero variance (i.e., zero

standard deviation) and save
them in a LIST
1. Drop these numerical variables with zero variance (i.e., zero
standard deviation) from the
dataset df. The dataset df should not have these variables going
forward.
Hints:
For each numerical variable, compute the standard deviation. If
the standard
deviation is zero, delete (i.e., drop) that variable from the
dataset df.
Check Modules 5b: Dropping Variables
Question 3
if ((sorted(numerical_variables) ==
['Age','DailyRate','DistanceFromHome','Education',
'EmployeeCount','EmployeeNumber','EnvironmentSatisfaction',
'HourlyRate','JobInvolvement','JobLevel','JobSatisfaction',
'MonthlyIncome','MonthlyRate','NumCompaniesWorked','Percen
tSalaryHike',
'PerformanceRating','RelationshipSatisfaction','StandardHours',
'StockOptionLevel','TotalWorkingYears','TrainingTimesLastYea
r',
'WorkLifeBalance','YearsAtCompany','YearsInCurrentRole',

'YearsSinceLastPromotion','YearsWithCurrManager']) and
(sorted(categorical_varables) ==
['BusinessTravel','Department','EducationField','Gender',
'JobRole','MaritalStatus','Over18','OverTime'])):
else:
except:
score
Out[142]:
In [143… zero_variance_numerical_variables = # your code to
find the
# numerical variables with zero variance
goes in here
df = # your code to drop the zero variance numerical variables
goes in here
In [144… try:
if (zero_variance_numerical_variables == ['EmployeeCount',
'StandardHours']):
else:
except:
score

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass'}
#
Question 4
1. Identify the categorical variables with zero variance (i.e., low
cardinality) and save them in a
LIST
1. Drop these categorical variables with zero variance (i.e., low
cardinality) from the dataset df.
The dataset df should not have these variables going forward.
Hints:
For each categorical variable, find the number of levels. If the
number of levels is
1, delete (i.e., drop) that variable from the dataset df. For
example, if a variable
named occupation has only "Engineers" across all the rows (i.e.,
one level), the
variable does not contain any information. In other words, zero
variation.
Question 4
{'question 1': 'pass',
'question 2': 'pass',

'question 4': 'pass'}
#
Question 5
Out[144]:
In [145… zero_variance_categorical_variables = [] # your code
to find the
# categorical variables with zero variance
goes in here
df = # your code to drop the zero variance
# categorical variables goes in here
In [146… try:
if (zero_variance_categorical_variables == ['Over18']):
else:
except:
score
Out[146]:
1. Find the categorical variables with very high variance (i.e.,

very high cardinality) and save
them in a LIST. Use 200 as the threshold. In other words, the
categorical variables over 200
levels should be considered as variables with high cardinality
(i.e., with high variance).
1. Drop the categorical variables with very high variance (i.e.,
very high cardinality) from the
dataset df. The dataset df should not have these variables going
forward.
Hints:
For each categorical variable, find the number of levels. If the
number of levels is
greater than 200, delete (i.e., drop) that variable from the
dataset df. For example,
Question 5
#
Question 6
1. Scale (i.e., standardize) the numerical variables in the dataset
using the standardization
method and drop the original numerical variables and only keep
the standardized ones.

2. The new standardized numerical variables should have the
same variable names. For
example, the age variable after being standardized should be
named the same (i.e., age)
Hints:
Feature standardization makes the values of each feature in the
data have zero-
mean (when subtracting the mean in the numerator) and unit-
variance. This
In [147… high_cardinality_categorical_variables = [] # your
code to find the
# categorical variables with high variance
(i.e., cardinality) goes in here
df = # your code to drop the high cardinality
# categorical variables goes in here
In [148… try:
if (high_cardinality_categorical_variables == []):
else:
except:
score
Out[148]:

method is widely used for normalization in many machine
learning algorithms.
Check M5d: Standardization
Question 6
#
Question 7
1. Encode the categorical input variables. Do not encode the
target variable Attrition. You will
do that in the following question.
Hints:
You will create dummies for categorical variables.
Example: Let's say you have a variable named occupation. This
variable has three
levels: Engineer, Teacher, Manager. We will use binary
encoding and create
dummies for each of these levels to be able to encode the
occupation variable.
Technically, we are converting the categorical variable into new
numerical
variables.
We will have two new variables for this occupation variable,
such as

occupation_teacher, occupation_manager. We do not need
occupation_teacher
because we can infer if the person is a teacher by checking
occupation_manager
and occupation_engineer variables.
For example: If occupation_enginner and occupation_manager
are zero, then this
person is a teacher.
If occupation_engineer is 1, this person is an engineer.
Check Module 5g: Encoding Categorical Variables
In [149… # your code to standardize numerical variables goes
in here
df =
In [150… try:
if ((df['Age'].max() == 2.526885578888087) and
(df['DailyRate'].max() == 1.7267301192801021)):
else:
except:
score
Out[150]:

Question 7
#
Question 8
1. Encode the categorical output variable: Attrition. Yes should
be coded as 1, and No should
be coded as 0. The new encoded target variable should be
named as Attrition. Do not
forget to drop the categorical Attirion Variable. Basically, you
will convert the categorical
Attrition variable into numerical attrition variable such that Yes
will be mapped to 1, and No
will be mapped to zero.
Hints:
Check Module 3 and Module 5 videos.
Question 8
In [151… # your code to encode categorical input variables goes
in here
df =
In [152… try:

if ((df['JobRole_Laboratory Technician'].mean() ==
0.1761904761904762) and
(df['EducationField_Marketing'].mean() ==
0.10816326530612246)):
else:
except:
score
Out[152]:
In [153… # your code to encode categorical output variables
Attrition goes in here
df =
In [154… try:
if (df['Attrition'].mean() == 0.16122448979591836):
else:

#
Question 9
1. Balance the dataset
1. Your code should return the input and output variables
seperately. The input variables will
be saved as a dataframe named X. The output variable will be
saved as a dataframe named
y.
Hints:
Imbalanced data refers to a situation where the number of
observations is not the
same for all the classes in a dataset. For example, the number of
churned
employees is 4000, while the number of unchurned employees is
40000. This
means this dataset is imbalanced.
You need to access the target variable Attrition and increase the
number of ones
(i.e., Yeses) so that both the number of zeros (i.e., Noes) and
the number of ones
(i.e., Yeses) will be equal.
Check M5g: Encoding Categorical Variables. balancing dataset
is discussed in
this video.
Question 9

except:
score
Out[154]:
In [156… # Your code to balance the dataset goes in here
X = # dataframe containing the input variables after balancing
y = # dataframe containing the output variable Attrition after
balancing
In [157… try:
if ((y.Attrition.value_counts()[0] == 1233) and
(y.Attrition.value_counts()[1] == 1233)):
else:
except:
score

#
Question 10
Split the dataset into training and testing Basically using X and
y dataframes, you will
create X_train, X_test, y_train, and y_test.
You need to keep 70% of the dataset for training and 30% for
testing.
Hints:
You can use the train_test_split function in sklearn library
Check Module M6c: Classification
Question 6
#
Out[157]:

In [158… # your code to create train and test sets goes in here
X_train, X_test, y_train, y_test = # your code to create train and
test sets goes in here
In [159… try:
if ((X_train.shape[0]<1750) and (X_train.shape[0]>1700)):
else:
except:
score
Out[159]:
Question 11
1. Train a knn model where k is 3 using the training dataset.
1. Make predictions using the test dataset
1. Compute accuracy and save as accuracy
Hints:
You need to use the KNeighborsClassifier function. Instantiate
a knn object and
pass the number of neighbors to the function. Train the model
using the X_train

and y_train. Then make predictions using X_test. Then compute
the accuracy using
the predicted values and y_test.
Check Module 6d: Model Performance and _Module 5c:
Classification
Question 11
#
Question 12
1. Train a Random Forests model where the number of
estimators is 100 using the training
dataset.
In [160… # Your code to train knn, make predictions, and
compute accuracy goes in here
accuracy = # compute accuracy here
In [161… try:
if (accuracy > 0.70):

else:
except:
score
Out[161]:
1. Make predictions using the test dataset
1. Compute accuracy and save as accuracy
Hints:
You need to use the RandomForestClassifier function.
Instantiate a
RandomForestClassifier object and pass the number of
estimators to the function.
Train the model using the X_train and y_train. Then make
predictions using X_test.
Then compute the accuracy using the predicted values and
y_test.
Check Module 6d: Model Performance and _Module 5c:
Classification
Question 6

#
Your Grade
Your overall score is: 100
In [162… # Your code to train random forest, make predictions,
and compute accuracy goes in here
accuracy = # compute accuracy here
In [163… try:
if (accuracy > 0.80):
else:
except:
score
Out[163]:
In [164… print('Your overall score is: ',
round(list(score.values()).count('pass')*8.3333))

12422, 744 PM Assignment_3localhost8888nbconverthtml.docx

12422, 744 PM Assignment_3localhost8888nbconverthtml.docx

Recommended

Recommended

More Related Content

Similar to 12422, 744 PM Assignment_3localhost8888nbconverthtml.docx

Similar to 12422, 744 PM Assignment_3localhost8888nbconverthtml.docx (20)

More from robert345678

More from robert345678 (20)

Recently uploaded

Recently uploaded (20)

12422, 744 PM Assignment_3localhost8888nbconverthtml.docx