Workshop: Your first machine learning project

Completing your ﬁrst
Machine Learning Project
1

Today’s Workshop
Goal:
● General overview of what Data Scientist do when they use “Machine Learning”
● Get our hands dirty with data & code, even if we don’t fully understand it all
● Complete a project you can talk about & share with others
Non-Goals:
● Understand all code / syntax
● Learn to write or deeply understand ML algorithms
● Become a Data Scientist
Audience Expectations:
● Limited or no coding experience
● New to Machine Learning 2

High level Machine Learning process steps
Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
3
Descriptive
Statistics on
Data

First, credit to the authors
Original code comes from Data-X:
https://github.com/ikhlaqsidhu/data-x
4

Set-up
1. Download the code and dataset
2. Login to Jupyter where we’ll run our code
a. Create a new folder in Jupyter
b. Unzip the file you downloaded
c. Upload the dataset & code to the folder
Find this presentation by going to: tinyurl.com/haasFirstML 6

Import Required Packages
Here, we pull in sets of instructions for the computer called ‘packages’ that are
specific to Machine Learning,
Don’t be intimidated by this first step, this is ‘boilerplate code’
# No warnings
import warnings
warnings.filterwarnings('ignore') # Filter out warnings
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
pd.set_option('display.max_columns', 100) # Print 100 Pandas columns
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# machine learning models
from sklearn.linear_model import LogisticRegression
7

Retrieve our Dataset
Tell the program where our data is located and give it a name: “train_df”
train_df = pd.read_csv('train.csv')
8

Now, let’s explore our data
9

Machine Learning Process Steps We’ll Cover Today
Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
10
Descriptive
Statistics on
Data

Explore our data
What columns are available?
print(train_df.columns.values)
11

Explore our data
How does the data look?
# preview the data
train_df.head(5)
12

Explore our data
How does the data look?
# preview the data
train_df.head(5)
General data statistics
# General data statistics
train_df.describe()
13

Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
14
Descriptive
Statistics on
Data

Visualize our data
How is the data in each column distributed?
Let’s look at a Histogram
train_df.hist(figsize=(13,10))
plt.show()
15

Identify Business
Problem / Goal
Visualize the
Data
Clean Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
16
Descriptive
Statistics on
Data

Clean our data
What if ‘age’ is not available for some people?
We could replace missing data with the median or mean.
print("Median Age: ",train_df.Age.median())
print("Mean Age: ",train_df.Age.mean())
# We'll use the mean
train_df['Age'] = train_df['Age'].fillna(train_df.Age.mean())
17

Clean our data
Algorithms need numerical inputs.
How do we handle non-numbers, like ‘male’ or ‘female’?
We’ll map them to binary values (male = 0, female = 1)
train_df['Sex'] = train_df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
train_df.head()
18

Clean our data
Let’s drop columns we won’t use from our dataset
# Remove columns that we won't use
train_df = train_df.drop(['PassengerId','Name', 'Parch', 'SibSp','Embarked','Ticket', 'Cabin'], axis=1)
train_df.head(5)
19

Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
21
Descriptive
Statistics on
Data

Split our data into input variables and output to predict
Define our independent (X) and dependent (Y) variables
X = train_df.drop("Survived", axis=1) # Training & Validation data
Y = train_df["Survived"] # Response / Target Variable
print(X.shape, Y.shape)
22

Split our data into training data and validation data
Separate 20% of our data as ‘hold-out’
We’ll use it to test our models.
# Split training set so that we test on 20% of the data
# Note that our algorithms will never have seen the validation
# data during training. This is to evaluate how good our estimators are.
np.random.seed(1337) # set random seed for reproducibility
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)
print(X_train.shape, Y_train.shape)
print(X_val.shape, Y_val.shape)
23

Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
24
Descriptive
Statistics on
Data

Testing different Machine Learning Models
There are many different Machine Learning algorithms.
We’ll Model and Predict using 4 algorithms, then choose the best performer.
● Logistic Regression
● KNN (k-nearest neighbors)
● Random Forest
● Neural Net
25

Run each of our 4 models
Don’t worry so much about the specific code syntax. Instead, note that we’re
testing different algorithms and assessing accuracy.
Logistic Regression:
logreg = LogisticRegression() # instantiate
logreg.fit(X_train, Y_train) # fit
Y_pred_train = logreg.predict(X_train) # predict
acc_log_train = sum(Y_pred_train == Y_train)/len(Y_train)*100
print('Logistic Regression Training Accuracy:', str(round(acc_log_train,2)),'%')
Y_pred = logreg.predict(X_val) # predict
acc_log = sum(Y_pred == Y_val)/len(Y_val)*100
print('Logistic Regression Validation Accuracy:', str(round(acc_log,2)),'%')
26

KNN:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
print('K-nearest Neighbors Training Accuracy:', str(round(knn.score(X_train, Y_train)*100,2)),'%')
acc_knn = knn.score(X_val, Y_val)
print('K-nearest Neighbors Validation Accuracy:', str(round(acc_knn*100,2)),'%')
27

Random Forest:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X_train, Y_train)
print('Random Forest Training Accuracy:', str(round(random_forest.score(X_train, Y_train)*100,2)),'%')
acc_random_forest = random_forest.score(X_val, Y_val)
acc_random_forest
print('Random Forest Validation Accuracy:', str(round(acc_random_forest*100,2)),'%')
28

Neural Net:
# Neural Net
hidden_lay = 10
hidden_node = 10
NN_model = MLPClassifier(hidden_layer_sizes=(hidden_lay, hidden_node),
activation='relu',solver='lbfgs').fit(X_train,Y_train)
print('Neural Net Training Accuracy:', str(round(accuracy_score(Y_train,NN_model.predict(X_train))*100,2)),'%')
print('Neural Net Validation Accuracy:', str(round(accuracy_score(Y_val,NN_model.predict(X_val))*100,2)),'%')
29

Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
30
Descriptive
Statistics on
Data

Assess
Which model was the best?
Print accuracy scores for each model.
print('Logistic Regression Training Accuracy:', str(round(acc_log_train,2)),'%')
print('Logistic Regression Validation Accuracy:', str(round(acc_log,2)),'%')
print('K-nearest Neighbors Training Accuracy:', str(round(knn.score(X_train, Y_train)*100,2)),'%')
print('K-nearest Neighbors Validation Accuracy:', str(round(acc_knn*100,2)),'%')
print('Random Forest Training Accuracy:', str(round(random_forest.score(X_train, Y_train)*100,2)),'%')
print('Random Forest Validation Accuracy:', str(round(acc_random_forest*100,2)),'%')
print('Neural Net Training Accuracy:', str(round(accuracy_score(Y_train,NN_model.predict(X_train))*100,2)),'%')
print('Neural Net Validation Accuracy:', str(round(accuracy_score(Y_val,NN_model.predict(X_val))*100,2)),'%')
31

Now what? How to understand our model
We have a model, but what drives it?
We can analyze model drivers (sometimes).
# Look at importnace of features for random forest
def plot_model_var_imp( model , X , y ):
imp = pd.DataFrame(
model.feature_importances_ ,
columns = [ 'Importance' ] ,
index = X.columns
)
imp = imp.sort_values( [ 'Importance' ] , ascending = True )
imp[ : 10 ].plot( kind = 'barh' )
print ('Training accuracy Random Forest:',model.score( X , y ))
plot_model_var_imp(random_forest, X_train, Y_train) 32

Related Courses
● CS188: Intro to Artificial Intelligence
○ Course Materials
○ Blurb: This class was quite technical. I audited and found the content interesting to listen in on.
Not sure I’d want to commit the time or have expertise to have taken full course (Alex)
○ One of my favorite classes (Sven)
● INFO 254: Data Mining & Analytics
○ https://www.ischool.berkeley.edu/courses/info/254
○ Blurb: Learn about how ML algorithms work and mental models for what models to use.
Python knowledge required
● DataX
○ https://data-x.blog/
○ Blurb: Great from learning how to apply ML (not necessarily how all of the models work).
Python is required
34

Additional Resources
● An additional tutorial: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
○ Tutorial requires you to execute code ‘locally’ (on your computer rather than on the web). This requires you to download
additional software which is further explained here:
○ https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/
35

Machine Learning Definition
Machine learning algorithms are described as learning a target function (f) that
best maps input variables (X) to an output variable (Y): Y = f(X)
This is a general learning task where we would like to make predictions in the
future (Y) given new examples of input variables (X). We don’t know what the
function (f) looks like or its form. If we did, we would use it directly and we would
not need to learn it from data using machine learning algorithms.
36
Source: Medium article - A tour of the top 10 algorithms for machine learning newbiews

Algorithm overviews
Logistic Regression: Logistic regression is another technique borrowed by machine learning from the field of statistics. It
is the go-to method for binary classification problems (problems with two class values). Logistic regression is like linear
regression in that the goal is to find the values for the coefficients that weight each input variable. Unlike linear regression,
the prediction for the output is transformed using a non-linear function called the logistic function. The logistic function looks
like a big S and will transform any value into the range 0 to 1. This is useful because we can apply a rule to the output of the
logistic function to snap values to 0 and 1 (e.g. IF less than 0.5 then output 1) and predict a class value.
KNN (k-nearest neighbors): Predictions are made for a new data point by searching through the entire training set for the
K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression
problems, this might be the mean output variable, for classification problems this might be the mode (or most common)
class value.
The trick is in how to determine the similarity between the data instances. The simplest technique if your attributes are all of
the same scale (all in inches for example) is to use the Euclidean distance, a number you can calculate directly based on
the differences between each input variable.
37

Algorithm overviews
Random Forest: Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of
ensemble machine learning algorithm called Bootstrap Aggregation or bagging.
The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots
of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true
mean value.
In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees.
Multiple samples of your training data are taken then models are constructed for each data sample. When you need to
make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate
of the true output value. Random forest is a tweak on this approach where decision trees are created so that rather than
selecting optimal split points, suboptimal splits are made by introducing randomness.
The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate
in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output
value.
38

Algorithm overviews
Last Takeaway: A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is
“which algorithm should I use?” The answer to the question varies depending on many factors, including: (1) The size,
quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What you want to do
with the data.
Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms.
39

Workshop: Your first machine learning project

More Related Content

What's hot

Similar to Workshop: Your first machine learning project

Recently uploaded

Workshop: Your first machine learning project