Completing your first
Machine Learning Project
1
Today’s Workshop
Goal:
● General overview of what Data Scientist do when they use “Machine Learning”
● Get our hands dirty with data & code, even if we don’t fully understand it all
● Complete a project you can talk about & share with others
Non-Goals:
● Understand all code / syntax
● Learn to write or deeply understand ML algorithms
● Become a Data Scientist
Audience Expectations:
● Limited or no coding experience
● New to Machine Learning 2
High level Machine Learning process steps
Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
3
Descriptive
Statistics on
Data
First, credit to the authors
Original code comes from Data-X:
https://github.com/ikhlaqsidhu/data-x
4
Project Set-Up
5
Set-up
1. Download the code and dataset
2. Login to Jupyter where we’ll run our code
a. Create a new folder in Jupyter
b. Unzip the file you downloaded
c. Upload the dataset & code to the folder
Find this presentation by going to: tinyurl.com/haasFirstML 6
Import Required Packages
Here, we pull in sets of instructions for the computer called ‘packages’ that are
specific to Machine Learning,
Don’t be intimidated by this first step, this is ‘boilerplate code’
# No warnings
import warnings
warnings.filterwarnings('ignore') # Filter out warnings
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
pd.set_option('display.max_columns', 100) # Print 100 Pandas columns
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# machine learning models
from sklearn.linear_model import LogisticRegression
7
Retrieve our Dataset
Tell the program where our data is located and give it a name: “train_df”
train_df = pd.read_csv('train.csv')
8
Now, let’s explore our data
9
Machine Learning Process Steps We’ll Cover Today
Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
10
Descriptive
Statistics on
Data
Explore our data
What columns are available?
print(train_df.columns.values)
11
Explore our data
What columns are available?
print(train_df.columns.values)
How does the data look?
# preview the data
train_df.head(5)
12
Explore our data
What columns are available?
print(train_df.columns.values)
How does the data look?
# preview the data
train_df.head(5)
General data statistics
# General data statistics
train_df.describe()
13
Machine Learning Process Steps We’ll Cover Today
Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
14
Descriptive
Statistics on
Data
Visualize our data
How is the data in each column distributed?
Let’s look at a Histogram
train_df.hist(figsize=(13,10))
plt.show()
15
Machine Learning Process Steps We’ll Cover Today
Identify Business
Problem / Goal
Visualize the
Data
Clean Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
16
Descriptive
Statistics on
Data
Clean our data
What if ‘age’ is not available for some people?
We could replace missing data with the median or mean.
print("Median Age: ",train_df.Age.median())
print("Mean Age: ",train_df.Age.mean())
# We'll use the mean
train_df['Age'] = train_df['Age'].fillna(train_df.Age.mean())
17
Clean our data
Algorithms need numerical inputs.
How do we handle non-numbers, like ‘male’ or ‘female’?
We’ll map them to binary values (male = 0, female = 1)
train_df['Sex'] = train_df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
train_df.head()
18
Clean our data
Let’s drop columns we won’t use from our dataset
# Remove columns that we won't use
train_df = train_df.drop(['PassengerId','Name', 'Parch', 'SibSp','Embarked','Ticket', 'Cabin'], axis=1)
train_df.head(5)
19
Next up: Machine Learning!
20
Machine Learning Process Steps We’ll Cover Today
Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
21
Descriptive
Statistics on
Data
Split our data into input variables and output to predict
Define our independent (X) and dependent (Y) variables
X = train_df.drop("Survived", axis=1) # Training & Validation data
Y = train_df["Survived"] # Response / Target Variable
print(X.shape, Y.shape)
22
Split our data into training data and validation data
Separate 20% of our data as ‘hold-out’
We’ll use it to test our models.
# Split training set so that we test on 20% of the data
# Note that our algorithms will never have seen the validation
# data during training. This is to evaluate how good our estimators are.
np.random.seed(1337) # set random seed for reproducibility
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)
print(X_train.shape, Y_train.shape)
print(X_val.shape, Y_val.shape)
23
Machine Learning Process Steps We’ll Cover Today
Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
24
Descriptive
Statistics on
Data
Testing different Machine Learning Models
There are many different Machine Learning algorithms.
We’ll Model and Predict using 4 algorithms, then choose the best performer.
● Logistic Regression
● KNN (k-nearest neighbors)
● Random Forest
● Neural Net
25
Run each of our 4 models
Don’t worry so much about the specific code syntax. Instead, note that we’re
testing different algorithms and assessing accuracy.
Logistic Regression:
logreg = LogisticRegression() # instantiate
logreg.fit(X_train, Y_train) # fit
Y_pred_train = logreg.predict(X_train) # predict
acc_log_train = sum(Y_pred_train == Y_train)/len(Y_train)*100
print('Logistic Regression Training Accuracy:', str(round(acc_log_train,2)),'%')
Y_pred = logreg.predict(X_val) # predict
acc_log = sum(Y_pred == Y_val)/len(Y_val)*100
print('Logistic Regression Validation Accuracy:', str(round(acc_log,2)),'%')
26
Run each of our 4 models
KNN:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
print('K-nearest Neighbors Training Accuracy:', str(round(knn.score(X_train, Y_train)*100,2)),'%')
acc_knn = knn.score(X_val, Y_val)
print('K-nearest Neighbors Validation Accuracy:', str(round(acc_knn*100,2)),'%')
27
Run each of our 4 models
Random Forest:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X_train, Y_train)
print('Random Forest Training Accuracy:', str(round(random_forest.score(X_train, Y_train)*100,2)),'%')
acc_random_forest = random_forest.score(X_val, Y_val)
acc_random_forest
print('Random Forest Validation Accuracy:', str(round(acc_random_forest*100,2)),'%')
28
Run each of our 4 models
Neural Net:
# Neural Net
hidden_lay = 10
hidden_node = 10
NN_model = MLPClassifier(hidden_layer_sizes=(hidden_lay, hidden_node),
activation='relu',solver='lbfgs').fit(X_train,Y_train)
print('Neural Net Training Accuracy:', str(round(accuracy_score(Y_train,NN_model.predict(X_train))*100,2)),'%')
print('Neural Net Validation Accuracy:', str(round(accuracy_score(Y_val,NN_model.predict(X_val))*100,2)),'%')
29
Machine Learning Process Steps We’ll Cover Today
Identify Business
Problem / Goal
Visualize the
Data
Find & Clean
Data
Create a Holdout
Sample for
Testing
Run Machine
Learning Models
Assess Accuracy
& Finalize Model
Make
Predictions and
Test
30
Descriptive
Statistics on
Data
Assess
Which model was the best?
Print accuracy scores for each model.
print('Logistic Regression Training Accuracy:', str(round(acc_log_train,2)),'%')
print('Logistic Regression Validation Accuracy:', str(round(acc_log,2)),'%')
print('K-nearest Neighbors Training Accuracy:', str(round(knn.score(X_train, Y_train)*100,2)),'%')
print('K-nearest Neighbors Validation Accuracy:', str(round(acc_knn*100,2)),'%')
print('Random Forest Training Accuracy:', str(round(random_forest.score(X_train, Y_train)*100,2)),'%')
print('Random Forest Validation Accuracy:', str(round(acc_random_forest*100,2)),'%')
print('Neural Net Training Accuracy:', str(round(accuracy_score(Y_train,NN_model.predict(X_train))*100,2)),'%')
print('Neural Net Validation Accuracy:', str(round(accuracy_score(Y_val,NN_model.predict(X_val))*100,2)),'%')
31
Now what? How to understand our model
We have a model, but what drives it?
We can analyze model drivers (sometimes).
# Look at importnace of features for random forest
def plot_model_var_imp( model , X , y ):
imp = pd.DataFrame(
model.feature_importances_ ,
columns = [ 'Importance' ] ,
index = X.columns
)
imp = imp.sort_values( [ 'Importance' ] , ascending = True )
imp[ : 10 ].plot( kind = 'barh' )
print ('Training accuracy Random Forest:',model.score( X , y ))
plot_model_var_imp(random_forest, X_train, Y_train) 32
Appendix
33
Related Courses
● CS188: Intro to Artificial Intelligence
○ Course Materials
○ Blurb: This class was quite technical. I audited and found the content interesting to listen in on.
Not sure I’d want to commit the time or have expertise to have taken full course (Alex)
○ One of my favorite classes (Sven)
● INFO 254: Data Mining & Analytics
○ https://www.ischool.berkeley.edu/courses/info/254
○ Blurb: Learn about how ML algorithms work and mental models for what models to use.
Python knowledge required
● DataX
○ https://data-x.blog/
○ Blurb: Great from learning how to apply ML (not necessarily how all of the models work).
Python is required
34
Additional Resources
● An additional tutorial: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
○ Tutorial requires you to execute code ‘locally’ (on your computer rather than on the web). This requires you to download
additional software which is further explained here:
○ https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/
35
Machine Learning Definition
Machine learning algorithms are described as learning a target function (f) that
best maps input variables (X) to an output variable (Y): Y = f(X)
This is a general learning task where we would like to make predictions in the
future (Y) given new examples of input variables (X). We don’t know what the
function (f) looks like or its form. If we did, we would use it directly and we would
not need to learn it from data using machine learning algorithms.
36
Source: Medium article - A tour of the top 10 algorithms for machine learning newbiews
Algorithm overviews
Logistic Regression: Logistic regression is another technique borrowed by machine learning from the field of statistics. It
is the go-to method for binary classification problems (problems with two class values). Logistic regression is like linear
regression in that the goal is to find the values for the coefficients that weight each input variable. Unlike linear regression,
the prediction for the output is transformed using a non-linear function called the logistic function. The logistic function looks
like a big S and will transform any value into the range 0 to 1. This is useful because we can apply a rule to the output of the
logistic function to snap values to 0 and 1 (e.g. IF less than 0.5 then output 1) and predict a class value.
KNN (k-nearest neighbors): Predictions are made for a new data point by searching through the entire training set for the
K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression
problems, this might be the mean output variable, for classification problems this might be the mode (or most common)
class value.
The trick is in how to determine the similarity between the data instances. The simplest technique if your attributes are all of
the same scale (all in inches for example) is to use the Euclidean distance, a number you can calculate directly based on
the differences between each input variable.
37
Source: Medium article - A tour of the top 10 algorithms for machine learning newbiews
Algorithm overviews
Random Forest: Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of
ensemble machine learning algorithm called Bootstrap Aggregation or bagging.
The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots
of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true
mean value.
In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees.
Multiple samples of your training data are taken then models are constructed for each data sample. When you need to
make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate
of the true output value. Random forest is a tweak on this approach where decision trees are created so that rather than
selecting optimal split points, suboptimal splits are made by introducing randomness.
The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate
in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output
value.
38
Source: Medium article - A tour of the top 10 algorithms for machine learning newbiews
Algorithm overviews
Last Takeaway: A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is
“which algorithm should I use?” The answer to the question varies depending on many factors, including: (1) The size,
quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What you want to do
with the data.
Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms.
39
Source: Medium article - A tour of the top 10 algorithms for machine learning newbiews

Workshop: Your first machine learning project

  • 1.
  • 2.
    Today’s Workshop Goal: ● Generaloverview of what Data Scientist do when they use “Machine Learning” ● Get our hands dirty with data & code, even if we don’t fully understand it all ● Complete a project you can talk about & share with others Non-Goals: ● Understand all code / syntax ● Learn to write or deeply understand ML algorithms ● Become a Data Scientist Audience Expectations: ● Limited or no coding experience ● New to Machine Learning 2
  • 3.
    High level MachineLearning process steps Identify Business Problem / Goal Visualize the Data Find & Clean Data Create a Holdout Sample for Testing Run Machine Learning Models Assess Accuracy & Finalize Model Make Predictions and Test 3 Descriptive Statistics on Data
  • 4.
    First, credit tothe authors Original code comes from Data-X: https://github.com/ikhlaqsidhu/data-x 4
  • 5.
  • 6.
    Set-up 1. Download thecode and dataset 2. Login to Jupyter where we’ll run our code a. Create a new folder in Jupyter b. Unzip the file you downloaded c. Upload the dataset & code to the folder Find this presentation by going to: tinyurl.com/haasFirstML 6
  • 7.
    Import Required Packages Here,we pull in sets of instructions for the computer called ‘packages’ that are specific to Machine Learning, Don’t be intimidated by this first step, this is ‘boilerplate code’ # No warnings import warnings warnings.filterwarnings('ignore') # Filter out warnings # data analysis and wrangling import pandas as pd import numpy as np import random as rnd pd.set_option('display.max_columns', 100) # Print 100 Pandas columns # visualization import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline # machine learning models from sklearn.linear_model import LogisticRegression 7
  • 8.
    Retrieve our Dataset Tellthe program where our data is located and give it a name: “train_df” train_df = pd.read_csv('train.csv') 8
  • 9.
  • 10.
    Machine Learning ProcessSteps We’ll Cover Today Identify Business Problem / Goal Visualize the Data Find & Clean Data Create a Holdout Sample for Testing Run Machine Learning Models Assess Accuracy & Finalize Model Make Predictions and Test 10 Descriptive Statistics on Data
  • 11.
    Explore our data Whatcolumns are available? print(train_df.columns.values) 11
  • 12.
    Explore our data Whatcolumns are available? print(train_df.columns.values) How does the data look? # preview the data train_df.head(5) 12
  • 13.
    Explore our data Whatcolumns are available? print(train_df.columns.values) How does the data look? # preview the data train_df.head(5) General data statistics # General data statistics train_df.describe() 13
  • 14.
    Machine Learning ProcessSteps We’ll Cover Today Identify Business Problem / Goal Visualize the Data Find & Clean Data Create a Holdout Sample for Testing Run Machine Learning Models Assess Accuracy & Finalize Model Make Predictions and Test 14 Descriptive Statistics on Data
  • 15.
    Visualize our data Howis the data in each column distributed? Let’s look at a Histogram train_df.hist(figsize=(13,10)) plt.show() 15
  • 16.
    Machine Learning ProcessSteps We’ll Cover Today Identify Business Problem / Goal Visualize the Data Clean Data Create a Holdout Sample for Testing Run Machine Learning Models Assess Accuracy & Finalize Model Make Predictions and Test 16 Descriptive Statistics on Data
  • 17.
    Clean our data Whatif ‘age’ is not available for some people? We could replace missing data with the median or mean. print("Median Age: ",train_df.Age.median()) print("Mean Age: ",train_df.Age.mean()) # We'll use the mean train_df['Age'] = train_df['Age'].fillna(train_df.Age.mean()) 17
  • 18.
    Clean our data Algorithmsneed numerical inputs. How do we handle non-numbers, like ‘male’ or ‘female’? We’ll map them to binary values (male = 0, female = 1) train_df['Sex'] = train_df['Sex'].map( {'female': 1, 'male': 0} ).astype(int) train_df.head() 18
  • 19.
    Clean our data Let’sdrop columns we won’t use from our dataset # Remove columns that we won't use train_df = train_df.drop(['PassengerId','Name', 'Parch', 'SibSp','Embarked','Ticket', 'Cabin'], axis=1) train_df.head(5) 19
  • 20.
    Next up: MachineLearning! 20
  • 21.
    Machine Learning ProcessSteps We’ll Cover Today Identify Business Problem / Goal Visualize the Data Find & Clean Data Create a Holdout Sample for Testing Run Machine Learning Models Assess Accuracy & Finalize Model Make Predictions and Test 21 Descriptive Statistics on Data
  • 22.
    Split our datainto input variables and output to predict Define our independent (X) and dependent (Y) variables X = train_df.drop("Survived", axis=1) # Training & Validation data Y = train_df["Survived"] # Response / Target Variable print(X.shape, Y.shape) 22
  • 23.
    Split our datainto training data and validation data Separate 20% of our data as ‘hold-out’ We’ll use it to test our models. # Split training set so that we test on 20% of the data # Note that our algorithms will never have seen the validation # data during training. This is to evaluate how good our estimators are. np.random.seed(1337) # set random seed for reproducibility from sklearn.model_selection import train_test_split X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2) print(X_train.shape, Y_train.shape) print(X_val.shape, Y_val.shape) 23
  • 24.
    Machine Learning ProcessSteps We’ll Cover Today Identify Business Problem / Goal Visualize the Data Find & Clean Data Create a Holdout Sample for Testing Run Machine Learning Models Assess Accuracy & Finalize Model Make Predictions and Test 24 Descriptive Statistics on Data
  • 25.
    Testing different MachineLearning Models There are many different Machine Learning algorithms. We’ll Model and Predict using 4 algorithms, then choose the best performer. ● Logistic Regression ● KNN (k-nearest neighbors) ● Random Forest ● Neural Net 25
  • 26.
    Run each ofour 4 models Don’t worry so much about the specific code syntax. Instead, note that we’re testing different algorithms and assessing accuracy. Logistic Regression: logreg = LogisticRegression() # instantiate logreg.fit(X_train, Y_train) # fit Y_pred_train = logreg.predict(X_train) # predict acc_log_train = sum(Y_pred_train == Y_train)/len(Y_train)*100 print('Logistic Regression Training Accuracy:', str(round(acc_log_train,2)),'%') Y_pred = logreg.predict(X_val) # predict acc_log = sum(Y_pred == Y_val)/len(Y_val)*100 print('Logistic Regression Validation Accuracy:', str(round(acc_log,2)),'%') 26
  • 27.
    Run each ofour 4 models KNN: knn = KNeighborsClassifier(n_neighbors = 3) knn.fit(X_train, Y_train) print('K-nearest Neighbors Training Accuracy:', str(round(knn.score(X_train, Y_train)*100,2)),'%') acc_knn = knn.score(X_val, Y_val) print('K-nearest Neighbors Validation Accuracy:', str(round(acc_knn*100,2)),'%') 27
  • 28.
    Run each ofour 4 models Random Forest: # Random Forest random_forest = RandomForestClassifier(n_estimators=10) random_forest.fit(X_train, Y_train) print('Random Forest Training Accuracy:', str(round(random_forest.score(X_train, Y_train)*100,2)),'%') acc_random_forest = random_forest.score(X_val, Y_val) acc_random_forest print('Random Forest Validation Accuracy:', str(round(acc_random_forest*100,2)),'%') 28
  • 29.
    Run each ofour 4 models Neural Net: # Neural Net hidden_lay = 10 hidden_node = 10 NN_model = MLPClassifier(hidden_layer_sizes=(hidden_lay, hidden_node), activation='relu',solver='lbfgs').fit(X_train,Y_train) print('Neural Net Training Accuracy:', str(round(accuracy_score(Y_train,NN_model.predict(X_train))*100,2)),'%') print('Neural Net Validation Accuracy:', str(round(accuracy_score(Y_val,NN_model.predict(X_val))*100,2)),'%') 29
  • 30.
    Machine Learning ProcessSteps We’ll Cover Today Identify Business Problem / Goal Visualize the Data Find & Clean Data Create a Holdout Sample for Testing Run Machine Learning Models Assess Accuracy & Finalize Model Make Predictions and Test 30 Descriptive Statistics on Data
  • 31.
    Assess Which model wasthe best? Print accuracy scores for each model. print('Logistic Regression Training Accuracy:', str(round(acc_log_train,2)),'%') print('Logistic Regression Validation Accuracy:', str(round(acc_log,2)),'%') print('K-nearest Neighbors Training Accuracy:', str(round(knn.score(X_train, Y_train)*100,2)),'%') print('K-nearest Neighbors Validation Accuracy:', str(round(acc_knn*100,2)),'%') print('Random Forest Training Accuracy:', str(round(random_forest.score(X_train, Y_train)*100,2)),'%') print('Random Forest Validation Accuracy:', str(round(acc_random_forest*100,2)),'%') print('Neural Net Training Accuracy:', str(round(accuracy_score(Y_train,NN_model.predict(X_train))*100,2)),'%') print('Neural Net Validation Accuracy:', str(round(accuracy_score(Y_val,NN_model.predict(X_val))*100,2)),'%') 31
  • 32.
    Now what? Howto understand our model We have a model, but what drives it? We can analyze model drivers (sometimes). # Look at importnace of features for random forest def plot_model_var_imp( model , X , y ): imp = pd.DataFrame( model.feature_importances_ , columns = [ 'Importance' ] , index = X.columns ) imp = imp.sort_values( [ 'Importance' ] , ascending = True ) imp[ : 10 ].plot( kind = 'barh' ) print ('Training accuracy Random Forest:',model.score( X , y )) plot_model_var_imp(random_forest, X_train, Y_train) 32
  • 33.
  • 34.
    Related Courses ● CS188:Intro to Artificial Intelligence ○ Course Materials ○ Blurb: This class was quite technical. I audited and found the content interesting to listen in on. Not sure I’d want to commit the time or have expertise to have taken full course (Alex) ○ One of my favorite classes (Sven) ● INFO 254: Data Mining & Analytics ○ https://www.ischool.berkeley.edu/courses/info/254 ○ Blurb: Learn about how ML algorithms work and mental models for what models to use. Python knowledge required ● DataX ○ https://data-x.blog/ ○ Blurb: Great from learning how to apply ML (not necessarily how all of the models work). Python is required 34
  • 35.
    Additional Resources ● Anadditional tutorial: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/ ○ Tutorial requires you to execute code ‘locally’ (on your computer rather than on the web). This requires you to download additional software which is further explained here: ○ https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/ 35
  • 36.
    Machine Learning Definition Machinelearning algorithms are described as learning a target function (f) that best maps input variables (X) to an output variable (Y): Y = f(X) This is a general learning task where we would like to make predictions in the future (Y) given new examples of input variables (X). We don’t know what the function (f) looks like or its form. If we did, we would use it directly and we would not need to learn it from data using machine learning algorithms. 36 Source: Medium article - A tour of the top 10 algorithms for machine learning newbiews
  • 37.
    Algorithm overviews Logistic Regression:Logistic regression is another technique borrowed by machine learning from the field of statistics. It is the go-to method for binary classification problems (problems with two class values). Logistic regression is like linear regression in that the goal is to find the values for the coefficients that weight each input variable. Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function. The logistic function looks like a big S and will transform any value into the range 0 to 1. This is useful because we can apply a rule to the output of the logistic function to snap values to 0 and 1 (e.g. IF less than 0.5 then output 1) and predict a class value. KNN (k-nearest neighbors): Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression problems, this might be the mean output variable, for classification problems this might be the mode (or most common) class value. The trick is in how to determine the similarity between the data instances. The simplest technique if your attributes are all of the same scale (all in inches for example) is to use the Euclidean distance, a number you can calculate directly based on the differences between each input variable. 37 Source: Medium article - A tour of the top 10 algorithms for machine learning newbiews
  • 38.
    Algorithm overviews Random Forest:Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging. The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value. In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees. Multiple samples of your training data are taken then models are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate of the true output value. Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness. The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output value. 38 Source: Medium article - A tour of the top 10 algorithms for machine learning newbiews
  • 39.
    Algorithm overviews Last Takeaway:A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What you want to do with the data. Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. 39 Source: Medium article - A tour of the top 10 algorithms for machine learning newbiews