SlideShare a Scribd company logo
Which flower species is it?
Building Models with Data
What warrants this as a good learning
dataset?
• Clean dataset available: 4 numeric attributes with no missing values
• Target is 3 different species of flowers. Multi-class classification.
• Well known dataset
What software do I need?
• IDE to run Python
• Online: https://repl.it
• Code Editor: VS Code https://code.visualstudio.com/download
• Data Science Platform: Anaconda https://www.anaconda.com/distribution/
Data Source
• Titanic dataset on Kaggle
• https://www.kaggle.com/c/titanic
def load_data(url):
'''
Loads data into Python environment.
Parameters: url with .csv
Returns: dataframe
'''
variables = ['sepal_len', 'sepal_w', 'petal_len',
'petal_w', 'class']
df = pd.read_csv(url, names=variables)
return df
Exploratory Data Analysis
1. Summary statistics
2. Data visualization
3. Data processing
Summary Statistics
• # of rows, # of features
• Frequency distribution
• Number of missing values Variable # of missing values
Sepal Width 0
Sepal Length 0
Petal Width 0
Petal Length 0
def summary_statistics(df):
'''
Generates summary statistics like the # of variables & columns,
pivot table, and 5 # summary.
Parameters: dataframe
Returns: none
'''
# shape
print('Shape of dataframe: %d instances and %d features' %
(df.shape[0], df.shape[1]))
# description
print(df.describe())
# class frequency
print(df.groupby('class').size())
# missing values
print(df.isnull().sum())
return
Flower Species # of instances
setosa 50
versicolor 50
virginica 50
The Base Rate is 0.33.
Our model has to beat
that.
Data Visualization
• Box Plot
• Histogram
• Scatter plot
• Correlation table
def visualize(df):
'''
Visualizes data using a box plot, histogram, scatter
matrix, and correlation matrix.
Parameters: dataframe
Returns: none
'''
# box plot
df.plot(kind='box', subplots=True, layout=(2,2),
showfliers=True, sharex=False, sharey=False)
plt.show()
# histogram - distribution
df.hist()
plt.show()
# scatter matrix
scatter_matrix(df)
plt.show()
print()
### Correlation Matrix
corr = df.corr()
corr.style.background_gradient()
return
Data Processing
• Dataset is really neat, so minimal processing needed.
• All features will be selected
• Split into training and test sets
Split Data Set
• We have a small data set, so later on we will use 10-fold validation to
create a more accurate representation of model performance.
def split_train_test(df):
'''
Splits available data into 80% training set, 20% test
set.
Parameters: dataframe
Returns: training set - features and output, test set -
features and output
'''
# 80% training set, 20% test set
array = df.values
X = array[:,0:4]
Y = array[:,4]
n_test = 0.2
seed = 7
X_train, X_test, Y_train, Y_test =
model_selection.train_test_split(X, Y, test_size=n_test,
random_state=seed)
return X_train, X_test, Y_train, Y_test
def k_fold_validation(models, X_train, Y_train):
'''
Performs 10-fold validation and prints the mean and standard
deviation of accuracies.
Parameters: array of models, training set - features and output
Returns:
'''
results = []
means = []
stds = []
names = []
scoring = 'accuracy'
seed = 7
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train,
Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
means.append(cv_results.mean())
stds.append(cv_results.std())
names.append(name)
msg = '%s: %f (%f)' % (name, cv_results.mean(),
cv_results.std())
print(msg)
return names, models, means, stds
Model Building = Equation
• Multi-Class Classification with only numeric variables
• Logistic Regression
• Linear Discriminant Analysis
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
Model Building code
• Logistic Regression
• Linear Discriminant Analysis
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
def build_model(X_train, Y_train):
'''
Runs training data through Logistic Regression, Linear
Discriminant Analysis, KNN, Decision Tree, Random
Forest, Naive Bayes, and Support Vector Machine.
Parameters: training set - features and output
Returns: array of names, models, means, and stds
'''
models = []
models.append(('LR',
LogisticRegression(solver='liblinear',
multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF',
RandomForestClassifier(n_estimators = 100,
max_depth=5)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
return models
Estimation
• Gini Impurity
• The dimensions of the petal is more predictive than those of the sepal.
Feature Gini Index
Petal Width 0.46
Petal Length 0.42
Sepal Length 0.09
Sepal Width 0.03
def gini_impurity(models, X_train, Y_train, X_test,
df):
'''
Examines feature importance using Gini impurity.
Parameters: models, training set, test set, dataframe
Returns: none
'''
random_forest = models[4][1]
keys = df.keys()
keys = keys[[0,1,2,3]]
models[4][1].fit(X_train, Y_train)
pred = models[4][1].predict(X_test)
print(sorted(zip(map(lambda x: round(x, 4),
random_forest.feature_importances_), keys),
reverse=True))
return
Model Evaluation
• Run on training set
• Performance metric: Accuracy
• Null Error Rate is the baseline accuracy if we predicted flower as being setosa. Only
algorithms that beat this base rate will be considered.
• Null Error Rate = 0.33
• Visualize in: Error Bars
Model Evaluation
• Error Bars show us the accuracy of each model.
def evaluate_error_bar(names, models, means, stds):
'''
Compare accuracy values with Error Bar graph.
Parameters: array of names, models, means, stds
Returns: none
'''
# error bar
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(1, 1, 1)
ax.set_xticklabels(names)
plt.errorbar(names, means, stds, linestyle='None',
marker='^')
plt.ylim(0.92,1)
plt.show()
return
Explanation
• Run on test set
• Performance metric:
Accuracy, Recall, Precision, F1
score
• Visualize in: Confusion Matrix
and Classification Report
def test_set(X_train, Y_train, X_test, Y_test, models):
'''
Runs test data through all models. Prints confusion
matrices and classification reports.
Parameters: training set and test set, array of models
Returns: none
'''
for name, model in models:
if name == 'RF':
model.fit(X_train, Y_train)
pred = model.predict(X_test)
print('nnn%s Accuracy: %.2f' % (name,
accuracy_score(Y_test, pred)))
labels = np.unique(Y_test)
confusion = confusion_matrix(Y_test, pred,
labels=labels)
print('nConfusion Matrix:')
print(pd.DataFrame(confusion, index=labels,
columns=labels))
print('nClassification Report:')
print(classification_report(Y_test, pred))
return
Explanation
Confusion Matrix Classification Report
Accuracy = 0.87
• Better than Base Rate = 0.33
• Precision
• Precision for setosa is perfect (1.00). This means that if the model
predicted that the flower species is setosa, then it is always right.
• Recall
• Recall rate for setosa is high (0.93). This means that we correctly
identified all setosa flowers.
• F1 Score
• Weighted mean of precision and recall. Here we see that we do a
better job at identifying setosa (F1 = 1.00) than the other two flower
species (F1 = 0.83 and 0.82)
Predicted Class
setosa versicolor virginica
Actual
Class
setosa 7 0 0
versicolor 0 10 2
virginica 0 2 9
Precision Recall F1_Score
Actual: Setosa 1.00 1.00 1.00
Actual: Versicolor 0.83 0.83 0.83
Actual: Virginica 0.82 0.82 0.82

More Related Content

What's hot

CIS 115 Achievement Education--cis115.com
CIS 115 Achievement Education--cis115.comCIS 115 Achievement Education--cis115.com
CIS 115 Achievement Education--cis115.comagathachristie170
 
CIS 115 Education for Service--cis115.com
CIS 115 Education for Service--cis115.com  CIS 115 Education for Service--cis115.com
CIS 115 Education for Service--cis115.com williamwordsworth10
 
CIS 115 Redefined Education--cis115.com
CIS 115 Redefined Education--cis115.comCIS 115 Redefined Education--cis115.com
CIS 115 Redefined Education--cis115.comagathachristie208
 
Cis 115 Extraordinary Success/newtonhelp.com
Cis 115 Extraordinary Success/newtonhelp.com  Cis 115 Extraordinary Success/newtonhelp.com
Cis 115 Extraordinary Success/newtonhelp.com amaranthbeg143
 
Ml5 svm and-kernels
Ml5 svm and-kernelsMl5 svm and-kernels
Ml5 svm and-kernelsankit_ppt
 
Array 31.8.2020 updated
Array 31.8.2020 updatedArray 31.8.2020 updated
Array 31.8.2020 updatedvrgokila
 
CIS 115 Education Counseling--cis115.com
CIS 115 Education Counseling--cis115.comCIS 115 Education Counseling--cis115.com
CIS 115 Education Counseling--cis115.comclaric59
 
Custom Star Creation for Ellucain's Enterprise Data Warehouse
Custom Star Creation for Ellucain's Enterprise Data WarehouseCustom Star Creation for Ellucain's Enterprise Data Warehouse
Custom Star Creation for Ellucain's Enterprise Data WarehouseBryan L. Mack
 
Property-Based Testing
Property-Based TestingProperty-Based Testing
Property-Based TestingShai Geva
 
Csphtp1 07
Csphtp1 07Csphtp1 07
Csphtp1 07HUST
 
mc_simulation documentation
mc_simulation documentationmc_simulation documentation
mc_simulation documentationCarlo Parodi
 
An Introduction to Property Based Testing
An Introduction to Property Based TestingAn Introduction to Property Based Testing
An Introduction to Property Based TestingC4Media
 

What's hot (17)

CIS 115 Achievement Education--cis115.com
CIS 115 Achievement Education--cis115.comCIS 115 Achievement Education--cis115.com
CIS 115 Achievement Education--cis115.com
 
Test design techniques
Test design techniquesTest design techniques
Test design techniques
 
Friendly Functional Programming
Friendly Functional ProgrammingFriendly Functional Programming
Friendly Functional Programming
 
CIS 115 Education for Service--cis115.com
CIS 115 Education for Service--cis115.com  CIS 115 Education for Service--cis115.com
CIS 115 Education for Service--cis115.com
 
CIS 115 Redefined Education--cis115.com
CIS 115 Redefined Education--cis115.comCIS 115 Redefined Education--cis115.com
CIS 115 Redefined Education--cis115.com
 
Templates
TemplatesTemplates
Templates
 
Cis 115 Extraordinary Success/newtonhelp.com
Cis 115 Extraordinary Success/newtonhelp.com  Cis 115 Extraordinary Success/newtonhelp.com
Cis 115 Extraordinary Success/newtonhelp.com
 
Ml5 svm and-kernels
Ml5 svm and-kernelsMl5 svm and-kernels
Ml5 svm and-kernels
 
Array 31.8.2020 updated
Array 31.8.2020 updatedArray 31.8.2020 updated
Array 31.8.2020 updated
 
CIS 115 Education Counseling--cis115.com
CIS 115 Education Counseling--cis115.comCIS 115 Education Counseling--cis115.com
CIS 115 Education Counseling--cis115.com
 
Custom Star Creation for Ellucain's Enterprise Data Warehouse
Custom Star Creation for Ellucain's Enterprise Data WarehouseCustom Star Creation for Ellucain's Enterprise Data Warehouse
Custom Star Creation for Ellucain's Enterprise Data Warehouse
 
Ch3
Ch3Ch3
Ch3
 
Property-Based Testing
Property-Based TestingProperty-Based Testing
Property-Based Testing
 
Bw14
Bw14Bw14
Bw14
 
Csphtp1 07
Csphtp1 07Csphtp1 07
Csphtp1 07
 
mc_simulation documentation
mc_simulation documentationmc_simulation documentation
mc_simulation documentation
 
An Introduction to Property Based Testing
An Introduction to Property Based TestingAn Introduction to Property Based Testing
An Introduction to Property Based Testing
 

Similar to wk5ppt2_Iris

Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Practical Predictive Modeling in Python
Practical Predictive Modeling in PythonPractical Predictive Modeling in Python
Practical Predictive Modeling in PythonRobert Dempsey
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionankit_ppt
 
Data analysis on bank data
Data analysis on bank dataData analysis on bank data
Data analysis on bank dataANISH BHANUSHALI
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 
Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RShirin Elsinghorst
 
MT_01_unittest_python.pdf
MT_01_unittest_python.pdfMT_01_unittest_python.pdf
MT_01_unittest_python.pdfHans Jones
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsParinaz Ameri
 
Quick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 minsQuick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 minsNaveen Davis
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learningMax Kleiner
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning projectAlex Austin
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experimentsSean Taylor
 
Classification examp
Classification exampClassification examp
Classification exampRyan Hong
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)TarunPaparaju
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Abhishek Thakur
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 

Similar to wk5ppt2_Iris (20)

Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Practical Predictive Modeling in Python
Practical Predictive Modeling in PythonPractical Predictive Modeling in Python
Practical Predictive Modeling in Python
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
 
Data analysis on bank data
Data analysis on bank dataData analysis on bank data
Data analysis on bank data
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with R
 
MT_01_unittest_python.pdf
MT_01_unittest_python.pdfMT_01_unittest_python.pdf
MT_01_unittest_python.pdf
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Quick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 minsQuick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 mins
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning project
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experiments
 
Classification examp
Classification exampClassification examp
Classification examp
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 

Recently uploaded

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhArpitMalhotra16
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单nscud
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sMAQIB18
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportSatyamNeelmani2
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxzahraomer517
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxbenishzehra469
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单vcaxypu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单ocavb
 

Recently uploaded (20)

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 

wk5ppt2_Iris

  • 1. Which flower species is it? Building Models with Data
  • 2. What warrants this as a good learning dataset? • Clean dataset available: 4 numeric attributes with no missing values • Target is 3 different species of flowers. Multi-class classification. • Well known dataset
  • 3. What software do I need? • IDE to run Python • Online: https://repl.it • Code Editor: VS Code https://code.visualstudio.com/download • Data Science Platform: Anaconda https://www.anaconda.com/distribution/
  • 4. Data Source • Titanic dataset on Kaggle • https://www.kaggle.com/c/titanic def load_data(url): ''' Loads data into Python environment. Parameters: url with .csv Returns: dataframe ''' variables = ['sepal_len', 'sepal_w', 'petal_len', 'petal_w', 'class'] df = pd.read_csv(url, names=variables) return df
  • 5. Exploratory Data Analysis 1. Summary statistics 2. Data visualization 3. Data processing
  • 6. Summary Statistics • # of rows, # of features • Frequency distribution • Number of missing values Variable # of missing values Sepal Width 0 Sepal Length 0 Petal Width 0 Petal Length 0 def summary_statistics(df): ''' Generates summary statistics like the # of variables & columns, pivot table, and 5 # summary. Parameters: dataframe Returns: none ''' # shape print('Shape of dataframe: %d instances and %d features' % (df.shape[0], df.shape[1])) # description print(df.describe()) # class frequency print(df.groupby('class').size()) # missing values print(df.isnull().sum()) return Flower Species # of instances setosa 50 versicolor 50 virginica 50 The Base Rate is 0.33. Our model has to beat that.
  • 7. Data Visualization • Box Plot • Histogram • Scatter plot • Correlation table def visualize(df): ''' Visualizes data using a box plot, histogram, scatter matrix, and correlation matrix. Parameters: dataframe Returns: none ''' # box plot df.plot(kind='box', subplots=True, layout=(2,2), showfliers=True, sharex=False, sharey=False) plt.show() # histogram - distribution df.hist() plt.show() # scatter matrix scatter_matrix(df) plt.show() print() ### Correlation Matrix corr = df.corr() corr.style.background_gradient() return
  • 8. Data Processing • Dataset is really neat, so minimal processing needed. • All features will be selected • Split into training and test sets
  • 9. Split Data Set • We have a small data set, so later on we will use 10-fold validation to create a more accurate representation of model performance. def split_train_test(df): ''' Splits available data into 80% training set, 20% test set. Parameters: dataframe Returns: training set - features and output, test set - features and output ''' # 80% training set, 20% test set array = df.values X = array[:,0:4] Y = array[:,4] n_test = 0.2 seed = 7 X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=n_test, random_state=seed) return X_train, X_test, Y_train, Y_test def k_fold_validation(models, X_train, Y_train): ''' Performs 10-fold validation and prints the mean and standard deviation of accuracies. Parameters: array of models, training set - features and output Returns: ''' results = [] means = [] stds = [] names = [] scoring = 'accuracy' seed = 7 for name, model in models: kfold = model_selection.KFold(n_splits=10, random_state=seed) cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) results.append(cv_results) means.append(cv_results.mean()) stds.append(cv_results.std()) names.append(name) msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()) print(msg) return names, models, means, stds
  • 10. Model Building = Equation • Multi-Class Classification with only numeric variables • Logistic Regression • Linear Discriminant Analysis • K Nearest Neighbor • Decision Tree • Random Forest • Naïve Bayes • Support Vector Machine
  • 11. Model Building code • Logistic Regression • Linear Discriminant Analysis • K Nearest Neighbor • Decision Tree • Random Forest • Naïve Bayes • Support Vector Machine def build_model(X_train, Y_train): ''' Runs training data through Logistic Regression, Linear Discriminant Analysis, KNN, Decision Tree, Random Forest, Naive Bayes, and Support Vector Machine. Parameters: training set - features and output Returns: array of names, models, means, and stds ''' models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('RF', RandomForestClassifier(n_estimators = 100, max_depth=5))) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) return models
  • 12. Estimation • Gini Impurity • The dimensions of the petal is more predictive than those of the sepal. Feature Gini Index Petal Width 0.46 Petal Length 0.42 Sepal Length 0.09 Sepal Width 0.03 def gini_impurity(models, X_train, Y_train, X_test, df): ''' Examines feature importance using Gini impurity. Parameters: models, training set, test set, dataframe Returns: none ''' random_forest = models[4][1] keys = df.keys() keys = keys[[0,1,2,3]] models[4][1].fit(X_train, Y_train) pred = models[4][1].predict(X_test) print(sorted(zip(map(lambda x: round(x, 4), random_forest.feature_importances_), keys), reverse=True)) return
  • 13. Model Evaluation • Run on training set • Performance metric: Accuracy • Null Error Rate is the baseline accuracy if we predicted flower as being setosa. Only algorithms that beat this base rate will be considered. • Null Error Rate = 0.33 • Visualize in: Error Bars
  • 14. Model Evaluation • Error Bars show us the accuracy of each model. def evaluate_error_bar(names, models, means, stds): ''' Compare accuracy values with Error Bar graph. Parameters: array of names, models, means, stds Returns: none ''' # error bar fig = plt.figure() fig.suptitle('Algorithm Comparison') ax = fig.add_subplot(1, 1, 1) ax.set_xticklabels(names) plt.errorbar(names, means, stds, linestyle='None', marker='^') plt.ylim(0.92,1) plt.show() return
  • 15. Explanation • Run on test set • Performance metric: Accuracy, Recall, Precision, F1 score • Visualize in: Confusion Matrix and Classification Report def test_set(X_train, Y_train, X_test, Y_test, models): ''' Runs test data through all models. Prints confusion matrices and classification reports. Parameters: training set and test set, array of models Returns: none ''' for name, model in models: if name == 'RF': model.fit(X_train, Y_train) pred = model.predict(X_test) print('nnn%s Accuracy: %.2f' % (name, accuracy_score(Y_test, pred))) labels = np.unique(Y_test) confusion = confusion_matrix(Y_test, pred, labels=labels) print('nConfusion Matrix:') print(pd.DataFrame(confusion, index=labels, columns=labels)) print('nClassification Report:') print(classification_report(Y_test, pred)) return
  • 16. Explanation Confusion Matrix Classification Report Accuracy = 0.87 • Better than Base Rate = 0.33 • Precision • Precision for setosa is perfect (1.00). This means that if the model predicted that the flower species is setosa, then it is always right. • Recall • Recall rate for setosa is high (0.93). This means that we correctly identified all setosa flowers. • F1 Score • Weighted mean of precision and recall. Here we see that we do a better job at identifying setosa (F1 = 1.00) than the other two flower species (F1 = 0.83 and 0.82) Predicted Class setosa versicolor virginica Actual Class setosa 7 0 0 versicolor 0 10 2 virginica 0 2 9 Precision Recall F1_Score Actual: Setosa 1.00 1.00 1.00 Actual: Versicolor 0.83 0.83 0.83 Actual: Virginica 0.82 0.82 0.82