1. Topik 9
Implementasi Machine Learning
dengan Python
Dr. Sunu Wibirama
Modul Kuliah Kecerdasan Buatan
Kode mata kuliah: UGMx 001001132012
July 4, 2022
2. July 4, 2022
1 Capaian Pembelajaran Mata Kuliah
Topik ini akan memenuhi CPMK 5, yakni mampu mengimplementasikan Bahasa Pem-
rograman Python sebagai pendukung pengembangan sistem cerdas.
Adapun indikator tercapainya CPMK tersebut adalah Mengerti dan mamahami cara
mengekstrak dataset ke variabel Python, mengerti penggunaan berbagai fungsi classifier,
mengerti dan memahami cara melakukan validasi model machine learning.
2 Cakupan Materi
Cakupan materi dalam topik ini sebagai berikut:
a) Loading machine learning data: materi ini membahas teknik untuk mengunduh data
secara online menggunakan Google Colaboratory dan Python. Selain itu, materi ini
juga menjelaskan dasar-dasar statistika yang dapat digunakan untuk melihat karakter-
istik data. Dataset yang digunakan dalam hands on ini adalah PIMA Indian Dataset.
b) Preparing machine learning data: materi ini membahas teknik-teknik yang digunakan
untuk melihat distribusi dan korelasi antara masing-masing atribut dalam dataset.
c) Data visualization: materi ini membahas hal-hal yang terkait dengan visualisasi data
dengan histogram, density plots, boxplots, correlation matrix, dan scatter plots.
d) Data preparation and transformation: materi ini membahas hal-hal penting yang harus
dilakukan untuk mempersiapkan data untuk mengurangi potensi kesalahan pada saat
data menjadi masukan dari algoritme machine learning. Materi ini akan membahas
data rescaling, data standardization, dan data normalization.
e) Feature selection: materi ini membahas teknik-teknik yang dibutuhkan untuk memilih
features dari sekian banyak features yang ada.
f) Performance evaluation of machine learning algorithms: materi ini membahas teknik-
teknik yang dapat digunakan untuk mengevaluasi dan memilih model machine learn-
ing, misalnya memisahkan dataset menjadi training dan testing sets, K-fold cross
validation, dan repeated random test-train splits.
g) Performance metrics of machine learning algorithms: materi ini membahas metrik-
metrik utama yang dapat digunakan untuk mengukur performa algoritme machine
learning, misalnya akurasi klasifikasi, area di bawah kurva ROC, confusion matrix,
mean absolute error, mean squared error, serta R-squared.
h) Implementation of machine learning and neural network: materi ini membahas contoh
riil implementasi dari algoritme machine learning dalam klasifikasi, algorithm tuning,
dan neural network dengan Keras.
1
12. [[0.353 0.744 0.59 0.354 0. 0.501 0.234 0.483]
[0.059 0.427 0.541 0.293 0. 0.396 0.117 0.167]
[0.471 0.92 0.525 0. 0. 0.347 0.254 0.183]
[0.059 0.447 0.541 0.232 0.111 0.419 0.038 0. ]
[0. 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]
Task2:Standardizethedata
StandardizationisausefultechniquetotransformattributeswithaGaussiandistributionanddifferingmeansandstandarddeviationstoastandard
Gaussiandistributionwithameanof
0andastandarddeviationof1.ItismostsuitablefortechniquesthatassumeaGaussiandistributioninthe
inputvariablesandworkbetterwithrescaleddata,suchaslinearregression,logisticregressionandlineardiscriminateanalysis.Wecan
standardizedatausingscikit-learnwiththe StandardScaler class.
[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426]
[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]
[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]
[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]
Task3:Normalizingthedata
Normalizinginscikit-learnreferstorescalingeachobservation(row)tohavealengthof1(calledaunitnormoravectorwiththelengthof1in
linearalgebra).Thispre-processingmethod
canbeusefulforsparsedatasets(lotsofzeros)withattributesofvaryingscaleswhenusing
algorithmsthatweightinputvaluessuchasneuralnetworksandalgorithmsthatusedistancemeasuressuchask-NearestNeighbors.Wecan
normalizedatainPythonwithscikit-learnusingthe Normalizer class.
In [102… # Rescale Data (between 0 and 1)
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
from numpy import set_printoptions
# load data
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
URL = "https://raw.githubusercontent.com/wibirama/Artificial-Intelligence-Course/master/pima-indians-diabetes.data.csv"
data = pd.read_csv(URL,names=names)
array = data.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
# rescaling the data
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])
In [ ]:
# Standardize Data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas as pd
from numpy import set_printoptions
# load data
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
URL = "https://raw.githubusercontent.com/wibirama/Artificial-Intelligence-Course/master/pima-indians-diabetes.data.csv"
data = pd.read_csv(URL,names=names)
array = data.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
# standardize the data
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])
In [ ]:
# Normalize Data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas as pd
from numpy import set_printoptions
# load data
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
URL = "https://raw.githubusercontent.com/wibirama/Artificial-Intelligence-Course/master/pima-indians-diabetes.data.csv"
data = pd.read_csv(URL,names=names)
array = data.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
# normalize the data
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
13. [[0.034 0.828 0.403 0.196 0. 0.188 0.004 0.28 ]
[0.008 0.716 0.556 0.244 0. 0.224 0.003 0.261]
[0.04 0.924 0.323 0. 0. 0.118 0.003 0.162]
[0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
[0. 0.596 0.174 0.152 0.731 0.188 0.01 0.144]]
9.6FeatureSelection
Thedatafeaturesthatweusetotrainourmachinelearningmodelshaveahugeinfluenceontheperformancewecanachieve.Inthislessonwewill
discoverautomaticfeatureselectiontechniquesthatwecanusetoprepareourmachinelearningdatainPythonwithscikit-learn.
Featureselectionisaprocesswhereyouautomaticallyselectthosefeaturesinourdatathatcontributemosttothepredictionvariableoroutputin
whichyouareinterested.Havingirrelevantfeaturesinourdatacandecreasetheaccuracyofmanymodels,especiallylinearalgorithmslikelinear
andlogisticregression.Threebenefitsofperformingfeatureselectionbeforemodelingourdataare:
Reducesoverfitting:lessredundantdatameanslessopportunitytomakedecisionsbasedonnoise.
Improvesaccuracy:lessmisleadingdatameansmodelingaccuracyimproves.
Reducestrainingtime:lessdatameansthatalgorithmstrainfaster
Aftercompletingthislessonyouwillknowhowtouse:
UnivariateSelection.
RecursiveFeatureElimination.
FeatureImportance.
Task1:UnivariateSelection
Statisticaltestscanbeusedtoselectthosefeaturesthathavethestrongestrelationshipwiththeoutputvariable.Thescikit-learnlibraryprovides
the SelectKBest classthatcanbeusedwithasuiteofdifferentstatisticalteststoselectaspecificnumberoffeatures.
TheexamplebelowusestheChi-Squared statisticaltestfornon-negativefeaturestoselect4ofthebestfeaturesfromthePimaIndiansonset
ofdiabetesdataset.
Selected features with first 5 entries:
plas test mass age
0 148 0 33.6 50
1 85 0 26.6 31
2 183 0 23.3 32
3 89 94 28.1 21
4 137 168 43.1 33
Chi-square scores of the selected features:
Index(['plas', 'test', 'mass', 'age'], dtype='object')
[1411.887 2175.565 127.669 181.304]
Task2:RecursiveFeatureElimination
TheRecursiveFeatureElimination(orRFE)worksbyrecursivelyremovingattributesandbuildingamodelonthoseattributesthatremain.Ituses
themodelaccuracytoidentifywhichattributes(andcombinationofattributes)contributethemosttopredictingthetargetattribute.Theexample
# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])
(χ
2
)
In [ ]:
# Feature selection with Univariate Statistical Tests (Chi-squared for classification)
import pandas as pd
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import chi2
# load data
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
URL = "https://raw.githubusercontent.com/wibirama/Artificial-Intelligence-Course/master/pima-indians-diabetes.data.csv"
data = pd.read_csv(URL,names=names)
array = data.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
# select four features and create new pandas dataframe
selector = SelectKBest(score_func=chi2, k=4).fit(X,Y)
f = selector.get_support(1)
data_new = data[data.columns[f]]
print ("Selected features with first 5 entries:")
print(data_new.head(5))
# show selected chi-square scores for selected features
print('n')
print ("Chi-square scores of the selected features:")
x_new = selector.transform(X) # not needed to get the score
scores = selector.scores_
print (data.columns[f])
print(scores[f])
14. belowusesRFEwiththelogisticregressionalgorithmtoselectthetop3features.Thechoiceofalgorithmdoesnotmattertoomuchaslongasitis
skillfulandconsistent.
Number of selected features via RFE: 3
Boolean of selected features: [ True False False False False True True False]
Feature ranking: [1 2 4 6 5 1 1 3]
Selected features:
preg
mass
pedi
Task3:FeatureImportance
Featureimportancereferstotechniquesthatassignascoretoinputfeaturesbasedonhowusefultheyareatpredictingatargetvariable.
Most
importancescoresarecalculatedbyapredictivemodelthathasbeenfitonthedataset.Inspectingtheimportancescoreprovidesinsightintothat
specificmodelandwhichfeaturesarethemostimportantandleastimportanttothemodelwhenmakingaprediction.Thisisatypeofmodel
interpretationthatcanbeperformedforthosemodelsthatsupportit.
BaggeddecisiontreeslikeRandomForestandExtraTreescanbeusedtoestimatetheimportanceoffeatures.Intheexamplebelowweconstruct
aExtraTreesClassifierclassifierforthePimaIndiansonsetofdiabetesdataset.Wecanseethatwearegivenanimportancescoreforeachattribute
wherethelargerthescore,themoreimportanttheattribute.Thescoreshighlighttheimportanceof plas , age and mass .
Features with importance scores:
preg 0.10917902521438591
plas 0.23778795159254987
pres 0.09677965067606348
skin 0.07938108481610057
test 0.0715765118317984
mass 0.1418165024365237
In [ ]:
# Feature Selection with RFE
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
URL = "https://raw.githubusercontent.com/wibirama/Artificial-Intelligence-Course/master/pima-indians-diabetes.data.csv"
data = pd.read_csv(URL,names=names)
array = data.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
# feature selection
#model = LogisticRegression()
rfe = RFE(estimator=LogisticRegression(solver='lbfgs', max_iter=1000),n_features_to_select=3)
fit = rfe.fit(X, Y)
print("Number of selected features via RFE: %d" % fit.n_features_)
print("Boolean of selected features: %s" % fit.support_)
print("Feature ranking: %s" % fit.ranking_)
print("Selected features: ")
idx = 0
for x in fit.ranking_:
if x==1:
print(data.columns[idx])
idx +=1
In [62]:
# Feature Importance with Extra Trees Classifier
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np
import matplotlib.pyplot as plt
# load data
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
URL = "https://raw.githubusercontent.com/wibirama/Artificial-Intelligence-Course/master/pima-indians-diabetes.data.csv"
data = pd.read_csv(URL,names=names)
array = data.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
# feature selection
model = ExtraTreesClassifier()
model.fit(X, Y)
importance_sorted = np.sort(model.feature_importances_)
print("Features with importance scores:")
idx = 0
for x in model.feature_importances_:
print(data.columns[idx], x)
idx += 1
#show bar plot
features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']
plt.bar(features,model.feature_importances_)
plt.show()