Modul Topik 9 - Kecerdasan Buatan

Topik 9
Implementasi Machine Learning
dengan Python
Dr. Sunu Wibirama
Modul Kuliah Kecerdasan Buatan
Kode mata kuliah: UGMx 001001132012
July 4, 2022

July 4, 2022
1 Capaian Pembelajaran Mata Kuliah
Topik ini akan memenuhi CPMK 5, yakni mampu mengimplementasikan Bahasa Pem-
rograman Python sebagai pendukung pengembangan sistem cerdas.
Adapun indikator tercapainya CPMK tersebut adalah Mengerti dan mamahami cara
mengekstrak dataset ke variabel Python, mengerti penggunaan berbagai fungsi classifier,
mengerti dan memahami cara melakukan validasi model machine learning.
2 Cakupan Materi
Cakupan materi dalam topik ini sebagai berikut:
a) Loading machine learning data: materi ini membahas teknik untuk mengunduh data
secara online menggunakan Google Colaboratory dan Python. Selain itu, materi ini
juga menjelaskan dasar-dasar statistika yang dapat digunakan untuk melihat karakter-
istik data. Dataset yang digunakan dalam hands on ini adalah PIMA Indian Dataset.
b) Preparing machine learning data: materi ini membahas teknik-teknik yang digunakan
untuk melihat distribusi dan korelasi antara masing-masing atribut dalam dataset.
c) Data visualization: materi ini membahas hal-hal yang terkait dengan visualisasi data
dengan histogram, density plots, boxplots, correlation matrix, dan scatter plots.
d) Data preparation and transformation: materi ini membahas hal-hal penting yang harus
dilakukan untuk mempersiapkan data untuk mengurangi potensi kesalahan pada saat
data menjadi masukan dari algoritme machine learning. Materi ini akan membahas
data rescaling, data standardization, dan data normalization.
e) Feature selection: materi ini membahas teknik-teknik yang dibutuhkan untuk memilih
features dari sekian banyak features yang ada.
f) Performance evaluation of machine learning algorithms: materi ini membahas teknik-
teknik yang dapat digunakan untuk mengevaluasi dan memilih model machine learn-
ing, misalnya memisahkan dataset menjadi training dan testing sets, K-fold cross
validation, dan repeated random test-train splits.
g) Performance metrics of machine learning algorithms: materi ini membahas metrik-
metrik utama yang dapat digunakan untuk mengukur performa algoritme machine
learning, misalnya akurasi klasifikasi, area di bawah kurva ROC, confusion matrix,
mean absolute error, mean squared error, serta R-squared.
h) Implementation of machine learning and neural network: materi ini membahas contoh
riil implementasi dari algoritme machine learning dalam klasifikasi, algorithm tuning,
dan neural network dengan Keras.
1

Week9:Hands-onMachineLearningwithPython
Copyright(C)2022-Dr.SunuWibirama|UniversitasGadjahMada
Note:Thisnotebookisintendedforeducationalpurpose.DistributionofthisnotebookislimitedonlyforstudentsofKuliah
KecerdasanBuatanthroughICEInstitutePlatform.AnyredistributionorrepublicationwithoutwrittenpermissionfromDr.Sunu
Wibiramaisstrictlyprohibitedandisconsideredascopyrightinfringement.
Inthislastlesson(Week9)ofKuliahKecerdasanBuatan(ArtificialIntelligenceCourse),wewillshowhowtoloaddatasettobeprocessedwith
machinelearningalgorithm.Inaddition,wewillalsolearnhowtovisualizethedata,howtoprepareourdata,andhowfeatureselectionworks.
Then,weintroducesomemetricstoevaluatemachinelearningalgorithms.Finally,wewillimplementseveralmachinelearningalgorithms.
9.1LoadingMachineLearningData
PIMAIndianDataset
ThePimaIndiansdatasetisusedtodemonstratedataloadinginthislesson.ThisdatasetisoriginallyfromtheNationalInstituteofDiabetesand
DigestiveandKidneyDiseases.Theobjectiveistopredictbasedondiagnosticmeasurementswhetherapatienthasdiabeteswithinfiveyears.As
suchitisaclassificationproblem.
Severalconstraintswereplacedontheselectionoftheseinstancesfromalargerdatabase.Inparticular,allpatientsherearefemalesatleast21
yearsoldofPimaIndianheritage.
[preg]--Pregnancies:Numberoftimespregnant
[plas]--Glucose:Plasmaglucoseconcentrationa2hoursinanoralglucosetolerancetest
[pres]--BloodPressure:Diastolicbloodpressure(mmHg)
[skin]--SkinThickness:Tricepsskinfoldthickness(mm)
[test]--Insulin:2-Hourseruminsulin(muU/ml)
[mass]--BMI:Bodymassindex(weightinkg/(heightinm)^2)
[pedi]--DiabetesPedigreeFunction:Diabetespedigreefunction
[age]--Age:Age(years)
[class]--Outcome:Classvariable,0isnoonsetofdiabeteswhile1isthereisonsetofdiabetes
Itisagooddatasetfordemonstrationbecausealloftheinputattributesarenumericandtheoutputvariabletobepredictedisbinary(0or1).More
detailedinformationaboutthedatasetcanbeseenhere:https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Task1:DownloadingdatasetfromGithubusingNumpyLibrary
WecanloadyourCSVdatausingNumPyandthenumpy.loadtxt()function.Thisfunctionassumesnoheaderrowandalldatahasthesameformat.
Theexamplebelowassumesthatthefilepima-indians-diabetes.data.csvisloadedfromanonlineURL.Theresultsarelistedinrowsthencolumns.
Youcanseethatthedatasethas768rowsand9columns.
In [89]:
from numpy import loadtxt

from urllib.request import urlopen

# load data
URL = "https://raw.githubusercontent.com/wibirama/Artificial-Intelligence-Course/master/pima-indians-diabetes.data.csv"

raw_data = urlopen(URL)

dataset = loadtxt(raw_data, delimiter=",")

print(dataset.shape)

(768, 9)

Task2:DownloadingdatasetfromGithubusingPandasLibrary
WecanalsoloadyourCSVdatausingPandasandthe pandas.read csv() function.Thisfunctionisveryflexibleandisperhapsthemost
recommendedapproachforloadingourmachinelearningdata.Thefunctionreturnsa pandas.DataFrame thatwecanimmediatelystart
summarizingandplotting.NotethatinthisexampleweexplicitlyspecifythenamesofeachattributetotheDataFrame.
(768, 9)

Task3:Usingdescriptivestatisticstounderstandyourdata
Thereisnosubstituteforlookingattherawdata.Lookingattherawdatacanrevealinsightsthatwecannotgetanyotherway.Itcanalsoplant
seedsthatmaylatergrowintoideasonhowtobetterpre-processandhandlethedataformachinelearningtasks.Wecanreviewthefirst20rows
ofourdatausingthe head() functiononthePandasdata.Wecanseethatthefirstcolumnliststherownumber,whichishandyforreferencinga
specificobservation.
preg plas pres skin test mass pedi age class

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

5 5 116 74 0 0 25.6 0.201 30 0

6 3 78 50 32 88 31.0 0.248 26 1

7 10 115 0 0 0 35.3 0.134 29 0

8 2 197 70 45 543 30.5 0.158 53 1

9 8 125 96 0 0 0.0 0.232 54 1

10 4 110 92 0 0 37.6 0.191 30 0

11 10 168 74 0 0 38.0 0.537 34 1

12 10 139 80 0 0 27.1 1.441 57 0

13 1 189 60 23 846 30.1 0.398 59 1

14 5 166 72 19 175 25.8 0.587 51 1

15 7 100 0 0 0 30.0 0.484 32 1

16 0 118 84 47 230 45.8 0.551 31 1

17 7 107 74 0 0 29.6 0.254 31 1

18 1 103 30 38 83 43.3 0.183 33 0

19 1 115 70 30 96 34.6 0.529 32 1

Task4:Observingtypeofdataforeachattribute
Thetypeofeachattributeisimportant.Stringsmayneedtobeconvertedtofloatingpointvaluesorintegerstorepresentcategoricalorordinal
values.Wecangetanideaofthetypesofattributesbypeekingattherawdata.Wecanalsolistthedatatypestocharacterizeeachattributeusing
the dtypes property.
preg int64

plas int64

pres int64

skin int64

test int64

mass float64

pedi float64

age int64

class int64

dtype: object

Task5:Descriptivestatistics
Descriptivestatisticscangiveusgreatinsightintothepropertiesofeachattribute.Oftenwecancreatemoresummariesthanwehavetimeto
review.The describe() functiononthePandasdatalists8statisticalpropertiesofeachattribute:
In [90]:
import pandas as pd

# load data
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']


data = pd.read_csv(URL,names=names)

print(data.shape)

In [91]:
import pandas as pd

# load data



peek = data.head(20)

print(peek)

In [92]:
import pandas as pd

# load data



types = data.dtypes

print(types)

Count.
Mean.
StandardDeviation.
MinimumValue.
25thPercentile.
50thPercentile(Median).
75thPercentile.
MaximumValue.
Wewillnotesomecallsto pandas.set_option() intherecipetochangetheprecisionofthenumbersandthepreferredwidthoftheoutput.
Thisistomakeitmorereadableforthisexample.Notethat display.width parametersetsthewidthofthedisplayincharacters.IncasePython
isrunninginaterminalthiscanbesettoNoneandPandaswillcorrectlyauto-detectthewidth.
Whendescribingourdatathisway,itisworthtakingsometimeandreviewingobservationsfromtheresults.ThismightincludethepresenceofNA
valuesformissingdataorsurprisingdistributionsforattributes.

count 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000

mean 3.845 120.895 69.105 20.536 79.799 31.993 0.472 33.241 0.349

std 3.370 31.973 19.356 15.952 115.244 7.884 0.331 11.760 0.477

min 0.000 0.000 0.000 0.000 0.000 0.000 0.078 21.000 0.000

25% 1.000 99.000 62.000 0.000 0.000 27.300 0.244 24.000 0.000

50% 3.000 117.000 72.000 23.000 30.500 32.000 0.372 29.000 0.000

75% 6.000 140.250 80.000 32.000 127.250 36.600 0.626 41.000 1.000

max 17.000 199.000 122.000 99.000 846.000 67.100 2.420 81.000 1.000

9.2PreparingMachineLearningData
Task1:Checkingthedistributionofthedata
Onclassificationproblemsweneedtoknowhowbalancedtheclassvaluesare.Highlyimbalancedproblems(alotmoreobservationsforoneclass
thananother)arecommonandmayneedspecialhandlinginthedatapreparationstageofourproject.Wecanquicklygetanideaofthe
distributionoftheclassattributeinPandas.Wecanseethattherearenearlydoublethenumberofobservationswithclass0(noonsetofdiabetes)
thantherearewithclass1(onsetofdiabetes).
class
0 500

1 268

dtype: int64

Task2:Checkingcorrelationofdata
Correlationreferstotherelationshipbetweentwovariablesandhowtheymayormaynotchangetogether.Themostcommonmethodfor
calculatingcorrelationisPearson’sCorrelationCoefficient,thatassumesanormaldistributionoftheattributesinvolved.
Acorrelationof-1or1showsafullnegativeorpositivecorrelationrespectively.Whereasavalueof0showsnocorrelationatall.Somemachine
learningalgorithmslikelinearandlogisticregressioncansufferpoorperformanceiftherearehighlycorrelatedattributesinyourdataset.Assuch,
itisagoodideatoreviewallofthepairwisecorrelationsoftheattributesinourdataset.Wecanusethe corr() functiononthePandasdatato
calculateacorrelationmatrix.
Thematrixlistsallattributesacrossthetopanddowntheside,togivecorrelationbetweenallpairsofattributes(twice,becausethematrixis
symmetrical).Wecanseethediagonallinethroughthematrixfromthetoplefttobottomrightcornersofthematrixshowsperfectcorrelationof
eachattributewithitself.
In [93]:
import pandas as pd

# load data



pd.set_option('display.width', 100)

pd.set_option('precision', 3)

description = data.describe()

print(description)

In [94]:
import pandas as pd

# load data



class_counts = data.groupby('class').size()

print(class_counts)

In [95]:
import pandas as pd

# load data






preg 1.000 0.129 0.141 -0.082 -0.074 0.018 -0.034 0.544 0.222

plas 0.129 1.000 0.153 0.057 0.331 0.221 0.137 0.264 0.467

pres 0.141 0.153 1.000 0.207 0.089 0.282 0.041 0.240 0.065

skin -0.082 0.057 0.207 1.000 0.437 0.393 0.184 -0.114 0.075

test -0.074 0.331 0.089 0.437 1.000 0.198 0.185 -0.042 0.131

mass 0.018 0.221 0.282 0.393 0.198 1.000 0.141 0.036 0.293

pedi -0.034 0.137 0.041 0.184 0.185 0.141 1.000 0.034 0.174

age 0.544 0.264 0.240 -0.114 -0.042 0.036 0.034 1.000 0.238

class 0.222 0.467 0.065 0.075 0.131 0.293 0.174 0.238 1.000

Task3:Skewofunivariatedistribution
SkewreferstoadistributionthatisassumedGaussian(normalorbellcurve)thatisshiftedorsquashedinonedirectionoranother.Manymachine
learningalgorithmsassumeaGaussiandistribution.Knowingthatanattributehasaskewmayallowustoperformdatapreparationtocorrectthe
skewandlaterimprovetheaccuracyofourmodels.Wecancalculatetheskewofeachattributeusingthe skew() functiononthePandasdata.
Theskewresultsshowapositive(right)ornegative(left)skew.Valuesclosertozeroshowlessskew.
preg 0.902

plas 0.174

pres -1.844

skin 0.109

test 2.272

mass -0.429

pedi 1.920

age 1.130

class 0.635

dtype: float64

Howtouseourstatisticalresults
Reviewthenumbers.Generatingthesummarystatisticsisnotenough.Takeamomenttopause,readandreallythinkaboutthenumbersyou
areseeing.
Askwhy.Reviewyournumbersandaskalotofquestions.Howandwhyareyouseeingspecificvalues.Thinkabouthowthenumbersrelateto
theproblemdomainingeneralandspecificentitiesthatobservationsrelateto.
Writedownideas.Writedownyourobservationsandideas.Keepasmalltextfileornotepadandjotdownalloftheideasforhowvariables
mayrelate,forwhatnumbersmean,andideasfortechniquestotrylater.Thethingsyouwritedownnowwhilethedataisfreshwillbevery
valuablelaterwhenyouaretryingtothinkupnewthingstotry.
9.3DataVisualization(Part01)
Wemustunderstandyourdatainordertogetthebestresultsfrommachinelearningalgorithms.Thefastestwaytolearnmoreaboutourdataisto
usedatavisualization.InthischapterwewilldiscoverexactlyhowwecanvisualizeyourmachinelearningdatainPythonusingPandas.First,we
willlearnthreeunivariateplots:
Histograms
DensityPlots
BoxandWhiskerPlots
Inthesubsequentpart,wewilllearnsomemultivariateplots,including:
CorrelationMatrixPlots
ScatterPlotMatrix
Task1:PlottingHistograms
Afastwaytogetanideaofthedistributionofeachattributeistolookathistograms.Histogramsgroupdataintobinsandprovideusacountofthe
numberofobservationsineachbin.FromtheshapeofthebinswecanquicklygetafeelingforwhetheranattributeisGaussian,skewedoreven
hasanexponentialdistribution.Itcanalsohelpusseepossibleoutliers.
Wecanseethatperhapstheattributes age, pedi and test mayhaveanexponentialdistribution.Wecanalsoseethatperhapsthe mass
and pres and plas attributesmayhaveaGaussianornearlyGaussiandistribution.Thisisinterestingbecausemanymachinelearning
techniquesassumeaGaussianunivariatedistributionontheinputvariables.
correlations = data.corr(method='pearson')

print(correlations)

In [96]:
import pandas as pd

# load data



skew = data.skew()

print(skew)

In [97]:
# Univariate Histograms

import matplotlib.pyplot as plt

import pandas as pd

# load data

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:11: UserWarning: To output multiple subplots, the figure conta
ining the passed axes is being cleared

# This is added back by InteractiveShellApp.init_path()

Task2:PlottingDensityPlots
Densityplotsareanotherwayofgettingaquickideaofthedistributionofeachattribute.Theplotslooklikeanabstractedhistogramwithasmooth
curvedrawnthroughthetopofeachbin,muchlikeoureyetriedtodowiththehistograms.Wecanseethedistributionforeachattributeisclearer
thanthehistograms.
/usr/local/lib/python3.7/dist-packages/pandas/plotting/_matplotlib/__init__.py:71: UserWarning: To output multiple subplot
s, the figure containing the passed axes is being cleared

plot_obj.generate()




fig = plt.figure(figsize = (12,12))

ax = fig.gca()

data.hist(ax = ax)

plt.show()

In [98]:
# Univariate Density Plots


import pandas as pd

# load data




ax = fig.gca()

data.plot(kind='density', subplots=True, layout=(3,3), sharex=False, ax=ax)

plt.show()

Task3:PlottingBoxandWhiskerPlots
AnotherusefulwaytoreviewthedistributionofeachattributeistouseBoxandWhiskerPlotsorboxplotsforshort.Boxplotssummarizethe
distributionofeachattribute,drawingalineforthemedian(middlevalue)andaboxaroundthe25thand75thpercentiles(themiddle50%ofthe
data).Thewhiskersgiveanideaofthespreadofthedataanddotsoutsideofthewhiskersshowcandidateoutliervalues(valuesthatare1.5times
greaterthanthesizeofspreadofthemiddle50%ofthedata).
Wecanseethatthespreadofattributesisquitedifferent.Somelike age , test and skin appearquiteskewedtowardssmallervalues.
/usr/local/lib/python3.7/dist-packages/pandas/plotting/_matplotlib/__init__.py:71: UserWarning: To output multiple subplot
s, the figure containing the passed axes is being cleared

plot_obj.generate()

In [99]:
# Univariate Box and Whisker Plots


import pandas as pd

# load data




ax = fig.gca()

data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, ax=ax)

plt.show()

9.4DataVisualization(Part02)
Thislectureprovidesexamplesoftwoplotsthatshowtheinteractionsbetweenmultiplevariablesinyourdataset.
CorrelationMatrixPlot.
ScatterPlotMatrix.
Task1:PlottingCorrelationMatrixPlot
Correlationgivesanindicationofhowrelatedthechangesarebetweentwovariables.Iftwovariableschangeinthesamedirectiontheyare
positivelycorrelated.Iftheychangeinoppositedirectionstogether(onegoesup,onegoesdown),thentheyarenegativelycorrelated.
Wecancalculatethecorrelationbetweeneachpairofattributes.Thisiscalledacorrelationmatrix.Wecanthenplotthecorrelationmatrixandget
anideaofwhichvariableshaveahighcorrelationwitheachother.Thisisusefultoknow,becausesomemachinelearningalgorithmslikelinearand
logisticregressioncanhavepoorperformanceiftherearehighlycorrelatedinputvariablesinourdata.
Wecanseethatthematrixissymmetrical,i.e.thebottomleftofthematrixisthesameasthetopright.Thisisusefulaswecanseetwodifferent
viewsonthesamedatainoneplot.Wecanalsoseethateachvariableisperfectlypositivelycorrelatedwithitself(asyouwouldhaveexpected)in
thediagonallinefromtoplefttobottomright.
In [100…
#Correlation Matrix Plot


import pandas as pd

import numpy as np

# load data



correlations = data.corr()

# plot correlation matrix


ax = fig.add_subplot(111)

cax = ax.matshow(correlations, vmin=-1, vmax=1)

fig.colorbar(cax)

ticks = np.arange(0,9,1)

ax.set_xticks(ticks)

ax.set_yticks(ticks)

ax.set_xticklabels(names)

ax.set_yticklabels(names)

plt.show()

#print correlation value




preg 1.000 0.129 0.141 -0.082 -0.074 0.018 -0.034 0.544 0.222

plas 0.129 1.000 0.153 0.057 0.331 0.221 0.137 0.264 0.467

pres 0.141 0.153 1.000 0.207 0.089 0.282 0.041 0.240 0.065

skin -0.082 0.057 0.207 1.000 0.437 0.393 0.184 -0.114 0.075

test -0.074 0.331 0.089 0.437 1.000 0.198 0.185 -0.042 0.131

mass 0.018 0.221 0.282 0.393 0.198 1.000 0.141 0.036 0.293

pedi -0.034 0.137 0.041 0.184 0.185 0.141 1.000 0.034 0.174

age 0.544 0.264 0.240 -0.114 -0.042 0.036 0.034 1.000 0.238

class 0.222 0.467 0.065 0.075 0.131 0.293 0.174 0.238 1.000

Task2:PlottingScatterPlotMatrix
Ascatterplotshowstherelationshipbetweentwovariablesasdotsintwodimensions,oneaxisforeachattribute.Wercancreateascatterplotfor
eachpairofattributesinourdata.Drawingallthesescatterplotstogetheriscalledascatterplotmatrix.
Scatterplotsareusefulforspottingstructuredrelationshipsbetweenvariables,likewhetherwecouldsummarizetherelationshipbetweentwo
variableswithaline.Attributeswithstructuredrelationshipsmayalsobecorrelatedandgoodcandidatesforremovalfromyourdataset.Belowisa
figureshowingshapeofscatterplotoftypicalcorrelationbetweentwovariables.

print(correlations)

LiketheCorrelationMatrixPlotabove,thescatterplotmatrixissymmetrical.Thisisusefultolookatthepairwiserelationshipsfromdifferent
perspectives.Becausethereislittlepointofdrawingascatterplotofeachvariablewithitself,thediagonalshowshistogramsofeachattribute.

preg 1.000 0.129 0.141 -0.082 -0.074 0.018 -0.034 0.544 0.222

plas 0.129 1.000 0.153 0.057 0.331 0.221 0.137 0.264 0.467

pres 0.141 0.153 1.000 0.207 0.089 0.282 0.041 0.240 0.065

skin -0.082 0.057 0.207 1.000 0.437 0.393 0.184 -0.114 0.075

test -0.074 0.331 0.089 0.437 1.000 0.198 0.185 -0.042 0.131

mass 0.018 0.221 0.282 0.393 0.198 1.000 0.141 0.036 0.293

pedi -0.034 0.137 0.041 0.184 0.185 0.141 1.000 0.034 0.174

age 0.544 0.264 0.240 -0.114 -0.042 0.036 0.034 1.000 0.238

class 0.222 0.467 0.065 0.075 0.131 0.293 0.174 0.238 1.000

9.5DataPreparationandTransformation
Manymachinelearningalgorithmsmakeassumptionsaboutourdata.Itisoftenaverygoodideatoprepareourdatainsuchawaytobestexpose
thestructureoftheproblemtothemachinelearningalgorithmsthatyouintendtouse.Adifficultyisthatdifferentalgorithmsmakedifferent
assumptionsaboutourdataandmayrequiredifferenttransforms.Further,whenyoufollowalloftherulesandprepareourdata,sometimes
algorithmscandeliverbetterresultswithoutpre-processing.
Task1:Rescalingthedata
Whenourdataiscomprisedofattributeswithvaryingscales,manymachinelearningalgorithmscanbenefitfromrescalingtheattributestoallhave
thesamescale.Oftenthisisreferredto
asnormalizationandattributesareoftenrescaledintotherangebetween0and1.Thisisusefulfor
optimizationalgorithmsusedinthecoreofmachinelearningalgorithmslikegradientdescent.Itisalsousefulforalgorithmsthatweightinputslike
regressionandneuralnetworksandalgorithmsthatusedistancemeasureslikek-NearestNeighbors.Wecanrescaleourdatausingscikit-learn
usingthe MinMaxScaler class.Afterrescalingwecanseethatallofthevaluesareintherangebetween0and1.
In [101…
# Scatter Plot Matrix

import pandas as pd

# load data



pd.plotting.scatter_matrix(data, figsize=(12,12))

plt.show()

#print correlation value




print(correlations)

[[0.353 0.744 0.59 0.354 0. 0.501 0.234 0.483]

[0.059 0.427 0.541 0.293 0. 0.396 0.117 0.167]

[0.471 0.92 0.525 0. 0. 0.347 0.254 0.183]

[0.059 0.447 0.541 0.232 0.111 0.419 0.038 0. ]

[0. 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]

Task2:Standardizethedata
StandardizationisausefultechniquetotransformattributeswithaGaussiandistributionanddifferingmeansandstandarddeviationstoastandard
Gaussiandistributionwithameanof
0andastandarddeviationof1.ItismostsuitablefortechniquesthatassumeaGaussiandistributioninthe
inputvariablesandworkbetterwithrescaleddata,suchaslinearregression,logisticregressionandlineardiscriminateanalysis.Wecan
standardizedatausingscikit-learnwiththe StandardScaler class.
[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426]

[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]

[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]

[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]

[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]

Task3:Normalizingthedata
Normalizinginscikit-learnreferstorescalingeachobservation(row)tohavealengthof1(calledaunitnormoravectorwiththelengthof1in
linearalgebra).Thispre-processingmethod
canbeusefulforsparsedatasets(lotsofzeros)withattributesofvaryingscaleswhenusing
algorithmsthatweightinputvaluessuchasneuralnetworksandalgorithmsthatusedistancemeasuressuchask-NearestNeighbors.Wecan
normalizedatainPythonwithscikit-learnusingthe Normalizer class.
In [102… # Rescale Data (between 0 and 1)

from sklearn.preprocessing import MinMaxScaler
import pandas as pd

from numpy import set_printoptions

# load data



array = data.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

# rescaling the data

scaler = MinMaxScaler(feature_range=(0, 1))

rescaledX = scaler.fit_transform(X)

# summarize transformed data

set_printoptions(precision=3)

print(rescaledX[0:5,:])

In [ ]:
# Standardize Data (0 mean, 1 stdev)

from sklearn.preprocessing import StandardScaler

import pandas as pd


# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]

# standardize the data

scaler = StandardScaler().fit(X)

rescaledX = scaler.transform(X)



print(rescaledX[0:5,:])

In [ ]:
# Normalize Data (length of 1)

from sklearn.preprocessing import Normalizer

import pandas as pd


# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]

# normalize the data

scaler = Normalizer().fit(X)

normalizedX = scaler.transform(X)

[[0.034 0.828 0.403 0.196 0. 0.188 0.004 0.28 ]

[0.008 0.716 0.556 0.244 0. 0.224 0.003 0.261]

[0.04 0.924 0.323 0. 0. 0.118 0.003 0.162]

[0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]

[0. 0.596 0.174 0.152 0.731 0.188 0.01 0.144]]

9.6FeatureSelection
Thedatafeaturesthatweusetotrainourmachinelearningmodelshaveahugeinfluenceontheperformancewecanachieve.Inthislessonwewill
discoverautomaticfeatureselectiontechniquesthatwecanusetoprepareourmachinelearningdatainPythonwithscikit-learn.
Featureselectionisaprocesswhereyouautomaticallyselectthosefeaturesinourdatathatcontributemosttothepredictionvariableoroutputin
whichyouareinterested.Havingirrelevantfeaturesinourdatacandecreasetheaccuracyofmanymodels,especiallylinearalgorithmslikelinear
andlogisticregression.Threebenefitsofperformingfeatureselectionbeforemodelingourdataare:
Reducesoverfitting:lessredundantdatameanslessopportunitytomakedecisionsbasedonnoise.
Improvesaccuracy:lessmisleadingdatameansmodelingaccuracyimproves.
Reducestrainingtime:lessdatameansthatalgorithmstrainfaster
Aftercompletingthislessonyouwillknowhowtouse:
UnivariateSelection.
RecursiveFeatureElimination.
FeatureImportance.
Task1:UnivariateSelection
Statisticaltestscanbeusedtoselectthosefeaturesthathavethestrongestrelationshipwiththeoutputvariable.Thescikit-learnlibraryprovides
the SelectKBest classthatcanbeusedwithasuiteofdifferentstatisticalteststoselectaspecificnumberoffeatures.
TheexamplebelowusestheChi-Squared statisticaltestfornon-negativefeaturestoselect4ofthebestfeaturesfromthePimaIndiansonset
ofdiabetesdataset.
Selected features with first 5 entries:

plas test mass age

0 148 0 33.6 50

1 85 0 26.6 31

2 183 0 23.3 32

3 89 94 28.1 21

4 137 168 43.1 33

Chi-square scores of the selected features:

Index(['plas', 'test', 'mass', 'age'], dtype='object')

[1411.887 2175.565 127.669 181.304]

Task2:RecursiveFeatureElimination
TheRecursiveFeatureElimination(orRFE)worksbyrecursivelyremovingattributesandbuildingamodelonthoseattributesthatremain.Ituses
themodelaccuracytoidentifywhichattributes(andcombinationofattributes)contributethemosttopredictingthetargetattribute.Theexample


print(normalizedX[0:5,:])

(χ
2
)
In [ ]:
# Feature selection with Univariate Statistical Tests (Chi-squared for classification)

import pandas as pd


from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.feature_selection import chi2

# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]

# select four features and create new pandas dataframe

selector = SelectKBest(score_func=chi2, k=4).fit(X,Y)

f = selector.get_support(1)

data_new = data[data.columns[f]]

print ("Selected features with first 5 entries:")

print(data_new.head(5))

# show selected chi-square scores for selected features

print('n')

print ("Chi-square scores of the selected features:")

x_new = selector.transform(X) # not needed to get the score

scores = selector.scores_

print (data.columns[f])

print(scores[f])

belowusesRFEwiththelogisticregressionalgorithmtoselectthetop3features.Thechoiceofalgorithmdoesnotmattertoomuchaslongasitis
skillfulandconsistent.
Number of selected features via RFE: 3

Boolean of selected features: [ True False False False False True True False]

Feature ranking: [1 2 4 6 5 1 1 3]

Selected features:

preg

mass

pedi

Task3:FeatureImportance
Featureimportancereferstotechniquesthatassignascoretoinputfeaturesbasedonhowusefultheyareatpredictingatargetvariable.
Most
importancescoresarecalculatedbyapredictivemodelthathasbeenfitonthedataset.Inspectingtheimportancescoreprovidesinsightintothat
specificmodelandwhichfeaturesarethemostimportantandleastimportanttothemodelwhenmakingaprediction.Thisisatypeofmodel
interpretationthatcanbeperformedforthosemodelsthatsupportit.
BaggeddecisiontreeslikeRandomForestandExtraTreescanbeusedtoestimatetheimportanceoffeatures.Intheexamplebelowweconstruct
aExtraTreesClassifierclassifierforthePimaIndiansonsetofdiabetesdataset.Wecanseethatwearegivenanimportancescoreforeachattribute
wherethelargerthescore,themoreimportanttheattribute.Thescoreshighlighttheimportanceof plas , age and mass .
Features with importance scores:

preg 0.10917902521438591

plas 0.23778795159254987

pres 0.09677965067606348

skin 0.07938108481610057

test 0.0715765118317984

mass 0.1418165024365237

In [ ]:
# Feature Selection with RFE

import pandas as pd

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]

# feature selection

#model = LogisticRegression()

rfe = RFE(estimator=LogisticRegression(solver='lbfgs', max_iter=1000),n_features_to_select=3)

fit = rfe.fit(X, Y)

print("Number of selected features via RFE: %d" % fit.n_features_)
print("Boolean of selected features: %s" % fit.support_)

print("Feature ranking: %s" % fit.ranking_)

print("Selected features: ")

idx = 0

for x in fit.ranking_:

if x==1:

print(data.columns[idx])

idx +=1

In [62]:
# Feature Importance with Extra Trees Classifier

import pandas as pd

from sklearn.ensemble import ExtraTreesClassifier

import numpy as np


# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]

# feature selection

model = ExtraTreesClassifier()

model.fit(X, Y)

importance_sorted = np.sort(model.feature_importances_)

print("Features with importance scores:")

idx = 0

for x in model.feature_importances_:

print(data.columns[idx], x)

idx += 1

#show bar plot

features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']

plt.bar(features,model.feature_importances_)

plt.show()

pedi 0.11855796799091706

age 0.14492130544166104

9.7PerformanceEvaluationofMachineLearningAlgorithms
Weneedtoknowhowwellyouralgorithmsperformonunseendata.Thebestwaytoevaluatetheperformanceofanalgorithmwouldbetomake
predictionsfornewdatatowhichwealreadyknowtheanswers.Thesecondbestwayistouseclevertechniquesfromstatisticscalledresampling
methodsthatallowustomakeaccurateestimatesforhowwellouralgorithmwillperformonnewdata.
Task1:SplittingDataintoTrainandTestSets
Thesimplestmethodthatwecanusetoevaluatetheperformanceofamachinelearningalgorithmistousedifferenttrainingandtestingdatasets.
Wecantakeouroriginaldatasetandsplititintotwoparts.Trainthealgorithmonthefirstpart,makepredictionsonthesecondpartandevaluate
thepredictionsagainsttheexpectedresults.Thesizeofthesplitcandependonthesizeandspecificsofyourdataset,althoughitiscommonto
use67%ofthedatafortrainingandtheremaining33%fortesting.
Thisalgorithmevaluationtechniqueisveryfast.Itisidealforlargedatasets(millionsofrecords)wherethereisstrongevidencethatbothsplitsof
thedataarerepresentativeoftheunderlyingproblem.Becauseofthespeed,itisusefultousethisapproachwhenthealgorithmweare
investigatingisslowtotrain.
Adownsideofthistechniqueisthatitcanhaveahighvariance.Thismeansthatdifferencesinthetrainingandtestdatasetcanresultinmeaningful
differencesintheestimateofaccuracy.IntheexamplebelowwesplitthePimaIndiansdatasetinto67%/33%splitsfortrainingandtestand
evaluatetheaccuracyofaLogisticRegressionmodel.
Notethatinadditiontospecifyingthesizeofthesplit,wealsospecifytherandomseed.Becausethesplitofthedataisrandom,wewanttoensure
thattheresultsarereproducible.Byspecifyingtherandomseedweensurethatwegetthesamerandomnumberseachtimewerunthe
codeandinturnthesamesplitofdata.Thisisimportantifwewanttocomparethisresulttotheestimatedaccuracyofanothermachinelearning
algorithmorthesamealgorithmwithadifferentconfiguration.Toensurethecomparisonwasapples-for-apples,wemustensurethattheyare
trainedandtestedonexactlythesamedata.
Accuracy: 78.740%

Task2:K-FoldCrossValidation
Cross-validationisanapproachthatwecanusetoestimatetheperformanceofamachinelearningalgorithmwithlessvariancethanasingletrain-
testsetsplit.Itworksbysplittingthedatasetintok-parts(e.g.k=5ork=10).Eachsplitofthedataiscalledafold.Thealgorithmistrainedonk−
1foldswithoneheldbackandtestedontheheldbackfold.Thisisrepeatedsothateachfoldofthedatasetisgivenachancetobetheheldback
testset.Afterrunningcross-validationweendupwithkdifferentperformancescoresthatwecansummarizeusingameanandastandard
deviation.
Theresultisamorereliableestimateoftheperformanceofthealgorithmonnewdata.Itismoreaccuratebecausethealgorithmistrainedand
evaluatedmultipletimesondifferentdata.Thechoiceofkmustallowthesizeofeachtestpartitiontobelargeenoughtobeareasonablesample
oftheproblem,whilstallowingenoughrepetitionsofthetrain-testevaluationofthealgorithmtoprovideafairestimateofthealgorithms
In [75]:
# Evaluate using a train and a test set

import pandas as pd

from sklearn.model_selection import train_test_split


# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]

test_size = 0.33

seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

model = LogisticRegression(solver='lbfgs', max_iter=1000)

model.fit(X_train, Y_train)

result = model.score(X_test, Y_test)

print("Accuracy: %.3f%%" % (result*100.0))

performanceonunseendata.Formodestsizeddatasetsinthethousandsortensofthousandsofrecords,kvaluesof3,5and10arecommon.In
theexamplebelowweuse10-foldcross-validation.
Wereportboththemeanandthestandarddeviationoftheperformancemeasure.Whensummarizingperformancemeasures,itisagoodpractice
tosummarizethedistributionofthemeasures,inthiscaseassumingaGaussiandistributionofperformance(averyreasonableassumption)and
recordingthemeanandstandarddeviation.
Accuracy: 77.604% (5.158%)

Task3:RepeatedRandomTest-TrainSplits
Anothervariationonk-foldcross-validationistocreatearandomsplitofthedatalikethetrain/testsplitdescribedabove,butrepeattheprocessof
splittingandevaluationofthealgorithmmultipletimes,likecross-validation.Thishasthespeedofusingatrain/testsplitandthereductionin
varianceintheestimatedperformanceofk-foldcross-validation.
Wecanalsorepeattheprocessmanymoretimesasneededtoimprovetheaccuracy.Adownsideisthatrepetitionsmayincludemuchofthesame
datainthetrainorthetestsplitfromruntorun,introducingredundancyintotheevaluation.Theexamplebelowsplitsthedataintoa67%/33%
train/testsplitandrepeatstheprocess10times.
Accuracy: 76.535% (2.235%)

Notestobeconsidered
Generallyk-foldcross-validationisthegoldstandardforevaluatingtheperformanceofamachinelearningalgorithmonunseendatawithkset
to3,5,or10.
Usingatrain/testsplitisgoodforspeedwhenusingaslowalgorithmandproducesperformanceestimateswithlowerbiaswhenusinglarge
datasets.
Techniqueslikerepeatedrandomsplitscanbeusefulintermediateswhentryingtobalancevarianceintheestimatedperformance,model
trainingspeedanddatasetsize.
Thebestadviceistoexperimentandfindatechniqueforyourproblemthatisfastandproducesreasonableestimatesofperformancethatyoucan
usetomakedecisions.Ifindoubt,use10-foldcross-validation.
9.8PerformanceMetricsofMachineLearningAlgorithm(Part01)
Themetricsthatwechoosetoevaluateyourmachinelearningalgorithmsareveryimportant.Choiceofmetricsinfluenceshowtheperformanceof
machinelearningalgorithmsismeasuredandcompared.Theyinfluencehowweweighttheimportanceofdifferentcharacteristicsintheresults
In [85]:
# Evaluate using cross validation

import pandas as pd

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score


# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]

kfold = KFold(n_splits=10, random_state=None)


results = cross_val_score(model, X, Y, cv=kfold)

print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

In [84]:
# Evaluate using Shuffle Split Cross Validation

import pandas as pd

from sklearn.model_selection import ShuffleSplit



# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]

n_splits = 10

test_size = 0.33

seed = 7

kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)


results = cross_val_score(model, X, Y, cv=kfold)

print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

andourultimatechoiceofwhichalgorithmtochoose.
Task1:ClassificationAccuracy
Classificationaccuracyisthenumberofcorrectpredictionsmadeasaratioofallpredictionsmade.Thisisthemostcommonevaluationmetricfor
classificationproblems,itisalsothemostmisused.Itisreallyonlysuitablewhenthereareanequalnumberofobservationsineachclass(whichis
rarelythecase)andthatallpredictionsandpredictionerrorsareequallyimportant,whichisoftennotthecase.Belowisanexampleofcalculating
classificationaccuracy.
Accuracy: 0.776 (0.052)

Task2:AreaUnderROCCurve
AreaunderROCCurve(orAUCforshort)isaperformancemetricforbinaryclassificationproblems.TheAUCrepresentsamodel’sabilityto
discriminatebetweenpositiveandnegativeclasses.Anareaof1.0representsamodelthatmadeallpredictionsperfectly.Anareaof0.5represents
amodelthatisasgoodasrandom.ROCcanbebrokendownintosensitivityandspecificity.Abinaryclassificationproblemisreallyatrade-off
betweensensitivityandspecificity.
Sensitivityisthetruepositiverate,alsocalledtherecall.Itisthenumberofinstancesfromthepositive(first)classthatwereactuallypredicted
correctly.
Specificityisalsocalledthetruenegativerate.Isthenumberofinstancesfromthenegative(second)classthatwereactuallypredicted
correctly.
Fromourresultsbelow,wecanseetheAUCisrelativelycloseto1andgreaterthan0.5,suggestingsomeskillinthepredictions.
In [87]:
# Cross Validation Classification Accuracy

import pandas as pd



# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]



scoring = 'accuracy'

results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

print("Accuracy: %.3f (%.3f)" % (results.mean(), results.std()))

In [88]:
# Cross Validation Classification ROC AUC

import pandas as pd



# load data

AUC: 0.828 (0.043)

Task3:ConfusionMatrix
Theconfusionmatrixisahandypresentationoftheaccuracyofamodelwithtwoormoreclasses.Thetablepresentspredictionsonthex-axisand
trueoutcomesonthey-axis.Thecellsofthetablearethenumberofpredictionsmadebyamachinelearningalgorithm.
Forexample,amachinelearningalgorithmcanpredict0or1andeachpredictionmayactuallyhavebeena0or1.Predictionsfor0thatwere
actually0appearinthecellforprediction=0andactual=0,whereaspredictionsfor0thatwereactually1appearinthecellforprediction=0and
actual=1.Andsoon.
BelowisanexampleofcalculatingaconfusionmatrixforasetofpredictionsbyaLogisticRegressiononthePimaIndiansonsetofdiabetes
dataset.
[[142 20]
[ 34 58]]

<matplotlib.axes._subplots.AxesSubplot at 0x7f76eb824a10>
9.9PerformanceMetricsofMachineLearningAlgorithm(Part02)



array = data.values


X = array[:,0:8]

Y = array[:,8]



scoring = 'roc_auc'


print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

In [114…
# Cross Validation Classification Confusion Matrix

import pandas as pd

import seaborn as sns



from sklearn.metrics import confusion_matrix

# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]

test_size = 0.33

seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size,random_state=seed)


model.fit(X_train, Y_train)

predicted = model.predict(X_test)

matrix = confusion_matrix(Y_test, predicted)

print(matrix)

# visualize the results

group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in

matrix.flatten()]

group_percentages = ["{0:.2%}".format(value) for value in

matrix.flatten()/np.sum(matrix)]

labels = [f"{v1}n{v2}n{v3}" for v1, v2, v3 in

zip(group_names,group_counts,group_percentages)]

labels = np.asarray(labels).reshape(2,2)

sns.heatmap(matrix, annot=labels, fmt='', cmap='Blues')

Out[114…

Inthepreviouslesson,welearnaboutperformancemetricsforclassificationproblems.Inthislessonwillreview3ofthemostcommonmetricsfor
evaluatingpredictionsonregressionmachinelearningproblems:
MeanAbsoluteError
MeanSquaredError
R-Squared
Task1:MeanAbsoluteError
TheMeanAbsoluteError(orMAE)isthesumoftheabsolutedifferencesbetweenpredictionsandactualvalues.Itgivesanideaofhowwrongthe
predictionswere.Themeasuregivesanideaofthemagnitudeoftheerror,butnoideaofthedirection(e.g.overorunderpredicting).Avalueof0
indicatesnoerrororperfectpredictions.Likelogloss,thismetricisinvertedby
the cross_val_score() function.
TheexamplebelowdemonstratescalculatingmeanabsoluteerrorontheBostonhousepricedataset.ThisdatasetwastakenfromtheStatLib
libraryandismaintainedbyCarnegieMellonUniversity.ThisdatasetconcernsthehousingpricesinthehousingcityofBoston.Thedataset
providedhas506instanceswith13features.
MAE: -4.005 (2.084)

In [117…
# Cross Validation Regression MAE

import pandas as pd


from sklearn.linear_model import LinearRegression

# load data
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO',

'B', 'LSTAT', 'MEDV']

URL = "https://raw.githubusercontent.com/wibirama/Artificial-Intelligence-Course/master/housing.csv"

data = pd.read_csv(URL, delim_whitespace=True, names=names)

array = data.values


X = array[:,0:13]

Y = array[:,13]


model = LinearRegression()

scoring = 'neg_mean_absolute_error'


print("MAE: %.3f (%.3f)" % (results.mean(), results.std()))

Task2:MeanSquaredError
TheMeanSquaredError(orMSE)ismuchlikethemeanabsoluteerrorinthatitprovidesagrossideaofthemagnitudeoferror.Takingthesquare
rootofthemeansquarederrorconvertstheunitsbacktotheoriginalunitsoftheoutputvariableandcanbemeaningfulfordescriptionand
presentation.ThisiscalledtheRootMeanSquaredError(orRMSE).Theexamplebelowprovidesademonstrationofcalculatingmeansquared
error.
Note:MSEmaybelessrobustthanMAE,sincethesquaringoftheerrorswillenforceahigherimportanceonoutliers.Butwhenoutliersare
exponentiallyrare(likeinabell-shapedcurve),theMSEperformsverywellandisgenerallypreferred.
MSE: -34.705 (45.574)

Task3:R-Squared
The (orR-Squared)metricprovidesanindicationofthegoodnessoffitofasetofpredictionstotheactualvalues.Instatisticalliteraturethis
measureiscalledthecoefficientofdetermination.Thisisavaluebetween0and1forno-fitandperfectfit,respectively.Theexamplebelow
providesademonstrationofcalculatingthemean forasetofpredictions.
where:
: valueofobservation
:predictedvalueof forobservation
:meanvalueof
In [118…

import pandas as pd



# load data




array = data.values


X = array[:,0:13]

Y = array[:,13]



scoring = 'neg_mean_squared_error'


print("MSE: %.3f (%.3f)" % (results.mean(), results.std()))

R
2
R
2
yi
y i
^
y y i
ȳ y
In [120…

import pandas as pd



# load data




array = data.values

R-Squared: 0.203 (0.595)

9.10ImplementingMachineLearningAlgorithms
Whenweworkonamachinelearningproject,weoftenendupwithmultiplegoodmodelstochoosefrom.Eachmodelwillhavedifferent
performancecharacteristics.Usingmethodslikecross-validation,wecangetanestimateforhowaccurateeachmodelmaybeonunseendata.
Weneedtobeabletousetheseestimatestochooseoneortwobestmodelsfromthesuiteofmodelsthatyouhavecreated.Whenwehaveanew
dataset,itisagoodideatovisualizethedatausingdifferenttechniquesinordertolookatthedatafromdifferentperspectives.
Thesameideaappliestomodelselection.Weshoulduseanumberofdifferentwaysoflookingattheestimatedaccuracyofourmachinelearning
algorithmsinordertochoosetheoneortwoalgorithmstofinalize.
Task1:ComparingMachineLearningAlgorithms
Intheexamplebelowfivedifferentclassificationalgorithmsarecomparedonasingledataset:
LinearDiscriminantAnalysis.
k-NearestNeighbors.
ClassificationandRegressionTrees.
NaiveBayes.
SupportVectorMachines.
ThedatasetisthePimaIndiansonsetofdiabetesproblem.Theproblemhastwoclassesandeightnumericinputvariablesofvaryingscales.The
10-foldcross-validationprocedureisusedtoevaluateeachalgorithm,importantlyconfiguredwiththesamerandomseedtoensurethatthesame
splitstothetrainingdataareperformedandthateachalgorithmisevaluatedinpreciselythesameway.Eachalgorithmisgivenashortname,
usefulforsummarizingresultsafterward.

X = array[:,0:13]

Y = array[:,13]



scoring = 'r2'


print("R-Squared: %.3f (%.3f)" % (results.mean(), results.std()))

In [128…
# Compare Algorithms

import pandas as pd

from matplotlib import pyplot



from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.naive_bayes import GaussianNB

from sklearn.svm import SVC

# load data



array = data.values


X = array[:,0:8]

Y = array[:,8]

# prepare models
models = []

models.append(('LDA', LinearDiscriminantAnalysis()))

models.append(('KNN', KNeighborsClassifier()))

models.append(('CART', DecisionTreeClassifier()))

models.append(('NB', GaussianNB()))

models.append(('SVM', SVC()))

# evaluate each model in turn

results = []

names = []

scoring = 'accuracy'

for name, model in models:


cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

results.append(cv_results)

names.append(name)

msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())

print(msg)

# boxplot algorithm comparison

fig = pyplot.figure()

fig.suptitle('Algorithm Comparison')

ax = fig.add_subplot(111)

pyplot.boxplot(results)

ax.set_xticklabels(names)

pyplot.show()

LDA: 0.773462 (0.051592)

KNN: 0.726555 (0.061821)

CART: 0.691302 (0.072112)

NB: 0.755178 (0.042766)

SVM: 0.760424 (0.052931)

Task2:AlgorithmTuning
Algorithmtuningisafinalstepintheprocessofappliedmachinelearningbeforefinalizingourmodel.Itissometimescalledhyperparameter
optimizationwherethealgorithmparametersarereferredtoashyperparameters,whereasthecoefficientsfoundbythemachinelearningalgorithm
itselfarereferredtoasparameters.Optimizationsuggeststhesearch-natureoftheproblem.Phrasedasasearchproblem,youcanusedifferent
searchstrategiestofindagoodandrobustparameterorsetofparametersforanalgorithmonagivenproblem.First,wewillshowin
implementationofSupportVectorMachinewithoutGridSearch.
WewillusebreastcancerdatasetfromScikitLearnlibrary.Thisisabinaryclassificationdataset.IthasnoMissingattributeornullvalues.Theclass
distributionisasfollows.
212:malignant
357:benign
Moreinformationcanbefoundhere:https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset
In [11]:
# Classification Without Algorithm Optimization

import pandas as pd

import numpy as np

from sklearn.metrics import classification_report, confusion_matrix


from sklearn.datasets import load_breast_cancer


# load data
cancer = load_breast_cancer()

X = pd.DataFrame(cancer['data'],

columns = cancer['feature_names'])

# cancer column is our target

Y = pd.DataFrame(cancer['target'],

columns =['Cancer'])

X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(Y),test_size = 0.30, random_state = 7)

# train the model on train set

precision recall f1-score support

0 1.00 0.78 0.88 55

1 0.91 1.00 0.95 116

accuracy 0.93 171

macro avg 0.95 0.89 0.91 171

weighted avg 0.94 0.93 0.93 171

GridSearchisanapproachtoparametertuningthatwillmethodicallybuildandevaluateamodelforeachcombinationofalgorithmparameters
specifiedinagrid.
Oneofthegreatthingsabout GridSearchCV isthatitisameta-estimator.IttakesanestimatorlikeSVCandcreatesanewestimator,that
behavesexactlythesame–inthiscase,likeaclassifier.Youshouldaddrefit=Trueandchooseverbosetowhatevernumberyouwant,thehigherthe
number,themoreverbose(verbosejustmeansthetextoutputdescribingtheprocess).
Fitting 5 folds for each of 25 candidates, totalling 125 fits

[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.613 total time= 0.0s





[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.613 total time= 0.0s





[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.613 total time= 0.0s





[CV 1/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.613 total time= 0.0s





[CV 1/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.925 total time= 0.0s





[CV 1/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.613 total time= 0.0s



model = SVC()

model.fit(X_train, y_train)

# print prediction results

predictions = model.predict(X_test)

print(classification_report(y_test, predictions))

In [24]:
# Classification Grid Search Optimization

import pandas as pd

import numpy as np

from sklearn.metrics import classification_report, confusion_matrix


from sklearn.model_selection import GridSearchCV

from sklearn.datasets import load_breast_cancer


# load data
cancer = load_breast_cancer()

X = pd.DataFrame(cancer['data'],

columns = cancer['feature_names'])

# cancer column is our target

Y = pd.DataFrame(cancer['target'],

columns =['Cancer'])

X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(Y),test_size = 0.30, random_state = 7)

# defining parameter range

param_grid = {'C': [0.1, 1, 10, 100, 1000],

'gamma': [1, 0.1, 0.01, 0.001, 0.0001],

'kernel': ['rbf']}

svc_grid = GridSearchCV(SVC(), param_grid, verbose = 3)

# fitting the model for grid search

svc_grid.fit(X_train, y_train)

# print best parameter after tuning

print(svc_grid.best_params_)

# print how our model looks after hyper-parameter tuning

print(svc_grid.best_estimator_)

svc_grid_predictions = svc_grid.predict(X_test)

# print classification report

print(classification_report(y_test, svc_grid_predictions))



[CV 1/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.613 total time= 0.0s





[CV 1/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.625 total time= 0.0s





[CV 1/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.925 total time= 0.0s





[CV 1/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.912 total time= 0.0s





[CV 1/5] END .........C=10, gamma=1, kernel=rbf;, score=0.613 total time= 0.0s




















[CV 1/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.900 total time= 0.0s





[CV 1/5] END ........C=100, gamma=1, kernel=rbf;, score=0.613 total time= 0.0s




















[CV 1/5] END ...C=100, gamma=0.0001, kernel=rbf;, score=0.938 total time= 0.0s





[CV 1/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.613 total time= 0.0s




















[CV 1/5] END ..C=1000, gamma=0.0001, kernel=rbf;, score=0.925 total time= 0.0s





{'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}

SVC(C=1, gamma=0.0001)

precision recall f1-score support

0 0.93 0.93 0.93 55

1 0.97 0.97 0.97 116

accuracy 0.95 171

macro avg 0.95 0.95 0.95 171

weighted avg 0.95 0.95 0.95 171

9.11NeuralNetworkswithKeras
ThiscodeisademonstrationofashallowneuralnetworkonMNISTdataset.MNISTdatasetisastandarddatasetusedinmostdeeplearning
tutorials.Inthiscode,threelayersshallowneuralnetworkisusedforhandwrittendigitsclassification.Thefirstlayerisinputlayer,consistedof784
nodes.Thesecondlayerisahiddenlayerwith64sigmoidneurons.Thelastlayerisanoutputlayerwith10softmaxneurons.
Task1:LoadingDependencies
Task2:LoadingMNISTDataset
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz

11493376/11490434 [==============================] - 0s 0us/step

11501568/11490434 [==============================] - 0s 0us/step

(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)
array([5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5], dtype=uint8)
<function matplotlib.pyplot.show>
In [28]:
import keras

from keras.datasets import mnist

from keras.models import Sequential

from keras.layers import Dense

#additional code for keras 2.4.0

from keras.utils import np_utils

#from keras.optimizers import SGD #deprecated

from keras.optimizers import gradient_descent_v2

#then use it : sgd = gradient_descent_v2.SGD(...)

from matplotlib import pyplot as plt

In [29]:
(X_train, y_train), (X_valid, y_valid) = mnist.load_data()

In [30]:
X_train.shape

Out[30]:
In [31]:
y_train.shape

Out[31]:
In [32]:
X_valid.shape

Out[32]:
In [33]:
y_valid.shape

Out[33]:
In [34]:
y_train[0:12]

Out[34]:
In [35]:
plt.figure(figsize=(4,4))

for k in range(12):

plt.subplot(3, 4, k+1)

plt.imshow(X_train[k], cmap='Greys')

plt.axis('off')

plt.tight_layout()

plt.show

Out[35]:

<matplotlib.image.AxesImage at 0x7fd8a48451d0>
7

Task3:DataPreprocessing
Reshapingthedatafrom2Dto1D
Normalizingthedata(tobe0to1)
Convertingintegerlabelstoone-hotencoding.Wearrangethelabelswithsuchone-hotencodingssothattheylineupwiththe10probabilities
beingoutputbythefinallayerofourartificialneuralnetwork.Theyrepresenttheidealoutputthatwearestrivingtoattainwithournetwork:Ifthe
inputimageisahandwrittenseven,thenaperfectlytrainednetworkwouldoutputaprobabilityof1.00thatitisasevenandaprobabilityof0.00for
eachoftheothernineclassesofdigits.
Task4:DesigningNeuralNetworks
Inthefirstlineofcode,weinstantiatethesimplesttypeofneuralnetworkmodelobject,the Sequential typeand—inadashofextreme
creativity—namethemodel model .
Inthesecondline,weusethe add() methodofourmodelobjecttospecifytheattributesofournetwork’shiddenlayer(64sigmoid-typeartificial
neuronsinthegeneral-purpose,fullyconnectedarrangementdefinedbythe Dense() methodaswellastheshapeofourinputlayer(one-
dimensionalarrayoflength784).
Inthethirdandfinallineweusethe add() methodagaintospecifytheoutputlayeranditsparameters:10artificialneuronsofthe softmax
variety,correspondingtothe10probabilities(oneforeachofthe10possibledigits)thatthenetworkwilloutputwhenfedagivenhandwritten
image.
Model: "sequential"

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

In [36]:
plt.imshow(X_valid[0], cmap='Greys')

Out[36]:
In [37]:
print(y_valid[0])

In [38]:
X_train = X_train.reshape(60000, 784).astype('float32')

X_valid = X_valid.reshape(10000, 784).astype('float32')

In [39]:
X_train = X_train/255

X_valid = X_valid/255

In [40]:
n_classes = 10

y_train = keras.utils.np_utils.to_categorical(y_train, n_classes)

y_valid = keras.utils.np_utils.to_categorical(y_valid, n_classes)

In [41]:
model = Sequential()

model.add(Dense(64, activation='sigmoid', input_shape=(784,)))

model.add(Dense(10, activation='softmax'))

In [42]:
model.summary()

dense (Dense) (None, 64) 50240

dense_1 (Dense) (None, 10) 650

=================================================================

Total params: 50,890

Trainable params: 50,890

Non-trainable params: 0

_________________________________________________________________

Task5:TrainingNeuralNetwork
val_loss isthevalueofcostfunctionforyourcross-validationdataand loss isthevalueofcostfunctionforyourtrainingdata.Howeverwith
val_loss (kerasvalidationloss)and val_acc (kerasvalidationaccuracy),manycasescanbepossiblelikebelow:
val_loss startsincreasing, val_acc startsdecreasing.Thismeansmodeliscrammingvaluesnotlearning
val_loss startsincreasing, val_acc alsoincreases.Thiscouldbecaseofoverfittingordiverseprobabilityvaluesincaseswheresoftmax
isbeingusedinoutputlayer
val_loss startsdecreasing, val_acc startsincreasing.Thisisalsofineasthatmeansmodelbuiltislearningandworkingfine.
Epoch 1/200

469/469 [==============================] - 4s 6ms/step - loss: 0.0934 - accuracy: 0.0876 - val_loss: 0.0923 - val_accuracy:
0.0817

Epoch 2/200

0.0776

Epoch 3/200

0.1058

Epoch 4/200

0.1451

Epoch 5/200

0.1825

Epoch 6/200

0.2185

Epoch 7/200

0.2681

Epoch 8/200

0.3167

Epoch 9/200

0.3528

Epoch 10/200

0.3769

Epoch 11/200

0.3999

Epoch 12/200

0.4152

Epoch 13/200

0.4281

Epoch 14/200

0.4385

Epoch 15/200

0.4460

Epoch 16/200

0.4500

Epoch 17/200

0.4524

Epoch 18/200

0.4534

Epoch 19/200

0.4531

Epoch 20/200

0.4531

Epoch 21/200

0.4535

Epoch 22/200

In [43]:
model.compile(loss='mean_squared_error', optimizer=gradient_descent_v2.SGD(learning_rate=0.01), metrics=['accuracy'])

In [44]:
model.fit(X_train, y_train, batch_size=128, epochs=200, verbose=1, validation_data=(X_valid, y_valid))

0.4534

Epoch 23/200

0.4539

Epoch 24/200

0.4560

Epoch 25/200

0.4584

Epoch 26/200

0.4620

Epoch 27/200

0.4651

Epoch 28/200

0.4690

Epoch 29/200

0.4726

Epoch 30/200

0.4784

Epoch 31/200

0.4839

Epoch 32/200

0.4896

Epoch 33/200

0.4965

Epoch 34/200

0.5025

Epoch 35/200

0.5084

Epoch 36/200

0.5149

Epoch 37/200

0.5212

Epoch 38/200

0.5284

Epoch 39/200

0.5350

Epoch 40/200

0.5417

Epoch 41/200

0.5499

Epoch 42/200

0.5573

Epoch 43/200

0.5628

Epoch 44/200

0.5694

Epoch 45/200

0.5759

Epoch 46/200

0.5814

Epoch 47/200

0.5866

Epoch 48/200

0.5914

Epoch 49/200

0.5958

Epoch 50/200

0.6011

Epoch 51/200

0.6055

Epoch 52/200

0.6109

Epoch 53/200

0.6153

Epoch 54/200


0.6183

Epoch 55/200

0.6200

Epoch 56/200

0.6223

Epoch 57/200

0.6255

Epoch 58/200

0.6282

Epoch 59/200

0.6301

Epoch 60/200

0.6325

Epoch 61/200

0.6353

Epoch 62/200

0.6371

Epoch 63/200

0.6398

Epoch 64/200

0.6422

Epoch 65/200

0.6446

Epoch 66/200

0.6461

Epoch 67/200

0.6489

Epoch 68/200

0.6513

Epoch 69/200

0.6527

Epoch 70/200

0.6546

Epoch 71/200

0.6565

Epoch 72/200

0.6577

Epoch 73/200

0.6604

Epoch 74/200

0.6635

Epoch 75/200

0.6652

Epoch 76/200

0.6679

Epoch 77/200

0.6702

Epoch 78/200

0.6723

Epoch 79/200

0.6748

Epoch 80/200

0.6785

Epoch 81/200

0.6810

Epoch 82/200

0.6846

Epoch 83/200

0.6867

Epoch 84/200

0.6890

Epoch 85/200

0.6918

Epoch 86/200


0.6939

Epoch 87/200

0.6965

Epoch 88/200

0.6994

Epoch 89/200

0.7030

Epoch 90/200

0.7066

Epoch 91/200

0.7097

Epoch 92/200

0.7124

Epoch 93/200

0.7159

Epoch 94/200

0.7192

Epoch 95/200

0.7231

Epoch 96/200

0.7253

Epoch 97/200

0.7278

Epoch 98/200

0.7313

Epoch 99/200

0.7338

Epoch 100/200

0.7367

Epoch 101/200

0.7399

Epoch 102/200

0.7426

Epoch 103/200

0.7446

Epoch 104/200

0.7469

Epoch 105/200

0.7504

Epoch 106/200

0.7523

Epoch 107/200

0.7540

Epoch 108/200

0.7565

Epoch 109/200

0.7589

Epoch 110/200

0.7620

Epoch 111/200

0.7644

Epoch 112/200

0.7664

Epoch 113/200

0.7700

Epoch 114/200

0.7724

Epoch 115/200

0.7742

Epoch 116/200

0.7765

Epoch 117/200

0.7784

Epoch 118/200


0.7801

Epoch 119/200

0.7816

Epoch 120/200

0.7846

Epoch 121/200

0.7867

Epoch 122/200

0.7885

Epoch 123/200

0.7905

Epoch 124/200

0.7916

Epoch 125/200

0.7932

Epoch 126/200

0.7949

Epoch 127/200

0.7964

Epoch 128/200

0.7976

Epoch 129/200

0.8000

Epoch 130/200

0.8008

Epoch 131/200

0.8022

Epoch 132/200

0.8047

Epoch 133/200

0.8062

Epoch 134/200

0.8082

Epoch 135/200

0.8098

Epoch 136/200

0.8113

Epoch 137/200

0.8136

Epoch 138/200

0.8152

Epoch 139/200

0.8171

Epoch 140/200

0.8183

Epoch 141/200

0.8201

Epoch 142/200

0.8213

Epoch 143/200

0.8222

Epoch 144/200

0.8240

Epoch 145/200

0.8249

Epoch 146/200

0.8258

Epoch 147/200

0.8270

Epoch 148/200

0.8280

Epoch 149/200

0.8293

Epoch 150/200


0.8309

Epoch 151/200

0.8334

Epoch 152/200

0.8349

Epoch 153/200

0.8358

Epoch 154/200

0.8367

Epoch 155/200

0.8372

Epoch 156/200

0.8391

Epoch 157/200

0.8401

Epoch 158/200

0.8410

Epoch 159/200

0.8423

Epoch 160/200

0.8433

Epoch 161/200

0.8441

Epoch 162/200

0.8452

Epoch 163/200

0.8459

Epoch 164/200

0.8468

Epoch 165/200

0.8476

Epoch 166/200

0.8482

Epoch 167/200

0.8492

Epoch 168/200

0.8503

Epoch 169/200

0.8513

Epoch 170/200

0.8522

Epoch 171/200

0.8530

Epoch 172/200

0.8541

Epoch 173/200

0.8547

Epoch 174/200

0.8556

Epoch 175/200

0.8559

Epoch 176/200

0.8563

Epoch 177/200

0.8573

Epoch 178/200

0.8575

Epoch 179/200

0.8579

Epoch 180/200

0.8583

Epoch 181/200

0.8588

Epoch 182/200


Modul Topik 9 - Kecerdasan Buatan

Modul Topik 9 - Kecerdasan Buatan

Recommended

Recommended

More Related Content

Similar to Modul Topik 9 - Kecerdasan Buatan

Similar to Modul Topik 9 - Kecerdasan Buatan (20)

More from Sunu Wibirama

More from Sunu Wibirama (11)

Recently uploaded

Recently uploaded (20)

Modul Topik 9 - Kecerdasan Buatan