Machine learning (domingo's paper)

Questions from paper
"A Few Useful Things to Know about Machine Learning"
Reference: http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
By:
Akhilesh Joshi
mail: akhileshjoshi123@gmail.com

1. Whatis the definitionofML?
Machine Learningisart of usingthe existingdata(historicalandpresent) toforecast/predict
ideal solutionswiththe helpof implementingstatistical modelswithout(orlesser) manual
intervention.HowevertechniquesforMachine Learningare still in development, itisone of the
importantconceptinfieldof datascience withvariousapplicationsthatwill be helpful to
mankind.
2. What is a classifier?
A Classifierisasystemwhere we provide inputstothe system(inputsmaybe discrete orcanbe
continuous) andclassifiergivesusanoutput.The data that we provide tothe classifieriscalled
as trainingdata.So the mainaim of classifieristoprovide anoutputbasedonour trainingdata
and that outputwill correctlyclassifyourtestdatato get more ideal results.
3. What are the 3 components ofa learningsystem,according to the author? Explainthem
briefly.
There are 3 componentsdescribedaboutthe learningsystem.Theyare asfollows.
a. Representation
 Representationisveryimportantaspectforapplicationof ML on our setof data. Here
we understandhowwe shouldrepresentdatasothat it will fitperfectly.Forexample,a
decisiontree mightbe suitedperfectlyforourdata whereasthere canbe neural
networksthatare bestsuitedforotherdata.
b. Evaluation
 Evaluationhelpsusdeterminingthe goodclassifiersforthe badclassifiers.Good
classifiersare those whichprovide arightsetof hypothesisthatare bestsuitedforour
testdata. For studentdatawe mightneed “Likelihood”evaluationparameterforgetting
a job ratherthan “Precisionandrecall”evaluationforgettingajob.Evaluationstep
helpsusindeterminingthe same.
c. Optimization
 Out of all the possible outcomesforourhypothesiswe have to decide whichof the
hypothesisprovide uswithoptimal solutionforourtestdata. Here we use the best
suitedhypothesisforarrivingatmostideal solution.
4. What is informationgain?
Giventhe numberof attributeswe have todecide the attribute which hasmaximum
informationgain.We calculate the average entropyandcompare the sumof entropiestothe
original set.Thiswill helpustobuilda decisiontree.Where anattribute withhighest
informationgainwillbe atroot node,thenagainwe subdivide the furthertree nodesby
comparingthe informationgainsw.r.ttothe root attribute thatwe have alreadychosen.The
orderthe splitsinadecisiontree isindecreasingorderof informationgain.

Formulaforinformationgain:
IG (A) = H(S) - Σv (Sv/S) H (Sv)
IG (A):InformationgainIGoverattribute A
H(S):entropyof all examples
H (Sv):entropyof one subsample afterpartitioningS
5. Whyis generalizationmore important than justgetting a good result on trainingdata i.e.the
data that was usedto train the classifier?
 Usingtrainingdata providesusaninsighthow our data lookslike.Sotrainingour
machine learningalgorithmsonthatparticularsetof data won’tguarantee the
algorithmtoworkcorrectlyon the test data.There mightbe a case where our testdata
iscomplete differentthanourtrainingdata andthe outputmaybe notas desired.So
we have to considerbothscenarioswhere ouralgorithmwill workonbothourtraining
data and testdata. Hence the conceptof generalization.
6. What is cross-validation?Whatare its advantages?
 Giventhe trainingdataS and hypothesisclassH(itcontainsall the possible hypothesis)
we have to find h (correcthypothesisforourdata).So to findhcorrectlywe make the
use of crossvalidationprocesstohave a data withmaximumadvantage.
Advantages of cross-validation
 Data is testedonbothtrainingandtestdata givingthe algorithmclearinsightsabout
the type of data that itmightsee or use for evaluationpurpose
 We can setaside our trainingdataas a part of our testingdatawhichhelpsusto use
that testdata for testingthe workingof ouralgorithmtogive desiredideal solutions.
 Since alreadyaset of data that is setaside asour test data,we neednothave to worry
abouthavinga test data.
Illustrationof cross-validation:

7. How is generalizationdifferentfromotheroptimizationproblems?
 Optimizationproblemsare more alignedtothe datathat is alreadyknown.Whereasin
generalizationwe have toassume the errorsandfindings fromourtrainingdatathat
will helpustoinferabouttestdata or at leastwill tryto infersomethingabouttestdata.
Since optimizationdealswithmore ideal situationswhere mostof the thingsare known
alreadywe can expectthe outputsasdesired, whichisnotthe case of generalization
problems.
8. If you have a scenario where the functioninvolves10 Boolean variables,how many possible
examples(calledinstance space) can there be?If you see 100 examples,whatpercentage of the
instance space have you seen?
 Numberof instancescanbe definedby2N
(where N isthe no.of Booleanvariables).So
inour case total instanceswill be 210
i.e. 1024 instances.Now we are givenwithonly100
examplessowe will be able tosee only 9.76% of instance space.
9. What is the "no free lunch" theoremin machine learning?You can do a Google searchif the
paper isn'tclear enough.
 NO FREE lunchsuggeststhatno learningalgorithmisinherentlysuperior toother
algorithms.If analgorithmisperformingwell inparticularclassof problem, thenit
shouldbe performingworstinotherclassof a problemi.e.performance here is
compensated.If we average the errorforall possible weightinan algorithm, thenwe
will getdifference inexpectederrorsasZERO betweenthosetwoalgorithms.
10. What general assumptionsallow us to carry out the machine learningprocess? What isthe
meaningof induction?
 Inductionismakingthe use of available knowledge toturnitintolarge amountof
knowledge.
11. How is learninglike farming?
 Farmingismore kindof dependentactivitywhere itdependsonNature.Alongwiththe
helpof Nature farmerscombine seedswithnutrientstogrow crops.In similarmanner
to grow programs(like crops),alearninghastocombine knowledge (logic) withdata for
growingthe programs.
12. What is overfitting?Howdoes it leadto a wrong idea that you have done a reallygood job on
training dataset?
 Overfittingiswhenmodel learnsfrommore trainingdata.Whenwe have more training
data thenthe model getsusedtothe characteristicsof the trainingdatawhicheven
includesthe noise anderrorof it.Now whenitcomesto applythe learningthatmodel
learnedontrainingdata,the resultsare not as expectedandthe model mightnotwork
well onthe testdata. It negativelyimpactsonmodelsabilitytogeneralize.Itishighly
likelythatwe will gettestdatasame as our trainingdata.

13. What is meantby biasand variance? Youdon't have to be really precise indefiningthem,just
get the idea.
 Bias: Learners erroneousassumptionsinlearningalgorithms.Low Bias→ more
assumptionsHighBias→ lessassumptions
 Variance:Amountof estimate foramodel tochange withdifferenttrainingdataisused.
14. What are some of the thingsthat can helpcombat overfitting?
 Use of followingtechniquesmighthelpincombatingoverfitting
 cross-validation
 Addinga regularizationtermtothe evaluationfunction.
 performa statistical significancetestlike chi-squarebeforeaddingnew
structure
15. Whydo algorithmsthat work well in lowerdimensionsfail at higherdimensions?Thinkabout
the numberof instancespossible inhigher dimensionsandthe cost of similaritycalculation
 As the dimensionsincreasethe amountof datathat is requiredtotraina model
(inthiscase algorithm) the amountof data neededgrowsexponentially.Ina
wayalgorithmswithlowerdimensionscangeneralize (keepsyncintrainingand
testdata) ina betterwaythan dealingwithmaintaininggeneralizationwith
higherdimensionality. Same phenomenonisexplainedby“Curse of
Dimensionality”
16. What is meantby "blessingofnon-uniformity"?
 Thisrefersto the fact that observationsfromreal-worlddomainsare oftennot
distributeduniformly,butgroupedorclusteredinuseful andmeaningful ways.
17. What has beenone of the major developmentsinthe recent decadesabout resultsof
induction?
 One of the majordevelopmentsisthatwe canhave guaranteesonthe resultsof
induction,particularlyif we’re willingtosettle forprobabilisticguarantees.
18. What is the most important factor that determineswhetheramachine learningproject
succeeds?
- Successof the projectdependsuponnumberof featuresused.If we have many
independentfeaturesthateachcorrelate wellwiththe class,learningiseasy.Onthe other
hand,if the classisa verycomplex functionof the features,we maynotbe able tolearnit.
19. In a ML project,which is more time consuming – feature engineeringorthe actual learning
process?Explain how ML is an iterative process?
 Feature engineeringformsthe more time consumingprocessformachine
learningsince itdealswithmanythingssuchasgatheringdata,cleaningitand
pre-processit.
 In ML we have to carry out certaintasksiterativelysuchasrunningthe learner,
analyzingthe results,modifyingthe dataandthe learner.Hence itisan iterative
process.

20. What, according to the author, is one of the holygrails of ML?
 Accordingto the author,the processof automatingfeature engineering
processesisthe holygrails.Itcan be done by generatinglarge no.of candidate
featuresandselectingthe bestbasedontheirinformationgainw.r.tclass.Butit
has some limitations.
21. If your ML solutionis not performingwell,what are two thingsthat you can do?Which one is
a betteroption?
When an ML solutiondoesnotperformwell we have twomainchoices
. To Designa betterlearneralgorithm
. Gathermore data.
It isalwaysbetterif we go forcollectingmore databecause a dumbalgorithmwith more and
more data beatsa cleveralgorithmwithmodestamountof data
22. What are the 3 limitedresourcesin ML computations? Whatis the bottlenecktoday? What is
one of the solutions?
The 3 limitedresourcesinMLcomputationsare:
. Time
. Memory
. TrainingData
The bottleneckhaschangedfromdecade todecade and todayit is“Time”. If there ismore data
thenit takesverylongto processitand learnthe complex algorithm.Sothe onlysolutionfor
thisisto come upwitha fasterwayto learnthe complex classifiers.
23. A surprisingfact mentionedbythe author is that all representations(typesoflearners)
essentially"all dothe same".Can you explain?Whichlearnersshouldyou try first?
All learnersworkbygroupingnearbyexamplesintothe same class,the keydifference
isin the meaningof nearby.Withnon-uniformlydistributeddata,learnerscanproduce widely
differentfrontierswhile still makingthe same predictionsinthe regionsthatmatter.
It isbetterto try the simplestlearnersfirst.Complexlearnersare usuallyhardertouse,because
theyhave more knobsyouneedto turnto get goodresults,andbecause theirinternalsare
more opaque
24. The author divideslearnersinto two typesbased on theirrepresentationsize.Write a brief
summary.
Accordingtothe authorthere are twotypesof learnersbasedonrepresentationsize.
1) Learners withfixedrepresentationsize
2) Learners whose representationsize growswithdata

Fixed-sizelearnerscanonlytake advantage of so muchdata. Variable-sizelearnerscanin
principle learnanyfunctiongivensufficientdata,butinpractice theymaynot,because of
limitationsof the algorithmorcomputational costorthe curse of dimensionality. Forthese
reasons,cleveralgorithmsthose thatmake the mostof the data andcomputingresources
available oftenpayoff inthe end.
25. Is it betterto have variation of a single model or a combination ofdifferentmodels,knownas
ensemble orstacking? Explainbriefly.
Researchersnoticedthat,if insteadof selectingthe bestvariationfound,we combine many
variations,the resultsare oftenmuchbetterandat little extraeffortforthe user.Inensembling
we generate randomvariationsof the trainingsetbyresampling,learnaclassifieroneach,and
combine the resultsbyvoting.Thisworksbecause itgreatlyreducesvariancewhile onlyslightly
increasingbias.
26. Read the last paragraph and explainwhy it makessense to prefersimpleralgorithms and
hypotheses.
Whenthe complexityiscomparedtothe size of hypothesisspace,smallerspacesof hypotheses
are allowedtobe representedinshortercodes.A learnerwithalargerhypothesisspace that
triesfewerhypothesesfromitislesslikelytooverfitthanone thattriesmore hypothesesfroma
smallerspace.Soitmakessense toprefersimpleralgorithmsandhypothesesasmore the
numberof assumptions tomake,more unlikelyexplanationis.
27. It has beenestablishedthat correlationbetweenindependentvariablesandpredicted
variablesdoes not implycausation, still correlation isused by many researchers.Explainbrieflythe
reason.
In a predictionstudy,the goal istodevelopaformulaformakingpredictionsaboutthe
dependentvariable,basedonthe observedvaluesof the independentvariables. Ina causal analysis,the
independentvariablesare regardedascausesof the dependentvariable. Manylearningalgorithmscan
potentiallyextractcausal informationfromobservational data,buttheirapplicabilityisratherrestricted.
To findcausation,yougenerallyneedexperimental data,notobservational data.Correlation isa
necessarybutnotsufficientconditionforcausation. Correlationisavaluable type of scientificevidence
infieldssuchasmedicine,psychology,andsociology.Butfirstcorrelationsmustbe confirmedasreal,
and theneverypossible causativerelationshipmustbe systematicallyexplored.Inthe endcorrelation
can be usedaspowerful evidence foracause-and-effectrelationshipbetweenatreatmentandbenefit,
a risk factorand a disease,ora social or economicfactorand variousoutcomes.

Machine learning (domingo's paper)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Machine learning (domingo's paper)

Similar to Machine learning (domingo's paper) (20)

More from Akhilesh Joshi

More from Akhilesh Joshi (20)

Recently uploaded

Recently uploaded (20)

Machine learning (domingo's paper)