2. Unraveled
Machine Learning
ravel [rav-uh l]
verb, raveled, raveling.
1. to disentangle or unravel the threads or fibers
of (a woven or knitted fabric, rope, etc.).
2. to tangle or entangle.
3.
4. verb, raveled, raveling.
1. to disentangle or unravel
the threads or fibers of (a
woven or knitted fabric,
rope, etc.).
2. to tangle or entangle.
ravel[rav-uh l]
5. Field of study that gives computers the ability to learn without being explicitly programmed.
— Arthur Samuel (1959)
A computer program is said to learn from experience E with respect to some task T and some performance
measure P, if its performance on T, as measured by P, improves with experience E.
— Tom Mitchell (1998)
ML Definitions
39 Years
6. Field of study that gives computers the ability to
learn without being explicitly programmed.
— Arthur Samuel (1959)
7.
8.
9. A solved game is a game whose
outcome (win, lose, or draw) can
be correctly predicted from any
position, given that both players
play perfectly.
11. A computer program is said to learn from experience
E with respect to some task T and some performance
measure P, if its performance on T, as measured by P,
improves with experience E.
— Tom Mitchell (1998)
16. Statistical Science , 2001, Vol. 16, No. 3, 199–231
Statistical Modeling: The Two Cultures, Leo Breiman
17. Statistical Science , 2001, Vol. 16, No. 3, 199–231
Statistical Modeling: The Two Cultures, Leo Breiman
18. This enterprise has at its heart the belief that a
statistician, by imagination and by looking at
the data, can invent a reasonably good
parametric class of models for a complex
mechanism devised by nature.
Statistical Science , 2001, Vol. 16, No. 3, 199–231
Statistical Modeling: The Two Cultures, Leo Breiman
46. Cat Cat Cat Cat
Cat Cat Cat Cat
Dog Dog Dog Dog
Dog Dog Dog Dog
47. Cat Cat Cat Cat
Cat Cat Cat Cat
Dog Dog Dog Dog
Dog Dog Dog Dog
Training
Model
Recall
Cat Cat
Cat
CatCat
Cat
Dog Dog
Dog
DogDog
Cat
Predictions
Training
Dataset
(Labeled
Examples)
Test
Dataset
48. Training
Model
Recall Predictions
Cat Cat
Cat
Dog
Cat Cat
Cat Cat Cat
Dog Dog Dog
Dog Dog Dog Dog
Cross-
Validation
Dataset
Cat Cat Cat Cat
Cat Cat Cat Cat
Dog Dog Dog Dog
Dog Dog Dog Dog
Training
Dataset
(Labeled
Examples)
58. Predictive modeling:
The process of developing a
mathematical tool or model that
generates an accurate prediction
— Kuhn, Max; Johnson, Kjell (2013-05-17).
Applied Predictive Modeling. Springer.
Predictions do not have to be
accurate to score big value.
— Siegel, Eric. Predictive Analytics:
The Power to Predict Who Will Click, Buy, Lie, or Die. Wiley.
more
69. > k1<-knn3(splorkData[samp,-c(3,4)],splorkData$splork[samp], k=3)
> k1
3-nearest neighbor classification model
Call:
knn3.data.frame(x = splorkData[samp, -c(3, 4)], y = splorkData$splork[samp], k = 3)
Training set class distribution:
no yes
386 114
> pred<-predict(k1,newdata=splorkData[-samp,-c(3,4)],type="class")
> str(pred)
Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
> table(pred,splorkData$splork[-samp])
pred no yes
no 332 5
yes 0 117
70. > rf <- randomForest(splork ~ ., data=splorkData[1:500,])
Call:
randomForest(formula = splork ~ ., data = splorkData[1:500, ]
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 0%
Confusion matrix:
no yes class.error
no 462 0 0
yes 0 38 0
>
> table(pred,splorkData$splor
pred no yes
no 418 0
yes 0 36
72. Title: SECOM Data Set
Abstract: Data from a semi-conductor manufacturing process
-----------------------------------------------------
Data Set Characteristics: Multivariate
Number of Instances: 1567
Area: Computer
Attribute Characteristics: Real
Number of Attributes: 591
Date Donated: 2008-11-19
Associated Tasks: Classification, Causal-Discovery
Missing Values? Yes
73.
74. #make training and test subsets
>train<-secom[1:1000,]
>test<-secom[1001:1567,]
#get rid of near-zero-variance variables
>train<-train[,-nearZeroVar(train)]
>test<-test[,-nearZeroVar(test)]
#impute missing values
>train<-na.roughfix(train)
>test<-na.roughfix(test)
#scale and center
>tr1<-preProcess(train, method = c("center", "scale"))
>tr2<-preProcess(test, method = c("center", "scale"))
>traincs<-predict(tr1,train)
>testcs<-predict(tr2,test)
75.
76. > fit <- glm(secResp ~ .,data=data.frame(sec[train,],secResp=
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
>
For the background to warning messages about ‘fitted probabilities numerically 0 or 1 occurred’ for
binomial GLMs, see Venables & Ripley (2002, pp. 197–8).
There is one fairly common circumstance in which both convergence problems and the Hauck-
Donner phenomenon can occur. This is when the fitted probabilities are extremely close to zero or
one. Consider a medical diagnosis problem with thousands of cases and around 50 binary
explanatory variable (which may arise from coding fewer categorical variables); one of these
indicators is rarely true but always indicates that the disease is present.
77. >Call:
randomForest(formula = secResp ~ ., data = data.frame(sec
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 21
OOB estimate of error rate: 7.6%
Confusion matrix:
0 1 class.error
0 924 0 0
1 76 0 1
78.
79. > ab
Call:
ada(sec[train, subset], y = secResp[train], test.x = sec[test,
], test.y = secResp[test], type = "gentle", iter = 100)
Loss: exponential Method: gentle Iteration: 100
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 924 0
1 50 26
Train Error: 0.05
Out-Of-Bag Error: 0.061 iteration= 100
Additional Estimates of number of iterations:
train.err1 train.kap1 test.err2 test.kap2
95 95 2 1
> pred<-predict(ab,sec[test,])
> table(pred,secResp[test])
pred 0 1
0 539 28
1 0 0
82. People . . . operate with beliefs and biases.
xtent you can eliminate both and replace them w
you gain a clear advantage.
— Michael Lewis, Moneyball:
The Art of Winning an Unfair Game
Editor's Notes
Hi, I’m Mark Fetherolf. I am a data scientist and president of Numinary Data Science.
My goal today is to *unravel* machine learning
I chose unraveled over explained, expounded, revealed, uncovered, elucidated, and of course *for dummies*
I chose unraveled because
I really like the word *ravel*
It has …
two definitions that are exact opposites
so no matter whether I enlighten or confuse you, I will still achieve the goal of unraveling
Machine Learning also has several definitions …
They are not opposites but are separated by 39 years
The first from Arthur Samuel, arguably the father of Machine Learning
In 1959 checkers on IBM’s first, the IBM 701. sensational; IBM's stock +15 overnight.
Worlds fair ny 1965; age 11; Selectric, Touchtone, tic/tac/toe; played 100 times; could get to a tie every time; Samuel used “wrote learning”; scored a polynomial function that rated each position to score moves later (temporal scoring, alpha / beta pruning). Problem with getting “stuck” .. / nudge … 8 years later …
in 1967, he wrote that getting the program to generate its own parameters - without nudging seemed as far in the future had in 1959
a mere 40 years later …
A mere 40 years later
IN the absence of a solution
In other words, in the absence of exact rules,
—> we rely on Heuristics
Games that aren’t solved are can still be played
And they can be played according to very specific rules or algorithms
1959: Arthur Samuel wrote computer programs that applied heuristics to checkers; and they were a very special kind of heuristics —> … ones that get better with experience!
39 years forward …. and that rhymes with T and that stands for _______________Tom Mitchell -> Professor at CMU -> more formal definition … 39 years after Arthur Samuel; T - Task; E - Experience; P - Performance; Sounds complicated, but really, is it …
Little Amanda - unlike Amy who got the wrong on purpose; funny; siri - haircut
Task: Classify cats and dogs Experience: 1) Guess; 2) Look at mom/dad for feedback Performance: right / wrong;
Machines learning has a lot in common with human learning and quite a lot NOT in common … SIMILARLY …
As a species I have the feeling that we are unique in being defensive about our intellect.
Comcast: ask people, of the tv shows you haven’t yet seen, which ones are you likely to watch next year
Machine learning has a lot in common with Statistics and a lot that is not; How many of you are statisticians? Know a statistician? Know more than 1 statistician? Some statisticians are exceptionally methodical, perhaps to the point of being fussy about methodology and it’s proper application … we have these in the computer programming business too … difference between … ML people on the other hand are sometimes a bit less *regulated*
Prominent ML pioneer, Leo Breiman stirred things up a bunch in 2001 when we wrote that statistical methods led to irrelevant theory and questionable conclusions
Linear regression IS a form of ML but breiman points out that
Every ML book and course starts out with linear regression, which we all learned in statistics
so what’s the issue …?
Well, Breiman says ..
input and an output connected by nature; x is rainfall, y is plant growth; x is force, y is acceleration; x is how much time college students spend playing minecraft and y is their GPAs
brieman says that statisticians assume that there must be a parametric model that describes the relationship
algorithmic modelers on the other hand treat nature as entirely unknown and just go for results
Of course some of this difference in perspective can be traced to the parent disciplines
disciplines that have coexisted peacefully but warily; I remember as a maths student marveling and waxing poetic about PI, how such a simple natural thing could be so complex and subtle; my friend ed, a comp sci major, said, “what’s the big deal, you’ve got a circle. measure it. In one of my three favorite public relations campaigns …
In one of my three favorite public relations campaigns, we aimed to resolve the conflict by renaming the whole thing data science
which encompasses —>
(would you like to hear the other two)
mathematics, statistics, computer science, machine learning; hacking is included to highlight practicality; data science is a practice not a purely academic discipline;
Data scientists are described as people who are better at programming than most mathematicians and better at maths than most programmers; when I heard this I was, of course, thrilled .. THAT’s me
Davenport and Patel’s HBR article was the icing on the cake for me:
I’ve had lost of jobs; several in the 21st century; none until now sexy
but of course we can’t talk about data science without mentioning the elephant in the room (who knows his name?) Hadoop!
Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant.
I don’t know about you, but I am deeply suspicious of such pneumonic. V is the first letter of .649% of english words. Which means that the changes these are the best is 0.000000273359449
(2.3 * 10^-7)
I don’t know about you, but I am deeply suspicious of such pneumonic. V is the first letter of .649% of english words. Which means that the changes these are the best is 0.000000273359449
(2.3 * 10^-7)
Here is, I think, a more clear illustration of big dataLong ago, in the in the real world, one of my jobs, as a child, was to make a database to collect data from this incredibly clean oil refinery,
actually looked more like this; with thousands of pressure, temperature and flow-sensing devices feeding data into a place like this …
feeds from temperature, pressure, flow rates etc. all fed into the control room where
Harold, the diligent young engineer, … unrealistic (lacks pocket protector)
collects data; in 30 minutes he could record 800-100 measurements;
so he could gather one sample for each data point week or so;
in reality he focuses his sampling on the project he’s working on that day; optimizing his thin-film reactor perhaps
1937 Census used statistical sampling to measure the extent of unemployment.
Statistics are great when you are overwhelmed with data. Take a random sample use statistics to predict population parameters within known confidence limits …
and it works great in the absence of sampling bias
and we have ways of measuring these and we have stratification and other strategies that work until we encounter effects like …
data points live in some kind of high dimension spatial landscape and the probability of having some property function of ones neighbor having that same property?
We replace harold …
a computer system that collects 10000 data points per second; and it doesn’t care what for or whether they are used
N = ALL is a big idea. So big I put it on my facebook page.
But I don’t spend much time on facebook.
When it comes to obsessive time-wasting on the internet, I much prefer Kaggle;
217K people competing to solve ML problems; Sentiment; recommendation; web optimization; the higgs boson; predict seizures;
It took me around 500 hours to get up to 925th; which puts me in the top 1% of the sexiest job of the century! Whoa!
Let’s look at cats and dogs. September 2013 - petfinder data - 4 months;
good example to step through the ML process
cats and dogs … EXAMPLES; LABELED
Kaggle got data from pet-finder; 25K labeled pictures of cats and dogs for training and a bunch of others for testing;
4 months later, Pierre Sermanet achieved .98914 success rate
PREDICTING … which are cats and dogs
Interesting we use the word predicting … cat and dog are TRUTH / ground-truth
and We call it predictive modeling, even when what we are predicting may already be in the past. We could call it guessing; when you flip a coin and cover it with your hand and I say heads; am I predicting or GUESSING; I think of predictive modeling as informed guessing …
Jeanne Dixon was one of the great guessers of all time and she did predict that Kennedy would be assassinated, sort-of … which brings up the issue of accuracy —->
What does accurate mean? Mark’s weather prediction model. How accurate is it?
I also have A psychological test. It’s a test for being a psychopath. I ask you if you are a psychopath and then classify you as a psychopath regardless of what you say. Like the stopped watch that is right twice daily …
more accurate than mark’s weather forecast … concept of baseline
However, if enough patients have taken the alternative therapy, then data could be collected on these patients related to their disease, treatment history, and demographics. Also, laboratory tests could be collected related to patients’ genetic background or other biological data (e.g., protein measurements). Given their outcome, a
predictive model could be created to predict the response to the alternative therapy based on these data. The critical question for the doctor and patient is a prediction of how the patient will react to a change in therapy.
All, this prediction needs to be accurate.
Kuhn, Max; Johnson, Kjell (2013-05-17). Applied Predictive Modeling (Page 4). Springer. Kindle Edition.
edges are numerical values; color is a label or classifier
we will use a common trick and recode colors using dummy variables —-> dummy as in showroom dummy as opposed to quantum electrodynamics for dummies
we will look at the distribution of each of the variables over the domain of spork-ness versus non-splorkness