Machine Learning – 101
Behzad Altaf
Coursera – Andrew NG Machine Learning
https://www.coursera.org/learn/machine-learning/home/welcome
At a basic level, machine learning is about predicting the future based on
the past. [Hal Daumé III]
Systems that automatically learn programs from data. [Domingos]
Teaching a computer about the world. [Mark Dredze]
ML Application
Kinds of Learning
 Supervised Learning
 Unsupervised Learning
 Semi-supervised Learning
 Reinforced Learning
https://en.wikipedia.org/wiki/Machine_learning
• Supervised Learning
Supervised learning is the machine learning task of inferring a function (f) from labeled training data.
- We have data with already attached labels (e.g. we know spam/non spam difference)
- We predict a pattern that fits the data
- We apply this pattern to new data and predict
Goal: from the database (learning sample), find a function f of the inputs that approximates at best the output
 Symbolic output ⇒ classification problem
 Numerical output ⇒ regression problem
Supervised Learning
• Unsupervised Learning
In machine learning, the problem of unsupervised learning is that of trying to find
hidden structure in unlabeled data.
- We have some (a lot) data that we cannot make much sense of
- We use different algorithms to see if a pattern emerges
Unsupervised Learning
Supervised Learning
Unsupervised Learning
Typical Process
Data
Training
Test
Hypothesis
Validate
Typical Process with Cross Validation
Data
Training
Test
Hypothesis
Cross
Validation
Tune/
Select Best
Validate
Issues in Machine Learning
Under fitting
Over Fitting
Issues in Machine Learning
Loan Type
(x1)
Loan Amount
(x2)
Income Range
(x3)
Car Segment
(x4)
Customer Type
y
House 50,000 > 50,000 Audi HNI
NA 10,000 < 10,000 NA LNI
y = ⱷ0+ ⱷ1.x1 + ⱷ2.x2 + ⱷ3.x3 + ⱷ4.x4
Data
Loan Type
(x1)
Loan Amount
(x2)
Income Range
(x3)
Car Segment
(x4)
Customer Type
y
1 50,000 5 5 HNI
0 10,000 1 0 LNI
1 20,000 3 2 ?
Feature
Vector
1-Dimensional
Array
What if the DATA IS
The curious case of
Spark MLLIB
http://spark.apache.org/
Spark and Machine Learning
• MLlib is Spark’s library of machine learning functions.
• It contains only parallel algorithms that run well on clusters.
• All learning algorithms require defining a set of features for each item, which will be fed
into the learning function.
• An important step is feature extraction and transformation to produce these feature
vectors.
http://spark.apache.org/mllib/
Algorithms in MLLIB
K-Means
Random Forest
http://opinions5.blogspot.in/2013/08/random-forest-confidence.html
Random Forest – tree a
http://www.slideshare.net/InfoQ/a-taste-of-random-decision-forests-on-apache-spark
Random Forest – tree b
http://www.slideshare.net/InfoQ/a-taste-of-random-decision-forests-on-apache-spark
PCA
http://www.scipy-lectures.org/packages/scikit-learn/#dimension-reduction-with-principal-component-analysis

Machine learning – 101