Machine Learning 101

Machine Learning 101
Talha Obaid

About me
• Email Security @ Symantec
• Doing Data Science to fight Spam and Malware
• Organizer for Python Data Science Group Singapore
• Monthly regular meet-ups—over a year
• http://meetup.com/pydata-sg >1.8K members
• https://www.facebook.com/groups/pydatasg/ >1k
members
• https://twitter.com/pydatasg
• https://engineers.sg/organizations/118 recorded
and uploaded
• Previously with CENSAM @ MIT
• Co-founded startup(s)
• NUS Alumni
• Some questions
• How many of you have heard about Machine Learning or
ML?
• How many of you know how to do ML?
• How many of you earn a living doing ML?
• What this talk offers
• Getting a foot in the door
• Grossly oversimplifying things
• How to learn ML from literature
• Relate to ML terms when thrown at you
• Types of ML
• Learning ML models and their coding (SciKit-learn and
why?)
• Linear Regression
• Logistic Regression
• Clustering
• Lessons from Practical ML
@ObaidTal

Some terminology
• Data Science
• Data Analytics
• Business Analytics
• Artificial Intelligence
• Machine Learning
Ref. Tuan Q. Phan

What is Data?
• Available data (Internal)
• Health record
• Organization
• University
• …
• Available data (external)
• www.data.gov.sg
• Publicly available
corpuses
• Quality of data
• Trustworthy or not
• Missing data
• Huge challenge in scientific
community
• Other jargon
• Tiny Data: Data from sensors
• Big Data: Data on massive scale
• Fast Data: Hash-based lookup
@ObaidTal

Machine Learning, defined
• A field of study that gives
computers the ability to learn
without being explicitly
programmed
– Arthur Samuel (1959)
• Samuel wrote a program to play
checkers
• Eventually his program learned to
play better
@ObaidTal
Ref: http://infolab.stanford.edu/pub/voy/museum/samuel.html

When did we all start with Machine Learning?
• Take a look at the following (outputs) and guess the ?:
• 1, 2, 3, 4, 5, 6, ?, …, ?
• 2, 4, 6, 8, 10, 12, ?, …, ?
• 3, 6, 9, 12, ?, …, ?
• 1, 3, 9, 27, ?, …, ?
• 4, 7, 10, 13, ?, …, ?
• So how can I represent above
• Input -> -> output
• X -> -> Y
• call this box as f()
• Output = f(Input) ... In maths
• Y = f(x)
• Answers
• Y=X
• Y=2*X
• Y=3*X + 0
• Y=3^X
• Y=3*X + 1
In school… Really, how? How to find ‘…,?’ – A: Equation (Single variable)
@ObaidTal
Assuming input is
1,2,3,4,5,6,…

Linear Regression – Statistical term
Y=mx+b… from last example, b=? & m=? b = 1
m = 3
Y=mx+b
Output
Input
Suppose this line
is Y=3x+1
Assume that this line
is ‘surrounded’ by ‘+’
shaped points, which
we had, i.e. (outputs)
4, 7, 10, 13, ?, …, ? (Y)
having inputs
1, 2, 3, 4, 5, 6, … (x)
The line Y=3x+1
kind of ‘fits’ in
these points as to
find out ‘…, ?’

Where are we headed…
https://www.ltcconline.net/greenl/courses/154/factor/circle.htm
http://machinelearningmastery.com/basic-concepts-in-machine-learning/
x …
Y …
m, b
Since x, Y are
already known,
therefore we
got Y=mx+b
+
+
+
+
Y=x+2x+3x+(3x+1)+3^x
Y=x1+2*x2+3*x3+(3x4+1)+3^x5
So far, there is a single variable ‘x’ on
the left-hand side. However, it can be
more than one variable. Let’s sum up all the
previous equations on the left-hand side:
Let’s assume the ‘x’ to be different
from each other on the left-hand side:
How to fit a line between ‘+’
shaped points?
A: Distance formula
Making sure each ‘+’ is
closest to the line
or vice versa
x …
mx+b
Y
Next, let’s move
to different types of
Machine Learning…

Supervised Learning
• Providing the output, and a dataset (input), to come up with the answer,
i.e. model.
• In literature, “The Boston housing prices” example is a “Regression
problem”, i.e. predicting the continuous value variable, as the outcome.
• “Classification problem” – i.e. the variable trying to predict is discrete e.g.
spam problem, output is either 0 or 1
• The feature or input dataset variables can always be more than one, i.e.
graph with multiple dimensions.
• Code the model with what the right answer, i.e. Y is, and train with
number of input sets, i.e. x, and ask the algorithm or model to replicate
the same
Types of Machine Learning
Ref. Andrew Y. Ng.
1

Unsupervised Learning
• Data is given, and structure must be inferred
• Clustering is one example of it
• Deep Learning is also considered here
• Example is finding clusters in
– Gene data
– Image processing, grouping pixels together
– Social network analysis
– Lots of people talking, extracting the voice of single person, considering
voices of others as noise – Cocktail party problem
– Text processing
• Independent component analysis ICA algorithm Ref. Andrew Y. Ng.
2

Reinforcement Learning
• Sequence of decisions are made over time
• Example
• Flying an autonomous helicopter
• Reward function
• Specify what you want to get done
• Specify a good behavior and bad behavior in Reward function
• Learning algorithm will decide to maximize good behavior and minimize
bad behavior
Ref. Andrew Y. Ng.
3

Getting ready – some more terms…
• Data set/Input is also called training set, observation
• The predictor is called hypothesis for historical reasons, and it is called
classifier, estimator, predictor
• Boston housing price problem (we’ll see more of it)
• We will train/learn and predict price
• Features or input variable on right side of Y = mx+b, i.e. x
• Price, i.e. Y, output or target variable of Y = mx+b
• Linear equation, i.e. Y=mx+b can be written as predictor, where m is slope
and b is intercept
• Cost function – which Y=mx+b is better (we will see more of it)
@ObaidTal
Let’s get coding… 1
• To remember
• Will expand on

Popular Machine Learning Tool Kit –
Introduction
Project Language Highlight
R R A language for statistic analysis and ML
Octave Octave A language to simulate Matlab for numerical computations
Scikit-learn Python Documentation, example, tutorials available. General purpose with simple API
Tensorflow Py bindings A library for numerical computation using data flow graphs
Orange Python General Purpose ML Package
PyBrain Python Neural networks, unsupervised learning
MLlib Python/Scala Apache’s new library based within Spark
Mahout Java Apache’s framework based on Hadoop
Weka java General Purpose ML Package
GoLearn Go Machine Learning by Go
shogun C++ User interfaces to various languages

Machine Learning Kit – which to choose
• Factors to consider
• Language
• Performance (run speed)
• Scalability
Ref. T. Obaid & H. Zhang
• We choose Scikit learn
• Language: Python
• Performance (run speed):
good enough
• Scalability: not critical, and
can switch to MLlib in Spark
for mass data
• Well documented, enough
algorithms, clean API,
robust, fast
implementation, easy usage
Scikit Learn – Machine
Learning in Python
• Simple and efficient tools for
data mining and data analysis
• Accessible to everybody, and
reusable in various contexts
• Built on NumPy, SciPy, and
matplotlib
• Open source, commercially
usable – BSD license

Scikit Learn – Examples
• A lot of sample codes are in source folder:
scikit-learn-0.16.1/examples
• Boston housing prices (we will work with this example dataset)
• Will try features one by one (test only 3 of them in this session,
please try more)
• Excerpt of data… (how our data actually looks like)
1. CRIM 2. ZN 3. INDUS 4. CHAS 5. NOX 6. RM 7. AGE 8. DIS 9. RAD 10. TAX
11.
PTRATIO 12. B 13. LSTAT 14. MEDV
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2
Details about each
feature of this data
are coming next…
• To remember
• Please explore …

Features of Boston housing prices
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centers
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT % lower status of the population
14. MEDV Median value of owner-occupied homes in $1000s
Features and their details
Which of these features are
significant:
• All of them?
• A few of them?
• Another one, not in them?
Let’s observe these…

Scikit Learn – Demo code for Boston house
price. Try it!
import matplotlib.pyplot as plt # for plotting
import numpy as np # for matrix/array operations
from sklearn import datasets, linear_model # classifier
boston = datasets.load_boston()
boston_X = boston.data[:, np.newaxis]
boston_X_temp = boston_X[:, :, 12] # indexes – for LSTAT its 12, for PTRATIO it’s 10, for RM it’s 5 – trying each one by one
boston_X_train = boston_X_temp[:]
boston_y_train = boston.target[:]
regr = linear_model.LinearRegression() # estimator
regr.fit(boston_X_train, boston_y_train) # train parameters
fig,ax = plt.subplots()
ax.scatter(boston_X_train, boston_y_train, color='black') # we can predict boston_X_test
ax.plot(boston_X_train, regr.predict(boston_X_train), color='green', linewidth=3) # to predict
ax.set_xlabel(boston.feature_names[12]) # indexes – for LSTAT its 12, for PTRATIO it’s 10, for RM it’s 5 – trying each one by one
ax.set_ylabel('Predicted')
fig.show()
plt.show()
• Important ...
• Good Feature?
• Not so Good Feature?
• Comments

Scikit Learn – Demo result for Boston house price
• Parameters
(Coefficients, -0.95692593 )
(intercept, 34.7411998746244)
• Feature:
• % lower status of the population
• y=-0.95692593 *LSTAT + 34.7411998746244
• Looks good!
1st Try with LSTAT % lower status of the population

Demo result Contd.
• Parameters
(Coefficients, -2.1571753)
(intercept, 62.3446274748)
• Feature:
• pupil-teacher ratio by town
• y=-2.1571753*PTRATIO + 62.3446274748
• Doesn’t look good!
2nd Try with PTRATIO pupil-teacher ratio by town

Demo result Contd.
• Parameters
(Coefficients, 9.126359)
(intercept, -34.7856369115583)
• Feature:
• average number of rooms per dwelling
• y=9.126359*RM -34.7856369115583
• Looks good!
3rd Try with RM average number of rooms per dwelling

Cost function – the lower the cost, the better the model
Real LSTAT Predicted Difference Square
... ... ... ... ...
18.3 14.1 21.24854426 2.948544262 8.693913263
21.2 12.92 22.37771686 1.177716859 1.387017
17.5 15.1 20.29161833 2.791618332 7.793132909
16.8 14.33 21.0284513 4.228451298 17.87980038
22.4 9.67 25.48772613 3.087726132 9.534052663
20.6 9.08 26.05231243 5.45231243 29.72771084
23.9 5.64 29.34413763 5.444137629 29.63863453
22 6.48 28.54031985 6.540319848 42.77578372
11.9 7.88 27.20062355 15.30062355 234.1090809
Total: 19478.69458
Total/2 9739.347291
Ref. Andrew Y. Ng.
Real RM Predicted Difference Square
... ... ... ... ...
18.3 5.794 18.09248713 -0.207512866 0.043061589
21.2 6.019 20.14591791 -1.054082091 1.111089054
17.5 5.569 16.03905636 -1.460943641 2.134356321
16.8 6.027 20.21892878 3.418928781 11.68907401
22.4 6.593 25.38444798 2.984447975 8.906929718
20.6 6.12 21.06768017 0.467680168 0.21872474
23.9 6.976 28.87984347 4.979843472 24.79884101
22 6.794 27.21884613 5.218846134 27.23635497
11.9 6.03 20.24630786 8.346307858 69.66085487
Total: 22062.73306
Total/2 11031.36653
Predicted=-0.95692593 * LSTAT + 34.7411998746244 Predicted=9.126359 * RM -34.7856369115583
Least-squares cost function
= for ( i = 1; i < m; i++)
Comment:
Here Summation is nothing
but a for loop as:
How well are we doing – Compare the Good ones • 1 Good
Feature
VS
• Another
Good
Feature

Over-fitting and Under-fitting
The Good model is … the “Just right!” model – Why?
• Under-fitting – high bias not matching and cost too high
• Just right is what we need
• Over-fitting – High variance happens mostly when
too many features are used or the model is too complex
• The model should learn, not memorize
http://i.imgur.com/W0qejU0.png

Scikit Learn – Usage
from sklearn import linear_model
X=[][] # source data with (n_samples, n_features)
Y=[] # target value with (n_samples)
clf = linear_model.LinearRegression() # Estimator, or classfier
clf = clf.fit(X, Y) # learn parameters from existing data
Test = [][] # same shape as X
clf.predict(Test) # predict the target for data in Test
The model program skeleton would look something like…
• Important
1. Model
2. Fit
3. Predict
• Comments

Observations from code
• There is always a fit function call, i.e. learning/training X, to give Y.
• Same is a predict function call, given X only, pop out Y.
• Panda library pd can alternatively be used to have relatively simpler
display of data
• train_test_split function call serves important purpose, as it
shuffles the dataset so we don’t have selection bias, i.e. if for instance
data is ordered by price ascending, and halved for training and half
for testing, then the training data may have all the house with lesser
prices.
• To remember
• Subtleties
• Probable Issue

Scikit Learn – Test Data
• Scikit-learn comes with a few standard datasets, for instance the iris and
digits datasets for classification and the Boston house prices dataset for
regression.
• Boston(boston house prices), iris(iris flower), mlcomp(20 newsgroups),
svmlight_file/s, diabetes, lfw_pairs(labeled face), sample_image/s(china and
flower), digits(0-9 handwriting), lfw_people(labeld people), linnerud(for
multivariate regression)
• Scipy.misc.lena()
• Load test data … Try others!
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
Subset of learning datasets – just saw Boston housing prices
• … Seen so far
• Ahead …
• Please Explore …

Scikit Learn – Main Algorithms
• Supervised learning (most have both classifier and regressor)
• Line model: LinearRegression, Lasso, Ridge, LogisticRegression, SGD
• SVM: LinearSVC, SVC, SVR
• Naïve Bayes: GaussianNB, MultinomiaNB, BernoulliNB
• Decision Tree: DecisionTree(optimized version of the CART)
• Ensemble method: RandomForest, AdaBoost, GradientBoosting(GBDT)
• Unsupervised learning
• Clustering: Kmeans(Kmeans+, mini-batch), DBSCAN
• Manifold learning(dimension reduction): MDS, Isomap, LocallyLinearEmbedding.
• Algorithm whole list:
http://scikit-learn.org/stable/modules/classes.html
Subset of supported algorithms – we just saw LinearRegression
• … Seen so far
• Ahead …

Logistic (Classification) Regression
• Regression is when our labels y can take any real (continuous) value.
Examples include:
• Predicting stock market.
• Predicting sales.
• Detecting the age of a person from a picture.
• Classification is when our labels y can only take a finite set of values
(categories). Examples include:
• Handwritten digit recognition: xx is an image with a handwritten digit, yy is a digit
between 0 and 9.
• Spam filtering: xx is an e-mail, and yy is 0 or 1 whether that e-mail is a spam or not.
Linear (Regression) vs Logistic (Classification)

Linear (Regression) vs Logistic (Classification)
Classification (finite output values) vs Regression (continuous output values)

Logistic Regression – with IRIS example
• Categorical output instead of continuous output
• Will use IRIS dataset – to classify 3 species of plants
• Number of Instances: 150 (50 in each of three classes)
• Number of Attributes: 4 numeric, predictive attributes and the class
• Attribute/Feature Information:
• sepal length in cm (will use this)
• sepal width in cm (will use this)
• petal length in cm
• petal width in cm
• Classes i.e. Target:
• Iris-Setosa
• Iris-Versicolour
• Iris-Virginica
IRIS is a database of flower classes… bears a little bit of botany
Setosa Versicolour Virginica
• Petal is the colored part of the flower
• Sepal is the green leaf below the petal

Let’s go code… Try it!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import
LogisticRegression
iris = load_iris()
print "--- Keys ---n", iris.keys()
print "--- Shape ---n", iris.data.shape
print "--- Feature Names ---n",
iris.feature_names
print "--- Description ---n", iris.DESCR
print "--- Target --- n", iris.target
iri = pd.DataFrame(iris.data)
print "--- Panda Head ---n", iri.head()
iri.columns = iris.feature_names
print "--- Panda Columns ---n",
iri.head()
logreg = LogisticRegression(C=1e5)
X = iris.data[:, :2] # we only take
the first two features.
Y = iris.target
print "--- X ---n", X
print "--- y ---n", Y
# we create an instance of Neighbors
Classifier and fit the data.
logreg.fit(X, Y) # again, the
infamous fit method
Part 1 Part 2
• Preparation
• Important
• Debug

A little bit more… Try it!
# Plotting
h = .02 # step size in the mesh
# Plot the decision boundary. For that, we will
assign a color to each
# point in the mesh [x_min,
m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:,
0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:,
1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Prediction
Z = logreg.predict(np.c_[xx.ravel(),
yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z,
cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y,
edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width’)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
Part 3 Part 4
• Plotting
• Important
• Debug

Classification – output
Two features, thus plotted in 2D plane

Clustering
• Unsupervised learning
• Output unknown
• Grouping observation
K-Means
• One of the most popular "clustering" algorithms.
• Stores kk centroids that it uses to define clusters.
• If a point is closer to a cluster's centroid.
• Find best centroids by alternating between
• assigning data points to clusters based on the
current centroids
• sing centroids (points which are the center of a
cluster) based on the current assignment of data
points to clusters.
34
43
49
58
70
81
89
101
116
121
131
145
<=11
<=12
<=15
34
43
49
58
70
81
89
101
116
121
131
145
Primitive clustering e.g.
11
6
9
12
11
8
12
15
Input data sorted
2

Clustering applied on IRIS data
• We used the same IRIS data, as used in logistic regression demo, however
changed two things:
• Added a feature, i.e. three features for clustering, that’s why a 3D plot as output
• Removed the output, to demonstrate unsupervised learning
Three features, thus plotted in 3D plane

Let’s codeTry it!
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d
import Axes3D
from sklearn.cluster import
KMeans
from sklearn import datasets
np.random.seed(5)
iris = datasets.load_iris()
X = iris.data # No used of Y here
est = KMeans() # We try before hand
the no. of clusters, can be even more,
default is 8
est.fit(X) # NOTICE!, no Y here, “Unsupervised”, Yay!
labels = est.labels_
fig = plt.figure(1, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(np.float))
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.show()
Part 1 Part 2
• Preparation/Plotting
• Important
• Debug

Lessons learned!
• The dataset on which the model is executed here, is available and well-formatted, which is not the case
always
• Data acquisition and preparation come prior to feature extraction
• Extracting the interesting features, “numerifying” (converting to numbers, if not already) and later
normalizing them, comes prior to running model on it
• Features or data columns can be categorical or inferential variables, or can cause singularity problem;
these affect the performance of data model and hence residual cost
• Selection of model, linear or logistic, and observing cost to select appropriate features, can also be
achieved using R, i.e. a gold standard of p (Probability of incorrectly rejecting a true null hypothesis)
would be ~ 0.05 (At least 23% (and typically close to 50%))
• Cross validation (CV) is done by running test and training a few times and measuring difference
• Confusion matrix also provides visibility into how many predictions are right and wrong
@ObaidTal
From real-life Machine Learning

Lessons learned!
• If the data is in time-series, and there is missing data within the time window, then we can apply
interpolation or extrapolation. Interpolation works good for archived data, whereas extrapolation for
live data
• Before applying any regression, it so happens that we may have to cluster the data and then apply
regression over it. This would help control outliers, if any, which may impact the model performance.
Outliers are not always noise in the data
• Selection bias happens when we train the model on data, which is not the true representation of the
real occurrences. For instance, dissecting the housing price ordered by ascending, and training over it,
would skip the higher-valued homes. Thus to avoid it, data should be shuffled to achieve even
distribution
• Curse of dimensionality, when challenged with too many features. To deal with it, carefully reduce the
non-significant features including the dependent, categorical or composite features, depending on
where applicable
… Continued
@ObaidTal

References
• Stanford’s CS229 by Prof Andrew Y. Ng – Highly recommended!
• https://www.youtube.com/watch?v=UzxYlbK2c7E
• Scikit-Learn tutorial
• http://scikit-learn.org/stable/
• http://scikit-learn.org/stable/install.html
• http://www.shogun-toolbox.org/page/features/
• http://daoudclarke.github.io/machine%20learning%20in%20practice/
2013/10/08/machine-learning-libraries/

References
• http://www-bcf.usc.edu/~gareth/ISL/ – Highly Recommended!
• http://bigdataexaminer.com/uncategorized/how-to-run-linear-regression-
in-python-scikit-learn/
• http://ipython-books.github.io/featured-04/
• http://stanford.edu/~cpiech/cs221/handouts/kmeans.html
• http://scikit-
learn.org/stable/auto_examples/cluster/plot_cluster_iris.html
• http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-
interpret-p-values
Continued…

Thank you!
Talha Obaid
• linkedin.com/in/talhaobaid
• twitter.com/ObaidTal
• github.com/TalhaObaid
• talhaobaid@gmail.com

Machine Learning 101

Recommended

Recommended

More Related Content

Similar to Machine Learning 101

Similar to Machine Learning 101 (20)

Recently uploaded

Recently uploaded (20)

Machine Learning 101