Session 06 machine learning.pptx

Machine Learning
Data science for beginners, session 6

Machine Learning: your 5-7 things
Defining machine learning
The Scikit-Learn library
Machine learning algorithms
Choosing an algorithm
Measuring algorithm performance

Machine Learning = learning models from data
Which advert is the user most likely to click on?
Who’s most likely to win this election?
Which wells are most likely to fail in the next 6 months?

Machine Learning as Predictive Analytics...

Machine Learning Process
● Get data
● Select a model
● Select hyperparameters for that model
● Fit model to data
● Validate model (and change model, if necessary)
● Use the model to predict values for new data

Today’s library: Scikit-Learn (sklearn)

Scikit-Learn’s example datasets
● Iris
● Digits
● Diabetes
● Boston

Algorithm Types
Supervised learning
Regression: learning numbers
Classification: learning classes
Unsupervised learning
Clustering: finding groups
Dimensionality reduction: finding efficient representations

Linear Regression: fit a line to (numerical) data

Linear Regression: First, get your data
import numpy as np
import pandas as pd
gen = np.random.RandomState(42)
num_samples = 40
x = 10 * gen.rand(num_samples)
y = 3 * x + 7+ gen.randn(num_samples)
X = pd.DataFrame(x)
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(x,y)

Linear Regression: Fit model to data
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
print('Slope: {}, Intercept: {}'.format(model.coef_, model.intercept_))

Linear Regression: Check your model
Xtest = pd.DataFrame(np.linspace(-1, 11))
predicted = model.predict(Xtest)
plt.scatter(x, y)
plt.plot(Xtest, predicted)

Reality can be a little more like this…

Classification: Predict classes
● Well pump: [working, broken]
● CV: [accept, reject]
● Gender: [male, female, others]
● Iris variety: [iris setosa, iris virginica, iris versicolor]

Classification: The Iris Dataset Petal
Sepal

Classification: first get your data
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target

Classification: Split your data
ntest=10
np.random.seed(0)
indices = np.random.permutation(len(X))
iris_X_train = X[indices[:-ntest]]
iris_Y_train = Y[indices[:-ntest]]
iris_X_test = X[indices[-ntest:]]
iris_Y_test = Y[indices[-ntest:]]

Classifier: Fit Model to Data
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski')
knn.fit(iris_X_train, iris_Y_train)

Classifier: Check your model
predicted_classes = knn.predict(iris_X_test)
print('kNN predicted classes: {}'.format(predicted_classes))
print('Real classes: {}'.format(iris_Y_test))

Clustering: Find groups in your data

Clustering: get your data
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target
print("Xs: {}".format(X))

Clustering: Fit model to data
from sklearn import cluster
k_means = cluster.KMeans(3)
k_means.fit(iris.data)

Clustering: Check your model
print("Generated labels: n{}".format(k_means.labels_))
print("Real labels: n{}".format(Y))

Dimensionality reduction: Get your data

Dimensionality reduction: Fit model to data

Recap: Choosing an Algorithm
Have: data and expected outputs
Want numbers? Try regression algorithms
Want classes? Try classification algorithms
Have: just data
Want to find structure? Try clustering algorithms
Want to look at it? Try dimensionality reduction

How well does the model fit new data?
“Holdout sets”:
split your data into training and test sets
learn your model with the training set
get a validation score for your test set
Models are rarely perfect… you might have to change parameters or model
● underfitting: model not complex enough to fit the training data
● overfitting: model too complex: fits the training data well, does badly on test

The Confusion Matrix
True positive
False positive
False negative
True negative

Test Metrics
Precision:
of all the “true” results, how many were actually “true”?
Precision = tp / (tp + fp)
Recall:
how many of the things that were really “true” were marked as “true” by the
classifier?
Recall = tp / (tp + fn)
F1 score:
harmonic mean of precision and recall
F1_score = 2 * precision * recall / (precision + recall)

Iris classification: metrics
from sklearn import metrics
print(metrics.classification_report(iris_Y_test, predicted_classes))

Explore some algorithms
Notebooks 6.x contain examples of machine learning algorithms. Run them,
play with the numbers in them, break them, think about why they might have
broken.

Session 06 machine learning.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Session 06 machine learning.pptx

Similar to Session 06 machine learning.pptx (20)

More from Sara-Jayne Terp

More from Sara-Jayne Terp (20)

Recently uploaded

Recently uploaded (20)

Session 06 machine learning.pptx

Editor's Notes