Introduction to Machine Learning with Python and scikit-learn

Introduction to Machine
Learning with Python
and scikit-learn
Python Atlanta
Nov. 14th 2013
Matt Hagy
matt@liveramp.com

Machine Learning (ML):
• Finding patterns in data

• Modeling patterns
• Use models to make
predictions

Slide #2

Intro to Machine Learning with Python

matt@liveramp.com

ML can be easy*
• You already have ML applications!

• You can start applying ML methods
now with Python &scikit-learn
• Theoretical knowledge of ML not
needed (initially)*
*Gaining more background, theory, and
experience will help
Slide #3


matt@liveramp.com

Simple Example

Slide #4


matt@liveramp.com

Simple Model

Slide #5


matt@liveramp.com

import numpyas np
from sklearn.linear_modelimport LinearRegression
x,y = np.load('data.npz')
x_test = np.linspace(0, 200)
model = LinearRegression()
model.fit(x[::, np.newaxis], y)
y_test = model.predict(x_test[::, np.newaxis])

Slide #6


matt@liveramp.com

Slide #7


matt@liveramp.com

Variance/Bias Trade Off
• Need models that can adapt to
relationships in our data
• Highly adaptable models can over-fit
and will not generalize
• Regularization – Common strategy to
address variance/bias trade off
Slide #8


matt@liveramp.com

Slide #9


matt@liveramp.com

import numpy as np
from sklearn.svmimport SVR
from sklearn.pipelineimport Pipeline
from sklearn.preprocessingimport StandardScaler
x,y = np.load('data.npz')
x_test = np.linspace(0, 200)

regularization
term

model = Pipeline([
('standardize', StandardScaler()),
('svr', SVR(kernel='rbf', verbose=0, C=5e6,
epsilon=20)) ])
model.fit(x[::, np.newaxis], y)
y_test = model.predict(x_test[::, np.newaxis])
Slide #10


matt@liveramp.com

Supervised Learning
Output, Y

0
3
1
3
4
2
9
3
4

1
6
3
7
9
3
17
6
7

Sample

Input, X

Slide #11

Modeling relationship
between inputs and outputs


matt@liveramp.com

Multiple Inputs
Input, X

Sample

X1

X2

X3

Xn

Output, Y

0
3
1
3
4
2
9
3
4

2
3
1
6
8
9
1
2
3

1
0
3
1
2
7
5
4
2

4
7
0
2
9
1
3
2
1

1
6
3
7
9
3
17
6
7

Slide #12

…


matt@liveramp.com

Example: Image Classification
• Classify
handwritten digits
with ML models
• Each input is an
entire image
• Output is digit in
the image
Slide #13


matt@liveramp.com

Input, X

Output, Y

9
2
Slide #14


matt@liveramp.com

import numpyas np
from sklearn.ensembleimport RandomForestClassifier
with np.load(’train.npz') as data:
pixels_train = data['pixels']
labels_train = data['labels’]
with np.load(’test.npz') as data:
pixels_test = data['pixels']
# flatten
X_train = pixels_train.reshape(pixels_train.shape[0], -1)
X_test = pixels_test.reshape(pixels_test.shape[0], -1)
model = RandomForestClassifier(n_estimators=50)
model.fit(X_train, labels_train)
labels_test = model.predict(X_test)
Slide #15


matt@liveramp.com

Predicting the tags of Stack Overflow
questions with machine learning
Kaggle Data Science Competition
• Given 6 million
training questions
labeled with tags
• Predict the tags for
2 million unlabeled
test questions
www.users.globalnet.co.uk/~slocks/instructions.html
stackoverflow.com/questions/895371/bubble-sort-homework

Slide #16


matt@liveramp.com

Text Classification Overview
Feature Extraction &
Selection
Raw Posts

Slide #17

Model Selection
& Training

Vector Space


Machine
Learning Model

matt@liveramp.com

Term Frequency Feature Extraction
Characterize text by the frequency of specific
words in each text entry

Slide #18

processing

sorted

array

faster

“Why is processing a
sorted array faster
than processing an
array this is not
sorted?”

Term Frequencies
why

Example Title:

1

2

2

2

1

Ignore common words
(i.e. stop words)


matt@liveramp.com

sorted

array

faster

need

help

java

homework

Title 1 1

2

2

2

1

0

0

0

0

Title 2 0

0

0

0

0

1

1

1

1

Title 3 0

0

1

1

0

0

1

0

1

why

processing

Frequency of key terms is anticipated to be
correlated with the tags of the question

Slide #19


matt@liveramp.com

Example Model Coefficients

Slide #22


matt@liveramp.com

ML can be easy*
• You already have ML problems!
• You can start applying ML methods now
with Python &scikit-learn
• Theoretical knowledge of ML not needed
(initially)*
scikit-learn.org

github.com/scikit-learn
Slide #24


matt@liveramp.com

Helping companies use their marketing data to delight customers

Tools

Opportunities
• Backend Engineers
• Data Scientists
• Full-Stack Engineers

• Java
• Hadoop (Map/Reduce)
• Ruby

Build and work with large distributed systems that
process massive data sets.
Check out: liveramp.com/careers
Slide #25


matt@liveramp.com

Introduction to Machine Learning with Python and scikit-learn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Machine Learning with Python and scikit-learn

Similar to Introduction to Machine Learning with Python and scikit-learn (20)

Recently uploaded

Recently uploaded (20)

Introduction to Machine Learning with Python and scikit-learn