Machine Learning
Part 1. Introduction
Classifying images
Boolean
2000 LOC +
dictionary
per language
Probabilistic
20 LOC +
lots of data
all languages
machine learning
Types of ML
Supervised learning
Supervised learning
x
o
x
xx
x
x
o
o
o
o
o
Prepare
Install Anaconda: https://conda.io/docs/install/quick.html
Update: conda update condo
Create env: conda create --name <envname> python=3
Switch to env: source activate <envrname>
Install libraries: sudo pip install
numpy
scipy
matplotlib
ipython
scikit-learn
pandas
pillow
//load breast cancer data
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
//split data into train & test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_sta
te=66)
//use k-neighbors algorithm to perform classification
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train,y_train
//predict cancer on test data
clf.predict(X_test)
//check accuracy
clf.score(X_test,y_test)
Unsupervised
learning
Unsupervised
learning
o
o
o
o
o
o
o
o
o
o
o
o
X1
X2
//Kmeans algorithm
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2,random_state=0)
//split data into train & test
labels_km = km.fit_predict(X_train)
print(labels_km)
print(y_train)
Type of learning?
Algorithm cheat
sheet
Data is key
How to prepare it for ML?
Typical tasks
Categorical data —> one-hot-
encoding (dummy variable)
Multidimensional data —> scaling
Too many features —> Principal
Component Analysis (PCA)
Text —> bag-of-words
One-hot-encoding
# of flights account
# of days
since join
features
150
google,
facebook
300
gmail_parse
d_success
200 icloud 600
gmail_parse
d_success
1 live 0
3 google 1
One-hot-encoding
account
has_goog
le
has_faceb
ook
has_iclou
d
has_live
google,
facebook
1 1 0 0
icloud 0 0 1 0
live 0 0 0 1
google 1 0 0 –
//use pandas
from pandas import get_dummies
data_dummies = pd.get_dummies(data)
One-hot-encoding
Scaling
//minmax scaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
//scale data
X_train_scaled = scaler.transform(X_train)
print(X_train_scaled)
print(X_train
PCA (eigenfaces)
//load Labeled Faces in the Wild dataset
from sklearn.datasets import fetch_lfw_people
people = fetch_lfw_people(min_faces_per_person=20,resize=0.7)
//display 10 faces
image_shape = people.images[0].shape
import matplotlib.pyplot as plt
fix,axes = plt.subplots(2,5, figsize=(15,8),subplot_kw={‘xticks’:(),’yticks':()})
for target,image,ax in zip(people.target,people.images,axes.ravel()):
ax.imshow(image)
ax.set_title(people.target_names[target])
plt.show()
//use plt.ion() if plot isn't displayed or create .matplotlibrc in ./.matplotlib/ with text
‘backend: TkAgg'
//apply k-neighbors & estimate score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test =
train_test_split(people.data,people.target,stratify=people.target,random_state=0)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
knn.score(X_test,y_test)
without PCA
//apply PCA and then KNN
from sklearn.decomposition import PCA
pca = PCA(n_components=100,whiten=True,random_state=0).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_pca,y_train)
knn.score(X_test_pca,y_test)
with PCA
//display eigenfaces
fix,axes = plt.subplots(3,5,figsize=(15,12),subplot_kw={'xticks':
(),'yticks':()})
for i, (component, ax) in
enumerate(zip(pca.components_,axes.ravel())):
ax.imshow(component.reshape(image_shape),cmap='viridis')
ax.set_title("{}. component”.format((i+1)))
plt.show()
Eigenfaces
Eigenfaces
Bag-of-words
Bag-of-words
//vectorize
sentence = ["Hello world"]
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(sentence)
//print vocabulary
vect.vocabulary_
//apply bag-of-words to sentence
bag_of_words = vect.transform(sentence)
bag_of_words.toarray()
Whats next?
Exercises
Predict user purchase (User, UserInfo,
UserSessionAction)
Find clusters of users (User, UserInfo,
UserSessionAction)
Determine if there is free wifi at the airport? (Tip)
Predicting CBP wait times at the airport
(regression)
Others?
Useful
CSV read: pandas.read_csv
Working with images as numpy
arrays: scikit-image
Scikit-learn.org

Machine Learning - Introduction

  • 1.
  • 2.
  • 3.
    Boolean 2000 LOC + dictionary perlanguage Probabilistic 20 LOC + lots of data all languages machine learning
  • 4.
  • 5.
  • 6.
  • 7.
    Prepare Install Anaconda: https://conda.io/docs/install/quick.html Update:conda update condo Create env: conda create --name <envname> python=3 Switch to env: source activate <envrname> Install libraries: sudo pip install numpy scipy matplotlib ipython scikit-learn pandas pillow
  • 8.
    //load breast cancerdata from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() //split data into train & test from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_sta te=66) //use k-neighbors algorithm to perform classification from sklearn.neighbors import KNeighborsClassifier clf = KNeighborsClassifier(n_neighbors=3) clf.fit(X_train,y_train
  • 9.
    //predict cancer ontest data clf.predict(X_test) //check accuracy clf.score(X_test,y_test)
  • 11.
  • 12.
  • 13.
    //Kmeans algorithm from sklearn.clusterimport KMeans km = KMeans(n_clusters=2,random_state=0) //split data into train & test labels_km = km.fit_predict(X_train) print(labels_km) print(y_train)
  • 14.
  • 15.
  • 16.
    Data is key Howto prepare it for ML?
  • 17.
    Typical tasks Categorical data—> one-hot- encoding (dummy variable) Multidimensional data —> scaling Too many features —> Principal Component Analysis (PCA) Text —> bag-of-words
  • 18.
    One-hot-encoding # of flightsaccount # of days since join features 150 google, facebook 300 gmail_parse d_success 200 icloud 600 gmail_parse d_success 1 live 0 3 google 1
  • 19.
  • 20.
    //use pandas from pandasimport get_dummies data_dummies = pd.get_dummies(data) One-hot-encoding
  • 21.
    Scaling //minmax scaler from sklearn.preprocessingimport MinMaxScaler scaler = MinMaxScaler() scaler.fit(X_train) //scale data X_train_scaled = scaler.transform(X_train) print(X_train_scaled) print(X_train
  • 22.
  • 23.
    //load Labeled Facesin the Wild dataset from sklearn.datasets import fetch_lfw_people people = fetch_lfw_people(min_faces_per_person=20,resize=0.7) //display 10 faces image_shape = people.images[0].shape import matplotlib.pyplot as plt fix,axes = plt.subplots(2,5, figsize=(15,8),subplot_kw={‘xticks’:(),’yticks':()}) for target,image,ax in zip(people.target,people.images,axes.ravel()): ax.imshow(image) ax.set_title(people.target_names[target]) plt.show() //use plt.ion() if plot isn't displayed or create .matplotlibrc in ./.matplotlib/ with text ‘backend: TkAgg'
  • 24.
    //apply k-neighbors &estimate score from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier X_train, X_test, y_train, y_test = train_test_split(people.data,people.target,stratify=people.target,random_state=0) knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train,y_train) knn.score(X_test,y_test) without PCA
  • 25.
    //apply PCA andthen KNN from sklearn.decomposition import PCA pca = PCA(n_components=100,whiten=True,random_state=0).fit(X_train) X_train_pca = pca.transform(X_train) X_test_pca = pca.transform(X_test) knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train_pca,y_train) knn.score(X_test_pca,y_test) with PCA
  • 26.
    //display eigenfaces fix,axes =plt.subplots(3,5,figsize=(15,12),subplot_kw={'xticks': (),'yticks':()}) for i, (component, ax) in enumerate(zip(pca.components_,axes.ravel())): ax.imshow(component.reshape(image_shape),cmap='viridis') ax.set_title("{}. component”.format((i+1))) plt.show() Eigenfaces
  • 27.
  • 28.
  • 29.
    Bag-of-words //vectorize sentence = ["Helloworld"] from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer() vect.fit(sentence) //print vocabulary vect.vocabulary_ //apply bag-of-words to sentence bag_of_words = vect.transform(sentence) bag_of_words.toarray()
  • 30.
  • 31.
    Exercises Predict user purchase(User, UserInfo, UserSessionAction) Find clusters of users (User, UserInfo, UserSessionAction) Determine if there is free wifi at the airport? (Tip) Predicting CBP wait times at the airport (regression) Others?
  • 32.
    Useful CSV read: pandas.read_csv Workingwith images as numpy arrays: scikit-image Scikit-learn.org