Machine Learning for Everyone

Aly Osama
Machine
LearningFor everyone

Teaching Assistant, Ain Shams University
 Computational biology and deep learning
Former Research Software Development
Engineer, Microsoft Research (ATLC)
 Speech Recognition Team “Arabic Models”
 Natural Language Processing Team “Virtual Bot”
ABOUT ME
aly.osama@eng.asu.edu.eg

AGENDA
1. Introduction
2. Machine learning tools
 Scikit-learn
3. Case Study applications
 Computer vision
 Natural language processing
 Speech Recognition
4. Learning resources
5. Next step

INTRODUCTION
https://docs.google.com/presentati
on/d/1kSuQyW5DTnkVaZEjGYCkf
OxvzCqGEFzWBy4e9Uedd9k/edit

MACHINE LEARNING
It is hard for people to explicitly write the 'rules' for making decisions
The solution is dependent on lots of complex cases
We don't have the expertise to fully write 'the rules' but we have lots of
examples
Learning from ‘examples’

HOW TO LEARN ?
Nearest neighbor

LEARNING THROUGH LINEAR SEPARATION

INTERPRETABILITY
OF DEEP
LEARNING

DEEP LEARNING REQUIRES LARGER TRAINING
SETS

RESOURCES
Resources for learning Python
•Codecademy's Python course: browser-based, tons of exercises
•DataQuest: browser-based, teaches Python in the context of data science
•Google's Python class: slightly more advanced, includes videos and downloadable
exercises (with solutions)
•Python for Informatics: beginner-oriented book, includes slides and videos

MACHINE LEARNING TOOLS SciKit learn

SCIKIT LEARN
Benefits
• Consistent interface to machine learning
models
• Provides many tuning parameters but
with sensible defaults
• Exceptional documentation
• Rich set of functionality for companion
tasks
• Active community for development and
support
Drawbacks
• Harder (than R) to get started with
machine learning
• Less emphasis (than R) on model
interpretability

GETTING STARTED IN SCIKIT-LEARN WITH THE
FAMOUS IRIS DATASET
• 50 samples of 3 different species of iris (150 samples total)
• Measurements: sepal length, sepal width, petal length, petal
width
Machine learning on the iris dataset
• Framed as a supervised learning problem: Predict the species of
an iris using the measurements.
• Famous dataset for machine learning because prediction is easy
• Learn more about the iris dataset: UCI Machine Learning
Repository

LOADING THE IRIS
DATASET INTO SCIKIT-
LEARN
Machine learning terminology
• Each row is an observation (also known as:
sample, example, instance, record)
• Each column is a feature (also known as:
predictor, attribute, independent variable,
input, regressor, covariate)
…
…

LOADING THE IRIS
DATASET INTO SCIKIT-
LEARN
Machine learning terminology
• Each value we are predicting is the
response (also known as: target, outcome,
label, dependent variable)
• Classification is supervised learning in
which the response is categorical
• Regression is supervised learning in which
the response is ordered and continuous

REQUIREMENTS FOR
WORKING WITH DATA IN
SCIKIT-LEARN
• Features and response are separate
objects
• Features and response should be numeric
• Features and response should be NumPy
arrays
• Features and response should have
specific shapes

TRAINING A MACHINE LEARNING MODEL WITH
SCIKIT-LEARN
K-nearest neighbors (KNN) classification
• Pick a value for K.
• Search for the K observations in the training data that are "nearest" to the
measurements of the unknown iris.
• Use the most popular response value from the K nearest neighbors as the predicted
response value for the unknown iris.

SCIKIT-LEARN 4-STEP MODELING PATTERN

USING A DIFFERENT CLASSIFICATION MODEL

COMPARING MACHINE LEARNING MODELS IN
SCIKIT-LEARN
• Classification task: Predicting the species of an unknown iris
• Used three classification models: KNN (K=1), KNN (K=5), logistic regression
• Need a way to choose between the models
Solution:
Model evaluation procedures:
1. Evaluation procedure #1: Train and test on the entire dataset
2. Evaluation procedure #2: Train/test split
https://github.com/justmarkham/scikit-learn-videos/blob/master/05_model_evaluation.ipynb

DATA SCIENCE PIPELINE: PANDAS, SEABORN,
SCIKIT-LEARN
Types of supervised learning
• Classification: Predict a categorical response
• Regression: Predict a continuous response

DATA SCIENCE PIPELINE: PANDAS, SEABORN,
SCIKIT-LEARN
Reading data using pandas
Pandas: popular Python library for data exploration, manipulation, and analysis
Visualizing data using seaborn
Seaborn: Python library for statistical data visualization built on top of Matplotlib
https://github.com/justmarkham/scikit-learn-videos/blob/master/06_linear_regression.ipynb

CROSS-VALIDATION FOR PARAMETER TUNING,
MODEL SELECTION, AND FEATURE SELECTION
Review of model evaluation procedures
Motivation: Need a way to choose between machine learning models
Goal is to estimate likely performance of a model on out-of-sample data
Initial idea: Train and test on the same data
But, maximizing training accuracy rewards overly complex models which overfit the training
data

CROSS-VALIDATION FOR PARAMETER TUNING,
MODEL SELECTION, AND FEATURE SELECTION
Alternative idea: Train/test split
Split the dataset into two pieces, so that the model can be trained and tested on different
data
Testing accuracy is a better estimate than training accuracy of out-of-sample performance
But, it provides a high variance estimate since changing which observations happen to be in
the testing set can significantly change testing accuracy
https://github.com/justmarkham/scikit-learn-videos/blob/master/07_cross_validation.ipynb

EFFICIENTLY SEARCHING FOR OPTIMAL TUNING
PARAMETERS
Review of K-fold cross-validation
• Steps for cross-validation:
• Dataset is split into K "folds" of equal size
• Each fold acts as the testing set 1 time, and acts as the training set K-1 times
• Average testing performance is used as the estimate of out-of-sample performance

EFFICIENTLY SEARCHING FOR OPTIMAL TUNING
PARAMETERS
Benefits of cross-validation:
• More reliable estimate of out-of-sample performance than train/test split
• Can be used for selecting tuning parameters, choosing between models, and selecting features
Drawbacks of cross-validation:
• Can be computationally expensive
https://github.com/justmarkham/scikit-learn-videos/blob/master/08_grid_search.ipynb

EVALUATING A CLASSIFICATION MODEL
Review of model evaluation
• Need a way to choose between models: different model types, tuning parameters, and
features
• Use a model evaluation procedure to estimate how well a model will generalize to out-of-
sample data
• Requires a model evaluation metric to quantify the model performance

Model evaluation procedures
Training and testing on the same data
 Rewards overly complex models that "overfit" the training data and won't necessarily generalize
Train/test split
 Split the dataset into two pieces, so that the model can be trained and tested on different data
 Better estimate of out-of-sample performance, but still a "high variance" estimate
 Useful due to its speed, simplicity, and flexibility
K-fold cross-validation
 Systematically create "K" train/test splits and average the results together
 Even better estimate of out-of-sample performance
 Runs "K" times slower than train/test split

Model evaluation metrics
• Regression problems: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
• Classification problems: Classification accuracy
https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb

4. LEARNING RESOURCES
•Courses
•Stanford Machine Learning: Available via Coursera and taught by Andrew Ng.
•Caltech Learning from Data: Available via edX and taught by Yaser Abu-Mostafa
•Machine Learning Category on VideoLectures.Net: This is an easy place to drown in the overload of content.
•Blogs:
•Machinelearningmastery
•https://machinelearningmastery.com/best-machine-learning-resources-for-getting-started/
•More:
•https://github.com/josephmisiti/awesome-machine-learning/blob/master/courses.md
•https://github.com/josephmisiti/awesome-machine-learning/blob/master/books.md
•https://github.com/josephmisiti/awesome-machine-learning/blob/master/blogs.md

5. NEXT STEP
Next Step is learning deep learning
Project:
1. Finish one of these courses (Stanford Machine Learning) or (Caltech Learning from
Data)
2. Submit in all these 4 Kaggle Competition (we will select top 20 students in their
leadership board)
• https://www.kaggle.com/c/titanic
• https://www.kaggle.com/c/ghouls-goblins-and-ghosts-boo
• https://www.kaggle.com/c/digit-recognizer
• https://www.kaggle.com/c/leaf-classification

Machine Learning for Everyone

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning for Everyone

Similar to Machine Learning for Everyone (20)

More from Aly Abdelkareem

More from Aly Abdelkareem (16)

Recently uploaded

Recently uploaded (20)

Machine Learning for Everyone