Aly Osama
Machine
LearningFor everyone
Teaching Assistant, Ain Shams University
 Computational biology and deep learning
Former Research Software Development
Engineer, Microsoft Research (ATLC)
 Speech Recognition Team “Arabic Models”
 Natural Language Processing Team “Virtual Bot”
ABOUT ME
aly.osama@eng.asu.edu.eg
AGENDA
1. Introduction
2. Machine learning tools
 Scikit-learn
3. Case Study applications
 Computer vision
 Natural language processing
 Speech Recognition
4. Learning resources
5. Next step
INTRODUCTION
https://docs.google.com/presentati
on/d/1kSuQyW5DTnkVaZEjGYCkf
OxvzCqGEFzWBy4e9Uedd9k/edit
BREAK
Quick Recap
MACHINE LEARNING
It is hard for people to explicitly write the 'rules' for making decisions
The solution is dependent on lots of complex cases
We don't have the expertise to fully write 'the rules' but we have lots of
examples
Learning from ‘examples’
HOW TO LEARN ?
HOW TO LEARN ?
Nearest neighbor
LEARNING THROUGH LINEAR SEPARATION
HOW TO LEARN ?
Regression
MODEL QUALITY
OVER FITTING PROBLEM
NEURAL NETWORKS
16
17
GRADIENT DESCENT
NEURAL NETWORK TRAINING
DEEP LEARNING AS A BLACK BOX
22
INTERPRETABILITY
OF DEEP
LEARNING
DEEP LEARNING REQUIRES LARGER TRAINING
SETS
RESOURCES
Resources for learning Python
•Codecademy's Python course: browser-based, tons of exercises
•DataQuest: browser-based, teaches Python in the context of data science
•Google's Python class: slightly more advanced, includes videos and downloadable
exercises (with solutions)
•Python for Informatics: beginner-oriented book, includes slides and videos
MACHINE LEARNING TOOLS SciKit learn
SCIKIT LEARN
Benefits
• Consistent interface to machine learning
models
• Provides many tuning parameters but
with sensible defaults
• Exceptional documentation
• Rich set of functionality for companion
tasks
• Active community for development and
support
Drawbacks
• Harder (than R) to get started with
machine learning
• Less emphasis (than R) on model
interpretability
GETTING STARTED IN SCIKIT-LEARN WITH THE
FAMOUS IRIS DATASET
• 50 samples of 3 different species of iris (150 samples total)
• Measurements: sepal length, sepal width, petal length, petal
width
Machine learning on the iris dataset
• Framed as a supervised learning problem: Predict the species of
an iris using the measurements.
• Famous dataset for machine learning because prediction is easy
• Learn more about the iris dataset: UCI Machine Learning
Repository
LOADING THE IRIS
DATASET INTO SCIKIT-
LEARN
Machine learning terminology
• Each row is an observation (also known as:
sample, example, instance, record)
• Each column is a feature (also known as:
predictor, attribute, independent variable,
input, regressor, covariate)
…
…
LOADING THE IRIS
DATASET INTO SCIKIT-
LEARN
Machine learning terminology
• Each value we are predicting is the
response (also known as: target, outcome,
label, dependent variable)
• Classification is supervised learning in
which the response is categorical
• Regression is supervised learning in which
the response is ordered and continuous
REQUIREMENTS FOR
WORKING WITH DATA IN
SCIKIT-LEARN
• Features and response are separate
objects
• Features and response should be numeric
• Features and response should be NumPy
arrays
• Features and response should have
specific shapes
TRAINING A MACHINE LEARNING MODEL WITH
SCIKIT-LEARN
K-nearest neighbors (KNN) classification
• Pick a value for K.
• Search for the K observations in the training data that are "nearest" to the
measurements of the unknown iris.
• Use the most popular response value from the K nearest neighbors as the predicted
response value for the unknown iris.
LOADING THE DATA
SCIKIT-LEARN 4-STEP MODELING PATTERN
SCIKIT-LEARN 4-STEP MODELING PATTERN
USING A DIFFERENT VALUE FOR K
USING A DIFFERENT CLASSIFICATION MODEL
COMPARING MACHINE LEARNING MODELS IN
SCIKIT-LEARN
• Classification task: Predicting the species of an unknown iris
• Used three classification models: KNN (K=1), KNN (K=5), logistic regression
• Need a way to choose between the models
Solution:
Model evaluation procedures:
1. Evaluation procedure #1: Train and test on the entire dataset
2. Evaluation procedure #2: Train/test split
https://github.com/justmarkham/scikit-learn-videos/blob/master/05_model_evaluation.ipynb
DATA SCIENCE PIPELINE: PANDAS, SEABORN,
SCIKIT-LEARN
Types of supervised learning
• Classification: Predict a categorical response
• Regression: Predict a continuous response
DATA SCIENCE PIPELINE: PANDAS, SEABORN,
SCIKIT-LEARN
Reading data using pandas
Pandas: popular Python library for data exploration, manipulation, and analysis
Visualizing data using seaborn
Seaborn: Python library for statistical data visualization built on top of Matplotlib
https://github.com/justmarkham/scikit-learn-videos/blob/master/06_linear_regression.ipynb
CROSS-VALIDATION FOR PARAMETER TUNING,
MODEL SELECTION, AND FEATURE SELECTION
Review of model evaluation procedures
Motivation: Need a way to choose between machine learning models
Goal is to estimate likely performance of a model on out-of-sample data
Initial idea: Train and test on the same data
But, maximizing training accuracy rewards overly complex models which overfit the training
data
CROSS-VALIDATION FOR PARAMETER TUNING,
MODEL SELECTION, AND FEATURE SELECTION
Alternative idea: Train/test split
Split the dataset into two pieces, so that the model can be trained and tested on different
data
Testing accuracy is a better estimate than training accuracy of out-of-sample performance
But, it provides a high variance estimate since changing which observations happen to be in
the testing set can significantly change testing accuracy
https://github.com/justmarkham/scikit-learn-videos/blob/master/07_cross_validation.ipynb
EFFICIENTLY SEARCHING FOR OPTIMAL TUNING
PARAMETERS
Review of K-fold cross-validation
• Steps for cross-validation:
• Dataset is split into K "folds" of equal size
• Each fold acts as the testing set 1 time, and acts as the training set K-1 times
• Average testing performance is used as the estimate of out-of-sample performance
EFFICIENTLY SEARCHING FOR OPTIMAL TUNING
PARAMETERS
Benefits of cross-validation:
• More reliable estimate of out-of-sample performance than train/test split
• Can be used for selecting tuning parameters, choosing between models, and selecting features
Drawbacks of cross-validation:
• Can be computationally expensive
https://github.com/justmarkham/scikit-learn-videos/blob/master/08_grid_search.ipynb
EVALUATING A CLASSIFICATION MODEL
Review of model evaluation
• Need a way to choose between models: different model types, tuning parameters, and
features
• Use a model evaluation procedure to estimate how well a model will generalize to out-of-
sample data
• Requires a model evaluation metric to quantify the model performance
EVALUATING A CLASSIFICATION MODEL
Model evaluation procedures
Training and testing on the same data
 Rewards overly complex models that "overfit" the training data and won't necessarily generalize
Train/test split
 Split the dataset into two pieces, so that the model can be trained and tested on different data
 Better estimate of out-of-sample performance, but still a "high variance" estimate
 Useful due to its speed, simplicity, and flexibility
K-fold cross-validation
 Systematically create "K" train/test splits and average the results together
 Even better estimate of out-of-sample performance
 Runs "K" times slower than train/test split
EVALUATING A CLASSIFICATION MODEL
Model evaluation metrics
• Regression problems: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
• Classification problems: Classification accuracy
https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb
LEARNING RESOURCES
4. LEARNING RESOURCES
•Courses
•Stanford Machine Learning: Available via Coursera and taught by Andrew Ng.
•Caltech Learning from Data: Available via edX and taught by Yaser Abu-Mostafa
•Machine Learning Category on VideoLectures.Net: This is an easy place to drown in the overload of content.
•Blogs:
•Machinelearningmastery
•https://machinelearningmastery.com/best-machine-learning-resources-for-getting-started/
•More:
•https://github.com/josephmisiti/awesome-machine-learning/blob/master/courses.md
•https://github.com/josephmisiti/awesome-machine-learning/blob/master/books.md
•https://github.com/josephmisiti/awesome-machine-learning/blob/master/blogs.md
NEXT STEP
5. NEXT STEP
Next Step is learning deep learning
Project:
1. Finish one of these courses (Stanford Machine Learning) or (Caltech Learning from
Data)
2. Submit in all these 4 Kaggle Competition (we will select top 20 students in their
leadership board)
• https://www.kaggle.com/c/titanic
• https://www.kaggle.com/c/ghouls-goblins-and-ghosts-boo
• https://www.kaggle.com/c/digit-recognizer
• https://www.kaggle.com/c/leaf-classification
5. NEXT STEP
Machine Learning for Everyone

Machine Learning for Everyone

  • 1.
  • 2.
    Teaching Assistant, AinShams University  Computational biology and deep learning Former Research Software Development Engineer, Microsoft Research (ATLC)  Speech Recognition Team “Arabic Models”  Natural Language Processing Team “Virtual Bot” ABOUT ME aly.osama@eng.asu.edu.eg
  • 3.
    AGENDA 1. Introduction 2. Machinelearning tools  Scikit-learn 3. Case Study applications  Computer vision  Natural language processing  Speech Recognition 4. Learning resources 5. Next step
  • 4.
  • 5.
  • 6.
  • 7.
    MACHINE LEARNING It ishard for people to explicitly write the 'rules' for making decisions The solution is dependent on lots of complex cases We don't have the expertise to fully write 'the rules' but we have lots of examples Learning from ‘examples’
  • 9.
  • 10.
    HOW TO LEARN? Nearest neighbor
  • 11.
  • 12.
    HOW TO LEARN? Regression
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 21.
    DEEP LEARNING ASA BLACK BOX
  • 22.
  • 23.
  • 24.
    DEEP LEARNING REQUIRESLARGER TRAINING SETS
  • 25.
    RESOURCES Resources for learningPython •Codecademy's Python course: browser-based, tons of exercises •DataQuest: browser-based, teaches Python in the context of data science •Google's Python class: slightly more advanced, includes videos and downloadable exercises (with solutions) •Python for Informatics: beginner-oriented book, includes slides and videos
  • 26.
  • 29.
    SCIKIT LEARN Benefits • Consistentinterface to machine learning models • Provides many tuning parameters but with sensible defaults • Exceptional documentation • Rich set of functionality for companion tasks • Active community for development and support Drawbacks • Harder (than R) to get started with machine learning • Less emphasis (than R) on model interpretability
  • 30.
    GETTING STARTED INSCIKIT-LEARN WITH THE FAMOUS IRIS DATASET • 50 samples of 3 different species of iris (150 samples total) • Measurements: sepal length, sepal width, petal length, petal width Machine learning on the iris dataset • Framed as a supervised learning problem: Predict the species of an iris using the measurements. • Famous dataset for machine learning because prediction is easy • Learn more about the iris dataset: UCI Machine Learning Repository
  • 31.
    LOADING THE IRIS DATASETINTO SCIKIT- LEARN Machine learning terminology • Each row is an observation (also known as: sample, example, instance, record) • Each column is a feature (also known as: predictor, attribute, independent variable, input, regressor, covariate) … …
  • 32.
    LOADING THE IRIS DATASETINTO SCIKIT- LEARN Machine learning terminology • Each value we are predicting is the response (also known as: target, outcome, label, dependent variable) • Classification is supervised learning in which the response is categorical • Regression is supervised learning in which the response is ordered and continuous
  • 33.
    REQUIREMENTS FOR WORKING WITHDATA IN SCIKIT-LEARN • Features and response are separate objects • Features and response should be numeric • Features and response should be NumPy arrays • Features and response should have specific shapes
  • 34.
    TRAINING A MACHINELEARNING MODEL WITH SCIKIT-LEARN K-nearest neighbors (KNN) classification • Pick a value for K. • Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris. • Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.
  • 36.
  • 37.
  • 38.
  • 39.
    USING A DIFFERENTVALUE FOR K
  • 40.
    USING A DIFFERENTCLASSIFICATION MODEL
  • 41.
    COMPARING MACHINE LEARNINGMODELS IN SCIKIT-LEARN • Classification task: Predicting the species of an unknown iris • Used three classification models: KNN (K=1), KNN (K=5), logistic regression • Need a way to choose between the models Solution: Model evaluation procedures: 1. Evaluation procedure #1: Train and test on the entire dataset 2. Evaluation procedure #2: Train/test split https://github.com/justmarkham/scikit-learn-videos/blob/master/05_model_evaluation.ipynb
  • 42.
    DATA SCIENCE PIPELINE:PANDAS, SEABORN, SCIKIT-LEARN Types of supervised learning • Classification: Predict a categorical response • Regression: Predict a continuous response
  • 43.
    DATA SCIENCE PIPELINE:PANDAS, SEABORN, SCIKIT-LEARN Reading data using pandas Pandas: popular Python library for data exploration, manipulation, and analysis Visualizing data using seaborn Seaborn: Python library for statistical data visualization built on top of Matplotlib https://github.com/justmarkham/scikit-learn-videos/blob/master/06_linear_regression.ipynb
  • 44.
    CROSS-VALIDATION FOR PARAMETERTUNING, MODEL SELECTION, AND FEATURE SELECTION Review of model evaluation procedures Motivation: Need a way to choose between machine learning models Goal is to estimate likely performance of a model on out-of-sample data Initial idea: Train and test on the same data But, maximizing training accuracy rewards overly complex models which overfit the training data
  • 45.
    CROSS-VALIDATION FOR PARAMETERTUNING, MODEL SELECTION, AND FEATURE SELECTION Alternative idea: Train/test split Split the dataset into two pieces, so that the model can be trained and tested on different data Testing accuracy is a better estimate than training accuracy of out-of-sample performance But, it provides a high variance estimate since changing which observations happen to be in the testing set can significantly change testing accuracy https://github.com/justmarkham/scikit-learn-videos/blob/master/07_cross_validation.ipynb
  • 46.
    EFFICIENTLY SEARCHING FOROPTIMAL TUNING PARAMETERS Review of K-fold cross-validation • Steps for cross-validation: • Dataset is split into K "folds" of equal size • Each fold acts as the testing set 1 time, and acts as the training set K-1 times • Average testing performance is used as the estimate of out-of-sample performance
  • 47.
    EFFICIENTLY SEARCHING FOROPTIMAL TUNING PARAMETERS Benefits of cross-validation: • More reliable estimate of out-of-sample performance than train/test split • Can be used for selecting tuning parameters, choosing between models, and selecting features Drawbacks of cross-validation: • Can be computationally expensive https://github.com/justmarkham/scikit-learn-videos/blob/master/08_grid_search.ipynb
  • 48.
    EVALUATING A CLASSIFICATIONMODEL Review of model evaluation • Need a way to choose between models: different model types, tuning parameters, and features • Use a model evaluation procedure to estimate how well a model will generalize to out-of- sample data • Requires a model evaluation metric to quantify the model performance
  • 49.
    EVALUATING A CLASSIFICATIONMODEL Model evaluation procedures Training and testing on the same data  Rewards overly complex models that "overfit" the training data and won't necessarily generalize Train/test split  Split the dataset into two pieces, so that the model can be trained and tested on different data  Better estimate of out-of-sample performance, but still a "high variance" estimate  Useful due to its speed, simplicity, and flexibility K-fold cross-validation  Systematically create "K" train/test splits and average the results together  Even better estimate of out-of-sample performance  Runs "K" times slower than train/test split
  • 50.
    EVALUATING A CLASSIFICATIONMODEL Model evaluation metrics • Regression problems: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error • Classification problems: Classification accuracy https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb
  • 51.
  • 52.
    4. LEARNING RESOURCES •Courses •StanfordMachine Learning: Available via Coursera and taught by Andrew Ng. •Caltech Learning from Data: Available via edX and taught by Yaser Abu-Mostafa •Machine Learning Category on VideoLectures.Net: This is an easy place to drown in the overload of content. •Blogs: •Machinelearningmastery •https://machinelearningmastery.com/best-machine-learning-resources-for-getting-started/ •More: •https://github.com/josephmisiti/awesome-machine-learning/blob/master/courses.md •https://github.com/josephmisiti/awesome-machine-learning/blob/master/books.md •https://github.com/josephmisiti/awesome-machine-learning/blob/master/blogs.md
  • 53.
  • 54.
    5. NEXT STEP NextStep is learning deep learning Project: 1. Finish one of these courses (Stanford Machine Learning) or (Caltech Learning from Data) 2. Submit in all these 4 Kaggle Competition (we will select top 20 students in their leadership board) • https://www.kaggle.com/c/titanic • https://www.kaggle.com/c/ghouls-goblins-and-ghosts-boo • https://www.kaggle.com/c/digit-recognizer • https://www.kaggle.com/c/leaf-classification
  • 55.