This is an introductory workshop for machine learning. Introduced machine learning tasks such as supervised learning, unsupervised learning and reinforcement learning.
2. Teaching Assistant, Ain Shams University
Computational biology and deep learning
Former Research Software Development
Engineer, Microsoft Research (ATLC)
Speech Recognition Team “Arabic Models”
Natural Language Processing Team “Virtual Bot”
ABOUT ME
aly.osama@eng.asu.edu.eg
3. AGENDA
1. Introduction
2. Machine learning tools
Scikit-learn
3. Case Study applications
Computer vision
Natural language processing
Speech Recognition
4. Learning resources
5. Next step
7. MACHINE LEARNING
It is hard for people to explicitly write the 'rules' for making decisions
The solution is dependent on lots of complex cases
We don't have the expertise to fully write 'the rules' but we have lots of
examples
Learning from ‘examples’
25. RESOURCES
Resources for learning Python
•Codecademy's Python course: browser-based, tons of exercises
•DataQuest: browser-based, teaches Python in the context of data science
•Google's Python class: slightly more advanced, includes videos and downloadable
exercises (with solutions)
•Python for Informatics: beginner-oriented book, includes slides and videos
29. SCIKIT LEARN
Benefits
• Consistent interface to machine learning
models
• Provides many tuning parameters but
with sensible defaults
• Exceptional documentation
• Rich set of functionality for companion
tasks
• Active community for development and
support
Drawbacks
• Harder (than R) to get started with
machine learning
• Less emphasis (than R) on model
interpretability
30. GETTING STARTED IN SCIKIT-LEARN WITH THE
FAMOUS IRIS DATASET
• 50 samples of 3 different species of iris (150 samples total)
• Measurements: sepal length, sepal width, petal length, petal
width
Machine learning on the iris dataset
• Framed as a supervised learning problem: Predict the species of
an iris using the measurements.
• Famous dataset for machine learning because prediction is easy
• Learn more about the iris dataset: UCI Machine Learning
Repository
31. LOADING THE IRIS
DATASET INTO SCIKIT-
LEARN
Machine learning terminology
• Each row is an observation (also known as:
sample, example, instance, record)
• Each column is a feature (also known as:
predictor, attribute, independent variable,
input, regressor, covariate)
…
…
32. LOADING THE IRIS
DATASET INTO SCIKIT-
LEARN
Machine learning terminology
• Each value we are predicting is the
response (also known as: target, outcome,
label, dependent variable)
• Classification is supervised learning in
which the response is categorical
• Regression is supervised learning in which
the response is ordered and continuous
33. REQUIREMENTS FOR
WORKING WITH DATA IN
SCIKIT-LEARN
• Features and response are separate
objects
• Features and response should be numeric
• Features and response should be NumPy
arrays
• Features and response should have
specific shapes
34. TRAINING A MACHINE LEARNING MODEL WITH
SCIKIT-LEARN
K-nearest neighbors (KNN) classification
• Pick a value for K.
• Search for the K observations in the training data that are "nearest" to the
measurements of the unknown iris.
• Use the most popular response value from the K nearest neighbors as the predicted
response value for the unknown iris.
41. COMPARING MACHINE LEARNING MODELS IN
SCIKIT-LEARN
• Classification task: Predicting the species of an unknown iris
• Used three classification models: KNN (K=1), KNN (K=5), logistic regression
• Need a way to choose between the models
Solution:
Model evaluation procedures:
1. Evaluation procedure #1: Train and test on the entire dataset
2. Evaluation procedure #2: Train/test split
https://github.com/justmarkham/scikit-learn-videos/blob/master/05_model_evaluation.ipynb
42. DATA SCIENCE PIPELINE: PANDAS, SEABORN,
SCIKIT-LEARN
Types of supervised learning
• Classification: Predict a categorical response
• Regression: Predict a continuous response
43. DATA SCIENCE PIPELINE: PANDAS, SEABORN,
SCIKIT-LEARN
Reading data using pandas
Pandas: popular Python library for data exploration, manipulation, and analysis
Visualizing data using seaborn
Seaborn: Python library for statistical data visualization built on top of Matplotlib
https://github.com/justmarkham/scikit-learn-videos/blob/master/06_linear_regression.ipynb
44. CROSS-VALIDATION FOR PARAMETER TUNING,
MODEL SELECTION, AND FEATURE SELECTION
Review of model evaluation procedures
Motivation: Need a way to choose between machine learning models
Goal is to estimate likely performance of a model on out-of-sample data
Initial idea: Train and test on the same data
But, maximizing training accuracy rewards overly complex models which overfit the training
data
45. CROSS-VALIDATION FOR PARAMETER TUNING,
MODEL SELECTION, AND FEATURE SELECTION
Alternative idea: Train/test split
Split the dataset into two pieces, so that the model can be trained and tested on different
data
Testing accuracy is a better estimate than training accuracy of out-of-sample performance
But, it provides a high variance estimate since changing which observations happen to be in
the testing set can significantly change testing accuracy
https://github.com/justmarkham/scikit-learn-videos/blob/master/07_cross_validation.ipynb
46. EFFICIENTLY SEARCHING FOR OPTIMAL TUNING
PARAMETERS
Review of K-fold cross-validation
• Steps for cross-validation:
• Dataset is split into K "folds" of equal size
• Each fold acts as the testing set 1 time, and acts as the training set K-1 times
• Average testing performance is used as the estimate of out-of-sample performance
47. EFFICIENTLY SEARCHING FOR OPTIMAL TUNING
PARAMETERS
Benefits of cross-validation:
• More reliable estimate of out-of-sample performance than train/test split
• Can be used for selecting tuning parameters, choosing between models, and selecting features
Drawbacks of cross-validation:
• Can be computationally expensive
https://github.com/justmarkham/scikit-learn-videos/blob/master/08_grid_search.ipynb
48. EVALUATING A CLASSIFICATION MODEL
Review of model evaluation
• Need a way to choose between models: different model types, tuning parameters, and
features
• Use a model evaluation procedure to estimate how well a model will generalize to out-of-
sample data
• Requires a model evaluation metric to quantify the model performance
49. EVALUATING A CLASSIFICATION MODEL
Model evaluation procedures
Training and testing on the same data
Rewards overly complex models that "overfit" the training data and won't necessarily generalize
Train/test split
Split the dataset into two pieces, so that the model can be trained and tested on different data
Better estimate of out-of-sample performance, but still a "high variance" estimate
Useful due to its speed, simplicity, and flexibility
K-fold cross-validation
Systematically create "K" train/test splits and average the results together
Even better estimate of out-of-sample performance
Runs "K" times slower than train/test split
50. EVALUATING A CLASSIFICATION MODEL
Model evaluation metrics
• Regression problems: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
• Classification problems: Classification accuracy
https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb
52. 4. LEARNING RESOURCES
•Courses
•Stanford Machine Learning: Available via Coursera and taught by Andrew Ng.
•Caltech Learning from Data: Available via edX and taught by Yaser Abu-Mostafa
•Machine Learning Category on VideoLectures.Net: This is an easy place to drown in the overload of content.
•Blogs:
•Machinelearningmastery
•https://machinelearningmastery.com/best-machine-learning-resources-for-getting-started/
•More:
•https://github.com/josephmisiti/awesome-machine-learning/blob/master/courses.md
•https://github.com/josephmisiti/awesome-machine-learning/blob/master/books.md
•https://github.com/josephmisiti/awesome-machine-learning/blob/master/blogs.md
54. 5. NEXT STEP
Next Step is learning deep learning
Project:
1. Finish one of these courses (Stanford Machine Learning) or (Caltech Learning from
Data)
2. Submit in all these 4 Kaggle Competition (we will select top 20 students in their
leadership board)
• https://www.kaggle.com/c/titanic
• https://www.kaggle.com/c/ghouls-goblins-and-ghosts-boo
• https://www.kaggle.com/c/digit-recognizer
• https://www.kaggle.com/c/leaf-classification