ML Fundamentals: Session 2
Supervised Learning with scikit-learn
Alia Hamwi
What is ML?
• “Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.” -Arthur Samuel (1959)
Traditional Programming .vs. Machine Learning
When Do We Use Machine Learning?
• ML is used when:
• Humans can’t explain their expertise (speech recognition)
• Models are based on huge amounts of data (genomics)
• Learning isn’t always useful:
• There is no need to “learn” to calculate payroll
When Do We Use Machine Learning?
• A classic example of a task that requires machine learning:
It is very hard to say what makes a 2
Types of Learning
• Supervised (inductive) learning
- Given: training data + desired outputs (labels)
Types of Learning
• Unsupervised learning
-Given: training data (without desired outputs)
Types of Learning
• Semi-supervised learning
-Given: training data + a few desired outputs
Types of Learning
• Reinforcement learning
-Rewards from sequence of actions
Types of Supervised learning
• Classification: A classification problem is when the output variable is
a category, such as “red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
Supervised learning Applications
• Text categorization (News)
• Face Recognition / Object Recognition/Signature recognition
• Type of Music ( for recommendation-Spotify)
• Spam detection (Gmail)
• Weather forecasting (weatherForcast)
• Predicting housing prices
• Stock price predictions, among others
• Predict product price depend on attributes
• Predict if employee will leave your company (HR system)
As an ML Engineer..
• Now, Choose the right answers for these use cases:
https://forms.gle/zDfcQuxX22UfjUUc6
ML Pipeline
Data Collection
- Row: examples (instances)
- Column: features + one for target/label
- Values:
-Numeric Data
-Ordinal Data
The categories have an inherent order
-Nominal Data
The categories do not have an inherent order
Data Collection
Data Preparation
• Data Cleaning
• Remove unwanted data content
• Check formatting
• Imputation/Handle missing data
• Numerical: mean, median
• Categorical: most frequent, add new Missing category
• Both: drop example
Data Preparation: Encoding
• One Hot Encoding/Dummy variables
• for each level of a categorical feature, we create a new variable. Each
category is mapped with a binary variable containing either 0 or 1. Here, 0
represents the absence, and 1 represents the presence of that category.
Data Preparation: Encoding
• Label Encoding/ordinal encoding
• We use this categorical data encoding technique when the categorical feature
is ordinal. In this case, retaining the order is important. Hence encoding
should reflect the sequence.(exam grade, day of week,sizes)
• Ex: ‘Degree':{'None':0,'High school':1,'Diploma':2,'Bachelors':3,'Masters':4,'phd':5}
Data Preparation:
• Standardization
• Standardization is a process that deals with the mean and standard deviation
of the data points. As raw data, the values are varying from very low to very
high. So, to avoid the low performance in the model we use standardization.
It says, the mean becomes zero and the standard deviation becomes a unit.
• The formula to standardization shown below:
z = (feature_value — mean)/standard deviation
Model Training
• Classification:
• Logistic regression
• K nearest neighbors
• Support vector classification (SVM)
• Naïve-Bayes
• Regression
• Linear regression with different regularization:
• Lasso
• Ridge
• Elastic
Model Training
• Cross validation
Model Evaluation
• Overfitting
• Increasing the model complexity
• Reducing regularization
• Adding features to training data
• Underfitting
• Adding more data
• Data augmentation
• Regularization
• Removing features from data
As an ML Engineer..
• Now, Choose the right answers for these use cases:
https://forms.gle/fN2y2nRueviBf2JX6
Model Evaluation
• Confusion matrix
Model Evaluation
• Precision explains how many correctly predicted values came out to be positive
actually. Or simply it gives the number of correct outputs given by the model out of
all the correctly predicted positive values by the model. Like music or video
recommendation systems, e-commerce websites, etc. Wrong results could lead to
customer churn and be harmful to the business.
• It determines whether a model is reliable or not. It is useful for the conditions
where false positive is a higher concern as compared to a false negative.
Model Evaluation
• Recall describes how many of the actual positive values to be predicted correctly
out of the model.
• Recall /Sensitivity is a useful metric in cases where False Negative trumps False
Positive. Recall is important in medical cases where it doesn’t matter whether we
raise a false alarm but the actual positive cases should not go undetected!
Model Evaluation
• Increasing precision decreases recall and vice versa, this is known as the
precision/recall tradeoff.
• For the condition when two models have low precision and high recall or vice versa,
it becomes hard to compare those models, therefore to solve this issue we can
deploy F-score.
• Also, if the recall is equal to precision, The F-score is maximum and can be
calculated using the below formula:
References
• Best Competitions for Beiggienrs – kaggle
https://www.kaggle.com/getting-started/78482
• The Hundred-Page Machine Learning Book
• Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems
Thank You

Supervised learning

  • 1.
    ML Fundamentals: Session2 Supervised Learning with scikit-learn Alia Hamwi
  • 2.
    What is ML? •“Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.” -Arthur Samuel (1959)
  • 3.
  • 4.
    When Do WeUse Machine Learning? • ML is used when: • Humans can’t explain their expertise (speech recognition) • Models are based on huge amounts of data (genomics) • Learning isn’t always useful: • There is no need to “learn” to calculate payroll
  • 5.
    When Do WeUse Machine Learning? • A classic example of a task that requires machine learning: It is very hard to say what makes a 2
  • 6.
    Types of Learning •Supervised (inductive) learning - Given: training data + desired outputs (labels)
  • 7.
    Types of Learning •Unsupervised learning -Given: training data (without desired outputs)
  • 8.
    Types of Learning •Semi-supervised learning -Given: training data + a few desired outputs
  • 9.
    Types of Learning •Reinforcement learning -Rewards from sequence of actions
  • 10.
    Types of Supervisedlearning • Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. • Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
  • 11.
    Supervised learning Applications •Text categorization (News) • Face Recognition / Object Recognition/Signature recognition • Type of Music ( for recommendation-Spotify) • Spam detection (Gmail) • Weather forecasting (weatherForcast) • Predicting housing prices • Stock price predictions, among others • Predict product price depend on attributes • Predict if employee will leave your company (HR system)
  • 12.
    As an MLEngineer.. • Now, Choose the right answers for these use cases: https://forms.gle/zDfcQuxX22UfjUUc6
  • 13.
  • 14.
    Data Collection - Row:examples (instances) - Column: features + one for target/label - Values: -Numeric Data -Ordinal Data The categories have an inherent order -Nominal Data The categories do not have an inherent order
  • 15.
  • 16.
    Data Preparation • DataCleaning • Remove unwanted data content • Check formatting • Imputation/Handle missing data • Numerical: mean, median • Categorical: most frequent, add new Missing category • Both: drop example
  • 17.
    Data Preparation: Encoding •One Hot Encoding/Dummy variables • for each level of a categorical feature, we create a new variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.
  • 18.
    Data Preparation: Encoding •Label Encoding/ordinal encoding • We use this categorical data encoding technique when the categorical feature is ordinal. In this case, retaining the order is important. Hence encoding should reflect the sequence.(exam grade, day of week,sizes) • Ex: ‘Degree':{'None':0,'High school':1,'Diploma':2,'Bachelors':3,'Masters':4,'phd':5}
  • 19.
    Data Preparation: • Standardization •Standardization is a process that deals with the mean and standard deviation of the data points. As raw data, the values are varying from very low to very high. So, to avoid the low performance in the model we use standardization. It says, the mean becomes zero and the standard deviation becomes a unit. • The formula to standardization shown below: z = (feature_value — mean)/standard deviation
  • 20.
    Model Training • Classification: •Logistic regression • K nearest neighbors • Support vector classification (SVM) • Naïve-Bayes • Regression • Linear regression with different regularization: • Lasso • Ridge • Elastic
  • 21.
  • 22.
    Model Evaluation • Overfitting •Increasing the model complexity • Reducing regularization • Adding features to training data • Underfitting • Adding more data • Data augmentation • Regularization • Removing features from data
  • 23.
    As an MLEngineer.. • Now, Choose the right answers for these use cases: https://forms.gle/fN2y2nRueviBf2JX6
  • 24.
  • 25.
    Model Evaluation • Precisionexplains how many correctly predicted values came out to be positive actually. Or simply it gives the number of correct outputs given by the model out of all the correctly predicted positive values by the model. Like music or video recommendation systems, e-commerce websites, etc. Wrong results could lead to customer churn and be harmful to the business. • It determines whether a model is reliable or not. It is useful for the conditions where false positive is a higher concern as compared to a false negative.
  • 26.
    Model Evaluation • Recalldescribes how many of the actual positive values to be predicted correctly out of the model. • Recall /Sensitivity is a useful metric in cases where False Negative trumps False Positive. Recall is important in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected!
  • 27.
    Model Evaluation • Increasingprecision decreases recall and vice versa, this is known as the precision/recall tradeoff. • For the condition when two models have low precision and high recall or vice versa, it becomes hard to compare those models, therefore to solve this issue we can deploy F-score. • Also, if the recall is equal to precision, The F-score is maximum and can be calculated using the below formula:
  • 28.
    References • Best Competitionsfor Beiggienrs – kaggle https://www.kaggle.com/getting-started/78482 • The Hundred-Page Machine Learning Book • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
  • 29.