Supervised learning

ML Fundamentals: Session 2
Supervised Learning with scikit-learn
Alia Hamwi

What is ML?
• “Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.” -Arthur Samuel (1959)

Traditional Programming .vs. Machine Learning

When Do We Use Machine Learning?
• ML is used when:
• Humans can’t explain their expertise (speech recognition)
• Models are based on huge amounts of data (genomics)
• Learning isn’t always useful:
• There is no need to “learn” to calculate payroll

When Do We Use Machine Learning?
• A classic example of a task that requires machine learning:
It is very hard to say what makes a 2

Types of Learning
• Supervised (inductive) learning
- Given: training data + desired outputs (labels)

Types of Learning
• Unsupervised learning
-Given: training data (without desired outputs)

Types of Learning
• Semi-supervised learning
-Given: training data + a few desired outputs

Types of Learning
• Reinforcement learning
-Rewards from sequence of actions

Types of Supervised learning
• Classification: A classification problem is when the output variable is
a category, such as “red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.

Supervised learning Applications
• Text categorization (News)
• Face Recognition / Object Recognition/Signature recognition
• Type of Music ( for recommendation-Spotify)
• Spam detection (Gmail)
• Weather forecasting (weatherForcast)
• Predicting housing prices
• Stock price predictions, among others
• Predict product price depend on attributes
• Predict if employee will leave your company (HR system)

As an ML Engineer..
• Now, Choose the right answers for these use cases:
https://forms.gle/zDfcQuxX22UfjUUc6

Data Collection
- Row: examples (instances)
- Column: features + one for target/label
- Values:
-Numeric Data
-Ordinal Data
The categories have an inherent order
-Nominal Data
The categories do not have an inherent order

Data Preparation
• Data Cleaning
• Remove unwanted data content
• Check formatting
• Imputation/Handle missing data
• Numerical: mean, median
• Categorical: most frequent, add new Missing category
• Both: drop example

Data Preparation: Encoding
• One Hot Encoding/Dummy variables
• for each level of a categorical feature, we create a new variable. Each
category is mapped with a binary variable containing either 0 or 1. Here, 0
represents the absence, and 1 represents the presence of that category.

Data Preparation: Encoding
• Label Encoding/ordinal encoding
• We use this categorical data encoding technique when the categorical feature
is ordinal. In this case, retaining the order is important. Hence encoding
should reflect the sequence.(exam grade, day of week,sizes)
• Ex: ‘Degree':{'None':0,'High school':1,'Diploma':2,'Bachelors':3,'Masters':4,'phd':5}

Data Preparation:
• Standardization
• Standardization is a process that deals with the mean and standard deviation
of the data points. As raw data, the values are varying from very low to very
high. So, to avoid the low performance in the model we use standardization.
It says, the mean becomes zero and the standard deviation becomes a unit.
• The formula to standardization shown below:
z = (feature_value — mean)/standard deviation

Model Training
• Classification:
• Logistic regression
• K nearest neighbors
• Support vector classification (SVM)
• Naïve-Bayes
• Regression
• Linear regression with different regularization:
• Lasso
• Ridge
• Elastic

Model Training
• Cross validation

Model Evaluation
• Overfitting
• Increasing the model complexity
• Reducing regularization
• Adding features to training data
• Underfitting
• Adding more data
• Data augmentation
• Regularization
• Removing features from data

As an ML Engineer..
• Now, Choose the right answers for these use cases:
https://forms.gle/fN2y2nRueviBf2JX6

Model Evaluation
• Confusion matrix

Model Evaluation
• Precision explains how many correctly predicted values came out to be positive
actually. Or simply it gives the number of correct outputs given by the model out of
all the correctly predicted positive values by the model. Like music or video
recommendation systems, e-commerce websites, etc. Wrong results could lead to
customer churn and be harmful to the business.
• It determines whether a model is reliable or not. It is useful for the conditions
where false positive is a higher concern as compared to a false negative.

Model Evaluation
• Recall describes how many of the actual positive values to be predicted correctly
out of the model.
• Recall /Sensitivity is a useful metric in cases where False Negative trumps False
Positive. Recall is important in medical cases where it doesn’t matter whether we
raise a false alarm but the actual positive cases should not go undetected!

Model Evaluation
• Increasing precision decreases recall and vice versa, this is known as the
precision/recall tradeoff.
• For the condition when two models have low precision and high recall or vice versa,
it becomes hard to compare those models, therefore to solve this issue we can
deploy F-score.
• Also, if the recall is equal to precision, The F-score is maximum and can be
calculated using the below formula:

References
• Best Competitions for Beiggienrs – kaggle
https://www.kaggle.com/getting-started/78482
• The Hundred-Page Machine Learning Book
• Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems

Supervised learning

More Related Content

What's hot

Similar to Supervised learning

More from Alia Hamwi

Recently uploaded

Supervised learning