Machine Learning Essentials Demystified part1 | Big Data Demystified

Machine Learning Essentials
Part 1: Basic algorithms
Lior King
Lior.King@gmail.com
1

Agenda
• Introduction to Machine Learning (ML)
• What is Machine Learning?
• The problems we can solved using ML.
• The learning process
• Basic ML Algorithms using Python and Scikit-learn library
• Linear Regression
• Naïve Bayes
• K-Means
• Artificial Neural Networks (ANN) and Deep Learning (DL)
using TensorFlow library
• Single layered ANN (using MNIST demo).
• Deep Learning (DL) with Multi-layered Neural Networks
• DL example: Convolutional Neural Network (CNN).
2

An astronaut lands on an alien planet
3

The
astronaut’s
dilemma
4
Male
or
Female ?

6
Male
or
Female ?
The alien’s
dilemma

Gender recognition algorithm for the alien
If the height is > 180 cm
and/or
weight > 75 kg
or
has a beard
or
has short hair
or
has a deep voice
or
is bold
or…
7
There might be exceptions…
The Rule based
approach

The learning approach
• We show the alien 500 humans and tell them who are the
males and who are the female
• The alien will find the characteristics that differentiate males
and females – by identifying repeated patterns.
• The alien needs to be exposed to a lot of humans to identify
repeated patterns. That is how he gains EXPERIENCE.
8

How can a computer learn?
Experience = Data
9

What is machine learning approach?
• An alternative for rule based approach
• Based on a lot of data.
• Implements a pre-determined model that use a standard algorithm that
finds DATA CORRELATIONS.
10

Some ML use cases
• Self driving cars and auto pilots
• Cortana, Siri, Google Assistant (NLP – Natural Language Processing)
• Recommendations - Netflix and Amazon know what you like
• Data security
• Healthcare – Computer Assisted Diagnosis
• Spam detection
• Fraud detection
• Algo-trading
• IoT
11

When to use machine learning?
•When it is difficult for humans to express rules
•Too many variables
•Difficult to understand relationships
•When there is a large amount of available historical data
•When data items relationships and pattern are dynamic and
keep changing
13

ML is not new - why is it so hot these days?
Problem 1: ML usually requires a lot of data
Solution: We are in the “big data” era.
Problem 2: ML requires a lot of computations.
Solutions:
• The CPUs have got very fast
• GPUs can be harnessed and multiply the speed.
• The cloud enables you to build a computing grid fast
and cheap.
Problem 3: ML is complex and difficult
Solution: Available free open source libraries and tools
14

How to use machine learning?
1. Define the problem you wish to solve – ask the right question.
2. Prepare the data - make sure you use relevant data which is represented with
meaningful numbers
3. Choose the right algorithm.
4. Use the algorithm to train a model with training data.
5. Test the model to see if it is correct enough.
15
Define the
problem
Represent
Data with
numbers
Choose the
algorithm
Train the
model
Test the
model

On going learning
16
Rules
Historical
Data
New Data
Retraining
Deploying

Machine learning - problem categories
18
Supervised
Unsupervised
Reinforcement

Machine Learning Categories
•Supervised machine learning:
• The program is “trained” on a pre-defined set of “training examples”
• Can reach a pretty accurate conclusion when given new data.
•Unsupervised machine learning:
• The program is given a bunch of data and must find patterns and relationships
therein.
•Reinforcement machine learning:
• The program is a given just an “environment” and a “reward” function for successful
actions – without an exact instructions what to do.
• The program finds a set of actions that will grant it maximum total “rewards”.
19

Machine learning - problem categories
20
Classification
Regression
(Prediction)
Clustering
(grouping)

Classification
• A “Yes or No” choices:
• Does the patient have cancer?
• Is this email a SPAM?
• Is this credit card transaction – a fraud?
• Is the stock market is going up or down?
• A discrete number of choices:
• Determine age group: 0-18, 18-35, 35-60, 60+
• Recognize handwritten characters – a, b, c, d …
• Customers sentiment analysis – very positive/slightly positive/neutral/slightly negative/strongly negative.
Classification requires training data
21
Classification
(discrete number)

Regression
•Regression – for Predictions or forecasts
• What will be the value of MSFT stock tomorrow?
• How much will we sell in the next quarter?
• How many bugs will we need to fix?
• How long will it take to commute from A to B?
• Outputs a continuous value – a float
• Requires training data
22
Regression
(Prediction)

Clustering
• Clustering is grouping variables into groups
• Customers segmentation
• Pattern recognition and image analysis
• Bio informatics
• Training data is not required (unsupervised).
23
Clustering

The set of rules known as a MODEL
MODEL = A quantitative representation of relationships between variables.
• Can be a mathematical equation
Or
• A set of if-then-else statements created dynamically.
Example:
A spam filtering model represents the relationship between the text in the email and
whether it is a spam or not.
26

Model = Function
⁞
27
Model
f (X1, X2, … ,Xn)
Data attribute
Data attribute
Data attribute
Data attribute
Outcome
Numbers
A number

The Goal
To find the best model (function) that
produces the desired result on any set
of inputs
28

Supervised learning – the training process
30
Prepared
Data
Training
Data
Test Data
Algorithm
Model
Splitting the data Training a model Testing the model
Model
Good
Bad

31
Basic ML Algorithms
and how to use them with Python

Most common ML algorithms
• Prediction:
• Polynomial Regression
• Decision Tree Regression
• Random Forest Regression
• Support Vector Regression (SVR)
• Classification:
• Naïve Bayes
• Logistic Regression
• Decision Tree Classification
• Random Forest Classification
• Support Vector Machines (SVM)
• K-Nearest Neighbors classification (K-NN)
32
• Clustering:
• K-Means
• Hierarchical clustering
• Artificial Neural Networks:
• Convolutional Neural Network (CNN)
• Recurrent Neural Network (RNN)

Some Other Algorithms
• Enhanced algorithms:
• Variations of basic algorithms
• Enhanced to perform better and/or add more functionality.
• Complex to understand and use properly
• Ensemble algorithms:
• Special algorithms that contain/combine multiple algorithms under one interface
• Used when you need to tune the model to increase performance
• Can be complex and difficult to debug and troubleshoot.
33

Regression problems
34
Regression
(Prediction)

Common ML algorithms for regression
• Polynomial Regression
• Decision Trees
• Random Forest
35
Regression
(Prediction)

Regression examples
• What will be the stocks returns?
• What will be the sales of a product next week?
• If flight is delayed, how does this affect customer satisfaction?
• If I change my investment portfolio, how would it affect my risk?
• How much will I get on my house?
36

Linear regression
Finding the relation between the age
and the salary.
Predicting the salary for any given age
38
Historical
Data points
Experience
Salary

Historical
Data points
Salary (dependent)
Minimize the error
The Error (or Residual) is the offset of
the dependent variable from the
independent variable.
The goal of any regression is to minimize
the error for the training data and to
FIND THE OPTIMAL LINE (or curve in
case of logistic regression).
39
Error
Experience (independent)

Historical
Data points
Salary (dependent)
Minimize the error – sum of square diffs
The error = 𝑖=1
𝑁
(𝑦𝑖 − 𝑦𝑖)2
40
y
Error
𝒚
Experience

Minimize the error with Stochastic Gradient
Descent (SGD)
Error =
1
𝑁 𝑖=1
𝑁
(𝑦𝑖 − 𝑦𝑖)2
N -> number of historical data points
1. Initialize some value for the slope
and intercept.
2. Find the current value of the error
function.
41
Error
Slope
Intercept
3. Find the slope at the current point (partial derivative) and move slightly
downwards in the direction.
4. Repeat until you reach a minimum OR stop after certain number of iterations

Historical
Data points
Salary (dependent)
Experience
Minimize the error
The iterative SGD process will slowly
change the slope and the intercept until
the error is minimal.
42

Multiple Linear Regression
• Simple linear regression:
𝑌 = 𝑏0 + 𝑏1*𝑥1
• Multiple linear regression:
𝑌 = 𝑏0 + 𝑏1*𝑥1 + 𝑏2*𝑥2 + … + 𝑏 𝑛∗𝑥 𝑛
Important note:
You need to exclude variables that will “mess” the prediction and keep the ones
that actually help predicting the desired result.
43

Polynomial Linear Regression
44
Simple linear regression:
𝑌 = 𝑏0 + 𝑏1*𝑥1
Polynomial linear regression:
𝑌 = 𝑏0 + 𝑏1*𝑥1 + 𝑏2∗𝑥1
𝟐
+ … + 𝑏 𝑛∗𝑥1
𝒏
Quadratic: degree = 2
Cubic: degree = 3

Why Python?
• Fast learning curve
• Combines the power of general-purpose language with the ease of use.
• Everything you need for ML: Libraries for data loading, visualization, statistics, natural
language processing, image processing, and more:
• numpy, scipy
• scikit-learn
• matplotlib
• pandas
• tensorflow, pytorch, GraphLab
• A lot of free IDEs and iterative tools (like Spyder, PyCharm, VS code and more)
• Allows for the creation of complex graphical user interfaces (GUIs)
• Easy integration into existing systems and web services.
45

Python becomes the leader for ML
46
* KDnuggests is a leading news site
on Business Analytics, Big Data, Data Mining,
Data Science, and Machine Learning

Python’s Scikit-learn library
• Makes it easier to perform training and evaluation tasks:
• Splitting the data into training and test sets.
• Pre processing before we train with it.
• Selection the important features.
• Model training
• Model tuning for better performance
• Provides a common interface for accessing algorithms
• Based on often used mathematical libraries such as NumPy, SciPy, Matplotlib
• Supports Pandas dataframes.
47

Classification problems
49
Classification
(Yes/No or a
discrete number)

Common ML algorithms for classification
• Naïve Bayes
• Logistic Regression
• Support Vector Machines
• Decision Trees
• K-Nearest Neighbors
50
Classification
(Yes/No or a
discrete number)

Classification examples
• Gender detection:
• Using the first name, length, prefix/suffix, ends with a vowel?
• Age group detection:
• Using users selection and preferences
• Sentiment Analysis – Positive or Negative (polarity identification)
• Using a large bank of tweets and post – unstructured and complicated.
• Trading stocks/derivatives – Up day or Down day?
• Using week day, month, prices in previous days, prices of related stocks.
51

Classification examples
• Detecting Ads – Is the image an Ad or not an Ad?
• Using the Image URL, page URL, Page text, Image caption and so on…
• Customer Churn – Is the customer is about to quit?
• Using: purchases, days since the last purchase, geo location etc…
• Fraud detection – a Fraud or not a Fraud
• Using: payment type, location, failed attempts history, frequency of use
• Credit risk – Will the customer default on a loan?
• Using: Income, employment sector, education, history of defaults
52

The goal is to classify an unknown review as
positive or negative.
Sentiment Analysis Classification
53
ClassificationMovie review
Positive“The movie was pretty good”
Negative“It was boring. I almost fell asleep”
Positive“We had a great evening”
Negative“The leading actor really sucked”
…
Negative“It is the worst film ever”

Naïve Bayes
𝑃 𝑐 𝑥 ) =
𝑃 𝑥1 𝑐 ) ∗ 𝑃 𝑥2 𝑐 ) ∗ … ∗ 𝑃 𝑥𝑛 𝑐 ) ∗ 𝑃(𝑐)
𝑃(𝑥)
“A great movie” – is it a positive review?
54
Prior probabilityLikelihood
Marginal likelihoodPosterior probability
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 "𝐺𝑟𝑒𝑎𝑡 𝑀𝑜𝑣𝑖𝑒" ) =
𝑃 "𝐺𝑟𝑒𝑎𝑡" 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) ∗ 𝑃 "𝑀𝑜𝑣𝑖𝑒" 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) ∗ 𝑃(𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒)
𝑃 "Great" ∗ 𝑃("𝑀𝑜𝑣𝑖𝑒")
Prior probability – What is the
probability of a positive review
Likelihood – what is the
probability to find the word X
in a positive review
Marginal likelihood – What is the
probability of the word in all the
set (positive & negative)
Posterior probability – What is
the probability of the word X
to indicate a positive review

Naïve Bayes
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 "𝐴 𝑔𝑟𝑒𝑎𝑡 𝑚𝑜𝑣𝑖𝑒" ) =
𝑃 "𝐺𝑟𝑒𝑎𝑡" 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) ∗ 𝑃 "𝑀𝑜𝑣𝑖𝑒" 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) ∗ 𝑃(𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒)
𝑃(𝑋)
𝑃 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 "𝐴 𝑔𝑟𝑒𝑎𝑡 𝑚𝑜𝑣𝑖𝑒") =
𝑃 "𝐺𝑟𝑒𝑎𝑡" 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒) ∗ 𝑃 "𝑀𝑜𝑣𝑖𝑒" 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒) ∗ 𝑃(𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒)
𝑃(𝑋)
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 "𝐴 𝑔𝑟𝑒𝑎𝑡 𝑚𝑜𝑣𝑖𝑒") > ? < 𝑃 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 "𝐴 𝑔𝑟𝑒𝑎𝑡 𝑚𝑜𝑣𝑖𝑒")
55

Naïve Bayes algorithm
1. Extract every word (get rid of words like the/is/a etc.).
2. Calculate the probability of each word in positive comments.
𝑃𝑃𝑜𝑠("𝑤𝑜𝑟𝑑") =
Sum(freq. of “word” in positive comments)
Sum (Freq. of “word” in the entire set).
3. For every sentence, calculate PPos and PNeg.
𝑃𝑃𝑜𝑠 sentence = 𝑃𝑃𝑜𝑠 word1 ∗ 𝑃𝑃𝑜𝑠 𝑤𝑜𝑟𝑑2 ∗ ⋯ ∗ 𝑃𝑃𝑜𝑠 𝐴𝑙𝑙
𝑃𝑁𝑒𝑔 sentence =
(1 − 𝑃𝑃𝑜𝑠 word1 ) ∗ (1 − 𝑃𝑃𝑜𝑠 𝑤𝑜𝑟𝑑2 ) ∗ ⋯ ∗ (1 − 𝑃𝑃𝑜𝑠 𝐴𝑙𝑙 )
4. Compare PPos(sentence) and Pneg(sentence)
56
ClassificationMovie review
Positive“The movie was pretty good”
Negative“It was boring. I almost fell asleep”
Positive“We had a great evening”
Negative“The leading actor really sucked”
…
Negative“It is the worst film ever”
PPos(word)Word
95%Great
10%Boring
50%Movie
10%Worst
…
For the entire set: 55% positive -> PPos (All)
PPos(“The movie is great”) =
0.5*0.95*0.55 = 0.261
PNeg(“The movie is great”) =
(1-0.5)*(1-0.95)* (1-0.55) = 0.011
0.261 > 0.011 → Positive 

Naïve Bayes – Continuous values (Gaussian)
57
Salary
Age
Features: Age, Salary
Blue circle = did not purchase = 40
Red cross = purchased = 30
did not purchase = 15
purchased = 10
The chance that X will purchase =
The chance the customers around X purchased *
The chance of purchasing in general /
The chance for a customer to be around X
=
(# of purchases around X/Total purchases) *
(# of purchases/Total customers) /
(Total customers around X/ Total purchases)
𝑃 𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒 𝑥) =
𝑃 𝑥 𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒) ∗ 𝑃(𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒)
𝑃(𝑥)
10
30
∗
30
70
25
70
= 0.4 = 40%
Assuming normal
distribution around X

Naïve Bayes Algorithm
• Each attribute (in our case – word) is used independently (hence the term “naïve”).
• Phrases are not taken under consideration like “far out”.
• Simple to understand
• Fast training
• Stable – insensitive to small changes in the training data
• Can be very robust for solving many classification problems – especially in cases :
• There is a small amount of training data
• You don’t have a lot of knowledge about the problem itself
58

Naïve Bayes Demo
(Gaussian)
59

Clustering problems
62
Clustering

Common ML algorithms for clustering
• K-Means
• Fuzzy clustering
• Hierarchical clustering
• Density based clustering
• Distribution based clustering
63
Clustering

Clustering use case example
• A cellular company need to put antennas in a region so that its users receive
optimum signal processing
• Locating police stations so they can arrive fast to areas of high crime rate.
• Identify important products features from customer feedbacks, emails etc.
• Perform efficient data compression
64

Reviews theme clustering
• We need to represent each review to have numeric attributes.
• In this example we’ll use a technique called “Term Frequency Representation”(TFR).
Sample: “With tears in my eyes”
All words: (movie, good, bad, with, boring, tears, yesterday, my, eyes)
(0, 0, 0, 1, 0, 1, 0, 1, 1 )
We represent each review using frequencies of words.
Some words characterize a document more than the others: “With tears in my eyes”.
These words usually occur more rarely and differentiate the review from the others.
65

“With tears in my eyes”.
We now weight the word frequencies to make the rare words stand out and the
common words to have minimal weight.
New weight= 1/frequency of the word
This representation is called Term Frequency – Inverse Document Frequency (TF-IDF)
66
Common Rare Common Common Rare

K-Means algorithm
• Every review is a tuple with N numbers:
(0, 3, 0, 4, 0, ….)
• So every review is a point in an N-
dimensional space hypercube.
• With K-means algorithm you define K
which is “how many groups you want to
converge in clusters”.
1. Initialize the mean points (also call
“centroids”).
67

K-Means iteration 1
2. Assign each review (point) to the
nearest centroid.
3. Look at each cluster and find a
new centroid for the cluster.
4. Repeat 2,3 until the means stop
changing.
68

K-Means iteration 2
nearest centroid.
changing.
69

K-Means iteration 3
nearest centroid.
changing.
70

K-Means iteration 4
nearest centroid.
3. Look at each cluster and find a new
centroid for the cluster.
changing.
71

K-Means iteration 5
nearest centroid.
changing.
72

K-Means iteration 6
nearest centroid.
changing.
73

K-Means algorithm
1. Initialize the mean points
(also call “centroids”).
2. Assign each review
(point) to the nearest
centroid.
3. Look at each cluster and
find a new centroid for
the cluster.
4. Repeat 2,3 until the
means stop changing.
74

• We need to represent each review to have numeric attributes.
• In this example we’ll use a technique called “Term Frequency Representation”(TFR).
Sample: “With tears in my eyes”
All words: (movie, good, bad, with, boring, tears, yesterday, my, eyes)
(0, 0, 0, 1, 0, 1, 0, 1, 1 )
We represent each review using frequencies of words.
75

“With tears in my eyes”.
We now weight the word frequencies to make the rare words stand out and the
common words to have minimal weight.
New weight= 1/frequency of the word
This representation is called Term Frequency – Inverse Document Frequency (TF-IDF)
76
Common Rare Common Common Rare

K-Means algorithm
• Every review is a tuple with N
numbers: (0, 3, 0, 4, 0, ….)
• So every review is a point in an N-
dimensional space hypercube.
77

Clustering vs. Classification
• Classification – When you want classify data into pre-defined categories.
• Clustering – Grouping data into a set of categories that is NOT known before hand.
• We can mix them both:
• Start with clustering the data
• Then train the data to recognize each cluster and create a model.
• Use the classification model to classify new data.
78

79
Thank you !
Lior.King@gmail.com

Machine Learning Essentials Demystified part1 | Big Data Demystified

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning Essentials Demystified part1 | Big Data Demystified

Similar to Machine Learning Essentials Demystified part1 | Big Data Demystified (20)

More from Omid Vahdaty

More from Omid Vahdaty (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Essentials Demystified part1 | Big Data Demystified