INTRODUCTION
TO
MACHINE LEARNING
- Sathvik N U
Where does AI fit in?
What is Machine Learning?
Concepts of Machine Learning
Steps to build a predictive model
INTRODUCTION
TO
MACHINE LEARNING
• Industrial revolution in the manufacturing process
changes the way we think about and work in the industry.
• Industrial Stages:
• Industrial 1.0 (water and steam) – 1760s
• Industrial 2.0 (The Technological Revolution) -1840s
• Industrial 3.0 (Information Technology) – 1970s
• Industrial 4.0 - Present
• With the connectivity produced in the Industry 3.0, a
disruptive automation technology rose and now in 4.0,
smart machines will be made such that the machines
communicate among themselves to provide the target.
ARTIFICIAL INTELLIGENCE
Ability of a machine to
imitate intelligent human
behavior
Hierarchy of Artificial Intelligence
Industries where AI is usedin full
fledge
• Retail
• Healthcare
• Manufacturing
• Banking
ARTIFICIAL INTELLIGENCE
Long Term Memory in Human Beings
Long-term memory refers to the storage of
information over an extended period. If you can
remembersomethingthat happenedmore than
just a few moments ago whether it occurred
just hours ago or decades earlier, then it is a
long-term memory.
Types of Long-Term Memories
• Explicit Memory
• Procedural
• Emotional Conditioning
• Implicit Memory
• Semantic
• Episodic
How do you fight
corona?
By staying at home
To learn more on how
computer grasps these
memories, Follow the link
UNDERSTANDING ARTIFICIAL INTELLIGENCE LANDSCAPE
IS ARTIFICIAL INTELLIGENCE HARMFUL?
IS ARTIFICIAL INTELLIGENCE HARMFUL?
“INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE
IS ARTIFICIAL INTELLIGENCE HARMFUL?
“INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE
IS ARTIFICIAL INTELLIGENCE HARMFUL?
“INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE
IS ARTIFICIAL INTELLIGENCE HARMFUL?
“INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE
IS ARTIFICIAL INTELLIGENCE HARMFUL?
“INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE
Machine Learning
Subset of AI which uses statistical
and probabilistic methods to
enable machines to improve with
experiences
INTRODUCTION
TO
MACHINE LEARNING
CERTAIN PRE-REQUISITES FOR MACHINE LEARNING
• Statistics
• Probability and Queuing Theory
• Linear Algebra
• Calculus
• Primary Coding skillset
• Patience!!
Of the above pre-requisites, the most important pre-requisite would be patience
as the machine learning world is almost infinite. It is ever evolving, and you can
never conquer it, but you can master it.
Once a wise man said, “I think the target of anything in life should be to do it so
well that it becomes an art”. I firmly believe that goes rightly with machine
learning. After all, It is mathematics guys!
Statistical model is all about getting a simple formulation of a frontier in
a classification model problem. Here we see a nonlinear boundary which
to some extent separates risky people from non-risky people.
When we see the contours generated by Machine Learning
algorithm, we witness that statistical modeling is no way
comparable for the problem in hand to the Machine Learning
algorithm. The contours of machine learning seems to capture
all patterns beyond any boundaries of linearity or even
continuity of the boundaries. This is what Machine Learning
can do for you.
HOW IS MACHINE LEARNING DIFFERENT TO STATISTICS?
Machine Learning algorithm is used in recommendation engines of
Google/Facebook/Amazon etc. which can churn trillions of data points in a
second to produce nearly perfect recommendations. Even with a laptop of 16
GB RAM I daily work on datasets of millions of rows with thousands of
parameters and build an entire model in not more than 30 minutes. A statistical
model on another hand needs a supercomputer to run a million observation
with thousand parameters.
Machine Learning
Examples use cases of
Machine Learning
• Intelligent Navigation
• News Recommendation
• Spam Detection
• Google Assistant
• Google Ads
• Product Recommendation
• Ad Delivery
• Optimizing Supply Chain
• Personalized Recommendation Engine
• Streaming quality by predicting bandwidth
• Personalized Ads
• Friend Suggestion
• Face Tagging
Machine Learning
How industries have capitalized
Machine Learning
Machine Learning Framework
Types of Machine Learning
Machine Learning Algorithms
Algorithms try to find a mathematical relationship f(x) between the inputs (x) and the output (y) by going through the whole dataset one
row at a time, adjusting the equation and the coefficients at every step to accommodate for all the rows covered till that point. This process
through which learning happens is called ‘Training’ or ‘Model Fitting’
When the model gets trained
with the historical data, it
generates coefficients called
model weights used for
predictions. The trained models
and weights are stored as files
and can be used when the
unseen data of same features
used in training needs to be
predicted
The Trained Model from the previous
step is tested on Unseen data (data
which was not part of the data used
to train the model) for which the
actual output is known. The predicted
outputs are compared to the actual
outputs and accuracy scores are
calculated. The scores act as an
evaluation metric for the model.
Machine Learning Algorithms
Algorithms try to mimic a human brain by finding a mathematical relationship between the inputs and the output
Algebric Models
Probablistic
Models
(Conditional)
Tree Models And
Other Models...
Machine Learning Algorithms
Algorithms try to find a mathematical relationship f(x) between the inputs (x) and the output (y) by going through the whole dataset one
row at a time, adjusting the equation and the coefficients at every step to accommodate for all the rows covered till that point. This process
through which learning happens is called ‘Training’ or ‘Model Fitting’
When the model gets trained
with the historical data, it
generates coefficients called
model weights used for
predictions. The trained models
and weights are stored as files
and can be used when the
unseen data of same features
used in training needs to be
predicted
The Trained Model from the previous
step is tested on Unseen data (data
which was not part of the data used
to train the model) for which the
actual output is known. The predicted
outputs are compared to the actual
outputs and accuracy scores are
calculated. The scores act as an
evaluation metric for the model.
Unsupervised Learning
Sincethe examplesgiven to the
learnerare unlabelled, there is no
error or reward signal to evaluate
a potential solution.
This distinguishesunsupervised
learningfrom supervisedlearning
and reinforcement learning.
Patterns in the data
• Which value(s)occur most frequently?
• How much does the data vary?
• How symmetrically does data vary
around center?
• Is data clustered aroundvalue(s)?
• Sub-space where data is
“concentrated”
It is the task of inferring a function to describe hidden structure or patterns from unlabeled data.
Unsupervised
Unsupervised Learning
Association Rules
• Find features (dimensions)
which occur together
• Find features (dimensions)
which are “correlated
Patterns that can be found in data
• Median
• Variance, Standard Deviation
Skewness, Kurtosis
• Mode
Summary Statistics Clustering
• Find data elements which are
similar
• Finding “areas” in space where
data is concentrated
• Are two features correlated
• Are two dimensions correlate
• Combines data from different
sources
Multiple Kernel Learning
Dimensionality Reduction
• Find smaller dimensional
representations of the data which
preserve its essential structure.
• Find subspaces where data varies
the most.
ASSOCIATION RULE MINING
The value of one feature tells us the value of another feature
▪ Peoplewho buy diapers are likely to buy baby powder
▪ If (peoplebuy diaper),then (theybuy baby powder)
<!> Watchthe directionality<!>
(A➔Bdoes not meanB ➔A)
In A➔B,A is the antecedent and B is consequent
Applications
▪ MarketBasket Analysis.
▪ MedicalDiagnosis.
▪ CensusData.
▪ ProteinSequence.
▪ CrossMarketing.
▪ CatalogDesign.
• Arestatements aboutrelations amongfeatures (attributes):
acrosselements (tuples)
• Use a transactionItemsetData Model
Association Rules
TRANSACTION ITEMSET DATA MODEL
ASSOCIATION RULE MINING
▪ MarketBasket Analysis is the mostcommonusewhere each basket is a row and each item is a
column
▪ It is not the only usecase.It can be usedin any dataset where the features take only two values 0/1
▪ Can work in any datasetwhere features canbe transformed as taking only two usevalues 0/1 by
discretizationand featureselection.
▪ There are a certainfew measuresof effectivenessthat helps us to find relations between features with
a setof transactionitemsetdata points.
TRANSACTION ITEMSET DATA MODEL
Measures of Effectiveness
• Support
• Confidence
• Lift
• Others : Affinity, Leverage
Ratio of the observed support to that
expected if X and Y
were independent
ASSOCIATION RULE MINING
Confidence
ASSOCIATION RULE MINING
TRANSACTION ITEMSET DATA MODEL• {Diaper, Beer} ➔ Milk
• Support = 2/5, Confidence = 2/3
• {Milk} ➔ {Diaper, Beer}
• Support = 2/5, Confidence = 2/4
• {Milk, Diaper} ➔ Bread
• Support = 2/5, Confidence = 2/3
• {Milk, Beer} ➔ Diaper?
Is Confidence = 1?
• Caution : Diaper is very popular!
• Does the inclusion of {Milk, Beer} increase the probability of Diaper?
LIFT
• If the lift is > 1, that lets us know the degree to whichthose two occurrences are dependent on one another and makes those rules
potentially useful for predicting the consequent in future data sets. Positively Correlated
• If the lift is < 1, that lets us know the items are substitute to each other. This meansthat presence of one item has negative effect
on presence of other item and vice versa. Negatively Correlated.
• If the lift is = 1, then the two occurrences are not correlated.
APRIORI ALGORITHM - ASSOCIATION RULE MINING
Key Idea
• If {a,b,c} is frequent, {a,b} must be frequent
• Downward closure a.k.a. anti-monotonicity
Algorithm
• Find all frequent 1-itemsets (frequent ➔ > support)
• Find all frequent 2-itemsets for filtered 1-itemsets
• Find all frequent 3-itemsets for filtered 2-itemsets
• …
Salient Features
• Exploits downward closure to optimize search
• Lower Support ➔ Higher computational complexity
• Confidence, Lift as post-processing filters
APRIORI ALGORITHM - LIMITATIONS
Computational Complexity
• How long does it take to run?
• How much memory does it need?
Approaches
• Throw more compute / RAM at it
• Parallelize
• Increase support
• Leverage item hierarchy
• Another algorithm?
When sequence of transactions matters
• Define a sequence as an item
• Combinatorial Explosion : Computational Complexity
• Read-Up!
Rare patterns
• Rules with low support but maybe very valuable
• People who buy ______ likely to buy luxury cars
RARE PATTERNS ARE IMPORTANT TOO
It is very rare in real world cases to have Lift factors as high as 8.
But there was a case where it did happen. That case was discovered by
Walmart in 2004 when a series of hurricanes crossedthe state of
Florida. After the first hurricane, there were several more hurricanes
seen in the Atlantic Ocean heading toward Florida, and so Walmart mined
their massive retail transaction database to see what their customers
really wanted to buy prior to the arrival of a hurricane. They found one
particular item that increased in sales by a factor of 7 over normal
shopping days. That was a huge Lift factor for a real-world case. That one
item was not bottled water, or batteries, or beer, or flashlights, or
generators, or any of the usual things that we might imagine. The item
was strawberry pop tarts! One could imagine lots of reasons why this
was the most desired product prior to the arrival of a hurricane – pop
tarts do not require refrigeration, they do not need to be cooked, they
come in individually wrapped portions, they have a long shelf life, they
are a snack food, they are a breakfast food, kids love them, and we love
them. Despite these “obvious” reasons, it was a still a huge surprise! And
so Walmart stocked their stores with tons of strawberry pop tarts
prior to the next hurricanes, and they sold them out. That is a win-
win: Walmart wins by making the sell, and customers win by getting
the product that they most want.
CLUSTERING
Motivation
• Transaction data (Customer –Product matrix)
• Segmentation, Pre-processing, Multimodal distributions, “I
can’t explain it”
Clustering
• Find elements (rows, tuples) which are similar.
• Finding “areas” in space where data is concentrated
• WYSIWYG : What You Select Is What You Get
When are two elements / rows similar?
• A measure of (dis)similarity.
• Which dimensions (attributes) are relevant?
• Normalization?
• How many clusters?
CLUSTERING
Three Metrics Used for Clustering
data points
• Similarity
• Dissimilarity
• Distance
When do we scale or don’t,
• If variables measure different units (kg, meter, sec,…)
• If you explicitly want to have equal weight for each variable
• Don’t scale if units are the same for all variables
If variables are scaled
• Every variable gets equal weight
• Similar alternative is re-weighing
Cosine Similarity
Gower Distance
DISTANCE - CLUSTERING
SIMILARITY - CLUSTERING
Measures of Similarity
Cosineof angle between two vectors a.k.a.normalized inner product
• Distancebetweenvectors captured by the cosine of the angle x
between them.
• Denominator involves the lengths of the vectors
What kinds of problems are solved
with cosine similarity?
• Relevant in information retrieval
• Documents (query) as vectors of
words
• Two documents similar if they
contain the same words
• Does size of the document matter?
• Yes: Dot Product
• No: Cosine similarity
Cosine Similarity
DISSIMILARITY - CLUSTERING
Measures of Dissimilarity
Categorical Variables
• One hot encoding
Here are a few dissimilarity based
clustering methods
• HammingDistance
• Simple Matching coefficient
• Dice
• Jaccard
Coefficients
• SMC: Simple Matching
Coefficient
• Jaccard coefficient
• Dice coefficient
Distance = 1 -coefficient
NUMBER OF CLUSTERS - KMEANS – CLUSTERING ALGORITHM
KMEANS – CLUSTERING ALGORITHM
• Initialize
• Pick k data points randomly from X : centroids
• Iterate
• Assign each data point to the closest / most similar centroid
• For each cluster, update centroid
• Repeat
• Terminate when “changein within cluster variation” < threshold
• Amount by whichelements in the same clusters are different
KMEANS – CLUSTERING ALGORITHM
KMEANS – CLUSTERING ALGORITHM
Limitations of K-Means
• Hyperparameter : k
• Not really a “limitation”
• Initial centroid at random
• Categorical (Mixed data?)
• Non-convex clusters
CLUSTERING ALGORITHMS
SUPERVISED LEARNING
It is the task of classifying a data point to a label based on the labeled data.
Supervised learning is the
machine learning task of
learning a function that maps
an input to an output based on
example input-output pairs.
Supervised
SUPERVISED LEARNING
CLASSIFICATION VS REGRESSION
REGRESSION - SUPERVISED LEARNING
Distance / Time Estimates
Temperature
A regression problem is when the output variable is a real or continuous value, such as “price” or
“temperature”. Many different models can be used, the simplest is the linear regression. It tries to fit data
with the best hyper-plane which goes through the points. Regression is the task of predicting a continuous
quantity.
A few regression algorithms
•LinearRegression
•MultipleLinear Regression
•Polynomial Regression
•DecisionTree Regression
•RandomForest Regression
Certain Applications of Regression
•To determine the economic growth of a country or a state in the coming quarter.
•Can also be used to predict the GDP of a country.
•To predict what would be the price of a product in the future.
•To predict the number of runs a player will score in the coming matches.
A regression algorithm may predict a
discrete value, but the discrete value in the
form of an integer quantity.
CLASSIFICATION – SUPERVISED LEARNING
Classification specifies the class to which data elements belong to and is best used when the output has
finite and discrete values. It predicts a class for an input variable as well.
Types of Classification:
•Binomial
•Multi-Class
CertainApplicationsof Classification:
• To find whetheran email
received is a spam or ham
• To find if a bank loan is granted
• To identify if a kid will pass or
fail in an examination
• Facial Recognition
A few classification algorithms
•Linear Models
• Logistic Regression
• Support Vector Machines
•Nonlinear models
• K-nearest Neighbours (KNN)
• Kernel Support Vector Machines
(SVM)
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
A classification algorithm may predict a continuous
value, but the continuous value is in the form of a
probability for a class label
REGRESSION ALGORITHMS – SUPERVISED LEARNING
Linear Regression
• LinearRegressionis the statisticalmodelusedto
predict the relationshipbetween independentand
dependentvariables by examiningtwo factors.
• The first one is which variables are significant
predictors of the outcomevariable and the second
one is how significantis the regressionline to make
predictionswith the highestpossibleaccuracy.
Linear regression is a linear approach for modeling the
relationship between a scalar dependent variable y and an
independent variable x.
• The relationship betweenthese two variables
is assumedto be linear i.e. straightline can be
used to separate them.
• The objective of this function is to get the line
that divides these two variables by keeping the
error as less as possible.
• The error is the sum of Euclidean distance
between points and the line.
•Independent Variable is X
•Dependent Variable is Y
•Β1 is an intercept of the regression model
•β2 is a slope of the regression model
•ϵ is the error term
LINEAR REGRESSION – SUPERVISED LEARNING
Here, there is a positive relationship between the monthly
e-commercesales(Y) and onlineadvertisingcosts (X)
Y = Β0 + Β1X
Y= 125.8 + 171.5*X
Linear regressionaims to find the best-fitting straight line through
the points. The best-fitting line is known as the regression line.
• If data points are closer when plotted to making a straight line,
it means the correlation between the two variables is higher. In
our example, the relationship is strong.
• The orange diagonal line is the regression line and shows the
predicted score on e-commercesales for eachpossible value of
the online advertisingcosts.
Interpretation of the results:
• The slope of 171.5 shows that each increaseof one unit in X, we
predictthe average of Y to increaseby an estimated171.5 units.
• The formula estimates that for each increase of 1 dollar in online
advertising costs, the expected monthly e-commerce sales are
predicted to increaseby $171.5.
Error Calculation:
• The error is calculated usingmetrics such as Root Mean
SquaredErrors, MeanAbsolute Errors, MeanSquared
Errors.
MULTIPLE LINEAR REGRESSION – SUPERVISED LEARNING
Here, there is a positive relationship between the monthly
e-commercesales(Y) and onlineadvertisingcosts (X)
y = w1x1 + w2x2 + b
Multiple Linear Regression is a statisticaltechniqueusedto predict the
outcome of a response variable throughseveral explanatory variables
and model the relationships between them. It represents line fitment
between multiple inputs and one output
• Multiple linear regression (MLR) is used to determinea
mathematical relationship among several random
variables.
• In other terms, MLR examines how multiple independent
variables are related to one dependent variable.
• Once eachof the independent factors has been determined
to predict the dependent variable, the information on the
multiple variables can be usedto create an accurate
prediction on the level of effect they have on the outcome
variable.
• The model creates a relationship in the form of a straight
line (linear) that best approximates all the individual data
points.
Error Calculation:
• The error is calculated usingmetrics such as Root Mean
SquaredErrors, MeanAbsolute Errors, MeanSquared
Errors.
CLASSIFICATION ALGORITHMS – SUPERVISED LEARNING
A few classification algorithms
•Linear Models
• Logistic Regression
• Support Vector Machines
•Nonlinear models
• K-nearest Neighbours (KNN)
• Kernel Support Vector Machines (SVM)
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
CLASSIFICATION ALGORITHMS – SUPERVISED LEARNING
Naïve Bayes
It is a classification technique based on Bayes’ Theorem
with an assumption of independence among predictors. In
simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to
the presence of any other feature.
•P(c|x) is the posterior probability of class (c, target)
given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability of predictor given class.
•P(x) is the prior probability of predictor.
Example
A fruit may be considered to be an apple if it is red, round, and
about 3 inches in diameter. Even if thesefeatures depend on
each other or upon the existence of the other features, all of
these properties independently contribute to the probability
that this fruit is an apple and that is whyit is known as ‘Naive’.
CLASSIFICATION ALGORITHMS – SUPERVISED LEARNING
Naïve Bayes Example
I have a training data set of weather
and corresponding target variable
‘Play’ (possibilities of playing). Now,
we need to classify whether players
will play or not based on weather
condition. Let’s follow the below steps
to perform it.
To find: Players will play if weather is sunny.
Is this statement correct?
CLASSIFICATION ALGORITHMS – SUPERVISED LEARNING
Naïve Bayes Example
Steps to solve posterior probability
1. Convert the data set into a frequency table
2. Create Likelihood table by finding the
probabilities like Overcastprobability = 0.29
and probability of playing is 0.64.
3. Now, use Naive Bayesian equation to calculate
the posterior probability for each class.
4. The class with the highest posterior
probability is the outcome of prediction.
P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60 => Higher probability to play!
CLASSIFICATION ALGORITHMS – SUPERVISED LEARNING
Naïve Bayes Applications
Real time Prediction:
• Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for
making predictions in real time.
Multi class Prediction:
• This algorithm is also well known for multi class prediction feature. Here we can predict the
probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis:
• Naive Bayes classifiers mostly used in text classification (due to better result in multi class
problems and independence rule) have higher success rate as compared to other algorithms.
As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis
(in social media analysis, to identify positive and negative customer sentiments)
Recommendation System:
• Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation
System that uses machine learning and data mining techniques to filter unseen information
and predict whether a user would like a given resource or not
MACHINE LEARNING VS DEEP LEARNING
NEURAL NETWORKS
REINFORCEMENT LEARNING
The science of making optimal decisions motivated by rewards
Carrying the Basic Understanding Forward
• The topics covered in this presentation are just droplets
from the ocean of machine learning and data science
concepts.
• If this presentation interested you, you may further get into
reference materials for further study.
• You can further enrol to various data science courses
online, read various articles and keep relevant with the
research in the fields of data science via websites like
Towards Data Science, Analytics Vidya, etc.

Introduction to Machine Learning

  • 1.
  • 2.
    Where does AIfit in? What is Machine Learning? Concepts of Machine Learning Steps to build a predictive model INTRODUCTION TO MACHINE LEARNING
  • 3.
    • Industrial revolutionin the manufacturing process changes the way we think about and work in the industry. • Industrial Stages: • Industrial 1.0 (water and steam) – 1760s • Industrial 2.0 (The Technological Revolution) -1840s • Industrial 3.0 (Information Technology) – 1970s • Industrial 4.0 - Present • With the connectivity produced in the Industry 3.0, a disruptive automation technology rose and now in 4.0, smart machines will be made such that the machines communicate among themselves to provide the target.
  • 4.
    ARTIFICIAL INTELLIGENCE Ability ofa machine to imitate intelligent human behavior Hierarchy of Artificial Intelligence Industries where AI is usedin full fledge • Retail • Healthcare • Manufacturing • Banking
  • 5.
    ARTIFICIAL INTELLIGENCE Long TermMemory in Human Beings Long-term memory refers to the storage of information over an extended period. If you can remembersomethingthat happenedmore than just a few moments ago whether it occurred just hours ago or decades earlier, then it is a long-term memory. Types of Long-Term Memories • Explicit Memory • Procedural • Emotional Conditioning • Implicit Memory • Semantic • Episodic How do you fight corona? By staying at home To learn more on how computer grasps these memories, Follow the link
  • 6.
  • 7.
  • 8.
    IS ARTIFICIAL INTELLIGENCEHARMFUL? “INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE
  • 9.
    IS ARTIFICIAL INTELLIGENCEHARMFUL? “INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE
  • 10.
    IS ARTIFICIAL INTELLIGENCEHARMFUL? “INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE
  • 11.
    IS ARTIFICIAL INTELLIGENCEHARMFUL? “INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE
  • 12.
    IS ARTIFICIAL INTELLIGENCEHARMFUL? “INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE
  • 13.
    Machine Learning Subset ofAI which uses statistical and probabilistic methods to enable machines to improve with experiences
  • 14.
    INTRODUCTION TO MACHINE LEARNING CERTAIN PRE-REQUISITESFOR MACHINE LEARNING • Statistics • Probability and Queuing Theory • Linear Algebra • Calculus • Primary Coding skillset • Patience!! Of the above pre-requisites, the most important pre-requisite would be patience as the machine learning world is almost infinite. It is ever evolving, and you can never conquer it, but you can master it. Once a wise man said, “I think the target of anything in life should be to do it so well that it becomes an art”. I firmly believe that goes rightly with machine learning. After all, It is mathematics guys!
  • 15.
    Statistical model isall about getting a simple formulation of a frontier in a classification model problem. Here we see a nonlinear boundary which to some extent separates risky people from non-risky people. When we see the contours generated by Machine Learning algorithm, we witness that statistical modeling is no way comparable for the problem in hand to the Machine Learning algorithm. The contours of machine learning seems to capture all patterns beyond any boundaries of linearity or even continuity of the boundaries. This is what Machine Learning can do for you. HOW IS MACHINE LEARNING DIFFERENT TO STATISTICS? Machine Learning algorithm is used in recommendation engines of Google/Facebook/Amazon etc. which can churn trillions of data points in a second to produce nearly perfect recommendations. Even with a laptop of 16 GB RAM I daily work on datasets of millions of rows with thousands of parameters and build an entire model in not more than 30 minutes. A statistical model on another hand needs a supercomputer to run a million observation with thousand parameters.
  • 16.
    Machine Learning Examples usecases of Machine Learning • Intelligent Navigation • News Recommendation • Spam Detection • Google Assistant • Google Ads • Product Recommendation • Ad Delivery • Optimizing Supply Chain • Personalized Recommendation Engine • Streaming quality by predicting bandwidth • Personalized Ads • Friend Suggestion • Face Tagging
  • 17.
    Machine Learning How industrieshave capitalized Machine Learning
  • 18.
  • 19.
  • 20.
    Machine Learning Algorithms Algorithmstry to find a mathematical relationship f(x) between the inputs (x) and the output (y) by going through the whole dataset one row at a time, adjusting the equation and the coefficients at every step to accommodate for all the rows covered till that point. This process through which learning happens is called ‘Training’ or ‘Model Fitting’ When the model gets trained with the historical data, it generates coefficients called model weights used for predictions. The trained models and weights are stored as files and can be used when the unseen data of same features used in training needs to be predicted The Trained Model from the previous step is tested on Unseen data (data which was not part of the data used to train the model) for which the actual output is known. The predicted outputs are compared to the actual outputs and accuracy scores are calculated. The scores act as an evaluation metric for the model.
  • 21.
    Machine Learning Algorithms Algorithmstry to mimic a human brain by finding a mathematical relationship between the inputs and the output Algebric Models Probablistic Models (Conditional) Tree Models And Other Models...
  • 22.
    Machine Learning Algorithms Algorithmstry to find a mathematical relationship f(x) between the inputs (x) and the output (y) by going through the whole dataset one row at a time, adjusting the equation and the coefficients at every step to accommodate for all the rows covered till that point. This process through which learning happens is called ‘Training’ or ‘Model Fitting’ When the model gets trained with the historical data, it generates coefficients called model weights used for predictions. The trained models and weights are stored as files and can be used when the unseen data of same features used in training needs to be predicted The Trained Model from the previous step is tested on Unseen data (data which was not part of the data used to train the model) for which the actual output is known. The predicted outputs are compared to the actual outputs and accuracy scores are calculated. The scores act as an evaluation metric for the model.
  • 23.
    Unsupervised Learning Sincethe examplesgivento the learnerare unlabelled, there is no error or reward signal to evaluate a potential solution. This distinguishesunsupervised learningfrom supervisedlearning and reinforcement learning. Patterns in the data • Which value(s)occur most frequently? • How much does the data vary? • How symmetrically does data vary around center? • Is data clustered aroundvalue(s)? • Sub-space where data is “concentrated” It is the task of inferring a function to describe hidden structure or patterns from unlabeled data. Unsupervised
  • 24.
    Unsupervised Learning Association Rules •Find features (dimensions) which occur together • Find features (dimensions) which are “correlated Patterns that can be found in data • Median • Variance, Standard Deviation Skewness, Kurtosis • Mode Summary Statistics Clustering • Find data elements which are similar • Finding “areas” in space where data is concentrated • Are two features correlated • Are two dimensions correlate • Combines data from different sources Multiple Kernel Learning Dimensionality Reduction • Find smaller dimensional representations of the data which preserve its essential structure. • Find subspaces where data varies the most.
  • 25.
    ASSOCIATION RULE MINING Thevalue of one feature tells us the value of another feature ▪ Peoplewho buy diapers are likely to buy baby powder ▪ If (peoplebuy diaper),then (theybuy baby powder) <!> Watchthe directionality<!> (A➔Bdoes not meanB ➔A) In A➔B,A is the antecedent and B is consequent Applications ▪ MarketBasket Analysis. ▪ MedicalDiagnosis. ▪ CensusData. ▪ ProteinSequence. ▪ CrossMarketing. ▪ CatalogDesign. • Arestatements aboutrelations amongfeatures (attributes): acrosselements (tuples) • Use a transactionItemsetData Model Association Rules TRANSACTION ITEMSET DATA MODEL
  • 26.
    ASSOCIATION RULE MINING ▪MarketBasket Analysis is the mostcommonusewhere each basket is a row and each item is a column ▪ It is not the only usecase.It can be usedin any dataset where the features take only two values 0/1 ▪ Can work in any datasetwhere features canbe transformed as taking only two usevalues 0/1 by discretizationand featureselection. ▪ There are a certainfew measuresof effectivenessthat helps us to find relations between features with a setof transactionitemsetdata points. TRANSACTION ITEMSET DATA MODEL Measures of Effectiveness • Support • Confidence • Lift • Others : Affinity, Leverage Ratio of the observed support to that expected if X and Y were independent
  • 27.
  • 28.
    ASSOCIATION RULE MINING TRANSACTIONITEMSET DATA MODEL• {Diaper, Beer} ➔ Milk • Support = 2/5, Confidence = 2/3 • {Milk} ➔ {Diaper, Beer} • Support = 2/5, Confidence = 2/4 • {Milk, Diaper} ➔ Bread • Support = 2/5, Confidence = 2/3 • {Milk, Beer} ➔ Diaper? Is Confidence = 1? • Caution : Diaper is very popular! • Does the inclusion of {Milk, Beer} increase the probability of Diaper? LIFT • If the lift is > 1, that lets us know the degree to whichthose two occurrences are dependent on one another and makes those rules potentially useful for predicting the consequent in future data sets. Positively Correlated • If the lift is < 1, that lets us know the items are substitute to each other. This meansthat presence of one item has negative effect on presence of other item and vice versa. Negatively Correlated. • If the lift is = 1, then the two occurrences are not correlated.
  • 29.
    APRIORI ALGORITHM -ASSOCIATION RULE MINING Key Idea • If {a,b,c} is frequent, {a,b} must be frequent • Downward closure a.k.a. anti-monotonicity Algorithm • Find all frequent 1-itemsets (frequent ➔ > support) • Find all frequent 2-itemsets for filtered 1-itemsets • Find all frequent 3-itemsets for filtered 2-itemsets • … Salient Features • Exploits downward closure to optimize search • Lower Support ➔ Higher computational complexity • Confidence, Lift as post-processing filters
  • 30.
    APRIORI ALGORITHM -LIMITATIONS Computational Complexity • How long does it take to run? • How much memory does it need? Approaches • Throw more compute / RAM at it • Parallelize • Increase support • Leverage item hierarchy • Another algorithm? When sequence of transactions matters • Define a sequence as an item • Combinatorial Explosion : Computational Complexity • Read-Up! Rare patterns • Rules with low support but maybe very valuable • People who buy ______ likely to buy luxury cars RARE PATTERNS ARE IMPORTANT TOO It is very rare in real world cases to have Lift factors as high as 8. But there was a case where it did happen. That case was discovered by Walmart in 2004 when a series of hurricanes crossedthe state of Florida. After the first hurricane, there were several more hurricanes seen in the Atlantic Ocean heading toward Florida, and so Walmart mined their massive retail transaction database to see what their customers really wanted to buy prior to the arrival of a hurricane. They found one particular item that increased in sales by a factor of 7 over normal shopping days. That was a huge Lift factor for a real-world case. That one item was not bottled water, or batteries, or beer, or flashlights, or generators, or any of the usual things that we might imagine. The item was strawberry pop tarts! One could imagine lots of reasons why this was the most desired product prior to the arrival of a hurricane – pop tarts do not require refrigeration, they do not need to be cooked, they come in individually wrapped portions, they have a long shelf life, they are a snack food, they are a breakfast food, kids love them, and we love them. Despite these “obvious” reasons, it was a still a huge surprise! And so Walmart stocked their stores with tons of strawberry pop tarts prior to the next hurricanes, and they sold them out. That is a win- win: Walmart wins by making the sell, and customers win by getting the product that they most want.
  • 31.
    CLUSTERING Motivation • Transaction data(Customer –Product matrix) • Segmentation, Pre-processing, Multimodal distributions, “I can’t explain it” Clustering • Find elements (rows, tuples) which are similar. • Finding “areas” in space where data is concentrated • WYSIWYG : What You Select Is What You Get When are two elements / rows similar? • A measure of (dis)similarity. • Which dimensions (attributes) are relevant? • Normalization? • How many clusters?
  • 32.
    CLUSTERING Three Metrics Usedfor Clustering data points • Similarity • Dissimilarity • Distance When do we scale or don’t, • If variables measure different units (kg, meter, sec,…) • If you explicitly want to have equal weight for each variable • Don’t scale if units are the same for all variables If variables are scaled • Every variable gets equal weight • Similar alternative is re-weighing Cosine Similarity
  • 33.
  • 34.
    SIMILARITY - CLUSTERING Measuresof Similarity Cosineof angle between two vectors a.k.a.normalized inner product • Distancebetweenvectors captured by the cosine of the angle x between them. • Denominator involves the lengths of the vectors What kinds of problems are solved with cosine similarity? • Relevant in information retrieval • Documents (query) as vectors of words • Two documents similar if they contain the same words • Does size of the document matter? • Yes: Dot Product • No: Cosine similarity Cosine Similarity
  • 35.
    DISSIMILARITY - CLUSTERING Measuresof Dissimilarity Categorical Variables • One hot encoding Here are a few dissimilarity based clustering methods • HammingDistance • Simple Matching coefficient • Dice • Jaccard Coefficients • SMC: Simple Matching Coefficient • Jaccard coefficient • Dice coefficient Distance = 1 -coefficient
  • 36.
    NUMBER OF CLUSTERS- KMEANS – CLUSTERING ALGORITHM
  • 37.
    KMEANS – CLUSTERINGALGORITHM • Initialize • Pick k data points randomly from X : centroids • Iterate • Assign each data point to the closest / most similar centroid • For each cluster, update centroid • Repeat • Terminate when “changein within cluster variation” < threshold • Amount by whichelements in the same clusters are different
  • 38.
  • 39.
    KMEANS – CLUSTERINGALGORITHM Limitations of K-Means • Hyperparameter : k • Not really a “limitation” • Initial centroid at random • Categorical (Mixed data?) • Non-convex clusters
  • 40.
  • 41.
    SUPERVISED LEARNING It isthe task of classifying a data point to a label based on the labeled data. Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. Supervised
  • 42.
  • 43.
    REGRESSION - SUPERVISEDLEARNING Distance / Time Estimates Temperature A regression problem is when the output variable is a real or continuous value, such as “price” or “temperature”. Many different models can be used, the simplest is the linear regression. It tries to fit data with the best hyper-plane which goes through the points. Regression is the task of predicting a continuous quantity. A few regression algorithms •LinearRegression •MultipleLinear Regression •Polynomial Regression •DecisionTree Regression •RandomForest Regression Certain Applications of Regression •To determine the economic growth of a country or a state in the coming quarter. •Can also be used to predict the GDP of a country. •To predict what would be the price of a product in the future. •To predict the number of runs a player will score in the coming matches. A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.
  • 44.
    CLASSIFICATION – SUPERVISEDLEARNING Classification specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well. Types of Classification: •Binomial •Multi-Class CertainApplicationsof Classification: • To find whetheran email received is a spam or ham • To find if a bank loan is granted • To identify if a kid will pass or fail in an examination • Facial Recognition A few classification algorithms •Linear Models • Logistic Regression • Support Vector Machines •Nonlinear models • K-nearest Neighbours (KNN) • Kernel Support Vector Machines (SVM) • Naïve Bayes • Decision Tree Classification • Random Forest Classification A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label
  • 45.
    REGRESSION ALGORITHMS –SUPERVISED LEARNING Linear Regression • LinearRegressionis the statisticalmodelusedto predict the relationshipbetween independentand dependentvariables by examiningtwo factors. • The first one is which variables are significant predictors of the outcomevariable and the second one is how significantis the regressionline to make predictionswith the highestpossibleaccuracy. Linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and an independent variable x. • The relationship betweenthese two variables is assumedto be linear i.e. straightline can be used to separate them. • The objective of this function is to get the line that divides these two variables by keeping the error as less as possible. • The error is the sum of Euclidean distance between points and the line. •Independent Variable is X •Dependent Variable is Y •Β1 is an intercept of the regression model •β2 is a slope of the regression model •ϵ is the error term
  • 46.
    LINEAR REGRESSION –SUPERVISED LEARNING Here, there is a positive relationship between the monthly e-commercesales(Y) and onlineadvertisingcosts (X) Y = Β0 + Β1X Y= 125.8 + 171.5*X Linear regressionaims to find the best-fitting straight line through the points. The best-fitting line is known as the regression line. • If data points are closer when plotted to making a straight line, it means the correlation between the two variables is higher. In our example, the relationship is strong. • The orange diagonal line is the regression line and shows the predicted score on e-commercesales for eachpossible value of the online advertisingcosts. Interpretation of the results: • The slope of 171.5 shows that each increaseof one unit in X, we predictthe average of Y to increaseby an estimated171.5 units. • The formula estimates that for each increase of 1 dollar in online advertising costs, the expected monthly e-commerce sales are predicted to increaseby $171.5. Error Calculation: • The error is calculated usingmetrics such as Root Mean SquaredErrors, MeanAbsolute Errors, MeanSquared Errors.
  • 47.
    MULTIPLE LINEAR REGRESSION– SUPERVISED LEARNING Here, there is a positive relationship between the monthly e-commercesales(Y) and onlineadvertisingcosts (X) y = w1x1 + w2x2 + b Multiple Linear Regression is a statisticaltechniqueusedto predict the outcome of a response variable throughseveral explanatory variables and model the relationships between them. It represents line fitment between multiple inputs and one output • Multiple linear regression (MLR) is used to determinea mathematical relationship among several random variables. • In other terms, MLR examines how multiple independent variables are related to one dependent variable. • Once eachof the independent factors has been determined to predict the dependent variable, the information on the multiple variables can be usedto create an accurate prediction on the level of effect they have on the outcome variable. • The model creates a relationship in the form of a straight line (linear) that best approximates all the individual data points. Error Calculation: • The error is calculated usingmetrics such as Root Mean SquaredErrors, MeanAbsolute Errors, MeanSquared Errors.
  • 48.
    CLASSIFICATION ALGORITHMS –SUPERVISED LEARNING A few classification algorithms •Linear Models • Logistic Regression • Support Vector Machines •Nonlinear models • K-nearest Neighbours (KNN) • Kernel Support Vector Machines (SVM) • Naïve Bayes • Decision Tree Classification • Random Forest Classification
  • 49.
    CLASSIFICATION ALGORITHMS –SUPERVISED LEARNING Naïve Bayes It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. •P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes). •P(c) is the prior probability of class. •P(x|c) is the likelihood which is the probability of predictor given class. •P(x) is the prior probability of predictor. Example A fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if thesefeatures depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is whyit is known as ‘Naive’.
  • 50.
    CLASSIFICATION ALGORITHMS –SUPERVISED LEARNING Naïve Bayes Example I have a training data set of weather and corresponding target variable ‘Play’ (possibilities of playing). Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it. To find: Players will play if weather is sunny. Is this statement correct?
  • 51.
    CLASSIFICATION ALGORITHMS –SUPERVISED LEARNING Naïve Bayes Example Steps to solve posterior probability 1. Convert the data set into a frequency table 2. Create Likelihood table by finding the probabilities like Overcastprobability = 0.29 and probability of playing is 0.64. 3. Now, use Naive Bayesian equation to calculate the posterior probability for each class. 4. The class with the highest posterior probability is the outcome of prediction. P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64 P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60 => Higher probability to play!
  • 52.
    CLASSIFICATION ALGORITHMS –SUPERVISED LEARNING Naïve Bayes Applications Real time Prediction: • Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time. Multi class Prediction: • This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable. Text classification/ Spam Filtering/ Sentiment Analysis: • Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments) Recommendation System: • Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not
  • 53.
    MACHINE LEARNING VSDEEP LEARNING
  • 54.
  • 55.
    REINFORCEMENT LEARNING The scienceof making optimal decisions motivated by rewards
  • 56.
    Carrying the BasicUnderstanding Forward • The topics covered in this presentation are just droplets from the ocean of machine learning and data science concepts. • If this presentation interested you, you may further get into reference materials for further study. • You can further enrol to various data science courses online, read various articles and keep relevant with the research in the fields of data science via websites like Towards Data Science, Analytics Vidya, etc.