Introduction to Machine Learning

INTRODUCTION
TO
MACHINE LEARNING
- Sathvik N U

Where does AI fit in?
What is Machine Learning?
Concepts of Machine Learning
Steps to build a predictive model
INTRODUCTION
TO
MACHINE LEARNING

• Industrial revolution in the manufacturing process
changes the way we think about and work in the industry.
• Industrial Stages:
• Industrial 1.0 (water and steam) – 1760s
• Industrial 2.0 (The Technological Revolution) -1840s
• Industrial 3.0 (Information Technology) – 1970s
• Industrial 4.0 - Present
• With the connectivity produced in the Industry 3.0, a
disruptive automation technology rose and now in 4.0,
smart machines will be made such that the machines
communicate among themselves to provide the target.

ARTIFICIAL INTELLIGENCE
Ability of a machine to
imitate intelligent human
behavior
Hierarchy of Artificial Intelligence
Industries where AI is usedin full
fledge
• Retail
• Healthcare
• Manufacturing
• Banking

ARTIFICIAL INTELLIGENCE
Long Term Memory in Human Beings
Long-term memory refers to the storage of
information over an extended period. If you can
remembersomethingthat happenedmore than
just a few moments ago whether it occurred
just hours ago or decades earlier, then it is a
long-term memory.
Types of Long-Term Memories
• Explicit Memory
• Procedural
• Emotional Conditioning
• Implicit Memory
• Semantic
• Episodic
How do you fight
corona?
By staying at home
To learn more on how
computer grasps these
memories, Follow the link

UNDERSTANDING ARTIFICIAL INTELLIGENCE LANDSCAPE

IS ARTIFICIAL INTELLIGENCE HARMFUL?

IS ARTIFICIAL INTELLIGENCE HARMFUL?
“INSPIRATIONAL” QUOTES GENERATED BY ARTIFICIAL INTELLIGENCE

Machine Learning
Subset of AI which uses statistical
and probabilistic methods to
enable machines to improve with
experiences

INTRODUCTION
TO
MACHINE LEARNING
CERTAIN PRE-REQUISITES FOR MACHINE LEARNING
• Statistics
• Probability and Queuing Theory
• Linear Algebra
• Calculus
• Primary Coding skillset
• Patience!!
Of the above pre-requisites, the most important pre-requisite would be patience
as the machine learning world is almost infinite. It is ever evolving, and you can
never conquer it, but you can master it.
Once a wise man said, “I think the target of anything in life should be to do it so
well that it becomes an art”. I firmly believe that goes rightly with machine
learning. After all, It is mathematics guys!

Statistical model is all about getting a simple formulation of a frontier in
a classification model problem. Here we see a nonlinear boundary which
to some extent separates risky people from non-risky people.
When we see the contours generated by Machine Learning
algorithm, we witness that statistical modeling is no way
comparable for the problem in hand to the Machine Learning
algorithm. The contours of machine learning seems to capture
all patterns beyond any boundaries of linearity or even
continuity of the boundaries. This is what Machine Learning
can do for you.
HOW IS MACHINE LEARNING DIFFERENT TO STATISTICS?
Machine Learning algorithm is used in recommendation engines of
Google/Facebook/Amazon etc. which can churn trillions of data points in a
second to produce nearly perfect recommendations. Even with a laptop of 16
GB RAM I daily work on datasets of millions of rows with thousands of
parameters and build an entire model in not more than 30 minutes. A statistical
model on another hand needs a supercomputer to run a million observation
with thousand parameters.

Machine Learning
Examples use cases of
Machine Learning
• Intelligent Navigation
• News Recommendation
• Spam Detection
• Google Assistant
• Google Ads
• Product Recommendation
• Ad Delivery
• Optimizing Supply Chain
• Personalized Recommendation Engine
• Streaming quality by predicting bandwidth
• Personalized Ads
• Friend Suggestion
• Face Tagging

Machine Learning
How industries have capitalized
Machine Learning

Machine Learning Algorithms
Algorithms try to find a mathematical relationship f(x) between the inputs (x) and the output (y) by going through the whole dataset one
row at a time, adjusting the equation and the coefficients at every step to accommodate for all the rows covered till that point. This process
through which learning happens is called ‘Training’ or ‘Model Fitting’
When the model gets trained
with the historical data, it
generates coefficients called
model weights used for
predictions. The trained models
and weights are stored as files
and can be used when the
unseen data of same features
used in training needs to be
predicted
The Trained Model from the previous
step is tested on Unseen data (data
which was not part of the data used
to train the model) for which the
actual output is known. The predicted
outputs are compared to the actual
outputs and accuracy scores are
calculated. The scores act as an
evaluation metric for the model.

Machine Learning Algorithms
Algorithms try to mimic a human brain by finding a mathematical relationship between the inputs and the output
Algebric Models
Probablistic
Models
(Conditional)
Tree Models And
Other Models...

Unsupervised Learning
Sincethe examplesgiven to the
learnerare unlabelled, there is no
error or reward signal to evaluate
a potential solution.
This distinguishesunsupervised
learningfrom supervisedlearning
and reinforcement learning.
Patterns in the data
• Which value(s)occur most frequently?
• How much does the data vary?
• How symmetrically does data vary
around center?
• Is data clustered aroundvalue(s)?
• Sub-space where data is
“concentrated”
It is the task of inferring a function to describe hidden structure or patterns from unlabeled data.
Unsupervised

Unsupervised Learning
Association Rules
• Find features (dimensions)
which occur together
• Find features (dimensions)
which are “correlated
Patterns that can be found in data
• Median
• Variance, Standard Deviation
Skewness, Kurtosis
• Mode
Summary Statistics Clustering
• Find data elements which are
similar
• Finding “areas” in space where
data is concentrated
• Are two features correlated
• Are two dimensions correlate
• Combines data from different
sources
Multiple Kernel Learning
Dimensionality Reduction
• Find smaller dimensional
representations of the data which
preserve its essential structure.
• Find subspaces where data varies
the most.

ASSOCIATION RULE MINING
The value of one feature tells us the value of another feature
▪ Peoplewho buy diapers are likely to buy baby powder
▪ If (peoplebuy diaper),then (theybuy baby powder)
<!> Watchthe directionality<!>
(A➔Bdoes not meanB ➔A)
In A➔B,A is the antecedent and B is consequent
Applications
▪ MarketBasket Analysis.
▪ MedicalDiagnosis.
▪ CensusData.
▪ ProteinSequence.
▪ CrossMarketing.
▪ CatalogDesign.
• Arestatements aboutrelations amongfeatures (attributes):
acrosselements (tuples)
• Use a transactionItemsetData Model
Association Rules
TRANSACTION ITEMSET DATA MODEL

▪ MarketBasket Analysis is the mostcommonusewhere each basket is a row and each item is a
column
▪ It is not the only usecase.It can be usedin any dataset where the features take only two values 0/1
▪ Can work in any datasetwhere features canbe transformed as taking only two usevalues 0/1 by
discretizationand featureselection.
▪ There are a certainfew measuresof effectivenessthat helps us to find relations between features with
a setof transactionitemsetdata points.
TRANSACTION ITEMSET DATA MODEL
Measures of Effectiveness
• Support
• Confidence
• Lift
• Others : Affinity, Leverage
Ratio of the observed support to that
expected if X and Y
were independent

Confidence

TRANSACTION ITEMSET DATA MODEL• {Diaper, Beer} ➔ Milk
• Support = 2/5, Confidence = 2/3
• {Milk} ➔ {Diaper, Beer}
• {Milk, Diaper} ➔ Bread
• {Milk, Beer} ➔ Diaper?
Is Confidence = 1?
• Caution : Diaper is very popular!
• Does the inclusion of {Milk, Beer} increase the probability of Diaper?
LIFT
• If the lift is > 1, that lets us know the degree to whichthose two occurrences are dependent on one another and makes those rules
potentially useful for predicting the consequent in future data sets. Positively Correlated
• If the lift is < 1, that lets us know the items are substitute to each other. This meansthat presence of one item has negative effect
on presence of other item and vice versa. Negatively Correlated.
• If the lift is = 1, then the two occurrences are not correlated.

APRIORI ALGORITHM - ASSOCIATION RULE MINING
Key Idea
• If {a,b,c} is frequent, {a,b} must be frequent
• Downward closure a.k.a. anti-monotonicity
Algorithm
• Find all frequent 1-itemsets (frequent ➔ > support)
• Find all frequent 2-itemsets for filtered 1-itemsets
• Find all frequent 3-itemsets for filtered 2-itemsets
• …
Salient Features
• Exploits downward closure to optimize search
• Lower Support ➔ Higher computational complexity
• Confidence, Lift as post-processing filters

APRIORI ALGORITHM - LIMITATIONS
Computational Complexity
• How long does it take to run?
• How much memory does it need?
Approaches
• Throw more compute / RAM at it
• Parallelize
• Increase support
• Leverage item hierarchy
• Another algorithm?
When sequence of transactions matters
• Define a sequence as an item
• Combinatorial Explosion : Computational Complexity
• Read-Up!
Rare patterns
• Rules with low support but maybe very valuable
• People who buy ______ likely to buy luxury cars
RARE PATTERNS ARE IMPORTANT TOO
It is very rare in real world cases to have Lift factors as high as 8.
But there was a case where it did happen. That case was discovered by
Walmart in 2004 when a series of hurricanes crossedthe state of
Florida. After the first hurricane, there were several more hurricanes
seen in the Atlantic Ocean heading toward Florida, and so Walmart mined
their massive retail transaction database to see what their customers
really wanted to buy prior to the arrival of a hurricane. They found one
particular item that increased in sales by a factor of 7 over normal
shopping days. That was a huge Lift factor for a real-world case. That one
item was not bottled water, or batteries, or beer, or flashlights, or
generators, or any of the usual things that we might imagine. The item
was strawberry pop tarts! One could imagine lots of reasons why this
was the most desired product prior to the arrival of a hurricane – pop
tarts do not require refrigeration, they do not need to be cooked, they
come in individually wrapped portions, they have a long shelf life, they
are a snack food, they are a breakfast food, kids love them, and we love
them. Despite these “obvious” reasons, it was a still a huge surprise! And
so Walmart stocked their stores with tons of strawberry pop tarts
prior to the next hurricanes, and they sold them out. That is a win-
win: Walmart wins by making the sell, and customers win by getting
the product that they most want.

CLUSTERING
Motivation
• Transaction data (Customer –Product matrix)
• Segmentation, Pre-processing, Multimodal distributions, “I
can’t explain it”
Clustering
• Find elements (rows, tuples) which are similar.
• Finding “areas” in space where data is concentrated
• WYSIWYG : What You Select Is What You Get
When are two elements / rows similar?
• A measure of (dis)similarity.
• Which dimensions (attributes) are relevant?
• Normalization?
• How many clusters?

CLUSTERING
Three Metrics Used for Clustering
data points
• Similarity
• Dissimilarity
• Distance
When do we scale or don’t,
• If variables measure different units (kg, meter, sec,…)
• If you explicitly want to have equal weight for each variable
• Don’t scale if units are the same for all variables
If variables are scaled
• Every variable gets equal weight
• Similar alternative is re-weighing
Cosine Similarity

Gower Distance
DISTANCE - CLUSTERING

SIMILARITY - CLUSTERING
Measures of Similarity
Cosineof angle between two vectors a.k.a.normalized inner product
• Distancebetweenvectors captured by the cosine of the angle x
between them.
• Denominator involves the lengths of the vectors
What kinds of problems are solved
with cosine similarity?
• Relevant in information retrieval
• Documents (query) as vectors of
words
• Two documents similar if they
contain the same words
• Does size of the document matter?
• Yes: Dot Product
• No: Cosine similarity
Cosine Similarity

DISSIMILARITY - CLUSTERING
Measures of Dissimilarity
Categorical Variables
• One hot encoding
Here are a few dissimilarity based
clustering methods
• HammingDistance
• Simple Matching coefficient
• Dice
• Jaccard
Coefficients
• SMC: Simple Matching
Coefficient
• Jaccard coefficient
• Dice coefficient
Distance = 1 -coefficient

NUMBER OF CLUSTERS - KMEANS – CLUSTERING ALGORITHM

KMEANS – CLUSTERING ALGORITHM
• Initialize
• Pick k data points randomly from X : centroids
• Iterate
• Assign each data point to the closest / most similar centroid
• For each cluster, update centroid
• Repeat
• Terminate when “changein within cluster variation” < threshold
• Amount by whichelements in the same clusters are different

Limitations of K-Means
• Hyperparameter : k
• Not really a “limitation”
• Initial centroid at random
• Categorical (Mixed data?)
• Non-convex clusters

SUPERVISED LEARNING
It is the task of classifying a data point to a label based on the labeled data.
Supervised learning is the
machine learning task of
learning a function that maps
an input to an output based on
example input-output pairs.
Supervised

SUPERVISED LEARNING
CLASSIFICATION VS REGRESSION

REGRESSION - SUPERVISED LEARNING
Distance / Time Estimates
Temperature
A regression problem is when the output variable is a real or continuous value, such as “price” or
“temperature”. Many different models can be used, the simplest is the linear regression. It tries to fit data
with the best hyper-plane which goes through the points. Regression is the task of predicting a continuous
quantity.
A few regression algorithms
•LinearRegression
•MultipleLinear Regression
•Polynomial Regression
•DecisionTree Regression
•RandomForest Regression
Certain Applications of Regression
•To determine the economic growth of a country or a state in the coming quarter.
•Can also be used to predict the GDP of a country.
•To predict what would be the price of a product in the future.
•To predict the number of runs a player will score in the coming matches.
A regression algorithm may predict a
discrete value, but the discrete value in the
form of an integer quantity.

CLASSIFICATION – SUPERVISED LEARNING
Classification specifies the class to which data elements belong to and is best used when the output has
finite and discrete values. It predicts a class for an input variable as well.
Types of Classification:
•Binomial
•Multi-Class
CertainApplicationsof Classification:
• To find whetheran email
received is a spam or ham
• To find if a bank loan is granted
• To identify if a kid will pass or
fail in an examination
• Facial Recognition
A few classification algorithms
•Linear Models
• Logistic Regression
• Support Vector Machines
•Nonlinear models
• K-nearest Neighbours (KNN)
• Kernel Support Vector Machines
(SVM)
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
A classification algorithm may predict a continuous
value, but the continuous value is in the form of a
probability for a class label

REGRESSION ALGORITHMS – SUPERVISED LEARNING
Linear Regression
• LinearRegressionis the statisticalmodelusedto
predict the relationshipbetween independentand
dependentvariables by examiningtwo factors.
• The first one is which variables are significant
predictors of the outcomevariable and the second
one is how significantis the regressionline to make
predictionswith the highestpossibleaccuracy.
Linear regression is a linear approach for modeling the
relationship between a scalar dependent variable y and an
independent variable x.
• The relationship betweenthese two variables
is assumedto be linear i.e. straightline can be
used to separate them.
• The objective of this function is to get the line
that divides these two variables by keeping the
error as less as possible.
• The error is the sum of Euclidean distance
between points and the line.
•Independent Variable is X
•Dependent Variable is Y
•Β1 is an intercept of the regression model
•β2 is a slope of the regression model
•ϵ is the error term

LINEAR REGRESSION – SUPERVISED LEARNING
Here, there is a positive relationship between the monthly
e-commercesales(Y) and onlineadvertisingcosts (X)
Y = Β0 + Β1X
Y= 125.8 + 171.5*X
Linear regressionaims to find the best-fitting straight line through
the points. The best-fitting line is known as the regression line.
• If data points are closer when plotted to making a straight line,
it means the correlation between the two variables is higher. In
our example, the relationship is strong.
• The orange diagonal line is the regression line and shows the
predicted score on e-commercesales for eachpossible value of
the online advertisingcosts.
Interpretation of the results:
• The slope of 171.5 shows that each increaseof one unit in X, we
predictthe average of Y to increaseby an estimated171.5 units.
• The formula estimates that for each increase of 1 dollar in online
advertising costs, the expected monthly e-commerce sales are
predicted to increaseby $171.5.
Error Calculation:
• The error is calculated usingmetrics such as Root Mean
SquaredErrors, MeanAbsolute Errors, MeanSquared
Errors.

MULTIPLE LINEAR REGRESSION – SUPERVISED LEARNING
Here, there is a positive relationship between the monthly
e-commercesales(Y) and onlineadvertisingcosts (X)
y = w1x1 + w2x2 + b
Multiple Linear Regression is a statisticaltechniqueusedto predict the
outcome of a response variable throughseveral explanatory variables
and model the relationships between them. It represents line fitment
between multiple inputs and one output
• Multiple linear regression (MLR) is used to determinea
mathematical relationship among several random
variables.
• In other terms, MLR examines how multiple independent
variables are related to one dependent variable.
• Once eachof the independent factors has been determined
to predict the dependent variable, the information on the
multiple variables can be usedto create an accurate
prediction on the level of effect they have on the outcome
variable.
• The model creates a relationship in the form of a straight
line (linear) that best approximates all the individual data
points.
Error Calculation:
• The error is calculated usingmetrics such as Root Mean
SquaredErrors, MeanAbsolute Errors, MeanSquared
Errors.

CLASSIFICATION ALGORITHMS – SUPERVISED LEARNING
A few classification algorithms
•Linear Models
• Logistic Regression
• Support Vector Machines
•Nonlinear models
• K-nearest Neighbours (KNN)
• Kernel Support Vector Machines (SVM)
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification

Naïve Bayes
It is a classification technique based on Bayes’ Theorem
with an assumption of independence among predictors. In
simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to
the presence of any other feature.
•P(c|x) is the posterior probability of class (c, target)
given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability of predictor given class.
•P(x) is the prior probability of predictor.
Example
A fruit may be considered to be an apple if it is red, round, and
about 3 inches in diameter. Even if thesefeatures depend on
each other or upon the existence of the other features, all of
these properties independently contribute to the probability
that this fruit is an apple and that is whyit is known as ‘Naive’.

Naïve Bayes Example
I have a training data set of weather
and corresponding target variable
‘Play’ (possibilities of playing). Now,
we need to classify whether players
will play or not based on weather
condition. Let’s follow the below steps
to perform it.
To find: Players will play if weather is sunny.
Is this statement correct?

Naïve Bayes Example
Steps to solve posterior probability
1. Convert the data set into a frequency table
2. Create Likelihood table by finding the
probabilities like Overcastprobability = 0.29
and probability of playing is 0.64.
3. Now, use Naive Bayesian equation to calculate
the posterior probability for each class.
4. The class with the highest posterior
probability is the outcome of prediction.
P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60 => Higher probability to play!

Naïve Bayes Applications
Real time Prediction:
• Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for
making predictions in real time.
Multi class Prediction:
• This algorithm is also well known for multi class prediction feature. Here we can predict the
probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis:
• Naive Bayes classifiers mostly used in text classification (due to better result in multi class
problems and independence rule) have higher success rate as compared to other algorithms.
As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis
(in social media analysis, to identify positive and negative customer sentiments)
Recommendation System:
• Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation
System that uses machine learning and data mining techniques to filter unseen information
and predict whether a user would like a given resource or not

MACHINE LEARNING VS DEEP LEARNING

REINFORCEMENT LEARNING
The science of making optimal decisions motivated by rewards

Carrying the Basic Understanding Forward
• The topics covered in this presentation are just droplets
from the ocean of machine learning and data science
concepts.
• If this presentation interested you, you may further get into
reference materials for further study.
• You can further enrol to various data science courses
online, read various articles and keep relevant with the
research in the fields of data science via websites like
Towards Data Science, Analytics Vidya, etc.

Introduction to Machine Learning

More Related Content

What's hot

Similar to Introduction to Machine Learning

Recently uploaded

Introduction to Machine Learning