SlideShare a Scribd company logo
1 of 100
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Data science is all about data.
 Every single tech company out there is collecting huge amounts of
data. And data is revenue.
 The more data you have, the more business insights you can
generate.
 Data Science is the art of turning data into actions through creating
data products.
 Performing data science requires extraction of timely, actionable
information from diverse data sources to drive data products.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Artificial intelligence is a field of computer science which makes a
computer system that can mimic human intelligence. It is comprised
of two words "Artificial" and "intelligence", which means "a human-
made thinking power." Hence we can define it as,
 Artificial intelligence is a technology using which we can create
intelligent systems that can simulate human intelligence.
 The Artificial intelligence system does not require to be pre-
programmed, instead of that, they use such algorithms which can
work with their own intelligence. It involves machine learning
algorithms such as Reinforcement learning algorithm and deep
learning neural networks. AI is being used in multiple places such as
Siri, Google's AlphaGo, AI in Chess playing, etc.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Machine learning is about extracting knowledge from the data. It can be defined as,
 Machine learning is a subfield of artificial intelligence, which enables machines to
learn from past data or experiences without being explicitly programmed.
 Machine learning enables a computer system to make predictions or take some
decisions using historical data without being explicitly programmed. Machine learning
uses a massive amount of structured and semi-structured data so that a machine
learning model can generate accurate result or give predictions based on that data.
 Machine learning works on algorithm which learn by it’s own using historical data. It
works only for specific domains such as if we are creating a machine learning model
to detect pictures of dogs, it will only give result for dog images, but if we provide a
new data like cat image then it will become unresponsive. Machine learning is being
used in various places such as for online recommender system, for Google search
algorithms, Email spam filter, Face book Auto friend tagging suggestion, etc.
It can be divided into three types:
 Supervised learning
 Reinforcement learning
 Unsupervised learning
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Artificial intelligence and machine learning are the part of computer
science that are correlated with each other. These two technologies
are the most trending technologies which are used for creating
intelligent systems.
 Although these are two related technologies and sometimes people
use them as a synonym for each other, but still both are the two
different terms in various cases.
 AI is a bigger concept to create intelligent machines that can
simulate human thinking capability and behavior, whereas, machine
learning is an application or subset of AI that allows machines to
learn from data without being programmed explicitly.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 A Machine Learning system learns from historical data, builds the
prediction models, and whenever it receives new data, predicts
the output for it. The accuracy of predicted output depends upon
the amount of data, as the huge amount of data helps to build a
better model which predicts the output more accurately.
 Suppose we have a complex problem, where we need to perform
some predictions, so instead of writing a code for it, we just need to
feed the data to generic algorithms, and with the help of these
algorithms, machine builds the logic as per the data and predict the
output. Machine learning has changed our way of thinking about the
problem. The below block diagram explains the working of Machine
Learning algorithm:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Features of Machine Learning:
Machine learning uses data to detect various patterns in a given dataset.
It can learn from past data and improve automatically.
It is a data-driven technology.
Machine learning is much similar to data mining as it also deals with the huge
amount of the data.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Traditional Programming refers to any manually created program
that uses input data and runs on a computer to produce the output.
Traditional programming is a manual process — meaning a person
(programmer) creates the program. But without anyone programming
the logic, one has to manually formulate or code rules. We have the
input data, and someone (programmer) coded a program that uses
that data and runs on a computer to produce the desired output.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Machine Learning, also known as augmented analytics, the input
data and output are fed to an algorithm to create a program. This
yields powerful insights that can be used to predict future outcomes.
Machine Learning, on the other hand, the input data and output are
fed to an algorithm to create a program.
This is the basic difference between traditional programming and
machine learning. Without anyone programming the logic, In
Traditional programming one has to manually formulate/code rules
while in Machine Learning the algorithms automatically formulate the
rules from the data, which is very powerful. .
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Fig:- Block diagram of decision flow architecture for ML System
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 1. Data Acquisition
As machine learning is based on available data for the system to make a decision
hence the first step defined in the architecture is data acquisition. This involves data
collection, preparing and segregating the case scenarios based on certain features
involved with the decision making cycle and forwarding the data to the processing unit
for carrying out further categorization. This stage is sometimes called the data
preprocessing stage. The data model expects reliable, fast and elastic data which
may be discrete or continuous in nature. The data is then passed into stream
processing systems (for continuous data) and stored in batch data warehouses (for
discrete data) before being passed on to data modeling or processing stages.
 2. Data Processing
The received data in the data acquisition layer is then sent forward to the data
processing layer where it is subjected to advanced integration and processing and
involves normalization of the data, data cleaning, transformation, and encoding.
The data processing is also dependent on the type of learning being used. For e.g., if
supervised learning is being used the data shall be needed to be segregated into
multiple steps of sample data required for training of the system and the data thus
created is called training sample data or simply training data.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 3. Data Modeling
This layer of the architecture involves the selection of different algorithms that might
adapt the system to address the problem for which the learning is being devised,
These algorithms are being evolved or being inherited from a set of libraries. The
algorithms are used to model the data accordingly, this makes the system ready for
execution step.
 4. Execution
This stage in machine learning is where the experimentation is done, testing is
involved and tunings are performed. The general goal behind being to optimize the
algorithm in order to extract the required machine outcome and maximize the system
performance, The output of the step is a refined solution capable of providing the
required data for the machine to make decisions.
 5. Deployment
Like any other software output, ML outputs need to be operational zed or be
forwarded for further exploratory processing. The output can be considered as a non-
deterministic query which needs to be further deployed into the decision-making
system. It is advised to seamlessly move the ML output directly to production where it
will enable the machine to directly make decisions based on the output and reduce
the dependency on the further exploratory steps.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Machine Learning Applications in Healthcare :-
Doctors and medical practitioners will soon be able to predict with accuracy
on how long patients with fatal diseases will live. Medical systems will learn
from data and help patients save money by skipping unnecessary tests.
i) Drug Discovery/Manufacturing
ii) Personalized Treatment/Medication
 Machine Learning Applications in Finance :-
More than 90% of the top 50 financial institutions around the world are using
machine learning and advanced analytics. The application of machine
learning in Finance domain helps banks offer personalized services to
customers at lower cost, better compliance and generate greater revenue.
i)Machine Learning for Fraud Detection
ii)Machine Learning for focused Account Holder Targeting
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Machine Learning Applications in Retail :-
Machine learning in retail is more than just a latest trend, retailers are
implementing big data technologies like Hadoop and Spark to build big data
solutions. Machine learning algorithms process this data intelligently and
automate the analysis to make this supercilious goal possible for retail giants
like Amazon, Alibaba and Walmart.
i)Machine Learning Examples in Retail for Product Recommendations
ii)Machine Learning Examples in Retail for Improved Customer Service.
 Machine Learning Applications in Travel :-
Travel providers can help travellers find the best time to book a hotel or buy a
cheap ticket by taking advantage of machine learning. When the agreement is
available, the application will probably send a notification to the user.
i)Machine Learning Examples in Travel for Dynamic Pricing
ii)Machine Learning Examples in Travel for Sentiment Analysis
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Machine Learning Applications in Media :-
Machine learning offers the most efficient means of engaging billions of
social media users. From personalizing news feed to rendering targeted ads,
machine learning is the heart of all social media platforms for their own and
user benefits. Social media and chat applications have advanced to a great
extent that users do not pick up the phone or use email to communicate with
brands – they leave a comment on Facebook or Instagram expecting a
speedy reply than the traditional channels.
 Earlier Facebook used to prompt users to tag your friends but nowadays the
social networks artificial neural networks machine learning algorithm
identifies familiar faces from contact list. The ANN (Artificial Neural
Networks)algorithm mimics the structure of human brain to power facial
recognition.
 The professional network like LinkedIn knows where you should apply for
your next job, whom you should connect with and how your skills stack up
against your peers as you search for new job.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Let’s understand the type of data available in the datasets from the
perspective of machine learning.
1. Numerical Data :-
Any data points which are numbers are termed as numerical data.
Numerical data can be discrete or continuous. Continuous data has any
value within a given range while the discrete data is supposed to have a
distinct value. For example, the number of doors of cars will be discrete i.e.
either two, four, six, etc. and the price of the car will be continuous that is
might be 1000$ or 1250.5$. The data type of numerical data is int64 or
float64.
2. Categorical Data :-
Categorical data are used to represent the characteristics. For example car
color, date of manufacture, etc. It can also be a numerical value provided the
numerical value is indicating a class. For example, 1 can be used to denote
a gas car and 0 for a diesel car. We can use categorical data to forms
groups but cannot perform any mathematical operations on them. Its data
type is an object.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
3. Time Series Data :-
It is the collection of a sequence of numbers collected at a regular interval
over a certain period of time. It is very important, like in the field of the stock
market where we need the price of a stock after a constant interval of time.
The type of data has a temporal field attached to it so that the timestamp of
the data can be easily monitored.
4. Text Data :-
Text data is nothing but literals. The first step of handling test data is to
convert them into numbers as or model is mathematical and needs data to
inform of numbers. So to do so we might use functions as a bag of word
formulation.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 ML Dataset :-
Machine learning dataset is defined as the collection of data that is needed
to train the model and make predictions. These datasets are classified as
structured and unstructured datasets, where the structured datasets are in
tabular format in which the row of the dataset corresponds to record and
column corresponds to the features, and unstructured datasets corresponds
to the images, text, speech, audio etc. which is acquired through Data
Acquisition, Data Wrangling and Data Exploration, during the learning
process these datasets are divided as training, validation and test sets for
the training and measuring the accuracy of the mode.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Types of datasets :-
1.Training Dataset: This data set is used to train the model i.e. these
datasets are used to update the weight of the model.
2.Validation Dataset: These types of a dataset are used to reduce over
fitting. It is used to verify that the increase in the accuracy of the training
dataset is actually increased if we test the model with the data that is not
used in the training. If the accuracy over the training dataset increase while
the accuracy over the validation dataset decrease, then this results in the
case of high variance i.e. over fitting.
3.Test Dataset: Most of the time when we try to make changes to the model
based upon the output of the validation set then unintentionally we make the
model peek into our validation set and as a result, our model might get over
fit on the validation set as well. To overcome this issue we have a test
dataset that is only used to test the final output of the model in order to
confirm the accuracy.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Pre-processing refers to the changes applied to our data before feeding it to
the ML algorithm. Data pre-processing is a technique that is used to convert
the created or collected (raw) data into a clean data set. In other words,
whenever the data is gathered from different sources it is collected in raw
format which is not feasible for the analysis or processing by ML model.
Following figure shows transformation processing performed on raw data
before, during and after applying ML techniques:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Data Pre-processing in Machine Learning can be broadly divided into 3 main
parts –
1.Data Integration
2.Data Cleaning
3.Data Transformation
There are various steps in each of these 3 broad categories. Depending on
the data and machine learning algorithm involved, not all steps might be
required though.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
1.Data Integration and formatting :-
During hackathon and competitions, we often deal with a single csv or excel
file containing all training data. But in real world, source of data might not be
this simple. In real life, we might have to extract data from various sources
and have to integrate it.
2. Data Cleaning :-
2.1 Dealing with Missing data :-
It is common to have some missing or null data in the real-world data set.
Most of the machine learning algorithms will not work with such data. So, it
becomes important to deal with missing or null data. Some of common
measures taken are,
 Get rid of the column if there are plenty of rows with null values.
 Eliminate the row if there are plenty of columns with null values.
 Change the missing value by mean or median or mode of that column
depending on data distribution in that column.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 By substituting the missing values by ‘NA’ or ‘Unknown’ or some other relevant
term, in case of categorical feature column, we can consider missing data as a
new category in itself.
 In this method to come up with educated guesses of possible candidate,
replace missing value by applying regression or classification techniques.
2.2 Remove Noise from Data :-
Noise are slightly erroneous data observations which does not fulfil with trend
or distribution of rest of data. Though each error can be small, but communally
noisy data results in poor machine learning model. Noise in data can be
minimized or smooth out by using below popular techniques –
 Binning
 Regression
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Binning Method:
 First sort data and partition
 Then one can smooth by bin mean, median and boundaries.
For Example:
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into bins: - Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means: - Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Regression Method:
 Here data can be smoothed by fitting the data to a function.
 Linear regression involves finding the “best” line to fit two attributes, so that
one attribute can be used to predict the other.
 Multiple linear regression is an extension of linear regression, where more
than two attributes are involved and the data are fit to a multidimensional
surface.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
2.3 Remove Outliers from Data :-
 Outliers are those observation that has extreme values, much beyond the
normal range of values for that feature. For example, a very high salary of
CEO of a company can be an outlier if we consider salary of other regular
employees of the company.
 Even few outliers in data set can contribute to poor accuracy of machine
learning model. The common methods to detect outliers and remove them
are –
 Standard Deviation
 Box Plot
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Standard Deviation
In statistics, if a data distribution is approximately normal then about 68% of
the data values lie within one standard deviation of the mean and about 95%
are within two standard deviations, and about 99.7% lie within three
standard deviations.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Therefore, if you have any data point that is more than 3 times the standard
deviation, then those points are very likely to be anomalous or outliers.
 Box Plots:
 Box plots are a graphical depiction of numerical data through their quintiles.
It is a very simple but effective way to visualize outliers. Think about the
lower and upper sideburns as the boundaries of the data distribution. Any
data points that show above or below the sideburns, can be considered
outliers or anomalous.
 The concept of the Interquartile Range (IQR) is used to build the box plot
graphs. IQR is a concept in statistics that is used to measure the statistical
dispersion and data variability by dividing the dataset into quartiles.
 In simple words, any dataset or any set of observations is divided into four
defined intervals based upon the values of the data and how they compare
to the entire dataset. A quartile is what divides the data into three points and
four intervals.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
2.4 Dealing with Duplicate Data :-
The approach to deal with duplicate data depends on the fact whether
duplicate data represents the real-world scenario or is more of an
inconsistency. If it is former than duplicate data should be conserved, else it
should be removed.
2.5 Dealing with Inconsistent Data :-
Some data might not be consistent with the business rule. This requires
domain knowledge to identify such inconsistencies and deal with it.
3.Data Transformation :-
3.1 Feature Scaling :-
Feature scaling is one of the most important prerequisites for most of the
machine learning algorithm. Various features of data set can have their own
range of values and far different from each other. For e.g. age of human
mostly lies within range of 0-100 years but population of countries will be in
ranges of millions.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
This huge differences of ranges between features in a data set can distort
the training of machine learning model. So, we need to bring the ranges of
all the features at common scale. The common approaches of feature
scaling are –
 Mean Normalization
 Min-Max Normalization
 Z-Score Normalization or Standardization,
We will make brief overview with examples on these approaches in the last
section of this unit.
3.2 Dealing with categorical data :-
 Categorical data, also known as qualitative data are text or string-based
data. Example of categorical data are gender of persons (Male or Female),
names of places (India, America, England), colour of car (Red, White).
 Most of the machine learning algorithms works on numerical data only and
will not be able to process categorical data. So, we need to transform
categorical data into numerical form without losing the sense of information.
Below are the popular approaches to convert categorical data into numerical
form –
 Label Encoding
 One Hot Encoding
 Binary Encoding
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
3.3 Dealing with Imbalanced Data Set :-
 Imbalanced data set are type of data set in which most of the data belongs
to only one class and very few data belongs to other class. This is common
in case of medical diagnosis, anomaly detection where the data belonging to
positive class is a very small percentage.
 For e.g. only 5-10% of data might belong to a disease positive class which
can be an expected distribution in medical diagnosis. But this skewed data
distribution can trick the machine learning model in training phase to only
identify the majority classes and it fails to learn the minority classes. For
example, the model might fail to identify the medical condition even though it
might be showing a very high accuracy by identifying negative scenarios.
 We need to do something about imbalanced data set to avoid a bad
machine learning model. Below are some approaches to deal with such
situation –
 Under Sampling Majority Class
 Over Sampling Minority Class
 SMOTE (Synthetic Minority Oversampling Technique)
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
3.4 Feature Engineering :-
 Feature engineering is an art of creating new feature from the given data by
either applying some domain knowledge or some common sense or both.
 A very common example of feature engineering is converting a Date feature
into additional features like Day, Week, Month, Year thus adding more
information into data set.
 Feature engineering enriches data set with more information that can
contribute to good machine learning model.
3.5 Training-Test Data Split :-
 Just before training supervised machine learning model, this is usually the
last step of data pre-processing in which for a given data set, after
undergoing all cleaning and transformation, is divided into two parts – one
for training of machine learning model and second for testing the trained
model.
 Though there is no rule of thumb, but usually training-test split is done
randomly at 80%-20% ratio. While splitting data set, care has to be taken
that there is no loss of information in training data set.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Example: We have registered the speed of 13 cars:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
1.Mean -The average value.
 (99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
 The NumPy module has NumPy mean() method for this.
 import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.mean(speed)
print(x)
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
2.Median - The midpoint value.
 The median value is the value in the middle, after you have sorted all the
values:
 (77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111)
 It is important that the numbers are sorted before you can find the median
 The NumPy module has a NumPy median() method to find the middle value.
 import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy. median(speed)
print(x)
 If there are two numbers in the middle, divide the sum of those numbers by
two.
 Ex. (77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103)
 (86 + 87) / 2 = 86.5
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
3.Mode - The most common value.
 The Mode value is the value that appears the most number of times.
 (99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86) = 86
 The SciPy module has a SciPy mode() method to find the number
that appears the most.
 from scipy import stats
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats. mode(speed)
print(x)
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Standard deviation is a number that describes how spread out the values
are.
 A low standard deviation means that most of the numbers are close to the
mean (average) value.
 A high standard deviation means that the values are spread out over a wider
range.
 Example: We have registered the speed of 7 cars:
 speed = [86,87,88,86,87,85,86]
 The standard deviation is:0.9
 Meaning that most of the values are within the range of 0.9 from the mean
value, which is 86.4.
 Example: We have registered the speed of another 7 cars:
 speed = [32,111,138,28,59,77,97]
 The standard deviation is:37.85
 Meaning that most of the values are within the range of 37.85 from the
mean value, which is 77.4.
 A higher standard deviation indicates that the values are spread out
over a wider range.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 The NumPy module has a NumPy std() method to find the standard
deviation.
 import numpy
speed = [86,87,88,86,87,85,86]
x = numpy.std(speed)
print(x)
 import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)
 Standard Deviation is often represented by the symbol Sigma: σ
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Variance is another number that indicates how spread out the values are.
 if you take the square root of the variance, you get the standard deviation!
 if you multiply the standard deviation by itself, you get the variance!
 To calculate the variance you have to do as follows:
1. Find the mean:
(32+111+138+28+59+77+97) / 7 = 77.4
2. For each value: find the difference from the mean:
32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
3. For each difference: find the square value:
(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16
4. The variance is the average number of these squared differences:
(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.6
 The NumPy module has a NumPy var() method to find the variance.
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.var(speed)
print(x)
 Variance is often represented by the symbol Sigma Square: σ2
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Percentiles are used in statistics to give you a number that describes the
value that a given percent of the values are lower than.
 Example: we have an array of the ages of all the people that lives in a street.
 ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
 What is the 75. percentile? The answer is 43, meaning that 75% of the people
are 43 or younger.
 The NumPy module has a method NumPy percentile() to find the percentiles:
 import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 75)
print(x)
 What is the age that 90% of the people are younger than?
 import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 90)
print(x)
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Statistics plays a main role in the field of research. It helps us in the
collection, analysis and presentation of data.
Before starting with descriptive and inferential statistics let us get the basic
idea of population and sample.
Population:-
Population is the group that is targeted to collect the data from. Our data is
the information collected from the population. Population is always defined
first, before starting the data collection process for any statistical study.
Population is not necessarily be people rather it could be batch of batteries,
measurements of rainfall in an area or a group of people.
Sample:-
It is the part of population which is selected randomly for the study. The
sample should be selected such that it represents all the characteristics of
the population. The process of selecting the subset from the population is
called sampling and the subset selected is called the sample.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
1.Descriptive Statistics :-
It describes the important characteristics/ properties of the data using the
measures the central tendency like mean/ median/mode and the measures
of dispersion like range, standard deviation, variance etc.
Data can be summarized and represented in an accurate way using charts,
tables and graphs.
For example: We have marks of 1000 students and we may be interested in
the overall performance of those students and the distribution as well as the
spread of marks. Descriptive statistics provides us the tools to define our
data in a most understandable and appropriate way.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
2.Inferential Statistics :-
It is about using data from sample and then making inferences about the
larger population from which the sample is drawn. The goal of the inferential
statistics is to draw conclusions from a sample and generalize them to the
population. It determines the probability of the characteristics of the sample
using probability theory. The most common methodologies used are
hypothesis tests, Analysis of variance etc.
For example: Suppose we are interested in the exam marks of all the
students in India. But it is not feasible to measure the exam marks of all the
students in India. So now we will measure the marks of a smaller sample of
students, for example 1000 students. This sample will now represent the
large population of Indian students. We would consider this sample for our
statistical study for studying the population from which it’s deduced.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Probability theory, a branch of mathematics concerned with the analysis of
random phenomena. The outcome of a random event cannot be determined
before it occurs, but it may be any one of several possible outcomes. The
actual outcome is considered to be determined by chance.
 EX. Tossing a Coin.
When a coin is tossed, there are two possible outcomes:
1.heads (H) or
2.tails (T)
We say that the probability of the coin landing H is ½
And the probability of the coin landing T is ½
EX. Throwing a Dice .
When a single die is thrown, there are six possible outcomes: 1, 2, 3, 4, 5, 6.
The probability of any one of them is 1/6.
So,
Probability of an event happening = Number of ways it can happen
Total no of Outcomes
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Many Supervised and Unsupervised machine learning models such as K-NN
and K-Means depend upon the distance between two data points to predict
the output. Therefore, the metric we use to compute distances plays an
important role in these models.
1.Euclidean Distance:
 Euclidean distance is the straight line distance between 2 data points in a
plane.
 It is calculated using the Minkowski Distance formula by setting ‘p’ value to
2, thus, also known as the L2 norm distance metric.
 This formula is similar to the Pythagorean theorem formula, Thus it is also
known as the Pythagorean Theorem. The formula is:-
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
2.Manhattan Distance:
 We use Manhattan distance, also known as city block distance, or taxicab
geometry if we need to calculate the distance between two data points in a
grid-like path. Manhattan distance metric can be understood with the help of
a simple example.
 In the above picture, imagine each cell to be a building, and the grid lines to
be roads. Now if I want to travel from Point A to Point B marked in the image
and follow the red or the yellow path. We see that the path is not straight
and there are turns. In this case, we use the Manhattan distance metric to
calculate the distance walked.
 We can get the equation for Manhattan distance by substituting p = 1 in the
Minkowski distance formula. The formula is:-
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 In the older days, people used to perform Machine Learning tasks by
manually coding all the algorithms and mathematical and statistical formula.
This made the process time consuming, tedious and inefficient. But in the
modern days, it becomes very much easier and efficient compared to the
olden days by various python libraries, frameworks, and modules. Today,
Python is one of the most popular programming languages for this task and
it has replaced many languages in the industry, one of the reasons is its vast
collection of libraries. Python libraries that used in Machine Learning are:
 Numpy
 Scipy
 Scikit-learn
 Theano
 TensorFlow
 Keras
 PyTorch
 Pandas
 Matplotlib
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 The single most important reason for the popularity of Python in the field of
AI and ML is the fact that Python provides 1000s of inbuilt libraries that have
in-built functions and methods to easily carry out data analysis, processing,
wrangling, modelling and so on.
In the below section, we’ll discuss the libraries for the following tasks:
1.Statistical Analysis
2.Data Visualization
3.Data Modelling and Machine Learning
4.Deep Learning
5.Natural Language Processing (NLP)
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
1.Statistical Analysis:
 Python comes with tons of libraries for the sole purpose of statistical
analysis. Top statistical packages that provide in-built functions to perform
the most complex statistical computations are:
NumPy :- NumPy or Numerical Python is one of the most commonly used
Python libraries. The main feature of this library is its support for multi-
dimensional arrays for mathematical and logical operations. Functions
provided by NumPy can be used for indexing, sorting, reshaping and
conveying images and sound waves as an array of real numbers in multi-
dimension.
SciPy :- Built on top of NumPy, the SciPy library is a collective of sub-
packages that help in solving the most basic problems related to statistical
analysis. SciPy library is used to process the array elements defined using
the NumPy library, so it is often used to compute mathematical equations
that cannot be done using NumPy.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Pandas :- Pandas is another important statistical library mainly used in a
wide range of fields including, statistics, finance, economics, data analysis
and so on. The library relies on the NumPy array for the purpose of
processing pandas data objects. NumPy, Pandas, and SciPy are heavily
dependent on each other for performing scientific computations, data
manipulation and so on. Pandas is one of the best libraries for processing
huge chunks of data, whereas NumPy has excellent support for multi-
dimensional arrays and Scipy, on the other hand, provides a set of sub-
packages that perform a majority of the statistical analysis tasks.
StatsModels :- Built on top of NumPy and SciPy, the StatsModels Python
package is the best for creating statistical models, data handling and model
evaluation. Along with using NumPy arrays and scientific models from SciPy
library, it also integrates with Pandas for effective data handling. This library
is famously known for statistical computations, statistical testing, and data
exploration.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
2. Data Visualization:
 A picture speaks more than a thousand words. Data visualization is all about
expressing the key insights from data effectively through graphical
representations. It includes the implementation of graphs, charts, mind
maps, heat-maps, histograms, density plots, etc, to study the correlations
between various data variables.
 Best Python data visualization packages that provide in-built functions to
study the dependencies between various data features are:
Matplotlib :-Matplotlib is the most basic data visualization package in
Python. It provides support for a wide variety of graphs such as histograms,
bar charts, power spectra, error charts, and so on. It is a 2 Dimensional
graphical library that produces clear and concise graphs that are essential
for Exploratory Data Analysis (EDA).
Seaborn :-The Matplotlib library forms the base of the Seaborn library. In
comparison to Matplotlib, Seaborn can be used to create more appealing
and descriptive statistical graphs. Along with extensive supports for data
visualization, Seaborn also comes with an inbuilt data set oriented API for
studying the relationships between multiple variables.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Plotly :-Ploty is one of the most well know graphical Python libraries. It
provides interactive graphs for understanding the dependencies between
target and predictor variables. It can be used to analyze and visualize
statistical, financial, commerce and scientific data to produce clear and
concise graphs, sub-plots, heatmaps, 3D charts and so on.
Bokeh :-One of the most interactive libraries in Python, Bokeh can be used
to build descriptive graphical representations for web browsers. It can easily
process humungous datasets and build versatile graphs that help in
performing extensive EDA. Bokeh provides the most well-defined
functionality to build interactive plots, dashboards, and data applications.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
3. Machine Learning :
 Implementing ML, DL, etc. involves coding 1000s of lines of code and this
can become more cumbersome when you want to create models that solve
complex problems through neural networks. But thankfully we don’t have to
code any algorithms because Python comes with several packages just for
the purpose of implementing machine learning techniques and algorithms.
 Top ML packages that provide in-built functions to implement all the ML
algorithms:
Scikit-learn :-One of the most useful Python libraries, Scikit-learn is the best
library for data modelling and model evaluation. It comes with tons and tons
of functions for the sole purpose of creating a model. It contains all the
Supervised and Unsupervised Machine Learning algorithms and it also
comes with well-defined functions for Ensemble Learning and Boosting
Machine Learning.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
XGBoost :-XGBoost which stands for Extreme Gradient Boosting is one of
the best Python packages for performing Boosting Machine Learning.
Libraries such as LightGBM and CatBoost are also equally equipped with
well-defined functions and methods. This library is built mainly for the
purpose of implementing gradient boosting machines which are used to
improve the performance and accuracy of Machine Learning Models.
ELI5 :-ELI5 is another Python library that is mainly focused on improving the
performance of Machine Learning models. This library is relatively new and
is usually used alongside the XGBoost, LightGBM, CatBoost and so on to
boost the accuracy of Machine Learning models.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
4.Deep Learning :
 The biggest advancements in ML and AI is been through deep learning. With
the introduction to deep learning, it is now possible to build complex models
and process humungous data sets. Thankfully, Python provides the best
deep learning packages that help in building effective neural networks.
 Top deep learning packages that provide in-built functions to implement
convoluted Neural Networks are:
Tensorflow :-One of the best Python libraries for Deep Learning,
TensorFlow is an open-source library for dataflow programming across a
range of tasks. It is a symbolic math library that is used for building strong
and precise neural networks. It provides an intuitive multiplatform
programming interface which is highly-scalable over a vast domain of fields.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Pytorch :-Pytorch is an open-source, Python-based scientific computing
package that is used to implement Deep Learning techniques and Neural
Networks on large datasets. This library is actively used by Facebook to
develop neural networks that help in various tasks such as face recognition
and auto-tagging.
Keras :-Keras is considered as one of the best Deep Learning libraries in
Python. It provides full support for building, analyzing, evaluating and
improving Neural Networks. Keras is built on top of Theano and TensorFlow
Python libraries which provides additional features to build complex and
large-scale Deep Learning models.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
5.Natural Language Processing:
 Have you ever wondered how Google so apply predicts what you’re
searching for? The technology behind Alexa, Siri, and other chatbots is
Natural Language Processing. NLP has played a huge role in designing AI-
based systems that help in describing the interaction between human
language and computers.
 Top Natural Language Processing packages that provide in-built functions to
implement high-level AI-based systems are:
NLTK (Natural Language Toolkit) :-
NLTK is considered to be the best Python package for analyzing human
language and behaviour. Preferred by most of the Data Scientists, the NLTK
library provides easy-to-use interfaces containing over 50 corpora and
lexical resources that help in describing human interactions and building AI-
Based systems such as recommendation engines.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
spaCy:- spaCy is a free, open-source Python library for implementing
advanced Natural Language Processing (NLP) techniques. When you’re
working with a lot of text it is important that you understand the
morphological meaning of the text and how it can be classified to understand
human language. These tasks can be easily achieved through spaCY.
Gensim :- Gensim is another open-source Python package modeled to
extract semantic topics from large documents and texts to process, analyze
and predict human behaviour through statistical models and linguistic
computations. It has the capability to process humungous data, irrespective
of whether the data is raw and unstructured.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 How Can we Get Big Data Sets?
 To create big data sets for testing, we use the Python module NumPy, which
comes with a number of methods to create random data sets, of any size.
Example
Create an array containing 250 random floats between 0 and 5:
import numpy
x = numpy.random.uniform(0.0, 5.0, 250)
print(x)
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Histogram
 To visualize the data set we can draw a histogram with the data we
collected.
 We will use the Python module Matplotlib to draw a histogram:
Example
Draw a histogram:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 5.0, 250)
plt.hist(x, 5)
plt.show()
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 We use the array from the example above to draw a histogram with 5 bars.
 The first bar represents how many values in the array are between 0 and 1.
 The second bar represents how many values are between 1 and 2 Etc.
 Which gives us this result:
52 values are between 0 and 1
48 values are between 1 and 2
49 values are between 2 and 3
51 values are between 3 and 4
50 values are between 4 and 5
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Big Data Distributions
An array containing 250 values is not considered very big, but now
you know how to create a random set of values, and by changing the
parameters, you can create the data set as big as you want.
Example
Create an array with 100000 random numbers, and display them
using a histogram with 100 bars:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 5.0, 100000)
plt.hist(x, 100)
plt.show()
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 In probability theory this kind of data distribution is known as the normal data
distribution, or the Gaussian data distribution, after the mathematician Carl
Friedrich Gauss who came up with the formula of this data distribution.
Example
A typical normal data distribution:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 100000)
plt.hist(x, 100)
plt.show()
 A normal distribution graph is also known as the bell curve because of it's
characteristic shape of a bell.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 We use the array from the numpy.random.normal() method, with 100000
values, to draw a histogram with 100 bars.
 We specify that the mean value is 5.0, and the standard deviation is 1.0.
 Meaning that the values should be concentrated around 5.0, and rarely
further away than 1.0 from the mean.
 And as you can see from the histogram, most values are between 4.0 and
6.0, with a top at approximately 5.0.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 A scatter plot is a diagram where each value in the data set is represented
by a dot.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 The Matplotlib module has a method for drawing scatter plots, it needs two
arrays of the same length, one for the values of the x-axis, and one for the
values of the y-axis:
 x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
 y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
 The x array represents the age of each car.
 The y array represents the speed of each car.
Example
Use the scatter() method to draw a scatter plot diagram:
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 The x-axis represents ages, and the y-axis represents speeds.
 What we can read from the diagram is that the two fastest cars were both 2
years old, and the slowest car was 12 years old.
 Random Data Distributions
In Machine Learning the data sets can contain thousands-, or even millions,
of values.
 Let us create two arrays that are both filled with 1000 random numbers from
a normal data distribution.
 The first array will have the mean set to 5.0 with a standard deviation of 1.0.
 The second array will have the mean set to 10.0 with a standard deviation of
2.0:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Example
A scatter plot with 1000 dots:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 1000)
y = numpy.random.normal(10.0, 2.0, 1000)
plt.scatter(x, y)
plt.show()
 We can see that the dots are concentrated around the value 5 on the x-axis,
and 10 on the y-axis.
 We can also see that the spread is wider on the y-axis than on the x-axis.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Execution of Data Pre-processing methods using Python commonly involves
following steps:
 Importing the libraries
 Importing the Dataset
 Handling of Missing Data
 Handling of Categorical Data
 Splitting the dataset into training and testing datasets
 Feature Scaling
For this Data Pre-processing script, We are going to use Anaconda
Navigator and specifically Spyder (IDE) to write the following code.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Importing the libraries :-
import numpy as np # used for handling numbers
import pandas as pd # used for handling the dataset
from sklearn.impute import SimpleImputer # used for handling missing
data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#used for encoding categorical data
from sklearn.model_selection import train_test_split # used for splitting
training and testing data
from sklearn.preprocessing import StandardScaler # used for
feature scaling
from sklearn.compose import ColumnTransformer # used to Applies
transformers to columns of an array
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Importing the Dataset :-
 First of all, let us have a look at the dataset we are going to use for this
particular example. You can download or take this dataset from :
https://github.com/tarunlnmiit/machine_learning/blob/master/DataPrepro
cessing.csv
 It is as shown below:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 By pressing raw button in the link, copy this dataset and store it in Data.csv
file, in the folder where your program is stored.
 In order to import this dataset into our script, we are apparently going to use
pandas as follows.
dataset = pd.read_csv('Data.csv') # to import the dataset into a variable
# Splitting the attributes into independent and dependent attributes
X = dataset.iloc[:, :-1].values # attributes to determine independent variable /
Class
Y = dataset.iloc[:, -1].values # dependent variable / Class
 When you run this code section, along with libraries, you should not see any
errors. When successfully executed, you can move to variable explorer in the
Spyder UI and you will see the following three variables.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 When you double click on each of these variables, you should see
something similar.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Handling of Missing Data :-
 Well the first idea is to remove the lines in the observations where there is
some missing data. But that can be quite dangerous because imagine this
data set contains key information. It would be quite dangerous to remove
such observation. So, we need to figure out a better idea to handle this
problem. And the most common idea to handle missing data is to take the
mean of the columns, as discussed in earlier section.
 If you noticed in our dataset, we have two values missing, one for age
column in 6th data index and for Income column in 4th data row. Missing
values should be handled during the data analysis. So, we do that as
follows.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
# handling the missing data and replace missing values with nan from
numpy and replace with mean of all the other values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean’)
imputer = imputer.fit(X[:, 1:])
X[:, 1:] = imputer.transform(X[:, 1:])
 After execution of this code, the independent variable X will transform
into the following.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Here you can see, that the missing values have been replaced by the average
values of the respective columns.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Handling of Categorical Data :-
 In this dataset we can see that we have two categorical variables namely
Region variable and the Online Shopper variable. These two variables are
categorical variables because simply they contain categories. The Region
contains three categories. It’s India, USA & Brazil and the online shopper
variable contains two categories. Yes and No that’s why they’re called
categorical variables.
 You can guess that since machine learning models are based on
mathematical equations you can naturally understand that it would cause
some problem if we keep the text here in the categorical variables in the
equations because we would only want numbers in the equations. So that’s
why we need to encode the categorical variables. That is to encode the text
that we have here into numbers. To do this we use the following code
snippet.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 # encode categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
rg = ColumnTransformer([("Region", OneHotEncoder(), [0])], remainder =
'passthrough’)
X = rg.fit_transform(X)
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
After execution of this code, the independent variable X and dependent
variable Y will transform into the following.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Here, you can see that the Region variable is now made up of a 3 bit binary
variable. The left most bit represents India, 2nd bit represents Brazil and
the last bit represents USA. If the bit is 1 then it represents data for that
country otherwise not.
For Online Shopper variable, 1 represents Yes and 0 represents No.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Splitting the dataset into training and testing datasets :-
 Any machine learning algorithm needs to be tested for accuracy. In order
to do that, we divide our data set into two parts: training set and testing
set. As the name itself suggests, we use the training set to make the
algorithm learn the behaviours present in the data and check the
correctness of the algorithm by testing on testing set. In Python, we do that
as follows:
 # splitting the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)
 Here, we are taking training set to be 80% of the original data set and
testing set to be 20% of the original data set. This is usually the ratio in
which they are split. But, you can come across sometimes to a 70–30% or
75–25% ratio split. But you don’t want to split it 50–50%. This can lead
to Model Overfitting. For now, we are going to split it in 80–20% ratio.
 After split, our training set and testing set look like this.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Feature Scaling (Variable Transformation):-
 As you can see, we have these two columns age and income that contains
numerical numbers. You notice that the variables are not on the same scale
because the age are going from 32 to 55 and the salaries going from 57.6 K to
like 99.6 K. So, because this age variable in the salary variable don’t have the
same scale. This will cause some issues in your Machine Learning models.
And why is that. It’s because a lot of ML models are based on what is called
the Euclidean distance.
 We use feature scaling to convert different scales to a standard scale to make
it easier for Machine Learning algorithms. We do this in Python as follows:
# feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 After the execution of this code, our training independent variable X and
our testing independent variable X and look like this.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Thanks !!!
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune

More Related Content

Similar to Introduction to Machine Learning.pptx

How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdfStephenAmell4
 
machine learning.docx
machine learning.docxmachine learning.docx
machine learning.docxJadhavArjun2
 
Machine learning applications nurturing growth of various business domains
Machine learning applications nurturing growth of various business domainsMachine learning applications nurturing growth of various business domains
Machine learning applications nurturing growth of various business domainsShrutika Oswal
 
What is machine learning
What is machine learningWhat is machine learning
What is machine learningNupurSingh92
 
Machine Tool And How You Can Work around It.pdf
Machine Tool And How You Can Work around It.pdfMachine Tool And How You Can Work around It.pdf
Machine Tool And How You Can Work around It.pdfLenore Industries
 
INTERNSHIP ON MAcHINE LEARNING.pptx
INTERNSHIP ON MAcHINE LEARNING.pptxINTERNSHIP ON MAcHINE LEARNING.pptx
INTERNSHIP ON MAcHINE LEARNING.pptxsrikanthkallem1
 
Introduction To Machine Learning
Introduction To Machine LearningIntroduction To Machine Learning
Introduction To Machine LearningKnoldus Inc.
 
Understanding The Pattern Of Recognition
Understanding The Pattern Of RecognitionUnderstanding The Pattern Of Recognition
Understanding The Pattern Of RecognitionRahul Bedi
 
Machine Learning Contents.pptx
Machine Learning Contents.pptxMachine Learning Contents.pptx
Machine Learning Contents.pptxNaveenkushwaha18
 
YASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptxYASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptxYashShiva3
 
Machine Learning in Business What It Is and How to Use It
Machine Learning in Business What It Is and How to Use ItMachine Learning in Business What It Is and How to Use It
Machine Learning in Business What It Is and How to Use ItKashish Trivedi
 
Intro/Overview on Machine Learning Presentation -2
Intro/Overview on Machine Learning Presentation -2Intro/Overview on Machine Learning Presentation -2
Intro/Overview on Machine Learning Presentation -2Ankit Gupta
 

Similar to Introduction to Machine Learning.pptx (20)

How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdf
 
machine learning.docx
machine learning.docxmachine learning.docx
machine learning.docx
 
Machine learning applications nurturing growth of various business domains
Machine learning applications nurturing growth of various business domainsMachine learning applications nurturing growth of various business domains
Machine learning applications nurturing growth of various business domains
 
AI BI and ML.pdf
AI BI and ML.pdfAI BI and ML.pdf
AI BI and ML.pdf
 
Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
 
What is machine learning
What is machine learningWhat is machine learning
What is machine learning
 
Machine Tool And How You Can Work around It.pdf
Machine Tool And How You Can Work around It.pdfMachine Tool And How You Can Work around It.pdf
Machine Tool And How You Can Work around It.pdf
 
INTERNSHIP ON MAcHINE LEARNING.pptx
INTERNSHIP ON MAcHINE LEARNING.pptxINTERNSHIP ON MAcHINE LEARNING.pptx
INTERNSHIP ON MAcHINE LEARNING.pptx
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction To Machine Learning
Introduction To Machine LearningIntroduction To Machine Learning
Introduction To Machine Learning
 
Understanding The Pattern Of Recognition
Understanding The Pattern Of RecognitionUnderstanding The Pattern Of Recognition
Understanding The Pattern Of Recognition
 
Eckovation Machine Learning
Eckovation Machine LearningEckovation Machine Learning
Eckovation Machine Learning
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
Machine Learning Contents.pptx
Machine Learning Contents.pptxMachine Learning Contents.pptx
Machine Learning Contents.pptx
 
YASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptxYASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptx
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
ML vs AI
ML vs AIML vs AI
ML vs AI
 
Machine Learning in Business What It Is and How to Use It
Machine Learning in Business What It Is and How to Use ItMachine Learning in Business What It Is and How to Use It
Machine Learning in Business What It Is and How to Use It
 
Intro/Overview on Machine Learning Presentation -2
Intro/Overview on Machine Learning Presentation -2Intro/Overview on Machine Learning Presentation -2
Intro/Overview on Machine Learning Presentation -2
 

More from Harsha Patel

Introduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptxIntroduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptxHarsha Patel
 
Introduction to Association Rules.pptx
Introduction  to  Association  Rules.pptxIntroduction  to  Association  Rules.pptx
Introduction to Association Rules.pptxHarsha Patel
 
Introduction to Clustering . pptx
Introduction    to     Clustering . pptxIntroduction    to     Clustering . pptx
Introduction to Clustering . pptxHarsha Patel
 
Introduction to Classification . pptx
Introduction  to   Classification . pptxIntroduction  to   Classification . pptx
Introduction to Classification . pptxHarsha Patel
 
Introduction to Regression . pptx
Introduction     to    Regression . pptxIntroduction     to    Regression . pptx
Introduction to Regression . pptxHarsha Patel
 
Intro of Machine Learning Models .pptx
Intro of Machine  Learning  Models .pptxIntro of Machine  Learning  Models .pptx
Intro of Machine Learning Models .pptxHarsha Patel
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
Unit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxUnit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxHarsha Patel
 
Unit-III-AI Search Techniques and solution's
Unit-III-AI Search Techniques and solution'sUnit-III-AI Search Techniques and solution's
Unit-III-AI Search Techniques and solution'sHarsha Patel
 
Unit-II-Introduction of Artifiial Intelligence.pptx
Unit-II-Introduction of Artifiial Intelligence.pptxUnit-II-Introduction of Artifiial Intelligence.pptx
Unit-II-Introduction of Artifiial Intelligence.pptxHarsha Patel
 
Unit-I-Introduction to Recent Trends.pptx
Unit-I-Introduction to Recent Trends.pptxUnit-I-Introduction to Recent Trends.pptx
Unit-I-Introduction to Recent Trends.pptxHarsha Patel
 
Using Unix Commands.pptx
Using Unix Commands.pptxUsing Unix Commands.pptx
Using Unix Commands.pptxHarsha Patel
 
Using Vi Editor.pptx
Using Vi Editor.pptxUsing Vi Editor.pptx
Using Vi Editor.pptxHarsha Patel
 
Shell Scripting and Programming.pptx
Shell Scripting and Programming.pptxShell Scripting and Programming.pptx
Shell Scripting and Programming.pptxHarsha Patel
 
Managing Processes in Unix.pptx
Managing Processes in Unix.pptxManaging Processes in Unix.pptx
Managing Processes in Unix.pptxHarsha Patel
 
Introduction to Unix Concets.pptx
Introduction to Unix Concets.pptxIntroduction to Unix Concets.pptx
Introduction to Unix Concets.pptxHarsha Patel
 
Handling Files Under Unix.pptx
Handling Files Under Unix.pptxHandling Files Under Unix.pptx
Handling Files Under Unix.pptxHarsha Patel
 
Introduction to OS.pptx
Introduction to OS.pptxIntroduction to OS.pptx
Introduction to OS.pptxHarsha Patel
 
Using Unix Commands.pptx
Using Unix Commands.pptxUsing Unix Commands.pptx
Using Unix Commands.pptxHarsha Patel
 
Introduction to Unix Concets.pptx
Introduction to Unix Concets.pptxIntroduction to Unix Concets.pptx
Introduction to Unix Concets.pptxHarsha Patel
 

More from Harsha Patel (20)

Introduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptxIntroduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptx
 
Introduction to Association Rules.pptx
Introduction  to  Association  Rules.pptxIntroduction  to  Association  Rules.pptx
Introduction to Association Rules.pptx
 
Introduction to Clustering . pptx
Introduction    to     Clustering . pptxIntroduction    to     Clustering . pptx
Introduction to Clustering . pptx
 
Introduction to Classification . pptx
Introduction  to   Classification . pptxIntroduction  to   Classification . pptx
Introduction to Classification . pptx
 
Introduction to Regression . pptx
Introduction     to    Regression . pptxIntroduction     to    Regression . pptx
Introduction to Regression . pptx
 
Intro of Machine Learning Models .pptx
Intro of Machine  Learning  Models .pptxIntro of Machine  Learning  Models .pptx
Intro of Machine Learning Models .pptx
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Unit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxUnit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptx
 
Unit-III-AI Search Techniques and solution's
Unit-III-AI Search Techniques and solution'sUnit-III-AI Search Techniques and solution's
Unit-III-AI Search Techniques and solution's
 
Unit-II-Introduction of Artifiial Intelligence.pptx
Unit-II-Introduction of Artifiial Intelligence.pptxUnit-II-Introduction of Artifiial Intelligence.pptx
Unit-II-Introduction of Artifiial Intelligence.pptx
 
Unit-I-Introduction to Recent Trends.pptx
Unit-I-Introduction to Recent Trends.pptxUnit-I-Introduction to Recent Trends.pptx
Unit-I-Introduction to Recent Trends.pptx
 
Using Unix Commands.pptx
Using Unix Commands.pptxUsing Unix Commands.pptx
Using Unix Commands.pptx
 
Using Vi Editor.pptx
Using Vi Editor.pptxUsing Vi Editor.pptx
Using Vi Editor.pptx
 
Shell Scripting and Programming.pptx
Shell Scripting and Programming.pptxShell Scripting and Programming.pptx
Shell Scripting and Programming.pptx
 
Managing Processes in Unix.pptx
Managing Processes in Unix.pptxManaging Processes in Unix.pptx
Managing Processes in Unix.pptx
 
Introduction to Unix Concets.pptx
Introduction to Unix Concets.pptxIntroduction to Unix Concets.pptx
Introduction to Unix Concets.pptx
 
Handling Files Under Unix.pptx
Handling Files Under Unix.pptxHandling Files Under Unix.pptx
Handling Files Under Unix.pptx
 
Introduction to OS.pptx
Introduction to OS.pptxIntroduction to OS.pptx
Introduction to OS.pptx
 
Using Unix Commands.pptx
Using Unix Commands.pptxUsing Unix Commands.pptx
Using Unix Commands.pptx
 
Introduction to Unix Concets.pptx
Introduction to Unix Concets.pptxIntroduction to Unix Concets.pptx
Introduction to Unix Concets.pptx
 

Recently uploaded

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

Introduction to Machine Learning.pptx

  • 1. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 2.  Data science is all about data.  Every single tech company out there is collecting huge amounts of data. And data is revenue.  The more data you have, the more business insights you can generate.  Data Science is the art of turning data into actions through creating data products.  Performing data science requires extraction of timely, actionable information from diverse data sources to drive data products. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 3.  Artificial intelligence is a field of computer science which makes a computer system that can mimic human intelligence. It is comprised of two words "Artificial" and "intelligence", which means "a human- made thinking power." Hence we can define it as,  Artificial intelligence is a technology using which we can create intelligent systems that can simulate human intelligence.  The Artificial intelligence system does not require to be pre- programmed, instead of that, they use such algorithms which can work with their own intelligence. It involves machine learning algorithms such as Reinforcement learning algorithm and deep learning neural networks. AI is being used in multiple places such as Siri, Google's AlphaGo, AI in Chess playing, etc. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 4.  Machine learning is about extracting knowledge from the data. It can be defined as,  Machine learning is a subfield of artificial intelligence, which enables machines to learn from past data or experiences without being explicitly programmed.  Machine learning enables a computer system to make predictions or take some decisions using historical data without being explicitly programmed. Machine learning uses a massive amount of structured and semi-structured data so that a machine learning model can generate accurate result or give predictions based on that data.  Machine learning works on algorithm which learn by it’s own using historical data. It works only for specific domains such as if we are creating a machine learning model to detect pictures of dogs, it will only give result for dog images, but if we provide a new data like cat image then it will become unresponsive. Machine learning is being used in various places such as for online recommender system, for Google search algorithms, Email spam filter, Face book Auto friend tagging suggestion, etc. It can be divided into three types:  Supervised learning  Reinforcement learning  Unsupervised learning Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 5.  Artificial intelligence and machine learning are the part of computer science that are correlated with each other. These two technologies are the most trending technologies which are used for creating intelligent systems.  Although these are two related technologies and sometimes people use them as a synonym for each other, but still both are the two different terms in various cases.  AI is a bigger concept to create intelligent machines that can simulate human thinking capability and behavior, whereas, machine learning is an application or subset of AI that allows machines to learn from data without being programmed explicitly. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 6. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 7. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 8.  A Machine Learning system learns from historical data, builds the prediction models, and whenever it receives new data, predicts the output for it. The accuracy of predicted output depends upon the amount of data, as the huge amount of data helps to build a better model which predicts the output more accurately.  Suppose we have a complex problem, where we need to perform some predictions, so instead of writing a code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms, machine builds the logic as per the data and predict the output. Machine learning has changed our way of thinking about the problem. The below block diagram explains the working of Machine Learning algorithm: Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 9. Features of Machine Learning: Machine learning uses data to detect various patterns in a given dataset. It can learn from past data and improve automatically. It is a data-driven technology. Machine learning is much similar to data mining as it also deals with the huge amount of the data. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 10.  Traditional Programming refers to any manually created program that uses input data and runs on a computer to produce the output. Traditional programming is a manual process — meaning a person (programmer) creates the program. But without anyone programming the logic, one has to manually formulate or code rules. We have the input data, and someone (programmer) coded a program that uses that data and runs on a computer to produce the desired output. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 11.  Machine Learning, also known as augmented analytics, the input data and output are fed to an algorithm to create a program. This yields powerful insights that can be used to predict future outcomes. Machine Learning, on the other hand, the input data and output are fed to an algorithm to create a program. This is the basic difference between traditional programming and machine learning. Without anyone programming the logic, In Traditional programming one has to manually formulate/code rules while in Machine Learning the algorithms automatically formulate the rules from the data, which is very powerful. . Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 12. Fig:- Block diagram of decision flow architecture for ML System Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 13.  1. Data Acquisition As machine learning is based on available data for the system to make a decision hence the first step defined in the architecture is data acquisition. This involves data collection, preparing and segregating the case scenarios based on certain features involved with the decision making cycle and forwarding the data to the processing unit for carrying out further categorization. This stage is sometimes called the data preprocessing stage. The data model expects reliable, fast and elastic data which may be discrete or continuous in nature. The data is then passed into stream processing systems (for continuous data) and stored in batch data warehouses (for discrete data) before being passed on to data modeling or processing stages.  2. Data Processing The received data in the data acquisition layer is then sent forward to the data processing layer where it is subjected to advanced integration and processing and involves normalization of the data, data cleaning, transformation, and encoding. The data processing is also dependent on the type of learning being used. For e.g., if supervised learning is being used the data shall be needed to be segregated into multiple steps of sample data required for training of the system and the data thus created is called training sample data or simply training data. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 14.  3. Data Modeling This layer of the architecture involves the selection of different algorithms that might adapt the system to address the problem for which the learning is being devised, These algorithms are being evolved or being inherited from a set of libraries. The algorithms are used to model the data accordingly, this makes the system ready for execution step.  4. Execution This stage in machine learning is where the experimentation is done, testing is involved and tunings are performed. The general goal behind being to optimize the algorithm in order to extract the required machine outcome and maximize the system performance, The output of the step is a refined solution capable of providing the required data for the machine to make decisions.  5. Deployment Like any other software output, ML outputs need to be operational zed or be forwarded for further exploratory processing. The output can be considered as a non- deterministic query which needs to be further deployed into the decision-making system. It is advised to seamlessly move the ML output directly to production where it will enable the machine to directly make decisions based on the output and reduce the dependency on the further exploratory steps. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 15.  Machine Learning Applications in Healthcare :- Doctors and medical practitioners will soon be able to predict with accuracy on how long patients with fatal diseases will live. Medical systems will learn from data and help patients save money by skipping unnecessary tests. i) Drug Discovery/Manufacturing ii) Personalized Treatment/Medication  Machine Learning Applications in Finance :- More than 90% of the top 50 financial institutions around the world are using machine learning and advanced analytics. The application of machine learning in Finance domain helps banks offer personalized services to customers at lower cost, better compliance and generate greater revenue. i)Machine Learning for Fraud Detection ii)Machine Learning for focused Account Holder Targeting Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 16.  Machine Learning Applications in Retail :- Machine learning in retail is more than just a latest trend, retailers are implementing big data technologies like Hadoop and Spark to build big data solutions. Machine learning algorithms process this data intelligently and automate the analysis to make this supercilious goal possible for retail giants like Amazon, Alibaba and Walmart. i)Machine Learning Examples in Retail for Product Recommendations ii)Machine Learning Examples in Retail for Improved Customer Service.  Machine Learning Applications in Travel :- Travel providers can help travellers find the best time to book a hotel or buy a cheap ticket by taking advantage of machine learning. When the agreement is available, the application will probably send a notification to the user. i)Machine Learning Examples in Travel for Dynamic Pricing ii)Machine Learning Examples in Travel for Sentiment Analysis Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 17.  Machine Learning Applications in Media :- Machine learning offers the most efficient means of engaging billions of social media users. From personalizing news feed to rendering targeted ads, machine learning is the heart of all social media platforms for their own and user benefits. Social media and chat applications have advanced to a great extent that users do not pick up the phone or use email to communicate with brands – they leave a comment on Facebook or Instagram expecting a speedy reply than the traditional channels.  Earlier Facebook used to prompt users to tag your friends but nowadays the social networks artificial neural networks machine learning algorithm identifies familiar faces from contact list. The ANN (Artificial Neural Networks)algorithm mimics the structure of human brain to power facial recognition.  The professional network like LinkedIn knows where you should apply for your next job, whom you should connect with and how your skills stack up against your peers as you search for new job. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 18. Let’s understand the type of data available in the datasets from the perspective of machine learning. 1. Numerical Data :- Any data points which are numbers are termed as numerical data. Numerical data can be discrete or continuous. Continuous data has any value within a given range while the discrete data is supposed to have a distinct value. For example, the number of doors of cars will be discrete i.e. either two, four, six, etc. and the price of the car will be continuous that is might be 1000$ or 1250.5$. The data type of numerical data is int64 or float64. 2. Categorical Data :- Categorical data are used to represent the characteristics. For example car color, date of manufacture, etc. It can also be a numerical value provided the numerical value is indicating a class. For example, 1 can be used to denote a gas car and 0 for a diesel car. We can use categorical data to forms groups but cannot perform any mathematical operations on them. Its data type is an object. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 19. 3. Time Series Data :- It is the collection of a sequence of numbers collected at a regular interval over a certain period of time. It is very important, like in the field of the stock market where we need the price of a stock after a constant interval of time. The type of data has a temporal field attached to it so that the timestamp of the data can be easily monitored. 4. Text Data :- Text data is nothing but literals. The first step of handling test data is to convert them into numbers as or model is mathematical and needs data to inform of numbers. So to do so we might use functions as a bag of word formulation. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 20.  ML Dataset :- Machine learning dataset is defined as the collection of data that is needed to train the model and make predictions. These datasets are classified as structured and unstructured datasets, where the structured datasets are in tabular format in which the row of the dataset corresponds to record and column corresponds to the features, and unstructured datasets corresponds to the images, text, speech, audio etc. which is acquired through Data Acquisition, Data Wrangling and Data Exploration, during the learning process these datasets are divided as training, validation and test sets for the training and measuring the accuracy of the mode. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 21.  Types of datasets :- 1.Training Dataset: This data set is used to train the model i.e. these datasets are used to update the weight of the model. 2.Validation Dataset: These types of a dataset are used to reduce over fitting. It is used to verify that the increase in the accuracy of the training dataset is actually increased if we test the model with the data that is not used in the training. If the accuracy over the training dataset increase while the accuracy over the validation dataset decrease, then this results in the case of high variance i.e. over fitting. 3.Test Dataset: Most of the time when we try to make changes to the model based upon the output of the validation set then unintentionally we make the model peek into our validation set and as a result, our model might get over fit on the validation set as well. To overcome this issue we have a test dataset that is only used to test the final output of the model in order to confirm the accuracy. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 22.  Pre-processing refers to the changes applied to our data before feeding it to the ML algorithm. Data pre-processing is a technique that is used to convert the created or collected (raw) data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis or processing by ML model. Following figure shows transformation processing performed on raw data before, during and after applying ML techniques: Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 23.  Data Pre-processing in Machine Learning can be broadly divided into 3 main parts – 1.Data Integration 2.Data Cleaning 3.Data Transformation There are various steps in each of these 3 broad categories. Depending on the data and machine learning algorithm involved, not all steps might be required though. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 24. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 25. 1.Data Integration and formatting :- During hackathon and competitions, we often deal with a single csv or excel file containing all training data. But in real world, source of data might not be this simple. In real life, we might have to extract data from various sources and have to integrate it. 2. Data Cleaning :- 2.1 Dealing with Missing data :- It is common to have some missing or null data in the real-world data set. Most of the machine learning algorithms will not work with such data. So, it becomes important to deal with missing or null data. Some of common measures taken are,  Get rid of the column if there are plenty of rows with null values.  Eliminate the row if there are plenty of columns with null values.  Change the missing value by mean or median or mode of that column depending on data distribution in that column. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 26.  By substituting the missing values by ‘NA’ or ‘Unknown’ or some other relevant term, in case of categorical feature column, we can consider missing data as a new category in itself.  In this method to come up with educated guesses of possible candidate, replace missing value by applying regression or classification techniques. 2.2 Remove Noise from Data :- Noise are slightly erroneous data observations which does not fulfil with trend or distribution of rest of data. Though each error can be small, but communally noisy data results in poor machine learning model. Noise in data can be minimized or smooth out by using below popular techniques –  Binning  Regression Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 27.  Binning Method:  First sort data and partition  Then one can smooth by bin mean, median and boundaries. For Example: • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 28.  Regression Method:  Here data can be smoothed by fitting the data to a function.  Linear regression involves finding the “best” line to fit two attributes, so that one attribute can be used to predict the other.  Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 29. 2.3 Remove Outliers from Data :-  Outliers are those observation that has extreme values, much beyond the normal range of values for that feature. For example, a very high salary of CEO of a company can be an outlier if we consider salary of other regular employees of the company.  Even few outliers in data set can contribute to poor accuracy of machine learning model. The common methods to detect outliers and remove them are –  Standard Deviation  Box Plot Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 30.  Standard Deviation In statistics, if a data distribution is approximately normal then about 68% of the data values lie within one standard deviation of the mean and about 95% are within two standard deviations, and about 99.7% lie within three standard deviations. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 31.  Therefore, if you have any data point that is more than 3 times the standard deviation, then those points are very likely to be anomalous or outliers.  Box Plots:  Box plots are a graphical depiction of numerical data through their quintiles. It is a very simple but effective way to visualize outliers. Think about the lower and upper sideburns as the boundaries of the data distribution. Any data points that show above or below the sideburns, can be considered outliers or anomalous.  The concept of the Interquartile Range (IQR) is used to build the box plot graphs. IQR is a concept in statistics that is used to measure the statistical dispersion and data variability by dividing the dataset into quartiles.  In simple words, any dataset or any set of observations is divided into four defined intervals based upon the values of the data and how they compare to the entire dataset. A quartile is what divides the data into three points and four intervals. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 32. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 33. 2.4 Dealing with Duplicate Data :- The approach to deal with duplicate data depends on the fact whether duplicate data represents the real-world scenario or is more of an inconsistency. If it is former than duplicate data should be conserved, else it should be removed. 2.5 Dealing with Inconsistent Data :- Some data might not be consistent with the business rule. This requires domain knowledge to identify such inconsistencies and deal with it. 3.Data Transformation :- 3.1 Feature Scaling :- Feature scaling is one of the most important prerequisites for most of the machine learning algorithm. Various features of data set can have their own range of values and far different from each other. For e.g. age of human mostly lies within range of 0-100 years but population of countries will be in ranges of millions. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 34. This huge differences of ranges between features in a data set can distort the training of machine learning model. So, we need to bring the ranges of all the features at common scale. The common approaches of feature scaling are –  Mean Normalization  Min-Max Normalization  Z-Score Normalization or Standardization, We will make brief overview with examples on these approaches in the last section of this unit. 3.2 Dealing with categorical data :-  Categorical data, also known as qualitative data are text or string-based data. Example of categorical data are gender of persons (Male or Female), names of places (India, America, England), colour of car (Red, White).  Most of the machine learning algorithms works on numerical data only and will not be able to process categorical data. So, we need to transform categorical data into numerical form without losing the sense of information. Below are the popular approaches to convert categorical data into numerical form –  Label Encoding  One Hot Encoding  Binary Encoding Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 35. 3.3 Dealing with Imbalanced Data Set :-  Imbalanced data set are type of data set in which most of the data belongs to only one class and very few data belongs to other class. This is common in case of medical diagnosis, anomaly detection where the data belonging to positive class is a very small percentage.  For e.g. only 5-10% of data might belong to a disease positive class which can be an expected distribution in medical diagnosis. But this skewed data distribution can trick the machine learning model in training phase to only identify the majority classes and it fails to learn the minority classes. For example, the model might fail to identify the medical condition even though it might be showing a very high accuracy by identifying negative scenarios.  We need to do something about imbalanced data set to avoid a bad machine learning model. Below are some approaches to deal with such situation –  Under Sampling Majority Class  Over Sampling Minority Class  SMOTE (Synthetic Minority Oversampling Technique) Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 36. 3.4 Feature Engineering :-  Feature engineering is an art of creating new feature from the given data by either applying some domain knowledge or some common sense or both.  A very common example of feature engineering is converting a Date feature into additional features like Day, Week, Month, Year thus adding more information into data set.  Feature engineering enriches data set with more information that can contribute to good machine learning model. 3.5 Training-Test Data Split :-  Just before training supervised machine learning model, this is usually the last step of data pre-processing in which for a given data set, after undergoing all cleaning and transformation, is divided into two parts – one for training of machine learning model and second for testing the trained model.  Though there is no rule of thumb, but usually training-test split is done randomly at 80%-20% ratio. While splitting data set, care has to be taken that there is no loss of information in training data set. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 37. Example: We have registered the speed of 13 cars: speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] 1.Mean -The average value.  (99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77  The NumPy module has NumPy mean() method for this.  import numpy speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] x = numpy.mean(speed) print(x) Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 38. 2.Median - The midpoint value.  The median value is the value in the middle, after you have sorted all the values:  (77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111)  It is important that the numbers are sorted before you can find the median  The NumPy module has a NumPy median() method to find the middle value.  import numpy speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] x = numpy. median(speed) print(x)  If there are two numbers in the middle, divide the sum of those numbers by two.  Ex. (77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103)  (86 + 87) / 2 = 86.5 Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 39. 3.Mode - The most common value.  The Mode value is the value that appears the most number of times.  (99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86) = 86  The SciPy module has a SciPy mode() method to find the number that appears the most.  from scipy import stats speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] x = stats. mode(speed) print(x) Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 40.  Standard deviation is a number that describes how spread out the values are.  A low standard deviation means that most of the numbers are close to the mean (average) value.  A high standard deviation means that the values are spread out over a wider range.  Example: We have registered the speed of 7 cars:  speed = [86,87,88,86,87,85,86]  The standard deviation is:0.9  Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4.  Example: We have registered the speed of another 7 cars:  speed = [32,111,138,28,59,77,97]  The standard deviation is:37.85  Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4.  A higher standard deviation indicates that the values are spread out over a wider range. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 41.  The NumPy module has a NumPy std() method to find the standard deviation.  import numpy speed = [86,87,88,86,87,85,86] x = numpy.std(speed) print(x)  import numpy speed = [32,111,138,28,59,77,97] x = numpy.std(speed) print(x)  Standard Deviation is often represented by the symbol Sigma: σ Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 42.  Variance is another number that indicates how spread out the values are.  if you take the square root of the variance, you get the standard deviation!  if you multiply the standard deviation by itself, you get the variance!  To calculate the variance you have to do as follows: 1. Find the mean: (32+111+138+28+59+77+97) / 7 = 77.4 2. For each value: find the difference from the mean: 32 - 77.4 = -45.4 111 - 77.4 = 33.6 138 - 77.4 = 60.6 28 - 77.4 = -49.4 59 - 77.4 = -18.4 77 - 77.4 = - 0.4 97 - 77.4 = 19.6 Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 43. 3. For each difference: find the square value: (-45.4)2 = 2061.16 (33.6)2 = 1128.96 (60.6)2 = 3672.36 (-49.4)2 = 2440.36 (-18.4)2 = 338.56 (- 0.4)2 = 0.16 (19.6)2 = 384.16 4. The variance is the average number of these squared differences: (2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.6  The NumPy module has a NumPy var() method to find the variance. import numpy speed = [32,111,138,28,59,77,97] x = numpy.var(speed) print(x)  Variance is often represented by the symbol Sigma Square: σ2 Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 44.  Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.  Example: we have an array of the ages of all the people that lives in a street.  ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]  What is the 75. percentile? The answer is 43, meaning that 75% of the people are 43 or younger.  The NumPy module has a method NumPy percentile() to find the percentiles:  import numpy ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31] x = numpy.percentile(ages, 75) print(x)  What is the age that 90% of the people are younger than?  import numpy ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31] x = numpy.percentile(ages, 90) print(x) Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 45.  Statistics plays a main role in the field of research. It helps us in the collection, analysis and presentation of data. Before starting with descriptive and inferential statistics let us get the basic idea of population and sample. Population:- Population is the group that is targeted to collect the data from. Our data is the information collected from the population. Population is always defined first, before starting the data collection process for any statistical study. Population is not necessarily be people rather it could be batch of batteries, measurements of rainfall in an area or a group of people. Sample:- It is the part of population which is selected randomly for the study. The sample should be selected such that it represents all the characteristics of the population. The process of selecting the subset from the population is called sampling and the subset selected is called the sample. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 46. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 47. 1.Descriptive Statistics :- It describes the important characteristics/ properties of the data using the measures the central tendency like mean/ median/mode and the measures of dispersion like range, standard deviation, variance etc. Data can be summarized and represented in an accurate way using charts, tables and graphs. For example: We have marks of 1000 students and we may be interested in the overall performance of those students and the distribution as well as the spread of marks. Descriptive statistics provides us the tools to define our data in a most understandable and appropriate way. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 48. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 49. 2.Inferential Statistics :- It is about using data from sample and then making inferences about the larger population from which the sample is drawn. The goal of the inferential statistics is to draw conclusions from a sample and generalize them to the population. It determines the probability of the characteristics of the sample using probability theory. The most common methodologies used are hypothesis tests, Analysis of variance etc. For example: Suppose we are interested in the exam marks of all the students in India. But it is not feasible to measure the exam marks of all the students in India. So now we will measure the marks of a smaller sample of students, for example 1000 students. This sample will now represent the large population of Indian students. We would consider this sample for our statistical study for studying the population from which it’s deduced. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 50. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 51. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 52.  Probability theory, a branch of mathematics concerned with the analysis of random phenomena. The outcome of a random event cannot be determined before it occurs, but it may be any one of several possible outcomes. The actual outcome is considered to be determined by chance.  EX. Tossing a Coin. When a coin is tossed, there are two possible outcomes: 1.heads (H) or 2.tails (T) We say that the probability of the coin landing H is ½ And the probability of the coin landing T is ½ EX. Throwing a Dice . When a single die is thrown, there are six possible outcomes: 1, 2, 3, 4, 5, 6. The probability of any one of them is 1/6. So, Probability of an event happening = Number of ways it can happen Total no of Outcomes Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 53.  Many Supervised and Unsupervised machine learning models such as K-NN and K-Means depend upon the distance between two data points to predict the output. Therefore, the metric we use to compute distances plays an important role in these models. 1.Euclidean Distance:  Euclidean distance is the straight line distance between 2 data points in a plane.  It is calculated using the Minkowski Distance formula by setting ‘p’ value to 2, thus, also known as the L2 norm distance metric.  This formula is similar to the Pythagorean theorem formula, Thus it is also known as the Pythagorean Theorem. The formula is:- Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 54. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 55. 2.Manhattan Distance:  We use Manhattan distance, also known as city block distance, or taxicab geometry if we need to calculate the distance between two data points in a grid-like path. Manhattan distance metric can be understood with the help of a simple example.  In the above picture, imagine each cell to be a building, and the grid lines to be roads. Now if I want to travel from Point A to Point B marked in the image and follow the red or the yellow path. We see that the path is not straight and there are turns. In this case, we use the Manhattan distance metric to calculate the distance walked.  We can get the equation for Manhattan distance by substituting p = 1 in the Minkowski distance formula. The formula is:- Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 56. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 57.  In the older days, people used to perform Machine Learning tasks by manually coding all the algorithms and mathematical and statistical formula. This made the process time consuming, tedious and inefficient. But in the modern days, it becomes very much easier and efficient compared to the olden days by various python libraries, frameworks, and modules. Today, Python is one of the most popular programming languages for this task and it has replaced many languages in the industry, one of the reasons is its vast collection of libraries. Python libraries that used in Machine Learning are:  Numpy  Scipy  Scikit-learn  Theano  TensorFlow  Keras  PyTorch  Pandas  Matplotlib Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 58.  The single most important reason for the popularity of Python in the field of AI and ML is the fact that Python provides 1000s of inbuilt libraries that have in-built functions and methods to easily carry out data analysis, processing, wrangling, modelling and so on. In the below section, we’ll discuss the libraries for the following tasks: 1.Statistical Analysis 2.Data Visualization 3.Data Modelling and Machine Learning 4.Deep Learning 5.Natural Language Processing (NLP) Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 59. 1.Statistical Analysis:  Python comes with tons of libraries for the sole purpose of statistical analysis. Top statistical packages that provide in-built functions to perform the most complex statistical computations are: NumPy :- NumPy or Numerical Python is one of the most commonly used Python libraries. The main feature of this library is its support for multi- dimensional arrays for mathematical and logical operations. Functions provided by NumPy can be used for indexing, sorting, reshaping and conveying images and sound waves as an array of real numbers in multi- dimension. SciPy :- Built on top of NumPy, the SciPy library is a collective of sub- packages that help in solving the most basic problems related to statistical analysis. SciPy library is used to process the array elements defined using the NumPy library, so it is often used to compute mathematical equations that cannot be done using NumPy. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 60. Pandas :- Pandas is another important statistical library mainly used in a wide range of fields including, statistics, finance, economics, data analysis and so on. The library relies on the NumPy array for the purpose of processing pandas data objects. NumPy, Pandas, and SciPy are heavily dependent on each other for performing scientific computations, data manipulation and so on. Pandas is one of the best libraries for processing huge chunks of data, whereas NumPy has excellent support for multi- dimensional arrays and Scipy, on the other hand, provides a set of sub- packages that perform a majority of the statistical analysis tasks. StatsModels :- Built on top of NumPy and SciPy, the StatsModels Python package is the best for creating statistical models, data handling and model evaluation. Along with using NumPy arrays and scientific models from SciPy library, it also integrates with Pandas for effective data handling. This library is famously known for statistical computations, statistical testing, and data exploration. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 61. 2. Data Visualization:  A picture speaks more than a thousand words. Data visualization is all about expressing the key insights from data effectively through graphical representations. It includes the implementation of graphs, charts, mind maps, heat-maps, histograms, density plots, etc, to study the correlations between various data variables.  Best Python data visualization packages that provide in-built functions to study the dependencies between various data features are: Matplotlib :-Matplotlib is the most basic data visualization package in Python. It provides support for a wide variety of graphs such as histograms, bar charts, power spectra, error charts, and so on. It is a 2 Dimensional graphical library that produces clear and concise graphs that are essential for Exploratory Data Analysis (EDA). Seaborn :-The Matplotlib library forms the base of the Seaborn library. In comparison to Matplotlib, Seaborn can be used to create more appealing and descriptive statistical graphs. Along with extensive supports for data visualization, Seaborn also comes with an inbuilt data set oriented API for studying the relationships between multiple variables. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 62. Plotly :-Ploty is one of the most well know graphical Python libraries. It provides interactive graphs for understanding the dependencies between target and predictor variables. It can be used to analyze and visualize statistical, financial, commerce and scientific data to produce clear and concise graphs, sub-plots, heatmaps, 3D charts and so on. Bokeh :-One of the most interactive libraries in Python, Bokeh can be used to build descriptive graphical representations for web browsers. It can easily process humungous datasets and build versatile graphs that help in performing extensive EDA. Bokeh provides the most well-defined functionality to build interactive plots, dashboards, and data applications. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 63. 3. Machine Learning :  Implementing ML, DL, etc. involves coding 1000s of lines of code and this can become more cumbersome when you want to create models that solve complex problems through neural networks. But thankfully we don’t have to code any algorithms because Python comes with several packages just for the purpose of implementing machine learning techniques and algorithms.  Top ML packages that provide in-built functions to implement all the ML algorithms: Scikit-learn :-One of the most useful Python libraries, Scikit-learn is the best library for data modelling and model evaluation. It comes with tons and tons of functions for the sole purpose of creating a model. It contains all the Supervised and Unsupervised Machine Learning algorithms and it also comes with well-defined functions for Ensemble Learning and Boosting Machine Learning. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 64. XGBoost :-XGBoost which stands for Extreme Gradient Boosting is one of the best Python packages for performing Boosting Machine Learning. Libraries such as LightGBM and CatBoost are also equally equipped with well-defined functions and methods. This library is built mainly for the purpose of implementing gradient boosting machines which are used to improve the performance and accuracy of Machine Learning Models. ELI5 :-ELI5 is another Python library that is mainly focused on improving the performance of Machine Learning models. This library is relatively new and is usually used alongside the XGBoost, LightGBM, CatBoost and so on to boost the accuracy of Machine Learning models. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 65. 4.Deep Learning :  The biggest advancements in ML and AI is been through deep learning. With the introduction to deep learning, it is now possible to build complex models and process humungous data sets. Thankfully, Python provides the best deep learning packages that help in building effective neural networks.  Top deep learning packages that provide in-built functions to implement convoluted Neural Networks are: Tensorflow :-One of the best Python libraries for Deep Learning, TensorFlow is an open-source library for dataflow programming across a range of tasks. It is a symbolic math library that is used for building strong and precise neural networks. It provides an intuitive multiplatform programming interface which is highly-scalable over a vast domain of fields. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 66. Pytorch :-Pytorch is an open-source, Python-based scientific computing package that is used to implement Deep Learning techniques and Neural Networks on large datasets. This library is actively used by Facebook to develop neural networks that help in various tasks such as face recognition and auto-tagging. Keras :-Keras is considered as one of the best Deep Learning libraries in Python. It provides full support for building, analyzing, evaluating and improving Neural Networks. Keras is built on top of Theano and TensorFlow Python libraries which provides additional features to build complex and large-scale Deep Learning models. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 67. 5.Natural Language Processing:  Have you ever wondered how Google so apply predicts what you’re searching for? The technology behind Alexa, Siri, and other chatbots is Natural Language Processing. NLP has played a huge role in designing AI- based systems that help in describing the interaction between human language and computers.  Top Natural Language Processing packages that provide in-built functions to implement high-level AI-based systems are: NLTK (Natural Language Toolkit) :- NLTK is considered to be the best Python package for analyzing human language and behaviour. Preferred by most of the Data Scientists, the NLTK library provides easy-to-use interfaces containing over 50 corpora and lexical resources that help in describing human interactions and building AI- Based systems such as recommendation engines. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 68. spaCy:- spaCy is a free, open-source Python library for implementing advanced Natural Language Processing (NLP) techniques. When you’re working with a lot of text it is important that you understand the morphological meaning of the text and how it can be classified to understand human language. These tasks can be easily achieved through spaCY. Gensim :- Gensim is another open-source Python package modeled to extract semantic topics from large documents and texts to process, analyze and predict human behaviour through statistical models and linguistic computations. It has the capability to process humungous data, irrespective of whether the data is raw and unstructured. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 69.  How Can we Get Big Data Sets?  To create big data sets for testing, we use the Python module NumPy, which comes with a number of methods to create random data sets, of any size. Example Create an array containing 250 random floats between 0 and 5: import numpy x = numpy.random.uniform(0.0, 5.0, 250) print(x) Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 70.  Histogram  To visualize the data set we can draw a histogram with the data we collected.  We will use the Python module Matplotlib to draw a histogram: Example Draw a histogram: import numpy import matplotlib.pyplot as plt x = numpy.random.uniform(0.0, 5.0, 250) plt.hist(x, 5) plt.show() Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 71. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 72.  We use the array from the example above to draw a histogram with 5 bars.  The first bar represents how many values in the array are between 0 and 1.  The second bar represents how many values are between 1 and 2 Etc.  Which gives us this result: 52 values are between 0 and 1 48 values are between 1 and 2 49 values are between 2 and 3 51 values are between 3 and 4 50 values are between 4 and 5 Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 73.  Big Data Distributions An array containing 250 values is not considered very big, but now you know how to create a random set of values, and by changing the parameters, you can create the data set as big as you want. Example Create an array with 100000 random numbers, and display them using a histogram with 100 bars: import numpy import matplotlib.pyplot as plt x = numpy.random.uniform(0.0, 5.0, 100000) plt.hist(x, 100) plt.show() Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 74.  In probability theory this kind of data distribution is known as the normal data distribution, or the Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with the formula of this data distribution. Example A typical normal data distribution: import numpy import matplotlib.pyplot as plt x = numpy.random.normal(5.0, 1.0, 100000) plt.hist(x, 100) plt.show()  A normal distribution graph is also known as the bell curve because of it's characteristic shape of a bell. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 75. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 76.  We use the array from the numpy.random.normal() method, with 100000 values, to draw a histogram with 100 bars.  We specify that the mean value is 5.0, and the standard deviation is 1.0.  Meaning that the values should be concentrated around 5.0, and rarely further away than 1.0 from the mean.  And as you can see from the histogram, most values are between 4.0 and 6.0, with a top at approximately 5.0. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 77.  A scatter plot is a diagram where each value in the data set is represented by a dot. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 78.  The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the same length, one for the values of the x-axis, and one for the values of the y-axis:  x = [5,7,8,7,2,17,2,9,4,11,12,9,6]  y = [99,86,87,88,111,86,103,87,94,78,77,85,86]  The x array represents the age of each car.  The y array represents the speed of each car. Example Use the scatter() method to draw a scatter plot diagram: import matplotlib.pyplot as plt x = [5,7,8,7,2,17,2,9,4,11,12,9,6] y = [99,86,87,88,111,86,103,87,94,78,77,85,86] plt.scatter(x, y) plt.show() Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 79. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 80.  The x-axis represents ages, and the y-axis represents speeds.  What we can read from the diagram is that the two fastest cars were both 2 years old, and the slowest car was 12 years old.  Random Data Distributions In Machine Learning the data sets can contain thousands-, or even millions, of values.  Let us create two arrays that are both filled with 1000 random numbers from a normal data distribution.  The first array will have the mean set to 5.0 with a standard deviation of 1.0.  The second array will have the mean set to 10.0 with a standard deviation of 2.0: Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 81. Example A scatter plot with 1000 dots: import numpy import matplotlib.pyplot as plt x = numpy.random.normal(5.0, 1.0, 1000) y = numpy.random.normal(10.0, 2.0, 1000) plt.scatter(x, y) plt.show()  We can see that the dots are concentrated around the value 5 on the x-axis, and 10 on the y-axis.  We can also see that the spread is wider on the y-axis than on the x-axis. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 82. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 83.  Execution of Data Pre-processing methods using Python commonly involves following steps:  Importing the libraries  Importing the Dataset  Handling of Missing Data  Handling of Categorical Data  Splitting the dataset into training and testing datasets  Feature Scaling For this Data Pre-processing script, We are going to use Anaconda Navigator and specifically Spyder (IDE) to write the following code. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 84. Importing the libraries :- import numpy as np # used for handling numbers import pandas as pd # used for handling the dataset from sklearn.impute import SimpleImputer # used for handling missing data from sklearn.preprocessing import LabelEncoder, OneHotEncoder #used for encoding categorical data from sklearn.model_selection import train_test_split # used for splitting training and testing data from sklearn.preprocessing import StandardScaler # used for feature scaling from sklearn.compose import ColumnTransformer # used to Applies transformers to columns of an array Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 85. Importing the Dataset :-  First of all, let us have a look at the dataset we are going to use for this particular example. You can download or take this dataset from : https://github.com/tarunlnmiit/machine_learning/blob/master/DataPrepro cessing.csv  It is as shown below: Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 86.  By pressing raw button in the link, copy this dataset and store it in Data.csv file, in the folder where your program is stored.  In order to import this dataset into our script, we are apparently going to use pandas as follows. dataset = pd.read_csv('Data.csv') # to import the dataset into a variable # Splitting the attributes into independent and dependent attributes X = dataset.iloc[:, :-1].values # attributes to determine independent variable / Class Y = dataset.iloc[:, -1].values # dependent variable / Class  When you run this code section, along with libraries, you should not see any errors. When successfully executed, you can move to variable explorer in the Spyder UI and you will see the following three variables. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 87. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 88.  When you double click on each of these variables, you should see something similar. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 89. Handling of Missing Data :-  Well the first idea is to remove the lines in the observations where there is some missing data. But that can be quite dangerous because imagine this data set contains key information. It would be quite dangerous to remove such observation. So, we need to figure out a better idea to handle this problem. And the most common idea to handle missing data is to take the mean of the columns, as discussed in earlier section.  If you noticed in our dataset, we have two values missing, one for age column in 6th data index and for Income column in 4th data row. Missing values should be handled during the data analysis. So, we do that as follows. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 90. # handling the missing data and replace missing values with nan from numpy and replace with mean of all the other values imputer = SimpleImputer(missing_values=np.nan, strategy='mean’) imputer = imputer.fit(X[:, 1:]) X[:, 1:] = imputer.transform(X[:, 1:])  After execution of this code, the independent variable X will transform into the following. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 91. Here you can see, that the missing values have been replaced by the average values of the respective columns. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 92. Handling of Categorical Data :-  In this dataset we can see that we have two categorical variables namely Region variable and the Online Shopper variable. These two variables are categorical variables because simply they contain categories. The Region contains three categories. It’s India, USA & Brazil and the online shopper variable contains two categories. Yes and No that’s why they’re called categorical variables.  You can guess that since machine learning models are based on mathematical equations you can naturally understand that it would cause some problem if we keep the text here in the categorical variables in the equations because we would only want numbers in the equations. So that’s why we need to encode the categorical variables. That is to encode the text that we have here into numbers. To do this we use the following code snippet. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 93.  # encode categorical data from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) rg = ColumnTransformer([("Region", OneHotEncoder(), [0])], remainder = 'passthrough’) X = rg.fit_transform(X) labelencoder_Y = LabelEncoder() Y = labelencoder_Y.fit_transform(Y) After execution of this code, the independent variable X and dependent variable Y will transform into the following. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 94. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 95.  Here, you can see that the Region variable is now made up of a 3 bit binary variable. The left most bit represents India, 2nd bit represents Brazil and the last bit represents USA. If the bit is 1 then it represents data for that country otherwise not. For Online Shopper variable, 1 represents Yes and 0 represents No. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 96. Splitting the dataset into training and testing datasets :-  Any machine learning algorithm needs to be tested for accuracy. In order to do that, we divide our data set into two parts: training set and testing set. As the name itself suggests, we use the training set to make the algorithm learn the behaviours present in the data and check the correctness of the algorithm by testing on testing set. In Python, we do that as follows:  # splitting the dataset into training set and test set X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)  Here, we are taking training set to be 80% of the original data set and testing set to be 20% of the original data set. This is usually the ratio in which they are split. But, you can come across sometimes to a 70–30% or 75–25% ratio split. But you don’t want to split it 50–50%. This can lead to Model Overfitting. For now, we are going to split it in 80–20% ratio.  After split, our training set and testing set look like this. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 97. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 98. Feature Scaling (Variable Transformation):-  As you can see, we have these two columns age and income that contains numerical numbers. You notice that the variables are not on the same scale because the age are going from 32 to 55 and the salaries going from 57.6 K to like 99.6 K. So, because this age variable in the salary variable don’t have the same scale. This will cause some issues in your Machine Learning models. And why is that. It’s because a lot of ML models are based on what is called the Euclidean distance.  We use feature scaling to convert different scales to a standard scale to make it easier for Machine Learning algorithms. We do this in Python as follows: # feature scaling sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 99.  After the execution of this code, our training independent variable X and our testing independent variable X and look like this. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 100. Thanks !!! Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune