Introduction to Machine Learning.pptx

Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune

 Data science is all about data.
 Every single tech company out there is collecting huge amounts of
data. And data is revenue.
 The more data you have, the more business insights you can
generate.
 Data Science is the art of turning data into actions through creating
data products.
 Performing data science requires extraction of timely, actionable
information from diverse data sources to drive data products.

 Artificial intelligence is a field of computer science which makes a
computer system that can mimic human intelligence. It is comprised
of two words "Artificial" and "intelligence", which means "a human-
made thinking power." Hence we can define it as,
 Artificial intelligence is a technology using which we can create
intelligent systems that can simulate human intelligence.
 The Artificial intelligence system does not require to be pre-
programmed, instead of that, they use such algorithms which can
work with their own intelligence. It involves machine learning
algorithms such as Reinforcement learning algorithm and deep
learning neural networks. AI is being used in multiple places such as
Siri, Google's AlphaGo, AI in Chess playing, etc.

 Machine learning is about extracting knowledge from the data. It can be defined as,
 Machine learning is a subfield of artificial intelligence, which enables machines to
learn from past data or experiences without being explicitly programmed.
 Machine learning enables a computer system to make predictions or take some
decisions using historical data without being explicitly programmed. Machine learning
uses a massive amount of structured and semi-structured data so that a machine
learning model can generate accurate result or give predictions based on that data.
 Machine learning works on algorithm which learn by it’s own using historical data. It
works only for specific domains such as if we are creating a machine learning model
to detect pictures of dogs, it will only give result for dog images, but if we provide a
new data like cat image then it will become unresponsive. Machine learning is being
used in various places such as for online recommender system, for Google search
algorithms, Email spam filter, Face book Auto friend tagging suggestion, etc.
It can be divided into three types:
 Supervised learning
 Reinforcement learning
 Unsupervised learning

 Artificial intelligence and machine learning are the part of computer
science that are correlated with each other. These two technologies
are the most trending technologies which are used for creating
intelligent systems.
 Although these are two related technologies and sometimes people
use them as a synonym for each other, but still both are the two
different terms in various cases.
 AI is a bigger concept to create intelligent machines that can
simulate human thinking capability and behavior, whereas, machine
learning is an application or subset of AI that allows machines to
learn from data without being programmed explicitly.

 A Machine Learning system learns from historical data, builds the
prediction models, and whenever it receives new data, predicts
the output for it. The accuracy of predicted output depends upon
the amount of data, as the huge amount of data helps to build a
better model which predicts the output more accurately.
 Suppose we have a complex problem, where we need to perform
some predictions, so instead of writing a code for it, we just need to
feed the data to generic algorithms, and with the help of these
algorithms, machine builds the logic as per the data and predict the
output. Machine learning has changed our way of thinking about the
problem. The below block diagram explains the working of Machine
Learning algorithm:

Features of Machine Learning:
Machine learning uses data to detect various patterns in a given dataset.
It can learn from past data and improve automatically.
It is a data-driven technology.
Machine learning is much similar to data mining as it also deals with the huge
amount of the data.

 Traditional Programming refers to any manually created program
that uses input data and runs on a computer to produce the output.
Traditional programming is a manual process — meaning a person
(programmer) creates the program. But without anyone programming
the logic, one has to manually formulate or code rules. We have the
input data, and someone (programmer) coded a program that uses
that data and runs on a computer to produce the desired output.

 Machine Learning, also known as augmented analytics, the input
data and output are fed to an algorithm to create a program. This
yields powerful insights that can be used to predict future outcomes.
Machine Learning, on the other hand, the input data and output are
fed to an algorithm to create a program.
This is the basic difference between traditional programming and
machine learning. Without anyone programming the logic, In
Traditional programming one has to manually formulate/code rules
while in Machine Learning the algorithms automatically formulate the
rules from the data, which is very powerful. .

Fig:- Block diagram of decision flow architecture for ML System

 1. Data Acquisition
As machine learning is based on available data for the system to make a decision
hence the first step defined in the architecture is data acquisition. This involves data
collection, preparing and segregating the case scenarios based on certain features
involved with the decision making cycle and forwarding the data to the processing unit
for carrying out further categorization. This stage is sometimes called the data
preprocessing stage. The data model expects reliable, fast and elastic data which
may be discrete or continuous in nature. The data is then passed into stream
processing systems (for continuous data) and stored in batch data warehouses (for
discrete data) before being passed on to data modeling or processing stages.
 2. Data Processing
The received data in the data acquisition layer is then sent forward to the data
processing layer where it is subjected to advanced integration and processing and
involves normalization of the data, data cleaning, transformation, and encoding.
The data processing is also dependent on the type of learning being used. For e.g., if
supervised learning is being used the data shall be needed to be segregated into
multiple steps of sample data required for training of the system and the data thus
created is called training sample data or simply training data.

 3. Data Modeling
This layer of the architecture involves the selection of different algorithms that might
adapt the system to address the problem for which the learning is being devised,
These algorithms are being evolved or being inherited from a set of libraries. The
algorithms are used to model the data accordingly, this makes the system ready for
execution step.
 4. Execution
This stage in machine learning is where the experimentation is done, testing is
involved and tunings are performed. The general goal behind being to optimize the
algorithm in order to extract the required machine outcome and maximize the system
performance, The output of the step is a refined solution capable of providing the
required data for the machine to make decisions.
 5. Deployment
Like any other software output, ML outputs need to be operational zed or be
forwarded for further exploratory processing. The output can be considered as a non-
deterministic query which needs to be further deployed into the decision-making
system. It is advised to seamlessly move the ML output directly to production where it
will enable the machine to directly make decisions based on the output and reduce
the dependency on the further exploratory steps.

 Machine Learning Applications in Healthcare :-
Doctors and medical practitioners will soon be able to predict with accuracy
on how long patients with fatal diseases will live. Medical systems will learn
from data and help patients save money by skipping unnecessary tests.
i) Drug Discovery/Manufacturing
ii) Personalized Treatment/Medication
 Machine Learning Applications in Finance :-
More than 90% of the top 50 financial institutions around the world are using
machine learning and advanced analytics. The application of machine
learning in Finance domain helps banks offer personalized services to
customers at lower cost, better compliance and generate greater revenue.
i)Machine Learning for Fraud Detection
ii)Machine Learning for focused Account Holder Targeting

 Machine Learning Applications in Retail :-
Machine learning in retail is more than just a latest trend, retailers are
implementing big data technologies like Hadoop and Spark to build big data
solutions. Machine learning algorithms process this data intelligently and
automate the analysis to make this supercilious goal possible for retail giants
like Amazon, Alibaba and Walmart.
i)Machine Learning Examples in Retail for Product Recommendations
ii)Machine Learning Examples in Retail for Improved Customer Service.
 Machine Learning Applications in Travel :-
Travel providers can help travellers find the best time to book a hotel or buy a
cheap ticket by taking advantage of machine learning. When the agreement is
available, the application will probably send a notification to the user.
i)Machine Learning Examples in Travel for Dynamic Pricing
ii)Machine Learning Examples in Travel for Sentiment Analysis

 Machine Learning Applications in Media :-
Machine learning offers the most efficient means of engaging billions of
social media users. From personalizing news feed to rendering targeted ads,
machine learning is the heart of all social media platforms for their own and
user benefits. Social media and chat applications have advanced to a great
extent that users do not pick up the phone or use email to communicate with
brands – they leave a comment on Facebook or Instagram expecting a
speedy reply than the traditional channels.
 Earlier Facebook used to prompt users to tag your friends but nowadays the
social networks artificial neural networks machine learning algorithm
identifies familiar faces from contact list. The ANN (Artificial Neural
Networks)algorithm mimics the structure of human brain to power facial
recognition.
 The professional network like LinkedIn knows where you should apply for
your next job, whom you should connect with and how your skills stack up
against your peers as you search for new job.

Let’s understand the type of data available in the datasets from the
perspective of machine learning.
1. Numerical Data :-
Any data points which are numbers are termed as numerical data.
Numerical data can be discrete or continuous. Continuous data has any
value within a given range while the discrete data is supposed to have a
distinct value. For example, the number of doors of cars will be discrete i.e.
either two, four, six, etc. and the price of the car will be continuous that is
might be 1000$ or 1250.5$. The data type of numerical data is int64 or
float64.
2. Categorical Data :-
Categorical data are used to represent the characteristics. For example car
color, date of manufacture, etc. It can also be a numerical value provided the
numerical value is indicating a class. For example, 1 can be used to denote
a gas car and 0 for a diesel car. We can use categorical data to forms
groups but cannot perform any mathematical operations on them. Its data
type is an object.

3. Time Series Data :-
It is the collection of a sequence of numbers collected at a regular interval
over a certain period of time. It is very important, like in the field of the stock
market where we need the price of a stock after a constant interval of time.
The type of data has a temporal field attached to it so that the timestamp of
the data can be easily monitored.
4. Text Data :-
Text data is nothing but literals. The first step of handling test data is to
convert them into numbers as or model is mathematical and needs data to
inform of numbers. So to do so we might use functions as a bag of word
formulation.

 ML Dataset :-
Machine learning dataset is defined as the collection of data that is needed
to train the model and make predictions. These datasets are classified as
structured and unstructured datasets, where the structured datasets are in
tabular format in which the row of the dataset corresponds to record and
column corresponds to the features, and unstructured datasets corresponds
to the images, text, speech, audio etc. which is acquired through Data
Acquisition, Data Wrangling and Data Exploration, during the learning
process these datasets are divided as training, validation and test sets for
the training and measuring the accuracy of the mode.

 Types of datasets :-
1.Training Dataset: This data set is used to train the model i.e. these
datasets are used to update the weight of the model.
2.Validation Dataset: These types of a dataset are used to reduce over
fitting. It is used to verify that the increase in the accuracy of the training
dataset is actually increased if we test the model with the data that is not
used in the training. If the accuracy over the training dataset increase while
the accuracy over the validation dataset decrease, then this results in the
case of high variance i.e. over fitting.
3.Test Dataset: Most of the time when we try to make changes to the model
based upon the output of the validation set then unintentionally we make the
model peek into our validation set and as a result, our model might get over
fit on the validation set as well. To overcome this issue we have a test
dataset that is only used to test the final output of the model in order to
confirm the accuracy.

 Pre-processing refers to the changes applied to our data before feeding it to
the ML algorithm. Data pre-processing is a technique that is used to convert
the created or collected (raw) data into a clean data set. In other words,
whenever the data is gathered from different sources it is collected in raw
format which is not feasible for the analysis or processing by ML model.
Following figure shows transformation processing performed on raw data
before, during and after applying ML techniques:

 Data Pre-processing in Machine Learning can be broadly divided into 3 main
parts –
1.Data Integration
2.Data Cleaning
3.Data Transformation
There are various steps in each of these 3 broad categories. Depending on
the data and machine learning algorithm involved, not all steps might be
required though.

1.Data Integration and formatting :-
During hackathon and competitions, we often deal with a single csv or excel
file containing all training data. But in real world, source of data might not be
this simple. In real life, we might have to extract data from various sources
and have to integrate it.
2. Data Cleaning :-
2.1 Dealing with Missing data :-
It is common to have some missing or null data in the real-world data set.
Most of the machine learning algorithms will not work with such data. So, it
becomes important to deal with missing or null data. Some of common
measures taken are,
 Get rid of the column if there are plenty of rows with null values.
 Eliminate the row if there are plenty of columns with null values.
 Change the missing value by mean or median or mode of that column
depending on data distribution in that column.

 By substituting the missing values by ‘NA’ or ‘Unknown’ or some other relevant
term, in case of categorical feature column, we can consider missing data as a
new category in itself.
 In this method to come up with educated guesses of possible candidate,
replace missing value by applying regression or classification techniques.
2.2 Remove Noise from Data :-
Noise are slightly erroneous data observations which does not fulfil with trend
or distribution of rest of data. Though each error can be small, but communally
noisy data results in poor machine learning model. Noise in data can be
minimized or smooth out by using below popular techniques –
 Binning
 Regression

 Binning Method:
 First sort data and partition
 Then one can smooth by bin mean, median and boundaries.
For Example:
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into bins: - Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means: - Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

 Regression Method:
 Here data can be smoothed by fitting the data to a function.
 Linear regression involves finding the “best” line to fit two attributes, so that
one attribute can be used to predict the other.
 Multiple linear regression is an extension of linear regression, where more
than two attributes are involved and the data are fit to a multidimensional
surface.

2.3 Remove Outliers from Data :-
 Outliers are those observation that has extreme values, much beyond the
normal range of values for that feature. For example, a very high salary of
CEO of a company can be an outlier if we consider salary of other regular
employees of the company.
 Even few outliers in data set can contribute to poor accuracy of machine
learning model. The common methods to detect outliers and remove them
are –
 Standard Deviation
 Box Plot

 Standard Deviation
In statistics, if a data distribution is approximately normal then about 68% of
the data values lie within one standard deviation of the mean and about 95%
are within two standard deviations, and about 99.7% lie within three
standard deviations.

 Therefore, if you have any data point that is more than 3 times the standard
deviation, then those points are very likely to be anomalous or outliers.
 Box Plots:
 Box plots are a graphical depiction of numerical data through their quintiles.
It is a very simple but effective way to visualize outliers. Think about the
lower and upper sideburns as the boundaries of the data distribution. Any
data points that show above or below the sideburns, can be considered
outliers or anomalous.
 The concept of the Interquartile Range (IQR) is used to build the box plot
graphs. IQR is a concept in statistics that is used to measure the statistical
dispersion and data variability by dividing the dataset into quartiles.
 In simple words, any dataset or any set of observations is divided into four
defined intervals based upon the values of the data and how they compare
to the entire dataset. A quartile is what divides the data into three points and
four intervals.

2.4 Dealing with Duplicate Data :-
The approach to deal with duplicate data depends on the fact whether
duplicate data represents the real-world scenario or is more of an
inconsistency. If it is former than duplicate data should be conserved, else it
should be removed.
2.5 Dealing with Inconsistent Data :-
Some data might not be consistent with the business rule. This requires
domain knowledge to identify such inconsistencies and deal with it.
3.Data Transformation :-
3.1 Feature Scaling :-
Feature scaling is one of the most important prerequisites for most of the
machine learning algorithm. Various features of data set can have their own
range of values and far different from each other. For e.g. age of human
mostly lies within range of 0-100 years but population of countries will be in
ranges of millions.

This huge differences of ranges between features in a data set can distort
the training of machine learning model. So, we need to bring the ranges of
all the features at common scale. The common approaches of feature
scaling are –
 Mean Normalization
 Min-Max Normalization
 Z-Score Normalization or Standardization,
We will make brief overview with examples on these approaches in the last
section of this unit.
3.2 Dealing with categorical data :-
 Categorical data, also known as qualitative data are text or string-based
data. Example of categorical data are gender of persons (Male or Female),
names of places (India, America, England), colour of car (Red, White).
 Most of the machine learning algorithms works on numerical data only and
will not be able to process categorical data. So, we need to transform
categorical data into numerical form without losing the sense of information.
Below are the popular approaches to convert categorical data into numerical
form –
 Label Encoding
 One Hot Encoding
 Binary Encoding

3.3 Dealing with Imbalanced Data Set :-
 Imbalanced data set are type of data set in which most of the data belongs
to only one class and very few data belongs to other class. This is common
in case of medical diagnosis, anomaly detection where the data belonging to
positive class is a very small percentage.
 For e.g. only 5-10% of data might belong to a disease positive class which
can be an expected distribution in medical diagnosis. But this skewed data
distribution can trick the machine learning model in training phase to only
identify the majority classes and it fails to learn the minority classes. For
example, the model might fail to identify the medical condition even though it
might be showing a very high accuracy by identifying negative scenarios.
 We need to do something about imbalanced data set to avoid a bad
machine learning model. Below are some approaches to deal with such
situation –
 Under Sampling Majority Class
 Over Sampling Minority Class
 SMOTE (Synthetic Minority Oversampling Technique)

3.4 Feature Engineering :-
 Feature engineering is an art of creating new feature from the given data by
either applying some domain knowledge or some common sense or both.
 A very common example of feature engineering is converting a Date feature
into additional features like Day, Week, Month, Year thus adding more
information into data set.
 Feature engineering enriches data set with more information that can
contribute to good machine learning model.
3.5 Training-Test Data Split :-
 Just before training supervised machine learning model, this is usually the
last step of data pre-processing in which for a given data set, after
undergoing all cleaning and transformation, is divided into two parts – one
for training of machine learning model and second for testing the trained
model.
 Though there is no rule of thumb, but usually training-test split is done
randomly at 80%-20% ratio. While splitting data set, care has to be taken
that there is no loss of information in training data set.

Example: We have registered the speed of 13 cars:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
1.Mean -The average value.
 (99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
 The NumPy module has NumPy mean() method for this.
 import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.mean(speed)
print(x)

2.Median - The midpoint value.
 The median value is the value in the middle, after you have sorted all the
values:
 (77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111)
 It is important that the numbers are sorted before you can find the median
 The NumPy module has a NumPy median() method to find the middle value.
 import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy. median(speed)
print(x)
 If there are two numbers in the middle, divide the sum of those numbers by
two.
 Ex. (77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103)
 (86 + 87) / 2 = 86.5

3.Mode - The most common value.
 The Mode value is the value that appears the most number of times.
 (99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86) = 86
 The SciPy module has a SciPy mode() method to find the number
that appears the most.
 from scipy import stats
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats. mode(speed)
print(x)

 Standard deviation is a number that describes how spread out the values
are.
 A low standard deviation means that most of the numbers are close to the
mean (average) value.
 A high standard deviation means that the values are spread out over a wider
range.
 Example: We have registered the speed of 7 cars:
 speed = [86,87,88,86,87,85,86]
 The standard deviation is:0.9
 Meaning that most of the values are within the range of 0.9 from the mean
value, which is 86.4.
 Example: We have registered the speed of another 7 cars:
 speed = [32,111,138,28,59,77,97]
 The standard deviation is:37.85
 Meaning that most of the values are within the range of 37.85 from the
mean value, which is 77.4.
 A higher standard deviation indicates that the values are spread out
over a wider range.

 The NumPy module has a NumPy std() method to find the standard
deviation.
 import numpy
speed = [86,87,88,86,87,85,86]
x = numpy.std(speed)
print(x)
 import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)
 Standard Deviation is often represented by the symbol Sigma: σ

 Variance is another number that indicates how spread out the values are.
 if you take the square root of the variance, you get the standard deviation!
 if you multiply the standard deviation by itself, you get the variance!
 To calculate the variance you have to do as follows:
1. Find the mean:
(32+111+138+28+59+77+97) / 7 = 77.4
2. For each value: find the difference from the mean:
32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6

3. For each difference: find the square value:
(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16
4. The variance is the average number of these squared differences:
(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.6
 The NumPy module has a NumPy var() method to find the variance.
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.var(speed)
print(x)
 Variance is often represented by the symbol Sigma Square: σ2

 Percentiles are used in statistics to give you a number that describes the
value that a given percent of the values are lower than.
 Example: we have an array of the ages of all the people that lives in a street.
 ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
 What is the 75. percentile? The answer is 43, meaning that 75% of the people
are 43 or younger.
 The NumPy module has a method NumPy percentile() to find the percentiles:
 import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 75)
print(x)
 What is the age that 90% of the people are younger than?
 import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 90)
print(x)

 Statistics plays a main role in the field of research. It helps us in the
collection, analysis and presentation of data.
Before starting with descriptive and inferential statistics let us get the basic
idea of population and sample.
Population:-
Population is the group that is targeted to collect the data from. Our data is
the information collected from the population. Population is always defined
first, before starting the data collection process for any statistical study.
Population is not necessarily be people rather it could be batch of batteries,
measurements of rainfall in an area or a group of people.
Sample:-
It is the part of population which is selected randomly for the study. The
sample should be selected such that it represents all the characteristics of
the population. The process of selecting the subset from the population is
called sampling and the subset selected is called the sample.

1.Descriptive Statistics :-
It describes the important characteristics/ properties of the data using the
measures the central tendency like mean/ median/mode and the measures
of dispersion like range, standard deviation, variance etc.
Data can be summarized and represented in an accurate way using charts,
tables and graphs.
For example: We have marks of 1000 students and we may be interested in
the overall performance of those students and the distribution as well as the
spread of marks. Descriptive statistics provides us the tools to define our
data in a most understandable and appropriate way.

2.Inferential Statistics :-
It is about using data from sample and then making inferences about the
larger population from which the sample is drawn. The goal of the inferential
statistics is to draw conclusions from a sample and generalize them to the
population. It determines the probability of the characteristics of the sample
using probability theory. The most common methodologies used are
hypothesis tests, Analysis of variance etc.
For example: Suppose we are interested in the exam marks of all the
students in India. But it is not feasible to measure the exam marks of all the
students in India. So now we will measure the marks of a smaller sample of
students, for example 1000 students. This sample will now represent the
large population of Indian students. We would consider this sample for our
statistical study for studying the population from which it’s deduced.

 Probability theory, a branch of mathematics concerned with the analysis of
random phenomena. The outcome of a random event cannot be determined
before it occurs, but it may be any one of several possible outcomes. The
actual outcome is considered to be determined by chance.
 EX. Tossing a Coin.
When a coin is tossed, there are two possible outcomes:
1.heads (H) or
2.tails (T)
We say that the probability of the coin landing H is ½
And the probability of the coin landing T is ½
EX. Throwing a Dice .
When a single die is thrown, there are six possible outcomes: 1, 2, 3, 4, 5, 6.
The probability of any one of them is 1/6.
So,
Probability of an event happening = Number of ways it can happen
Total no of Outcomes

 Many Supervised and Unsupervised machine learning models such as K-NN
and K-Means depend upon the distance between two data points to predict
the output. Therefore, the metric we use to compute distances plays an
important role in these models.
1.Euclidean Distance:
 Euclidean distance is the straight line distance between 2 data points in a
plane.
 It is calculated using the Minkowski Distance formula by setting ‘p’ value to
2, thus, also known as the L2 norm distance metric.
 This formula is similar to the Pythagorean theorem formula, Thus it is also
known as the Pythagorean Theorem. The formula is:-

2.Manhattan Distance:
 We use Manhattan distance, also known as city block distance, or taxicab
geometry if we need to calculate the distance between two data points in a
grid-like path. Manhattan distance metric can be understood with the help of
a simple example.
 In the above picture, imagine each cell to be a building, and the grid lines to
be roads. Now if I want to travel from Point A to Point B marked in the image
and follow the red or the yellow path. We see that the path is not straight
and there are turns. In this case, we use the Manhattan distance metric to
calculate the distance walked.
 We can get the equation for Manhattan distance by substituting p = 1 in the
Minkowski distance formula. The formula is:-

 In the older days, people used to perform Machine Learning tasks by
manually coding all the algorithms and mathematical and statistical formula.
This made the process time consuming, tedious and inefficient. But in the
modern days, it becomes very much easier and efficient compared to the
olden days by various python libraries, frameworks, and modules. Today,
Python is one of the most popular programming languages for this task and
it has replaced many languages in the industry, one of the reasons is its vast
collection of libraries. Python libraries that used in Machine Learning are:
 Numpy
 Scipy
 Scikit-learn
 Theano
 TensorFlow
 Keras
 PyTorch
 Pandas
 Matplotlib

 The single most important reason for the popularity of Python in the field of
AI and ML is the fact that Python provides 1000s of inbuilt libraries that have
in-built functions and methods to easily carry out data analysis, processing,
wrangling, modelling and so on.
In the below section, we’ll discuss the libraries for the following tasks:
1.Statistical Analysis
2.Data Visualization
3.Data Modelling and Machine Learning
4.Deep Learning
5.Natural Language Processing (NLP)

1.Statistical Analysis:
 Python comes with tons of libraries for the sole purpose of statistical
analysis. Top statistical packages that provide in-built functions to perform
the most complex statistical computations are:
NumPy :- NumPy or Numerical Python is one of the most commonly used
Python libraries. The main feature of this library is its support for multi-
dimensional arrays for mathematical and logical operations. Functions
provided by NumPy can be used for indexing, sorting, reshaping and
conveying images and sound waves as an array of real numbers in multi-
dimension.
SciPy :- Built on top of NumPy, the SciPy library is a collective of sub-
packages that help in solving the most basic problems related to statistical
analysis. SciPy library is used to process the array elements defined using
the NumPy library, so it is often used to compute mathematical equations
that cannot be done using NumPy.

Pandas :- Pandas is another important statistical library mainly used in a
wide range of fields including, statistics, finance, economics, data analysis
and so on. The library relies on the NumPy array for the purpose of
processing pandas data objects. NumPy, Pandas, and SciPy are heavily
dependent on each other for performing scientific computations, data
manipulation and so on. Pandas is one of the best libraries for processing
huge chunks of data, whereas NumPy has excellent support for multi-
dimensional arrays and Scipy, on the other hand, provides a set of sub-
packages that perform a majority of the statistical analysis tasks.
StatsModels :- Built on top of NumPy and SciPy, the StatsModels Python
package is the best for creating statistical models, data handling and model
evaluation. Along with using NumPy arrays and scientific models from SciPy
library, it also integrates with Pandas for effective data handling. This library
is famously known for statistical computations, statistical testing, and data
exploration.

2. Data Visualization:
 A picture speaks more than a thousand words. Data visualization is all about
expressing the key insights from data effectively through graphical
representations. It includes the implementation of graphs, charts, mind
maps, heat-maps, histograms, density plots, etc, to study the correlations
between various data variables.
 Best Python data visualization packages that provide in-built functions to
study the dependencies between various data features are:
Matplotlib :-Matplotlib is the most basic data visualization package in
Python. It provides support for a wide variety of graphs such as histograms,
bar charts, power spectra, error charts, and so on. It is a 2 Dimensional
graphical library that produces clear and concise graphs that are essential
for Exploratory Data Analysis (EDA).
Seaborn :-The Matplotlib library forms the base of the Seaborn library. In
comparison to Matplotlib, Seaborn can be used to create more appealing
and descriptive statistical graphs. Along with extensive supports for data
visualization, Seaborn also comes with an inbuilt data set oriented API for
studying the relationships between multiple variables.

Plotly :-Ploty is one of the most well know graphical Python libraries. It
provides interactive graphs for understanding the dependencies between
target and predictor variables. It can be used to analyze and visualize
statistical, financial, commerce and scientific data to produce clear and
concise graphs, sub-plots, heatmaps, 3D charts and so on.
Bokeh :-One of the most interactive libraries in Python, Bokeh can be used
to build descriptive graphical representations for web browsers. It can easily
process humungous datasets and build versatile graphs that help in
performing extensive EDA. Bokeh provides the most well-defined
functionality to build interactive plots, dashboards, and data applications.

3. Machine Learning :
 Implementing ML, DL, etc. involves coding 1000s of lines of code and this
can become more cumbersome when you want to create models that solve
complex problems through neural networks. But thankfully we don’t have to
code any algorithms because Python comes with several packages just for
the purpose of implementing machine learning techniques and algorithms.
 Top ML packages that provide in-built functions to implement all the ML
algorithms:
Scikit-learn :-One of the most useful Python libraries, Scikit-learn is the best
library for data modelling and model evaluation. It comes with tons and tons
of functions for the sole purpose of creating a model. It contains all the
Supervised and Unsupervised Machine Learning algorithms and it also
comes with well-defined functions for Ensemble Learning and Boosting
Machine Learning.

XGBoost :-XGBoost which stands for Extreme Gradient Boosting is one of
the best Python packages for performing Boosting Machine Learning.
Libraries such as LightGBM and CatBoost are also equally equipped with
well-defined functions and methods. This library is built mainly for the
purpose of implementing gradient boosting machines which are used to
improve the performance and accuracy of Machine Learning Models.
ELI5 :-ELI5 is another Python library that is mainly focused on improving the
performance of Machine Learning models. This library is relatively new and
is usually used alongside the XGBoost, LightGBM, CatBoost and so on to
boost the accuracy of Machine Learning models.

4.Deep Learning :
 The biggest advancements in ML and AI is been through deep learning. With
the introduction to deep learning, it is now possible to build complex models
and process humungous data sets. Thankfully, Python provides the best
deep learning packages that help in building effective neural networks.
 Top deep learning packages that provide in-built functions to implement
convoluted Neural Networks are:
Tensorflow :-One of the best Python libraries for Deep Learning,
TensorFlow is an open-source library for dataflow programming across a
range of tasks. It is a symbolic math library that is used for building strong
and precise neural networks. It provides an intuitive multiplatform
programming interface which is highly-scalable over a vast domain of fields.

Pytorch :-Pytorch is an open-source, Python-based scientific computing
package that is used to implement Deep Learning techniques and Neural
Networks on large datasets. This library is actively used by Facebook to
develop neural networks that help in various tasks such as face recognition
and auto-tagging.
Keras :-Keras is considered as one of the best Deep Learning libraries in
Python. It provides full support for building, analyzing, evaluating and
improving Neural Networks. Keras is built on top of Theano and TensorFlow
Python libraries which provides additional features to build complex and
large-scale Deep Learning models.

5.Natural Language Processing:
 Have you ever wondered how Google so apply predicts what you’re
searching for? The technology behind Alexa, Siri, and other chatbots is
Natural Language Processing. NLP has played a huge role in designing AI-
based systems that help in describing the interaction between human
language and computers.
 Top Natural Language Processing packages that provide in-built functions to
implement high-level AI-based systems are:
NLTK (Natural Language Toolkit) :-
NLTK is considered to be the best Python package for analyzing human
language and behaviour. Preferred by most of the Data Scientists, the NLTK
library provides easy-to-use interfaces containing over 50 corpora and
lexical resources that help in describing human interactions and building AI-
Based systems such as recommendation engines.

spaCy:- spaCy is a free, open-source Python library for implementing
advanced Natural Language Processing (NLP) techniques. When you’re
working with a lot of text it is important that you understand the
morphological meaning of the text and how it can be classified to understand
human language. These tasks can be easily achieved through spaCY.
Gensim :- Gensim is another open-source Python package modeled to
extract semantic topics from large documents and texts to process, analyze
and predict human behaviour through statistical models and linguistic
computations. It has the capability to process humungous data, irrespective
of whether the data is raw and unstructured.

 How Can we Get Big Data Sets?
 To create big data sets for testing, we use the Python module NumPy, which
comes with a number of methods to create random data sets, of any size.
Example
Create an array containing 250 random floats between 0 and 5:
import numpy
x = numpy.random.uniform(0.0, 5.0, 250)
print(x)

 Histogram
 To visualize the data set we can draw a histogram with the data we
collected.
 We will use the Python module Matplotlib to draw a histogram:
Example
Draw a histogram:
import numpy
import matplotlib.pyplot as plt
plt.hist(x, 5)
plt.show()

 We use the array from the example above to draw a histogram with 5 bars.
 The first bar represents how many values in the array are between 0 and 1.
 The second bar represents how many values are between 1 and 2 Etc.
 Which gives us this result:
52 values are between 0 and 1

 Big Data Distributions
An array containing 250 values is not considered very big, but now
you know how to create a random set of values, and by changing the
parameters, you can create the data set as big as you want.
Example
Create an array with 100000 random numbers, and display them
using a histogram with 100 bars:
import numpy
plt.hist(x, 100)
plt.show()

 In probability theory this kind of data distribution is known as the normal data
distribution, or the Gaussian data distribution, after the mathematician Carl
Friedrich Gauss who came up with the formula of this data distribution.
Example
A typical normal data distribution:
import numpy
x = numpy.random.normal(5.0, 1.0, 100000)
plt.hist(x, 100)
plt.show()
 A normal distribution graph is also known as the bell curve because of it's
characteristic shape of a bell.

 We use the array from the numpy.random.normal() method, with 100000
values, to draw a histogram with 100 bars.
 We specify that the mean value is 5.0, and the standard deviation is 1.0.
 Meaning that the values should be concentrated around 5.0, and rarely
further away than 1.0 from the mean.
 And as you can see from the histogram, most values are between 4.0 and
6.0, with a top at approximately 5.0.

 A scatter plot is a diagram where each value in the data set is represented
by a dot.

 The Matplotlib module has a method for drawing scatter plots, it needs two
arrays of the same length, one for the values of the x-axis, and one for the
values of the y-axis:
 x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
 y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
 The x array represents the age of each car.
 The y array represents the speed of each car.
Example
Use the scatter() method to draw a scatter plot diagram:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()

 The x-axis represents ages, and the y-axis represents speeds.
 What we can read from the diagram is that the two fastest cars were both 2
years old, and the slowest car was 12 years old.
 Random Data Distributions
In Machine Learning the data sets can contain thousands-, or even millions,
of values.
 Let us create two arrays that are both filled with 1000 random numbers from
a normal data distribution.
 The first array will have the mean set to 5.0 with a standard deviation of 1.0.
 The second array will have the mean set to 10.0 with a standard deviation of
2.0:

Example
A scatter plot with 1000 dots:
import numpy
x = numpy.random.normal(5.0, 1.0, 1000)
y = numpy.random.normal(10.0, 2.0, 1000)
plt.scatter(x, y)
plt.show()
 We can see that the dots are concentrated around the value 5 on the x-axis,
and 10 on the y-axis.
 We can also see that the spread is wider on the y-axis than on the x-axis.

 Execution of Data Pre-processing methods using Python commonly involves
following steps:
 Importing the libraries
 Importing the Dataset
 Handling of Missing Data
 Handling of Categorical Data
 Splitting the dataset into training and testing datasets
 Feature Scaling
For this Data Pre-processing script, We are going to use Anaconda
Navigator and specifically Spyder (IDE) to write the following code.

Importing the libraries :-
import numpy as np # used for handling numbers
import pandas as pd # used for handling the dataset
from sklearn.impute import SimpleImputer # used for handling missing
data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#used for encoding categorical data
from sklearn.model_selection import train_test_split # used for splitting
training and testing data
from sklearn.preprocessing import StandardScaler # used for
feature scaling
from sklearn.compose import ColumnTransformer # used to Applies
transformers to columns of an array

Importing the Dataset :-
 First of all, let us have a look at the dataset we are going to use for this
particular example. You can download or take this dataset from :
https://github.com/tarunlnmiit/machine_learning/blob/master/DataPrepro
cessing.csv
 It is as shown below:

 By pressing raw button in the link, copy this dataset and store it in Data.csv
file, in the folder where your program is stored.
 In order to import this dataset into our script, we are apparently going to use
pandas as follows.
dataset = pd.read_csv('Data.csv') # to import the dataset into a variable
# Splitting the attributes into independent and dependent attributes
X = dataset.iloc[:, :-1].values # attributes to determine independent variable /
Class
Y = dataset.iloc[:, -1].values # dependent variable / Class
 When you run this code section, along with libraries, you should not see any
errors. When successfully executed, you can move to variable explorer in the
Spyder UI and you will see the following three variables.

 When you double click on each of these variables, you should see
something similar.

Handling of Missing Data :-
 Well the first idea is to remove the lines in the observations where there is
some missing data. But that can be quite dangerous because imagine this
data set contains key information. It would be quite dangerous to remove
such observation. So, we need to figure out a better idea to handle this
problem. And the most common idea to handle missing data is to take the
mean of the columns, as discussed in earlier section.
 If you noticed in our dataset, we have two values missing, one for age
column in 6th data index and for Income column in 4th data row. Missing
values should be handled during the data analysis. So, we do that as
follows.

# handling the missing data and replace missing values with nan from
numpy and replace with mean of all the other values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean’)
imputer = imputer.fit(X[:, 1:])
X[:, 1:] = imputer.transform(X[:, 1:])
 After execution of this code, the independent variable X will transform
into the following.

Here you can see, that the missing values have been replaced by the average
values of the respective columns.

Handling of Categorical Data :-
 In this dataset we can see that we have two categorical variables namely
Region variable and the Online Shopper variable. These two variables are
categorical variables because simply they contain categories. The Region
contains three categories. It’s India, USA & Brazil and the online shopper
variable contains two categories. Yes and No that’s why they’re called
categorical variables.
 You can guess that since machine learning models are based on
mathematical equations you can naturally understand that it would cause
some problem if we keep the text here in the categorical variables in the
equations because we would only want numbers in the equations. So that’s
why we need to encode the categorical variables. That is to encode the text
that we have here into numbers. To do this we use the following code
snippet.

 # encode categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
rg = ColumnTransformer([("Region", OneHotEncoder(), [0])], remainder =
'passthrough’)
X = rg.fit_transform(X)
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
After execution of this code, the independent variable X and dependent
variable Y will transform into the following.

 Here, you can see that the Region variable is now made up of a 3 bit binary
variable. The left most bit represents India, 2nd bit represents Brazil and
the last bit represents USA. If the bit is 1 then it represents data for that
country otherwise not.
For Online Shopper variable, 1 represents Yes and 0 represents No.

Splitting the dataset into training and testing datasets :-
 Any machine learning algorithm needs to be tested for accuracy. In order
to do that, we divide our data set into two parts: training set and testing
set. As the name itself suggests, we use the training set to make the
algorithm learn the behaviours present in the data and check the
correctness of the algorithm by testing on testing set. In Python, we do that
as follows:
 # splitting the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=0)
 Here, we are taking training set to be 80% of the original data set and
testing set to be 20% of the original data set. This is usually the ratio in
which they are split. But, you can come across sometimes to a 70–30% or
75–25% ratio split. But you don’t want to split it 50–50%. This can lead
to Model Overfitting. For now, we are going to split it in 80–20% ratio.
 After split, our training set and testing set look like this.

Feature Scaling (Variable Transformation):-
 As you can see, we have these two columns age and income that contains
numerical numbers. You notice that the variables are not on the same scale
because the age are going from 32 to 55 and the salaries going from 57.6 K to
like 99.6 K. So, because this age variable in the salary variable don’t have the
same scale. This will cause some issues in your Machine Learning models.
And why is that. It’s because a lot of ML models are based on what is called
the Euclidean distance.
 We use feature scaling to convert different scales to a standard scale to make
it easier for Machine Learning algorithms. We do this in Python as follows:
# feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

 After the execution of this code, our training independent variable X and
our testing independent variable X and look like this.

Thanks !!!

Introduction to Machine Learning.pptx

Recommended

Recommended

More Related Content

Similar to Introduction to Machine Learning.pptx

Similar to Introduction to Machine Learning.pptx (20)

More from Harsha Patel

More from Harsha Patel (20)

Recently uploaded

Recently uploaded (20)

Introduction to Machine Learning.pptx