2. • Interpolation in Python is a technique used to
estimate unknown data points between two known
data points. In Python, Interpolation is a technique
mostly used to impute missing values in the data
frame or series while preprocessing data. You can use
this method to estimate missing data points in your
data using Python in Power BI or machine learning
algorithms. Interpolation is also used in Image
Processing when expanding an image, where you can
estimate the pixel value with the help of neighboring
pixels.
3. When to Use Interpolation?
• We can use Interpolation to find missing value/null with
the help of its neighbors. When imputing missing values
with average does not fit best, we have to move to a
different technique, and the technique most people find is
Interpolation.
• Interpolation is mostly used while working with time-
series data because, in time-series data, we like to fill
missing values with the previous one or two values. for
example, suppose temperature, now we would always
prefer to fill today’s temperature with the mean of the last
2 days, not with the mean of the month. We can also use
Interpolation for calculating the moving averages.
4. EXAMPLE
• You can use interpolation when there is an order or a
sequence and you want to estimate a missing value in
the sequence. For example: Let’s say there are various
classes of tickets in train travel, like, first class, second
class, and so on. You would naturally expect the ticket
price of the higher class to be more expensive than the
lower class.
• In that case, if the ticket price of an intermediate class
is missing, you can use interpolation to estimate the
missing value.
5. Using Interpolation to Fill Missing
Values in Series Data
• import numpy as np
• import pandas as pd
• fare = {'first_class':100,
• 'second_class':np.nan,
• 'third_class':60,
• 'open_class':20}
• ser = pd.Series(fare)
• ser
7. Linear Interpolation
• Linear Interpolation simply means to estimate
a missing value by connecting dots in a
straight line in increasing order.
8. Polynomial Interpolation
• In Polynomial Interpolation, you need to
specify an order. It means that polynomial
interpolation fills missing values with the
lowest possible degree that passes through
available data points. The polynomial
Interpolation curve is like the trigonometric
sin curve or assumes it like a parabola shape.
• a.interpolate(method="polynomial", order=2)
9. Interpolation padding
• Interpolation with the help of padding simply means
filling missing values with the same value present
above them in the dataset. If the missing value is in the
first row, then this method will not work. While using
this technique, you also need to specify the limit, which
means how many NaN values to fill.
• So, if you are working on a real-world project and want
to fill missing values with previous values, you have to
specify the limit as to the number of rows in the dataset.
• a.interpolate(method="pad", limit=2)
10. Application
• These uses of interpolation include: Help
users to determine what data might exist
outside of their collected data. Similarly, for
scientists, engineers, photographers and
mathematicians to fit the data for analysing
the trend and so on.
11. discretization
• Data discretization is the process of converting
continuous data into discrete buckets by grouping it.
Discretization is also known for easy maintainability of
the data. Training a model with discrete data becomes
faster and more effective than when attempting the
same with continuous data. Although continuous-
valued data contains more information, huge amounts
of data can slow the model down. Here, discretization
can help us strike a balance between both. Some
famous methods of data discretization are binning and
using a histogram. Although data discretization is
useful, we need to effectively pick the range of each
bucket, which is a challenge.
12. • Here we make use of a function
called pandas.cut(). This function is useful to
achieve the bucketing and sorting of
segmented data.
13. • Perform bucketing using the pd.cut() function on
the marks column and display the top 10
columns. The cut() function takes parameters
such as x, bins, and labels. Here, we have used
only three parameters. Add the following code to
implement
this:df['bucket']=pd.cut(df['marks'],5,labels=['Po
or','Below_average','Average','Above_Average','
Excellent'])
• df.head(10)
14. Binning
• Data binning, which is also known as
bucketing or discretization, is a technique
used in data processing and statistics.
• Binning can be used for example, if there are
more possible data points than observed data
points.
15. • Bins do not necessarily have to be numerical,
they can be categorical values of any kind, like
"dogs", "cats",and so on.
• Binning is also used in image processing, binning.
It can be used to reduce the amount of data, by
combining neighboring pixel into single pixels. kxk
binning reduces areas of k x k pixels into single
pixel.
• Pandas provides easy ways to create bins and to
bin data
20. Outlier Detection
• Outliers are an important part of a dataset. They
can hold useful information about your data.
• In simple terms, an outlier is an extremely high or
extremely low data point relative to the nearest
data point and the rest of the neighboring co-
existing values in a data graph or dataset you're
working with.
• Outliers are extreme values that stand out greatly
from the overall pattern of values in a dataset or
graph.
21. How to Identify an Outlier in a
Dataset
• outlier < Q1 - 1.5(IQR)
• outlier > Q3 + 1.5(IQR)
22. Random Sampling With and without
Replacement.
• Sample() is an inbuilt function of random module in
Python that returns a particular length list of items
chosen from the sequence i.e. list, tuple, string or set.
Used for random sampling without replacement.
• Syntax : random.sample(sequence, k)
• Parameters:
sequence: Can be a list, tuple, string, or set.
k: An Integer value, it specify the length of a sample.
• Returns: k length new list of elements chosen from the
sequence.
•
23. • Randomly selecting records from a large data set may be
helpful if your data set is so large as to prevent or slow
processing, or if one is conducting a survey and needs to
select a random sample from some master database. When
you select records randomly from a larger data set (or some
master database), you can achieve the sampling in a few
different ways, including:
• sampling without replacement, in which a subset of the
observations are selected randomly, and once an
observation is selected it cannot be selected again.
• sampling with replacement, in which a subset of
observations are selected randomly, and an observation
may be selected more than once.
24. • Sampling with replacement:
• Consider a population of potato sacks, each of which has
either 12, 13, 14, 15, 16, 17, or 18 potatoes, and all the
values are equally likely. Suppose that, in this population,
there is exactly one sack with each number. So the whole
population has seven sacks. If I sample two with
replacement, then I first pick one (say 14). I had a 1/7
probability of choosing that one. Then I replace it. Then I
pick another. Every one of them still has 1/7 probability of
being chosen. And there are exactly 49 different
possibilities here (assuming we distinguish between the
first and second.) They are: (12,12), (12,13), (12, 14),
(12,15), (12,16), (12,17), (12,18), (13,12), (13,13), (13,14),
etc.
25. • Sampling without replacement:
• Consider the same population of potato sacks, each of
which has either 12, 13, 14, 15, 16, 17, or 18 potatoes, and
all the values are equally likely. Suppose that, in this
population, there is exactly one sack with each number. So
the whole population has seven sacks. If I sample two
without replacement, then I first pick one (say 14). I had a
1/7 probability of choosing that one. Then I pick another. At
this point, there are only six possibilities: 12, 13, 15, 16, 17,
and 18. So there are only 42 different possibilities here
(again assuming that we distinguish between the first and
the second.) They are: (12,13), (12,14), (12,15), (12,16),
(12,17), (12,18), (13,12), (13,14), (13,15), etc.