An outlier is an element of a dataset that distinctly stands out from the rest of the data. Outliers can represent either a) items that are so far outside the norm that they need not be considered or b) the illustration of a unique and singular variable that is worth exploring, either to capitalize on a niche or find an area where an organization can offer a unique focus.
What is Outlier Analysis and How Can It Improve Analysis?
1. Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
4. What is Outliners in Data
An outlier is an
element of a data set
that distinctly stands
out from the rest of
the data
In other words, they
are the observations
lying outside overall
pattern of distribution
as shown in the figure
Outliers
5. Example Outliners
An outlier in the list 212,
361, 201, 203, 227, 221, 188,
192, 198 is 361
An outlier in the list 14, 9, 17,
19, 42, 22, 35, 99, 32, 2 is 99
In the examples,
361 and 99 are
far apart from the
remaining set of
values making
them Outlier
7. How to Detect Outliers
The easiest way
to detect
outliers is by
creating a
graph. Plots
such as
Box plots
Scatterplot and
Histogram
Can easily help
us detect
outliers
Alternatively we can
use mean and
standard deviation to
list out the outliers
Interquartile Range
and Quartiles can also
be used to detect
outliers
8. Detecting Outliers
• We can simply use following formula to identify outliers;
this is subject to analyst if he/she wants to change this
criteria : Outliers = (Xi-mean) > 3* 𝝈
Where Xi = Observation
𝝈 = Standard Deviation
• This will classify those data points into outliers whose
distance from mean is beyond 3 standard deviation
• Alternatively we can use Q1-1.5*IQR and Q3+1.5IQR
formula to detect lower and upper outliers where IQR is
Inter Quartile Range which is Quartile 3rd - Quartile 1st i.e.
Percentile 75th – Percentile 25th
With Mean + Standard deviation or
Inter Quartile Range
9. Detecting Outliers
With Histogram
A univariate outlier is a data point
that consists of an extreme value on
one variable
If you look at the Histogram, you
can notice that there is one value
that lies far to the left side of all the
other data. This data point is an
outlier
10. Detecting Outliers
With Box plot
A data point is an outlier if it is
more than 1.5 IQR above the third
quartile or below the first quartile
In other words, low outliers are
below Q1-1.5*IQR and high outliers
are above Q3+1.5*IQR where IQR is
Inter Quartile Range
If you look at the box below, you
can notice the outliers easily; These
are the points lying above
Q3+1.5*IQR and below Q1-1.5*IQR
11. Detecting bivariate
outliers
With Scatterplot
When we are working with two
quantitative variables, we can look
at a scatterplot to identify bivariate
(two variable) outliers
A bivariate outlier is an observation
that does not fit with the pattern of
the other observations. In the plot
below, there is an arrow pointing
out the outlier
13. How to Handle Outliers
We can either remove the
outliers altogether from the
selected dataset or we can
replace them by
recommended statistical
measure that is percentiles
1
It is general practice to
replace lower and upper
outliers with 5th and 95th
percentile values
respectively
2
But , in case of domains
demanding high accuracy
and no loss of data, we can
use 1st and 99th percentile
values to replace lower and
upper outliers respectively
3
Thus, it is at sole discretion
of an analyst which
approach to select and
apply
4
14. Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018