Descriptive statistics helps users to describe and understand the features of a specific dataset, by providing short summaries and a graphic depiction of the measured data. Descriptive Statistical algorithms are sophisticated techniques that, within the confines of a self-serve analytical tool, can be simplified in a uniform, interactive environment to produce results that clearly illustrate answers and optimize decisions.
What is Descriptive Statistics and How Do You Choose the Right One for Enterprise Analysis?
1. Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
4. List of statistical summary measures and plots
Mean, Median,
Mode
Percentile,
Quartile , Inter
Quartile Range
Skewness
Variance ,
Standard
deviation
Measures
Box plot and
Histogram plot
Plots
6. Introduction with example
Mean :
• It is simply the average of
all the data values
• This measure can be biased
in case of significant
number of outliers present
in data
• Descriptive statistics help
describe and understand
the features of a specific
dataset, by giving short
summaries about the
measures of the data. The
most recognized types of
descriptive statistics are
listed and explained below
• Mean, median, and mode
are different ways to figure
out an average
Median :
• It is the value in the middle
when the data items are
arranged in ascending order
• This measure is relatively
robust in case of significant
number of outliers present
in data making it more
appropriate measure of
average in case of presence
of outliers in data
• For instance, when profiling
customers based on various
attributes such as income
or balances, their median
age/income/balance etc.
can be looked at instead of
mean to avoid bias due to
outliers
Mode :
• It is the most frequently
occurring value in a series
of data
• In case of no repeating
values, there would be no
mode
• For example, in satisfaction
survey analysis, mode can
be used to find what is the
most common rating
provided by responders to a
particular service/product
• The second most popular
use of mode is while
imputing missing values of
a character variable ; when
we have number of missing
values in say, region
variable then it’s general
tendency to replace these
missing values with most
frequently occurring region
i.e. mode of region
7. Introduction with example
Percentile :
• It represents a percentage position in a list of data
• For example, the 20th percentile is the value below which 20% of the observations
may be found
• Let’s consider the 25th percentile for the 8 numbers in Table 1. Notice the numbers
are given ranks ranging from 1 for the lowest number to 8 for the highest number. Thus
the numbers are sorted in ascending order for ranking
• Step 1: Compute the rank (R) of the 25th percentile. This is done using the following
formula:
• R = P/100 x (N + 1)
• where P is the desired percentile (25 in this case) and N is the number of numbers (8
in this case). Therefore,
• R = 25/100 x (8 + 1) = 9/4 = 2.25
Number Rank
3
5
7
8
9
11
13
15
1
2
3
4
5
6
7
8
8. Introduction
with example
Percentile :
Step 2: If R is an integer, the Pth percentile is the number
with rank R; But when R is not an integer, we compute
the Pth percentile as follows :
Define IR as the integer portion of R (the number
to the left of the decimal point). For this example,
IR = 2
Define FR as the fractional portion of R. For this
example, FR = 0.25
Find the scores with Rank IR and with Rank IR + 1.
For this example, this means the score with Rank 2
and the score with Rank 3. The scores are 5 and 7
Multiply the difference between the scores by FR
and add the result to the lower score. For these
data, this is (0.25)(7 - 5) + 5 = 5.5
Therefore, the 25th percentile is 5.5
9. Introduction with
example
Quartile :
Quartiles are specific percentiles which divide
the dataset into four equal parts
First Quartile = Q1 = 25th Percentile ;
Second Quartile = Q2 = 50th Percentile =
Median
Third Quartile = Q3 = 75th Percentile
For instance, if lower quartile is at Income =
100k then bottom 25% have income <=100k.
If median i.e. second quartile has Income =
200k then bottom 50% population has income
<=200k and so on
10. Introduction with example
Standard deviation and Variance:
Both are the popular measures of how spread out the data points are from a
center value mean
For example, let’s find the standard deviation of the following data: 1,2,2,4,6
1. Calculate the mean of data: 15/5 = 3
2. Subtract the mean from each data value: -2, -1, -1, 1, 3
3. Square each of the new data value: 4,1,1,1,9
4. Sum these squared data values: 16
5. Divide this sum by (number of observations -1): 16 / (5-1) = 4
6. This number is Variance and Square root of this number is standard
deviation: Sqrt (4) = 2
For instance, standard deviations of price data are frequently used as a
measure of volatility; While monitoring some industrial process , if process
indicators go beyond design standards then it my be troublesome hence
variance/standard deviation can be used in such cases
11. Introduction with
example
Skewness:
It is a measure of symmetry. A dataset
is symmetric if it looks the same to the
left and right of the center point
If skewness < −1 or greater than > 1,
the distribution is highly skewed
If skewness is between −1 and − 0.5 or
between 0.5 and +1, the distribution is
moderately skewed
If skewness is between −0.5 and + 0.5,
the distribution is approximately
symmetric
12. Introduction with example
Skewness Calculation Formula:
• Where:
n = Number of observations
s = Standard deviation
S = Skewness
Xi = Ith observation
X avg = Mean of observations
13. Introduction with
example
Histogram:
• It is a graphical display where the data is
grouped into buckets and then plotted as bars
• For example, a price by volume chart
shown below is a common histogram that
shows how many shares of a stock traded at a
given price range
• Here share price is converted into 12 bins
and counts of shares traded for each price
range is plotted as bars
14. Introduction
with example
The histogram is an effective graphical technique for showing the
Skewness and kurtosis of dataset
For example, to quickly check whether the data follows normal
distribution or not, before applying any predictive algorithm,
which requires data to follow normality, Kurtosis can be looked at
and transformation can be applied on data if necessary, to
achieve normality
If the bulk of the data is at the left and the right tail is longer, we
say that the distribution is skewed right or positively skewed
If the peak is toward the right and the left tail is longer, we say
that the distribution is skewed left or negatively skewed
In negatively (left) skewed data, mean will always be < median
and mode, whereas in positively (right) skewed data, mean will
always be > median and mode
For example, casual equity investors look at the chart of a stock's
price and try to make investments in companies that have a
positive skew. The idea is to invest in a company with a long tail,
which in the equity markets is a stock price that is greatly skewed
positively
15. Introduction with
example
Box plot:
It is a standardized way of displaying the
distribution of data based on the five-number
summary: minimum, first quartile, median,
third quartile, and maximum
The central rectangle spans the first quartile
to the third quartile (the interquartile range or
IQR). A segment inside the rectangle shows
the median and "whiskers" above and below
the box show the locations of the minimum
and maximum
For example, for the quartiles Q1, Q2 and Q3
with values 4.5, 7 and 11.5 respectively and
minimum=0.5, maximum=22, Box plot can be
drawn as shown in right
17. Business use
cases – In
general
Descriptive Statistics as the name implies
describes or summarizes the raw data and
makes it interpretable by humans
Common examples of descriptive analytics are
reports that provide historical insights regarding
the company’s production, financials,
operations, sales, finance, inventory and
customers
Thus, Descriptive Statistics is to be used when
you need to understand at an aggregate level
what is going on in your company, and when
you want to summarize and describe different
aspects of your business
18. Business use cases – Mode
Business benefit :
• By identifying mode of a name of Dish
purchased , restaurant owner will become aware
of the most popular dish and will be able to
decide the prizing of that dish accordingly
• By identifying most rated movie or restaurant of
the month, news publishers or other researchers
can broadcast such piece of information or a
market researcher can provide this information
to prospective restaurant owners who are
currently surveying the market condition before
launching
• By knowing the most frequently bought product
category/size , stock inventory can be better
managed
Business problem : Identify the
most popular dish served in the
restaurant or find out the most
frequent rating given by customers
for a given movie/ restaurant or most
frequent size or category of a sold
product etc.
19. Business use cases – Mean/Median
Business benefit :
•By identifying mean/median income of
this segment, targeted marketing can
be done to this segment in order to
improve ROI and sales revenue
•However, median is a better measure
than mean in order to get the accurate
picture ; for instance, if couple of users
have some extreme income values, it
will affect overall average in case of
mean
Business problem : Find out
the average age and income of
particular type of product
category purchased
20. Business use cases – Percentile
Business benefit :
•By checking the credit score
distribution, he will be able to know
how many % of applicants fall in top
10 percentiles and can estimate the
total number of eligible loan
applicants based on the bank’s set
criteria for loan eligibility in terms of
credit score
•This high number of delinquencies and
defaulters can be avoided by taking
informative decision on whom to give
loan using such statistical measures
Business problem : A bank’s
loan manager needs to find out
the percentile distribution of
credit score of the loan
applicants
21. Business use cases – Quartiles Interquartile Range +
Boxplot
Business benefit :
•By checking Q1, Q3 and
Inter quartile range(Q3-
Q1) values of each step
of the process, we can
come to know which
particular step has a
scope of time reduction
Business problem
: A business owner
wants to reduce the
business process
cycle time
For instance, in the box plot above, we can observe that steps of preliminary analysis, database research,
Evaluation, Record keeping and follow up have high Inter quartile range( box height indicating Inter Quartile
Range) , making them the steps of further inspection with follow up step being the primary concern owing to its
highest inter quartile range ( Box height)
22. Business use cases – Standard deviation/variance
Business benefit :
• By analyzing the standard deviation or variance,
one can measure the risk associated with a
particular stock in terms of price fluctuations
• If these measures are relatively low, the proper
estimate of future pricing as well as expected
volatility can be made
• Thus standard deviation/variance tells us how
much the stock price or fund's return is deviating
from the expected normal price or returns and
therefore is used by investors as a gauge for the
amount of expected volatility
• However, one would have to divide the standard
deviation by the closing price to directly
compare volatility of two stocks in order to make
valid comparison
Business problem : A stock broker
wants to analyze the price volatility of
a stock as a measure of risk
23. Business use cases –
Standard
deviation/variance
For instance, lets compute the standard deviation for 10 days
stock prices shown in table shown :
• Calculate the average (mean) price for the number of
periods or observations
• Determine each period's deviation ( price – mean )
• Square each period's deviation
• Sum the squared deviations
• Divide this sum by the number of observations – this is
variance
• The standard deviation is then equal to the square root of
this number
Lower this number, lesser the volatility in price, easier to
estimate future price of the stock
24. Business use cases –
Skewness + Histogram
Business problem : A quality control manager of a
company producing elevator rails needs to know which
machine is ideal to produce rails
Business benefit :
If required diameter for an elevator rail is 3 inches,
you can conclude from the image in right that
machine A is producing elevator rails that are too
narrow, whereas machine B is producing elevator
rails that are too wide
Hence, both machines are failing to produce the
required diameter of elevator rails
However, Machine C is producing the right diameter
most of the time on an average making it ideal
choice for production of elevator rails with diameter
=3 inch
25. Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018