Exploratory data analysis is an approach consisting of tools that help you understand your data easily. These tools can be used with minimal knowledge of statistics.
EDA tools are presented here by The School of Continuous Improvement with the main purpose of anyone wanting to use these tools to be able to use them.
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
Exploratory data analysis v1.0
1. (C) The School of Continuous Improvement v1.0
1
Exploratory Data Analysis
2. Disclaimer
(C) The School of Continuous Improvement v1.0
2
This module on Exploratory Data Analysis is being offered free of charge to the interested
individuals who wish to learn more about using these tools to understand their datasets, better.
Usage of these tools is recommended with the help of a mentor. Please speak to us at
vishy@theschoolofci.org, should you need our mentoring on Exploratory Data Analysis.
Reproducing this module or distributing or selling it to achieve financial benefits will invite stringent
action under the concerned law of jurisdiction by the institution facilitating this module.
3. Body of knowledge
(C) The School of Continuous Improvement v1.0
3
1. Stem and Leaf Plot
2. Box Plot
3. Median Polish
4. Resistant Line
5. Resistant Smooth
6. Rootogram
4. Introduction to Exploratory Data Analysis
(C) The School of Continuous Improvement v1.0
4
Exploratory Data Analysis is an approach that has a list of techniques which can be used to understand the
data better without the need to use significance or confidence level testing.
Uses of Exploratory Data Analysis are as below:
1. Get detailed insight into your dataset.
2. Understand some critical impact variables that influence the dataset.
3. Detect if any outliers are present in the dataset.
4. Test the underlying assumptions of’ the dataset.
Exploratory Data Analysis can be done in a matter of 3 minutes using Minitab or any other statistical
software package.
Be surprised though --- We will use Microsoft Excel ® to complete these tools.
5. (C) The School of Continuous Improvement v1.0
5
Stem and Leaf Plot
6. Steam and Leaf Plot
(C) The School of Continuous Improvement v1.0
6
A contact center quality team evaluates 100 calls in the contact center. The
Quality Manager decides to review the quality scores of the operations floor.
Let us draw a stem and leaf plot to understand the data.
A snapshot of the data sheet is attached here. This data sheet can be
found in the file EDA.xls.
7. Steam and Leaf Plot
(C) The School of Continuous Improvement v1.0
7
Step 1 – Sort the data in ascending
order.
Step 2 – Find out the minimum and
maximum values using the MIN and
the MAX function
Step 3 – Find out the range using the
formula MAX – MIN
Step 4 – Construct the stems starting
from 0 and ending with 8. Rule for
constructing stems – If you have
a data set with 3 digit values, the
stems would need to be
constructed in accordance to the
hundredth place.
8. Steam and Leaf Plot
(C) The School of Continuous Improvement v1.0
8
Step 5 – We need to write the formula to compute leafs. For example, let us
take the Stem 3 highlighted in Yellow background. We need to count how
many values fall greater than 30.
Let us first write the formula to count the values that are 30.
Press Enter. See how the Leaf shows up as 0. Now, this means we have one
value of 30.
Let us change the first value of the dataset to 30 – For sake of simulations!!
As we see here, you now have two values of 30. So, the formula works!!
9. Steam and Leaf Plot
(C) The School of Continuous Improvement v1.0
9
Step 6 – Let us now build the formula which will count all the numbers in the
series of 30-40, i.e. 31, 32, 33, 34, 35, and so on.
Huh! That formula seems to never end, does it? Well do it just once and then it
would be easy.
But yes, it is some pain and worth it!!
10. Steam and Leaf Plot
(C) The School of Continuous Improvement v1.0
10
Step 7 – The Stem and Leaf Plot
as shown here.
Step 8 – Let us the LEN and
SUBSTITUTE formula
together to add the
interpretation.
11. Stem and Leaf Plot
(C) The School of Continuous Improvement v1.0
11
1. You have an easier option to run a macro to generate the
Stem and Leaf Plot, but VBA coding is not everyone’s
cup of tea.
2. You could use some statistical software but that may turn
out to be slightly expensive.
3. With the use of some simple Excel formulas, you have
discovered tool 1 which is used to show granularity in
information in the dataset.
4. That is the Steam and Leaf Plot for you.
12. (C) The School of Continuous Improvement v1.0
12
Box Plot
13. Box Plot
(C) The School of Continuous Improvement v1.0
13
Granularity as provided by the Stem and Leaf Plot is good, but at times you
need a graph that shows the data shape, its distribution and the spread. That’s
where we use the Box Plot.
Let us draw a Box plot to understand the data.
5 teams of a factory produce homogenous units. The sampled cycle
times are shown as below.
14. Box Plot
(C) The School of Continuous Improvement v1.0
14
Step 1 – Let us setup the table as
seen here. We know how to
calculate the Minimum and
Maximum value.
Step 2 – Calculate the Median, and
the Quartile values using the
formulas below
Median: = MEDIAN()
1st Quartile: = PERCENTILE(Data
range, 25%)
3rd Quartile: = PERCENTILE(Data
range, 75%)
15. Box Plot
(C) The School of Continuous Improvement v1.0
15
Step 3 – Although you have
prepared the basic data needed, we
aren’t ready to draw the Box Plot
yet. We need to prepare another
table, one that is shown here.
Step 4 – In the row titled Series 1,
fetch the minimum values for the
Teams.
In the row titled Series 2,
subtract the Minimum value
from the 1st Quartile value
from the Summary Range table.
16. Box Plot
(C) The School of Continuous Improvement v1.0
16
Step 5– In the row titled Series 3,
subtract the 1st Quartile value from
the Median value.
In the row titled Series 4, subtract
the Median from the 3rd Quartile
value.
In the row titled Series 5, subtract
the 3rd Quartile value from the
Maximum value.
Let us now try to draw the
Box Plot.
17. Box Plot
(C) The School of Continuous Improvement v1.0
17
Step 6– Select data from Series to Series 4. Don’t select Series 5 as of yet. We
will do it later.
Select 2D Column – Stacked Column Chart.
18. Box Plot
(C) The School of Continuous Improvement v1.0
18
Step 7– Obviously the chart is not a
completed Box Plot. We need to work
around a few things on Excel. Let us
first hide the Series 1 in the graph
generated.
To do this, right click on Series 1 on
the graph.
Click on Format Data Series.
Click on Fill. Select No fill.
Click on Border Color. Select No
color.
See how the blue bars for Series
1 go away.
19. Box Plot
(C) The School of Continuous Improvement v1.0
19
Step 8– Repeat the same steps as in
Step 7 discussed in the previous slide
but leave the cursor selected on the
axis of Series 2.
Step 9 – We need to define the
Whiskers. To do that,
Click on Layout, click on Error
Bars and click on More Error
Bar options.
Step 10 – In the dialog window box
that opens up, select Minus for
Direction and change the
percentage to 100.
20. Box Plot
(C) The School of Continuous Improvement v1.0
20
Step 11– After doing Step 9 and Step
10, the graph changes shape to what
is seen here. Take a look at the graph.
Step 12 – Repeat steps 9 and 10 for
Series 4. A small change. In the More
Error bars options, select the
Direction to Plus.
You will see how the lower and
upper whiskers are defined
now.
21. Box Plot
(C) The School of Continuous Improvement v1.0
21
Step 11– After doing Step 9 and Step
10, the graph changes shape to what
is seen here. Take a look at the graph.
Step 12 – Repeat steps 9 and 10 for
Series 4. A small change. In the More
Error bars options, select the
Direction to Plus.
You will see how the lower and
upper whiskers are defined
now.
22. Box Plot
(C) The School of Continuous Improvement v1.0
22
Step 13– Oops something went wrong
with the graph here. We have not
defined the Maximum values here.
Step 14 – Click on the lines at the top.
Click on Layout, Click on More Error
Bars and in the window that opens
up, select Custom and specify values.
Select the maximum values from
the data for chart table, aka
Series 5.
23. Box Plot
(C) The School of Continuous Improvement v1.0
23
Step 15– The Box Plot is ready now. We can now start interpreting. Obviously
we spent some time making this Box Plot, but it is a one time effort. Once you
are able to construct this, you can use this as a Box Plot Template.
Box Plot
Interpretation
1. The Median cycle time for Team C seems the
lowest at approximately 20 minutes.
2. Team A shows the greatest spread in data.
3. Data for Team A is also heavily skewed.
4. Team E seems to have a good % of population
in the lower end of the cycle time.
24. Box Plot
(C) The School of Continuous Improvement v1.0
24
1. Box Plot doesn’t confirm anything. It is thus not a confirmatory data analysis
tool.
2. Given the fact that a Box Plot is able to tell you information about central
tendency, spread and shape of the data, you can use this EDA tool pretty
much everywhere you have stratified data.
3. You can also use this tool where you just have one sample of data and you
wish to study properties of that sample.
25. (C) The School of Continuous Improvement v1.0
25
Median Polish
26. Median Polish
(C) The School of Continuous Improvement v1.0
26
In Inferential statistics, Analysis of Variance is a Hypothesis testing measure
that fits an additive model to a 2-way design and identifies data patterns not
explained by Row and Column variable effects.
Median Polish does a similar thing except that Median Polish will
use Medians.
A company wishes to conduct a Median Polish on the percentage
scores achieved by students in each course of an IT institution.
Table 1
27. Median Polish
(C) The School of Continuous Improvement v1.0
27
Step 1 – First find out the medians
of all the course scores individually
and subtract the individual mean
performance scores from the
median. This is known as the 1st
sweep.
Step 2 – Now, do the 2nd sweep. In
the second sweep, subtract the
median from table 2 (Last row)
and the Row median from table 2
(Last column) (Both highlighted)
from the table values of table 1.
For the column median, subtract
2nd Sweep value for any cell with
the corresponding cell in 1st sweep.
Table 2
Table 3
28. Median Polish
(C) The School of Continuous Improvement v1.0
28
Step 3 – Let’s do the 3rd sweep
now. Subtract the row values
obtained in table 3 from the row
medians. Identify the new column
medians in the 3rd sweep itself. The
new row medians = Change
Median – Median from table 3.
Table 4
Step 4 – Time for the 4th sweep. Subtract all the row value in table 4
from the 3rd sweep column median. This will give you the row values for
new table which we would be constructing.
Also add the Column Median value with the 3rd Sweep Column Median.
29. Median Polish
(C) The School of Continuous Improvement v1.0
29
Table 5
Step 4 – Time for the 4th sweep. Subtract all the row value in table 4
from the 3rd sweep column median. This will give you the row values for
new table which we would be constructing.
Also add the Column Median value with the 3rd Sweep Column Median.
30. Median Polish
(C) The School of Continuous Improvement v1.0
30
Table 6 – Final
Residual Table
31. Median Polish
(C) The School of Continuous Improvement v1.0
31
Interpretations
1. The average test score performance
across all the courses was 44.25%.
2. People who do JAVA programs alone
score approximately 13 points less than
those who do .NET.
3. Oh yes, look at the Column effects from
the Residual table. Students with 90%
attendance outscore the ones with 70%
attendance by 5 points.
32. Median Polish
(C) The School of Continuous Improvement v1.0
32
Final Notes
1. The tediousness of calculations shouldn’t shy you away from this wonderful
tool.
2. In a 2*2 design where there is a possibility that one of them is categorical,
Median polish comes in very handy in establishing relationships.
3. With the power of calculating residuals with the Median Polish tool, you can
also predict on what could happen in the future.
33. (C) The School of Continuous Improvement v1.0
33
Histogram
34. Histogram
(C) The School of Continuous Improvement v1.0
34
Histogram is another important EDA tool, which you can use when you wish
to check the shape. Importantly, histogram will outline issues in the data like
1. Modality issues
2. Skew issues
3. Mixed distribution issues
Let us go back to the cycle time data and try to plot the histogram with the
help of Excel.
35. Histogram
(C) The School of Continuous Improvement v1.0
35
Step 1 – Let us first calculate the descriptive statistics measures for all the teams.
As you can see from the table shown here, most of the formulas are basic except
for the ones shaded in Light amber background.
IQR = 3rd Quartile – 1st Quartile
Bin width = 2*Count1/3
Number of bins = (Maximum – Minimum)/ Bin width
36. Histogram
(C) The School of Continuous Improvement v1.0
36
Step 2 – Let us now define with the bins. Start with the minimum value. For
example, for Team A the first bin would be 0.32. The next bin will be = 0.32+Bin
Size (7.26). The third bin would be 7.53+ 7.26 and so on. Continue this until you
reach 7 bins.
37. Histogram
(C) The School of Continuous Improvement v1.0
37
Step 3 – Let us first draw the Histogram for one team’s metric performance, e.g.
Team A.
Steps to draw a Histogram
1. Click on Data. Click on Data Analysis (If this option is not available, please
insert the Data Analysis Add-in).
2. From the Data Analysis Dialog window, choose Histogram.
3. In the section showing Input variable, select data corresponding to Team A.
4. In the section showing Bin range, select Bin range corresponding to Team A.
5. Put a tick on Chart Output and Click Ok.
38. Histogram
(C) The School of Continuous Improvement v1.0
38
We achieved this nice
looking Histogram by
reducing the Gap to 0%
on the graph.
39. Histogram
(C) The School of Continuous Improvement v1.0
39
Interpretations
1. Bi-modality observed at 7.53 and 56. Is
this due to an external issue?
2. If the Bi-modality is resolved, we’d get a
close to a perfect distribution, but what is
the reason for this bi-modality?
3. It could difference in suppliers, difference
in changeovers, difference in raw materials
--- Anything?
40. Rootogram
(C) The School of Continuous Improvement v1.0
40
Interpretations
1. Introduction of a new tool here. Instead of having the frequencies on the
vertical axis, you can now take the square root of all the frequencies on
the vertical axis and what you have is known as the Rootogram.
2. The x-axis is the response variable instead of bins used in a Histogram.
41. Histogram
(C) The School of Continuous Improvement v1.0
41
Based on the 4 Histograms drawn for each of the teams, what can you
infer?
Which team’s data distribution is close to being a normal
distribution?
42. Rootogram
(C) The School of Continuous Improvement v1.0
42
Interpretations
1. Introduction of a new tool here. Instead of having the frequencies on the
vertical axis, you can now take the square root of all the frequencies on
the vertical axis and what you have is known as the Rootogram.
2. The x-axis is the response variable instead of bins used in a Histogram.
43. (C) The School of Continuous Improvement v1.0
43
Scatter Plot
44. Scatter Plot
(C) The School of Continuous Improvement v1.0
44
Most times in projects we stumble upon the fact that x impact y. In other
words, y = f(x). Now, using scatter plots, you can visually understand if there is
a relationship between x and y.
Let us use data for two variables – Machine downtime and
production capacity for a factory to understand how does a scatter
plot work. Downtime is expressed in % and Production Capacity is
expressed in tons.
45. Scatter Plot
(C) The School of Continuous Improvement v1.0
45
Step 1 – Select the data, Click
on Insert, Click on Scatter and
Click on Scatter with only
markers.
Step 2 – Voila – you are done.
There you have the scatter
chart as seen here.
46. Scatter Plot
(C) The School of Continuous Improvement v1.0
46
Step 3 – Modification to a Regression equation
This is where you can use an EDA tool as an Inferential statistics tool. Right
click on any point in the graph and click on Add Trendline. Select Linear, Display
equation and Display R-Square.
47. Scatter Plot
(C) The School of Continuous Improvement v1.0
47
Step 4 – Interpretation
While the scatter graph itself visually revealed absence of any strong correlation
between downtime and production capacity, the regression statistics merely
confirm.
The R-Square value needs to be > 0.64 for us to conclude strong correlation.
48. Final Notes
(C) The School of Continuous Improvement v1.0
48
1. This module covers most of the tools used in Exploratory data analysis.
2. Some other tools are:
a. Parallel Coordinates
b. Run Charts
c. Odds Ratio
d. Principal Components Analysis
e. Ordination
Please write into us at vishy@theschoolofci.org for
usage of EDA tools if you have doubts or also follow us
at Linkedin on The School of Continuous
Improvement.
49. (C) The School of Continuous Improvement v1.0 49
Thank you….