Upcoming SlideShare
×

# Data summary metrics

234 views

Published on

A little document I made for myself; to clarify some concepts about the various data summary metrics.

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
234
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Data summary metrics

1. 1. Data Summary Metrics Mean, Median, Mode and More • Populations & Samples, Parameters & Statistics • Summarizing the Data with a Metric • Discrepancy and Error • Estimating the Summary Metrics: Minimizing the Error • Arithmetic Mean, Median, and Mode • Geometric Mean, Harmonic Mean and Mid-Range • Breakdown Points of the Arithmetic Mean and the Median MandarGadre,July2013. -- Mandar Gadre (July 2013)
2. 2. MandarGadre,July2013. Population and Sample The Whole Population A “Sample” Dataset Parameter: is a certain property of the population as a whole. e.g. the median age of all Indian citizens. Statistic: is an estimate of that property, drawn from the sample dataset. e.g. estimated median, calculated from the ages of, say 1 million citizens.
3. 3. MandarGadre,July2013. Summarizing the Data • If a certain property of every individual in the population has only one value, no need to summarize! • For non-identical values of the property (almost everywhere in the real world), we look for a “Summary Statistic” such as a mean. • But how do we choose the summary statistic, S? Xi, i = 1 to N S
4. 4. MandarGadre,July2013. Discrepancy • Discrepancy ei is the “deviation” of an individual data-point xi from this “Summary Statistic”. We take s as the candidate summary statistic. You could define your own way of calculating discrepancy! • Three common ways – 1. Comparison Is the individual reading same as the candidate? ei = 1, xi ≠ s ei = 0, xi = s 2. Absolute difference from the candidate ei = |xi – s| 3. Square of the difference from the candidate ei = (xi – s)2
5. 5. MandarGadre,July2013. Error • Error E is the aggregate of individual discrepancies ei of all the individual data-points xi from the Candidate Summary Statistic s. • E for the three types of discrepancies would be – 1. Comparison with the Candidate E = ∑i ei, where ei = 1, xi ≠ s; ei = 0, xi = s 2. Absolute difference from the Candidate E = ∑i |xi – s| 3. Square of the difference from the Candidate E = ∑i (xi – s)2
6. 6. MandarGadre,July2013. Calculating S • We define S, Summary Statistic, as the value of s for which E is minimized. • We have given special names for the three types of Summary Statistics arising from the three types of discrepancies: Arising from Comparison: S, such that the error E = ∑ i ei, (where ei = 1, xi ≠ s; ei = 0, xi = s) is minimized. It turns out that the value of s that occurs the most frequently will minimize this error. There may be one or more such values. We call this the Mode. e.g. if we want to sell single size men’s t-shirts, we can get them made with size equal to the mode.
7. 7. MandarGadre,July2013. Arising from Absolute Difference: S, such that the error E = ∑ i |xi – s| is minimized. This is similar to the absolute value function and the derivative is the signum function! To find the minimum, we make the derivative (signum function) zero – which happens at the middle reading when all the data-points are arranged in increasing order. We call this the Median. If we want to summarize income-per-household in a huge country like India (data-set with severe outliers which do not require higher weightage) we will use the median.
8. 8. MandarGadre,July2013. Arising from Square of Absolute Difference: S, such that the error E = ∑i (xi – s)2 is minimized. Making the derivative zero gives us – ∑i 2(xi – S) = 0 or N*S = ∑i xi or S = (∑i xi) / N We call this the Arithmetic Mean. If we want to summarize the height of children in a kindergarten class, we may use the mean: the data is normally distributed and most likely there aren’t any extreme outliers (though in such cases the median and the mode are not any worse summary statistics to use). +++ Mode is used while capturing categorical/nominal data. Mean is used to capture the effect of extreme outliers. Median is used for datasets with extreme outliers which need not be given any higher weightage. The outliers sway the arithmetic mean much more than they sway the median, because of the square-of-the-distance.
9. 9. MandarGadre,July2013. Other Summary Statistics • Geometric Mean: Defined only for dataset with all positive numbers, it is the Nth root of the product of all N data-points. G = ( ∏ (xi) ) 1/N It is used while summarizing/aggregating data with different categories and scales involved. E.g. rating companies on various metrics taken together. Or where the data-points show compounding behavior e.g. summarizing performance of a stock over the past N years.
10. 10. MandarGadre,July2013. Other Summary Statistics • Harmonic Mean: Defined only for dataset with all non-zero numbers, it is the reciprocal of the arithmetic mean of reciprocals of xi. H = 1 / (1/N(∑i (1/xi)) ) It is used while summarizing rates. e.g. the average speed of aircraft between numerous Mumbai-London trips; or the average rate (in ml/min) at which a blood donor fills a bag over multiple visits.
11. 11. MandarGadre,July2013. Other Summary Statistics • Mid-Range Defined as the arithmetic mean of the maximum and minimum data-points Mid-Range = ½ (xmax + xmin) This is one of the least efficient (since it ignores all the data- points except for min and max) and the least robust (since it only depends on the extreme data-points and will be swayed if they are extreme outliers) statistics. It is used in process control. e.g. where the process is tightly controlled and the outliers are already handled/trimmed out.
12. 12. MandarGadre,July2013. Robust Statistics • Robust Statistics are those summary statistics which are insensitive to which sample we choose from the population; or to the presence of contaminated/bad/incorrect data in that sample. ‘Breakdown Point’ represents the degree of robustness of a statistic. • Breakdown Point is the largest proportion of contaminated data- points (e.g. an arbitrarily large data-point) a statistic can handle before yielding an absurd result (e.g. an arbitrarily large statistic). • Since the arithmetic mean depends on all the values and is swayed by changing even one value among N, its Breakdown Point is 0. • The median is the strongest statistic, with its Breakdown Point at 50%. (If more than 50% of the data is contaminated, a statistic cannot be defined anyway since there is no way to distinguish between the actual underlying distribution and the contaminated one.)