Slide 1zjzckkasjasfjsajkfakjlasfasajfdfjaksdffj.pdf

Research needs good understanding of data analysis
Vikash Raj Satyal
(Vikash@kusom.edu.np)
Summarize your Data:

What to look in the dataset?
If our study have a large data set, we
(researcher) are interested to know :-
• What the central value is,
• What is the spread from center,
• What is the shape & size of data
distribution

Major economic dataset
Questions
• What is percapita GDP?
• Whose percapita GDP is this?
• Did you earn $1191 in this FY
142920(Rs.126,018)? (Rs11,910monthly)

• Nepali people earn about
55 times low percapita GDP
than USA, and
165 times lower than
Monaco people

Research Paradigm
3. Survey(Collect data)
4. Statistical analysis
5. There is not enough evidence to
support research(alternative)
hypothesis(HA)
6. Res. Hypo accepted
HA is true
= Failure of research hypo.
7. Report writing
1. Setup research
hypo/Refine(Lit Review)
2. Develop instruments
5a. Report writing
7

Why Dolpa &
Mugu also have
highest annual
growth rate?
Why Achham,
Palpa has one of
the lowest
growth rate?
Mugu
Dolpa

• What is the general IQ of US university students?
• In the US the mean IQ for persons completing no more than a…..
• Bachelor’s degree 113 (80th centile)
• Master’s degree 117 (87th centile)
• PhD, LLD, MD 124 (95th centile)

Central Tendency in large sample data
In any large data set, data are
clustered around center. So
researchers focus to find out
that central value.
Depending on the shape of the
data distribution center is
calculated differently using
different statistical formula

Statistical way of measuring
the center of a data set
•Mean(AM, GM, HM, Weighted mean)
•Median
•Mode
•Partition values

Median not mean, for:
(i) Open End Classes.
(ii) unequal class interval data table.
(ii) When data has several extreme values(outliers).
(iii) qualitative data( in frequency).
(IV) When data strongly lack normality

Quantiles
or Partition values
•Quartiles (3)
•Deciles (9)
•Percentiles (99)
• Quintile(5) (Not Quantile)

Mode is most frequently occurring value
• Less used
• Popular in business and industry
• Only way to locate central value when data is nominal
(How many type A sold? most preferred flavor of ice cream)

Which Average is better?
AM is best for interval data, however it should not be used :
• For highly skewed data
• in open end classes.
• When there are very large and very small items(outliers).
• In case of average ratio and rate of change.
Median is the best average for:
• open end classes
• Skewed data or in presence of outliers
• For ordinal qualitative data eg.: less honest, honest, very honest
Mode is used for qualitative nominal data frequently used in
business and industry

Does Shape and Size of the data matters?
• Elongation of left or right tail is Skewness
• skewness described dataset’s symmetry – or lack of
symmetry.
• A perfectly symmetrical data set will have a skewness of
0.

Skewness • Negative (left) skewness indicates more small values(on left tail)
• Positive (right) skewness indicates more large values(on right tail)

• kurtosis measures extreme values in either tail.
• Normal curve has no Kurtosis
• Kurtosis is measured comparing
the Normal curve

Calculation in EXCEL:
Statistics for lexp(life expectancy) using NHDR2014.xlsx

Calculation in EXCEL:
Statistics for price using auto.xlsx

Use data, nhdr2014 to calculate the following
1. Average life expectancy (‘life’)
2. Average gdp percapita (‘income’)
3. Average life expectancy (‘life’) of 3 ecologies (eg, average life(mountain)= …. )
4. Calculate Q1, Q2, Q3 of ‘income’
5. Using 3 quartiles of ‘income’ we can divide any other data in 4 equal parts.
Make a new variable, call it ‘groups’, that will have 4 value-labels according to
below criteria:
‘poor’ if below Q1
‘below average’, if between Q1 to Q2,
‘above average’, if between Q2 to Q3
‘rich’ if above Q3
6. Find the average of ‘life’ & ‘hdi’ for this newly created variable with 4 groups
7. How many ‘districts’ falls in each of these ‘groups’? And which district has the
highest & lowest ‘life’ value that falls in each of these 4 ‘groups’?
8. Save this data for your future use

Dispersion in data is meaningful

Central value alone can disguise the picture

Variability is beauty of the wild nature
•Geographical variation generates
variety in species of flora and fauna
•Ethnography –cultural diversity
•Epidemiology treats variation in
disease

How to measure data dispersion?
Range
Standard Deviation
Quartile Deviation
Coefficient of variation

1. Range
Range= Largest value – Smallest value
•High Range in temperature acts for desertification
•Range of mobile sets
•Range of social disparity

2. Quartile deviation (semi inter quartile range)
• Inter-quartile Range = Q3 – Q1
• Quartile deviation (Q.D.) :
𝑸. 𝑫. =
𝟑𝟑. 𝟐𝟓 − 𝟏𝟑. 𝟖𝟔
𝟐
= 𝟗. 𝟔𝟗

3. Variance & Standard Deviation
•Most popular measure of variation
•It uses all observations
•Std(standard deviation) is the square root of variance
•Std = 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒

Sample VS population VARIANCE
For Papulation
s2 =
(𝑋−𝑋 )²
𝑛
=
𝑋²
𝑛
−
𝑋
𝑛
2
(individual data)
s2 =
𝑓(𝑋−𝑋 )²
𝑁
=
𝑓𝑋²
𝑁
−
𝑓𝑋
𝑁
2
Grouped data
For sample
S2 =
(𝑋−𝑋 )²
𝑛−1
Also, S2 =
𝑛
𝑛−1
s2
S2 =
𝑛
𝑛−1
s2
=
𝑛
𝑛 − 1
𝑓𝑋2
𝑁
−
𝑓𝑋
𝑁
2
When n  ∞ , sample mean  population mean

Example: Variance and std of the life
of electric bulbs(in hours)
Length of life No. of bulbs
500–700 5
700–900 11
900–1100 26
1100–1300 10
1300–1500 8
Length of
life
No. of
bulbs
mid-
value
f X fx fx2
500–700 5 600 3000 1800000
700–900 11 800 8800 7040000
900–1100 26 1000 26000 26000000
1100–1300 10 1200 12000 14400000
1300–1500 8 1400 11200 15680000
SUM 60 61000 64920000
Mean = 1016.67
Variance = 48388.89
Std = 219.9747

4. Coefficient of Variation(C.V.)
The co-efficient of variation is the relative measure based on the
standard deviation and is defined as the ratio of the standard
deviation to the mean expressed in percent.
C.V. =
𝜎
μ
x100%
It is used to compare the compactness of two or more data
Smaller C.V. indicates consistent or less variable data
C.V. is unit-less so data in same or different units can be compared
by it. eg. Weights in KG and in Pounds

Which type of electric bulbs has better consistency in life span?
Length of life
No. of
bulbs(alpha, a)
No. of
bulbs(beta, b)
fa fb
500–700 5 4
700–900 11 30
900–1100 26 12
1100–1300 10 8
1300–1500 8 6
Length of life
# bulbs
(alpha, a)
# bulbs
(beta, b)
Mid-value
fa fb X Xfa Xfb X2fa X2fb
500–700 5 4 600 3000 2400 1800000 1440000
700–900 11 30 800 8800 24000 7040000 19200000
900–1100 26 12 1000 26000 12000 26000000 12000000
1100–1300 10 8 1200 12000 9600 14400000 11520000
1300–1500 8 6 1400 11200 8400 15680000 11760000
SUM 60 60 61000 56400 64920000 55920000
mean(a) 1016.7 mean(b) 940.0
std(a)= 220.0 std(b)= 220.0
CV(a) 21.64% CV(b) 23.4%

Hans Rosling
(27 July 1948 – 7 February 2017)
most admired TED shows
Swedish epidemiologist with high data exploratory power
Gapminder foundation
2014 second time in Nepal from UNESCO
How not to be ignorant /The Joy of Statistics
( first 5 minutes of the total 1 hours Video)
http://www.gapminder.org/videos/the-joy-of-stats/

Slide 1zjzckkasjasfjsajkfakjlasfasajfdfjaksdffj.pdf

Recommended

Recommended

More Related Content

Similar to Slide 1zjzckkasjasfjsajkfakjlasfasajfdfjaksdffj.pdf

Similar to Slide 1zjzckkasjasfjsajkfakjlasfasajfdfjaksdffj.pdf (20)

More from BirBetalMatketing

More from BirBetalMatketing (11)

Recently uploaded

Recently uploaded (20)

Slide 1zjzckkasjasfjsajkfakjlasfasajfdfjaksdffj.pdf