This will help understand the basic concepts of Statistics like data types, level of measurements, central tendency, dispersion, graphs, univaraite analysis, bivariate analysis and more. Moreover, it will also help you to select appropriate summary statistics and charts for your data.
2. SESSION
FLOW
What is Statistics?
Population & Sample
What is Data?
Types of Data
Level of Measurements
Summary Statistics
Types of Charts
Presentation of data
Univariate Analysis
Bivariate Analysis
3. Statistics
Statistics is the science concerned with developing and
studying methods for collecting, analysing, interpreting
and presenting data.
4. Population is the entire group that you
want to draw conclusions about.
Sample is a subset of a population that
contains characteristics of that
population.
6. What is Data ?
Data is a collection of facts or information from which
conclusions may be drawn.
7. Data
Laal Singh Chaddha (Aamir Khan) is that passenger on your train who has a lot of
stories to tell, even if you don’t want to be part of it. That’s how the story starts by
Laal making the viewers the co-passengers on a train to Chandigarh and starting to
narrate his journey from a dim-witted guy wearing leg-braces to the front-page
celebrity of a famous magazine. Laal grows up with just one person Rupa (Kareena
Kapoor Khan) who actually gets him after his mother (Mona Singh).
Cust ID Gender Age Region Source Payment Product Amount Time Of Day
10001 Male 38 East TV advt Credit Card Books 617 22:19
10002 Female 25 West Email Paypal Clothing 3083 13:27
10003 Male 24 North Email Net Banking Grocery 1762 14:27
10004 Male 33 West Email Paypal Home Kitchen 2248 15:38
10005 Male 21 South TV advt Cash On Delivery Grocery 1299 15:21
10006 Male 28 West Web Paypal Mobile 13041 13:11
10007 Male 20 East Email Paypal Mobile 14455 21:59
10008 Female 20 West TV advt Credit Card Home Kitchen 13090 04:04
10009 Female 38 West TV advt Cash On Delivery Grocery 16322 19:35
10010 Male 26 South Newspaper Credit Card Grocery 11716 13:26
10011 Female 27 South Newspaper Paypal Home Kitchen 18176 14:17
10012 Male 45 East Newspaper Credit Card Books 15505 01:01
10013 Male 58 North Email Cash On Delivery Books 21649 10:04
10014 Male 49 East Email Debit Card Home Kitchen 18227 09:09
10015 Female 29 West Email Net Banking Clothing 10971 05:05
10016 Male 19 West TV advt Credit Card Clothing 12956 20:29
8. Types of Data
Qualitative or Attribute data - the characteristic being
studied is nonnumeric.
E.g.: Gender, religious affiliation, state of birth, condition of
patient, words, images, videos.
Quantitative data - the characteristic being studied is
numeric.
E.g.: time (in seconds) for 400 mts race, age of corona patient,
no. of WBC in blood sample.
9. Quantitative
Data
Discrete variables: can only assume certain values.
E.g.: no. of pregnancies, no. of missing teeth in children of a
school, no. of visits made by doctor ,the number of goals
in a football match, the number of wickets by a bowler in
a cricket match.
Continuous variable can assume any value within a specified
range.
E.g.: the height of an athlete or the weight of a boxer, skull
circumference, diastolic blood pressure, serum-
cholesterol.
12. Nominal-Level Data
Properties:
• Observations of a qualitative variable can only
be classified and counted.
• There is no particular order to the labels.
E.g. Blood group, Marital status, Eye colour,
Gender, Religion
Favorite
beverage
Group
Membership
13. Ordinal-Level Data
Properties:
• Data classifications are represented by sets of
labels or names (high, medium, low) that have
relative values.
• Because of the relative values, the data
classified can be ranked or ordered.
E.g. Stage of disease, Severity of pain, level of
satisfaction, Likert scale
14. Interval-Level Data
Properties:
• Data classifications are ordered according to
the amount of the characteristic they possess.
• Equal differences in the characteristic are
represented by equal differences in the
measurements.
E.g. Temperature , SAT score, Shoe size, Dress
Size, distance from landmark, geographical
coordinates ( longitudes, latitudes)
Dress Size
15. Ratio-Level Data
Properties:
• Data classifications are ordered according to the amount of the
characteristics they possess.
• Equal differences in the characteristic are represented by equal
differences in the numbers assigned to the classifications.
• The zero point is the absence of the characteristic and the ratio
between two numbers is meaningful.
E.g. Head circumference, Time until death, weight, Kelvin
temperature
Height
Weight
23. Pie Chart
The pie (circle) represents 100% of the variable and is divided into sectors.
The area of each sector represents the frequency of each category in the
variable it represents.
24. Bar Chart
Bar graphs are more
commonly used to
represent categorical
variables. It can be
vertical or horizontal
graphs and can show
the frequency or the
percentage of each
category.
25. Histogram
It is similar to the bar chart, but
there are no gaps between the
bars as the variable is continuous.
The width of each bar of the
histogram relates to a range of
values for the variable, but in
most cases, the width is kept the
same.
26. Scatter Diagram
If we have two variables that are
numerical, the relationship between
them can be illustrated using a scatter
diagram.
It plots one variable against the other in
a two-way diagram. One variable is
represented on the horizontal axis and
the other is plotted on the vertical axis
with each dot representing one case.
27. Box-Whisker Plot
The boxplot (also called Box and Whisker plot) is used to summarize numerical
variables based on the five-number summary.
Those five numbers are minimum, maximum, median, upper quartile, and lower
quartile.
28. Which Chart ?
ONLY ONE VARIABLE SCALE CATEGORICAL
SCALE
HISTOGRAM SCATTER PLOT BOX-PLOT
CATEGORICAL
PIE / BAR BOX-PLOT MULTIPLE / STACKED
31. Univariate
Analysis
Univariate analysis is a basic kind of analysis technique for
statistical data. Here the data contains just one variable.
The main objective of the univariate analysis is to describe
the data in order to find out the patterns in the data.
Some of the measures in Univariate Analysis:
• Central Tendency
• Dispersion
• Skewness
• Kurtosis
32. Central Tendency
The Mean of a variable
can be computed as the
sum of the observed
values divided by the
number of observations.
The Median is the point
at the centre of the data,
where half of the values
are above, and half are
below it.
The Mode is the most
frequently occurring
value in the dataset
Measures that indicate the approximate centre of the data are called
Measures of Central Tendency.
33. Dispersion
The Range is simply the
difference between the
largest and smallest values.
The Inter-Quartile Range is
simply the difference
between the upper quartile
and the lower quartile
The Variance is an average
of squared deviations from
mean.
Standard deviation is
calculated as the square
root of the variance
Measures that describe the spread of the data from central tendency are
Measures of Dispersion.
35. Kurtosis
Kurtosis is a statistical measure used to describe the degree to which
observations cluster in the tails or the peak of a frequency distribution.
36. Choosing Summary Statistics
Type of Variable
Scale
Normally distributed
Mean
(Standard deviation)
Skewed data
Median
(Interquartile range)
Categorical
Ordinal:
Median
(Interquartile range)
Nominal:
Mode
(None)
37. Bivariate
Analysis
Bivariate analysis is stated to be an analysis of any
concurrent relation between two variables or attributes.
This study explores the relationship of two variables as
well as the depth of this relationship to figure out if there
are any discrepancies between two variables and any
causes of this difference.
Some of the measures in Bivariate Analysis:
• Correlation
• Regression
• Time Series
38. Correlation
Positive Correlation
If the change in the two variables is
in the same direction.
E.g. Temperature and Sales of Ice-cream
Negative Correlation
If the change in the two variables is
in the opposite direction.
E.g. Temperature and Sales of Woollen
clothes
If there is a simultaneous changes in the variables due to direct or indirect
cause-effect then there is a correlation between variables.
39. Correlation Coefficient
Scatter Plot
A scatterplot is a type of
data display that shows
the relationship between
two numerical variables.
Karl Pearson
It measures the linear
association between two
numeric variables.
Correlation coefficient is a statistical measure that indicates the extent to
which two or more variables fluctuate in relation to each other.
Spearman
It measures the linear
association between ranks
assigned to individual
items of two variables.
40. Regression
If these functional relationship is linear
in nature, it is called Linear Regression.
The regression line is given as
𝑦 = a + 𝑏𝑦𝑥 𝑥
𝒃𝒚𝒙 is the regression coefficient, which
measures the change in variable 𝑦 for a
unit change in independent variable 𝑥 .
Regression is the functional relationship between two or more variables, such
that we can estimate value of dependent variable for given value of
independent variable(s)
41. Time Series
A time series is a time ordered sequence of observations taken at regular interval (e.g.
Hourly, daily, weekly, monthly, quarterly, annually).
Examples of Time Series
• Daily: Stock Price, temperature Weekly: Retail sales of departmental store
• Monthly: Unemployment rate, consumer price index
• Quarterly: GDP of a country, Yearly: Production of crops
42. Multivariate
Analysis
Multivariate analysis is stated to be an analysis of any
concurrent relation between more than two variables or
attributes.
Some of the measures in Multivariate Analysis:
• Multiple Correlation
• Multiple Regression
• Discriminant Analysis
• ANOVA
• Structural Equation Modelling
43. References
https://ncert.nic.in/textbook.php?kest1=7-9
Std_11 - Google Drive
Std_12 - Google Drive
https://cdn1.byjus.com/wp-content/uploads/2020/07/GSEB-
Class-12-Statistics-Part-1-Textbook-Commerce-Stream.pdf
https://schools.freshersnow.com/wp-
content/uploads/2021/12/Std-12-Statistics-Part-2-E.M.pdf
44. THANK YOU
Dr Parag Shah | M.Sc., M.Phil., Ph.D. ( Statistics)
pbshah@hlcollege.edu
www.paragstatistics.wordpress.com