Lekcija 1 - Uvod.pdf

Statistical Analysis 2021/22
Lecture 1
About the Course
Introduction to Statistics & Data

What You Need to Know Before Your
Course Begins
• We begin on time!
• So please, be there on time!
• Tuesday, 08:30 – 11:00
• We make a break after 90 min max (usually after 75 minutes), so hold on
for food or other addictions until the break!
• Any special treatment / status you need (early exams, longer absences…)
• Send an email or come to my office before October 21 and we’ll try to
make some arrangements.
• After that, the rules are set for all!

Course Outline
1) Data management
2) Visualization
3) Descriptive statistics
4) Statistical distributions
5) Confidence intervals
6) Hypothesis testing of one population
7) Comparing two populations
8) Non-parametric testing
9) Simple linear and multiple regression
10) Regression model building
11) Time series.

Teaching (Lectures + Exercises +
Assignments)
• Offline, IRL, synchronous
• Lectures: every Tuesday (08:30 – 11:00), M. Pahor
• they will be recorded, the recording posted on
Canvas
• Exercises: every Friday (09:00 – 11:00), M. Pahor
• “computer lab”
• exercises, examples, Google Sheets and R Studio
Cloud, etc.
• Google Sheets and R Studio Cloud
• You’ll need Google and an R Studio Cloud accounts
(free)

Teaching (Lectures + Exercises +
Assignments)
• Online, Canvas, asynchronous
• DIY Exercises, K. Dvorski
• You will have recap quizzes with explanations for
every lecture available at all times for you to practice
and prepare for the exams successfully
• Extra assignments (10% of your final grade), K. Dvorski
• You will have to solve and submit a number of
assignments by a certain date over the course of
the semester (no make-up tests and no deadline
extensions)
• The details will be announced in a timely manner
and you will have enough time and TA‘s support
for every assignment.

Teaching (Mid-Terms)
• Canvas platform
• On-line tests (midterms) are a part of the grade (30%) → There will be
no re-do tests!
• Three (3) on-line tests will be given on the dates and times specified.
• Each mid-term makes up for 10% of your final grade
• Mid-terms are here to help you learn and pass the course! Please, don‘t stress!
• DIY Exercises, assignments, exercises and lectures help you work
towards your grade.
• Don‘t worry, if you take an active part in the course, you will have no
problem in passing the mid-terms and the final exam!

Exam & Grading
• Course grading: Your course grade will be based on the maximum of the
following decompositions:
• Graded tests (midterms) on Canvas (30 percent)
• Bonus assignments during semester on Canvas (10 percent)
• Final exam (60 percent)
• You must answer more than 50% of questions correctly in order to pass the
final exam (e.g., 25 MCQs; 13 correct answers!)
• The final exam might be on-line
• All the important details on exams and assignments will be announced
shortly!

Time & Space
• IRL at SEB:
• Lectures: Tuesdays, 8:30-11:00
• Exercises: Friday, 09:00 – 10:30
• Instructor: Prof. Marko Pahor
• email: marko.pahor@ef.uni-lj.si
• Office hours
• Wednesdays, 2 p.m. on Zoom (check link)
• Assistant: Ms. Katarina Dvorski
• will try to answer your questions and provide support within 24
hours from receiving your email.
• email: katdvorski@gmail.com

Learning Objectives
After this lecture you should be able to:
LO 1: Describe the importance of statistics.
LO 2: Differentiate between descriptive statistics and
inferential statistics.
LO 3: Explain the various data types
LO 4: Describe variables and types of measurement
scales.

Statistics or Sadistics?
• “There are three types of lies -- lies, damn lies, and statistics.”
Benjamin Disraeli
• “A single death is a tragedy; a million deaths is a statistic.”
Joseph Stalin
• “Most people use statistics like a drunk man uses a lamppost; more for
support than illumination”
Andrew Lang

What is Statistics?
Statistics
• the branch of mathematics that examines ways to process and analyse
data. It provides procedures to collect and transform data in ways that are
useful to business decision makers.
• is the science of data. This involves collecting, classifying, summarizing,
organizing, analysing, presenting, and interpreting numerical data.
• In the broadest sense, we may define the study of statistics as the
methodology of extracting useful information from a data set.
Statistic
• is a numerical measure that describes a characteristic of a sample.
Statistical Analysis
• used to manipulate summarize, and investigate data, for the purposes of
useful decision-making.

Statistics & The Real World
• Insurance companies use data on homeowners, drivers and many
more to define the insurance premium.
• Accountants use sample data concerning a company’s actual
sales revenues to assess whether the company’s claimed sales
revenues are valid.
• Marketing experts help businesses decide which products to
develop and offer by using data that reveal consumer preferences.
• Politicians focus on public opinion polls to formulate legislation and
to create campaign strategies.
• Scientists use data on the effectiveness of drugs and vaccines to
improve our health and advance knowledge.

…
• Finance – correlation and regression, index numbers, time series
analysis,…
• Marketing – hypothesis testing, chi-square tests, nonparametric
statistics,…
• HRM – hypothesis testing, chi-square tests, nonparametric tests,…
• Operations Management – hypothesis testing, estimation,, analysis
of variance, time series analysis…

Types (Branches) of Statistics
• Descriptive statistics
• utilizes numerical and graphical methods to look for patterns in a data
set, to summarize the information revealed in a data set, and to
present that information in a convenient form;
• focuses on collecting, summarizing and presenting a set of data.
• Inferential statistics
• utilizes sample data to make estimates, decisions, predictions, or other
generalizations about a larger set of data;
• uses sample data to draw conclusions about a population.

…
Descriptive
Statistics
describe, organize,
summarize
common
terminology: mode,
median, mean,
averages
data presentation
(tables, graphs)
standard tools:
central tendency,
dispersion,
skewness
Inferential
Statistics
generalize findings
from samples to
populations
estimation,
predictions,
assessing
relationships
between variables
common
terminology: margin
of error, statistically
significant
standard tools:
hypothesis testing,
confidence
intervals,
regression analysis

…
Four Elements of Descriptive Statistical
Problems
1. The population or sample of interest
2. One or more variables (characteristics
of the population or sample units) that
are to be investigated
3. Tables, graphs, or numerical summary
tools
4. Identification of patterns in the data
Five Elements of Inferential Statistical
Problems
1. The population of interest
2. One or more variables (characteristics
of the population units) that are to be
investigated
3. The sample of population units
4. The inference about the population
based on information contained in the
sample
5. A measure of the reliability of the
inference

Statistical Inference
• Estimation
• e.g., estimate the population mean
weight using the sample mean weight
• Hypothesis testing
• e.g., test the claim that the population
mean weight is 70 kg
• Statistical inference
• process of drawing conclusions or
making decisions about a population
based on sample results (data).
• making the inference is only part of the
story. We need to know its reliability –
how good the inference is.
Statistical Inference: an estimate, prediction, or some
other generalization about a population based on
information contained in a sample.

Population – Sample – Variable
Population: all items of interest in a statistical problem, all elements under study.
Sample: subset of population (should be representative of the population)
Unit of observation: an entity about which information is collected
Data set: all the data collected in a particular study
Elements: individual entities of a data set (individuals)
Observation: the set of measurements obtained for a particular element
Individuals: every individual element in the population
Variables: characteristics of interest for the elements (individuals)

Population & Sample
• Population
• the entire set of individuals or objects of interest or the
measurements obtained from all individuals or objects of
interest;
• A population parameter is a number calculated using the
population measurements that describes some aspect of the
population. That is, a population parameter is a descriptive
measure of the population.
• Sample
• a portion, or part, of the population of interest; it should be
representative of the population;
• A sample statistic is a number calculated using the sample
measurements that describes some aspect of the sample.
That is, a sample statistic is a descriptive measure of the
sample.

Variable(s)
Independent Variable
• also known as the explanatory or predictor variable
• it explains variations in the response variable
• in a study, it is manipulated by the researcher.
Dependent Variable
• also known as the response or outcome variable
• its value is predicted or its variation is explained by the explanatory
• variable
• in a study, this is the outcome that is measured following
manipulation of the explanatory variable.
Confounding Variable
• a variable, other than the independent variable of interest, that may
affect the dependent variable.
• an unforeseen or unaccounted-for factor that may call into question
the finding of a relationship between two other factors or variables.

Inferential Statistics: The Need for
Sampling
Resource constraints (e.g., time, money)
• obtaining information on the entire population is expensive / time-
consuming
• the monthly unemployment rate in the U.S. is calculated by the Bureau of
Labor Statistics (BLS): Is it reasonable to assume that the BLS counts
every unemployed person each month?
It is impossible to examine every member of the population
• Suppose we are interested in the average length of life of a VARTA AAA
battery. If we tested the duration of each VARTA AAA battery, then in the
end, all batteries would be dead and the answer to the initial question
would be useless.

Recap: Example 1
• „Cola wars“ is the popular term for the intense competition between Coca-
Cola and Pepsi displayed in their marketing campaigns. Their campaigns
have featured claims of consumer preference based on taste tests.
• In 2013, the Huffington Post conducted a blind taste test of 9 cola brands
that included Coca-Cola and Pepsi. (Pepsi finished 1st and Coke finished
5th).
• Suppose, as part of a Pepsi marketing campaign, 1,000 cola consumers
are given a blind taste test (i.e., a taste test in which the two brand names
are disguised). Each consumer is asked to state a preference for brand A
or brand B.
a) Describe the population.
b) Describe the variable of interest.
c) Describe the sample.
d) Describe the inference.
Source: McClave, J. T., Benson, G. P., & Sincich, T. (2017). Statistics or Business and Economics. Harlow, UK: Pearson
Education Limited, p. 32.

…
a) Because we are interested in the responses of cola consumers in a taste
test, a cola consumer (individual) is the experimental unit. Thus, the
population of interest is the collection or set of all cola consumers.
b) The characteristic that Pepsi wants to measure is the consumer’s cola
preference as revealed under the conditions of a blind taste test, so cola
preference is the variable of interest.
c) The sample is the 1,000 cola consumers selected from the population of all
cola consumers.
d) The inference of interest is the generalization of the cola preferences of the
1,000 sampled consumers to the population of all cola consumers. The
preferences of the consumers in the sample can be used to estimate the
percentage of all cola consumers who prefer each brand.
Source: McClave, J. T., Benson, G. P., & Sincich, T. (2017). Statistics or Business and Economics. Harlow, UK: Pearson
Education Limited, p. 32.

Recap: Example 2
• Let’s say you want to find out how alcohol consumption affects mortality.
You decide to compare the mortality rates between two groups – one
consisting of heavy users of alcohol, one consisting of tee-totallers (people
who never drink alcohol).
• What would be your independent and dependent variable in this case?.
• If you find that people who consume more alcohol are more likely to die, it
might seem intuitive to conclude that alcohol use increases the risk of
death. In reality, however, the situation might be more complex. It is
possible that alcohol use is not the only mortality-affecting factor that differs
between the two groups?
• What would be possible confounding variables?

…
• alcohol consumption is the independent variable
• mortality is the dependent variable
• age, sex, ethnicity, diet, BMI are possible confounding variables

WHAT ARE DATA?
WHERE DO WE GET DATA AN HOW?

Data: A Quick Intro
• Let‘s say you are interested in the state of the COVID pandemic…
• What data will you look for?
• Where will you look for this data?

Data: A Definition
Data
• facts, opinions, and figures from which conclusions can be drawn
• typically in numerical form
• data becomes information when it informs the decision making of the user
• DIKW pyramid
• Data is not information, information is not knowledge, knowledge is not
understanding, understanding is not wisdom.
Clifford Stoll

Basic Concepts of Data
• variables are characteristics of items or individuals
• data are the observed values of variables
• all variables should have an operational definition – a universally accepted
meaning that is clear to all associated with an analysis
• e.g., ‘country of birth of person’, which is the country identified as being
the one in which the person was born.
• statistical techniques are processes that convert data into information
• univariate data sets (one variable) ↔ univariate techniques
• bivariate data sets (two variables) ↔ bivariate techniques
• multivariate data sets (more than two variables) ↔ multivariate techniques
Source: Berenson, M. L. et. al. (2019). Basic Business Statistics: Concepts and Applications.
Melbourne, AU: Pearson Australia, p. 6.

…
• Cross-sectional data contain values of a characteristic of many subjects at
the same point or approximately the same point in time.
• variations of ice cream flavours at a particular store, heart rates of 100 patients at
the beginning of the same procedure
• Time series data contain values of a characteristic of a subject over time.
• e.g., patient‘s ECG heart data, daily closing prices over one year for a single
financial security
• Panel data (longitudinal data) contains observations about different cross
sections across time.
• examples of groups that may make up panel data series include countries, firms,
individuals, or demographic groups.
• e.g., GDP per capita for Sub-saharan African (SSA) countries over the period
1960 – 2020 and stock prices of listed companies in Slovenia over the period
2000 – 2020.

…
• Structured data generally refers to data that has a well-defined length and
format.
• data reside in a predefined row-column format
• fits nicely into a spreadsheet, relational database
• numbers, dates, and groups of words and numbers called strings
• Unstructured data (unmodeled data) do not conform to a predefined row-
column format.
• textual (e.g., e-mail or open-ended survey responses), multimedia
contents (e.g., photographs, videos, and audio data)
• social media data, such as those that appear on LinkedIn, Twitter,
YouTube are examples of unstructured data
• AI is changing the landscape of unstructured data

…
• The term big data is used to describe a massive volume of both structured
and unstructured data that are extremely difficult to manage, process, and
analyze using traditional data processing tools.
• The availability of big data, however, does not necessarily imply complete
(population) data.

…
• outliers: values that appear to be excessively large or small
compared with most values observed.
• missing values: refers to when no data value is stored for one or
more variables in an observation.

Sources of Data
• data distributed by an
organisation or an individual
• a designed experiment
• a survey
• an observational study
• data collected by ongoing
business activities.
• official statistics offices, e.g.:
• www.stat.si
• ec.europa.eu/Eurostat
• https://sdw.ecb.europa.eu
• …
• statistics departments of
international organizations
stats.oecd.org/
• unstats.un.org/
• …
• Other sources:
• https://www.dataquest.io/blog
/free-datasets-for-projects/
• …
Primary sources: provide first-hand data collected
by the data analyser.
Secondary sources: provide (already collected)
data collected by another person or organization.

Methods and Properties of Data
Collection
• The reliability and accuracy of the data affect the validity of the results in a
statistical analysis.
• The reliability and accuracy of the data depend on the method(s) of data
collection.
• Three of the most popular sources of statistical data are:
• published data
• observational studies
• experimental studies.
Reliability Validity Accuracy
Definition
The consistency of
repeated assessments.
How well the
assessment measures
the concept of what it
intends to measure.
How well an
assessment measures
what (i.e. the variable)
it is supposed to
measure.
Example
Measurement of
someone's weight using a
weighing scale would give
consistent results in the
same person.
The heavier a person
is the more likely they
are to be overweight.
A properly calibrated
weighing scale would
accurately measure
kilograms

Variables & Data: Qualitative vs.
Quantitative
Qualitative
• also known as descriptive or attributive
• labels or names are used to categorize
the distinguishing characteristics of a
qualitative variable
• observed values (data) not numerical
in nature
• can be quantified in the translation
process ↔ attributes may be coded
into numbers for purposes of data
processing
Quantitative
• also known as numerical
• a quantitative variable assumes
meaningful numerical values (data),
and can be further categorized as
either discrete or continuous
• a discrete variable assumes a
countable number of values
• a continuous variable is characterized
by uncountable values within an
interval

VARIABLES AND SCALES OF MEASUREMENT

Scales of Variable Measurement
• Variables are measurement using an instrument, device, or computer.
• The scale of the variable measured drastically affects the type of analytical
techniques that can be used on the data, and what conclusions can be
drawn from the data.
• There are four scales of measurement, nominal, ordinal, interval, and ratio.

Measurement
• Nominal measurement reflects classification of objects (e.g., codes A, N and P to
represent aggressive, normal, and passive drivers); the order has no meaning, and the
difference between identifiers is meaningless. In practice it is often useful to assign
numbers instead of letters to represent nominal scale variables, but the numbers
should not be treated as ordinal, interval, or ratio scale variables (e.g., bar codes).
• Ordinal measurement reflects rank (e.g., : 1 = use often; 2 = use sometimes; 3 = never
use); order matters, but the difference between responses in not consistent across the
scale or individuals.
• Interval measurement enables meaningful interpretation of numbers assigned to
objects (e.g., temperature in Celsius or Fahrenheit, time); difference between
measurements is the same anywhere along the scale and consistent across
measurements. Ratios of interval scale variables have limited meaning because there
is not an absolute zero for interval scale variables.
• Ratio measurement has all the attributes of interval scale variables and one additional
attribute: an absolute “zero” point. For example, traffic density (vehicles per kilometre)
represents a ratio scale. The density of a link is defined as zero when there are no
vehicles in a link.

Measurement Levels & Variables

Measurement Levels & Variables: An
Alternative Approach

Measurement & Variables: A
Comparison
Provides Nominal Ordinal Interval Ratio
Categorizes and
labels values
✓ ✓ ✓ ✓
The order of
values is known
✓ ✓ ✓
Counts aka
frequency of
distribution
✓ ✓ ✓ ✓
Mode ✓ ✓ ✓ ✓
Median ✓ ✓ ✓
Mean ✓ ✓
Can quantify the
difference
between each
value
✓ ✓
Can add or
substract values
✓ ✓
Can multiply or
divide values
✓
Has true zero ✓

Types of Variables with Respect to
Data
Variables
Categorical
(qualitative)
Nominal Ordinal
Numerical
(quantitative)
Interval
Discrete or
continuous
Ratio
Discrete or
continuous

Discrete vs. Continuous Variables
As a general rule, counts are discrete and measurements are continuous.
But…think of a bucket filled with identical grains of sand. Although they are
countable, it might be easier to devise a weighing device that will have a
scale indicating the numbers of grains of sand.

Recap: Example 3
For this problem, state whether the variables included are
cross-sectional, time series or :
• current GPAs of Purdue Statistics Graduate Students
• GPA of Marko during his time at Purdue
• value of Jordan Belfort‘s portfolio over the last 3 years before
pleading guilty to fraud in 1999.
• value of all portfolio’s at Charles Schwaab in January 2019
• total salary of the Dallas Mavericks throughout the 1990s
• salaries of all Vuelta cycling teams in 2021.

…
• cross-sectional
• time series
• time series
• cross-sectional
• time series
• cross-sectional

Recap: Example 4
• Labor Market Data of Cornwell and Rupert (1988) consists of the following
variables for 595 Individuals over 7 years:
• EXP =Work experience
• WKS =Weeks worked
• OCC =Occupation, 1 if blue collar
• IND =1 if manufacturing industry
• SOUTH =1 if resides in south
• SMSA =1 if resides in a city (SMSA)
• MS =1 if married
• FEM =1 if female
• UNION =1 if wage set by union contract
• ED =Years of education
• BLK =1 if individual is black
• LWAGE=Log of wage
• This is an example of a panel data set.
• What analysis can be made by using this data?
• e.g., Returns to Schooling

Recap: Example 5
• How many elements are in the data set?
• How many variables are in the data set?
• What type of variable is each variable in the data set (be sure to
answer both qualitative or quantitative as well as nominal, ordinal,
interval, or ratio). Define the quantitative variables as discrete or
continuous.
Grade Major GPA
Credit
Hours
Sophomore Psychology 3.14 30
Senior Spanish 2.89 105
Senior Religion 3.01 99
Freshman Philosophy 2.45 12
Source: Huang, W. (2014). Lecture 25: Types of Data, Sampling. PowerPoint Presentation. Purdue University.

…
• 4 elements in total
• 4 variables in this data set. They are Grade, Major, Credit
Hours, and GPA (grade point average)
• Grade: qualitative (ordinal); Major: qualitative (nominal);
GPA: quantitative (interval, continuous); Credit hours:
quantitative (ratio, discrete)

Lekcija 1 - Uvod.pdf

Recommended

Recommended

More Related Content

Similar to Lekcija 1 - Uvod.pdf

Similar to Lekcija 1 - Uvod.pdf (20)

Recently uploaded

Recently uploaded (20)

Lekcija 1 - Uvod.pdf