2. Populations and Samples
• A population includes all of the entities of interest: people,
households, machines, or whatever. The following are three typical
populations:
• • All potential voters in a general election
• All subscribers to a cable television provider
• All telecom users in Pakistan
• In these situations and many others, it is virtually impossible to obtain
information about all members of the population. For example, it is
far too costly to ask all potential voters which presidential candidates
they prefer.
3. Sampling vs Population
• Therefore, we often try to gain insights into the characteristics of a
population by examining a sample, or subset, of the population.
• samples should be representative of the population so that observed
characteristics of the sample can be generalized to the population as
a whole.
• A population includes all of the entities of interest in a study.
• A sample is a subset of the population, often randomly chosen and
preferably representative of the population as a whole.
4. Data Sets, Variables, and Observations
• A data set is generally a rectangular table of data where the columns
contain variables, such as height, gender, and income, and each row
contains an observation.
• Each observation includes the attributes of a particular member of
the population: a person, a company, a city, a machine, or whatever.
• A variable (column) is often called a field or an attribute, and an
observation (row) is often called a case or a record.
5. Data Types
• There are several ways to categorize data, as we explain in the context
of Example (Green Environment survey data).
• A basic distinction is between numeric and categorical data. The
distinction is whether you intend to do any arithmetic on the data.
• A variable is numeric if meaningful arithmetic can be performed on it.
Otherwise, the variable is categorical.
6. Example Excel Spreadsheet
• In the questionnaire data set, Age, Children, and Salary are clearly
numeric.
• For example, it makes perfect sense to sum or average any of these.
• In contrast, Gender and State are clearly categorical because they are
expressed as text, not numbers.
• The Opinion variable is less obvious. It is expressed numerically, on a
1-to-5 scale.
• However, these numbers are really only codes for the categories
“strongly disagree,” “disagree,” “neutral,” “agree,” and “strongly
agree.”
7. • There is a definite ordering of its categories, whereas there is no
natural ordering of the categories for the Gender or State variables.
• When there is a natural ordering of categories, the variable is
classified as ordinal.
• If there is no natural ordering, as with the Gender or State variables,
the variable is classified as nominal.
8. • A categorical variable is ordinal if there is a natural ordering of its
possible categories. If there is no natural ordering, the variable is
nominal.
• Note:
Three types of variables that appear to be numeric but are usually
treated as categorical are phone numbers, zip codes, and Social
Security numbers. Do you see why? Can you think of others?
9. Green environment survey
• Each observation lists the person’s age, gender, state of residence,
number of children, annual salary, and opinion of the president’s
environmental policies.
• These six pieces of information represent the variables. It is
customary to include a row (row 1 in this case) that lists variable
names.
• These variable names should be concise but meaningful. Note that an
index of the observation is often included in column A. If you sort on
other variables, you can always sort on the index to get back to the
original sort order.
10. Categorical variables
• Categorical variables can be coded numerically. In excel example file,
Gender has not been coded, whereas Opinion has been coded. This is
largely a matter of taste—so long as you realize that coding a
categorical variable does not make it numeric and appropriate for
arithmetic operations.
• Now Opinion has been replaced by text, and Gender has been coded
as 1 for males and 0 for females. This 0/1 coding for a categorical
variable is very common. Such a variable is usually called a dummy
variable, and it often simplifies the analysis. You will see dummy
variables throughout the book.