by :SOURABH MODGIL
Where do data come from?
We’ve seen our data for this lab, all nice and collated
 in a database – from:
  ď‚—Insurance companies (claims, medications, procedures,
   diagnoses, etc.)
  ď‚—Firms (demographic data, productivity data, etc.)
Where do data come from?
Take a step back – if we’re starting from scratch, how
 do we collect / find data?
  ď‚—Secondary data
  ď‚—Primary data
Secondary Data
Secondary data – data someone else has collected
  ď‚—This is what you were looking for in your assignment.
Secondary Data – Examples of
Sources
ď‚—County health departments
Vital Statistics – birth, death certificates
ď‚—Hospital, clinic, school nurse records
ď‚—Private and foundation databases
ď‚—City and county governments
ď‚—Surveillance data from state government
 programs
ď‚—Federal agency statistics - Census, NIH, etc.
Secondary Data – Limitations
ď‚—What did you find on the frustrating side as you
 looked for data on the state’s websites?
Secondary Data – Limitations
ď‚—When was it collected? For how long?
 ď‚—May be out of date for what you want to analyze.
 ď‚—May not have been collected long enough for detecting
  trends.
 E.g. Have new anticorruption laws impacted Russia’s
  government accountability ratings?
Secondary Data – Limitations
ď‚—Is the data set complete?
  ď‚—There may be missing information on some
   observations
  ď‚—Unless such missing information is caught and
   corrected for, analysis will be biased.
Secondary Data – Limitations
ď‚—Are there confounding problems?
  ď‚—Sample selection bias?
  ď‚—Source choice bias?
  ď‚—In time series, did some observations drop out over
   time?
Secondary Data – Limitations
ď‚—Are the data consistent/reliable?
  ď‚—Did variables drop out over time?
  ď‚—Did variables change in definition over time?
    ď‚—   E.g. number of years of education versus highest degree
        obtained.
Secondary Data – Limitations
ď‚—Is the information exactly what you need?
  In some cases, may have to use “proxy variables” –
   variables that may approximate something you really
   wanted to measure. Are they reliable? Is there
   correlation to what you actually want to measure?
  ď‚—E.g. gauging student interest in U.W. by their ranking
   on FAFSA – subject to gamesmanship.
Secondary Data – Advantages
ď‚—No need to reinvent the wheel.
  ď‚—If someone has already found the data, take advantage
    of it.
Secondary Data – Advantages
ď‚—It will save you money.
  ď‚—Even if you have to pay for access, often it is cheaper in
    terms of money than collecting your own data. (more
    on this later.)
Secondary Data – Advantages
ď‚—It will save you time.
  ď‚—Primary data collection is very time consuming. (More
    on this later, too!)
Secondary Data – Advantages
ď‚—It may be very accurate.
  ď‚—When especially a government agency has collected the
    data, incredible amounts of time and money went into
    it. It’s probably highly accurate.
Secondary Data – Advantages
ď‚—It has great exploratory value
  ď‚—Exploring research questions and formulating
    hypothesis to test.
Primary Data
Primary data – data you collect
Primary Data - Examples
ď‚—Surveys
ď‚—Focus groups
ď‚—Questionnaires
ď‚—Personal interviews
ď‚—Experiments and observational study
Primary Data - Limitations
ď‚—Do you have the time and money for:
  ď‚—Designing your collection instrument?
  ď‚—Selecting your population or sample?
  ď‚—Pretesting/piloting the instrument to work out sources
   of bias?
  ď‚—Administration of the instrument?
  ď‚—Entry/collation of data?
Primary Data - Limitations
ď‚—Uniqueness
  ď‚—May not be able to compare to other populations
Primary Data - Limitations
ď‚—Researcher error
  ď‚—Sample bias
  ď‚—Other confounding factors
Data collection choice
ď‚—What you must ask yourself:
  ď‚—Will the data answer my research question?
Data collection choice
ď‚—To answer that
  ď‚—You much first decide what your research question is
  ď‚—Then you need to decide what data/variables are
   needed to scientifically answer the question
Data collection choice
ď‚—If that data exist in secondary form, then use them to
 the extent you can, keeping in mind limitations.
ď‚—But if it does not, and you are able to fund primary
 collection, then it is the method of choice.
Data collection methods

Data collection methods

  • 1.
  • 2.
    Where do datacome from? We’ve seen our data for this lab, all nice and collated in a database – from: Insurance companies (claims, medications, procedures, diagnoses, etc.) Firms (demographic data, productivity data, etc.)
  • 3.
    Where do datacome from? Take a step back – if we’re starting from scratch, how do we collect / find data? Secondary data Primary data
  • 4.
    Secondary Data Secondary data– data someone else has collected This is what you were looking for in your assignment.
  • 5.
    Secondary Data –Examples of Sources County health departments Vital Statistics – birth, death certificates Hospital, clinic, school nurse records Private and foundation databases City and county governments Surveillance data from state government programs Federal agency statistics - Census, NIH, etc.
  • 6.
    Secondary Data –Limitations What did you find on the frustrating side as you looked for data on the state’s websites?
  • 7.
    Secondary Data –Limitations When was it collected? For how long? May be out of date for what you want to analyze. May not have been collected long enough for detecting trends. E.g. Have new anticorruption laws impacted Russia’s government accountability ratings?
  • 8.
    Secondary Data –Limitations Is the data set complete? There may be missing information on some observations Unless such missing information is caught and corrected for, analysis will be biased.
  • 9.
    Secondary Data –Limitations Are there confounding problems? Sample selection bias? Source choice bias? In time series, did some observations drop out over time?
  • 10.
    Secondary Data –Limitations Are the data consistent/reliable? Did variables drop out over time? Did variables change in definition over time?  E.g. number of years of education versus highest degree obtained.
  • 11.
    Secondary Data –Limitations Is the information exactly what you need? In some cases, may have to use “proxy variables” – variables that may approximate something you really wanted to measure. Are they reliable? Is there correlation to what you actually want to measure? E.g. gauging student interest in U.W. by their ranking on FAFSA – subject to gamesmanship.
  • 12.
    Secondary Data –Advantages No need to reinvent the wheel. If someone has already found the data, take advantage of it.
  • 13.
    Secondary Data –Advantages It will save you money. Even if you have to pay for access, often it is cheaper in terms of money than collecting your own data. (more on this later.)
  • 14.
    Secondary Data –Advantages It will save you time. Primary data collection is very time consuming. (More on this later, too!)
  • 15.
    Secondary Data –Advantages It may be very accurate. When especially a government agency has collected the data, incredible amounts of time and money went into it. It’s probably highly accurate.
  • 16.
    Secondary Data –Advantages It has great exploratory value Exploring research questions and formulating hypothesis to test.
  • 17.
    Primary Data Primary data– data you collect
  • 18.
    Primary Data -Examples ď‚—Surveys ď‚—Focus groups ď‚—Questionnaires ď‚—Personal interviews ď‚—Experiments and observational study
  • 19.
    Primary Data -Limitations ď‚—Do you have the time and money for: ď‚—Designing your collection instrument? ď‚—Selecting your population or sample? ď‚—Pretesting/piloting the instrument to work out sources of bias? ď‚—Administration of the instrument? ď‚—Entry/collation of data?
  • 20.
    Primary Data -Limitations ď‚—Uniqueness ď‚—May not be able to compare to other populations
  • 21.
    Primary Data -Limitations ď‚—Researcher error ď‚—Sample bias ď‚—Other confounding factors
  • 22.
    Data collection choice ď‚—Whatyou must ask yourself: ď‚—Will the data answer my research question?
  • 23.
    Data collection choice ď‚—Toanswer that ď‚—You much first decide what your research question is ď‚—Then you need to decide what data/variables are needed to scientifically answer the question
  • 24.
    Data collection choice ď‚—Ifthat data exist in secondary form, then use them to the extent you can, keeping in mind limitations. ď‚—But if it does not, and you are able to fund primary collection, then it is the method of choice.