Statistical Measures Working with Categorical Data
Data In statistics we work with  data Statistical data comes in the form of  tables Each  row  of a table is called a  case  or a  subject Each  column  of a table is a  characteristic  or  attribute  of the case or subject Because the columns potentially hold different information for each row, they are also called  variables
Data on Titanic Passengers Each row refers to one passenger (a case)  Each column refers to a characteristic of the case. It is called a variable, because it can hold different values for different cases. In this table, all of the variables are categorical or qualitative.  This means that they are labels and their values are usually words or numbers that are not used in computations, like a zip code or a phone number or a social security number. rows are horizontal columns are vertical Survived Age Sex Class Dead Adult Male Third Dead Adult Male Crew Dead Adult Male Third Dead Adult Male Crew Dead Adult Male Crew Dead Adult Male Crew Alive Adult Female First Dead Adult Male Third Dead Adult Male Crew
Frequency Table & Bar Chart This frequency table comes from counting the occurrence of each individual value of the Class variable from the table in the previous slide. Class Frequency First 325 Second 285 Third 706 Crew 885
Relative Frequency Table & Pie Chart The degree measure of each slice of the pie is calculated by multiplying the percent (in decimal form) times 360. Class Rel. Frequency (%) First 14.8 Second 12.9 Third 32.1 Crew 40.2
Contingency Table This table is called a Contingency Table because it shows how the individuals are  distributed  along each variable,  contingent  (based) on the value of the other variable.  It is also called a  two-way table  of categorical data. conditional distribution  of class for surviving passengers marginal distribution  of passenger class conditional distribution  of survival status for first class passengers marginal distribution  of survival status conditional distribution  = one row or on column marginal  distribution  = total row or on column First Second Third Crew Total Alive 202 118 178 212 710 % of row 28.5% 16.6% 25.1% 29.9% 100% % of column 62.2% 41.4% 25.2% 24% 32.3% Dead 123 167 528 673 1491 % of row 8.5% 11.2% 35.4% 45.1% 100% % of column 37.8% 58.6% 74.8% 76.0% 67.7% Total 325 285 706 885 2201 % of row 14.8% 12.9% 32.1% 40.2% 100%
Questions What percent of 1 st  class passengers survived? What percent of survivors were 2 nd  class passengers? What percent of all passengers were crew members? What percent of all passengers died? What percent of those who died were crew members?
Conditional and Marginal Distributions List the conditional distribution of survival status for crew members List the conditional relative frequency distribution of passenger class for passengers who died List the marginal distribution of passenger class List the marginal distribution of survival status
Segmented (Stacked) Bar Chart Segmented Bar Chart of Survival Status by Passenger Class
Segmented (Stacked) Bar Chart Segmented Bar Chart of Passenger Class by Survival Status
Independence The colors of M&Ms are  independent  of whether bags are opened by boys or girls Survival status is  not independent  of passenger class.  Survival status and passenger class are  dependent .
Why? Even if two variables are dependent, we cannot assume a causal relationship Just knowing that there is an association between two variables, we may not be able to tell why it exists
Assignment You are to find actual data with at least two categorical variables You might be able to find data online, in a magazine or newspaper (e.g. Consumer Reports, USA Today, etc.) Bring your data with you to class on Thursday If you have any problems, see me tomorrow Don’t wait until Thursday!
Kevin’s Project Source: Chicago Blackhawks Web Site Name  Country Education Bryan Bickell Canada Juniors Dave Bolland Canada  Juniors Troy Brouwer Canada  Juniors Jake Dowell USA College Marian Hossa Europe Juniors Ryan Johnson Canada College Patrick Kane USA Juniors Tomas Kopecky Europe Juniors Fernando Pisani Europe College … … …
Country Country Count % Canada 14 60.9% Europe 5 21.7% USA 4 17.4% Total 23 100%
Education Education Count % College 13 56.5% Juniors 10 43.5% Total 23 100%
Country and Education Juniors College Total USA 1 3 4 % of row 25% 75% 100% % of column 7.7% 30% 17.4% Canada 7 7 14 % of row 50% 50% 100% % of column 53.8% 70% 60.9% Europe 5 0 5 % of row 100% 0% 100% % of column 38.5% 0% 21.7% Total 13 10 23 % of row 56.5% 43.5% 100%

Statistical measures categorical data

  • 1.
    Statistical Measures Workingwith Categorical Data
  • 2.
    Data In statisticswe work with data Statistical data comes in the form of tables Each row of a table is called a case or a subject Each column of a table is a characteristic or attribute of the case or subject Because the columns potentially hold different information for each row, they are also called variables
  • 3.
    Data on TitanicPassengers Each row refers to one passenger (a case) Each column refers to a characteristic of the case. It is called a variable, because it can hold different values for different cases. In this table, all of the variables are categorical or qualitative. This means that they are labels and their values are usually words or numbers that are not used in computations, like a zip code or a phone number or a social security number. rows are horizontal columns are vertical Survived Age Sex Class Dead Adult Male Third Dead Adult Male Crew Dead Adult Male Third Dead Adult Male Crew Dead Adult Male Crew Dead Adult Male Crew Alive Adult Female First Dead Adult Male Third Dead Adult Male Crew
  • 4.
    Frequency Table &Bar Chart This frequency table comes from counting the occurrence of each individual value of the Class variable from the table in the previous slide. Class Frequency First 325 Second 285 Third 706 Crew 885
  • 5.
    Relative Frequency Table& Pie Chart The degree measure of each slice of the pie is calculated by multiplying the percent (in decimal form) times 360. Class Rel. Frequency (%) First 14.8 Second 12.9 Third 32.1 Crew 40.2
  • 6.
    Contingency Table Thistable is called a Contingency Table because it shows how the individuals are distributed along each variable, contingent (based) on the value of the other variable. It is also called a two-way table of categorical data. conditional distribution of class for surviving passengers marginal distribution of passenger class conditional distribution of survival status for first class passengers marginal distribution of survival status conditional distribution = one row or on column marginal distribution = total row or on column First Second Third Crew Total Alive 202 118 178 212 710 % of row 28.5% 16.6% 25.1% 29.9% 100% % of column 62.2% 41.4% 25.2% 24% 32.3% Dead 123 167 528 673 1491 % of row 8.5% 11.2% 35.4% 45.1% 100% % of column 37.8% 58.6% 74.8% 76.0% 67.7% Total 325 285 706 885 2201 % of row 14.8% 12.9% 32.1% 40.2% 100%
  • 7.
    Questions What percentof 1 st class passengers survived? What percent of survivors were 2 nd class passengers? What percent of all passengers were crew members? What percent of all passengers died? What percent of those who died were crew members?
  • 8.
    Conditional and MarginalDistributions List the conditional distribution of survival status for crew members List the conditional relative frequency distribution of passenger class for passengers who died List the marginal distribution of passenger class List the marginal distribution of survival status
  • 9.
    Segmented (Stacked) BarChart Segmented Bar Chart of Survival Status by Passenger Class
  • 10.
    Segmented (Stacked) BarChart Segmented Bar Chart of Passenger Class by Survival Status
  • 11.
    Independence The colorsof M&Ms are independent of whether bags are opened by boys or girls Survival status is not independent of passenger class. Survival status and passenger class are dependent .
  • 12.
    Why? Even iftwo variables are dependent, we cannot assume a causal relationship Just knowing that there is an association between two variables, we may not be able to tell why it exists
  • 13.
    Assignment You areto find actual data with at least two categorical variables You might be able to find data online, in a magazine or newspaper (e.g. Consumer Reports, USA Today, etc.) Bring your data with you to class on Thursday If you have any problems, see me tomorrow Don’t wait until Thursday!
  • 14.
    Kevin’s Project Source:Chicago Blackhawks Web Site Name Country Education Bryan Bickell Canada Juniors Dave Bolland Canada Juniors Troy Brouwer Canada Juniors Jake Dowell USA College Marian Hossa Europe Juniors Ryan Johnson Canada College Patrick Kane USA Juniors Tomas Kopecky Europe Juniors Fernando Pisani Europe College … … …
  • 15.
    Country Country Count% Canada 14 60.9% Europe 5 21.7% USA 4 17.4% Total 23 100%
  • 16.
    Education Education Count% College 13 56.5% Juniors 10 43.5% Total 23 100%
  • 17.
    Country and EducationJuniors College Total USA 1 3 4 % of row 25% 75% 100% % of column 7.7% 30% 17.4% Canada 7 7 14 % of row 50% 50% 100% % of column 53.8% 70% 60.9% Europe 5 0 5 % of row 100% 0% 100% % of column 38.5% 0% 21.7% Total 13 10 23 % of row 56.5% 43.5% 100%