- The study aimed to predict depression in young adults using data from the National Longitudinal Study of Adolescent to Adult Health (AddHealth), which surveyed the same individuals over multiple waves from adolescence to age 32.
- The researchers built several prediction models using machine learning techniques like random forests and support vector regression on survey responses from Wave I to predict depression scores derived from Wave IV responses.
- They addressed challenges like missing data, sample size relative to number of survey questions, and participants lost to follow up over time when developing and evaluating their prediction models.
This document describes using the Add Health dataset to predict depression scores. It discusses the structure and size of the Add Health data across multiple waves. The outcome variable is a depression score from Wave IV, measured using 10 depression indicator questions. Challenges in building a prediction model include data cleaning, too many predictors relative to sample size, sparseness, and loss to followup between waves. Methods to address these include filtering to those in Waves I and IV, expanding categorical variables, reducing dimensions via random projection and PCA, and using techniques like OLS, preshrinking, and machine learning models to make predictions. Results show machine learning models like random forest and XGBoost achieved the lowest square root of MSPE for prediction
If you happen to like this powerpoint, you may contact me at flippedchannel@gmail.com
I offer some educational services like:
-powerpoint presentation maker
-grammarian
-content creator
-layout designer
Subscribe to our online platforms:
FlippED Channel (Youtube)
http://bit.ly/FlippEDChannel
LET in the NET (facebook)
http://bit.ly/LETndNET
This document discusses misuses and limitations of statistics. It provides examples of how statistics can be misleading when organizations selectively publish studies, questions are worded to influence responses, or samples are not representative of the overall population. Limitations of statistics include that they deal with aggregates rather than individuals, quantitative rather than qualitative data, and laws that are true on average rather than exactly. Statistics also cannot prove causation and are limited by the quality of data collection and analysis.
This document discusses methods of collecting statistical data. It describes census and sample investigation methods. The census method collects data from every unit of the population, while the sample method collects data from only a few representative units. The census method is more reliable but costly, while the sample method is less expensive but less accurate. Key differences between the two methods are also outlined.
1) Statistics is the study of collecting, organizing, analyzing, and drawing conclusions from data. It involves sampling, hypothesis testing, and using statistical tests tailored to measurement scales and hypothesis types.
2) Descriptive statistics describe and summarize data quantitatively, while inferential statistics allow generalizing from samples to populations through statistical testing and other methods.
3) The document discusses differences between statistics and statistical data, types of data, levels of measurement, sampling techniques, and uses of statistics.
This document provides an introduction to statistics. It discusses key concepts including the role of statistics in research, the typical research process, variables, scales of measurement, and descriptive and inferential statistics. Specifically, it describes how statistics is used for collecting, analyzing and interpreting data to answer research questions. It also outlines the typical steps in research including developing questions and hypotheses, choosing measures, designing the study, analyzing data, and drawing conclusions.
Selection of appropriate data analysis techniqueRajaKrishnan M
- The document discusses choosing the right statistical method for data analysis, which depends on factors like the number and measurement level of variables, the distribution of variables, the dependence/independence structure, the nature of the hypotheses, and sample size.
- It presents flowcharts for choosing a statistical method based on whether the hypothesis involves one variable (univariate), two variables (bivariate), or more than two variables (multivariate).
- For univariate data, descriptive statistics or a one-sample t-test can be used depending on whether description or inference is the goal; for bivariate data, the choice depends on the nature of the hypothesis (difference or association) and the level of measurement (parametric or nonparame
1. The document discusses the introduction to statistics, providing definitions and explaining key concepts. It describes how statistics is used in various fields like education, business, medical research, and agriculture.
2. Statistics is defined as the science of collecting, organizing, summarizing, presenting, analyzing, and interpreting data. It can be used as both a science and an art. Statistics has various applications in fields like administration, business, education, and medical and agricultural research.
3. The document outlines the basic terminology used in statistics, including data, variables, observations, quantitative and qualitative data, continuous and discrete variables. It distinguishes between primary and secondary data and their characteristics.
This document describes using the Add Health dataset to predict depression scores. It discusses the structure and size of the Add Health data across multiple waves. The outcome variable is a depression score from Wave IV, measured using 10 depression indicator questions. Challenges in building a prediction model include data cleaning, too many predictors relative to sample size, sparseness, and loss to followup between waves. Methods to address these include filtering to those in Waves I and IV, expanding categorical variables, reducing dimensions via random projection and PCA, and using techniques like OLS, preshrinking, and machine learning models to make predictions. Results show machine learning models like random forest and XGBoost achieved the lowest square root of MSPE for prediction
If you happen to like this powerpoint, you may contact me at flippedchannel@gmail.com
I offer some educational services like:
-powerpoint presentation maker
-grammarian
-content creator
-layout designer
Subscribe to our online platforms:
FlippED Channel (Youtube)
http://bit.ly/FlippEDChannel
LET in the NET (facebook)
http://bit.ly/LETndNET
This document discusses misuses and limitations of statistics. It provides examples of how statistics can be misleading when organizations selectively publish studies, questions are worded to influence responses, or samples are not representative of the overall population. Limitations of statistics include that they deal with aggregates rather than individuals, quantitative rather than qualitative data, and laws that are true on average rather than exactly. Statistics also cannot prove causation and are limited by the quality of data collection and analysis.
This document discusses methods of collecting statistical data. It describes census and sample investigation methods. The census method collects data from every unit of the population, while the sample method collects data from only a few representative units. The census method is more reliable but costly, while the sample method is less expensive but less accurate. Key differences between the two methods are also outlined.
1) Statistics is the study of collecting, organizing, analyzing, and drawing conclusions from data. It involves sampling, hypothesis testing, and using statistical tests tailored to measurement scales and hypothesis types.
2) Descriptive statistics describe and summarize data quantitatively, while inferential statistics allow generalizing from samples to populations through statistical testing and other methods.
3) The document discusses differences between statistics and statistical data, types of data, levels of measurement, sampling techniques, and uses of statistics.
This document provides an introduction to statistics. It discusses key concepts including the role of statistics in research, the typical research process, variables, scales of measurement, and descriptive and inferential statistics. Specifically, it describes how statistics is used for collecting, analyzing and interpreting data to answer research questions. It also outlines the typical steps in research including developing questions and hypotheses, choosing measures, designing the study, analyzing data, and drawing conclusions.
Selection of appropriate data analysis techniqueRajaKrishnan M
- The document discusses choosing the right statistical method for data analysis, which depends on factors like the number and measurement level of variables, the distribution of variables, the dependence/independence structure, the nature of the hypotheses, and sample size.
- It presents flowcharts for choosing a statistical method based on whether the hypothesis involves one variable (univariate), two variables (bivariate), or more than two variables (multivariate).
- For univariate data, descriptive statistics or a one-sample t-test can be used depending on whether description or inference is the goal; for bivariate data, the choice depends on the nature of the hypothesis (difference or association) and the level of measurement (parametric or nonparame
1. The document discusses the introduction to statistics, providing definitions and explaining key concepts. It describes how statistics is used in various fields like education, business, medical research, and agriculture.
2. Statistics is defined as the science of collecting, organizing, summarizing, presenting, analyzing, and interpreting data. It can be used as both a science and an art. Statistics has various applications in fields like administration, business, education, and medical and agricultural research.
3. The document outlines the basic terminology used in statistics, including data, variables, observations, quantitative and qualitative data, continuous and discrete variables. It distinguishes between primary and secondary data and their characteristics.
This document provides an introduction to statistics. It defines statistics as the scientific methods for collecting, organizing, summarizing, presenting and analyzing data to derive valid conclusions. Statistics is useful across many fields and careers as it helps make informed decisions based on data. The document outlines descriptive and inferential statistics, and notes that descriptive statistics simplifies complexity while inferential statistics allows for conclusions to be drawn. It also discusses types of data sources, including primary data collected directly and secondary data that has already been collected.
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...inventionjournals
This document presents a new model for HIV replication dynamics. It introduces an exponential distribution to model the rate of HIV multiplication by infected cells. A Bayesian approach is used to estimate the posterior distribution of the rate parameter, using an incomplete gamma function as the prior. The model allows estimating the HIV count in the succeeding period based on viral load and CD4 count data collected periodically.
This presentation covers statistics, its importance, its applications, branches of statistics, basic concepts used in statistics, data sampling, types of sampling,types of data and collection of data.
This document provides an overview of key concepts in statistics. It discusses how statistics involves collecting, organizing and analyzing numerical data. There are two main types of statistics: descriptive statistics which summarizes and presents data, and inferential statistics which uses samples to make estimates about populations. Key elements discussed include populations, samples, variables, and measures of reliability. Both quantitative and qualitative data are examined. Methods for collecting data include published sources, designed experiments, surveys and observational studies. The role of statistics in critical thinking is also discussed.
Statistics involves collecting, describing, and analyzing data. There are two main areas: descriptive statistics which describes sample data, and inferential statistics which draws conclusions about populations from samples. A population is the entire set being studied, while a sample is a subset of the population. Variables are characteristics being measured, and can be either qualitative (categorical) or quantitative (numerical). Data is collected through experiments or surveys using sampling methods to obtain a representative sample from the population. There is usually variability in data that statistics aims to measure and characterize.
This document discusses the scope and uses of statistics across various fields such as planning, economics, business, industry, mathematics, science, psychology, education, war, banking, government, sociology, and more. It outlines functions of statistics like presenting facts, testing hypotheses, forecasting, policymaking, enlarging knowledge, measuring uncertainty, simplifying data, deriving valid inferences, and drawing rational conclusions. It also covers characteristics, advantages, and limitations of statistics.
Inferential statistics use samples to make generalizations about populations. It allows researchers to test theories designed to apply to entire populations even though samples are used. The goal is to determine if sample characteristics differ enough from the null hypothesis, which states there is no difference or relationship, to justify rejecting the null in favor of the research hypothesis. All inferential tests examine the size of differences or relationships in a sample compared to variability and sample size to evaluate how deviant the results are from what would be expected by chance alone.
This presents an overview about relevance and significance of statistics as a valid tool in enhancing quality of research. It also touches upon some misuse and abuse of statistics.
This document provides an introduction to statistics. It defines statistics and discusses its importance, limitations, and application areas. It also outlines the main classifications of statistics including descriptive and inferential statistics. Descriptive statistics describes data without making conclusions while inferential statistics makes generalizations beyond the data. The document concludes by defining key statistical terms and outlining the typical steps in a statistical investigation.
De-Mystifying Stats: A primer on basic statisticsGillian Byrne
This document provides an overview of key concepts in research methods and statistical analysis. It defines important terms like hypotheses, variables, sampling, and statistical significance. It also describes common statistical tests like t-tests, ANOVA, correlation coefficients, and their appropriate uses and limitations. Various measures of central tendency, dispersion, and their interpretations are outlined. Examples are provided to illustrate statistical concepts. The document serves as a useful introduction and reference guide for understanding research methodology and statistics.
Fundamentals Of Statistics-Definition of statistics,Descriptive and Inferential Statistics,Major Types of Descriptive Statistics,Statistical data analysis
Chapter 1 introduction to statistics for engineers 1 (1)abfisho
This document provides an introduction to statistics. It defines statistics as the science of collecting, analyzing, and presenting data systematically. Statistics has two main branches - descriptive statistics, which describes data through measures like averages without generalizing beyond the sample, and inferential statistics, which makes generalizations from samples to populations. The document lists important terms in statistics like data, variables, population, sample, and sample size. It also outlines the main steps in a statistical investigation, including collecting and organizing data. Statistics has many applications in fields like business, engineering, health, and economics.
1. The document discusses the meaning, uses, functions, importance and limitations of statistics. It defines statistics as the collection, presentation, analysis and interpretation of numerical data.
2. Statistics has various uses across different fields such as policy planning, management, education, commerce and accounts. It helps present facts precisely and enables comparison, correlation, formulation and testing of hypotheses, and forecasting.
3. While statistics is important for planning, administration, economics and more, it also has limitations such as only studying aggregates, numerical data, and being an average. Statistics can also be misused if not used carefully by experts.
The data set is about the 1987 national Indonesia contraceptive prevalence survey. Data Retrieving, cleaning, exploration, modelling with classification using Decision Tree and KNN model.
This document discusses inferential statistics and different types of samples that can be drawn from a population. Inferential statistics involves making inferences about a population based on a sample. It consists of generalizing from samples to populations, estimating parameters, hypothesis testing, and determining relationships. Two main methods are estimation, which uses a sample to estimate a parameter and construct a confidence interval, and hypothesis testing, which involves assuming and testing a null hypothesis against collected data. Types of samples discussed are simple random samples, systematic samples, and stratified random samples.
Top 10 Uses Of Statistics In Our Day to Day Life Stat Analytica
Don't you know the uses of statistics is our daily life? If yes then check out this presentation you will learn a lot more about the use of statistics in our daily life.
This document discusses statistics and their uses in various fields such as business, health, learning, research, social sciences, and natural resources. It provides examples of how statistics are used in starting businesses, manufacturing, marketing, and engineering. Statistics help decision-makers reduce ambiguity and assess risks. They are used to interpret data and make informed decisions. However, statistics also have limitations as they only show averages and may not apply to individuals.
Statistics for the Health Scientist: Basic Statistics IDrLukeKane
This document provides an introduction to statistics. It defines key statistical concepts like variables, data, and different types of variables. Descriptive statistics are used to summarize raw data through tables and charts. Different types of charts are described that are suitable for categorical or quantitative variables. The goals are to classify variables, choose appropriate charts and tables, and understand how to describe and communicate data.
This presentation includes an introduction to statistics, introduction to sampling methods, collection of data, classification and tabulation, frequency distribution, graphs and measures of central tendency.
This document provides an overview and objectives of Chapter 1: Introduction to Statistics from an elementary statistics textbook. It covers key statistical concepts like data, population, sample, variables, and the two branches of statistics - descriptive and inferential. Potential pitfalls in statistical analysis like misleading conclusions, biased samples, and nonresponse are also discussed. Examples are provided to illustrate concepts like voluntary response samples, statistical versus practical significance, and interpreting correlation.
1. The document discusses key concepts in statistics including populations, samples, descriptive statistics, inferential statistics, qualitative and quantitative data, and scales of measurement.
2. It provides examples of statistical tests that can be used including one-sample t-tests, two-sample t-tests, paired t-tests, one-way ANOVA tests, and examples of how they can be applied.
3. The guidelines for designing a statistical study are outlined including identifying variables of interest, developing a data collection plan, collecting and describing data, and interpreting results.
This document provides an introduction to statistics. It defines statistics as the scientific methods for collecting, organizing, summarizing, presenting and analyzing data to derive valid conclusions. Statistics is useful across many fields and careers as it helps make informed decisions based on data. The document outlines descriptive and inferential statistics, and notes that descriptive statistics simplifies complexity while inferential statistics allows for conclusions to be drawn. It also discusses types of data sources, including primary data collected directly and secondary data that has already been collected.
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...inventionjournals
This document presents a new model for HIV replication dynamics. It introduces an exponential distribution to model the rate of HIV multiplication by infected cells. A Bayesian approach is used to estimate the posterior distribution of the rate parameter, using an incomplete gamma function as the prior. The model allows estimating the HIV count in the succeeding period based on viral load and CD4 count data collected periodically.
This presentation covers statistics, its importance, its applications, branches of statistics, basic concepts used in statistics, data sampling, types of sampling,types of data and collection of data.
This document provides an overview of key concepts in statistics. It discusses how statistics involves collecting, organizing and analyzing numerical data. There are two main types of statistics: descriptive statistics which summarizes and presents data, and inferential statistics which uses samples to make estimates about populations. Key elements discussed include populations, samples, variables, and measures of reliability. Both quantitative and qualitative data are examined. Methods for collecting data include published sources, designed experiments, surveys and observational studies. The role of statistics in critical thinking is also discussed.
Statistics involves collecting, describing, and analyzing data. There are two main areas: descriptive statistics which describes sample data, and inferential statistics which draws conclusions about populations from samples. A population is the entire set being studied, while a sample is a subset of the population. Variables are characteristics being measured, and can be either qualitative (categorical) or quantitative (numerical). Data is collected through experiments or surveys using sampling methods to obtain a representative sample from the population. There is usually variability in data that statistics aims to measure and characterize.
This document discusses the scope and uses of statistics across various fields such as planning, economics, business, industry, mathematics, science, psychology, education, war, banking, government, sociology, and more. It outlines functions of statistics like presenting facts, testing hypotheses, forecasting, policymaking, enlarging knowledge, measuring uncertainty, simplifying data, deriving valid inferences, and drawing rational conclusions. It also covers characteristics, advantages, and limitations of statistics.
Inferential statistics use samples to make generalizations about populations. It allows researchers to test theories designed to apply to entire populations even though samples are used. The goal is to determine if sample characteristics differ enough from the null hypothesis, which states there is no difference or relationship, to justify rejecting the null in favor of the research hypothesis. All inferential tests examine the size of differences or relationships in a sample compared to variability and sample size to evaluate how deviant the results are from what would be expected by chance alone.
This presents an overview about relevance and significance of statistics as a valid tool in enhancing quality of research. It also touches upon some misuse and abuse of statistics.
This document provides an introduction to statistics. It defines statistics and discusses its importance, limitations, and application areas. It also outlines the main classifications of statistics including descriptive and inferential statistics. Descriptive statistics describes data without making conclusions while inferential statistics makes generalizations beyond the data. The document concludes by defining key statistical terms and outlining the typical steps in a statistical investigation.
De-Mystifying Stats: A primer on basic statisticsGillian Byrne
This document provides an overview of key concepts in research methods and statistical analysis. It defines important terms like hypotheses, variables, sampling, and statistical significance. It also describes common statistical tests like t-tests, ANOVA, correlation coefficients, and their appropriate uses and limitations. Various measures of central tendency, dispersion, and their interpretations are outlined. Examples are provided to illustrate statistical concepts. The document serves as a useful introduction and reference guide for understanding research methodology and statistics.
Fundamentals Of Statistics-Definition of statistics,Descriptive and Inferential Statistics,Major Types of Descriptive Statistics,Statistical data analysis
Chapter 1 introduction to statistics for engineers 1 (1)abfisho
This document provides an introduction to statistics. It defines statistics as the science of collecting, analyzing, and presenting data systematically. Statistics has two main branches - descriptive statistics, which describes data through measures like averages without generalizing beyond the sample, and inferential statistics, which makes generalizations from samples to populations. The document lists important terms in statistics like data, variables, population, sample, and sample size. It also outlines the main steps in a statistical investigation, including collecting and organizing data. Statistics has many applications in fields like business, engineering, health, and economics.
1. The document discusses the meaning, uses, functions, importance and limitations of statistics. It defines statistics as the collection, presentation, analysis and interpretation of numerical data.
2. Statistics has various uses across different fields such as policy planning, management, education, commerce and accounts. It helps present facts precisely and enables comparison, correlation, formulation and testing of hypotheses, and forecasting.
3. While statistics is important for planning, administration, economics and more, it also has limitations such as only studying aggregates, numerical data, and being an average. Statistics can also be misused if not used carefully by experts.
The data set is about the 1987 national Indonesia contraceptive prevalence survey. Data Retrieving, cleaning, exploration, modelling with classification using Decision Tree and KNN model.
This document discusses inferential statistics and different types of samples that can be drawn from a population. Inferential statistics involves making inferences about a population based on a sample. It consists of generalizing from samples to populations, estimating parameters, hypothesis testing, and determining relationships. Two main methods are estimation, which uses a sample to estimate a parameter and construct a confidence interval, and hypothesis testing, which involves assuming and testing a null hypothesis against collected data. Types of samples discussed are simple random samples, systematic samples, and stratified random samples.
Top 10 Uses Of Statistics In Our Day to Day Life Stat Analytica
Don't you know the uses of statistics is our daily life? If yes then check out this presentation you will learn a lot more about the use of statistics in our daily life.
This document discusses statistics and their uses in various fields such as business, health, learning, research, social sciences, and natural resources. It provides examples of how statistics are used in starting businesses, manufacturing, marketing, and engineering. Statistics help decision-makers reduce ambiguity and assess risks. They are used to interpret data and make informed decisions. However, statistics also have limitations as they only show averages and may not apply to individuals.
Statistics for the Health Scientist: Basic Statistics IDrLukeKane
This document provides an introduction to statistics. It defines key statistical concepts like variables, data, and different types of variables. Descriptive statistics are used to summarize raw data through tables and charts. Different types of charts are described that are suitable for categorical or quantitative variables. The goals are to classify variables, choose appropriate charts and tables, and understand how to describe and communicate data.
This presentation includes an introduction to statistics, introduction to sampling methods, collection of data, classification and tabulation, frequency distribution, graphs and measures of central tendency.
This document provides an overview and objectives of Chapter 1: Introduction to Statistics from an elementary statistics textbook. It covers key statistical concepts like data, population, sample, variables, and the two branches of statistics - descriptive and inferential. Potential pitfalls in statistical analysis like misleading conclusions, biased samples, and nonresponse are also discussed. Examples are provided to illustrate concepts like voluntary response samples, statistical versus practical significance, and interpreting correlation.
1. The document discusses key concepts in statistics including populations, samples, descriptive statistics, inferential statistics, qualitative and quantitative data, and scales of measurement.
2. It provides examples of statistical tests that can be used including one-sample t-tests, two-sample t-tests, paired t-tests, one-way ANOVA tests, and examples of how they can be applied.
3. The guidelines for designing a statistical study are outlined including identifying variables of interest, developing a data collection plan, collecting and describing data, and interpreting results.
This document provides an overview of basic statistical concepts including descriptive and inferential statistics, variables and levels of measurement, and methods of data collection and presentation. Descriptive statistics summarize and organize data, while inferential statistics make conclusions about a population based on a sample. There are various methods used to collect both primary and secondary data, including observation, surveys, and existing records. Data is typically presented through tables, diagrams, and graphs. Frequency distributions group and summarize data into classes to aid in analysis and interpretation.
This document provides an overview of basic statistical concepts including descriptive and inferential statistics, variables and levels of measurement, and methods of data collection and presentation. Descriptive statistics summarize and organize data, while inferential statistics make conclusions about a population based on a sample. There are various methods used to collect both primary and secondary data, including observation, surveys, and existing records. Data is typically presented through tables, diagrams, and graphs. Frequency distributions group and summarize data into classes to aid in analysis and interpretation.
This document discusses statistical significance, power, and effect size in response to a reexamination of reviewer bias. It argues that the power of the bogus study used in the original research was sufficient to detect typical effect sizes found in published research in the Journal of Counseling Psychology. While the median effect size reported in another study was small, the effect size was increasing over time and would correspond to a large effect by the year the current study was conducted. Further examination of the data supports the claim that the bogus study had adequate power to detect published effect sizes.
This document discusses key concepts in statistics including:
- The process of statistics involves identifying a research question, collecting data, organizing and summarizing the data, and drawing conclusions.
- Variables can be qualitative (categorical) or quantitative and data can be qualitative, quantitative, discrete or continuous depending on the variable.
- An example is provided to illustrate the process of statistics and distinguishing between variable types.
This document provides an introduction to biostatistics. It defines biostatistics as the application of statistical tools and concepts to data from biological sciences and medicine. The two main branches of statistics are described as descriptive statistics, which involves organizing and summarizing sample data, and inferential statistics, which involves generalizing from samples to populations. Several key statistical concepts are also defined, including populations, samples, variables, data types, levels of measurement, and common sampling methods. The objectives are to demonstrate knowledge of these fundamental statistical terms and concepts.
- Descriptive statistics are used to describe and summarize key characteristics of a data set.
- They include measures such as counts, means, ranges, and standard deviations.
- Descriptive statistics provide simple summaries about the sample and the measures, but do not make any claims about the population.
- The document provides examples of how descriptive statistics could be used to summarize caseload data from public defender offices.
Data Matrix Of Cpi Data Distribution After Transformation...Kimberly Jones
Here is a draft essay on default risk and ways of identifying it:
Introduction
Default risk refers to the possibility that a borrower will fail to make timely payments on their debt obligations. For lenders and investors, understanding and assessing default risk is crucial. Higher default risk indicates a greater likelihood that the borrower may default, resulting in losses for the lender. This essay will discuss default risk and some key ways that lenders and analysts identify and measure default risk.
Credit Ratings
One of the most common ways default risk is assessed is through credit ratings provided by rating agencies such as Moody's, S&P, and Fitch. These agencies analyze a borrower's financial strength and assign a rating that indicates their
This document discusses quantitative data analysis and descriptive and inferential statistics. It provides the following key points:
- Quantitative data is measured numerically along a scale and reported as scores or other numeric values. Descriptive statistics summarize and describe quantitative data.
- Descriptive statistics include numerical summaries that measure central tendency and variability of data, as well as graphical summaries like histograms. They characterize data without generalizing beyond the sample.
- Inferential statistics allow inferences about populations based on samples. Methods include hypothesis testing, regression, and principle components analysis. Hypothesis testing involves stating and evaluating null and alternative hypotheses using sample data and test statistics.
- Descriptive statistics simply describe or characterize sample data, while inferential
Pg. 05Question FiveAssignment #Deadline Day 22.docxmattjtoni51554
Pg. 05
Question Five
Assignment #
Deadline: Day 22/10/2017 @ 23:59
[Total Mark for this Assignment is 25]
System Analysis and Design
IT 243
College of Computing and Informatics
Question One
5 Marks
Learning Outcome(s):
Understand the need of Feasibility analysis in project approval and its types
What is feasibility analysis? List and briefly discuss three kinds of feasibility analysis.
Question Two
5 Marks
Learning Outcome(s):
Understand the various cost incurred in project development
How can you classify costs? Describe each cost classification and provide a typical example of each category.Question Three
5 Marks
Learning Outcome(s):
System Development Life Cycle methodologies (Waterfall & Prototyping)
There a several development methodologies for the System Development Life Cycle (SDLC). Among these are the Waterfall and System Prototyping models. Compare the two methodologies in details in terms of the following criteria.
Criteria
Waterfall
System Prototyping
Description
Requirements Clarity
System complexity
Project Time schedule
Question Four
5 Marks
Learning Outcome(s):
Understand JAD Session and its procedure
What is JAD session? Describe the five major steps in conducting JAD sessions.
Question Five
5 Marks
Learning Outcome(s):
Ability to distinguish between functional and non functional requirements
State what is meant by the functional and non-functional requirements. What are the primary types of nonfunctional requirements? Give two examples of each. What role do nonfunctional requirements play in the project overall?
# Marks
4 - PRELIMINARY DATA SCREENING
4.1 Introduction: Problems in Real Data
Real datasets often contain errors, inconsistencies in responses or measurements, outliers, and missing values. Researchers should conduct thorough preliminary data screening to identify and remedy potential problems with their data prior to running the data analyses that are of primary interest. Analyses based on a dataset that contains errors, or data that seriously violate assumptions that are required for the analysis, can yield misleading results.
Some of the potential problems with data are as follows: errors in data coding and data entry, inconsistent responses, missing values, extreme outliers, nonnormal distribution shapes, within-group sample sizes that are too small for the intended analysis, and nonlinear relations between quantitative variables. Problems with data should be identified and remedied (as adequately as possible) prior to analysis. A research report should include a summary of problems detected in the data and any remedies that were employed (such as deletion of outliers or data transformations) to address these problems.
4.2 Quality Control During Data Collection
There are many different possible methods of data collection. A psychologist may collect data on personality or attitudes by asking participants to answer questions on a questionnaire..
Presentation is made by the student of M.phil Jameel Ahmed Qureshi Faculty of Education Elsa Kazi campus Hyderabad UoS Jamshoron, This presentation is an assignment assign by the Dr. Mumtaz Khwaja
The document discusses basics of statistics including key concepts like population, sample, parameters, and statistics. It provides definitions for population as the collection of all individuals or items under consideration, and sample as the part of the population selected for a study. Parameters describe unknown characteristics of the population, while statistics describe known characteristics of the sample and are used to infer parameters. The document also distinguishes between descriptive statistics, which summarize and organize data, and inferential statistics, which draw conclusions about populations from samples.
WEEK 7 – EXERCISES Enter your answers in the spaces pr.docxwendolynhalbert
WEEK 7 – EXERCISES
Enter your answers in the spaces provided. Save the file using your last name as the beginning of the file name (e.g., ruf_week6_exercises) and submit via “Assignments.” When appropriate,
show your work
. You can do the work by hand, scan/take a digital picture, and attach that file with your work.
A sports researcher gave a standard written test of eating habits to 12 randomly selected professionals, four each from baseball, football, and basketball. The results were as follows:
Eating Habits Scores
Baseball Players
Football Players
Basketball Players
34
27
35
18
28
44
21
67
47
65
42
61
Is there a difference in eating habits among professionals in the three sports? (Use the .05 significance level.)
a.
Use the five steps of hypothesis testing.
b.
Sketch the distribution involved.
c.
Determine effect size.
2.
To study the effectiveness of treatments for insomnia, a sleep researcher conducted a study with 12 participants.
Four participants were instructed to count sheep (Sheep Condition), four were told to concentrate on their breathing (Breathing Condition), and four were not given any special instructions. Over the next few days, measures were taken of how many minutes it took each participant to fall asleep. The average times for the participants in the Sheep Condition were 14, 28, 27, and 31; for those in the Breathing Condition, 25, 22, 17, and 14; and for those in the control condition, 45, 33, 30, and 41.
Do these results suggest that the different techniques have different effects?
(Use the .05 significance level.)
a.
Use the five steps of hypothesis testing.
b.
Sketch the distribution involved.
c.
Figure the effect size of the study.
d.
Explain your findings (including the logic of comparing within-group to between-group population variance estimates, how each of these is figured, and the
F
distribution).
High school juniors planning to attend college were randomly assigned to view one of four videos about a particular college, each differing according to what aspect of college life was emphasized: athletics, social life, scholarship, or artistic/cultural opportunities. After viewing the videos, the students took a test measuring their desire to attend this college. The results were as follows:
Desire to Attend this College
Athletics
Social Life
Scholarship
Art/Cultural
68
89
74
76
56
78
82
71
69
81
79
69
70
77
80
65
Do these results suggest that the type of activity emphasized in a college film affects desire to attend that college? (Use the .01 significance level.)
a.
Use the five steps of hypothesis testing.
b.
Sketch the distribution involved.
c.
Figure the effect size of the study.
d.
Explain the logic of what you have done to a person who is unfamiliar with the analysis of variance.
A team of psychologists designed a study in which 12 psychiatric patients diagnosed as having generalized anxiety disorder were randomly assigned to one of three new types of th.
This document discusses research on developing an intelligent depression detection system using natural language processing of social media posts. It summarizes several previous studies that have used Facebook and Twitter data to predict depression by analyzing language and behavior. Specifically, some key studies are highlighted that have successfully predicted depression levels based on survey responses and self-reported diagnoses on social media, with prediction accuracy rates up to 89% in some cases. The document also reviews approaches that have used online forum membership and posts to classify mental health conditions. Overall, the research suggests social media can provide insights into users' mental states and has potential for early detection of depression.
This document contains a summary of key statistical concepts and methods for analyzing grouped data. It includes 10 topics: (1) names of group members, (2) acknowledgements, (3) an introduction to statistics, (4) grouped data, (5) mean of grouped data, (6) mode of grouped data, (7) median of grouped data, (8) graphical representation of cumulative distribution, (9) conclusion, and (10) bibliography. The document provides examples and explanations of statistical techniques for summarizing and visualizing grouped data, including calculating the mean, mode, and median from frequency tables and constructing cumulative frequency distributions in ogive graphs.
The document summarizes key concepts from Chapter 1 of the textbook "Elementary Statistics" including:
- The difference between a population and a sample, and how statistics uses samples to make inferences about populations.
- The different types of data: quantitative, categorical, discrete vs. continuous data.
- The different levels of measurement for data: nominal, ordinal, interval, and ratio.
- The importance of critical thinking when analyzing data and statistics, including considering context, sources, sampling methods, and avoiding misleading graphs, samples, conclusions, or survey questions.
This document provides an overview of key concepts from Chapter 1 of the textbook "Elementary Statistics". It defines important statistical terms like population, sample, parameter, and statistic. It also distinguishes between different types of data and levels of measurement. Additionally, it discusses the importance of collecting sample data through appropriate random sampling methods. Critical thinking in statistics is emphasized, highlighting factors like the context, source, and sampling method of data when evaluating statistical claims. Different ways of collecting data through studies and experiments are also introduced.
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docxbreaksdayle
Complete
the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS® problems and chapter exercises listed below.
Ch. 6: Chapter Exercises 2, 4, 6, 8, and 12
Ch. 7: SPSS® Problem 2
Ch. 7: Chapter Exercises 2, 4, 6, 8, and 12
Include
your answers in a Microsoft® Word document.
Click
the Assignment Files tab to upload your assignment.
Please see Chapter 6 material.
Sampling and Sampling Distributions
Chapter Learning Objectives
Describe the aims of sampling and basic principles of probability
Explain the relationship between a sample and a population
Identify and apply different sampling designs
Apply the concept of the sampling distribution
Describe the central limit theorem
Until now, we have ignored the question of who or what should be observed when we collect data or whether the conclusions based on our observations can be generalized to a larger group of observations. In truth, we are rarely able to study or observe everyone or everything we are interested in. Although we have learned about various methods to analyze observations, remember that these observations represent a fraction of all the possible observations we might have chosen. Consider the following research examples.
Example 1:
The Muslim Student Association on your campus is interested in conducting a study of experiences with campus diversity. The association has enough funds to survey 300 students from the more than 20,000 enrolled students at your school.
Example 2:
Environmental activists would like to assess recycling practices in 2-year and 4-year colleges and universities. There are more than 4,700 colleges and universities nationwide.
1
Example 3:
The Academic Advising Office is trying to determine how to better address the needs of more than 15,000 commuter students, but determines that it has only enough time and money to survey 500 students.
The primary problem in each situation is that there is too much information and not enough resources to collect and analyze it.
Aims of Sampling
2
Researchers in the social sciences rarely have enough time or money to collect information about the entire group that interests them. Known as the
population
, this group includes all the cases (individuals, groups, or objects) in which the researcher is interested. For example, in our first illustration, there are more than 20,000 students; the population in the second illustration consists of 4,700 colleges and universities; and in the third illustration, the population is 15,000 commuter students.
Population
A group that includes all the cases (individuals, objects, or groups) in which the researcher is interested.
Fortunately, we can learn a lot about a population if we carefully select a subset of it. This subset is called a
sample
. Through the process of
sampling
—selecting a subset of observations from the population—we attempt to generalize the characteristics of the larger group (population) based on what we learn from the smaller group (t ...
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
main
1. Prediction Study on the AddHealth Dataset
James Mason, Joao Carreira, Tomofumi Ogawa, Yannik Pitcan
Abstract
The National Longitudinal Study of Adolescent to Adult
Health (AddHealth) is a longitudinal study of a nation-
ally representative sample of adolescents. The study
was conducted in 4 different waves, the earliest of which
occurred in 1994-1995 (grades 7-12) and the most re-
cent in 2008 (ages 24-32). In each wave an in-home in-
terview was conducted with the goal of understanding
what forces and behaviors may influence adolescents’
health during the lifecourse.
The public-use dataset contains between about 4-
6K respondents in each wave and each respondent was
surveyed on 2-3K items.
In this work, we leverage the information in the Ad-
dHealth study to build a prediction model for depres-
sion from the information we have from participants
in their adolescent period.
There are several challenges involved in building such
a model. First, the ratio between participants and
survey variables is low (around 6K participants for 2-
3K variables). Second, the survey results are sparse –
many items are left unanswered. Third, as is normal
in a longitudinal study such as this one, some partici-
pants were lost in the later waves of the study. Last, in
the study dataset answers are represented with numer-
ical values. However, for many of the 2-3K items these
values don’t have a numerical meaning (e.g., value 4
means “refused to answer”).
In this report we apply different statistical and ma-
chine learning techniques to predict depression and ad-
dress these challenges.
1 Introduction
AddHealth sampled students in grades 7-12 in 1994-95
in Wave I, then conducted three additional waves of
follow-up. Wave II was in 1996, Wave III was in 2001-
02, and Wave IV was in 2008-09. An additional wave
is planned for 2016-18.
The public-use dataset contains between about 4-6K
respondents, depending on which wave, with about 2-
3K variables collected in each. This dataset is survey-
sampled, with some demographic groups oversampled
compared to others. We ignore the external validity
issues presented by this (and thus do not use the sample
weights implied by the sampling design).
Our central interest is prediction. Since the environ-
ment of each person in their adolescence and young-
adulthood might affect their mental health as adults,
we decided to use the variables collected in Wave I to
predict a depression score in Wave IV. The depression
data in Wave IV are a series of categorical frequency
indicators (0=“never” through 3=“most of the time”)
in the “Social Psychology and Mental Health” section;
hence we propose a depression score constructed from
these responses.
One of the major potential problems we might face
in the analysis is that the number of possible predic-
tors in Wave I overwhelms the sample size available
to us, therefore it will be necessary to select the best
predictors to use in our analysis. Another issue has to
do with the problem of doing predictions when many
respondents may be missing data on some fraction of
the large number of predictor variables.
We have decided to approach this problem in the
following way. First, we chose a set of prediction tech-
niques that we thought were promising for a dataset
with these characteristics. Some of the prediction
methods were OLS, Random Forests and Support Vec-
tor Regression. Second, because some of the methods
(e.g., OLS) assume linearity of variables we one-hot en-
coded our input data set. Third, to reduce the number
of predictor variables we applied two dimensionality
reduction techniques: Random Projection and Princi-
pal Component Analysis. Fourth, because these meth-
ods are sensitive to the value of their hyperparameters
we trained these models with cross-validation. Finally,
because we are concerned with how well we are able
to predict depression in new/unseen data we split our
dataset into two sets: a training set and a test set.
We report the error of predictions when our models
(trained with the training set) are used to predict de-
pression of people in the test set.
This report is organized in the following way. In
Section 2 we present in detail the dataset we analyzed
and discuss some of the challenges of analyzing such a
dataset. In Section 3 we describe how we prepared the
dataset for analysis using prediction models. In Sec-
tion 4 we present the methodology we used to build our
prediction models. In Section 5 we compare the qual-
1
2. ity of the predictions from the various models. Finally,
in Section 6 we summarize our work.
2 Dataset
The data are split into four waves.
Wave I consisted of two stages. Stage 1 was a strat-
ified, random sample of all high schools in the United
States. Stage 2 was an in-home sample of 27,000
teenagers consisting of core samples from each commu-
nity. Some over-samples were also selected. In other
words, an adolescent could qualify for multiple sam-
ples. Wave II was identical to the Wave I sample with
a few exceptions: (a) those who were in the 12th grade
at Wave I and not part of the genetic sample, (b) re-
spondents only in the Wave I disabled sample, (c) 65
adolescents who were in the genetic sample and not
interviewed in Wave I were interviewed in Wave II.
If Wave I respondents could be found and inter-
viewed again after six years, they were in the Wave III
sample. Urine and saliva samples were also collected
at this time.
Finally, the Wave IV sample consisted of all original
Wave I respondents. Readings of vitals were also taken.
There were a few NA values in the data after col-
lection. The Wave I dataset had the highest with over
five percent of its values being NA’s.
3 Data Processing
We wish to predict depression of participants 10 to 15
years after they have first participated in the survey.
In AddHealth, depression is measured using ten items,
which respondents answer using a four-point frequency
scale, as described in Section 3.2. The responses to
these ten items are then summarized into a single de-
pression score.
To predict depression we have performed the follow-
ing study. First, we have prepossessed the data to 1)
filter individuals that were lost to follow-up during the
study, 2) we have cleaned the data, 3) we have gener-
ated for each participant of Wave 4 a general mental
health score that we aim to predict.
Second, we built a linear prediction model regressed
on an expanded design matrix. In this expanded ma-
trix each predictor variable has a binary value indicat-
ing whether for a specific survey item the participant
chose a specific answer. Third, we used cross-validation
to estimate the hyperparameters used by the model to
provide the smallest prediction error.
Analysis was conducted using R [7] and Python. In
the rest of the section we explain each step in more
detail and explain the reasoning behind our approach.
3.1 Data Processing
Before building a prediction model we cleaned the raw
AddHealth dataset. Some of the problems we found
while cleaning the data were: filtering individuals that
are lost to follow participants and removing survey
variables unimportant to our prediction task.
To address the first problem, lost to follow partici-
pants, we have filtered out all participants that have
participated in the Wave I of the study but not in Wave
IV. We intend to analyze this subset of the population
in order to understand if they differ significantly from
the other participants.
Second, some variables in the dataset are likely to
not be important for prediction purposes. For instance,
the survey variables IMONTH, IDAY, and IYEAR indicate
the month, day and year the survey was conducted,
respectively. Such variables were removed and ignored
in the context of our analysis.
3.2 Depression Scores
Out outcome measure is depression, as measured in
Wave IV. The “Social Psychology and Mental Health”
section of Wave IV contains ten items, H4MH18 through
H4MH27, measuring depression. Most of these items
were taken from the Center for Epidemiologic Studies
Depression Scale (CES-D) [8]. These ask “How often
was the following true during the past seven days”
with a rating scale of (0=“never” through 3=“most of
the time”. The items were:
• “You were bothered by things that usually don’t
bother you”
• “You could not shake off the blues, even with help
from your family and your friends”
2
3. • “You felt you were just as good as other people”
*
• “You had trouble keeping your mind on what you
were doing”
• “You felt depressed”
• “You felt that you were too tired to do things”
• “You felt happy” *
• “You enjoyed life” *
• “You felt sad”
• “You felt that people disliked you”
Three items (indicated by *) were positive, and were
reverse-coded before analysis (i.e., 3=“never” through
0=“most of the time”).
To generate a single outcome which could be
modeled as a continuous outcome, there are three
commonly-used methods. First, we could could con-
struct sum-scores by adding the codes (0:3) from each
item, yielding a score range from 0:30. This would
treat each item equally, and treat the spaces between
the for levels within each item equally. Second, we
could conduct an Exploratory Factor Analysis (EFA)
and generate factor scores for a single Principal Factor.
This would weight each item differently, but still treat
the spaces between the four levels within those items
equally. Third, we could fit an Item-Response Theory
(IRT) model, which is a kind of latent variable model,
and use the latent variable scores.
We chose to use IRT scores, because their distribu-
tion was closer to Normal than sum-scores or EFA
scores. Specifically, we fit Masters’ Partial Credit
Model [5]using the TAM [4] package.
P(Xpi = c) =
exp
c
k=0 (θp − δik)
li
l=0 exp
l
k=0 (θp − δik)
where Xpi ∈ {0, 1, 2, 3} is the response of person p
to item i, θp is the (latent) depression of person p, δik
is how high on the depression scale a person must be
to answer item i at level k rather than k − 1, and li is
the maximum level for item i (always 3).
We estimated the model parameters (item param-
eters, and parameters of the θ distribution) using
Marginal Maximum Likelihood. We predicated de-
pression scores θp for each individual using Empirical
Bayes: Expected a-Posteriori (EAP), with a N(0, σ2
)
prior.
Although EAP can generate scores for individuals
with missing data on the indicators, we chose not to
generate depression scores for those who answered less
than seven of the ten items. This resulted in the loss
of a single observation.
3.3 Predicting on categorical variables
There are 6504 participants and 2794 variables in
Wave I. Most of these variables are categorical,
whereas others are numerical. For instance, item 19 in
Wave I of the study is “(During the past seven days:)
You could not shake off the blues, even with help
from your family and your friends” with the following
answers:
• 0 – ”ever or rarely”
• 1 – ”sometimes”
• 2 – ”a lot of the time”
• 3 – ”most of the time or all of the time”
• 4 – ”refused”
In categorical variables like this we cannot use the
value of the survey response directly in a linear model,
because the values are not linear. The answer ’refused’
is not indicative of more depression than the answer
’sometimes’.
To address this problem, we decided to expand each
variable in the original dataset into a set of binary val-
ues, one for each possible response to the corresponding
survey item (one-hot encoding). For this item, we gen-
erated 5 variables, one for each of the 5 answers. This
way we are able to pull apart the contribution of each
particular answer to the final prediction of.
This procedure expanded our 2794 predictor vari-
ables into 20086 variables.
4 Prediction
4.1 Reducing number of predictor vari-
ables
The number of predictor variables largely exceeds the
number of observations. This is an issue especially
when we use regression based models such as OLS and
exacerbated by one-hot encoding. Thus, reducing the
number of predictors is necessary.
To address this problem we have tried two ap-
proaches: Principal Component Analysis (PCA) and
Random Projection.
To perform PCA on the one-hot encoded matrix we
first computed the set of principal components that
accounts for more than 80% of the variance in the
dataset. This step resulted in 658 principal compo-
nents. We then projected our design matrix onto the
subspace spanned by these principal components.
Alternatively to the PCA approach, we have decided
to project the one-hot encoded design matrix into a
subspace spanned by random vectors (Random Pro-
jection). By the Johnson-Lindenstrauss lemma [9] we
know that when projecting our data vectors into a
lower-dimensional space the distance between points
is approximately preserved with high probability. This
approach has the advantage of being computationally
faster than other dimensionality reduction alternatives
3
4. (e.g., PCA).
To perform this projection we did the following.
First, we created a D × p projection matrix V where
D is a parameter chosen by users, p is the number of
predictors variables in the expanded matrix, and each
entry Vij is a standard normal variable, Vij ∼ N(0, 1).
Then, we projected our data vectors into the space
spanned by V , i.e., we compute
Xtransformed = XV T
.
The n × D projected matrix Xtransformed was used on
several prediction methods explained in detail in the
next subsections.
4.2 Regression Based Model
We employed three regression-based models to predict
depression scores: 1) OLS, 2) preshrunk estimator sug-
gested by Copas [2], and 3) preshrunk estimator with
cross-validated shrinkage ˆK. Notice that all three mod-
els require the number of predictor variables to be
smaller than the number of observations and hence re-
duction of predictor variables.
For each of the three models, PCA was performed
where the subset of principal components was chosen
to explain 80% of total variance.
In addition, for each of the three, Random Pro-
jection was used with the size of the projected ma-
trix which was cross-validated by 5-folds: that is,
with choosing candidates of the size D beforehand,
5, 10, 25, 50, 100, 150, and 200 for the preshrunk mod-
els and 500, 1000, and 2000 in addition for OLS, we
produced the projected matrix Xtransformed with each
candidate D. Then with each Xtransformed we split the
training data into 5 folds, and for each fold we used
the other 4 folds to build a model and to predict the
depression scores in this fold. In other words each fold
was used to build a model at 4 times to predict the
scores in the different 4 folds, and left out exactly once
to gain the predicted scores in this fold. After these
procedures we obtained the mean prediction error on
the training set so that we can decide the best size
Dbest.
4.3 Tree-based prediction
Prediction methods based on OLS have two limita-
tions. First, OLS cannot (directly) take into account
interactions between predictor variables. Second, OLS
imposes a strict condition on the ration between obser-
vations and predictor variables.
Tree-based methods have been successfully used in
classification and regression settings and do not suffer
from these problems. Because of this, we have used two
predictions methods based on decision trees: random
forest regression and XGBoost regression.
4.3.1 Random Forests
Random forests is an ensemble method used for classifi-
cation and regression. Because we are trying to predict
a continuous value (depression) we focus on regression.
Random forest regression works by constructing mul-
tiple decision trees at training time and outputting the
mean of the values predicted by each tree.
We used the sklearn.ensemble.RandomForestRegressor
python class for this task. Because we applied this
method to the one-hot encoded design matrix, each
decision tree built during this process performs a
simple ”Yes/No” decision on a random subset of the
predictor variables.
Because training of Random Forests can be compu-
tationally expensive we made use of 10 cores to speed
up this process.
4.3.2 XGBoost
XGBoost [1] is a gradient tree boosting library used for
prediction. Boosted trees methods work by using en-
sembles of decision trees to build predictions. However,
unlike random forests, these methods build ensembles
of decision trees in an incremental fashion, minimiz-
ing the residual errors at each new decision tree. XG-
Boost in particular is an efficient library for performing
boosted trees prediction.
4.4 Support Vector Regression
Support Vector Regression (SVR) is a prediction
method that casts problems into a convex optimiza-
tion framework of the form:
minimize
1
2
||w||2
subject to
yi − w, xi − b ≤
w, xi + b − yi ≤
This formulation can be used to find an hyperplane
that approximates the data with an error of at most
for each data point.
We applied the SVR implementation in the scikit-
learn Python library [6] to our prediction problem. Be-
cause we use a linear kernel, we decided to use the Lin-
earSVR python class. We found this class to perform
significantly faster than the sklearn.svm.SVR class –
because it is built on top of the LIBLINEAR [3] library
(not the slower LIBSVM).
4
5. 4.5 Prediction Models Used
From the strategies above, we selected nine prediction
models to evaluate:
1. Random projection (number of dimensions deter-
mined by cross-validation) followed by OLS,
2. Random projection (number of dimensions deter-
mined by cross-validation) followed by Copas’ pre-
shrunk regression,
3. Random projection (number of dimensions deter-
mined by cross-validation) followed by pre-shrunk
regression where the shrinkage factor was deter-
mined by cross-validation,
4. Principal Component Analysis followed by OLS,
5. Principal Component Analysis followed by Copas’
pre-shrunk regression,
6. Principal Component Analysis followed by pre-
shrunk regression where the shrinkage factor was
determined by cross-validation,
7. Random Forest Regression (using the original fea-
ture matrix) with maximum depth and number of
trees determined by cross-validation,
8. XGBoost Regression (using the original feature
matrix) with maximum depth and minimum child
weight determined by cross-validation,
9. Support Vector Regression (using the original
feature matrix) with C determined by cross-
validation.
In addition, we implemented a random predictor,
which predicted the depression score for an individ-
ual by randomly selecting another individual’s score.
This provides a baseline prediction root mean squared
prediction error to which the other models can be com-
pared.
5 Results
5.1 IRT Scores
The range of IRT scores was from about -1.90 to 4.21,
with a mean of about 0 (by design) and a standard de-
viation of about 1.06. The distribution of IRT scores is
presented in Figure 1. Note that since the IRT scores
have a SD of about 1, when evaluating root mean
squared prediction errors, the errors are essentially in
standard deviation units.
5.2 Cross Validation
We used 5-fold cross-validation to determine the best
value for the hyperparameters of the following models:
regression-based, tree-based and support vector regres-
sion.
Histogram of depression scores
Depression Scores (IRT)
Frequency
−2 −1 0 1 2 3 4
02004006008001000
Figure 1: Distribution of IRT scores for depression
Regression When using Random Projection we
used CV to determine the best choice of D, the number
of dimensions of projection. Since the optimal choice
may vary based on the type of prediction model, this
parameter was independently cross-validated for for
OLS, OLS with Copas shrinkage, and OLS with cross-
validated shrinkage. The candidate values for D, along
with associated root mean square prediction errors are
listed in Table ??.
For OLS, the Dbest was 100, whereas when Copas
or cross-validated shrinkage was used, Dbest was 150.
These values of Dbest were used in subsequent analyses.
Note that over-fitting is a problem to be concerned
with: in OLS with D = 2000, the root mean square
prediction error was almost twice as large as with the
optimal D = 100.
Tree-based methods We used cross-validation
with random forest regression and XGBoost.
For random forest regression we tried the parame-
ters and corresponding values shown in Table 1. For
XGBoost regression we tried the parameters and cor-
responding values in Table ??.
Parameter Description Values
max depth Max tree depth 10,50,100,200
n estimators Number of trees 100, 1000, 1500, 2000, 4000, 8000
Table 1: Parameters and values used for cross-
validation with Random Forest
Support Vector Regression We have cross-
validated our model using the parameters and respec-
tive values presented in Table 3.
5
6. Parameter Description Values
max depth Max tree depth 1,3,5,7,9
min child weight The minimum number of
variables in each leaf node
1,3,5,7,9
Table 2: Parameters and values used for cross-
validation with XGBoost
Parameter Description Values
epsilon Error margin 0.01, 0.1, 0.5, 1, 2,
4, 8, 16
C Penalty given to data
points far from optimal
hyperplane
1 × 10−8
,
1 × 10−7
, 1 × 10−6
,
1 × 10−5
,1 × 10−4
,
1 × 10−3
, 1 × 10−2
,
1 × 10−1
, 1,10,100
Table 3: Parameters and values used for cross-
validation with SVR
5.3 Prediction Quality
The prediction results from each prediction method,
using the best settings as determined by cross-
validation, are presented in Table ??.
Considering root mean squared prediction error
(RMSPE), the tree-based and support-vectors models
worked better than the PCA-based models, which in
turn worked better than the random-projection-based
models.
Among the former, the best predictions were from
Random Forest regression with RMSPE of 0.970, fol-
lowed closely by XGBoost regression with RMSPE of
0.975. Support Vector regression fared slightly worse,
with RMSPE of 0.991.
Among the PCA models, OLS with Copas shrinkage
did the best, with RMSPE of 0.996, followed by OLS
with cross-validated shrinkage factor, with RMSPE of
1.002, and OLS with no shrinkage, with RMSPE of
1.010.
Among Random Projection models, OLS with Co-
pas shrinkage did the best with RMSPE of 1.033, fol-
lowed by OLS with cross-validated shrinkage factor,
with RMSPE of 1.042, and OLS with no shrinkage,
with RMSPE of 1.049.
However, all of these models performed better than
the baseline RMSPE of 1.508 provided by random pre-
diction.
6 Conclusion
In this report we presented a methodology for the pre-
diction of depression in a sparse dataset from an ado-
lescent health study.
We identified challenges in doing prediction using
this dataset and addressed them with a combination
of feature reduction, statistical and machine learning
techniques.
All the prediction methods we used performed bet-
ter than a random predictor. The machine learning
techniques we used achieved slightly better predictions
than the methods based on OLS (X RMSPE vs Y).
This might result from the interaction of different vari-
ables.
Interestingly, the predictions after Random Projec-
tion were in the ballpark of the RMSPE of other predic-
tions without dimensionality reduction. This indicates
that Random Projection can work in practice.
The AddHealth dataset was not constructed with
prediction in mind, but rather taking a snapshot of
the state of adolescent health in the US. We believe
a dataset built for prediction would have taken into
account from the get go some of the problems we ad-
dresses: lack of observations, the format of the dataset
and depression scores. Despite these challenges we
were able to build a predictor of depression scores with
one standard deviation of RMSPE and significantly
better than a random predictor.
6
7. References
[1] T. Chen and C. Guestrin. Xgboost: A scalable tree
boosting system. CoRR, abs/1603.02754, 2016.
[2] J. B. Copas. Regression, prediction and shrinkage.
Journal of the Royal Statistical Society. Series B
(Methodological), 1983.
[3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,
and C.-J. Lin. LIBLINEAR: A library for large
linear classification. Journal of Machine Learning
Research, 9:1871–1874, 2008.
[4] T. Kiefer, A. Robitzsch, and M. Wu. TAM: Test
Analysis Modules, 2016. R package version 1.16-0.
[5] G. N. Masters. A rasch model for partial credit
scoring. Psychometrika, 47(2):149–174, 1 June
1982.
[6] F. Pedregosa, G. Varoquaux, A. Gramfort,
V. Michel, B. Thirion, O. Grisel, and et al. Scikit-
learn: Machine learning in python. Journal of Ma-
chine Learning Research, 2011.
[7] R Core Team. R: A Language and Environment for
Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria, 2015.
[8] L. S. Radloff. The CES-D scale: A Self-Report
depression scale for research in the general popula-
tion. Applied psychological measurement, 1(3):385–
401, 1 June 1977.
[9] J. William and L. Joram. Extensions of lipschitz
mappings into a hilbert space. Conference in mod-
ern analysis and probability, 1984.
7