This document describes an analysis of count data from a study on the detection of anthelmintic resistance in gastrointestinal nematodes of small ruminants. The data consists of egg counts from 30 goats and 30 sheep that were grouped into Albendazole, Ivermectin, and control groups. The data was analyzed using Poisson and negative binomial regression models in R software. The Poisson model did not fit the data well due to overdispersion. However, the negative binomial regression model provided a better fit for the overdispersed data. Key findings from the negative binomial regression analysis are summarized.
This document discusses variance and standard deviation. It defines variance as the average squared deviation from the mean of a data set. Standard deviation measures how spread out numbers are from the mean and is calculated by taking the square root of the variance. The document provides step-by-step instructions for calculating both variance and standard deviation, including examples using test score data.
This document discusses multivariate analysis (MVA), which involves observing and analyzing multiple outcome variables simultaneously. It describes key components of MVA like variates, measurement scales, and statistical significance. Various MVA techniques are explained, including cross correlations, single-equation models, vector autoregressions, and cointegration. An example using crime rate data from US states is provided. Applications of MVA in fields like marketing, quality control, process optimization, and research are also mentioned.
- Confidence intervals provide an estimated range of values that is likely to include an unknown population parameter, such as a mean, with a specified degree of confidence.
- The margin of error depends on the sample size, standard deviation, and confidence level, with a larger sample size and smaller standard deviation yielding a smaller margin of error.
- When the sample size is small, a t-distribution rather than normal distribution is used to construct the confidence interval due to the unknown population standard deviation. The t-distribution is wider than the normal and accounts for additional uncertainty from an unknown standard deviation.
This document provides an introduction to Poisson regression models for count data. It outlines that Poisson regression can be used to model count variables that have a Poisson distribution. A simple equiprobable model is presented where the expected count is equal across all categories. This equiprobable model establishes a null hypothesis that can be tested using likelihood ratio or Pearson's test statistics. Residual analysis is also discussed. Finally, the document introduces how a covariate can be added to a Poisson regression model to establish relationships between the count variable and explanatory variables.
Residuals represent variation in the data that cannot be explained by the model.
Residual plots useful for discovering patterns, outliers or misspecifications of the model. Systematic patterns discovered may suggest how to reformulate the model.
If the residuals exhibit no pattern, then this is a good indication that the model is appropriate for the particular data.
This document discusses various statistical tests used to analyze categorical data, including contingency tables and chi-square tests. It begins by defining continuous and categorical variables. It then discusses how to represent associations between categorical variables using contingency tables. It explains how to calculate expected frequencies and chi-square values to test for relationships between categorical variables. Finally, it discusses other tests that can be used for contingency tables like Fisher's exact test, McNemar's test, and Yates correction.
The document discusses simple linear regression. It defines key terms like regression equation, regression line, slope, intercept, residuals, and residual plot. It provides examples of using sample data to generate a regression equation and evaluating that regression model. Specifically, it shows generating a regression equation from bivariate data, checking assumptions visually through scatter plots and residual plots, and interpreting the slope as the marginal change in the response variable from a one unit change in the explanatory variable.
This document discusses outliers, including what they are, how they impact regression analysis, potential causes of outliers, methods for detecting outliers, and approaches for dealing with outliers. Outliers are observations that are distant from other observations and can be caused by data errors, sampling issues, or legitimate rare cases. They can negatively influence predictions if not addressed but also sometimes provide important insights. The document reviews techniques for identifying outliers like Mahalanobis distance and for making analyses more robust to outliers such as trimmed means, winsorization, least trimmed squares, and least median of squares methods.
This document discusses variance and standard deviation. It defines variance as the average squared deviation from the mean of a data set. Standard deviation measures how spread out numbers are from the mean and is calculated by taking the square root of the variance. The document provides step-by-step instructions for calculating both variance and standard deviation, including examples using test score data.
This document discusses multivariate analysis (MVA), which involves observing and analyzing multiple outcome variables simultaneously. It describes key components of MVA like variates, measurement scales, and statistical significance. Various MVA techniques are explained, including cross correlations, single-equation models, vector autoregressions, and cointegration. An example using crime rate data from US states is provided. Applications of MVA in fields like marketing, quality control, process optimization, and research are also mentioned.
- Confidence intervals provide an estimated range of values that is likely to include an unknown population parameter, such as a mean, with a specified degree of confidence.
- The margin of error depends on the sample size, standard deviation, and confidence level, with a larger sample size and smaller standard deviation yielding a smaller margin of error.
- When the sample size is small, a t-distribution rather than normal distribution is used to construct the confidence interval due to the unknown population standard deviation. The t-distribution is wider than the normal and accounts for additional uncertainty from an unknown standard deviation.
This document provides an introduction to Poisson regression models for count data. It outlines that Poisson regression can be used to model count variables that have a Poisson distribution. A simple equiprobable model is presented where the expected count is equal across all categories. This equiprobable model establishes a null hypothesis that can be tested using likelihood ratio or Pearson's test statistics. Residual analysis is also discussed. Finally, the document introduces how a covariate can be added to a Poisson regression model to establish relationships between the count variable and explanatory variables.
Residuals represent variation in the data that cannot be explained by the model.
Residual plots useful for discovering patterns, outliers or misspecifications of the model. Systematic patterns discovered may suggest how to reformulate the model.
If the residuals exhibit no pattern, then this is a good indication that the model is appropriate for the particular data.
This document discusses various statistical tests used to analyze categorical data, including contingency tables and chi-square tests. It begins by defining continuous and categorical variables. It then discusses how to represent associations between categorical variables using contingency tables. It explains how to calculate expected frequencies and chi-square values to test for relationships between categorical variables. Finally, it discusses other tests that can be used for contingency tables like Fisher's exact test, McNemar's test, and Yates correction.
The document discusses simple linear regression. It defines key terms like regression equation, regression line, slope, intercept, residuals, and residual plot. It provides examples of using sample data to generate a regression equation and evaluating that regression model. Specifically, it shows generating a regression equation from bivariate data, checking assumptions visually through scatter plots and residual plots, and interpreting the slope as the marginal change in the response variable from a one unit change in the explanatory variable.
This document discusses outliers, including what they are, how they impact regression analysis, potential causes of outliers, methods for detecting outliers, and approaches for dealing with outliers. Outliers are observations that are distant from other observations and can be caused by data errors, sampling issues, or legitimate rare cases. They can negatively influence predictions if not addressed but also sometimes provide important insights. The document reviews techniques for identifying outliers like Mahalanobis distance and for making analyses more robust to outliers such as trimmed means, winsorization, least trimmed squares, and least median of squares methods.
The document provides an overview of regression analysis including:
- Regression analysis is a statistical process used to estimate relationships between variables and predict unknown values.
- The document outlines different types of regression like simple, multiple, linear, and nonlinear regression.
- Key aspects of regression like scatter diagrams, regression lines, and the method of least squares are explained.
- An example problem is worked through demonstrating how to calculate the slope and y-intercept of a regression line using the least squares method.
Multiple regression analysis , its methods among which multiple regression analysis one of the popular method. also discuss the applications and purposes
This document provides an overview of simple linear regression. It defines regression as determining the statistical relationship between variables where changes in one variable depend on changes in another. Regression analysis is used for prediction and exploring relationships between dependent and independent variables. The key aspects covered include:
- Dependent variables change due to independent variables.
- Lines of regression show the relationship between the variables.
- The method of least squares is used to determine the line of best fit that minimizes the error between predicted and actual values.
- Linear regression models take the form of y = a + bx and are used for tasks like prediction and determining impact of independent variables.
The document introduces the maximum likelihood method (MLM) for determining the most likely cause of an observed result from several possible causes. It provides examples of using MLM to determine the most likely father of a child from potential candidates and the most likely distribution of balls in a box based on the observed colors of balls drawn from the box. MLM involves calculating the likelihood of each potential cause producing the observed result and selecting the cause with the highest likelihood as the most probable explanation.
This document defines variance and standard deviation and provides formulas and examples to calculate them. It states that variance is the average squared deviation from the mean and measures how far data points are from the average. Standard deviation tells how clustered data is around the mean and is the square root of the variance. It provides step-by-step instructions to find variance and standard deviation, including calculating the mean, deviations from the mean, summing the squared deviations, and taking the square root. Worked examples are shown to find the variance and standard deviation of students' test scores and people's heights in a room.
This document discusses sampling and sampling distributions. It begins by explaining why sampling is preferable to a census in terms of time, cost and practicality. It then defines the sampling frame as the listing of items that make up the population. Different types of samples are described, including probability and non-probability samples. Probability samples include simple random, systematic, stratified, and cluster samples. Key aspects of each type are defined. The document also discusses sampling distributions and how the distribution of sample statistics such as means and proportions can be approximated as normal even if the population is not normal, due to the central limit theorem. It provides examples of how to calculate probabilities and intervals for sampling distributions.
Logistic regression is a statistical method used to predict a binary or categorical dependent variable from continuous or categorical independent variables. It generates coefficients to predict the log odds of an outcome being present or absent. The method assumes a linear relationship between the log odds and independent variables. Multinomial logistic regression extends this to dependent variables with more than two categories. An example analyzes high school student program choices using writing scores and socioeconomic status as predictors. The model fits significantly better than an intercept-only model. Increases in writing score decrease the log odds of general versus academic programs.
The document discusses the normal distribution, which produces a symmetrical bell-shaped curve. It has two key parameters - the mean and standard deviation. According to the empirical rule, about 68% of values in a normal distribution fall within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. The normal distribution is commonly used to model naturally occurring phenomena that tend to cluster around an average value, such as heights or test scores.
The document discusses the chi-square test, which offers an alternative method for testing the significance of differences between two proportions. It was developed by Karl Pearson and follows a specific chi-square distribution. To calculate chi-square, contingency tables are made noting observed and expected frequencies, and the chi-square value is calculated using the formula. Degrees of freedom are also calculated. Chi-square test is commonly used to test proportions, associations between events, and goodness of fit to a theory. However, it has limitations when expected values are less than 5 and does not measure strength of association or indicate causation.
This document provides an overview of logistic regression, including when and why it is used, the theory behind it, and how to assess logistic regression models. Logistic regression predicts the probability of categorical outcomes given categorical or continuous predictor variables. It relaxes the normality and linearity assumptions of linear regression. The relationship between predictors and outcomes is modeled using an S-shaped logistic function. Model fit, predictors, and interpretations of coefficients are discussed.
Regression is a statistical tool used to predict unknown values of a dependent variable from known values of one or more independent variables. It estimates the average change in the dependent variable given a change in the independent variable(s). There are two regression lines - one with Y as the dependent variable (Y on X) and one with X as the dependent variable (X on Y). The regression equation expresses these lines algebraically. The constants a and b are estimated using the method of least squares, which finds the line that minimizes the vertical differences between actual and estimated Y values. Multiple regression uses more than one independent variable to increase prediction accuracy.
This document provides an overview of statistical tests and hypothesis testing. It discusses the four steps of hypothesis testing, including stating hypotheses, setting decision criteria, computing test statistics, and making a decision. It also describes different types of statistical analyses, common descriptive statistics, and forms of statistical relationships. Finally, it provides examples of various parametric and nonparametric statistical tests, including t-tests, ANOVA, chi-square tests, correlation, regression, and decision trees.
This document discusses confidence intervals, which provide a range of values that is likely to include an unknown population parameter based on a sample statistic. It defines key concepts like confidence level, confidence limits, and factors that determine how to set the confidence interval like sample size, population variability, and precision of values. It explains how larger sample sizes and more precise measurements result in narrower confidence intervals. Applications to clinical trials are discussed, showing how sample size impacts the ability to make definitive recommendations based on trial results.
This document discusses multiple regression analysis and its use in predicting relationships between variables. Multiple regression allows prediction of a criterion variable from two or more predictor variables. Key aspects covered include the multiple correlation coefficient (R), squared correlation coefficient (R2), adjusted R2, regression coefficients, significance testing using t-tests and F-tests, and considerations for using multiple regression such as sample size and normality assumptions.
Discriminant analysis is a statistical technique used to classify cases into categories based on a set of predictor variables. It determines which continuous variables discriminate between two or more naturally occurring groups. For example, a researcher could use discriminant analysis to determine which fruit characteristics best predict whether a fruit will be eaten by birds, primates, or squirrels, based on data collected on various fruit properties from each animal group. Discriminant analysis involves estimating parameters, computing discriminant functions to classify new observations, and using cross-validation to estimate misclassification probabilities.
The document defines and explains how to calculate and interpret an odds ratio. An odds ratio is a measure of association used in case-control studies to compare the odds of exposure to a risk factor in cases versus controls. It is calculated by dividing the odds of exposure in cases by the odds of exposure in controls. An odds ratio of 1 indicates no association, while a ratio greater than 1 means the risk factor is associated with higher odds of the health outcome. The document provides an example of using a 2x2 table to calculate the odds ratio to determine if drug abuse is associated with higher odds of having a stroke.
This document discusses various measures of dispersion used to quantify how spread out or varied values in a data set are. It defines dispersion as the difference or deviation of values from the central value. Measures of dispersion described include range, standard deviation, quartile deviation, mean deviation, variance, and coefficient of variation. Both absolute measures, which use numerical variations, and relative measures, which use statistical variations based on percentages, are examined. Relative measures allow for comparison between different data sets.
This document discusses Poisson regression in R. Poisson regression is a type of regression where the response variable is counts, like number of births or wins. The document shows how to create a Poisson regression model in R using the glm() function, specifying the Poisson family. It uses the built-in warpbreaks data to predict the number of warp breaks based on wool type and tension level, and the summary of the model shows that wool type B and higher tension levels have a significant impact on the number of breaks.
The document provides an overview of inferential statistics. It defines inferential statistics as making generalizations about a larger population based on a sample. Key topics covered include hypothesis testing, types of hypotheses, significance tests, critical values, p-values, confidence intervals, z-tests, t-tests, ANOVA, chi-square tests, correlation, and linear regression. The document aims to explain these statistical concepts and techniques at a high level.
This presentation covered the following topics:
1. Definition of Correlation and Regression
2. Meaning of Correlation and Regression
3. Types of Correlation and Regression
4. Karl Pearson's methods of correlation
5. Bivariate Grouped data method
6. Spearman's Rank correlation Method
7. Scattered diagram method
8. Interpretation of correlation coefficient
9. Lines of Regression
10. regression Equations
11. Difference between correlation and regression
12. Related examples
Basic Statistics for application in Medical AssessmentShrushrita Sharma
- Percentages represent occurrences in proportions of 100 and are calculated by dividing the number of items by the total number and multiplying by 100.
- The mean, median, and mode are measures of central tendency used to understand data distribution. The mean is the average, median is the midpoint, and mode is the most frequent value.
- Standard deviation measures the spread of data around the average and is used to determine if data is normally distributed and identify outliers. Confidence intervals indicate the probable range of a population parameter.
- P-values represent the probability that results are due to chance, with lower p-values indicating higher significance. Parametric and non-parametric tests are used to analyze different types of
This document discusses various statistical concepts and their applications in clinical laboratories. It defines descriptive statistics, statistical analysis, measures of central tendency (mean, median, mode), measures of variation (variance, standard deviation), probability distributions (binomial, Gaussian, Poisson), and statistical tests (t-test, chi-square, F-test). It provides examples of how these statistical methods are used to monitor laboratory test performance, interpret results, and compare different laboratory instruments and methods.
The document provides an overview of regression analysis including:
- Regression analysis is a statistical process used to estimate relationships between variables and predict unknown values.
- The document outlines different types of regression like simple, multiple, linear, and nonlinear regression.
- Key aspects of regression like scatter diagrams, regression lines, and the method of least squares are explained.
- An example problem is worked through demonstrating how to calculate the slope and y-intercept of a regression line using the least squares method.
Multiple regression analysis , its methods among which multiple regression analysis one of the popular method. also discuss the applications and purposes
This document provides an overview of simple linear regression. It defines regression as determining the statistical relationship between variables where changes in one variable depend on changes in another. Regression analysis is used for prediction and exploring relationships between dependent and independent variables. The key aspects covered include:
- Dependent variables change due to independent variables.
- Lines of regression show the relationship between the variables.
- The method of least squares is used to determine the line of best fit that minimizes the error between predicted and actual values.
- Linear regression models take the form of y = a + bx and are used for tasks like prediction and determining impact of independent variables.
The document introduces the maximum likelihood method (MLM) for determining the most likely cause of an observed result from several possible causes. It provides examples of using MLM to determine the most likely father of a child from potential candidates and the most likely distribution of balls in a box based on the observed colors of balls drawn from the box. MLM involves calculating the likelihood of each potential cause producing the observed result and selecting the cause with the highest likelihood as the most probable explanation.
This document defines variance and standard deviation and provides formulas and examples to calculate them. It states that variance is the average squared deviation from the mean and measures how far data points are from the average. Standard deviation tells how clustered data is around the mean and is the square root of the variance. It provides step-by-step instructions to find variance and standard deviation, including calculating the mean, deviations from the mean, summing the squared deviations, and taking the square root. Worked examples are shown to find the variance and standard deviation of students' test scores and people's heights in a room.
This document discusses sampling and sampling distributions. It begins by explaining why sampling is preferable to a census in terms of time, cost and practicality. It then defines the sampling frame as the listing of items that make up the population. Different types of samples are described, including probability and non-probability samples. Probability samples include simple random, systematic, stratified, and cluster samples. Key aspects of each type are defined. The document also discusses sampling distributions and how the distribution of sample statistics such as means and proportions can be approximated as normal even if the population is not normal, due to the central limit theorem. It provides examples of how to calculate probabilities and intervals for sampling distributions.
Logistic regression is a statistical method used to predict a binary or categorical dependent variable from continuous or categorical independent variables. It generates coefficients to predict the log odds of an outcome being present or absent. The method assumes a linear relationship between the log odds and independent variables. Multinomial logistic regression extends this to dependent variables with more than two categories. An example analyzes high school student program choices using writing scores and socioeconomic status as predictors. The model fits significantly better than an intercept-only model. Increases in writing score decrease the log odds of general versus academic programs.
The document discusses the normal distribution, which produces a symmetrical bell-shaped curve. It has two key parameters - the mean and standard deviation. According to the empirical rule, about 68% of values in a normal distribution fall within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. The normal distribution is commonly used to model naturally occurring phenomena that tend to cluster around an average value, such as heights or test scores.
The document discusses the chi-square test, which offers an alternative method for testing the significance of differences between two proportions. It was developed by Karl Pearson and follows a specific chi-square distribution. To calculate chi-square, contingency tables are made noting observed and expected frequencies, and the chi-square value is calculated using the formula. Degrees of freedom are also calculated. Chi-square test is commonly used to test proportions, associations between events, and goodness of fit to a theory. However, it has limitations when expected values are less than 5 and does not measure strength of association or indicate causation.
This document provides an overview of logistic regression, including when and why it is used, the theory behind it, and how to assess logistic regression models. Logistic regression predicts the probability of categorical outcomes given categorical or continuous predictor variables. It relaxes the normality and linearity assumptions of linear regression. The relationship between predictors and outcomes is modeled using an S-shaped logistic function. Model fit, predictors, and interpretations of coefficients are discussed.
Regression is a statistical tool used to predict unknown values of a dependent variable from known values of one or more independent variables. It estimates the average change in the dependent variable given a change in the independent variable(s). There are two regression lines - one with Y as the dependent variable (Y on X) and one with X as the dependent variable (X on Y). The regression equation expresses these lines algebraically. The constants a and b are estimated using the method of least squares, which finds the line that minimizes the vertical differences between actual and estimated Y values. Multiple regression uses more than one independent variable to increase prediction accuracy.
This document provides an overview of statistical tests and hypothesis testing. It discusses the four steps of hypothesis testing, including stating hypotheses, setting decision criteria, computing test statistics, and making a decision. It also describes different types of statistical analyses, common descriptive statistics, and forms of statistical relationships. Finally, it provides examples of various parametric and nonparametric statistical tests, including t-tests, ANOVA, chi-square tests, correlation, regression, and decision trees.
This document discusses confidence intervals, which provide a range of values that is likely to include an unknown population parameter based on a sample statistic. It defines key concepts like confidence level, confidence limits, and factors that determine how to set the confidence interval like sample size, population variability, and precision of values. It explains how larger sample sizes and more precise measurements result in narrower confidence intervals. Applications to clinical trials are discussed, showing how sample size impacts the ability to make definitive recommendations based on trial results.
This document discusses multiple regression analysis and its use in predicting relationships between variables. Multiple regression allows prediction of a criterion variable from two or more predictor variables. Key aspects covered include the multiple correlation coefficient (R), squared correlation coefficient (R2), adjusted R2, regression coefficients, significance testing using t-tests and F-tests, and considerations for using multiple regression such as sample size and normality assumptions.
Discriminant analysis is a statistical technique used to classify cases into categories based on a set of predictor variables. It determines which continuous variables discriminate between two or more naturally occurring groups. For example, a researcher could use discriminant analysis to determine which fruit characteristics best predict whether a fruit will be eaten by birds, primates, or squirrels, based on data collected on various fruit properties from each animal group. Discriminant analysis involves estimating parameters, computing discriminant functions to classify new observations, and using cross-validation to estimate misclassification probabilities.
The document defines and explains how to calculate and interpret an odds ratio. An odds ratio is a measure of association used in case-control studies to compare the odds of exposure to a risk factor in cases versus controls. It is calculated by dividing the odds of exposure in cases by the odds of exposure in controls. An odds ratio of 1 indicates no association, while a ratio greater than 1 means the risk factor is associated with higher odds of the health outcome. The document provides an example of using a 2x2 table to calculate the odds ratio to determine if drug abuse is associated with higher odds of having a stroke.
This document discusses various measures of dispersion used to quantify how spread out or varied values in a data set are. It defines dispersion as the difference or deviation of values from the central value. Measures of dispersion described include range, standard deviation, quartile deviation, mean deviation, variance, and coefficient of variation. Both absolute measures, which use numerical variations, and relative measures, which use statistical variations based on percentages, are examined. Relative measures allow for comparison between different data sets.
This document discusses Poisson regression in R. Poisson regression is a type of regression where the response variable is counts, like number of births or wins. The document shows how to create a Poisson regression model in R using the glm() function, specifying the Poisson family. It uses the built-in warpbreaks data to predict the number of warp breaks based on wool type and tension level, and the summary of the model shows that wool type B and higher tension levels have a significant impact on the number of breaks.
The document provides an overview of inferential statistics. It defines inferential statistics as making generalizations about a larger population based on a sample. Key topics covered include hypothesis testing, types of hypotheses, significance tests, critical values, p-values, confidence intervals, z-tests, t-tests, ANOVA, chi-square tests, correlation, and linear regression. The document aims to explain these statistical concepts and techniques at a high level.
This presentation covered the following topics:
1. Definition of Correlation and Regression
2. Meaning of Correlation and Regression
3. Types of Correlation and Regression
4. Karl Pearson's methods of correlation
5. Bivariate Grouped data method
6. Spearman's Rank correlation Method
7. Scattered diagram method
8. Interpretation of correlation coefficient
9. Lines of Regression
10. regression Equations
11. Difference between correlation and regression
12. Related examples
Basic Statistics for application in Medical AssessmentShrushrita Sharma
- Percentages represent occurrences in proportions of 100 and are calculated by dividing the number of items by the total number and multiplying by 100.
- The mean, median, and mode are measures of central tendency used to understand data distribution. The mean is the average, median is the midpoint, and mode is the most frequent value.
- Standard deviation measures the spread of data around the average and is used to determine if data is normally distributed and identify outliers. Confidence intervals indicate the probable range of a population parameter.
- P-values represent the probability that results are due to chance, with lower p-values indicating higher significance. Parametric and non-parametric tests are used to analyze different types of
This document discusses various statistical concepts and their applications in clinical laboratories. It defines descriptive statistics, statistical analysis, measures of central tendency (mean, median, mode), measures of variation (variance, standard deviation), probability distributions (binomial, Gaussian, Poisson), and statistical tests (t-test, chi-square, F-test). It provides examples of how these statistical methods are used to monitor laboratory test performance, interpret results, and compare different laboratory instruments and methods.
Confidence Intervals in the Life Sciences PresentationNamesS.docxmaxinesmith73660
Confidence Intervals in the Life Sciences Presentation
Names
Statistics for the Life Sciences STAT/167
Date
Fahad M. Gohar M.S.A.S
1
Conservation Biology of Bears
Normal Distribution
Standard normal distribution
Confidence Interval
Population Mean
Population Variance
Confidence Level
Point Estimate
Critical Value
Margin of Error
Welcome to the presentation on Confidence Intervals of Conservation Biology on Bears.
The team will define normal distribution and use an example of variables why this is important. A standard and normal distribution is discussed as well as the difference between standard and other normal distributions. Confidence interval will be defined and how it is used in Conservation Biology and Bears. We will learn how a confidence interval helps researchers estimate of population mean and population variance. The presenters defined a point estimate and try to explain how a point estimate found from a confidence interval. Confidence level is defined and a short explanation of confidence level is related to the confidence interval. Lastly, a critical value and margin of error are explained with examples from the Statdisk.
2
Normal Distribution
A normal distribution is one which has the mean, median, and mode are the same and the standard deviations are apart from the mean in the probabilities that go with the empirical rule. Not all data has the measures of central tendency, since some data sets may not have one unique value which occurs more than once. But every data set has a mean and median. The mean is only good with interval and ratio data, while the median can be used with interval, ratio and ordinal data. Mean is used when they're a lot of outliers, and median is used when there are few.
The normal distribution is continuous, and has only two parameters - mean and variance. The mean can be any positive number and variance can be any positive number (can't be negative - the mean and variance), so there are an infinite number of normal distributions. You want your data to represent the population distribution because when you make claims from the distribution of the sample you took, you want it to represent the whole entire population.
Some examples in the business world: Some industries which use normal distributions are pharmaceutical companies. They model the average blood pressure through normal distributions, and can make medicine which will help majority of the people with high blood pressure. A company can also model its average time to create something using the normal distribution. Several statistics can be calculated with the normal distribution, and hypothesis tests can be done with the normal distribution which models the average time.
Our chosen life science is BEARS. The age of the bears can be modeled by normal distributions and it is important to monitor since that tells us the average age of the bear, and can tell us a lot about the population. If the mean is high and the standard deviatio.
- Multinomial logistic regression predicts categorical membership in a dependent variable based on multiple independent variables. It is an extension of binary logistic regression that allows for more than two categories.
- Careful data analysis including checking for outliers and multicollinearity is important. A minimum sample size of 10 cases per independent variable is recommended.
- Multinomial logistic regression does not assume normality, linearity or homoscedasticity like discriminant function analysis does, making it more flexible and commonly used. It does assume independence between dependent variable categories.
applied multivariate statistical techniques in agriculture and plant science 2amir rahmani
This document provides an overview of multivariate statistical techniques that can be used in agriculture and plant science research. It discusses multiple linear regression analysis, which models the relationship between a dependent variable and one or more explanatory variables. The document explains how to determine regression coefficients and test their significance using analysis of variance. It also describes different variable selection techniques for multiple regression like backward elimination, forward selection, and stepwise regression. The goal is to help researchers identify the best predictive model and determine which variables are most important when the number of predictors increases.
ExcelR is a proud partner of Universiti Malaysia Saravak (UNIMAS), Malaysia’s 1st public University and ranked 8th top university in Malaysia and ranked among top 200th in Asian University Rankings 2017 by QS World University Rankings. Participants will be awarded Data Science international certification from UNIMAS.
Offset regression accounts for exposure variables in Poisson regression models. It incorporates an exposure variable, such as time or number of opportunities, using an offset option which adds the log of the exposure to the regression equation. This allows comparison of event counts that may have different exposures like awards counted over different time periods. Negative binomial regression can be used when the variance of count data is greater than the mean, indicating overdispersion. It has a dispersion parameter that allows the variance to differ from the mean. Zero inflated regression models are used when there are excess zeros in count data compared to a standard Poisson or negative binomial distribution.
This document compares several dimension reduction techniques for survival analysis when there are many covariates: principal component analysis (PCA), partial least squares (PLS), and three variants of random matrices (RM) based on Johnson-Lindenstrauss embeddings. It simulates 5,000 datasets using the accelerated failure time model and determines the total bias error and mean-squared error between the true and estimated survivor curves for each method. The results indicate that PCA outperforms PLS, the RMs are comparable, and the RMs outdo both PCA and PLS.
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
Clinical trials and health outcomes research differ in important ways that impact statistical modeling approaches. Clinical trials typically use homogeneous samples and focus on a single endpoint, while health outcomes data is heterogeneous with multiple endpoints. Predictive modeling techniques used in health outcomes research, like those in SAS Enterprise Miner, are better suited than traditional methods as they can handle complex real-world data without strong assumptions and more accurately predict rare events. Validation of models on separate test data is also important for generalizing results.
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
This document discusses the differences between clinical trials and health outcomes research. Clinical trials use homogeneous samples, surrogate endpoints, and focus on a single outcome. They are also typically underpowered for rare events. Health outcomes research uses heterogeneous data from the general population to examine multiple real endpoints simultaneously. It has larger samples and data that allow analysis of rare occurrences. Predictive modeling is better suited than traditional statistical methods for analyzing heterogeneous health outcomes data due to relaxed assumptions like normality.
This document discusses processing and analyzing data. It defines processing as editing, coding, classifying, and tabulating raw data. Analysis is categorized as descriptive or inferential. Descriptive analysis studies distributions through measures like mean, median and correlation, while inferential analysis determines relationships through regression and hypothesis testing. Multivariate analysis simultaneously analyzes more than two variables using techniques like multiple regression, discriminant analysis, and ANOVA. Proper data analysis requires understanding concepts like sampling, standard error, and estimation to make valid statistical inferences.
This document provides an overview of Module 5 on sampling distributions. It discusses key concepts like parameters vs statistics, sampling variability, and sampling distributions. It explains that the sampling distribution of a sample mean is a normal distribution with a mean equal to the population mean and standard deviation equal to the population standard deviation divided by the square root of the sample size. The central limit theorem states that as the sample size increases, the distribution of sample means will approach a normal distribution regardless of the shape of the population distribution. The module also covers binomial distributions for sample counts and proportions.
This study evaluated the performance of bootstrap confidence intervals for estimating slope coefficients in Model II regression with three or more variables. Simulation studies were conducted for different correlation structures between variables, sampling from both normal and lognormal distributions. The results showed that bootstrap intervals provided less than the nominal 95% coverage. Scenarios with strong relationships between variables produced better coverage, while scenarios with weaker relationships and bias produced poorer coverage, even with larger sample sizes. Future work could explore additional scenarios and alternative interval methods to improve accuracy of confidence intervals in Model II regression.
The document discusses using R to analyze data from the NHANES dataset. Density estimation revealed the age variable was bimodal. Linear discriminant analysis and classification trees were used to predict class variables with mixed results. Support vector machines better predicted insulin use with a polynomial kernel than a linear kernel.
This document discusses the normal distribution curve, also called the bell curve or normal curve. It describes several key properties of the normal distribution including that it is symmetrical around the mean, the area under the curve sums to 1, and most values cluster around the mean. The normal distribution is important because many natural phenomena and psychological variables follow this pattern. Statistical tests often assume a normal distribution of data, and the empirical rule can be used to determine what percentage of values fall within a given number of standard deviations from the mean for a normal distribution. The document provides guidance on checking if a dataset follows a normal distribution.
This document discusses data transformation techniques for statistical analysis. It explains that if measurement data is not normally distributed or has unequal variances, transformation is necessary. It then outlines steps to test for normality in SPSS. The document focuses on three common transformations: logarithmic for count data with a wide range, square root for rare count events, and arcsine for proportional or percentage data to make distributions normal. Examples and formulas are provided for each transformation.
This document summarizes a research article that proposes a new three-parameter generalized beta-Poisson dose-response model for quantitative microbial risk assessment. The model allows for the minimum number of organisms required to cause infection to be a random variable, rather than fixed at one organism as in traditional single-hit beta-Poisson models. The researchers use an approximate Bayesian computation algorithm to estimate parameters for the new model by fitting it to four experimental dose-response data sets from previous studies. The results show that while the new model may better characterize some dose-response processes, it did not significantly improve fit to three of the four data sets, possibly due to small sample sizes. The generalized model provides a way to investigate dose-response mechanisms
This document discusses descriptive statistics and exploratory data analysis. It defines descriptive statistics as procedures for summarizing quantitative data in a clear way, while exploratory data analysis involves examining data to understand its characteristics. The document outlines common descriptive statistics like the mean, median, mode, standard deviation, and frequency distributions. It also discusses examining distributions, central tendency, dispersion, and using SPSS to calculate descriptive statistics.
- Sampling distribution describes the distribution of sample statistics like means or proportions drawn from a population. It allows making statistical inferences about the population.
- The central limit theorem states that sampling distributions of sample means will be approximately normally distributed regardless of the population distribution, if the sample size is large.
- Standard error measures the amount of variability in values of a sample statistic across different samples. It is used to construct confidence intervals for population parameters.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
1. I
ADDIS ABABA UNIVERSITY
COLLEGE OF VETERINARY MEDICINE AND AGRICULTURE
Assignment for the course “Advanced Biostatistics” ON ANALYSIS OF COUNT DATA
By Walkite Furgasa Chala (DVM) ID NO., GSR/2792/10
Submitted to;
Samson Leta (DVM, MSc, Assistant Professor )
December, 2017
Bishoftu, Ethiopia
2. II
Table of Contents Page
LIST OF TABLE ........................................................................................................................III
LIST OF FIGURES ....................................................................................................................IV
LIST OF ABBREVATIONS ........................................................................................................V
SUMMARY.................................................................................................................................VI
1. INTRODUCTION..................................................................................................................... 1
2. STATISTICAL TESTS TO ANALYZE COUNT DATA ..................................................... 2
2.1 Poisson Regression................................................................................................................. 2
2.2 Negative Binomial Regression................................................................................................ 3
2.3 Zero Inflated Regression........................................................................................................ 4
3. ANALYSIS OF A COUNT DATA .......................................................................................... 5
3.1. Source of Data....................................................................................................................... 5
3.2. Types of Variables of the Data............................................................................................... 8
3.3. Poisson Regression Analysis and Its Interpretation............................................................... 8
3.4. Negative Binomial Regression Analysis and Its Interpretation........................................... 14
4 REFERENCES......................................................................................................................... 21
4. IV
LIST OF FIGURES
Figure 1. Q-Q plot of poission regression analysis
Figure 2. Q-Q plot of negative binomial regression analysis
5. V
LIST OF ABBREVATIONS
AIC Akaike Information Criterion
EPG Egg pergram of feaces
GLM Generalized linear model
IRR Incident rate ratio
NBREG Negative binomial regression model
ZINB Zero inflated negative binomial model
ZIP Zero inflated poisson model
6. VI
SUMMARY
In statistics, count data is a statistical data type in which the observations can take only the non-
negative integer values. Count models are a subset of discrete response regression models and
are distributed as non-negative integers, are intrinsically heteroskedastic, right skewed, and
have a variance that increases with the mean. An individual piece of count data is often termed
as a count variable. When such a variable is treated as a random variable,
the Poisson and negative binomial distributions are commonly used to represent its distribution
and if there is excess zeros, zero Inflated Regression was used. The objective of this assignment
was to write and analyze certain data on count data using R software. The title of the the data is
“Detection of Anthelmintic Resistance in Gastrointestinal Nematodes of Small Ruminants in
Haramaya University Farms”. The sheep and goats infected with gastrointestinal nematodes were
selected and I took 30 goats and 30 sheep. The goats and sheep were grouped into Albendazole
group(10), Ivermectin group(10) and the control(10). The egg was counted before treatment and
after treatment in treated group and again the egg was also counted twice in control group in
parallel to treated groups. The change of egg count was taken from treated groups and the second
egg count was taken in control group for these assignment. The data was analyzed with R
software through poisson regression and negative binomial regression models. The poisson
model didn`t fit the data because the result of overdispersion test indicate there is evidence of
overdispersion (c is estimated to be 872.046) which speaks quite strong against the assumption
of equidispersion that means when c=0. Pchisq p-value also nonsignificant(0) which indicates
the data was not fit. The normal quartile plot also indicates that the error is not normally
distributed. So generally since almost all assumption were violated or the goodness of fit of the
Poisson model indicates that the model is not fit. The ‘dispersiontest’indicate the data to be over
dispersed but the negative binomial regression model fit the data. The data was interpreted based
on the result obtain through negative binomial regression.
Keywords: Analysis, count data
7. 1
1. INTRODUCTION
In statistics, count data is a statistical data type in which the observations can take only the
non-negative integer values {0, 1, 2, 3, ...}, and where these integers arise from counting.
The statistical treatment of count data is distinct from that of binary data, in which the
observations can take only two values, usually represented by 0 and 1, and from ordinal data,
which may also consist of integers but where the individual values fall on an arbitrary scale
and only the relative ranking is important(Cameron and Trivedi, 2013).
Count models are a subset of discrete response regression models. Count data are
distributed as non-negative integers, right skewed, and have a variance that increases with
the mean. Example, count data include such situations as length of hospital stay, the
number of a certain species of fish per defined area in the ocean, the number of lights
displayed by fireflies over specified time periods, the classic case of the number of deaths
and the number of occurrences of thunderstorms in a calendar year. An individual piece of
count data is often termed a count variable. When such a variable is treated as a random
variable, the Poisson and negative binomial distributions are commonly used to represent its
distribution (Cameron and Trivedi, 1986).
Graphical examination of count data may be aided by the use of data transformations chosen
to have the property of stabilising the sample variance. In particular, the square root
transformation might be used when data can be approximated by a Poisson
distribution (although other transformation have modestly improved properties), while an
inverse sine transformation is available when a binomial distribution is preferred(Hilbe,
2011b).
8. 2
2. STATISTICAL TESTS TO ANALYZE COUNT DATA
2.1 Poisson Regression
The Poisson distribution can form the basis for some analyses of count data and in this
case Poisson regression may be used. This is a special case of the class of generalized linear
models which also contains specific forms of model capable of using the binomial
distribution (binomial regression, logistic regression) or the negative binomial distribution
where the assumptions of the Poisson model are violated, in particular when the range of
count values is limited or when overdispersion is present(Hilbe, 2011a).
A key feature of the Poisson model is the equality of the mean and variance functions. When
the variance of a Poisson model exceeds its mean, the model is termed overdispersed.
Simulation studies have demonstrated that overdispersion is indicated when the Pearson
χ2dispersion is greater than 1.0. The dispersion statistic is defined as the Pearson χ2 divided
by the model residual degrees of freedom. Overdispersion, common to most Poisson models,
biases the parameter estimates and fitted values. When Poisson overdispersion is real, and
not merely apparent, a count model other than Poisson is required(Hilbe, 2008).
Poisson regression is the basic model from which a variety of count models are based. It is
derived from the Poisson probability mass function. The Poisson regression model is the
benchmark model for count data in much the same way as the normal linear model is the
benchmark for real-valued continuous data(Cameron and Trivedi, 1986).
The Poisson model is simple, and it is robust. If the only interest of the analysis lies in
estimating the parameters of a log-linear mean function, there is hardly any reason (except
for efficiency) to ever contemplate anything other than the Poisson regression model. In
fact, its applicability extends well beyond the traditional domain of count data. The
Poisson regression model can be used for any constant elasticity mean function, whether
the dependent variable is a count, and there are good reasons why it should be preferred
over the more common log transformation of the dependent variable. In fact, its
applicability extends well beyond the traditional domain of count data. And yet, there are
instances where the Poisson regression model is unsuited. Essentially, the Poisson model is
9. 3
always overly restrictive when it comes to estimating features of the population other than
the mean, such as the variance or the probability of single outcomes.
The Poisson distribution has a positive mean µ. Although a GLM can model a positive mean
using the identity link, it is more common to model the log of the mean. Like the linear
predictor α+βx, the log mean can take any real value. The log mean is the natural parameter
for the Poisson distribution, and the log link is the canonical link for a Poisson GLM. A
Poisson loglinear GLM assumes a Poisson distribution for Y and uses the log link. The
Poisson loglinear model with explanatory variable X is logµ=α+βx. For this model, the mean
satisfies the exponential relationship µ=exp(α+βx)=eα(eβ)x. A one unit increase in x has a
multiplicative impact of eβ on µ. The mean at x+1equals the mean at x multiplied by eβ.(Re)
.
In some contexts, the Poisson distribution describes the number of events that occur in a
given time period where its mean µ is the average number of events per period. It has the
unusual feature that its mean equals its variance. Its probability density function is Pr(Y = y )
= e-µµy/y!, y=0,1,2,. . .where e is the base of the natural logarithms and y ! is the factorial of
y . The skewness of the Poisson distribution is (1/µ) and the kurtosis is (3 + 1/µ), so that for
large µ, the distribution approaches the Normal N (µ,µ) with skewness of zero and kurtosis
of three (Christopher,2010)
2.2 Negative Binomial Regression
A limitation of the Poisson distribution is the equality of its mean and variance. It may often
observe count data processes where this equality is not reasonable: in particular, where the
conditional variance is larger than the conditional mean. This is termed overdispersion, and its
presence renders the assumption of a Poisson distribution for the error process untenable. It is
particularly likely to occur in the case of unobserved heterogeneity. In this circumstance, a
reasonable alternative is negative binomial regression. The negative binomial is a conjugate
mixture distribution for count data. The negative binomial (NB) distribution is a two-parameter
distribution. For positive integer n, it is the distribution of the number of failures that occur in a
sequence of trials before n successes have occurred, where the probability of success in each trial
is p. The distribution is defined for any positive n. The negative binomial distribution is a
10. 4
mixture of the Poisson distribution and the Gamma distribution, or generalized factorial function.
Unlike the Poisson, which is fully characterized by its mean µ, the NB distribution is a function
of both µ and α . Its mean is still µ, but its conditional variance is µ(1 +α). As evident, as α=0,
the distribution becomes the Poisson distribution(Christopher, 2010)
2.3 Zero Inflated Regression
In many studies count data may possess excess amount of zeros. If data consist of non-
negative, highly skewed sequence counts with a large proportion of zeros. Zero-Inflated
Poisson (ZIP), Zero-Inflated Negative Binomial (ZINB) Models and Hurdle models are
useful for analysing of such data. Zero counts may not occur in the same process as other
positive counts. Zero-inflated count data may not have equality of mean and variance. In
such case over-dispersion (or under-dispersion) need to be taken into account. (Lambert,
1992)
11. 5
3. ANALYSIS OF A COUNT DATA
3.1. Source of Data and Its Description .
The data was normally my DVM thesis. The data was on East Africa Journal of veterinary and
Animal science 03 gallery proof. walkite et al., 2017. The title of the the data or the research is
“Detection of Anthelmintic Resistance in Gastrointestinal Nematodes of Small Ruminants in
Haramaya University Farms”. The sheep and goats infected with gastrointestinal nematodes
were selected and I took 30 goats and 30 sheep. The goats and sheep were grouped into
Albendazole group(10), Ivermectin group(10) and the control(10). The egg was counted
before treatment and after treatment in treated group and again the egg was also counted twice
in control group in parallel to treated groups. The change of egg count was taken from treated
groups and the second egg count was taken in control group for these assignment(Walkite et
al., 2017).
Table 1:- The raw data of the Assignment
No, ID age species sex treatment EPG
1 1546 >3yrs goat male Albendazole 1050
2 1595 <-1yrs goat male Albendazole 2500
3 1612
2yrs-
3yrs goat male Albendazole 2800
4 1599 <-1yrs goat male Albendazole 1450
5 1576
2yrs-
3yrs goat male Albendazole 9050
6 1593 <-1yrs goat male Albendazole 2300
7 1609 <-1yrs goat male Albendazole 1050
8 1608 <-1yrs goat female Albendazole 650
9 1526
2yrs-
3yrs goat female Albendazole 1850
10 1605 <-1yrs goat female Albendazole 2350
11 63
2yrs-
3yrs goat female Ivermectin 400
12 42
2yrs-
3yrs goat female Ivermectin 3300
12. 6
13 110
2yrs-
3yrs goat male Ivermectin 5750
14 111
2yrs-
3yrs goat male Ivermectin 4900
15 28
2yrs-
3yrs goat male Ivermectin 1800
16 1425
2yrs-
3yrs goat male Ivermectin 1100
17 80
2yrs-
3yrs goat male Ivermectin 2200
18 96
2yrs-
3yrs goat male Ivermectin 1050
19 72
2yrs-
3yrs goat female Ivermectin 350
20 87
2yrs-
3yrs goat female Ivermectin 1500
21 1536 >3yrs goat female control 2550
22 1543 >3yrs goat female control 1600
23 1580 >3yrs goat male control 2250
24 13 >3yrs goat male control 350
25 68 >3yrs goat male control 2800
26 6 >3yrs goat male control 3450
27 5 >3yrs goat male control 700
28 21 >3yrs goat male control 600
29 31 >3yrs goat female control 1000
30 259 >3yrs goat female control 700
31 106 <-1yrs sheep male Albendazole 300
32 13 2yrs-3yrs sheep female Albendazole 2050
33 237 2yrs-3yrs sheep female Albendazole 400
34 42 2yrs-3yrs sheep female Albendazole 200
35 95 >1yrs sheep male Albendazole 5100
36 190 <-1yrs sheep male Albendazole 250
37 148 >3yrs sheep male Albendazole 1550
38 89 >3yrs sheep female Albendazole 1150
39 158 >3yrs sheep male Albendazole 1500
40 187 >3yrs sheep female Albendazole 2100
41 109 2yrs-3yrs sheep male Ivermectin 1100
13. 7
42 5 2yrs-3yrs sheep female Ivermectin 350
43 110 >3yrs sheep male Ivermectin 500
44 168 >1yrs sheep female Ivermectin 1200
45 120 >yrs sheep male Ivermectin 2350
46 20 2yrs-3yrs sheep male Ivermectin 300
47 83 1yrs sheep male Ivermectin 1850
48 60 2yrs-3yrs sheep female Ivermectin 2100
49 14 2yrs-3yrs sheep female Ivermectin 900
50 909 >3yrs sheep male Ivermectin 800
51 6 >3yrs sheep female control 1350
52 218 >3yrs sheep female control 1150
53 11 2yrs-3yrs sheep female control 350
54 86 2yrs-3yrs sheep male control 1200
55 220 2yrs-3yrs sheep female control 1350
56 217 >3yrs sheep female control 150
57 147 2yrs-3yrs sheep male control 550
58 15 2yrs-3yrs sheep male control 1200
59 2 2yrs-3yrs sheep female control 1350
60 9 >3yrs sheep female control 1350
14. 8
3.2. Types of Variables of the Data
The EPG is the count response variables and sex,species, age and treatment are the
explanatory variables.
3.3. Poisson RegressionAnalysis and Its Interpretation
attach(walkite_Assignment_)
names(walkite_Assignment_)
View(walkite_Assignment_)
nematode<-glm(EPG~factor(age)+factor(species)+factor(sex)+factor(treatment),family =
"poisson",data = walkite_Assignment_)
nematode
summary(nematode)
coef <- coefficients(nematode)
coef
IRR <- exp(coefficients(nematode))
IRR
# predicted values and residual error
pred <- predict(nematode, type="response") # estimate predicted values
pred
res <- residuals(nematode, type="deviance") # estimate residuals
res
qqnorm(res, plot.it = TRUE)
qqline(res)
#Evaluating the fitness of Poisson regression models
?pchisq
pchisq(nematode$deviance,df=nematode$df.residual,lower.tail = FALSE)
library(AER)
dispersion <- dispersiontest(nematode,trafo=1)
15. 9
dispersion
###################################################
library(readxl)
> walkite_Assignment_ <- read_excel("~/walkite Assignment .xlsx")
> View(walkite_Assignment_)
> attach(walkite_Assignment_)
The following object is masked _by_ .GlobalEnv:
age
The following objects are masked from walkite_Assignment_ (pos = 3):
age, EPG, ID, no,, sex, species, treatment
The following objects are masked from walkite_Assignment_ (pos = 4):
age, EPG, ID, no,, sex, species, treatment
The following objects are masked from walkite_Assignment_ (pos = 12):
age, EPG, ID, no,, sex, species, treatment
> names(walkite_Assignment_)
[1] "no," "ID" "age" "species" "sex" "treatment"
[7] "EPG"
> View(walkite_Assignment_)
>nematode<-glm(EPG~factor(age)+factor(species)+factor(sex)+factor(treatment),family =
"poisson",data = walkite_Assignment_)
> nematode
Call: glm(formula = EPG ~ factor(age) + factor(species) + factor(sex) +
16. 10
factor(treatment), family = "poisson", data = walkite_Assignment_)
Coefficients:
(Intercept) factor(age)>3yrs
7.452148 0.005106
factor(age)2yrs-3yrs factor(species)sheep
0.308401 -0.500520
factor(sex)male factor(treatment)control
0.393118 -0.356036
factor(treatment)Ivermectin
-0.307651
Degrees of Freedom: 59 Total (i.e. Null); 53 Residual
Null Deviance: 64280
Residual Deviance: 49460 AIC: 50000
> summary(nematode)
Call:
glm(formula = EPG ~ factor(age) + factor(species) + factor(sex) +
factor(treatment), family = "poisson", data = walkite_Assignment_)
Deviance Residuals:
Min 1Q Median 3Q Max
-41.835 -28.155 -6.764 14.689 78.557
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.452148 0.009425 790.677 <2e-16 ***
factor(age)>3yrs 0.005106 0.010897 0.469 0.639
factor(age)2yrs-3yrs 0.308401 0.009429 32.708 <2e-16 ***
factor(species)sheep -0.500520 0.006708 -74.620 <2e-16 ***
17. 11
factor(sex)male 0.393118 0.006936 56.681 <2e-16 ***
factor(treatment)control -0.356036 0.009612 -37.039 <2e-16 ***
factor(treatment)Ivermectin -0.307651 0.008409 -36.586 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 64280 on 59 degrees of freedom
Residual deviance: 49456 on 53 degrees of freedom
AIC: 50005
Number of Fisher Scoring iterations: 5
> coef <- coefficients(nematode)
> coef
(Intercept) factor(age)>3yrs
7.452147624 0.005106036
factor(age)2yrs-3yrs factor(species)sheep
0.308400816 -0.500519597
factor(sex)male factor(treatment)control
0.393117562 -0.356036156
factor(treatment)Ivermectin
-0.307651001
> IRR <- exp(coefficients(nematode))
> IRR
(Intercept) factor(age)>3yrs
1723.5607335 1.0051191
factor(age)2yrs-3yrs factor(species)sheep
1.3612465 0.6062156
factor(sex)male factor(treatment)control
20. 14
[1] 0
Interpretation: In this result the p-value zero (0) which indicates it is significant, indicating the lack of
fit of the data. The significance of the p-value in this result shows that there is presence of
overdispersion and it reveals that the poisson model data does not fit the data
> library(AER)
> dispersion <- dispersiontest(nematode,trafo=1)
> dispersion
Overdispersion test
data: nematode
z = 4.2675, p-value = 9.884e-06
alternative hypothesis: true alpha is greater than 0
sample estimates:
alpha
871.0029
The result of overdispersion test indicate there is evidence of overdispersion (c is estimated to be
872.046) which speaks quite strong against the assumption of equidispersion that means when c=0. So
generally since almost all assumption were violated or the goodness of fit of the Poisson model indicates
that the model is not fit. The ‘dispersiontest’indicate the data to be over dispersed. The normal quartile
plot also indicates that the error is not normally distributed. Thus, it is better to look for
Negative Binomial Regression.
3.4. Negative Binomial RegressionAnalysis and Its Interpretation
#Negative Binomial regression
library(MASS)
NBREG<-glm.nb(EPG~factor(age)+factor(species)+factor(sex)+factor(treatment),data =
21. 15
walkite_Assignment_)
NBREG
summary(NBREG)
#####Checking the model assumption
library(lmtest)
lrtest(nematode,NBREG)
coef <- coefficients(NBREG)
coef
IRR <- exp(coefficients(NBREG))
IRR
# predicted values and residual error
pred <- predict(NBREG, type="response") # estimate predicted values
pred
res <- residuals(NBREG, type="deviance") # estimate residuals
res
qqnorm(res, plot.it = TRUE)
qqline(res)
################################
> library(MASS)
>NBREG<-glm.nb(EPG~factor(age)+factor(species)+factor(sex)+factor(treatment),data =
walkite_Assignment_)
> NBREG
Call: glm.nb(formula = EPG ~ factor(age) + factor(species) + factor(sex) +
factor(treatment), data = walkite_Assignment_, init.theta = 1.923394949,
link = log)
Coefficients:
(Intercept) factor(age)>3yrs
7.59952 -0.08562
factor(age)2yrs-3yrs factor(species)sheep
26. 20
-0.22734653 -0.55289232 0.71465995 0.46009582 -1.21473337 -0.11852987
55 56 57 58 59 60
0.47377501 -1.85712317 -1.04996037 -0.11852987 0.47377501 0.71465995
> qqnorm(res, plot.it = TRUE)
> qqline(res)
>
*The normal quartile plot indicates that the error is almost normally distributed. Thus the
negative binomial regression fit the data.
IT’S INTERPRETATION
The interpretation should be based on negative binomial regression analysis because the poission
model does not fit the Data. In the above Negative binomial regression analysis ‘Albendazole’
from –treatment-, ‘female’ from –sex- and ‘<-1yrs’ from -age, `goat` from species were used
as references. Sex and age have statistically nonsignificant impact on EPG count and control
group has nonsignificant effect on EPG count. Species has significant impact on EPG count. The
reduction factor caused by Ivermectin drug is (exp(-0.21070)-1)*100= -18.998. Even if there is
reduction in EPG count the Ivermectin drug has nonsignificant impact on EPG count because the
p-value for Ivermectin is 0.3991. Hence this indicates that there is resistance of the parasite or
the efficacy of the drug is not good. Generally; since the control group (p-value=0.2661) has
27. 21
nonsignifant effect on EPG count, in both Albendazole and Ivermectin resistance of parasite
were detected.
4. REFERENCES
Cameron, A.C., Trivedi, P.K., 1986. Econometric models based on count data. Comparisons and
applications of some estimators and tests. Journal of applied econometrics 1, 29-53.
Cameron, A.C., Trivedi, P.K., 2013. Regression analysis of count data. Cambridge university
press.
Christopher, B., 2010. Models for Count Data and Categorical Response Data.
Hilbe, J.M., 2008. Brief overview on interpreting count model risk ratios: An addendum to
negative binomial regression. Cambridge University Press Cambridge.
Hilbe, J.M., 2011a. Modeling count data. International Encyclopedia of Statistical Science.
Springer, pp. 836-839.
Hilbe, J.M., 2011b. Negative binomial regression. Cambridge University Press.
Lambert, D., 1992. Zero-inflated Poisson regression, with an application to defects in
manufacturing. Technometrics 34, 1-14.
Walkite, F., Negesse, M., Anwar, H., 2017. Detection of Antihelmintic resistance in
gastrointestinal nematode parasite in small ruminant in Haramaya university farms. pp. 13-19.