This document discusses correlated data structures and methods for analyzing correlated binary outcome data, specifically generalized estimating equations (GEE) and generalized linear mixed models (GLMM). It begins with examples of correlated data and an overview of GEE and GLMM. It then compares GEE and GLMM, noting that GEE makes population-level inferences while GLMM allows for individual-level inferences. The document concludes by stating that both GEE and GLMM can be applied to genome-wide association studies (GWAS) to account for genetic correlations.
Introduction to principal component analysis (pca)Mohammed Musah
This document provides an introduction to principal component analysis (PCA), outlining its purpose for data reduction and structural detection. It defines PCA as a linear combination of weighted observed variables. The procedure section discusses assumptions like normality, homoscedasticity, and linearity that are evaluated prior to PCA. Requirements for performing PCA include the variables being at the metric or nominal level, sufficient sample size and variable ratios, and adequate correlations between variables.
A two-way ANOVA analyzes the influence of two independent variables on a single dependent variable. It tests for main effects of each independent variable as well as interactions between the variables. The independent variables are categorical and the dependent variable is measured on an ordinal or ratio scale. It compares sums of squares and mean squares to determine if the means of observations grouped by each factor differ significantly. An example tests the effect of gender and age on test scores, with gender, age as independent variables and test score as the dependent variable.
This document discusses inferential statistics and hypothesis testing. It begins by explaining the difference between descriptive and inferential statistics, and how inferential statistics are used to make inferences about populations based on data collected from samples. It then discusses key concepts in hypothesis testing including the null hypothesis, type I and type II errors, significance, confidence intervals, and p-values. Examples are provided to illustrate hypothesis testing and how to determine the appropriate statistical test to use based on the variables. Common parametric and non-parametric tests are also outlined.
This document provides information on using binary logistic regression and linear regression in SPSS. It explains that binary logistic regression is used to predict a dichotomous dependent variable, like yes/no outcomes. Examples are provided of using it to predict vaccination status and surgical infection complications. Linear regression is described as appropriate when the dependent variable is continuous. The document offers tips for conducting and interpreting the analyses in SPSS, like transforming categorical variables, checking for significance of models, and interpreting coefficient outputs. Examples are presented of using both techniques to analyze factors influencing surgical infection rates and Babesia infection in dogs.
Data Science - Part IV - Regression Analysis & ANOVADerek Kane
This lecture provides an overview of linear regression analysis, interaction terms, ANOVA, optimization, log-level, and log-log transformations. The first practical example centers around the Boston housing market where the second example dives into business applications of regression analysis in a supermarket retailer.
The document discusses various statistical tests used for hypothesis testing, including parametric and non-parametric tests. Parametric tests like the z-test and t-test assume a normal distribution, while non-parametric tests like the chi-square test, sign test, and Mann-Whitney test make fewer assumptions. The z-test specifically compares a sample mean to a hypothesized population mean for large samples or when the population variance is known.
Logistic regression is a statistical method used to predict a binary or categorical dependent variable from continuous or categorical independent variables. It generates coefficients to predict the log odds of an outcome being present or absent. The method assumes a linear relationship between the log odds and independent variables. Multinomial logistic regression extends this to dependent variables with more than two categories. An example analyzes high school student program choices using writing scores and socioeconomic status as predictors. The model fits significantly better than an intercept-only model. Increases in writing score decrease the log odds of general versus academic programs.
The document discusses techniques for imputing missing data (<NA>) in R. It introduces common imputation methods like MICE, missForest, and Hmisc. MICE creates multiple imputations using chained equations to account for uncertainty, while missForest uses random forests to impute missing values. Hmisc offers functions to impute missing values using methods like mean, regression, and predictive mean matching. The goal is to understand missing data, learn imputation methods, and choose the best approach for a given dataset.
Introduction to principal component analysis (pca)Mohammed Musah
This document provides an introduction to principal component analysis (PCA), outlining its purpose for data reduction and structural detection. It defines PCA as a linear combination of weighted observed variables. The procedure section discusses assumptions like normality, homoscedasticity, and linearity that are evaluated prior to PCA. Requirements for performing PCA include the variables being at the metric or nominal level, sufficient sample size and variable ratios, and adequate correlations between variables.
A two-way ANOVA analyzes the influence of two independent variables on a single dependent variable. It tests for main effects of each independent variable as well as interactions between the variables. The independent variables are categorical and the dependent variable is measured on an ordinal or ratio scale. It compares sums of squares and mean squares to determine if the means of observations grouped by each factor differ significantly. An example tests the effect of gender and age on test scores, with gender, age as independent variables and test score as the dependent variable.
This document discusses inferential statistics and hypothesis testing. It begins by explaining the difference between descriptive and inferential statistics, and how inferential statistics are used to make inferences about populations based on data collected from samples. It then discusses key concepts in hypothesis testing including the null hypothesis, type I and type II errors, significance, confidence intervals, and p-values. Examples are provided to illustrate hypothesis testing and how to determine the appropriate statistical test to use based on the variables. Common parametric and non-parametric tests are also outlined.
This document provides information on using binary logistic regression and linear regression in SPSS. It explains that binary logistic regression is used to predict a dichotomous dependent variable, like yes/no outcomes. Examples are provided of using it to predict vaccination status and surgical infection complications. Linear regression is described as appropriate when the dependent variable is continuous. The document offers tips for conducting and interpreting the analyses in SPSS, like transforming categorical variables, checking for significance of models, and interpreting coefficient outputs. Examples are presented of using both techniques to analyze factors influencing surgical infection rates and Babesia infection in dogs.
Data Science - Part IV - Regression Analysis & ANOVADerek Kane
This lecture provides an overview of linear regression analysis, interaction terms, ANOVA, optimization, log-level, and log-log transformations. The first practical example centers around the Boston housing market where the second example dives into business applications of regression analysis in a supermarket retailer.
The document discusses various statistical tests used for hypothesis testing, including parametric and non-parametric tests. Parametric tests like the z-test and t-test assume a normal distribution, while non-parametric tests like the chi-square test, sign test, and Mann-Whitney test make fewer assumptions. The z-test specifically compares a sample mean to a hypothesized population mean for large samples or when the population variance is known.
Logistic regression is a statistical method used to predict a binary or categorical dependent variable from continuous or categorical independent variables. It generates coefficients to predict the log odds of an outcome being present or absent. The method assumes a linear relationship between the log odds and independent variables. Multinomial logistic regression extends this to dependent variables with more than two categories. An example analyzes high school student program choices using writing scores and socioeconomic status as predictors. The model fits significantly better than an intercept-only model. Increases in writing score decrease the log odds of general versus academic programs.
The document discusses techniques for imputing missing data (<NA>) in R. It introduces common imputation methods like MICE, missForest, and Hmisc. MICE creates multiple imputations using chained equations to account for uncertainty, while missForest uses random forests to impute missing values. Hmisc offers functions to impute missing values using methods like mean, regression, and predictive mean matching. The goal is to understand missing data, learn imputation methods, and choose the best approach for a given dataset.
The document discusses mixed models, which contain both fixed and random effects. Fixed effects have all possible levels included in the study, while random effects are a random sample from the total population. The mixed model is represented as Y = Xβ + Zγ + ε, where β are fixed effects, X are fixed effect variables, Z are random effects, γ are random effect parameters, and ε is the error term. Mixed models can model both fixed and random effects, account for correlation in errors, and handle missing data. They provide correct standard errors compared to general linear models (GLMs). Model fitting involves likelihood ratio tests and information criteria to select the best fitting model.
General Linear Model is an ANOVA procedure in which the calculations are performed using the least square regression approach to describe the statistical relationship between one or more prediction in continuous response variable. Predictors can be factors and covariates. Copy the link given below and paste it in new browser window to get more information on General Linear Model:- http://www.transtutors.com/homework-help/statistics/general-linear-model.aspx
This document provides an overview of logistic regression, including when and why it is used, the theory behind it, and how to assess logistic regression models. Logistic regression predicts the probability of categorical outcomes given categorical or continuous predictor variables. It relaxes the normality and linearity assumptions of linear regression. The relationship between predictors and outcomes is modeled using an S-shaped logistic function. Model fit, predictors, and interpretations of coefficients are discussed.
This document provides an overview of simple linear regression. It defines regression as determining the statistical relationship between variables where changes in one variable depend on changes in another. Regression analysis is used for prediction and exploring relationships between dependent and independent variables. The key aspects covered include:
- Dependent variables change due to independent variables.
- Lines of regression show the relationship between the variables.
- The method of least squares is used to determine the line of best fit that minimizes the error between predicted and actual values.
- Linear regression models take the form of y = a + bx and are used for tasks like prediction and determining impact of independent variables.
Logistic regression is a statistical model used to predict binary outcomes like disease presence/absence from several explanatory variables. It is similar to linear regression but for binary rather than continuous outcomes. The document provides an example analysis using logistic regression to predict risk of HHV8 infection from sexual behaviors and infections like HIV. The analysis found HIV and HSV2 history were associated with higher odds of HHV8 after adjusting for other variables, while gonorrhea history was not a significant independent predictor.
This document provides an overview of logistic regression. It begins by explaining that linear regression is not appropriate when the dependent variable is dichotomous. Logistic regression uses an S-shaped logistic function to model the probabilities of different outcomes. The logistic function transforms the non-linear probabilities into linear-looking data that can be modeled using linear regression. Examples are provided to demonstrate how logistic regression can be used to predict the probability of coronary heart disease based on age and to analyze the relationship between patient satisfaction and residence.
Pearson Correlation, Spearman Correlation &Linear RegressionAzmi Mohd Tamil
This document discusses correlation and linear regression. It defines correlation as a statistic that measures the strength and direction of the linear relationship between two continuous variables. Positive correlation indicates that as one variable increases, so does the other. Negative correlation means the variables are inversely related. Linear regression can be used to predict a continuous outcome variable based on a continuous predictor variable using the regression equation y=a+bx. The regression line minimizes the sum of squared differences between the data points and the line. The slope coefficient b indicates the strength of the linear prediction and can be tested for significance.
Chapter 6 part2-Introduction to Inference-Tests of Significance, Stating Hyp...nszakir
Mathematics, Statistics, Introduction to Inference, Tests of Significance, The Reasoning of Tests of Significance, Stating Hypotheses, Test Statistics, P-values, Statistical Significance, Test for a Population Mean, Two-Sided Significance Tests and Confidence Intervals
The document discusses how to choose the appropriate statistical test based on the characteristics of the data. It outlines several key considerations for selecting a test, including the number and type of variables, whether the data is paired or independent, and if the continuous variables follow a normal distribution. The document then describes many commonly used statistical tests for different types of comparisons, including onesample, bivariate, and multivariate tests. It emphasizes that the correct statistical test must be applied to ensure valid conclusions can be drawn from the data analysis.
The Kolmogorov-Smirnov test is a nonparametric test used to compare a sample distribution to a reference distribution. It can be used to test whether two underlying probability distributions differ. The test statistic D is calculated as the maximum distance between the empirical distribution functions of the two samples. If the calculated D value is greater than the critical value from a table, the null hypothesis that the samples are from the same distribution is rejected. An example calculates D for student interest in different academic streams and rejects the null hypothesis since D is greater than the critical value, indicating a difference in interest levels across streams.
This document discusses key concepts in research methods and biostatistics, including hypothesis testing, random error, p-values, and confidence intervals. It explains that hypothesis testing involves determining if study findings reflect chance or a true effect. The p-value represents the probability of observing results as extreme or more extreme than what was observed by chance alone. A p-value less than 0.05 indicates statistical significance. Confidence intervals provide a range of values that are likely to contain the true population parameter.
This document discusses approaches for handling missing data in statistical analyses. It begins by contrasting an ideal scenario with no missing data against the real-world problem of missing data. Common types of missing data are defined, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Complete case analysis is described as biased except for large samples with little missing data. Alternative approaches like multiple imputation are presented along with their advantages and disadvantages. An example using indicator variables to handle missing clinic data in a regression is provided. Issues like creating indicators for each missing covariate are also noted.
The document discusses significance tests and their role in hypothesis testing. It defines key terms like p-value, significance level, confidence level, rejection region, and classification of significance tests. The p-value represents the probability of observing the results by chance if the null hypothesis is true. The significance level is set before data collection and represents the probability of incorrectly rejecting the null hypothesis. A p-value less than the significance level leads to rejecting the null hypothesis.
Chi Square test for independence of attributes / Testing association between two categorical variables, Chi-Square test for Goodness of fit / Testing significant difference between observed and expected frequencies
The document provides an overview of multiple linear regression (MLR). MLR allows predicting a dependent variable from multiple independent variables. It extends simple linear regression by incorporating additional predictors. Key points covered include: purposes of MLR for explanation and prediction; assumptions of the method; interpreting R-squared values; comparing unstandardized and standardized regression coefficients; and testing the statistical significance of predictors.
The chi-square test is used to determine if there is a relationship between two categorical variables in two or more independent groups. It can be used when data is arranged in a contingency table with observed and expected frequencies. A sample problem demonstrates how to calculate chi-square by finding the difference between observed and expected counts, squaring these differences, dividing by the expected counts, and summing across all cells. Degrees of freedom and critical values from tables determine whether to reject or fail to reject the null hypothesis of independence. Larger tables can be partitioned into subtables to identify where differences lie. Guidelines are provided for when chi-square or Fisher's exact test should be used based on sample size and expected cell counts.
This document provides an overview of logistic regression analysis. It introduces the need for logistic regression when the dependent variable is binary. Key concepts covered include the logistic regression model, interpreting the beta coefficients, assessing goodness of fit using various tests and metrics, and an example of fitting a logistic regression line to predict burger purchasing based on a customer's age. Students are instructed to use statistical software to estimate a logistic regression model and interpret the results.
This document discusses kappa statistics, which measure interrater reliability beyond chance agreement. Kappa statistics are useful when multiple raters are interpreting subjective data, such as radiology images. The kappa statistic formula calculates observed agreement between raters compared to expected chance agreement. Examples show how to calculate kappa when two raters are assessing whether a biomarker is present or absent in samples. Confidence intervals for kappa are determined using 1.96 as a constant to generate a 95% confidence level.
The document provides an overview of inferential statistics. It defines inferential statistics as making generalizations about a larger population based on a sample. Key topics covered include hypothesis testing, types of hypotheses, significance tests, critical values, p-values, confidence intervals, z-tests, t-tests, ANOVA, chi-square tests, correlation, and linear regression. The document aims to explain these statistical concepts and techniques at a high level.
It is widely agreed that complex diseases are typically caused by joint effects of multiple genetic variations, rather than a single genetic variation. Multi-SNP interactions, also known as epistatic interactions, have the potential to provide information about causes of complex diseases, and build on GWAS studies that look at associations between single SNPs and phenotypes. However, epistatic analysis methods are both computationally expensive, and have limited accessibility for biologists wanting to analyse GWAS datasets due to being command line based. Here we present APPistatic, a prototype desktop version of a pipeline for epistatic analysis of GWAS datasets. his application combines ease-of-use, via a GUI, with accelerated implementation of BOOST and FaST-LMM epistatic analysis methods.
Association mapping using local genealogiesmailund
This document summarizes the process of association mapping through local genealogies to locate disease-affecting polymorphisms. It discusses how markers that are locally correlated within cases and controls can be used to search for indirect signals and perform multi-marker indirect association mapping. The ancestral recombination graph is used to determine a local tree for each point on the chromosome, showing how local phylogenies can connect genetic variations to disease phenotypes through ancestral relationships.
The document discusses mixed models, which contain both fixed and random effects. Fixed effects have all possible levels included in the study, while random effects are a random sample from the total population. The mixed model is represented as Y = Xβ + Zγ + ε, where β are fixed effects, X are fixed effect variables, Z are random effects, γ are random effect parameters, and ε is the error term. Mixed models can model both fixed and random effects, account for correlation in errors, and handle missing data. They provide correct standard errors compared to general linear models (GLMs). Model fitting involves likelihood ratio tests and information criteria to select the best fitting model.
General Linear Model is an ANOVA procedure in which the calculations are performed using the least square regression approach to describe the statistical relationship between one or more prediction in continuous response variable. Predictors can be factors and covariates. Copy the link given below and paste it in new browser window to get more information on General Linear Model:- http://www.transtutors.com/homework-help/statistics/general-linear-model.aspx
This document provides an overview of logistic regression, including when and why it is used, the theory behind it, and how to assess logistic regression models. Logistic regression predicts the probability of categorical outcomes given categorical or continuous predictor variables. It relaxes the normality and linearity assumptions of linear regression. The relationship between predictors and outcomes is modeled using an S-shaped logistic function. Model fit, predictors, and interpretations of coefficients are discussed.
This document provides an overview of simple linear regression. It defines regression as determining the statistical relationship between variables where changes in one variable depend on changes in another. Regression analysis is used for prediction and exploring relationships between dependent and independent variables. The key aspects covered include:
- Dependent variables change due to independent variables.
- Lines of regression show the relationship between the variables.
- The method of least squares is used to determine the line of best fit that minimizes the error between predicted and actual values.
- Linear regression models take the form of y = a + bx and are used for tasks like prediction and determining impact of independent variables.
Logistic regression is a statistical model used to predict binary outcomes like disease presence/absence from several explanatory variables. It is similar to linear regression but for binary rather than continuous outcomes. The document provides an example analysis using logistic regression to predict risk of HHV8 infection from sexual behaviors and infections like HIV. The analysis found HIV and HSV2 history were associated with higher odds of HHV8 after adjusting for other variables, while gonorrhea history was not a significant independent predictor.
This document provides an overview of logistic regression. It begins by explaining that linear regression is not appropriate when the dependent variable is dichotomous. Logistic regression uses an S-shaped logistic function to model the probabilities of different outcomes. The logistic function transforms the non-linear probabilities into linear-looking data that can be modeled using linear regression. Examples are provided to demonstrate how logistic regression can be used to predict the probability of coronary heart disease based on age and to analyze the relationship between patient satisfaction and residence.
Pearson Correlation, Spearman Correlation &Linear RegressionAzmi Mohd Tamil
This document discusses correlation and linear regression. It defines correlation as a statistic that measures the strength and direction of the linear relationship between two continuous variables. Positive correlation indicates that as one variable increases, so does the other. Negative correlation means the variables are inversely related. Linear regression can be used to predict a continuous outcome variable based on a continuous predictor variable using the regression equation y=a+bx. The regression line minimizes the sum of squared differences between the data points and the line. The slope coefficient b indicates the strength of the linear prediction and can be tested for significance.
Chapter 6 part2-Introduction to Inference-Tests of Significance, Stating Hyp...nszakir
Mathematics, Statistics, Introduction to Inference, Tests of Significance, The Reasoning of Tests of Significance, Stating Hypotheses, Test Statistics, P-values, Statistical Significance, Test for a Population Mean, Two-Sided Significance Tests and Confidence Intervals
The document discusses how to choose the appropriate statistical test based on the characteristics of the data. It outlines several key considerations for selecting a test, including the number and type of variables, whether the data is paired or independent, and if the continuous variables follow a normal distribution. The document then describes many commonly used statistical tests for different types of comparisons, including onesample, bivariate, and multivariate tests. It emphasizes that the correct statistical test must be applied to ensure valid conclusions can be drawn from the data analysis.
The Kolmogorov-Smirnov test is a nonparametric test used to compare a sample distribution to a reference distribution. It can be used to test whether two underlying probability distributions differ. The test statistic D is calculated as the maximum distance between the empirical distribution functions of the two samples. If the calculated D value is greater than the critical value from a table, the null hypothesis that the samples are from the same distribution is rejected. An example calculates D for student interest in different academic streams and rejects the null hypothesis since D is greater than the critical value, indicating a difference in interest levels across streams.
This document discusses key concepts in research methods and biostatistics, including hypothesis testing, random error, p-values, and confidence intervals. It explains that hypothesis testing involves determining if study findings reflect chance or a true effect. The p-value represents the probability of observing results as extreme or more extreme than what was observed by chance alone. A p-value less than 0.05 indicates statistical significance. Confidence intervals provide a range of values that are likely to contain the true population parameter.
This document discusses approaches for handling missing data in statistical analyses. It begins by contrasting an ideal scenario with no missing data against the real-world problem of missing data. Common types of missing data are defined, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Complete case analysis is described as biased except for large samples with little missing data. Alternative approaches like multiple imputation are presented along with their advantages and disadvantages. An example using indicator variables to handle missing clinic data in a regression is provided. Issues like creating indicators for each missing covariate are also noted.
The document discusses significance tests and their role in hypothesis testing. It defines key terms like p-value, significance level, confidence level, rejection region, and classification of significance tests. The p-value represents the probability of observing the results by chance if the null hypothesis is true. The significance level is set before data collection and represents the probability of incorrectly rejecting the null hypothesis. A p-value less than the significance level leads to rejecting the null hypothesis.
Chi Square test for independence of attributes / Testing association between two categorical variables, Chi-Square test for Goodness of fit / Testing significant difference between observed and expected frequencies
The document provides an overview of multiple linear regression (MLR). MLR allows predicting a dependent variable from multiple independent variables. It extends simple linear regression by incorporating additional predictors. Key points covered include: purposes of MLR for explanation and prediction; assumptions of the method; interpreting R-squared values; comparing unstandardized and standardized regression coefficients; and testing the statistical significance of predictors.
The chi-square test is used to determine if there is a relationship between two categorical variables in two or more independent groups. It can be used when data is arranged in a contingency table with observed and expected frequencies. A sample problem demonstrates how to calculate chi-square by finding the difference between observed and expected counts, squaring these differences, dividing by the expected counts, and summing across all cells. Degrees of freedom and critical values from tables determine whether to reject or fail to reject the null hypothesis of independence. Larger tables can be partitioned into subtables to identify where differences lie. Guidelines are provided for when chi-square or Fisher's exact test should be used based on sample size and expected cell counts.
This document provides an overview of logistic regression analysis. It introduces the need for logistic regression when the dependent variable is binary. Key concepts covered include the logistic regression model, interpreting the beta coefficients, assessing goodness of fit using various tests and metrics, and an example of fitting a logistic regression line to predict burger purchasing based on a customer's age. Students are instructed to use statistical software to estimate a logistic regression model and interpret the results.
This document discusses kappa statistics, which measure interrater reliability beyond chance agreement. Kappa statistics are useful when multiple raters are interpreting subjective data, such as radiology images. The kappa statistic formula calculates observed agreement between raters compared to expected chance agreement. Examples show how to calculate kappa when two raters are assessing whether a biomarker is present or absent in samples. Confidence intervals for kappa are determined using 1.96 as a constant to generate a 95% confidence level.
The document provides an overview of inferential statistics. It defines inferential statistics as making generalizations about a larger population based on a sample. Key topics covered include hypothesis testing, types of hypotheses, significance tests, critical values, p-values, confidence intervals, z-tests, t-tests, ANOVA, chi-square tests, correlation, and linear regression. The document aims to explain these statistical concepts and techniques at a high level.
It is widely agreed that complex diseases are typically caused by joint effects of multiple genetic variations, rather than a single genetic variation. Multi-SNP interactions, also known as epistatic interactions, have the potential to provide information about causes of complex diseases, and build on GWAS studies that look at associations between single SNPs and phenotypes. However, epistatic analysis methods are both computationally expensive, and have limited accessibility for biologists wanting to analyse GWAS datasets due to being command line based. Here we present APPistatic, a prototype desktop version of a pipeline for epistatic analysis of GWAS datasets. his application combines ease-of-use, via a GUI, with accelerated implementation of BOOST and FaST-LMM epistatic analysis methods.
Association mapping using local genealogiesmailund
This document summarizes the process of association mapping through local genealogies to locate disease-affecting polymorphisms. It discusses how markers that are locally correlated within cases and controls can be used to search for indirect signals and perform multi-marker indirect association mapping. The ancestral recombination graph is used to determine a local tree for each point on the chromosome, showing how local phylogenies can connect genetic variations to disease phenotypes through ancestral relationships.
This document introduces a crash course in probability theory and statistics for machine learning. It explains that probability theory is needed to mathematically model uncertainty and randomness, and to quantify aspects like uncertainty in data and conclusions, the goodness of a model when confronted with data, and expected error and success rates. Probability theory can provide the mathematical foundations for these important machine learning concepts.
This document provides an overview of linkage in genetics. It discusses how Sutton and Boveri proposed the chromosomal theory of inheritance, where genes located on the same chromosome tend to be inherited together. It then summarizes Bateson and Punnett's pioneering experiments in 1906 that first discovered linkage between genes controlling flower color and pollen shape in sweet peas. The document outlines different phases and types of linkage, how linkage can be detected through test crosses, and the significance of linkage for plant breeding and genetics research.
This document summarizes genetic variation projects like HapMap and 1000 Genomes that aimed to catalog common human genetic variants. It describes types of variation like SNPs and how factors like selection and recombination influence their distribution. It provides overviews of the HapMap and 1000 Genomes projects, including their goals, populations studied, methods, and data formats. The information from these projects can be used to study traits, diseases, and human history.
This document introduces different measures used to calculate linkage disequilibrium (LD) between two loci, including D, D', r, and r2. It provides the steps and equations to calculate each measure using an example with two SNPs, each with two alleles. The key measures are D, D', and r2. D represents the deviation from expected haplotype frequencies under linkage equilibrium. D' standardizes D between 0-1. r2 is the squared correlation coefficient and can be used to test for statistically significant LD between loci. An example calculation is shown to illustrate the application of these measures.
The idea of chromosomal Linkage. It starts with understanding the Mendel's law of segregation and Independent assortment and later discusses why certain traits does not follows 9:3:3:1 ratio as in Mendel's law of Independent assortment. Also briefly covers the Genetic mapping and phenotypic mapping unit.
Estimation of Linkage Disequilibrium using GGT2 SoftwareAwais Khan
This document provides instructions for using the GGT 2.0 software to estimate linkage disequilibrium (LD) decay from genotype data. It describes how to import data from an Excel file into GGT, view individual genomes and the LD between markers, calculate LD using Lewontin's D' measurement, and view the results in diallel format, as pairwise distances and values, or in a heat plot view of the LD pattern. An example is given showing the LD values between 5 SNPs in a 1.64 kb gene region.
This document provides an overview of a genetics lecture that discusses haplotypes, linkage disequilibrium, and the HapMap project.
The lecture begins by explaining how mutations give rise to SNPs, which then give rise to haplotypes. Recombination leads to linkage disequilibrium, where haplotypes are seen together more often than by chance. The document then discusses different measures used to calculate linkage disequilibrium, including D, D', and r2.
The second half focuses on the HapMap project. It provides background on the project and explains how it aimed to characterize linkage disequilibrium across the human genome. HapMap helped identify tag SNPs that could represent other SNPs in linkage
The document describes how to create a kinship matrix using microsatellite data in Microsatellite Analyzer (MSA) software by first preparing genotypic data from molecular markers in a spreadsheet, importing this data into MSA and running an analysis to generate a pairwise kinship coefficient matrix, which provides a measure of genetic relatedness between individuals that can be used in association mapping studies to account for population structure.
A lecture for UW EPI 519 providing background for genome-wide association studies, a few examples of recent papers in the CVD GWAS literature, and some lessons and new directions. The talk was originally given in 2008 (in collaboration with a colleagure), this version has been updated slightly for 2010 and includes references for further reading.
Some of the typefaces may have been mangled on conversion; the file download should be more reliable.
This document discusses gene mapping and linkage. It explains that linked genes on the same chromosome are inherited together, while genes on different chromosomes assort independently. Thomas Morgan discovered linkage in fruit flies and determined they have 4 linkage groups, or chromosomes. The document then describes how Alfred Sturtevant developed the concept of using recombination rates between linked genes to construct gene maps and determine the relative positions of genes on chromosomes. An example is given showing how to use offspring data from a fruit fly cross to calculate the map distance between the black and vestigial genes as 83.8 map units apart.
Introduction to association mapping and tutorial using tasselAwais Khan
This presentation introduces association mapping/linkage disequilibrium mapping and also includes a tutorial showing association mapping analysis using TASSEL software.
Genome-wide association study (GWAS) technology has been a primary method for identifying the genes responsible for diseases and other traits for the past ten years. GWAS continues to be highly relevant as a scientific method. Over 2,000 human GWAS reports now appear in scientific journals. Our free eBook aims to explain the basic steps and concepts to complete a GWAS experiment.
This document summarizes research applying genome-wide association studies (GWAS) and transcriptomics to study gene-trait associations in banana. A GWAS was performed on a panel of 106 banana accessions to identify genomic regions associated with seedlessness. Five regions were identified, with two genes directly linked to female sterility in Arabidopsis. Additional studies explored drought tolerance by measuring leaf temperatures under drought and conducting transcriptomics analyses of three genotypes under osmotic stress. The research aims to aid banana breeding through gene discovery using these complementary genomic and phenomic approaches.
The document provides instructions for determining linkage and mapping distances between genes using three-point crosses. It explains how to identify parental and recombinant classes, determine gene order based on double recombinants, and calculate map distances. For the example three-point cross, the parental classes are identified as calm, five, smooth and dithery, four, grizzled. The double recombinants indicate the gene order is five-calm-smooth. Map distances are calculated as 10 LMU between five and calm loci and 13 LMU between calm and smooth loci.
This document provides an overview of genome-wide association studies (GWAS). It defines key terms related to GWAS such as linkage disequilibrium, minor allele frequency, and odds ratio. It compares linkage mapping and association mapping. It describes the methodology of GWAS including identifying population structure, selecting case and control subjects, genotyping samples, and determining associated SNPs. It discusses challenges such as multiple hypothesis testing and population structure. It provides examples of successful GWAS in crops like maize and Arabidopsis. Overall, the document provides a comprehensive introduction and overview of GWAS.
The document discusses genetic recombination and gene linkage. During meiosis, homologous chromosomes pair up and may exchange genetic material through crossing over. Tracking how closely linked genes segregate during crosses allows their relative positions on chromosomes to be determined. Genetic markers are used to infer the presence of other linked genes. Independent assortment of chromosomes leads to many possible combinations of alleles in gametes. The analysis of crosses differs depending on whether genes are on the same or different chromosomes. Recombination frequency between linked genes can be measured and used to create genetic maps.
Unsupervised Deep Learning Applied to Breast Density Segmentation and Mammogr...Jinseob Kim
1. The document describes a method using unsupervised deep learning for breast density segmentation and mammographic risk scoring from medical images.
2. A convolutional sparse autoencoder (CSAE) model is used to learn features from unlabeled mammogram patch data at multiple scales to perform the segmentation and risk scoring tasks.
3. Experimental results show the CSAE approach achieves state-of-the-art performance for both density segmentation and texture-based cancer risk prediction.
The document discusses Wright's F-statistics and Cockerham's θ-statistics, which are methods used to calculate genetic differentiation between populations. It also discusses methods to detect signatures of positive selection, including Extended Haplotype Homozygosity (EHH), integrated Haplotype Score (iHS), and cross population Extended Haplotype Homozygosity (xpEHH). EHH detects when a particular haplotype is over-represented in a population by measuring how quickly homozygosity declines with genetic distance from the core haplotype. iHS and xpEHH are derived from EHH scores to identify haplotypes that have increased in frequency due to positive selection.
New Epidemiologic Measures in Multilevel Study: Median Risk Ratio, Median Haz...Jinseob Kim
This document introduces new epidemiological measures for multilevel studies, including the median risk ratio, median hazard ratio, and median beta. It begins with an introduction and overview of intraclass correlation coefficients and variance partition coefficients. It then provides formulas for calculating the new measures based on binomial, Poisson, and Cox proportional hazards multilevel models. Examples are shown using real data on breast cancer and families to demonstrate how to compute and interpret the median odds ratio, median risk ratio, and median hazard ratio. The document concludes by discussing applications of the new measures to other data types like count and survival data.
This document provides code to calculate extended haplotype homozygosity (EHH), integrated haplotype score (iHS), and cross-population composite likelihood ratio (XP-CLR) from population genetics data. It loads example data, calculates EHH and iHS for a set of SNPs on chromosome 12, and plots the results. It then loads example results for composite likelihood ratio (CLR) between cattle populations and calculates relative extended haplotype homozygosity (REHH) between the populations, plotting the output. Finally, it calculates iHS for all SNPs on chromosome 1 from one of the cattle populations and plots those results.
The document summarizes Wright's F-statistics and Cockerham's θ-statistics, which are methods used to calculate genetic differentiation between populations. It then discusses methods to detect signatures of positive selection, including Extended Haplotype Homozygosity (EHH), integrated Haplotype Score (iHS), and cross population Extended Haplotype Homozygosity (xp-EHH). EHH detects when a haplotype is over-represented in a population due to recent positive selection. iHS and xp-EHH are derived from EHH to identify specific genomic regions under selection. The document uses examples and figures to illustrate key concepts.
The document introduces DISMOD and DISMOD II, software used to model disease burden. DISMOD uses differential equations to estimate disease measures like incidence, remission, and case fatality from available data. DISMOD II improves on DISMOD by allowing estimation of measures from other available data using statistical methods. It also introduces a graphical user interface. Both tools are used to model disease measures over age, sex, and location where data may be limited or uncertain. Newer approaches aim to have more flexible models that account for covariates and better represent uncertainty.
This document discusses analyzing time-series data using a case-crossover study design and conditional logistic regression. It begins with concepts of individual versus population risk, the case-crossover design which uses a subject's other time periods as controls, and how the data structure changes. It then reviews basic linear regression, logistic regression, and conditional logistic regression. Finally, it discusses practical issues and demonstrates using the season package in R to conduct case-crossover analyses and conditional logistic regression.
This document discusses analyzing time-series data using generalized additive models (GAM). It covers non-linear issues in regression, GAM theory including various spline methods and model selection, descriptive analysis of time-series data through plots, and applying GAM to analyze incidence data from Seoul using the mgcv package in R. Examples are provided to illustrate spline fitting and model selection for both Poisson and quasipoisson GAMs.
1. This document discusses the history and development of deep learning from the perceptron in 1958 to modern deep neural networks.
2. It describes the key milestones as the perceptron in 1958, multilayer perceptrons in the 1980s which could solve the XOR problem, and Boltzmann machines in the 1980s which introduced unsupervised learning.
3. Deep learning has gained popularity since 2010 due to increases in data and computational power. It is now being applied to problems in computer vision, natural language processing and other domains.
This document discusses the changing role of human scientists in an era where metahuman science has advanced far beyond human comprehension. It outlines how human scientists have shifted from conducting original research to interpreting and analyzing the work of metahumans through hermeneutic approaches like textual analysis of publications, reverse engineering of technological artifacts, and remote sensing of research facilities. While some see these as a waste of time, the document argues they are worthwhile pursuits that continue scientific inquiry and increase human knowledge, and may even uncover applications not considered by metahumans.
This document discusses advanced tree-based machine learning methods including bagging, random forests, and boosting. Bagging involves resampling data and growing trees on each sample to average predictions and reduce variance compared to a single tree. Random forests build on bagging by randomly selecting features at each split. Boosting fits trees sequentially to emphasize training examples that previous trees misclassified to produce a stronger learner. These ensemble methods aggregate multiple tree models to improve over a single decision tree.
This document discusses the history and development of deep learning. It describes how early neural networks like perceptrons had limitations in tasks like the XOR problem. The development of multilayer perceptrons with hidden layers and the backpropagation algorithm helped address these issues. However, training these networks remained a challenge until recent breakthroughs in unsupervised learning using methods like restricted Boltzmann machines and deep belief networks. These approaches pre-train the lower layers of neural networks in an unsupervised manner before fine-tuning the entire network with a supervised method like backpropagation.
Introduction to Jio Cinema**:
- Brief overview of Jio Cinema as a streaming platform.
- Its significance in the Indian market.
- Introduction to retention and engagement strategies in the streaming industry.
2. **Understanding Retention and Engagement**:
- Define retention and engagement in the context of streaming platforms.
- Importance of retaining users in a competitive market.
- Key metrics used to measure retention and engagement.
3. **Jio Cinema's Content Strategy**:
- Analysis of the content library offered by Jio Cinema.
- Focus on exclusive content, originals, and partnerships.
- Catering to diverse audience preferences (regional, genre-specific, etc.).
- User-generated content and interactive features.
4. **Personalization and Recommendation Algorithms**:
- How Jio Cinema leverages user data for personalized recommendations.
- Algorithmic strategies for suggesting content based on user preferences, viewing history, and behavior.
- Dynamic content curation to keep users engaged.
5. **User Experience and Interface Design**:
- Evaluation of Jio Cinema's user interface (UI) and user experience (UX).
- Accessibility features and device compatibility.
- Seamless navigation and search functionality.
- Integration with other Jio services.
6. **Community Building and Social Features**:
- Strategies for fostering a sense of community among users.
- User reviews, ratings, and comments.
- Social sharing and engagement features.
- Interactive events and campaigns.
7. **Retention through Loyalty Programs and Incentives**:
- Overview of loyalty programs and rewards offered by Jio Cinema.
- Subscription plans and benefits.
- Promotional offers, discounts, and partnerships.
- Gamification elements to encourage continued usage.
8. **Customer Support and Feedback Mechanisms**:
- Analysis of Jio Cinema's customer support infrastructure.
- Channels for user feedback and suggestions.
- Handling of user complaints and queries.
- Continuous improvement based on user feedback.
9. **Multichannel Engagement Strategies**:
- Utilization of multiple channels for user engagement (email, push notifications, SMS, etc.).
- Targeted marketing campaigns and promotions.
- Cross-promotion with other Jio services and partnerships.
- Integration with social media platforms.
10. **Data Analytics and Iterative Improvement**:
- Role of data analytics in understanding user behavior and preferences.
- A/B testing and experimentation to optimize engagement strategies.
- Iterative improvement based on data-driven insights.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
1. Association Study: Binomial Case
GEE & GLMM
Jinseob Kim
GSPH, SNU
July 2, 2014
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 1 / 45
2. Contents
1 Correlated = Not Independent
Concept
Example
2 GEE & GLMM Basic
Basic Linear Regression
GEE
GLMM
Comparison
3 GEE & GLMM in GWAS
Concepts of GWAS
Genetic Correlation
Use GEE & GLMM
4 Conclusion
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 2 / 45
4. Correlated = Not Independent
Contents
1 Correlated = Not Independent
Concept
Example
2 GEE GLMM Basic
Basic Linear Regression
GEE
GLMM
Comparison
3 GEE GLMM in GWAS
Concepts of GWAS
Genetic Correlation
Use GEE GLMM
4 Conclusion
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 4 / 45
5. Correlated = Not Independent Concept
iid??
i iid N(0; 2) or N(0; 2In)
Independent
Identically distributed
i N(0; 2
i )
Independent
Not Identically distributed
@ ¨Ñèt DÈä!!
äL ÜÐ..
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 5 / 45
7. Correlated = Not Independent Example
Repeated Measure
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 7 / 45
8. Correlated = Not Independent Example
Clustered/Multilevel study
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 8 / 45
9. Correlated = Not Independent Example
Serial Correlation
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 9 / 45
10. Correlated = Not Independent Example
Familial structure in Genetic Study
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 10 / 45
14. estimation in linear regression
1 Ordinary Least Square(OLS): semi-parametric
2 Maximum Likelihood Estimator(MLE): parametric
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 13 / 45
15. GEE GLMM Basic Basic Linear Regression
Least Square(Œñ•)
ñiD Œ: y Ü1Ð D”Æä.
Figure. OLS Fitting
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 14 / 45
16. GEE GLMM Basic Basic Linear Regression
Likelihood??
¥Ä(likelihood) VS U`(probability)
Discrete: ¥Ä = U` - ü¬ X8 1˜, U`@ 16
Continuous: ¥Ä != U` - 01 Ð + X˜ QXD L 0.7|
U`@ 0...
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 15 / 45
17. GEE GLMM Basic Basic Linear Regression
Maximum likelihood estimator(MLE)
¥Ä”É: 1; ; nt Žt|X.
1 X ¥Ä h| lä.
2 ¥Ä| € ñXt ´ ¬tX ¥Ä (ŽtÈL)
3 ¥Ä| X”
18. | lä.
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 16 / 45
19. GEE GLMM Basic Basic Linear Regression
MLE: ¥Ä”É
pt0 |´ ¥1D : y” „ìD”.
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 17 / 45
20. GEE GLMM Basic Basic Linear Regression
Logistic function: MLE
Figure. Fitting Logistic Function
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 18 / 45
21. GEE GLMM Basic Basic Linear Regression
LRT? Ward? score?
Likelihood Ratio Test VS Ward test VS score test
1 µÄ X1 èX” )•ä.
2 ¥ÄDP VS ÀDP VS 0¸0DP/
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 19 / 45
22. GEE GLMM Basic Basic Linear Regression
DP
Figure. Comparison
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 20 / 45
23. GEE GLMM Basic Basic Linear Regression
AIC
°¬ l ¨X ¥Ä| Lt| Xt.
1 AIC = 2 log(L) + 2 k
2 k: $…ÀX /(1Ä, ˜t, ð ...)
3 ‘D] ‹@ ¨!!!
¥Ä p ¨D àt ÀÌ.. $…À 4 Ît ˜ð!!!
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 21 / 45
42. µ”
l` ˆä(BLUP).
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 29 / 45
43. GEE GLMM Basic Comparison
GEE example: Continuous
running glm to get initial regression estimate
(Intercept) age sex BMI
-64.2956645 0.1811694 -42.3958662 8.5256257
gee(formula = TG ~ age + sex + BMI, id = FID, data = a, corstr = exchangeable)
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -67.2665582 35.8624272 -1.8756834 35.9094269 -1.8732284
age 0.1751885 0.3340099 0.5245007 0.3996143 0.4383938
sex -42.2905294 11.3716707 -3.7189372 8.3038131 -5.0929048
BMI 8.6744524 1.2930220 6.7086657 1.4041520 6.1777161
Working Correlation
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.2582559 0.2582559 0.2582559
[2,] 0.2582559 1.0000000 0.2582559 0.2582559
[3,] 0.2582559 0.2582559 1.0000000 0.2582559
[4,] 0.2582559 0.2582559 0.2582559 1.0000000
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 30 / 45
44. GEE GLMM Basic Comparison
GLMM example: Continuous
lmer(formula = TG ~ age + sex + BMI + (1 | FID), data = a)
Estimate Std. Error t value
(Intercept) -65.222107 35.8720093 -1.8181894
age 0.109564 0.3318413 0.3301699
sex -41.942137 11.3684264 -3.6893529
BMI 8.648601 1.2917159 6.6954362
Groups Name Std.Dev.
FID (Intercept) 39.356
Residual 72.007
39.356^2/(39.356^2+72.007^2)=0.23
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 31 / 45
45. GEE GLMM Basic Comparison
GEE example: Binomial
running glm to get initial regression estimate
(Intercept) age sex BMI
-5.457458529 0.009749659 -1.385819506 0.157734298
gee(formula = hyperTG ~ age + sex + BMI, id = FID, data = a,
family = binomial, corstr = exchangeable)
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -5.453486897 1.10811194 -4.9214224 1.14198243 -4.7754561
age 0.008754136 0.00997040 0.8780125 0.01087413 0.8050421
sex -1.337114934 0.53428456 -2.5026270 0.52621253 -2.5410169
BMI 0.158988089 0.03867076 4.1113256 0.04248749 3.7419975
Working Correlation
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.1942491 0.1942491 0.1942491
[2,] 0.1942491 1.0000000 0.1942491 0.1942491
[3,] 0.1942491 0.1942491 1.0000000 0.1942491
[4,] 0.1942491 0.1942491 0.1942491 1.0000000
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 32 / 45
46. GEE GLMM Basic Comparison
GLMM example: Binomial
glmer(formula = hyperTG ~ age + sex + BMI + (1 | FID), data = family = binomial)
Estimate Std. Error z value Pr(|z|)
(Intercept) -6.65451749 1.48227814 -4.4893852 7.142904e-06
age 0.01052907 0.01206682 0.8725635 3.829010e-01
sex -1.48506920 0.60773433 -2.4436158 1.454090e-02
BMI 0.19131619 0.05022612 3.8090977 1.394749e-04
Groups Name Std.Dev.
FID (Intercept) 1.1163
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 33 / 45
47. GEE GLMM in GWAS
Contents
1 Correlated = Not Independent
Concept
Example
2 GEE GLMM Basic
Basic Linear Regression
GEE
GLMM
Comparison
3 GEE GLMM in GWAS
Concepts of GWAS
Genetic Correlation
Use GEE GLMM
4 Conclusion
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 34 / 45
48. GEE GLMM in GWAS Concepts of GWAS
Issues
Concepts
Sample SNP (3461 VS 500,000)
Regression more than 500,000 repeat...!!!!
Strict p-value( 5 108)
Issues
Computation burden.. speed!!
Complex correlation structure
Approximation technique
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 35 / 45
49. GEE GLMM in GWAS Genetic Correlation
GCM
Genetic Correlation Matrix
Correlation structure: tø Là ˆä. (qlp VS Data)
õ¡Xä. ÜYt Æä.
Computation...
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 36 / 45
51. GEE GLMM in GWAS Use GEE GLMM
üX
Cluster” Æä. x X˜X˜ Cluster.
GCM ø¬ ¥ä.
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 38 / 45
52. GEE GLMM in GWAS Use GEE GLMM
GWAS example: GEE-continuous
running glm to get initial regression estimate
(Intercept) age sex BMI genecount
-63.0665181 0.1441694 -39.0676606 7.8280011 19.8533844
gee(formula = TG ~ age + sex + BMI + genecount, id = ID, data = a,
R = kin, corstr = fixed)
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -63.0665181 35.4400639 -1.7795261 31.4650444 -2.0043359
age 0.1441694 0.3376881 0.4269307 0.3558302 0.4051635
sex -39.0676606 11.2797186 -3.4635315 7.2549380 -5.3849751
BMI 7.8280011 1.2914399 6.0614519 1.3054881 5.9962258
genecount 19.8533844 6.2315166 3.1859635 5.8534124 3.3917624
Working Correlation
[,1] [,2] [,3] [,4]
[1,] 1.0 0.5 0.5 0.5
[2,] 0.5 1.0 0.5 0.5
[3,] 0.5 0.5 1.0 0.0
[4,] 0.5 0.5 0.0 1.0
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 39 / 45
53. GEE GLMM in GWAS Use GEE GLMM
GWAS example: GEE-binomial
running glm to get initial regression estimate
(Intercept) age sex BMI genecount
-5.482288956 0.009646267 -1.348154797 0.151819412 0.192508455
gee(formula = hyperTG ~ age + sex + BMI + genecount, id = ID,
data = a, R = kin, family = binomial, corstr = fixed)
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -5.482288957 1.10060632 -4.9811535 1.07919392 -5.0799850
age 0.009646267 0.01004073 0.9607134 0.01027862 0.9384789
sex -1.348154801 0.53873048 -2.5024662 0.52100579 -2.5876004
BMI 0.151819412 0.03861585 3.9315312 0.04199752 3.6149615
genecount 0.192508455 0.18683677 1.0303564 0.19281252 0.9984230
Working Correlation
[,1] [,2] [,3] [,4]
[1,] 1.0 0.5 0.5 0.5
[2,] 0.5 1.0 0.5 0.5
[3,] 0.5 0.5 1.0 0.0
[4,] 0.5 0.5 0.0 1.0
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 40 / 45
54. GEE GLMM in GWAS Use GEE GLMM
GWAS example: GLMM
lme4 (¤ÀÐ l ˆ¥.
hglm (¤ÀÐ ¥.
GenABELÐ polygenic hglm h l´ ˆL.
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 41 / 45
55. GEE GLMM in GWAS Use GEE GLMM
Limitation
Both GEE GLMM
¬ä. ¹ˆ qlp + Binomial@ E..
Continuous: ApproximationX ì ùõ- FASTA, GRAMMAR,
GEMMA..
Binomial: Approximation 1ˆ..- Speed8 ùõˆ.
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 42 / 45
56. Conclusion
Contents
1 Correlated = Not Independent
Concept
Example
2 GEE GLMM Basic
Basic Linear Regression
GEE
GLMM
Comparison
3 GEE GLMM in GWAS
Concepts of GWAS
Genetic Correlation
Use GEE GLMM
4 Conclusion
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 43 / 45
58. Conclusion
END
Email : secondmath85@gmail.com
Oce: (02)880-2473
H.P: 010-9192-5385
Jinseob Kim (GSPH, SNU) Association Study: Binomial Case July 2, 2014 45 / 45