This document provides solutions to sample problems using various datasets. It demonstrates how to use R functions like bargraph.CI(), boxplot(), hist(), and table() to analyze and visualize data. For example, it shows how to create bar charts comparing mean BMI by gender and mean AFP difference by drug concentration using the bargraph.CI() function from the sciplot package. It also provides solutions for manipulating datasets, such as recoding a variable or sorting and subsetting data.
This document provides an overview of analysis of covariance (ANCOVA). It describes a scenario where a one-way ANOVA is desired but there is also a continuous predictor variable. ANCOVA accounts for variation associated with this covariate. It models the relationship as additive, with the response variable a function of the intercept, factor effects, covariate effects, and error. ANCOVA is described as a hybrid of ANOVA and linear regression. The document then demonstrates ANCOVA using example diet data, showing how it allows detecting a diet effect when ANOVA alone did not.
The document discusses contrasts, estimation, and power analysis in the context of a one-way ANOVA experiment with four brands (A, B, C, D) of chainsaws. Orthogonal contrasts are constructed to compare: (1) groups A&D vs B&C, (2) group A vs D, and (3) group B vs C. The contrasts are tested using an ANOVA model in R. Confidence intervals are estimated for the effect sizes and differences in group means are calculated for each contrast. Finally, power is analyzed for a two-sample t-test and one-way ANOVA.
This document discusses a statistical analysis that was conducted to determine the impact of area and fertilizer consumption on crop production. A multiple linear regression model was developed with crop production as the dependent variable and area and fertilizer consumption as the independent variables. The analysis found that the regression model was statistically significant, with area and fertilizer consumption also being statistically significant predictors of crop production. The model explained 95.1% of the variation in crop production.
A personal statistical analysis that I conducted in S.P.S.S. and the R programming language. A logistic regression was performed in order to investigate which myopic factors are the most significant.
COVID-19 (Coronavirus Disease) Outbreak Prediction Using a Susceptible-Exposed-Symptomatic Infected-Recovered-Super Spreaders-Asymptomatic Infected-Deceased-Critical (SEIR-PADC) Dynamic Model
This paper tries to compare more accurate and efficient L1 norm regression algorithms. Other comparative studies are mentioned, and their conclusions are discussed. Many experiments have been performed to evaluate the comparative efficiency and accuracy of the selected algorithms.
This document provides an overview of Chapter 7 from a statistics textbook. The chapter covers sampling and sampling distributions. It has 6 main learning objectives, including determining when to use sampling vs a census, distinguishing random and nonrandom sampling, and understanding the impact of the central limit theorem. The chapter outline lists 7 sections that will be covered, such as sampling, sampling distributions of the mean and proportion, and key terms. It provides examples to illustrate the central limit theorem and formulas from it.
This document provides an overview of analysis of covariance (ANCOVA). It describes a scenario where a one-way ANOVA is desired but there is also a continuous predictor variable. ANCOVA accounts for variation associated with this covariate. It models the relationship as additive, with the response variable a function of the intercept, factor effects, covariate effects, and error. ANCOVA is described as a hybrid of ANOVA and linear regression. The document then demonstrates ANCOVA using example diet data, showing how it allows detecting a diet effect when ANOVA alone did not.
The document discusses contrasts, estimation, and power analysis in the context of a one-way ANOVA experiment with four brands (A, B, C, D) of chainsaws. Orthogonal contrasts are constructed to compare: (1) groups A&D vs B&C, (2) group A vs D, and (3) group B vs C. The contrasts are tested using an ANOVA model in R. Confidence intervals are estimated for the effect sizes and differences in group means are calculated for each contrast. Finally, power is analyzed for a two-sample t-test and one-way ANOVA.
This document discusses a statistical analysis that was conducted to determine the impact of area and fertilizer consumption on crop production. A multiple linear regression model was developed with crop production as the dependent variable and area and fertilizer consumption as the independent variables. The analysis found that the regression model was statistically significant, with area and fertilizer consumption also being statistically significant predictors of crop production. The model explained 95.1% of the variation in crop production.
A personal statistical analysis that I conducted in S.P.S.S. and the R programming language. A logistic regression was performed in order to investigate which myopic factors are the most significant.
COVID-19 (Coronavirus Disease) Outbreak Prediction Using a Susceptible-Exposed-Symptomatic Infected-Recovered-Super Spreaders-Asymptomatic Infected-Deceased-Critical (SEIR-PADC) Dynamic Model
This paper tries to compare more accurate and efficient L1 norm regression algorithms. Other comparative studies are mentioned, and their conclusions are discussed. Many experiments have been performed to evaluate the comparative efficiency and accuracy of the selected algorithms.
This document provides an overview of Chapter 7 from a statistics textbook. The chapter covers sampling and sampling distributions. It has 6 main learning objectives, including determining when to use sampling vs a census, distinguishing random and nonrandom sampling, and understanding the impact of the central limit theorem. The chapter outline lists 7 sections that will be covered, such as sampling, sampling distributions of the mean and proportion, and key terms. It provides examples to illustrate the central limit theorem and formulas from it.
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data MiningVarun Ojha
The document proposes using metaheuristic optimization techniques to tune the parameters of an interval type-2 fuzzy inference system (IT2FIS) for data mining applications. Specifically, it aims to 1) create diverse rules in the IT2FIS, 2) reduce the number of fuzzy rules, 3) determine appropriate shapes for type-2 fuzzy sets, and 4) analyze the performance of proposed IT2FIS optimization methods. The proposed framework uses genetic algorithms to tune the IT2FIS knowledge base and swarm intelligence methods to tune rule parameters. Experimental results on four datasets show that differential evolution generally provides the best performance, though no single algorithm works best on all datasets.
This document discusses two methods of unsupervised learning: principal component analysis (PCA) and clustering. It applies PCA and clustering to cancer microarray gene expression data (NCI60) to explore patterns in the data without a response variable. PCA of the NCI60 data finds the first seven principal components explain 40% of the variance. Scatter plots of the first seven principal components show cancer types cluster together, though imperfectly. Hierarchical clustering with complete linkage also tends to cluster cell lines within a single cancer type together.
I am Watson A. I am a Statistics Assignment Expert at statisticsassignmenthelp.com. I hold a Masters in Statistics from, Liberty University, USA
I have been helping students with their homework for the past 6 years. I solve assignments related to Statistics.
Visit statisticsassignmenthelp.com or email info@statisticsassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Statistics Assignments.
Logistic Regression in Case-Control StudySatish Gupta
This document provides an introduction to using logistic regression in R to analyze case-control studies. It explains how to download and install R, perform basic operations and calculations, handle data, load libraries, and conduct both conditional and unconditional logistic regression. Conditional logistic regression is recommended for matched case-control studies as it provides unbiased results. The document demonstrates how to perform logistic regression on a lung cancer dataset to analyze the association between disease status and genetic and environmental factors.
The document provides information on using SPSS and PSPP statistical software to analyze data and conduct statistical tests. It includes 4 lessons:
1. How to define and input data into the software.
2. How to generate descriptive statistics like measures of central tendency and variability to describe data.
3. How to examine relationships between variables using correlation, regression, and graphs.
4. How to perform statistical inference tests for means using one-sample t-tests, independent two-sample t-tests, and paired t-tests. Examples of hypotheses testing and interpreting results are provided.
I am Luke M. I love exploring new topics. Academic writing seemed an interesting option for me. After working for many years with statisticsassignmentexperts.com. I have assisted many students with their assignments. I can proudly say, each student I have served is happy with the quality of the solution that I have provided. I have acquired my Master’s Degree in Statistics, from Arizona University, United States.
The document discusses using k-way interaction loglinear models to analyze gene interaction from microarray data. It begins with background on challenges with existing clustering and association rule methods. It then proposes using loglinear models to identify interactions that cannot be explained by pairwise associations alone. The method transforms expression data, finds frequent gene sets, and fits k-way interaction models to identify statistically significant higher-order interactions. Experimental results on yeast data identified known and potentially new biological interactions.
This document provides an overview of recursive partitioning and classification and regression tree (CART) methods. It describes how trees are built by recursively splitting nodes into left and right sons based on variables that maximize impurity reduction. Two common measures of impurity are the Gini index and information index. The tree is then pruned back using cross-validation to prevent overfitting. Examples are provided to illustrate the methods.
The document discusses applying machine learning techniques to identify compiler optimizations that impact program performance. It used classification trees to analyze a dataset containing runtime measurements for 19 programs compiled with different combinations of 45 LLVM optimizations. The trees identified optimizations like SROA and inlining that generally improved performance across programs. Analysis of individual programs found some variations, but also common optimizations like SROA and simplifying the control flow graph. Precision, accuracy, and AUC metrics were used to evaluate the trees' ability to classify optimizations for best runtime.
1 FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docxmercysuttle
1
FACULTY OF SCIENCE AND ENGINEERING
SCHOOL OF COMPUTING, MATHEMATICS & DIGITAL MEDIA
REASSESSMENT COURSEWORK 2013/14
UNIT CODE:
6G6Z3005
UNIT DESC:
APPLIED REGRESSION AND MULTIVARIATE ANALYSIS
ASSESSMENT ID:
1CWK30
ASSESSMENT NAME:
Courswork 30%
WEIGHT
FACTOR: 30%
See below.
NAME OF STAFF SETTING ASSIGNMENT: Dr B L Shea
0
MANCHESTER METROPOLITAN UNIVERSITY
FACULTY OF SCIENCE AND ENGINEERING
SCHOOL OF COMPUTING, MATHEMATICS & DIGITAL TECHNOLOGY
ACADEMIC YEAR 2013-2014:
REFERRED COURSEWORK
BSC(HONS) FINANCIAL MATHEMATICS
BSC(HONS) MATHEMATICS
YEAR/STAGE THREE
UNIT 6G6Z3005 : APPLIED REGRESSION AND MULTIVARIATE ANALYSIS
Answer ALL questions.
The pass mark is 40% which corresponds to a minimum of 72
marks out of a possible 180 marks.
The deadline is 8th August 2014.
SECTION A
1. (a) Three measurementsx1, x2 andx3 have the following sample covariance matrix.
∑̂ =
9 2 0
2 4 1
0 1 4
(i) Verify that the corresponding sample correlation matrix C, is given by
C =
1 13 0
1
3 1
1
4
0 14 1
[2]
(ii) Given that one of the eigenvalues of C is equal to one, calculate the other two
eigenvalues and determine the proportion of the variation in the data explained
by the first principal component.
[6]
(iii) Using the sample correlation matrix C, calculate the first principal component.
[6]
(b) A Principal Components Analysis of the prices of food items in 23 cities was carried
out with a view to forming a measure of the Consumer Price Index(CPI). A Minitab
analysis of this data is attached.
(i) Explain why Principal Components Analysis was performedon the correlation
matrix instead of the covariance matrix.
[2]
(ii) If the first Principal Component is taken as a measure of the CPI calculate, to
one decimal place, the value of the index for Atlanta.
[2]
(iii) Which is the most expensive city and which is the least expensive city?
[2]
(Question 1 continued overleaf)
1
(Question 1 continued)
Minitab output for Question 1
Descriptive Statistics: bread, burger, milk, oranges, tomatoes
Variable N Mean Median TrMean StDev SE Mean
bread 23 25.291 25.300 25.267 2.507 0.523
burger 23 91.86 91.00 91.63 7.55 1.58
milk 23 62.30 62.50 61.96 6.95 1.45
oranges 23 102.99 105.90 102.90 14.24 2.97
tomatoes 23 48.77 46.80 48.74 7.60 1.59
Principal Component Analysis: bread, burger, milk, oranges, tomatoes
Eigenanalysis of the Correlation Matrix
Eigenvalue 2.4225 1.1047 0.7385 0.4936 0.2408
Proportion 0.484 0.221 0.148 0.099 0.048
Cumulative 0.484 0.705 0.853 0.952 1.000
Variable PC1 PC2 PC3 PC4 PC5
bread 0.496 -0.309 0.386 -0.509 -0.500
burger 0.576 -0.044 0.262 0.028 0.773
milk 0.340 -0.431 -0.835 -0.049 0.008
oranges 0.225 0.797 -0.292 -0.479 -0.006
tomatoes 0.506 0.287 0.012 0.713 -0.391
(Question 1 continued overleaf)
2
(Question 1 continued)
Data Display
Row c ...
In healthcare sector, data are enormous and diverse because it contains a data of different types and getting knowledge from these data is crucial. So to get that knowledge, data mining techniques may be utilized to mine knowledge by building models from healthcare dataset. At present, the classification of heart diseases patients has been a demanding research confront for many researchers. For building a classification model for a these patient, we used four different classification algorithms such as NaiveBayes, MultilayerPerceptron, RandomForest and DecisionTable. The intention behind this work is to classify that whether a patient is tested positive or tested negative for heart diseases, based on some diagnostic measurements integrated into the dataset.
This document provides instructions for performing multiple regression analysis in SPSS. It demonstrates entering variables, running the regression using the enter, stepwise, and backward methods, and interpreting the output including R-square values, F-tests, beta coefficients, and equations for predicting the dependent variable based on the independent variables. Age and education were identified as the best predictors of months of full-time employment using both the stepwise and backward regression methods.
1
Useful Hints on Assignment 5
Exercise 1: (Chapter 6)
To help you better understand the calculations for Exercise 1 of Assignment 5, see below for an explanation on
how to correctly compute the risk rating of an asset.
Using the terminology from Chapter 6 of the textbook, the formula for calculating the risk rating of an asset
can be written as:
Risk rating = I x V x (1.0 - C + U)
where,
I : is Impact value of an asset
V : is Likelihood of vulnerability
C : is Percentage of risks mitigated by controls on the asset (example: Firewall etc.)
U : is Uncertainty of assumptions and data
Worked Example:
Let us see how we can apply this to an example problem. Assume that an organization has three assets A, B, C
as follows:
(1) Asset A: has an impact value of 50, and likelihood of vulnerability is estimated to be 1.0. Also
assume that there are no current controls in place to protect the asset, and there is a 90% certainty
of these assumptions and data. Thus we can write:
I : Impact value of asset is given as 50
V : Likelihood of vulnerability is given as 1.0
C : Assume that there are no current controls in place to protect this asset.
(So, Percentage of risk mitigated by current controls = 0% (i.e. 0))
U : Certainty of assumptions is given as 90%
- so the Uncertainty of assumptions = 10% (i.e. 0.1)
Risk rating for asset A = I x V x (1 – C + U) = (50 x 1.0) x (1.0 - 0 + 0.1) = 55
(2) Asset B: has an impact value of 100, and likelihood of vulnerability is estimated to be 0.5. Also
assume that current controls in place address 50% of the risk, and there is an 80% certainty of
these assumptions and data. Thus we can write:
I : Impact value of asset is given as 100
V : Likelihood of vulnerability is given as 0.5
C : Assume that current controls for this vulnerability address 50% of the risk.
(So, Percentage of risk mitigated by current controls = 50% (= 0.50))
U : Certainty of assumptions is given as 80%
- so Uncertainty of assumptions = 20% (i.e. 0.2)
Risk rating for asset B = I x V x (1 – C + U) = (100 x 0.5) X (1.0 - 0.5 + 0.2) = 35
(3) Asset C: has an impact value of 100, and likelihood of vulnerability is estimated to be 0.1. Also
assume that there are no current controls in place to protect the asset, and there is an 80%
certainty of these assumptions and data. Thus we can write:
I : Impact value of asset is given as 100
V : Likelihood of vulnerability is given as 0.1
C : Assume that there are no current controls in place to protect this asset.
2
(So, Percentage of risk mitigated by current controls = 0% (i.e. 0))
U : Certainty of assumptions is given as 80%
- so Uncertainty of assumptions = 20% (i.e. 0.2)
Risk rating for asset C = I x V x (1 – C + U) = (100 x 0.1) - (1.0 - 0 + 0.2) = ...
1 Useful Hints on Assignment 5 Exercise 1 (ChapterAbbyWhyte974
1
Useful Hints on Assignment 5
Exercise 1: (Chapter 6)
To help you better understand the calculations for Exercise 1 of Assignment 5, see below for an explanation on
how to correctly compute the risk rating of an asset.
Using the terminology from Chapter 6 of the textbook, the formula for calculating the risk rating of an asset
can be written as:
Risk rating = I x V x (1.0 - C + U)
where,
I : is Impact value of an asset
V : is Likelihood of vulnerability
C : is Percentage of risks mitigated by controls on the asset (example: Firewall etc.)
U : is Uncertainty of assumptions and data
Worked Example:
Let us see how we can apply this to an example problem. Assume that an organization has three assets A, B, C
as follows:
(1) Asset A: has an impact value of 50, and likelihood of vulnerability is estimated to be 1.0. Also
assume that there are no current controls in place to protect the asset, and there is a 90% certainty
of these assumptions and data. Thus we can write:
I : Impact value of asset is given as 50
V : Likelihood of vulnerability is given as 1.0
C : Assume that there are no current controls in place to protect this asset.
(So, Percentage of risk mitigated by current controls = 0% (i.e. 0))
U : Certainty of assumptions is given as 90%
- so the Uncertainty of assumptions = 10% (i.e. 0.1)
Risk rating for asset A = I x V x (1 – C + U) = (50 x 1.0) x (1.0 - 0 + 0.1) = 55
(2) Asset B: has an impact value of 100, and likelihood of vulnerability is estimated to be 0.5. Also
assume that current controls in place address 50% of the risk, and there is an 80% certainty of
these assumptions and data. Thus we can write:
I : Impact value of asset is given as 100
V : Likelihood of vulnerability is given as 0.5
C : Assume that current controls for this vulnerability address 50% of the risk.
(So, Percentage of risk mitigated by current controls = 50% (= 0.50))
U : Certainty of assumptions is given as 80%
- so Uncertainty of assumptions = 20% (i.e. 0.2)
Risk rating for asset B = I x V x (1 – C + U) = (100 x 0.5) X (1.0 - 0.5 + 0.2) = 35
(3) Asset C: has an impact value of 100, and likelihood of vulnerability is estimated to be 0.1. Also
assume that there are no current controls in place to protect the asset, and there is an 80%
certainty of these assumptions and data. Thus we can write:
I : Impact value of asset is given as 100
V : Likelihood of vulnerability is given as 0.1
C : Assume that there are no current controls in place to protect this asset.
2
(So, Percentage of risk mitigated by current controls = 0% (i.e. 0))
U : Certainty of assumptions is given as 80%
- so Uncertainty of assumptions = 20% (i.e. 0.2)
Risk rating for asset C = I x V x (1 – C + U) = (100 x 0.1) - (1.0 - 0 + 0.2) = ...
ECN 425 Introduction to Econometrics Alvin Murphy .docxtidwellveronique
ECN 425: Introduction to Econometrics
Alvin Murphy Arizona State University: Fall 2018
Assignment #1
Due at the beginning of class on Thursday, September 6th
PART I: DERIVING OLS ESTIMATORS
(You must show all work to receive full credit)
1) 1) Suppose the population regression function can be written as: uxy
10
, where
0uE and 0| xuE . The sample equivalents to these two restrictions imply:
0ˆ
1
:1
n
i
i
u
n
and 0ˆ
1
:1
n
i
ii
ux
n
. Parts (a)-(c) of this problem ask you to derive the OLS
estimators for
0
and
1
. Please show all of your work.
(20 points: 5/5/10)
(a) Use 0ˆ
1
:1
n
i
i
u
n
to demonstrate that the OLS estimator for
0
can be written as:
xy
10
ˆˆ , where
n
i
i
y
n
y
:1
1
and
n
i
i
x
n
x
:1
1
.
(b) Use 0ˆ
1
:1
n
i
ii
ux
n
together with the result from (a) to demonstrate that the OLS
estimator for
1
can be written as:
n
i
ii
n
i
ii
xxx
yyx
1
:1
1
̂ .
(c) Use your result from (b) together with the definition of the variance and covariance to
demonstrate that
i
ii
x
yx
var
,covˆ
1
.
2
2) Suppose the population regression function is uzy
i
10
, and you estimate the
following sample regression function:
iii
uxy ˆˆˆ
10
, where zx .
(20 points: 10/10)
(a) Express your estimator,
1
̂ , in terms of the data and parameters of the population
regression function,
ii
zx ,,
1
, and
i
u .
(b) Use your result from (a) to demonstrate that
1
̂ is generally a biased estimator for
1
.
PART II: USING A FAKE DATA EXPERIMENT TO INVESTIGATE OLS ESTIMATORS
A fake data experiment can be a useful way to investigate the properties of an estimator. This
process begins by specifying the “true” economic model (i.e. the population regression
function). The next step is to use this model to generate some data that represent a population.
Finally, by taking repeated samples from the population and using these samples to estimate the
sample regression function several times, you can evaluate how well your estimator performs
(e.g. bias and variance) under specific conditions.
3) In this problem, you will use a fake data experiment to demonstrate the importance of
correctly specifying the form of the sample regression function. More precisely, you will
compare the bias of the OLS estimator when the model is correctly specified, to the bias
when the model is incorrectly specified to use the wrong explanatory variable. In the file
“fake1.dta”, I have generated a population of 500 observations from the (true) regression
equation: uzy
10
, such that 0uE , 0| zuE , and 2|var zu .
(25 points: 5/5/5/5/5)
a) Use these data to calculate the population paramete.
Sparse Representation for Fetal QRS Detection in Abdominal ECG RecordingsRiccardo Bernardini
Slideshow of the presentation given at EHB 2015
In this work, we consider the problem of detection of fetal heart beats from abdominal, non-invasive mixture recordings. We propose a new method for the separation of maternal and fetal beats based on the sparse decomposition in an over-complete dictionary of Gaussian-like functions. To increase the detection capability, we also use Independent Component Analysis (ICA) after maternal template subtraction. We show that the proposed detection method can be applied on the original mixture with a sensitivity close to 95%. Moreover, our method may be used also for single channel abdominal ECG signals, and also used in real-time applications.
2013.11.14 Big Data Workshop Bruno Voisin NUI Galway
Bruno Voisin from the Irish Centre for High End Computing presented this Introduction to Data Analytics Techniques and their Implementation in R during the Big Data Workshop hosted by the Social Sciences Computing Hub at the Whitaker Institute on the 14th November 2013
This study introduces and compares different methods for estimating the two parameters of generalized logarithmic series distribution. These methods are the cuckoo search optimization, maximum likelihood estimation, and method of moments algorithms. All the required derivations and basic steps of each algorithm are explained. The applications for these algorithms are implemented through simulations using different sample sizes (n = 15, 25, 50, 100). Results are compared using the statistical measure mean square error.
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...ijcsit
This study introduces and compares different methods for estimating the two parameters of generalized logarithmic series distribution. These methods are the cuckoo search optimization, maximum likelihood estimation, and method of moments algorithms. All the required derivations and basic steps of each algorithm are explained. The applications for these algorithms are implemented through simulations using different sample sizes (n = 15, 25, 50, 100). Results are compared using the statistical measure mean square error.
Here are the steps to visualize a potential indel region after realignment:
1. Run GATK IndelRealigner on the target list:
java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T IndelRealigner -R ../human_g1k_v37.fasta -I sample.dedup.bam -targetIntervals sample.intervals -o sample.realigned.bam
2. Index the realigned BAM:
samtools index sample.realigned.bam
3. Load the realigned BAM into IGV and navigate to a region of interest from the target list (sample.intervals).
4. In I
This document discusses phylogenetic analysis and tree building. It introduces the Bioinformatics and Computational Biology Branch (BCBB) group and their work analyzing biological sequences and constructing phylogenetic trees. The document explains why biological sequences are important to analyze and compares sequences to understand relatedness and evolution. It also covers multiple sequence alignment, substitution models, and algorithms for building trees, including neighbor-joining.
More Related Content
Similar to Appendix: Crash course in R and BioConductor
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data MiningVarun Ojha
The document proposes using metaheuristic optimization techniques to tune the parameters of an interval type-2 fuzzy inference system (IT2FIS) for data mining applications. Specifically, it aims to 1) create diverse rules in the IT2FIS, 2) reduce the number of fuzzy rules, 3) determine appropriate shapes for type-2 fuzzy sets, and 4) analyze the performance of proposed IT2FIS optimization methods. The proposed framework uses genetic algorithms to tune the IT2FIS knowledge base and swarm intelligence methods to tune rule parameters. Experimental results on four datasets show that differential evolution generally provides the best performance, though no single algorithm works best on all datasets.
This document discusses two methods of unsupervised learning: principal component analysis (PCA) and clustering. It applies PCA and clustering to cancer microarray gene expression data (NCI60) to explore patterns in the data without a response variable. PCA of the NCI60 data finds the first seven principal components explain 40% of the variance. Scatter plots of the first seven principal components show cancer types cluster together, though imperfectly. Hierarchical clustering with complete linkage also tends to cluster cell lines within a single cancer type together.
I am Watson A. I am a Statistics Assignment Expert at statisticsassignmenthelp.com. I hold a Masters in Statistics from, Liberty University, USA
I have been helping students with their homework for the past 6 years. I solve assignments related to Statistics.
Visit statisticsassignmenthelp.com or email info@statisticsassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Statistics Assignments.
Logistic Regression in Case-Control StudySatish Gupta
This document provides an introduction to using logistic regression in R to analyze case-control studies. It explains how to download and install R, perform basic operations and calculations, handle data, load libraries, and conduct both conditional and unconditional logistic regression. Conditional logistic regression is recommended for matched case-control studies as it provides unbiased results. The document demonstrates how to perform logistic regression on a lung cancer dataset to analyze the association between disease status and genetic and environmental factors.
The document provides information on using SPSS and PSPP statistical software to analyze data and conduct statistical tests. It includes 4 lessons:
1. How to define and input data into the software.
2. How to generate descriptive statistics like measures of central tendency and variability to describe data.
3. How to examine relationships between variables using correlation, regression, and graphs.
4. How to perform statistical inference tests for means using one-sample t-tests, independent two-sample t-tests, and paired t-tests. Examples of hypotheses testing and interpreting results are provided.
I am Luke M. I love exploring new topics. Academic writing seemed an interesting option for me. After working for many years with statisticsassignmentexperts.com. I have assisted many students with their assignments. I can proudly say, each student I have served is happy with the quality of the solution that I have provided. I have acquired my Master’s Degree in Statistics, from Arizona University, United States.
The document discusses using k-way interaction loglinear models to analyze gene interaction from microarray data. It begins with background on challenges with existing clustering and association rule methods. It then proposes using loglinear models to identify interactions that cannot be explained by pairwise associations alone. The method transforms expression data, finds frequent gene sets, and fits k-way interaction models to identify statistically significant higher-order interactions. Experimental results on yeast data identified known and potentially new biological interactions.
This document provides an overview of recursive partitioning and classification and regression tree (CART) methods. It describes how trees are built by recursively splitting nodes into left and right sons based on variables that maximize impurity reduction. Two common measures of impurity are the Gini index and information index. The tree is then pruned back using cross-validation to prevent overfitting. Examples are provided to illustrate the methods.
The document discusses applying machine learning techniques to identify compiler optimizations that impact program performance. It used classification trees to analyze a dataset containing runtime measurements for 19 programs compiled with different combinations of 45 LLVM optimizations. The trees identified optimizations like SROA and inlining that generally improved performance across programs. Analysis of individual programs found some variations, but also common optimizations like SROA and simplifying the control flow graph. Precision, accuracy, and AUC metrics were used to evaluate the trees' ability to classify optimizations for best runtime.
1 FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docxmercysuttle
1
FACULTY OF SCIENCE AND ENGINEERING
SCHOOL OF COMPUTING, MATHEMATICS & DIGITAL MEDIA
REASSESSMENT COURSEWORK 2013/14
UNIT CODE:
6G6Z3005
UNIT DESC:
APPLIED REGRESSION AND MULTIVARIATE ANALYSIS
ASSESSMENT ID:
1CWK30
ASSESSMENT NAME:
Courswork 30%
WEIGHT
FACTOR: 30%
See below.
NAME OF STAFF SETTING ASSIGNMENT: Dr B L Shea
0
MANCHESTER METROPOLITAN UNIVERSITY
FACULTY OF SCIENCE AND ENGINEERING
SCHOOL OF COMPUTING, MATHEMATICS & DIGITAL TECHNOLOGY
ACADEMIC YEAR 2013-2014:
REFERRED COURSEWORK
BSC(HONS) FINANCIAL MATHEMATICS
BSC(HONS) MATHEMATICS
YEAR/STAGE THREE
UNIT 6G6Z3005 : APPLIED REGRESSION AND MULTIVARIATE ANALYSIS
Answer ALL questions.
The pass mark is 40% which corresponds to a minimum of 72
marks out of a possible 180 marks.
The deadline is 8th August 2014.
SECTION A
1. (a) Three measurementsx1, x2 andx3 have the following sample covariance matrix.
∑̂ =
9 2 0
2 4 1
0 1 4
(i) Verify that the corresponding sample correlation matrix C, is given by
C =
1 13 0
1
3 1
1
4
0 14 1
[2]
(ii) Given that one of the eigenvalues of C is equal to one, calculate the other two
eigenvalues and determine the proportion of the variation in the data explained
by the first principal component.
[6]
(iii) Using the sample correlation matrix C, calculate the first principal component.
[6]
(b) A Principal Components Analysis of the prices of food items in 23 cities was carried
out with a view to forming a measure of the Consumer Price Index(CPI). A Minitab
analysis of this data is attached.
(i) Explain why Principal Components Analysis was performedon the correlation
matrix instead of the covariance matrix.
[2]
(ii) If the first Principal Component is taken as a measure of the CPI calculate, to
one decimal place, the value of the index for Atlanta.
[2]
(iii) Which is the most expensive city and which is the least expensive city?
[2]
(Question 1 continued overleaf)
1
(Question 1 continued)
Minitab output for Question 1
Descriptive Statistics: bread, burger, milk, oranges, tomatoes
Variable N Mean Median TrMean StDev SE Mean
bread 23 25.291 25.300 25.267 2.507 0.523
burger 23 91.86 91.00 91.63 7.55 1.58
milk 23 62.30 62.50 61.96 6.95 1.45
oranges 23 102.99 105.90 102.90 14.24 2.97
tomatoes 23 48.77 46.80 48.74 7.60 1.59
Principal Component Analysis: bread, burger, milk, oranges, tomatoes
Eigenanalysis of the Correlation Matrix
Eigenvalue 2.4225 1.1047 0.7385 0.4936 0.2408
Proportion 0.484 0.221 0.148 0.099 0.048
Cumulative 0.484 0.705 0.853 0.952 1.000
Variable PC1 PC2 PC3 PC4 PC5
bread 0.496 -0.309 0.386 -0.509 -0.500
burger 0.576 -0.044 0.262 0.028 0.773
milk 0.340 -0.431 -0.835 -0.049 0.008
oranges 0.225 0.797 -0.292 -0.479 -0.006
tomatoes 0.506 0.287 0.012 0.713 -0.391
(Question 1 continued overleaf)
2
(Question 1 continued)
Data Display
Row c ...
In healthcare sector, data are enormous and diverse because it contains a data of different types and getting knowledge from these data is crucial. So to get that knowledge, data mining techniques may be utilized to mine knowledge by building models from healthcare dataset. At present, the classification of heart diseases patients has been a demanding research confront for many researchers. For building a classification model for a these patient, we used four different classification algorithms such as NaiveBayes, MultilayerPerceptron, RandomForest and DecisionTable. The intention behind this work is to classify that whether a patient is tested positive or tested negative for heart diseases, based on some diagnostic measurements integrated into the dataset.
This document provides instructions for performing multiple regression analysis in SPSS. It demonstrates entering variables, running the regression using the enter, stepwise, and backward methods, and interpreting the output including R-square values, F-tests, beta coefficients, and equations for predicting the dependent variable based on the independent variables. Age and education were identified as the best predictors of months of full-time employment using both the stepwise and backward regression methods.
1
Useful Hints on Assignment 5
Exercise 1: (Chapter 6)
To help you better understand the calculations for Exercise 1 of Assignment 5, see below for an explanation on
how to correctly compute the risk rating of an asset.
Using the terminology from Chapter 6 of the textbook, the formula for calculating the risk rating of an asset
can be written as:
Risk rating = I x V x (1.0 - C + U)
where,
I : is Impact value of an asset
V : is Likelihood of vulnerability
C : is Percentage of risks mitigated by controls on the asset (example: Firewall etc.)
U : is Uncertainty of assumptions and data
Worked Example:
Let us see how we can apply this to an example problem. Assume that an organization has three assets A, B, C
as follows:
(1) Asset A: has an impact value of 50, and likelihood of vulnerability is estimated to be 1.0. Also
assume that there are no current controls in place to protect the asset, and there is a 90% certainty
of these assumptions and data. Thus we can write:
I : Impact value of asset is given as 50
V : Likelihood of vulnerability is given as 1.0
C : Assume that there are no current controls in place to protect this asset.
(So, Percentage of risk mitigated by current controls = 0% (i.e. 0))
U : Certainty of assumptions is given as 90%
- so the Uncertainty of assumptions = 10% (i.e. 0.1)
Risk rating for asset A = I x V x (1 – C + U) = (50 x 1.0) x (1.0 - 0 + 0.1) = 55
(2) Asset B: has an impact value of 100, and likelihood of vulnerability is estimated to be 0.5. Also
assume that current controls in place address 50% of the risk, and there is an 80% certainty of
these assumptions and data. Thus we can write:
I : Impact value of asset is given as 100
V : Likelihood of vulnerability is given as 0.5
C : Assume that current controls for this vulnerability address 50% of the risk.
(So, Percentage of risk mitigated by current controls = 50% (= 0.50))
U : Certainty of assumptions is given as 80%
- so Uncertainty of assumptions = 20% (i.e. 0.2)
Risk rating for asset B = I x V x (1 – C + U) = (100 x 0.5) X (1.0 - 0.5 + 0.2) = 35
(3) Asset C: has an impact value of 100, and likelihood of vulnerability is estimated to be 0.1. Also
assume that there are no current controls in place to protect the asset, and there is an 80%
certainty of these assumptions and data. Thus we can write:
I : Impact value of asset is given as 100
V : Likelihood of vulnerability is given as 0.1
C : Assume that there are no current controls in place to protect this asset.
2
(So, Percentage of risk mitigated by current controls = 0% (i.e. 0))
U : Certainty of assumptions is given as 80%
- so Uncertainty of assumptions = 20% (i.e. 0.2)
Risk rating for asset C = I x V x (1 – C + U) = (100 x 0.1) - (1.0 - 0 + 0.2) = ...
1 Useful Hints on Assignment 5 Exercise 1 (ChapterAbbyWhyte974
1
Useful Hints on Assignment 5
Exercise 1: (Chapter 6)
To help you better understand the calculations for Exercise 1 of Assignment 5, see below for an explanation on
how to correctly compute the risk rating of an asset.
Using the terminology from Chapter 6 of the textbook, the formula for calculating the risk rating of an asset
can be written as:
Risk rating = I x V x (1.0 - C + U)
where,
I : is Impact value of an asset
V : is Likelihood of vulnerability
C : is Percentage of risks mitigated by controls on the asset (example: Firewall etc.)
U : is Uncertainty of assumptions and data
Worked Example:
Let us see how we can apply this to an example problem. Assume that an organization has three assets A, B, C
as follows:
(1) Asset A: has an impact value of 50, and likelihood of vulnerability is estimated to be 1.0. Also
assume that there are no current controls in place to protect the asset, and there is a 90% certainty
of these assumptions and data. Thus we can write:
I : Impact value of asset is given as 50
V : Likelihood of vulnerability is given as 1.0
C : Assume that there are no current controls in place to protect this asset.
(So, Percentage of risk mitigated by current controls = 0% (i.e. 0))
U : Certainty of assumptions is given as 90%
- so the Uncertainty of assumptions = 10% (i.e. 0.1)
Risk rating for asset A = I x V x (1 – C + U) = (50 x 1.0) x (1.0 - 0 + 0.1) = 55
(2) Asset B: has an impact value of 100, and likelihood of vulnerability is estimated to be 0.5. Also
assume that current controls in place address 50% of the risk, and there is an 80% certainty of
these assumptions and data. Thus we can write:
I : Impact value of asset is given as 100
V : Likelihood of vulnerability is given as 0.5
C : Assume that current controls for this vulnerability address 50% of the risk.
(So, Percentage of risk mitigated by current controls = 50% (= 0.50))
U : Certainty of assumptions is given as 80%
- so Uncertainty of assumptions = 20% (i.e. 0.2)
Risk rating for asset B = I x V x (1 – C + U) = (100 x 0.5) X (1.0 - 0.5 + 0.2) = 35
(3) Asset C: has an impact value of 100, and likelihood of vulnerability is estimated to be 0.1. Also
assume that there are no current controls in place to protect the asset, and there is an 80%
certainty of these assumptions and data. Thus we can write:
I : Impact value of asset is given as 100
V : Likelihood of vulnerability is given as 0.1
C : Assume that there are no current controls in place to protect this asset.
2
(So, Percentage of risk mitigated by current controls = 0% (i.e. 0))
U : Certainty of assumptions is given as 80%
- so Uncertainty of assumptions = 20% (i.e. 0.2)
Risk rating for asset C = I x V x (1 – C + U) = (100 x 0.1) - (1.0 - 0 + 0.2) = ...
ECN 425 Introduction to Econometrics Alvin Murphy .docxtidwellveronique
ECN 425: Introduction to Econometrics
Alvin Murphy Arizona State University: Fall 2018
Assignment #1
Due at the beginning of class on Thursday, September 6th
PART I: DERIVING OLS ESTIMATORS
(You must show all work to receive full credit)
1) 1) Suppose the population regression function can be written as: uxy
10
, where
0uE and 0| xuE . The sample equivalents to these two restrictions imply:
0ˆ
1
:1
n
i
i
u
n
and 0ˆ
1
:1
n
i
ii
ux
n
. Parts (a)-(c) of this problem ask you to derive the OLS
estimators for
0
and
1
. Please show all of your work.
(20 points: 5/5/10)
(a) Use 0ˆ
1
:1
n
i
i
u
n
to demonstrate that the OLS estimator for
0
can be written as:
xy
10
ˆˆ , where
n
i
i
y
n
y
:1
1
and
n
i
i
x
n
x
:1
1
.
(b) Use 0ˆ
1
:1
n
i
ii
ux
n
together with the result from (a) to demonstrate that the OLS
estimator for
1
can be written as:
n
i
ii
n
i
ii
xxx
yyx
1
:1
1
̂ .
(c) Use your result from (b) together with the definition of the variance and covariance to
demonstrate that
i
ii
x
yx
var
,covˆ
1
.
2
2) Suppose the population regression function is uzy
i
10
, and you estimate the
following sample regression function:
iii
uxy ˆˆˆ
10
, where zx .
(20 points: 10/10)
(a) Express your estimator,
1
̂ , in terms of the data and parameters of the population
regression function,
ii
zx ,,
1
, and
i
u .
(b) Use your result from (a) to demonstrate that
1
̂ is generally a biased estimator for
1
.
PART II: USING A FAKE DATA EXPERIMENT TO INVESTIGATE OLS ESTIMATORS
A fake data experiment can be a useful way to investigate the properties of an estimator. This
process begins by specifying the “true” economic model (i.e. the population regression
function). The next step is to use this model to generate some data that represent a population.
Finally, by taking repeated samples from the population and using these samples to estimate the
sample regression function several times, you can evaluate how well your estimator performs
(e.g. bias and variance) under specific conditions.
3) In this problem, you will use a fake data experiment to demonstrate the importance of
correctly specifying the form of the sample regression function. More precisely, you will
compare the bias of the OLS estimator when the model is correctly specified, to the bias
when the model is incorrectly specified to use the wrong explanatory variable. In the file
“fake1.dta”, I have generated a population of 500 observations from the (true) regression
equation: uzy
10
, such that 0uE , 0| zuE , and 2|var zu .
(25 points: 5/5/5/5/5)
a) Use these data to calculate the population paramete.
Sparse Representation for Fetal QRS Detection in Abdominal ECG RecordingsRiccardo Bernardini
Slideshow of the presentation given at EHB 2015
In this work, we consider the problem of detection of fetal heart beats from abdominal, non-invasive mixture recordings. We propose a new method for the separation of maternal and fetal beats based on the sparse decomposition in an over-complete dictionary of Gaussian-like functions. To increase the detection capability, we also use Independent Component Analysis (ICA) after maternal template subtraction. We show that the proposed detection method can be applied on the original mixture with a sensitivity close to 95%. Moreover, our method may be used also for single channel abdominal ECG signals, and also used in real-time applications.
2013.11.14 Big Data Workshop Bruno Voisin NUI Galway
Bruno Voisin from the Irish Centre for High End Computing presented this Introduction to Data Analytics Techniques and their Implementation in R during the Big Data Workshop hosted by the Social Sciences Computing Hub at the Whitaker Institute on the 14th November 2013
This study introduces and compares different methods for estimating the two parameters of generalized logarithmic series distribution. These methods are the cuckoo search optimization, maximum likelihood estimation, and method of moments algorithms. All the required derivations and basic steps of each algorithm are explained. The applications for these algorithms are implemented through simulations using different sample sizes (n = 15, 25, 50, 100). Results are compared using the statistical measure mean square error.
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...ijcsit
This study introduces and compares different methods for estimating the two parameters of generalized logarithmic series distribution. These methods are the cuckoo search optimization, maximum likelihood estimation, and method of moments algorithms. All the required derivations and basic steps of each algorithm are explained. The applications for these algorithms are implemented through simulations using different sample sizes (n = 15, 25, 50, 100). Results are compared using the statistical measure mean square error.
Similar to Appendix: Crash course in R and BioConductor (20)
Here are the steps to visualize a potential indel region after realignment:
1. Run GATK IndelRealigner on the target list:
java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T IndelRealigner -R ../human_g1k_v37.fasta -I sample.dedup.bam -targetIntervals sample.intervals -o sample.realigned.bam
2. Index the realigned BAM:
samtools index sample.realigned.bam
3. Load the realigned BAM into IGV and navigate to a region of interest from the target list (sample.intervals).
4. In I
This document discusses phylogenetic analysis and tree building. It introduces the Bioinformatics and Computational Biology Branch (BCBB) group and their work analyzing biological sequences and constructing phylogenetic trees. The document explains why biological sequences are important to analyze and compares sequences to understand relatedness and evolution. It also covers multiple sequence alignment, substitution models, and algorithms for building trees, including neighbor-joining.
The webinar covered new features and updates to the Nephele 2.0 bioinformatics analysis platform. Key updates included a new website interface, improved performance through a new infrastructure framework, the ability to resubmit jobs by ID, and interactive mapping file submission. New pipelines for 16S analysis using DADA2 and quality control preprocessing were introduced, and the existing 16S mothur pipeline was updated. The quality control pipeline provides tools to assess data quality before running microbiome analyses through FastQC, primer/adapter trimming with cutadapt, and additional quality filtering options. The webinar emphasized the importance of data quality checks and highlighted troubleshooting tips such as examining the log file for error messages when jobs fail.
1) METAGENOTE is a new web-based tool for annotating genomic samples and submitting metadata and sequencing files to the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI).
2) It provides templates and controlled vocabularies to streamline sample metadata annotation using existing ontologies and standards. This allows for easier cross-study comparisons.
3) The demonstration showed how to use METAGENOTE's interface to annotate a mouse ear skin sample with terms from relevant ontologies, import additional annotations in batch, and submit the metadata and files to NCBI SRA through a 5-step wizard.
This document provides an introduction to homology modeling using computational tools like I-TASSER and Phyre2. It discusses how homology modeling can be used to generate 3D structural models of proteins when an experimental structure is not available. The document addresses common questions from users and outlines the I-TASSER modeling pipeline. Hands-on exercises are provided to allow users to run homology modeling tools and examine the resulting models.
This document summarizes different computational methods for protein structure prediction, including homology modeling, fold recognition, threading, and ab initio modeling. Homology modeling relies on identifying proteins with similar sequences and known structures. Fold recognition and threading can be used when there are no homologs, to identify proteins with the same overall fold but different sequences. Ab initio modeling uses physics-based modeling and protein fragments to predict structure from sequence alone, and has challenges due to the vast number of possible conformations.
Homology modeling is a computational technique for predicting the structure of a protein target based on its sequence similarity to proteins with known structures, and it involves finding a suitable template, aligning the target and template sequences, building a 3D model of the target, and evaluating the model quality. While experimental methods like X-ray crystallography and NMR can determine protein structures, they have limitations in terms of which proteins can be studied, so computational methods like homology modeling are needed to predict structures for the many proteins whose structures remain unknown.
The document discusses function prediction for unknown proteins. It begins with an overview of common methods for function prediction, including sequence and structure similarity, domains and motifs, gene expression, and interactions. It then uses a protein called Msa as a case study, analyzing it with various tools and finding evidence it may function as a signal transducer in bacterial response to environment. Finally, it briefly discusses another protein M46 and challenges in evaluating prediction accuracy.
This presentation discusses protein structure prediction using Rosetta. It begins with an overview of the Critical Assessment of Protein Structure Prediction (CASP) experiments and notes that Rosetta is one of the top performing free-modeling servers. The presentation then describes the basic ab initio protocol used by Rosetta, which involves fragment insertion, scoring, and refinement. It also discusses limitations and success rates. Key aspects of the Rosetta energy functions and sampling algorithms are presented. Examples of specific Rosetta applications including low-resolution modeling and refinement are provided.
This document provides an outline for a presentation on biological networks, including introducing biological networks, describing their basic components and types, methods for predicting and building networks, sources of interaction data, tools for network visualization and analysis, and a demonstration of building, visualizing and analyzing biological networks using Cytoscape. The presentation covers topics like nodes and edges in networks, features used to analyze networks, methods for predicting networks from sequences and omics data, integrated databases for interaction data, and popular tools for searching, visualizing and performing network analysis.
This document provides an overview and introduction to using the command line interface and submitting jobs to the NIAID High Performance Computing (HPC) Cluster. The objectives are to learn basic Unix commands, practice file manipulation from the command line, and submit a job to the HPC cluster. The document covers topics such as the anatomy of the terminal, navigating directories, common commands, tips for using the command line more efficiently, accessing and mounting drives on the HPC cluster, and an overview of the cluster queue system.
This document provides an overview of statistical analyses that can be performed in PRISM. It discusses how to perform common statistical tests like t-tests, ANOVA, linear regression, and summarizes the appropriate tests to address different research questions. Examples are given of how to analyze pre-post treatment data using paired t-tests and compare groups using independent t-tests or ANOVA. Guidance is also provided on interpreting results and checking assumptions.
1) JMP is statistical software that allows for easy import, organization, and analysis of data. It features spreadsheet-like data tables, powerful statistical modeling capabilities, and customizable graphics.
2) The document reviews various features of JMP including importing data, organizing data tables, performing statistical analyses through platforms like distribution and fit model, and creating graphs and reports.
3) Assistance is available for using JMP through free training, support contacts, and detailed help menus within the software. JMP allows for both simple and advanced statistical analysis of data.
This document discusses methods for analyzing categorical data and response variables, including contingency tables, chi-square tests, Fisher's exact test, odds ratios, logistic regression, and generalized linear models. Contingency tables are used to display relationships between categorical variables and tests of independence. Fisher's exact test and chi-square tests determine if a relationship is statistically significant. Odds ratios and relative risk indicate the magnitude of relationships. Logistic regression models relationships between continuous predictors and categorical responses. Generalized linear models extend these methods.
This document provides a training manual on better graphics in R. It begins with an overview of R and BioConductor and reviews basic R functions. It then covers creating simple and customized graphics, multi-step graphics with legends, and multi-panel layouts. The manual aims to help researchers learn visualization techniques to improve the communication of their data and results.
This document describes two web tools that were created using R to automate biostatistics workflows: HDX NAME and DRAP. HDX NAME analyzes hydrogen-deuterium exchange mass spectrometry data to estimate protein flexibility. It computes protection factors, compares groups, and maps results to protein structures. DRAP fits logistic dose-response curves to drug screening data from multiple plates. It automates curve fitting, compares results, and exports summaries. Both tools were created with R on the backend for analysis and web interfaces for usability. This allows researchers to perform complex analyses without programming expertise.
This document discusses several common problems with data handling and quality including building and testing models with the same data, confusion between biological and technical replicates, and identification and handling of outliers. It provides examples and explanations of key concepts such as experimental and sampling units, pseudo-replication, outliers versus high influence points, and leverage plots. The importance of proper data handling techniques like dividing data into training, test, and confirmation sets and using cross-validation is emphasized to avoid overfitting models and generating spurious findings.
The document provides an overview of statistical testing, including:
- When to use parametric vs. nonparametric tests
- When large sample tests or exact tests are needed
- When adjustments for multiple testing are required
It discusses key concepts like null and alternative hypotheses, test statistics, p-values, and type I and II errors. Examples of the Student's t-test and Wilcoxon rank sum test are provided.
This document summarizes a presentation on curve fitting using GraphPad Prism. It discusses nonlinear regression techniques for analyzing dose-response and binding curve data commonly used by biologists. Specific nonlinear regression models like sigmoidal dose-response curves are described. The document provides guidance on choosing and fitting appropriate models, evaluating model fit, and improving model fit if needed.
More from Bioinformatics and Computational Biosciences Branch (20)
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
1.
Training Manual Appendix
Crash Course:
R and BioConductor
Jeff Skinner, M.S.
Sudhir Varma, Ph.D.
Bioinformatics and Computational Biosciences Branch (BCBB)
NIH/NIAID/OD/OSMO/OCICB
http://bioinformatics.niaid.nih.gov
ScienceApps@niaid.nih.gov
2. Crash Course: R and BioConductor
2
Appendix
Solutions to Sample Problems for Students
#1. {Fisher’s iris data} Sir Ronald A. Fisher famously used this set of iris flower data
as an example to test his new linear discriminant statistical model. Now, the iris
data set is used as a historical example for new statistical classification models.
A) Search the help menu for the keyword “linear discriminant”, then report
the names of the functions and packages you find.
Ans. > help.search(“linear discriminant”) returns results for the
functions lda() and predict.lda() from the MASS package library.
B) Search the help menus or a search engine for additional classification
models that could be tested with the iris data.
Ans. Any results are OK, but two examples are the knn() function from the
class package library and the randomForest() function from the
randomForest package library.
C) The measurements from the iris data set were made in centimeters, but
suppose a researcher wanted to compare the performance of their classifier
for measurements in both cm and inches. Remember 1 cm = 0.3937 inch
and create a new iris data set with measurements in inches.
Ans. One possible answer is shown below:
> irisINCHES <- data.frame(0.3937*iris[,1:4],iris[,5])
> iris[1:4,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
> irisINCHES[1:4,]
Sepal.Length Sepal.Width Petal.Length Petal.Width iris...5.
1 2.00787 1.37795 0.55118 0.07874 setosa
2 1.92913 1.18110 0.55118 0.07874 setosa
3 1.85039 1.25984 0.51181 0.07874 setosa
4 1.81102 1.22047 0.59055 0.07874 setosa
3. Crash Course: R and BioConductor
3
D) Use indexing to verify that the 77th
plant (i.e. row 77) has petal length of
approximately 1.89 inches.
Ans. Two possible answers are shown below:
> iris[77,"Petal.Length"]*0.3937
[1] 1.88976
> irisINCHES[77,3]
[1] 1.88976
#2. {AFP data} Suppose alpha-fetoprotein (AFP) is a potential biomarker for liver
cancer and other cancer types. A researcher might be interested in AFP levels
before and after taking a new drug in one of four concentrations.
A) The example in section 2.7.2 of the manual provided a list of 20 AFP
levels before drug treatment. Use your own methods to enter a new
column of 20 AFP levels after drug treatment, then enter another column
with the difference between the pre- and post-treatment AFP levels
Ans. One possible answer is shown below:
# manually enter Alpha-fetoprotein (AFP) levels for 20 patients
> AFP.after <- AFP.before - 1.2 + 0.2*rnorm(20)
> AFP.diff <- AFP.after - AFP.before
> afp.data <- data.frame(subject,gender,height,weight,BMI,
drug,AFP.before,AFP.after,AFP.diff)
> afp.data
B) Verify the storage mode of the data set afp.data. Verify the storage
mode of the variable drug. Verify the storage mode of the variable
gender. Convert the storage mode of drug to factor.
Ans. One possible answer shown below
> class(afp.data)
[1] "data.frame"
> class(afp.data$drug)
[1] "numeric"
> class(afp.data$gender)
[1] “factor”
> afp.data$drug <- as.factor(afp.data$drug)
4. Crash Course: R and BioConductor
4
C) Create a subset of the AFP data that only includes male patients with
BMI > 25.5 or weight > 180 lbs. How many men are included in the
data subset?
Ans. Six male patients are included in the subset. One example is shown:
> afp.subset <- afp.data[afp.data$gender=="male",]
> indx <- afp.subset$BMI > 25.5 | afp.subset$weight > 180
> afp.subset <- afp.subset[indx,]
> afp.subset
subject gender height weight BMI drug ...
2 2 male 69.15696 202.9318 29.82865 5 ...
3 3 male 69.35599 211.0632 30.84607 10 ...
5 5 male 71.44586 241.4526 33.25317 20 ...
7 7 male 68.21618 297.4155 44.93081 5 ...
8 8 male 69.77130 289.2935 41.77731 10 ...
10 10 male 66.95951 178.6660 28.01385 20 ...
D) Sort the entire data subset created in part C) by the BMI variable in an
descending order. What is the row ordering of the sorted data subset?
Save the data subset as a comma separated value (.csv) text file.
Ans. The row order is: 7, 8, 5, 3, 2, 10. A possible solution is below:
> afp.subset <- afp.subset[order(afp.subset$BMI,
decreasing=TRUE),]
> afp.subset
subject gender height weight BMI drug ...
7 7 male 68.21618 297.4155 44.93081 5 ...
8 8 male 69.77130 289.2935 41.77731 10 ...
5 5 male 71.44586 241.4526 33.25317 20 ...
3 3 male 69.35599 211.0632 30.84607 10 ...
2 2 male 69.15696 202.9318 29.82865 5 ...
10 10 male 66.95951 178.6660 28.01385 20 ...
> write.csv(afp.subset,file="~/subset.csv")
#3. {AE data} Doctors, epidemiologists and other researchers look at adverse events
to explore the symptoms and medical conditions affecting patients. A researcher
might choose to look for associations between adverse events and diet.
A) One of the adverse events in the data table is “Malaise”. Recode the AE
data table, such that all entries for “Malaise” read “Discomfort” instead.
Ans. Hint: you need to convert the adverse event variable to a character variable
> AE$Adverse.Event <- as.character(AE$Adverse.Event)
> indx <- AE$Adverse.Event == "Malaise"
> AE$Adverse.Event <- replace(AE$Adverse.Event,indx,"Discomfort")
> AE$Adverse.Event <- as.factor(AE$Adverse.Event)
5. Crash Course: R and BioConductor
5
B) Look at the results of your recoded adverse events. How many different
types of adverse events are there? Look through their names. Do you see
any potential problems? Fix any problems that you might find.
Ans. Initially, there are 18 different types of adverse events. There appears to
be a typo; “Mylagia” should be “Myalgia”. After correction, there are 17
different types of adverse events.
> length(levels(AE$Adverse.Event))
[1] 18
> AE$Adverse.Event <- as.character(AE$Adverse.Event)
> indx <- AE$Adverse.Event == "Mylagia"
> AE$Adverse.Event <- replace(AE$Adverse.Event,indx,"Myalgia")
> AE$Adverse.Event <- as.factor(AE$Adverse.Event)
> length(levels(AE$Adverse.Event))
[1] 17
C) Create an adverse event table to examine relationship between different
adverse event symptoms and their severities. Make sure the “Discomfort”
AE shows up in the table, instead of “Malaise”.
Ans. One possible solution is shown:
> attach(AE)
> AEtable <- table(Adverse.Event,Severity)
> AEtable
Severity
Adverse.Event Mild Moderate Severe
Anemia 2 3 1
Arthralgia 2 0 0
Dimpling 1 0 0
Discomfort 1 1 3
Ecchymosis 0 2 1
Elavated CH50 0 0 1
Erythema 0 3 1
Headache 1 5 0
Induration 1 3 0
Leukopenia 1 1 2
Myalgia 2 0 1
Nausea 4 0 1
Nodule 0 1 0
Pain 2 5 0
Papule 0 3 0
Swelling 1 2 1
Tenderness 2 2 1
6. Crash Course: R and BioConductor
6
D) Search the help menus for the functions rowSums and colSums. Use these
functions to count up the number of patients with each adverse event and
the number of patients with mild, moderate and severe symptoms.
Ans. An example is shown below
> AEsymptoms <- rowSums(AEtable)
> AEsymptoms
Anemia Arthralgia Dimpling Discomfort ...
6 2 1 5 ...
> AEseverity <- colSums(AEtable)
> AEseverity
Mild Moderate Severe
20 31 13
E) Define a new variable AEmatrix by converting the AE table into a matrix.
Define two new matrix variables: LL = matrix(1,1,17) and RR = c(1,1,1).
Compute the products of LL by AEmatrix; AEmatrix by RR; and LL by
AEmatrix by RR. Do you notice anything?
Ans. The matrix product LL by AEmatrix is equal to the colSums(), AEmatrix
by RR is equal to the rowSums() and LL by AEmatrix by RR is equal to
the sample size n = 64. An example is shown below:
> LL = matrix(1,1,17)
> RR = c(1,1,1)
> LL %*% AEmatrix
Severity
Mild Moderate Severe
[1,] 20 31 13
> AEmatrix %*% RR
Adverse.Event [,1]
Anemia 6
Arthralgia 2
Dimpling 1
Discomfort 5
Ecchymosis 3
Elavated CH50 1
Erythema 4
Headache 6
Induration 4
Leukopenia 4
Myalgia 3
Nausea 5
Nodule 1
Pain 7
Papule 3
Swelling 4
Tenderness 5
> LL %*% AEmatrix %*% RR
[,1]
[1,] 64
7. Crash Course: R and BioConductor
7
#4. {Fisher’s iris data} Sir Ronald A. Fisher famously used this set of iris flower data
as an example to test his new linear discriminant statistical model. Now, the iris
data set is used as a historical example for new statistical classification models.
A) Make a boxplot of all four measurements from Fisher’s iris data
Ans. An example is shown below:
> boxplot(iris[,1:4],main="Fisher's Iris Data",ylab="cm",
xlab="measurement",col="wheat")
8. Crash Course: R and BioConductor
8
B) Create a multi-panel figure with histograms of all four measurments. Do
you notice anything that could not be seen from the boxplot?
Ans. An example is shown below:
> par(mfrow=c(2,2))
> hist(iris[,1],main="Fisher's Iris Data -- Sepal Length",
ylab="count",xlab="Sepal Length (cm)",col="red")
> hist(iris[,2],main="Fisher's Iris Data -- Sepal Width",
ylab="count",xlab="Sepal Width (cm)",col="yellow")
> hist(iris[,3],main="Fisher's Iris Data -- Petal Length",
ylab="count",xlab="Petal Length (cm)",col="green")
> hist(iris[,4],main="Fisher's Iris Data -- Petal Width",
ylab="count",xlab="Petal Width (cm)",col="blue")
The boxplots didn’t show the bimodal distribution of petal length and
petal width, probably caused by differences among species.
9. Crash Course: R and BioConductor
9
C) Create a multi-panel figure with boxplots of all four measurements,
paneled by the three different species. Do you notice any differences
among species?
Ans. An example is shown below:
> par(mfrow=c(1,3))
> boxplot(iris[iris$Species=="setosa",1:4],
main="Fisher's Iris Data -- Setosa",ylab="cm",
xlab="measurement",col="wheat")
> boxplot(iris[iris$Species=="versicolor",1:4],
main="Fisher's Iris Data -- Versicolor",ylab="cm",
xlab="measurement",col="olivedrab")
> boxplot(iris[iris$Species=="virginica",1:4],
main="Fisher's Iris Data -- Virginica",ylab="cm",
xlab="measurement",col="grey")
Yes. There are big differences among the three species.
10. Crash Course: R and BioConductor
10
#5. {AFP data} Suppose alpha-fetoprotein (AFP) is a potential biomarker for liver
cancer and other cancer types. A researcher might be interested in AFP levels
before and after taking a new drug in one of four concentrations.
A) In section 3.2.1, the barplot() and arrows() commands were used to
create a barchart of mean(BMI) by gender with error bars. Install the
sciplot package library and use the bargraph.CI() command to
replicate that graph.
Ans. An example is shown below:
> library(sciplot)
> bargraph.CI(as.factor(afp.data$gender),afp.data$BMI,
col=c("pink","sky blue"),
main="Mean BMI by Gender",ylim=c(0,50),ylab="BMI")
> legend(x="topleft",legend=c("Female","Male"),
fill=c("pink","sky blue"))
11. Crash Course: R and BioConductor
11
B) Use the bargraph.CI() command to create a bar chart that compares AFP
difference over all five drug concentrations.
Ans. An example is shown below:
> bargraph.CI(as.factor(afp.data$drug),afp.data$AFP.diff,
col=rainbow(5),main="Mean AFP Difference by Drug",
ylim=c(0,-2),ylab="AFP difference",
xlab="Drug Concentration")
> legend(x="topleft",legend=seq(0,20,by=5),fill=rainbow(5),
title="Drug Concentration")
12. Crash Course: R and BioConductor
12
C) Create an interleaved bar chart that plots mean AFP difference by both
drug concentration and gender
Ans. An example is shown below:
> bargraph.CI(as.factor(afp.data$drug),afp.data$AFP.diff,
group=as.factor(afp.data$gender),
col=c("pink","sky blue"),
main="Mean AFP Difference by Drug and Gender",
ylim=c(0,-2),ylab="AFP difference",
xlab="Drug Concentration")
> legend(x="topleft",legend=c("Female","Male"),
fill=c("pink","sky blue"))
#6. {AE data} Doctors, epidemiologists and other researchers look at adverse events
to explore the symptoms and medical conditions affecting patients. A researcher
might choose to look for associations between adverse events and diet.
A) Create a histogram of Percent Body Fat (or your choice of continuous
response variable), then overlay a normal curve.
Ans. An example is shown below:
13. Crash Course: R and BioConductor
13
> norm.curve <- qnorm(seq(0,1,length=10000),
mean(AE$Percent.Body.Fat),
sd(AE$Percent.Body.Fat))
> hist(AE$Percent.Body.Fat,col="wheat",freq=FALSE,
xlab=”Percent Body Fat”)
> lines(density(norm.curve))
B) Install the lattice package and use the barchart() command to graph the
AEtable data table created for question #3. C) in the previous chapter.
What kind of plot is this? Add the appropriate figure legend.
Ans. The plot is a stacked bar chart, with stacked boxes representing the mild,
moderate and severe symptoms. An example is shown below:
14. Crash Course: R and BioConductor
14
> barchart(AEtable,main="Bar Chart of Adverse Event by Severity",
col=c("red","yellow","blue"))
> legend(x="topright",legend=levels(AE$Severity),
fill=c("red","yellow","blue"))
#7. {Nonparametric statistics} Search the help menus to find the command(s) for a
non-parametric statistical test analogous to the Student’s t-test (e.g. Mann-
Whitney U-test, Wilcoxon rank sum test, ...). Repeat at least one of the Student’s
t-test examples from section 4.1 with this non-parametric test.
Ans. An example is shown below:
> # Define a vector of % body fat data for men from AE data
> bfat.m <- AE[AE$Gender == "Male",6]
> # Define a vector of % body fat data for women from AE data
> bfat.f <- AE[AE$Gender == "Female",6]
> # Compute a two-sided, WIlcoxon Rank Sum test with AE data
> wilcox.test(bfat.m,bfat.f,alternative="two.sided")
Wilcoxon rank sum test with continuity correction
data: bfat.m and bfat.f
W = 553, p-value = 0.5811
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(bfat.m, bfat.f, alt. = "two.sided") :
cannot compute exact p-value with ties
15. Crash Course: R and BioConductor
15
#8. {Linear models} Add a second predictor variable to the formula parameter of the
lm() procedure from the regression or ANOVA example in section 4.2 to create a
more complicated linear model. Use the AFP data.
Ans. An example of multiple regression is shown below:
> # Define afp.data data frame with stringsAsFactors FALSE
> afp.data <- data.frame(subject,gender,height,weight,BMI,drug,
AFP.before,AFP.after,AFP.diff,
stringsAsFactors=FALSE)
> # Call the lm() procedure to fit regression
> afp.reg <- lm(formula = AFP.diff ~ drug*BMI, data = afp.data)
> afp.reg
Call:
lm(formula = AFP.diff ~ drug * BMI, data = afp.data)
Coefficients:
(Intercept) drug BMI drug:BMI
-1.3568528 0.0123046 0.0049974 -0.0003010
> anova(afp.reg)
Analysis of Variance Table
Response: AFP.diff
Df Sum Sq Mean Sq F value Pr(>F)
drug 1 0.00863 0.00863 0.2017 0.6594
BMI 1 0.00384 0.00384 0.0897 0.7685
drug:BMI 1 0.00542 0.00542 0.1265 0.7267
Residuals 16 0.68512 0.04282
> summary(afp.reg)
Call:
lm(formula = AFP.diff ~ drug * BMI, data = afp.data)
Residuals:
Min 1Q Median 3Q Max
-0.26127 -0.12370 -0.01925 0.14384 0.40517
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3568528 0.3473496 -3.906 0.00126 **
drug 0.0123046 0.0268771 0.458 0.65325
BMI 0.0049974 0.0107781 0.464 0.64913
drug:BMI -0.0003010 0.0008463 -0.356 0.72670
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2069 on 16 degrees of freedom
Multiple R-Squared: 0.02545, Adjusted R-squared: -0.1573
F-statistic: 0.1393 on 3 and 16 DF, p-value: 0.935
16. Crash Course: R and BioConductor
16
#9. {Workflow scripting} Create a script to automate the creation graphing and linear
model analysis of the AFP data. Use your previous results from questions #2, #5
and #8, if necessary.
Ans. An example is shown:
############### Import AFP data ########################
# generate a list of subject IDs, numbered from 1 to 20
subject <- 1:20
# create 10 entries for male subjects
males <- rep("male",10)
# create 10 entries for female subjects
females <- rep("female",10)
# combine male and female entries into one column vector
gender <- c(males,females)
# bind subjectID and gender columns together
afp.data <- cbind(subject,gender)
# generate 10 male and 10 female random normal heights
height <- as.numeric(c(rnorm(10,70,2.5),rnorm(10,64,2.2)))
# generate 10 male and 10 female random uniform weights
weight <- as.numeric(c(runif(10,155,320),runif(10,95,210)))
# compute body mass index (BMI) for 10 men and 10 women
BMI <- as.numeric((weight*703)/(height**2))
# enter five treatment levels of a new drug (ng/mL)
drug <- rep(x = seq(from = 0, to = 20, by = 5), times = 4)
# manually enter Alpha-fetoprotein (AFP) levels for 20 patients
AFP.before <-
as.numeric(c(0.8,2.3,1.1,4.8,3.7,12.5,0.3,4.4,4.9,0.0,1.8,2.4,23.
6,8.9,0.7,3.3,3.1,0.5,2.7,4.5))
AFP.after <- AFP.before - 1.2 + 0.2*rnorm(20)
AFP.diff <- AFP.after - AFP.before
18. Crash Course: R and BioConductor
18
means = afp.aov$fitted.values[1:5]
names(means) = levels(afp.data$drug)
mp <- barplot(height =
means,main=main,xlab=xlab,ylab=ylab,col=colors,ylim=c(0,-2))
X0 <- X1 <- mp
Y0 <- means - afp.summary$sigma
Y1 <- means + afp.summary$sigma
arrows(X0,Y0,X1,Y1,code=3,angle=90)
dev.off()
browseURL("ANOVA.pdf")
#10. {Function scripts} Create your own script to compute two new types of row
statistic (e.g. standard deviation and interquartile range) for a data frame or
matrix. Be creative, add graphics or a statistical test (e.g. linear regression).
Ans. An example is shown below:
# Define a function to compute row statistics with a for() loop
row.stats.loop <- function(x){
# Initialize vectors
row.sd <- row.IQR <- vector("numeric",length=nrow(x))
# Use a for() loop to compute means and medians for each
row
for(i in 1:nrow(x)){
row.sd[i] <- sd(x[i,])
row.IQR[i] <- IQR(x[i,])}
# Perform a linear regression
row.reg <- lm(formula = row.sd ~ row.IQR)
# Create a list of output
output <- list()
output[["row sd"]] <- row.sd
output[["row IQR"]] <- row.IQR
output[[“lm”]] <- row.reg
output[[“anova”]] <- anova(row.reg)
output[[“summary”]] <- summary(row.reg)
# Call the output list to report final results
19. Crash Course: R and BioConductor
19
output}
#11. Download the microarray dataset with the accession number “GDS10” from the
GEO website using the GEOquery package
Ans. The following loads the library, downloads the dataset and converts it to an
ExpressionSet object
library("GEOquery")
gds = getGEO("GDS10")
expset=GDS2eSet(gds, do.log2=TRUE)
A) Convert the data into three data frames, one for gene expression, one for
phenotypes and one for gene annotations
Ans. The following is an example script that will do this. Here we convert gds,
the output from getGEO() to an ExpressionSet object before converting
to the three data frames. We can do this directly from the getGEO() output
too (see the documentation for the GEOquery package on CRAN)
#Extract the expression matrix
X=exprs(expset)
#Extract the phenotypes
pheno.names=varLabels(expset)
> pheno.names
[1] "sample" "tissue" "strain"
"disease.state"
[5] "description"
phenotypes=data.frame(sample=expset$sample, tissue=expset$tissue,
strain=expset$strain, disease.state=expset$disease.state,
description=expset$description)
#Convert each row from factor to character type
for(i in 1:ncol(phenotypes))
phenotypes[,i]=as.character(phenotypes[,i])
#Extract the gene annotations
annot.columns= fvarLabels(expset)
> annot.columns
[1] "ID" "GB_ACC" "SPOT_ID"
annot.obj=featureData(expset)
annot=data.frame(id=annot.obj$ID, genbank.acc=annot.obj$GB_ACC,
spot.id=annot.obj$SPOT_ID)
B) Plot boxplots for each sample in one plot with different colors for each
sample. (Hint: use the stack() function and use a formula in the
20. Crash Course: R and BioConductor
20
boxplot() function. A vector of n colors can be obtained by using
rainbow(n))
Ans. The following is probably the easiest way to do this. You should look up
the help page for stack() to better understand how this works.
nsamp=ncol(X)
boxcol=rainbow(nsamp)
X.stack=stack(as.data.frame(X))
#Draw the boxplot
#Option las=3 makes the x axis labels vertical
boxplot(values~ind, data=X.stack, col=boxcol, las=3)
C) Compare the samples from the thymus and spleen for diabetic-resistant
mice and find the 10 most significant genes using the adjusted p-value.
Ans. This is a relatively lengthy script, but the explanation for each step can be
found here and in the manual.
#Find the samples that come from diabetic resistant mice
that
#originate from thymus
qt=which(phenotypes$disease.state=="diabetic-resistant" &
phenotypes$tissue=="thymus")
Xt=X[,qt]
21. Crash Course: R and BioConductor
21
#Find the samples that come from diabetic resistant mice
that
#originate from spleen
qs=which(phenotypes$disease.state=="diabetic-resistant" &
phenotypes$tissue=="spleen")
Xs=X[,qs]
#Compute the p-value and fold change for all genes
p.value=c()
fold.change=c()
for(i in 1:nrow(Xs))
{
#Find number of non-missing samples
n1=sum(!is.na(Xs[i,]))
n2=sum(!is.na(Xt[i,]))
if(n1 >= 2 & n2 >=2)
{
tt.res=t.test(Xs[i,], Xt[i,])
p.value[i]=tt.res$p.value
#The log fold change is calculated by the
#difference in means between the two classes
fold.change[i]=tt.res$estimate[2]-
tt.res$estimate[1]
}else
{
p.value[i]=NA
fold.change[i]=NA
}
}
#Compute adjusted p-values
adj.p.value=p.adjust(p.value)
#Find the smallest 10 p-values
qo=order(adj.p.value)
sig.genes=qo[1:10]
> adj.p.value[sig.genes]
[1] 1.859514e-12 7.615543e-12 1.852015e-11
[4] 3.337001e-11 4.210158e-11 5.769339e-11
[7] 7.557780e-11 9.369532e-11 1.125353e-10
[10] 1.331595e-10
D) Write the gene annotations, p-value, adjusted p-value and expressions in
all the samples for these 10 genes to an CSV file.
Ans. An example is shown below
d=data.frame(annot[sig.genes,], p.value=
p.value[sig.genes], adj.p.value=adj.p.value[sig.genes],
X[sig.genes,])
write.csv(d, file="report.csv", row.names=FALSE)